Research paper / project report

Crayotter: Artifact-Grounded Multimodal Agents for Long-Form Video Editing

We frame long-form video editing as an artifact-grounded agent trajectory problem, where retrieval, research, execution, and revision are all anchored in visible external states instead of opaque reasoning alone.

Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian, Anqi Wu, Wenxi Li, Chenyang Lyu

Read PDF GitHub Back to Home

Crayotter editing trajectory visual — A long-form editing request becomes a visible production trajectory with inspectable clips, transitions, narration, subtitles, and revision loops.

Why this paper matters

From black-box prompting to stateful multimodal editing.

Abstract

Long-form multimodal editing remains hard because decisions are weakly grounded, intermediate states are hard to audit, and failures are expensive to localize. Crayotter addresses this by mapping a user intent to a finished edited video through three phases: material preparation, deep editing research, and tool-grounded execution.

Its key mechanism is environment-grounded reflection: after each tool action, the agent revises from observable artifacts such as retrieval coverage reports, analysis JSON, timeline plans, previews, and tool logs, rather than from latent traces only.

Core contributions

Formulates long-form video editing as an artifact-grounded trajectory problem.
Introduces a coverage-aware multimodal footage retrieval loop.
Builds a three-phase editing system with explicit runtime observability.
Defines an RLVR setup with verifiable rewards over editing artifacts.

Three-phase pipeline

The system decomposes long-form editing into retrieval, research, and execution with explicit artifact contracts.

Crayotter framework architecture — Crayotter externalizes state across retrieval evidence, multimodal analyses, editable blueprints, timeline files, renders, and logs.

Phase 1

Coverage-aware retrieval

Expand the request into scene, action, style, story, and shot tags, then retrieve and rerank videos until semantic coverage is sufficient.

Phase 2

Deep editing research

Translate the material pool into a time-grounded editing blueprint with beats, in/out points, transitions, rhythm, and narration intent.

Phase 3

Tool-grounded execution

Realize the blueprint through cutting, timeline construction, transitions, subtitles, loudness control, export, and artifact-based diagnosis.

Coverage-aware retrieval figure — The retrieval loop searches until missing tags are closed or the retrieval budget is exhausted.

Crayotter workbench interface — The workbench keeps assets, execution state, and intermediate artifacts in one shared workspace.

What makes Crayotter different

Editing is treated as a controllable production system rather than a single generation step.

Observable state

Artifacts such as coverage reports, analysis outputs, plans, renders, and tool logs are first-class state variables.

Selective repair

Failures can be traced to a clip, timestamp span, or transition decision and fixed locally instead of restarting the whole run.

RLVR-ready traces

The same execution environment exposes verifiable diagnostics and reward signals for trajectory-level policy learning.

The agent does not revise from hidden chain-of-thought alone; it revises from rendered timeline states, multimodal analysis outputs, and tool-produced intermediate artifacts.

Competitive evaluation

Crayotter leads the reported baselines across all five judged dimensions.

3.40

Human overall score

2.39

AI overall score

23 themes

Fixed benchmark pack

5 axes

Theme, richness, narrative, smoothness, visual quality

Method	Theme	Richness	Narrative	Smoothness	Visual	Overall
Crayotter (ours)	3.59	3.35	3.22	3.29	3.54	3.40
CapCut-Mate	2.59	2.71	2.01	2.13	2.74	2.44
CutClaw	1.59	1.64	1.72	1.86	1.67	1.70

Reading the result

The gains are consistent with the paper's core argument: long-form editing benefits when retrieval, planning, and execution all expose visible state that can be audited and revised.

Human raters score Crayotter highest on theme alignment, content richness, narrative coherence, editing smoothness, and visual quality.

Citation and resources

Use the paper page, the PDF, or the repository directly.