Iterative Motion Editing
with Natural Language

SIGGRAPH 2024

1Stanford University, 2Snap Inc.
Paper Code (Coming Soon!) Video

We introduce a system for using natural language to conversationally specify local edits to character motion. Our key idea is to cast motion editing as a two-step process: converting natural language editing instructions into Python programs that describe fine-grained editing operations with a LLM, then executing resulting operations using a constraint generation and diffusion-based motion infilling process. As an intermediate between text and joints, we define a set of kinematic motion editing operators (MEOs) that have well-defined semantics for how to modify specific frames of a target motion.

Jump to:

System overview. Our LLM-based parser converts natural language into Python code that describes the desired motion edit (green). Each method in the program is an MEO, defining a joint to modify, a spatial constraint (rotation/translation), and a time interval during which the constraint applies. Constraints are expressed relative to the properties of the source motion they are applied to. These Python programs can be executed to generate the desired motion edit (blue).

MEO programs generated by our LLM-based parser during an iterative editing session are shown below. Notice how the LLM agent provides justifications through comments to break down its reasoning (we encouraged it to do so by giving it several example programs with these patterns via in-context learning). Also notice how context from previous edits informs programs generated for future edits.

Results on single edits. Our system can generate edited motions that are plausible, faithful to input instructions, and non-destructive to the original motion. The videos below show motions before and after the edit.

Results on iterative edits.Our system allows editing motions conversationally, which enables progressive refinement of the character's motion, allows the user to break larger editing tasks into sub-goals, and supports clarification or even adjustment of editing intent. Each video below shows separate iterative editing sessions.

FAQ:

Why not just use prompt engineering?

One alternative to recent motion-editing methods is to iterate on the input prompt ("prompt engineering") to the best text2motion model, e.g., changing the original prompt "a side kick" to "a high side kick" and generating from scratch. Unfortunately, it can be hard to predict how these models will interpret changes in the prompt, and they provide little guarantee that the modified motion will retain any correspondence with the original. For the inherently iterative process of character motion refinement, a system that is both predictable and non-destructive to the current motion is vital (we recommend the excellent discussion in [Agrawala 2023]).

Is text a good way to specify motion edits? When is a text-based interface useful?

Text can be an ambiguous (and thus inefficient) way to describe precise edits--so why use text at all, instead of, e.g., a traditional keyframe animation software? For a single edit to a single joint, traditional methods may be more efficient, but we believe text is useful for iterative, conversational editing. Text instructions can build upon or refine previous edits, mimicking a conversation between user and character. While there's certainly a ways to go, we see our work as providing a good step in the direction of scaffolding iterative editing workflows.

That being said, we believe the ideal motion editing system of the future should be multimodal: text is one tool to describe edits, but so are demonstration (e.g., with images or videos), specifics of the scene/environment, and good old kinematic joint constraints.

What type of edits are and are not supported?

Our MEOs support kinematic editing of the main joints in the SMPL body. Physics-informed edits ("jump more forcefully") or semantic, stylization-based edits ("do that more excitedly") are not handled by our system, although we are excited about these directions.

Our prompt design makes it quite straightforward to add program generation support for new MEOs: the new MEO is included as an import statement in the LLM prompt; example use(s) of the new MEO is provided in the demonstration part of the prompt. With our method, the LLM is likely to correctly target new operators without retraining due to its strong priors. For full system integration, the execution engine should implement the MEO.

How capable is the LLM Parser in generating executable code?

Quite! Given 100 editing prompts (collected by both asking ChatGPT to suggest kinematic editing instructions for sourec motion descriptions and also hand-writing editing prompts), the LLM parser successfully produced programs for 90 instructions on the first try, and an additional 7 after reflection/re-generation. Only 3 prompts failed to execute (generated instructions were not implemented by our current system, e.g., neck rotation, relative rotation, medial waist rotation). Thus, we believe the fundamental challenge of the system is generating MEO programs that embody the edit well, rather than generating valid programs.

Hindsights?

Our method splits up constraint generation from motion generation: an LLM translates instructions into executable Python programs of MEOs; our execution engine first generates motion constraints, e.g., keyframes, from the programs, then a diffusion-based motion infilling step integrates keyframes into the source motion.

We found the division of labor and introduction of an IR to be quite useful when scaling up the system, as opposed to using a single model to handle the whole text-to-edited-motion pipeline. First, the MEO intermediate representation is highly interpretable: it's clear what edit is being executed, and why. Second, the representation is controllable: each MEO reduces edits to a change along a single DOF; though our examples have the magnitude of change being set procedurally, it can in principle be set directly by the user (like a single-use rig). Third, the division of labor allows the pipeline to be more modular; the infilling model does not require re-training to add new types of kinematic edits. We found, for example, that scaling up our original set of rotation/translation MEOs to also handle relative translation was fairly simple, with no fine-tuning required.

The cons of this approach is that edits are currently limited to ones that can be expressed by frame-wise motion constraints and transitions in and out of those frames; as mentioned earlier, stylization-based edits are more difficult within this division of labor. Motion quality may also suffer with this more modular approach (see, in comparison, the more recent work [Huang 2024], which uses a single model to encapsulate both constraints and motion generation rather than our proposed two-stage execution engine). We believe there's a lot of room to explore the right balance between modularity and generation quality!

Acknowledgements:

Purvi Goel is supported by a Stanford Interdisciplinary Graduate Fellowship. Kuan-Chieh Wang was supported by Stanford Wu-Tsai Human Performance Alliances while at Stanford University. We thank the anonymous reviewers for constructive feedback; Vishnu Sarukkai, Sarah Jobalia, Sofia Di Toro Wyetzner, Haotian Zhang, David Durst, and James Hong for helpful discussions. Our codebase was built with invaluable help from James Burgess. This website was developed referencing EDGE, and by extension, the excellent Imagen site.