We introduce a system for using natural language to conversationally specify local edits to character motion. Our key idea is to cast motion editing as a two-step process: converting natural language editing instructions into Python programs that describe fine-grained editing operations with a LLM, then executing resulting operations using a constraint generation and diffusion-based motion infilling process. As an intermediate between text and joints, we define a set of kinematic motion editing operators (MEOs) that have well-defined semantics for how to modify specific frames of a target motion.
System overview. Our LLM-based parser converts natural language into Python code that describes the desired motion edit (green). Each method in the program is an MEO, defining a joint to modify, a spatial constraint (rotation/translation), and a time interval during which the constraint applies. Constraints are expressed relative to the properties of the source motion they are applied to. These Python programs can be executed to generate the desired motion edit (blue).
MEO programs generated by our LLM-based parser during an iterative editing session are shown below. Notice how the LLM agent provides justifications through comments to break down its reasoning (we encouraged it to do so by giving it several example programs with these patterns via in-context learning). Also notice how context from previous edits informs programs generated for future edits.
Results on single edits. Our system can generate edited motions that are plausible, faithful to input instructions, and non-destructive to the original motion. The videos below show motions before and after the edit.
Results on iterative edits.Our system allows editing motions conversationally, which enables progressive refinement of the character's motion, allows the user to break larger editing tasks into sub-goals, and supports clarification or even adjustment of editing intent. Each video below shows separate iterative editing sessions.
Why not just use prompt engineering?
One alternative to recent motion-editing methods is to iterate on the input prompt ("prompt engineering") to the best text2motion model, e.g., changing the original prompt "a side kick" to "a high side kick" and generating from scratch. Unfortunately, it can be hard to predict how these models will interpret changes in the prompt, and they provide little guarantee that the modified motion will retain any correspondence with the original. For the inherently iterative process of character motion refinement, a system that is both predictable and non-destructive to the current motion is vital (we recommend the excellent discussion in [Agrawala 2023]).Is text a good way to specify motion edits? When is a text-based interface useful?
Text can be an ambiguous (and thus inefficient) way to describe precise edits--so why use text at all, instead of, e.g., a traditional keyframe animation software? For a single edit to a single joint, traditional methods may be more efficient, but we believe text is useful for iterative, conversational editing. Text instructions can build upon or refine previous edits, mimicking a conversation between user and character. While there's certainly a ways to go, we see our work as providing a good step in the direction of scaffolding iterative editing workflows.What type of edits are and are not supported?
Our MEOs support kinematic editing of the main joints in the SMPL body. Physics-informed edits ("jump more forcefully") or semantic, stylization-based edits ("do that more excitedly") are not handled by our system, although we are excited about these directions.How capable is the LLM Parser in generating executable code?
Quite! Given 100 editing prompts (collected by both asking ChatGPT to suggest kinematic editing instructions for sourec motion descriptions and also hand-writing editing prompts), the LLM parser successfully produced programs for 90 instructions on the first try, and an additional 7 after reflection/re-generation. Only 3 prompts failed to execute (generated instructions were not implemented by our current system, e.g., neck rotation, relative rotation, medial waist rotation). Thus, we believe the fundamental challenge of the system is generating MEO programs that embody the edit well, rather than generating valid programs.