Iterative Motion Editing with Natural Language

FAQ:

Why not just use prompt engineering?

One alternative to recent motion-editing methods is to iterate on the input prompt ("prompt engineering") to the best text2motion model, e.g., changing the original prompt "a side kick" to "a high side kick" and generating from scratch. Unfortunately, it can be hard to predict how these models will interpret changes in the prompt, and they provide little guarantee that the modified motion will retain any correspondence with the original. For the inherently iterative process of character motion refinement, a system that is both predictable and non-destructive to the current motion is vital (we recommend the excellent discussion in [Agrawala 2023]).

Is text a good way to specify motion edits? When is a text-based interface useful?

Text can be an ambiguous (and thus inefficient) way to describe precise edits--so why use text at all, instead of, e.g., a traditional keyframe animation software? For a single edit to a single joint, traditional methods may be more efficient, but we believe text is useful for iterative, conversational editing. Text instructions can build upon or refine previous edits, mimicking a conversation between user and character. While there's certainly a ways to go, we see our work as providing a good step in the direction of scaffolding iterative editing workflows.

That being said, we believe the ideal motion editing system of the future should be multimodal: text is one tool to describe edits, but so are demonstration (e.g., with images or videos), specifics of the scene/environment, and good old kinematic joint constraints.

What type of edits are and are not supported?

Our MEOs support kinematic editing of the main joints in the SMPL body. Physics-informed edits ("jump more forcefully") or semantic, stylization-based edits ("do that more excitedly") are not handled by our system, although we are excited about these directions.

Our prompt design makes it quite straightforward to add program generation support for new MEOs: the new MEO is included as an import statement in the LLM prompt; example use(s) of the new MEO is provided in the demonstration part of the prompt. With our method, the LLM is likely to correctly target new operators without retraining due to its strong priors. For full system integration, the execution engine should implement the MEO.

How capable is the LLM Parser in generating executable code?

Quite! Given 100 editing prompts (collected by both asking ChatGPT to suggest kinematic editing instructions for sourec motion descriptions and also hand-writing editing prompts), the LLM parser successfully produced programs for 90 instructions on the first try, and an additional 7 after reflection/re-generation. Only 3 prompts failed to execute (generated instructions were not implemented by our current system, e.g., neck rotation, relative rotation, medial waist rotation). Thus, we believe the fundamental challenge of the system is generating MEO programs that embody the edit well, rather than generating valid programs.

Hindsights?

Our method splits up constraint generation from motion generation: an LLM translates instructions into executable Python programs of MEOs; our execution engine first generates motion constraints, e.g., keyframes, from the programs, then a diffusion-based motion infilling step integrates keyframes into the source motion.

We found the division of labor and introduction of an IR to be quite useful when scaling up the system, as opposed to using a single model to handle the whole text-to-edited-motion pipeline. First, the MEO intermediate representation is highly interpretable: it's clear what edit is being executed, and why. Second, the representation is controllable: each MEO reduces edits to a change along a single DOF; though our examples have the magnitude of change being set procedurally, it can in principle be set directly by the user (like a single-use rig). Third, the division of labor allows the pipeline to be more modular; the infilling model does not require re-training to add new types of kinematic edits. We found, for example, that scaling up our original set of rotation/translation MEOs to also handle relative translation was fairly simple, with no fine-tuning required.

The cons of this approach is that edits are currently limited to ones that can be expressed by frame-wise motion constraints and transitions in and out of those frames; as mentioned earlier, stylization-based edits are more difficult within this division of labor. Motion quality may also suffer with this more modular approach (see, in comparison, the more recent work [Huang 2024], which uses a single model to encapsulate both constraints and motion generation rather than our proposed two-stage execution engine). We believe there's a lot of room to explore the right balance between modularity and generation quality!

Iterative Motion Editing
with Natural Language

SIGGRAPH 2024

Purvi Goel¹

Kuan-Chieh Wang²

C. Karen Liu¹

Kayvon Fatahalian¹

¹Stanford University, ²Snap Inc.

FAQ:

Hindsights?

Acknowledgements:

Iterative Motion Editing with Natural Language

SIGGRAPH 2024

Purvi Goel1

Kuan-Chieh Wang2

C. Karen Liu1

Kayvon Fatahalian1

1Stanford University, 2Snap Inc.

FAQ:

Hindsights?

Acknowledgements:

Iterative Motion Editing
with Natural Language

Purvi Goel¹

Kuan-Chieh Wang²

C. Karen Liu¹

Kayvon Fatahalian¹

¹Stanford University, ²Snap Inc.