Generating Detailed Character Motion
from Blocking Poses

SIGGRAPH Asia 2025

Stanford University
Paper Supplement Code (Coming Soon) Video

We focus on the problem of using generative diffusion models to convert a rough "sketch" of a desired motion, represented by a temporally sparse set of coarsely posed, imprecisely timed blocking poses, into a detailed animation. Current diffusion models can address (a) generating motion from temporally sparse constraints, and (b) correcting the timing of imprecisely-timed keyframes, but we find no good solution for handling the coarse posing of input blocking poses. Our key idea is simple: at certain denoising steps, we blend the outputs of an unconditioned diffusion model with input blocking pose constraints using per-blocking-pose tolerance weights, and pass the result as an input condition to a pre-existing motion retiming model. This can be thought of as refining the blocking poses themselves as they condition the motion generation process.

Jump to:

Motivation. One of the most common workflows in animation begins with motion blocking: specifying a coarse set ofposes ("blocking poses" or "keys") that convey the gist of the desired action. Blocking poses are typically few in number, imprecise in timing, and coarse or incomplete in posing. They are intended as scaffolding for future motion detailing passes where an animator fills in the details that bring the motion to life, while adjusting the timing/posing the blocking poses themsleves as necessary [Cooper 21, Lasseter 1987, Williams 2009]. Unfortunately, there is no existing diffusion technique that robustly performs motion detailing: converting blocking poses to a plausible, detailed character animation, a task that we call motion detailing. Running a standard motion-inbetweening model on blocking poses can result in unrealistic motion, because the blocking poses themselves are unrealistic.

Problem Illustration: Consider the blocking poses in the figure below (top left), of a character stepping forward and kicking. The goal is to convert these blocking poses into a detailed animation. Goel et al [2025] presented a motion retiming model, but it requires well-posed keyframes. Running the retiming model or a standard inbetweening model on blocking poses can result in unrealistic motion, because the blocking poses themselves are unrealistic. We find that in this setting, standard ways for leveraging the diffusion prior to enhance pose detail, such as blending diffusion outputs with desired keyframes [Shafir 2023, Tseng 2022], or using keyposes as external constraints in reconstruction guidance [Xie 2023], also fail to produce plausible motions.


These 4 blocking poses capture the gist of an action: a character steps forward, kicks with the left leg, then steps back.



Baseline #1:
Goel et al [2025]'s retiming model assumes that the input keyframes are perfectly posed, and tries to preserve them exactly, which results in unnatural motion.
Baseline #2:
Using an automatically generated blending mask, as proposed by Shafir et al [2023] can lead to over-smooth results
Ours:
Our method creates a detailed motion with steps, a refined upper body position, and a natural, snappy kick.

Method Summary. At certain diffusion steps, we blend the outputs of an unconditioned diffusion model (shown in light blue in the figure below) with the input blocking poses (shown in orange) based on animator-controlled per-blocking-pose tolerance weights (shown as c). We pass the blended result as an input condition to the existing retiming model (dark blue). This approach contrasts with standard blending approaches, which blend the outputs of diffusion models and rely on brittle heuristics to define per-frame blending masks. By refining the input condition, rather than model outputs, our approach allows the diffusion model to infer how each blocking pose should influence the in-between motion without requiring heuristics for setting per-frame blending masks.

Results: Our technique instruments a diffusion model to robustly convert blocking poses into detailed, natural looking character animations, offering direct support for an important animation workflow.

These 3 blocking poses capture a motion where the character jumps forward, kicks its legs out to the side, then lands.




Our method creates a complete motion with detail even in joints that weren't posed in the input, like natural arm movements, realistic knee bend, and the jump itself.
These 2 blocking poses capture a motion where the character sits. Notice that the poses themselves are rough and underspecified..




Our method produces a complete motion and adding detail even in joints that weren't posed in the input, like natural arm movements, realistic balance, and a more natural sitting pose.
These 2 blocking poses capture a motion walking with its arms way above its head. They're very coarsely posed, just a 90 degree rotation on both shoulders.




Our method creates a motion with realistic steps, and a more natural upper body position that still has the arms raised.

FAQ:

Where can I read more about the blocking workflow?

We recommend the following fantastic resources:
  • GameAnim: Video Game Animation Explained by Jonathan Cooper [2021]
  • The Animator's Survival Kit by Richard Williams [2009]
  • Principles of traditional animation applied to 3D computer animation by John Lasster [1987]

What is the difference between motion detailing and motion inbetweening?

Motion inbetweening is a core task in both classical animation and modern motion generation, referring to the process of generating intermediate frames that transition smoothly between keyframes. The core problem statement is: given a sparse set of key poses placed at specific times, generate a full motion sequence that respects these poses exactly while producing natural, coherent transitions between them. This may seem like it can be used to generate motion from blocking poses, especially since blocking poses, like keyframes, are usually temporally sparse. However, motion inbetweening assumes that keyframes are well-posed and well-timed, i.e., it treats inputs as hard constraints. In contrast, blocking poses are intentionally rough: they are often underspecified, with only a handful of joints posed meaningfully, and they are placed at approximate times on the timeline. While inbetweening assumes that input poses should be preserved exactly, motion detailing requires a system to potentially refine blocking poses themselves as well as generate the in-between motion. We therefore frame motion detailing as a more general, relaxed form of motion inbetweening.

Why blend the inputs of the diffusion model, not the outputs?

The challenge of both respecting the gist of blocking poses while also adding detail is similar to the goals of inference-time imputation or blending techniques[Tevet 23, Shafir 23, Goel 24] in diffusion models, which blend conditioned and unconditioned outputs using a blending mask M. Such masks are difficult to design: too narrow can cause discontinuities, too wide can cause the generated motion to adhere too closely to the undetailed blocking input, and heuristics for creating masks automatically are often brittle. Further, how much neighboring frames should be influenced by the blocking poses in the final detailed animation might be different per-animation, per-pose, or even per-joint. Our key idea is that instead of requiring the animator to design a dense blending mask M by hand, we instead blend the input blocking poses throughout the diffusion process.

What is the runtime speed? How does it compare to other motion-inbetweening models?

The full system (diffusion model + constraint refinement) takes about 20 seconds at inference time. In terms of runtime cost, our method introduces a modest overhead of approximately 10% relative to the base model R. Though this overhead could be further reduced through performance engineering (e.g., vectorized IK), most of the runtime is still attributable to the base model itself. We believe that advances in accelerating diffusion models (see CLoSD [Tevet 2025], DPM-Solver [Lu 2022]) could directly benefit our approach and enable our system to run at interactive rates in the future.

Hindsights/Looking Forward

We’ve been reflecting on the role of blocking poses in real animation workflows. They act as a bridge between the animator’s high-level creative intent and the low-level details of motion, making them a natural and powerful input for next-generation, learning-based animation systems. In this role, they form a shared language between animator and system. Because blocking poses can be sourced, extracted, edited, or generated from higher-level modalities—such as text, images, or video demonstrations—they offer a flexible and interpretable intermediate representation. We see strong potential for them to connect a wide range of high-level specifications to motion, a direction we explored in earlier work [Goel 24], about motion editing from text instructions, where textual edits were translated into modified blocking poses and then “detailed” into plausible motion (although we didn't realize this framing at the time!).

We are excited to see what new directions this broader line of work takes in the future 🚀

Acknowledgements:

Purvi Goel is supported by a Stanford Interdisciplinary Graduate Fellowship. We thank the anonymous reviewers for constructive feedback; Vishnu Sarukkai, Sarah Jobalia, Sofia Di Toro Wyetzner, and Zander Majercik for helpful discussions; Meta for gift support. This website was developed referencing EDGE, and by extension, the excellent Imagen site.