We focus on the problem of using generative diffusion models to convert a rough "sketch" of a desired motion, represented by a temporally sparse set of coarsely posed, imprecisely timed blocking poses, into a detailed animation. Current diffusion models can address (a) generating motion from temporally sparse constraints, and (b) correcting the timing of imprecisely-timed keyframes, but we find no good solution for handling the coarse posing of input blocking poses. Our key idea is simple: at certain denoising steps, we blend the outputs of an unconditioned diffusion model with input blocking pose constraints using per-blocking-pose tolerance weights, and pass the result as an input condition to a pre-existing motion retiming model. This can be thought of as refining the blocking poses themselves as they condition the motion generation process.
Motivation. One of the most common workflows in animation begins with motion blocking: specifying a coarse set ofposes ("blocking poses" or "keys") that convey the gist of the desired action. Blocking poses are typically few in number, imprecise in timing, and coarse or incomplete in posing. They are intended as scaffolding for future motion detailing passes where an animator fills in the details that bring the motion to life, while adjusting the timing/posing the blocking poses themsleves as necessary [Cooper 21, Lasseter 1987, Williams 2009]. Unfortunately, there is no existing diffusion technique that robustly performs motion detailing: converting blocking poses to a plausible, detailed character animation, a task that we call motion detailing. Running a standard motion-inbetweening model on blocking poses can result in unrealistic motion, because the blocking poses themselves are unrealistic.
Problem Illustration: Consider the blocking poses in the figure below (top left), of a character stepping forward and kicking. The goal is to convert these blocking poses into a detailed animation. Goel et al [2025] presented a motion retiming model, but it requires well-posed keyframes. Running the retiming model or a standard inbetweening model on blocking poses can result in unrealistic motion, because the blocking poses themselves are unrealistic. We find that in this setting, standard ways for leveraging the diffusion prior to enhance pose detail, such as blending diffusion outputs with desired keyframes [Shafir 2023, Tseng 2022], or using keyposes as external constraints in reconstruction guidance [Xie 2023], also fail to produce plausible motions.
Method Summary. At certain diffusion steps, we blend the outputs of an unconditioned diffusion model (shown in light blue in the figure below) with the input blocking poses (shown in orange) based on animator-controlled per-blocking-pose tolerance weights (shown as c). We pass the blended result as an input condition to the existing retiming model (dark blue). This approach contrasts with standard blending approaches, which blend the outputs of diffusion models and rely on brittle heuristics to define per-frame blending masks. By refining the input condition, rather than model outputs, our approach allows the diffusion model to infer how each blocking pose should influence the in-between motion without requiring heuristics for setting per-frame blending masks.
Results: Our technique instruments a diffusion model to robustly convert blocking poses into detailed, natural looking character animations, offering direct support for an important animation workflow.
Where can I read more about the blocking workflow?
We recommend the following fantastic resources:What is the difference between motion detailing and motion inbetweening?
Motion inbetweening is a core task in both classical animation and modern motion generation, referring to the process of generating intermediate frames that transition smoothly between keyframes. The core problem statement is: given a sparse set of key poses placed at specific times, generate a full motion sequence that respects these poses exactly while producing natural, coherent transitions between them. This may seem like it can be used to generate motion from blocking poses, especially since blocking poses, like keyframes, are usually temporally sparse. However, motion inbetweening assumes that keyframes are well-posed and well-timed, i.e., it treats inputs as hard constraints. In contrast, blocking poses are intentionally rough: they are often underspecified, with only a handful of joints posed meaningfully, and they are placed at approximate times on the timeline. While inbetweening assumes that input poses should be preserved exactly, motion detailing requires a system to potentially refine blocking poses themselves as well as generate the in-between motion. We therefore frame motion detailing as a more general, relaxed form of motion inbetweening.Why blend the inputs of the diffusion model, not the outputs?
The challenge of both respecting the gist of blocking poses while also adding detail is similar to the goals of inference-time imputation or blending techniques[Tevet 23, Shafir 23, Goel 24] in diffusion models, which blend conditioned and unconditioned outputs using a blending mask M. Such masks are difficult to design: too narrow can cause discontinuities, too wide can cause the generated motion to adhere too closely to the undetailed blocking input, and heuristics for creating masks automatically are often brittle. Further, how much neighboring frames should be influenced by the blocking poses in the final detailed animation might be different per-animation, per-pose, or even per-joint. Our key idea is that instead of requiring the animator to design a dense blending mask M by hand, we instead blend the input blocking poses throughout the diffusion process.What is the runtime speed? How does it compare to other motion-inbetweening models?
The full system (diffusion model + constraint refinement) takes about 20 seconds at inference time. In terms of runtime cost, our method introduces a modest overhead of approximately 10% relative to the base model R. Though this overhead could be further reduced through performance engineering (e.g., vectorized IK), most of the runtime is still attributable to the base model itself. We believe that advances in accelerating diffusion models (see CLoSD [Tevet 2025], DPM-Solver [Lu 2022]) could directly benefit our approach and enable our system to run at interactive rates in the future.