Generative Motion Infilling
From Imprecisely Timed Keyframes

Eurographics 2025

1Stanford University, 2NVIDIA
Paper Supplement Video Code (Coming Soon) Blog Post

Keyframes are a standard representation in character animation, and recent motion-inbetweening methods use them to control generative motion models. However, specifying keyframe timing is challenging in practice. We introduce a system for motion-inbetweening in the context of keyframes that may be imprecisely timed. Our key idea is a novel model architecture that explicitly outputs a time-warping function to correct mistimed keyframes, and the spatial details to add in between those keyframes. Our system is able to generate high-quality motion from mistimed keyframes, and supports both motion synthesis and editing workflows.

Jump to:

Motivation. Machine-learning-based motion in-betweening solutions can generate natural motion from sparse keyframe constraints. The problem is that while casual users may be able to specify the poses when setting keyframe constraints for the model, timing those keyframes can be extremely challenging [Terra 2004]. Even experienced animators have noted the importance and difficulty of animation timing. When input keyframes have imprecise timing, standard motion in-betweening solutions--trained to match input keyframes exactly--can generate unrealistic or undesireable motion.

We believe learned in-betweening systems must be capable of adjusting the timing of the input keyframes.

Problem Illustration: To generate a motion like the dynamic dance on the left (top left), we give six keyframes (top right) to a standard motion-inbetweening model. If the keyframe timing is incorrect, the model can generate jerky, unrealistic motion with a missed step (bottom left). Our model generates the desired motion even if keyframe timing is incorrect (bottom right).

The desired motion.
Six poses from the desired motion, to act as keyframe constraints for inbetweening model.
Motion generated by a standard motion-inbetweening model, given imprecisely timed keyframes.
Motion generated by our model, given imprecisely timed keyframes.

Method Summary. First, we use our proposed data generation procedure to create a dataset of deliberately mistimed keyframes and their corresponding ground truth motion (left). Our goal is to train a model that can learn to generate the ground truth motion from the mistimed keyframes. So, we introduce a novel dual-headed transformer-based diffusion model (right) that jointly predicts an explicit global time-warping function to correct and retime the mistimed keyframes, and local pose residuals that add spatial details to the motion.

Results (Synthesis): Our model can generate high-quality motions from imprecisely timed keyframes. We show the set of input keyframes on the left (red), which may have imprecise timing. We show the generated motion from a state-of-the-art motion-inbetweening model (middle), which treats the timing of input keyframes as hard constraints. We show the generated motion from our model (right).

Set of keyframes specifying a motion of a person doing arm circles. In this case, the keyframes were sampled from a ground truth motion and deliberately mistimed.
Motion generated by a prior learned motion-inbetweening model, which treats the timing of input keyframes as a hard constraint. Due to the imprecise timing of input keyframes, the model fails to create circles.
Motion generated by our model. The model is able to generate high-quality motion and recovers the circles.
Set of keyframes specifying a motion of a person writing on a whiteboard. In this case, there was no ground truth motion; instead, we placed poses manually on the timeline, mimicking typical user interactions with the system.
Because the timing of the keyframes was ultimately innaccurate, a model which treats the timing of input keyframes as a hard constraint produces a motion that is jerky and unrealistic..
Meanwhile, our model creates smooth steps and a natural transition.

Results: Editing. Our model can also be used to edit existing motions. In editing scenarios, handling imprecise timing is very important: the original motion already has an existing timing, and introducing an edit may require changing that timing.

1) Given an original motion of a person walking in a crouched manner we want the character to walk farther but in the same crocuhed style.
2) We sample 4 keyframes from the original motion and space them out further apart.
3) A model which treats the timing of input keyframes as a hard constraint reacts to the imprecisely timed middle keyposes by generating motion with an uneven gait
4) Given the input keyposes, our model generates a new motion (bottom, right) that is similar to the original motion, but walks farther.

FAQ:

How does keyframe density affect controllability, and what impact does it have on prior approaches' ability to handle mistimed keyframes?

Existing motion inbetweening solutions are trained to match input keyposes at exactly the frame provided. In practice, this hard timing constraint does not pose much of a problem to priorlearned motion-inbetweening models if there are only two or three keyframe constraints, since the model has significant flexibility to construct a motion in between the keyframes that still looks natural. But given that the model is very sparsely constrained, the generated motion--despite appearing natural and adhering to constraints--may not reflect the final detailed animation that the animator envisioned. If the animator seeks more control by providing more keyframes, the model has less flexibility to compensate for mistimed keyframe inputs. The result from standard motion-inbetweening solutions is output that may feature unrealistic dynamics (the character moves from one keyframe to another too fast) or fail to hit keyframes (the character does not have enough time to reach the next keyframe).

Why not learn the retiming function and spatial details with two separate models?

One alternative model architecture is to learn a time-warping function separately from spatial details. For example, a model could learn a timewarping function that warps the input keyframes to match the ground truth motion. The retimed keyframes could then be used to generate motion using a standard motion-inbetweening model. However, in many cases, timing and spatial details are actually closely related, e.g., a higher jump may require more time in the air, but also more wind-up/knee bend. This is primarily why we propose a dual-headed model that jointly learns a time-warping function and spatial details.

Why not just use a previous learned motion-inbetweening model, and retime the generated output manually?

Instead of relying on a model that learns to retime keyframes during the generation process, one might ask: could a vanilla motion-inbetweening model be applied to imprecisely timed keyframes, with the generated motion then manually retimed—say, using a handcrafted timewarp function? Unfortunately, using prior learned interpolation techniques on imprecisely timed keyframes can often results in motion that lacks accurate spatial detail, which is hard to recover through manual retiming. This limitation underscores the need for an integrated approach.

Where does this fit in the context of other forms of conditioned generative motion generation, like text-to-motion?

We believe that the ideal future motion editing system should be multimodal, incorporating both text and kinematic joint constraints (and image/video-based demonstration, and scene-specific details, and more). Each modality offers unique advantages: text is highly accessible but can be ambiguous, while kinematic joint constraints allow for greater control but can be challenging to articulate. We believe a multimodal system that combined both would get the best of both worlds: using text to express high-level intent and kinematic constraints (e.g., keyframes) to communicate more precise details. In this work, we focus on a fundamental challenge in specifying kinematic constraints: defining the timing of these constraints. Ideally, this would make it easier to control any animation system, multimodal or not, that exposes kinematic constraints.

TLDR: Both! Both is good!

Hindsights?

Still brewing...

Acknowledgements:

Purvi Goel is supported by a Stanford Interdisciplinary Graduate Fellowship. We thank the anonymous reviewers for constructive feedback; Vishnu Sarukkai, Sarah Jobalia, Sofia Di Toro Wyetzner, and Mia Tang for proofreading; James Hong, Zander Majercik, and David Durst for helpful discussions; Meta for gift support. This website was developed referencing EDGE, and by extension, the excellent Imagen site.