Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Stracke, Nick; Bauer, Kolja; Baumann, Stefan Andreas; Bautista, Miguel Angel; Susskind, Josh; Ommer, Björn

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Nick Stracke^*,1,2, Kolja Bauer^*,1,2, Stefan Andreas Baumann^1,2, Miguel Angel Bautista³, Josh Susskind³, Björn Ommer^1,2

¹CompVis @ LMU ²MCML ³Apple
^*Equal Contribution

Paper Code arXiv Weights

Teaser figure for Learning Long-term Motion Embeddings for Efficient Kinematics Generation

TL;DR

We learn a long-term motion embedding from large-scale tracker-derived trajectories, giving a compact representation of scene dynamics without having to generate full videos.
The learned embedding achieves 64x temporal compression and supports efficient generation of long, realistic motions that satisfy goals specified by text prompts or spatial pokes.
Strong temporal compression is not just cheaper, it improves learning: the motion space becomes more semantic and motion generation becomes both faster and higher quality.
Across open-set internet videos and LIBERO robotics benchmarks, the model outperforms specialized trajectory predictors and modern video baselines while being substantially more efficient.

Goal-Conditioned Motion Generation

Conditioned on start and end points (spatial pokes), our model finds plausible motions that satisfy the conditioning.

Path finding: sparse pokes induce globally coherent paths through the scene.

Rotational motion: the model captures circular and spinning dynamics.

Understanding joints: articulated structures move coherently instead of as independent points.

From a single start frame, our model generates multiple coherent futures, illustrating the diversity of the learned motion distribution.

Eagle motion hypothesis showing wing flap.

Comparison against Motion Predictors

On open-domain videos, we compare against prior motion predictors under different poke-conditioning sparsities, from a single poke to dense guidance. Our latent motion model achieves the best generation quality and conditioning adherence across the board while being substantially faster than both flow-based and trajectory-based baselines.

Comparison against SOTA video models

We compare against Wan and Veo 3 in a goal-conditioned setup where video baselines receive the start and end frame, while our model receives the start frame and a poke-based motion goal. Since video models do not expose explicit motion trajectories, we track their generated videos with CoTracker and evaluate the recovered motion using the same metrics. In both sample-matched and time-matched settings, our kinematics-first model outperforms the video baselines, with the gap widening substantially when wall-clock sampling time is matched.

Sample-matched comparison against state-of-the-art video models.

Time-matched comparison against state-of-the-art video models.

Action Prediction on LIBERO

Given a task description and a start frame, our model predicts how objects in the scene should move to solve the task. We train a small policy head to map our generated motions to 7D robot actions and rollout the policy:

Put both the cream cheese box and the butter in the basket

Turn on the stove and put the moka pot on it

Put the black bowl in the bottom drawer of the cabinet and close it

Put the yellow and white mug in the microwave and close it

Following the ATM and Tra-MoE evaluation protocols, our method outperforms both baselines across tasks, indicating stronger task-conditioned planning and scene understanding:

Action prediction results table on LIBERO under the ATM and Tra-MoE evaluation protocols.

How It Works

We learn a dense motion space by encoding sparse tracker trajectories and the start frame into a compact latent motion grid that supports dense reconstruction at arbitrary spatial query points.

Overview showing dense motion space learning from sparse tracker trajectories.

We then generate goal-conditioned motion latents directly in this learned motion space from scene context plus text or spatial-poke conditioning.

Architecture showing conditional flow matching in the learned motion space.

Compression Improves Efficiency and Semantics

Compression ablation showing quality, reconstruction, and semantic structure across temporal compression factors.

Stronger temporal compression improves motion generation quality and throughput while only mildly reducing reconstruction fidelity, and leads to a more semantic motion space.

Additional Results

Conditioned on the start and end position of a specific point, our model generates a distribution of trajectories, yielding a distribution of positions. We visualize the original video on the left alongside the distribution of positions generated by our model on the right.

Per column, we show multiple generated motions conditioned on spatial pokes shown in the top row.

First horse qualitative sample from Figure H.

Second horse qualitative sample from Figure H.

Third horse qualitative sample from Figure H.

Fourth horse qualitative sample from Figure H.

First robot qualitative sample from Figure H.

Second robot qualitative sample from Figure H.

Third robot qualitative sample from Figure H.

Fourth robot qualitative sample from Figure H.

First camel qualitative sample from Figure H.

Second camel qualitative sample from Figure H.

Third camel qualitative sample from Figure H.

Fourth camel qualitative sample from Figure H.

BibTeX

@inproceedings{stracke2026motionembeddings,
  title     = {Learning Long-term Motion Embeddings for Efficient Kinematics Generation},
  author    = {Stracke, Nick and Bauer, Kolja and Baumann, Stefan Andreas and Bautista, Miguel Angel and Susskind, Josh and Ommer, Björn},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}

More Research

What If

CleanDIFT

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

TL;DR

Goal-Conditioned Motion Generation

Comparison against Motion Predictors

Comparison against SOTA video models

Action Prediction on LIBERO

How It Works

Compression Improves Efficiency and Semantics

Stronger temporal compression improves motion generation quality and throughput while only mildly reducing reconstruction fidelity, and leads to a more semantic motion space.

Additional Results

BibTeX