Research Multi-Agent Diffusion Models Motion Generation

MAGNet: One Model for All Multi-Agent Interactions

Vongani H. Maluleke

Humans coordinate in groups naturally—whether we’re performing synchronized dance routines, sparring in a boxing ring, or simply navigating a crowded sidewalk. For AI systems, replicating this natural ability has been a longstanding challenge. But we’re excited to share a breakthrough that tackles this problem head-on.

In our new work, we introduce MAGNet (Multi-Agent Generative Network)—a unified framework that successfully handles the full complexity of multi-person coordination. Rather than building specialized models for different interaction types or group sizes, MAGNet provides a single, powerful architecture that generates remarkably realistic motion across diverse multi-person scenarios.

MAGNet generates realistic coordinated motion across diverse scenarios—from dancing and boxing to social interactions—all from a single unified model.

A Unified Solution to Fragmented Problems

The existing landscape of multi-person motion generation has been fragmented. Most methods are narrowly specialized: one model predicts how a partner reacts to your movements, another forecasts joint futures, a third handles specific activities like dancing or boxing. Even more limiting, these approaches typically handle only dyadic (two-person) interactions.

We built MAGNet to fundamentally change this paradigm. A single trained model now supports the full spectrum of multi-agent generation tasks:

Partner Inpainting

Generate plausible motion for one person given complete motion trajectories of their interaction partners. Imagine you have mocap data of one dancer and want to synthesize a believable partner—MAGNet handles this seamlessly and beautifully.

Joint Future Prediction

Given everyone’s past motion, predict simultaneous future movements for the entire group. This is the classic multi-agent forecasting problem, and MAGNet solves it with impressive accuracy.

Partner Prediction

Predict one agent’s future motion given both agents’ past motion, generating reactive and contextually appropriate responses to their partner’s movements.

Motion In-Betweening

Create smooth, natural transitions between specified keyframe poses. This gives artists and animators unprecedented control while letting the model handle the complex physics of coordinated motion.

Polyadic Interactions

Naturally extend from pairs to groups of three, four, or more people without retraining. The architecture elegantly scales to any number of agents.

Agentic Sampling

Run the model independently on each agent, where individuals generate motion from their own perspective based on observations of others. This mirrors how humans actually coordinate.

The Secret Sauce: How It Works

MAGNet’s versatility comes from three tightly integrated design choices:

1. Relative Coordinate Representation

Rather than representing each person’s motion in absolute world coordinates, we encode everything as relative transformations between agents. Each person maintains their own local coordinate frame, and we represent how they’re positioned and oriented relative to every other agent in the scene.

This elegant approach makes the model completely agnostic to where the interaction happens in physical space. A group dancing in the center of a room has the same relational structure as one dancing in the corner—the model only needs to learn interaction patterns, not every possible absolute configuration. This dramatically reduces complexity and yields impressive generalization.

2. Discrete Motion Tokenization via VQ-VAE

Building on our successful approach from the couple dancing work, we use a Vector Quantized Variational Autoencoder (VQ-VAE) to learn a discrete latent representation of human motion. This converts continuous pose sequences into discrete tokens—essentially creating a learned vocabulary of body movements.

Working with discrete tokens rather than continuous joint positions provides a remarkably clean signal for the transformer. It eliminates high-frequency noise, captures meaningful motion primitives, and enables the use of classification-based training rather than regression.

But we’ve pushed this further than standard VQ-VAE approaches. Each motion token in MAGNet is multi-component, encoding not just an agent’s own pose but also their relative transforms to all other agents. This explicit representation of inter-agent relationships is the key to our success in capturing realistic coordination.

3. Diffusion Forcing Transformer

This is where the architecture gets particularly powerful. We adopt Diffusion Forcing, an innovative hybrid approach that applies different noise levels to different positions in a sequence during training. Each token (representing a specific agent at a specific timestep) receives an independently sampled noise level.

This elegant modification enables remarkably flexible conditioning at inference time. The model seamlessly handles any combination of known and unknown motion—whether you’re conditioning on clean past motion, inpainting specific agents, or gradually denoising entire future trajectories.

For multi-agent settings, our key innovation is that each agent’s tokens are independently noised. The transformer learns to intelligently modulate how much it relies on an agent’s own motion history versus their partners’ states—essentially learning when to trust individual dynamics versus inter-agent coupling. The results are stunning.

Motion Tokenization (VQ-VAE)

Each agent’s motion is decomposed into components, encoded into discrete latent tokens, and decoded back

Input
Encoding
xt
Θt
Tc→r
VQ-VAE
Encoder
ct
β
ΔTcan
Codebook
Decoding
VQ-VAE
Decoder
Θ̂t
c→r
ct
β
ΔT̂can
Output
Diffusion Forcing (DFoT)

Each token receives an independent noise level; the transformer iteratively denoises the full sequence

MiP
+
ε
=
M̃(τ)
DFoT
Diffusion Forcing
Transformer
× N steps
iP
Motion Components
β — body shape
Θ — joint angles
Tc→r — canonical→root
ΔTcan — canonical delta
z — latent code
Ts→p — self→partner
Agents & Noise
Agent A
Agent B
ε ~ U[0.1, 1]
Noise Level
A Clean Noisy
B Clean Noisy
Per-Agent Motion Repr.
A
β
Θ
Tc→r
ΔTcan
TA→B
B
β
Θ
Tc→r
ΔTcan
TB→A
Token mip
mA [ zA ΔTcan TA→B ]
mB [ zB ΔTcan TB→A ]

Ultra-Long Motion Generation

One of MAGNet’s most impressive capabilities is generating ultra-long motion sequences—spanning hundreds of timesteps while maintaining coherent coordination between agents. Where most motion models degrade after a few seconds, MAGNet autoregressively extends generation far beyond available ground truth, sustaining realistic dynamics and timing throughout.

Ultra-Long Waltz Generation. The pink mesh shows the context (4 frames, 1 second), the grey mesh is the ground truth, and the red and blue meshes are the generated motion samples. The ground truth motion ends at 44 seconds (frozen grey meshes), and MAGNet continues generating coordinated motion well beyond the available ground truth.

Temporal Denoising Schedules

A key insight of Diffusion Forcing is that different sampling strategies simply correspond to different noise schedules applied across the token sequence. By choosing which tokens start clean (conditioned) versus noisy (to be generated), MAGNet handles all its capabilities within a single framework. The animated Gaussian textures below show remaining noise intensity at each denoising step.

Real-World Applications

The implications are genuinely exciting across multiple domains:

The Future: Toward Truly Social AI

Looking forward, we see transformative possibilities:

Our vision was ambitious but achievable: a generative model that captures the full richness of human social coordination—from intimate two-person interactions to complex group dynamics—in a unified framework that’s fast enough for real-time use and flexible enough for diverse applications. MAGNet represents a major leap toward that goal.

Citation

Maluleke, V. H *, Horiuchi, K.*, Wilken, L., Ng, E., Malik, J., & Kanazawa, A. (2024). Diffusion Forcing for Multi-Agent Interaction Sequence Modeling. arXiv preprint arXiv:2512.17900.

← Previous It Takes Two: AI & Dance