Research Computer Vision Motion Prediction

It Takes Two: How We Taught AI to Predict Dance Through Social Interaction

Vongani H. Maluleke

What is the counterpart in video understanding of next-word prediction—the bedrock pre-training task of large language models? In NLP, a "word" has a clear embedding. But what is a "word" in human motion? And can we predict the next one?

In our paper "Synergy and Synchrony in Couple Dances", my collaborators and I at UC Berkeley tackle this question head-on. We study couple dance as a testbed for understanding how social interaction shapes human behavior. The core question: does knowing your partner's movements help predict your own future motion? The answer is a resounding yes—and the implications extend far beyond the dance floor.

To what extent does Bob’s behavior affect Alice’s behavior? We study this question in a couple’s dance - an example of full-body dyadic physical social interaction. We predict the full body motion of a dancer, Alice (yellow), given their own past motion and their partner, Bob’s (blue), motion.

The Problem: Next-Token Prediction for Human Motion

Unlike next-token prediction in language, the dynamics of a person's state are conditioned on much more than their own past history. In social situations, we also have to consider interactions with other people. When one partner raises their arm during a couple dance, the other partner is likely to twirl.

The Challenge

Most motion prediction treats people as isolated agents. You analyze someone's past trajectory and extrapolate. This works for solo activities, but couple dancing involves continuous physical and social feedback between partners.

Our Approach

We predict a dancer's future motion conditioned on both their own past and their partner's motion. We show that this social conditioning dramatically improves prediction quality—producing surprisingly compelling dance synthesis.

We frame this through two complementary concepts from motor control and social psychology:

Synergy makes single-person prediction possible. Synchrony makes partner-conditioned prediction dramatically better. The coupling between dancers enforces constraints on motion that are unpredictable from either person alone.

The Dataset: Swing Dancing in the Wild

To study this properly, we needed data of real couples dancing—not the sanitized, motion-capture-suit-wearing kind from a lab, but actual professional dancers performing in real environments. We built a dataset from YouTube videos of the International Lindy Hop Championships.

Using SLAHMR, a state-of-the-art 4D human motion reconstruction method, we extracted 3D body poses represented as SMPL parameters—translation, global orientation, and body pose—for both dancers in each video. This is the first couple-dance dataset that combines 3D mesh representations, in-the-wild video, and future motion prediction in a single benchmark.

Why Swing? Because it features tight physical coupling. The lead and follow maintain hand contact throughout most of the dance, creating exactly the kind of continuous social feedback loop we wanted to study. When the lead signals a move through subtle weight shifts and hand pressure, the follow interprets and responds in real time.

Our Approach: Discrete Motion Tokens + Transformer

Our approach draws an analogy to Labanotation—a notation system invented for capturing dance motion like a musical score. Labanotation breaks dance into atomic motions with discrete notations that are easier to analyze than continuous motion, because there is a finite dictionary of them. We do the same thing computationally.

In-the-Wild Video
YouTube Swing
SLAHMR
4D Reconstruction
Motion VQ-VAE
3 Disentangled Codebooks
Transformer
Autoregressive Prediction

Stage 1: Learning a Motion Dictionary with VQ-VAE

We train a Vector Quantized Variational Autoencoder (VQ-VAE) to learn a discrete vocabulary of atomic motion elements. The critical design choice: we disentangle motion into three separate codebooks rather than compressing everything into one:

Why disentangle? While some poses naturally correlate with orientation (e.g., walking forward), most dance poses can occur at any orientation or position on the floor. A unified codebook entangles these elements, limiting its ability to capture the full diversity of dance motion. Our ablation experiments confirm that separate codebooks significantly outperform a single unified one.

Stage 2: Autoregressive Transformer Prediction

With motion represented as discrete tokens, we train a decoder-only transformer to autoregressively predict the next motion token. The transformer uses causal masking so that each dancer can only attend to past motion—no peeking into the future. The output is a probability distribution over codebook indices, enabling non-deterministic prediction: we can sample multiple plausible futures from the same past context.

We define two prediction tasks to isolate the effect of social information:

Unary Prediction

Predict Alice's future motion conditioned only on Alice's past. At each timestep, the model sees Alice's ground truth up to time tπ, plus its own previous predictions.

Dyadic Prediction

Predict Alice's future motion conditioned on both Alice's past and Bob's full ground truth motion. The model learns to leverage the coupling between partners.

In the dyadic case, the transformer receives interleaved tokens from both Alice and Bob, with person-specific and parameter-specific encodings added to each token. This allows the model to learn cross-person temporal correlations—the synchrony between partners.

Results: Social Information Changes Everything

We evaluate using metrics that capture both the realism of individual motion and the quality of partner coordination:

The dyadic model produces motion that is twice as realistic as the unary baseline, measured by Fréchet Inception Distance on the motion feature space. The partner synchrony metric—which measures how well the predicted motion of Alice coordinates with Bob's actual motion—shows an even more dramatic improvement.

1
Social conditioning halves prediction error. Across all metrics, knowing Bob's motion roughly doubles the quality of Alice's predicted motion. This isn't a marginal improvement—it's a fundamental shift in what the model can capture.
2
Diversity reflects real constraints. The unary model generates more varied movements, but this isn't a virtue. In couple dance, you don't want maximum freedom—you want movements that respect your partner's constraints. The dyadic model's lower diversity reflects the reality that social interaction narrows the space of plausible next moves.
3
Disentangled codebooks matter. Our ablation shows that separating pose, orientation, and translation into independent codebooks outperforms a single unified codebook. These aspects of motion have different structure and predictability—entangling them wastes representational capacity.

Key Design Choices That Mattered

SMPL body model representation: Unlike methods that use 3D joint locations, our SMPL-based representation is invariant to changes in body shape, scale, and camera pose. This is computationally crucial—it lets us ignore irrelevant pixels and predict behavior directly in a canonical body space.

Discrete classification over continuous regression: By predicting distributions over codebook indices rather than regressing continuous parameters, we avoid the mean-pose collapse that plagues regression-based motion prediction. The model can express genuine uncertainty about the future by assigning probability mass to multiple plausible next tokens.

In-the-wild data: Using real YouTube videos rather than lab motion capture means our model learns from the full complexity of actual dancing—different styles, different body types, different environments. Previous couple-interaction datasets relied on controlled lab settings with MoCap systems or depth sensors. Our approach can in principle scale to any collection of internet videos.

Why This Matters Beyond Dance

This work is fundamentally about a question that extends well beyond dance: to what extent does social interaction influence behavior? We use couple dance as a controlled testbed because it offers tight physical coupling and clear measurable outcomes, but the principles apply broadly.

Robotics & HRI

Robots collaborating with humans need to predict human motion based on the robot's own actions—the same dyadic conditioning we study here.

Autonomous Agents

Autonomous vehicles, drone swarms, and multi-agent systems all involve agents that continuously influence each other's trajectories.

Animation & VFX

Generating realistic interaction scenes requires models that understand the coupling between characters, not just individual motion.

Next-Token Prediction

Our work extends the LLM paradigm of next-token prediction to human motion, suggesting that "tokens" in video understanding might be people and their states.

Limitations and Future Directions

We're transparent about current limitations. The 4D motion reconstruction from SLAHMR isn't perfect—there are errors in relative positioning, potential mesh interpenetration, and imprecise contact estimation between partners. As reconstruction methods improve, these issues should diminish, and our approach will directly benefit.

Our focus on Swing dancing means the model is specialized for that genre. Generalizing to other forms of physical interaction—sports, martial arts, collaborative work tasks—would require additional data and potentially architectural modifications. However, our pipeline is designed to work with any internet video, making scaling straightforward in principle.

There's also the question of bidirectional prediction. Currently, we predict Alice's motion given Bob's ground truth. Jointly predicting both dancers' futures from only their shared past remains an open and exciting challenge.

The Bigger Picture

What I find most compelling about this work is how it challenges a default assumption in motion prediction: that you should model individuals first and add interaction as an afterthought. Our results suggest the opposite. For socially interactive scenarios, the coupling between agents is primary. You can't accurately predict either person by treating them as independent.

This maps onto broader questions in AI about modeling multi-agent systems. Whether you're predicting traffic patterns, simulating crowd dynamics, or building collaborative robots, the lesson is the same: social context isn't just helpful additional information—it's often the most important signal.

The dance floor, it turns out, is an excellent laboratory for understanding something fundamental about human behavior: we don't just move near each other, we move with each other. And if we want AI systems that interact naturally with humans, they'll need to learn the same lesson.

Citation

Maluleke, V. H., Müller, L., Rajasegaran, J., Pavlakos, G., Ginosar, S., Kanazawa, A., & Malik, J. (2024). Synergy and Synchrony in Couple Dances. arXiv preprint arXiv:2409.04440.

← Previous When GANs See Race Next → Diffusion Forcing for Multi-Agent Interaction