It Takes Two: How We Taught AI to Predict Dance Through Social Interaction
What is the counterpart in video understanding of next-word prediction—the bedrock pre-training task of large language models? In NLP, a "word" has a clear embedding. But what is a "word" in human motion? And can we predict the next one?
In our paper "Synergy and Synchrony in Couple Dances", my collaborators and I at UC Berkeley tackle this question head-on. We study couple dance as a testbed for understanding how social interaction shapes human behavior. The core question: does knowing your partner's movements help predict your own future motion? The answer is a resounding yes—and the implications extend far beyond the dance floor.
The Problem: Next-Token Prediction for Human Motion
Unlike next-token prediction in language, the dynamics of a person's state are conditioned on much more than their own past history. In social situations, we also have to consider interactions with other people. When one partner raises their arm during a couple dance, the other partner is likely to twirl.
Most motion prediction treats people as isolated agents. You analyze someone's past trajectory and extrapolate. This works for solo activities, but couple dancing involves continuous physical and social feedback between partners.
We predict a dancer's future motion conditioned on both their own past and their partner's motion. We show that this social conditioning dramatically improves prediction quality—producing surprisingly compelling dance synthesis.
We frame this through two complementary concepts from motor control and social psychology:
- Synergy — the reduced dimensionality of human motion. Our joints work together in coordinated ways, creating a lower-dimensional manifold of possible movements than the raw degrees of freedom would suggest.
- Synchrony — the dynamic, reciprocal adaptation between interacting partners. Dancers continuously modulate the temporal structure of their behavior in response to each other.
Synergy makes single-person prediction possible. Synchrony makes partner-conditioned prediction dramatically better. The coupling between dancers enforces constraints on motion that are unpredictable from either person alone.
The Dataset: Swing Dancing in the Wild
To study this properly, we needed data of real couples dancing—not the sanitized, motion-capture-suit-wearing kind from a lab, but actual professional dancers performing in real environments. We built a dataset from YouTube videos of the International Lindy Hop Championships.
Using SLAHMR, a state-of-the-art 4D human motion reconstruction method, we extracted 3D body poses represented as SMPL parameters—translation, global orientation, and body pose—for both dancers in each video. This is the first couple-dance dataset that combines 3D mesh representations, in-the-wild video, and future motion prediction in a single benchmark.
Why Swing? Because it features tight physical coupling. The lead and follow maintain hand contact throughout most of the dance, creating exactly the kind of continuous social feedback loop we wanted to study. When the lead signals a move through subtle weight shifts and hand pressure, the follow interprets and responds in real time.
Our Approach: Discrete Motion Tokens + Transformer
Our approach draws an analogy to Labanotation—a notation system invented for capturing dance motion like a musical score. Labanotation breaks dance into atomic motions with discrete notations that are easier to analyze than continuous motion, because there is a finite dictionary of them. We do the same thing computationally.
Stage 1: Learning a Motion Dictionary with VQ-VAE
We train a Vector Quantized Variational Autoencoder (VQ-VAE) to learn a discrete vocabulary of atomic motion elements. The critical design choice: we disentangle motion into three separate codebooks rather than compressing everything into one:
- Pose (Θ) — the configuration of the body (joint angles across 23 SMPL joints)
- Orientation (Φ) — which direction the person is facing in the world
- Translation (Γ) — where the person is in 3D space
Why disentangle? While some poses naturally correlate with orientation (e.g., walking forward), most dance poses can occur at any orientation or position on the floor. A unified codebook entangles these elements, limiting its ability to capture the full diversity of dance motion. Our ablation experiments confirm that separate codebooks significantly outperform a single unified one.
Stage 2: Autoregressive Transformer Prediction
With motion represented as discrete tokens, we train a decoder-only transformer to autoregressively predict the next motion token. The transformer uses causal masking so that each dancer can only attend to past motion—no peeking into the future. The output is a probability distribution over codebook indices, enabling non-deterministic prediction: we can sample multiple plausible futures from the same past context.
We define two prediction tasks to isolate the effect of social information:
Predict Alice's future motion conditioned only on Alice's past. At each timestep, the model sees Alice's ground truth up to time tπ, plus its own previous predictions.
Predict Alice's future motion conditioned on both Alice's past and Bob's full ground truth motion. The model learns to leverage the coupling between partners.
In the dyadic case, the transformer receives interleaved tokens from both Alice and Bob, with person-specific and parameter-specific encodings added to each token. This allows the model to learn cross-person temporal correlations—the synchrony between partners.
Results: Social Information Changes Everything
We evaluate using metrics that capture both the realism of individual motion and the quality of partner coordination:
The dyadic model produces motion that is twice as realistic as the unary baseline, measured by Fréchet Inception Distance on the motion feature space. The partner synchrony metric—which measures how well the predicted motion of Alice coordinates with Bob's actual motion—shows an even more dramatic improvement.
Key Design Choices That Mattered
SMPL body model representation: Unlike methods that use 3D joint locations, our SMPL-based representation is invariant to changes in body shape, scale, and camera pose. This is computationally crucial—it lets us ignore irrelevant pixels and predict behavior directly in a canonical body space.
Discrete classification over continuous regression: By predicting distributions over codebook indices rather than regressing continuous parameters, we avoid the mean-pose collapse that plagues regression-based motion prediction. The model can express genuine uncertainty about the future by assigning probability mass to multiple plausible next tokens.
In-the-wild data: Using real YouTube videos rather than lab motion capture means our model learns from the full complexity of actual dancing—different styles, different body types, different environments. Previous couple-interaction datasets relied on controlled lab settings with MoCap systems or depth sensors. Our approach can in principle scale to any collection of internet videos.
Why This Matters Beyond Dance
This work is fundamentally about a question that extends well beyond dance: to what extent does social interaction influence behavior? We use couple dance as a controlled testbed because it offers tight physical coupling and clear measurable outcomes, but the principles apply broadly.
Robotics & HRI
Robots collaborating with humans need to predict human motion based on the robot's own actions—the same dyadic conditioning we study here.
Autonomous Agents
Autonomous vehicles, drone swarms, and multi-agent systems all involve agents that continuously influence each other's trajectories.
Animation & VFX
Generating realistic interaction scenes requires models that understand the coupling between characters, not just individual motion.
Next-Token Prediction
Our work extends the LLM paradigm of next-token prediction to human motion, suggesting that "tokens" in video understanding might be people and their states.
Limitations and Future Directions
We're transparent about current limitations. The 4D motion reconstruction from SLAHMR isn't perfect—there are errors in relative positioning, potential mesh interpenetration, and imprecise contact estimation between partners. As reconstruction methods improve, these issues should diminish, and our approach will directly benefit.
Our focus on Swing dancing means the model is specialized for that genre. Generalizing to other forms of physical interaction—sports, martial arts, collaborative work tasks—would require additional data and potentially architectural modifications. However, our pipeline is designed to work with any internet video, making scaling straightforward in principle.
There's also the question of bidirectional prediction. Currently, we predict Alice's motion given Bob's ground truth. Jointly predicting both dancers' futures from only their shared past remains an open and exciting challenge.
The Bigger Picture
What I find most compelling about this work is how it challenges a default assumption in motion prediction: that you should model individuals first and add interaction as an afterthought. Our results suggest the opposite. For socially interactive scenarios, the coupling between agents is primary. You can't accurately predict either person by treating them as independent.
This maps onto broader questions in AI about modeling multi-agent systems. Whether you're predicting traffic patterns, simulating crowd dynamics, or building collaborative robots, the lesson is the same: social context isn't just helpful additional information—it's often the most important signal.
The dance floor, it turns out, is an excellent laboratory for understanding something fundamental about human behavior: we don't just move near each other, we move with each other. And if we want AI systems that interact naturally with humans, they'll need to learn the same lesson.
Citation
Maluleke, V. H., Müller, L., Rajasegaran, J., Pavlakos, G., Ginosar, S., Kanazawa, A., & Malik, J. (2024). Synergy and Synchrony in Couple Dances. arXiv preprint arXiv:2409.04440.