Body2Hands: Learning to Infer 3D Hands
from Conversational Gesture Body Dynamics

Evonne Ng
UC Berkeley
Shiry Ginosar
UC Berkeley
Trevor Darrell
UC Berkeley
Hanbyul Joo
Facebook AI Research

Hand gestures are vital in conveying non-verbal information. (a) Our work considers how a speaker’s body alone can facilitate the inference of their hand gestures. (b) From a temporal stack of the speaker’s 3D body poses (top), we predict corresponding hands (bottom). (c) Body2Hands outputs a sequence of 3D hand poses in the form of an articulated 3D hand model.


We propose a novel learned deep prior of body motion for 3D hand shape synthesis and estimation in the domain of conversational gestures. Our model builds upon the insight that body motion and hand gestures are strongly correlated in non-verbal communication settings. We formulate the learning of this prior as a prediction task of 3D hand shape over time given body motion input alone. Trained with 3D pose estimations obtained from a large-scale dataset of internet videos, our hand prediction model produces convincing 3D hand gestures given only the 3D motion of the speaker's arms as input. We demonstrate the efficacy of our method on hand gesture synthesis from body motion input, and as a strong body prior for single-view image-based 3D hand pose estimation. We demonstrate that our method outperforms previous state-of-the-art approaches and can generalize beyond the monologue-based training data to multi-person conversations.



(1) The hand pose of a speaker is strongly correlated with their body dynamics.

We demonstrate the utility of leveraging surprisingly strong correlations between a speaker's body and hand poses.

(2) Body pose thus serves as a strong prior for a speaker's hand pose.

For each hand pose example query, we find 10 closest predicted hand poses from in-the-wild videos and visualize their corresponding body poses (darker means closer match). We reembodythe query hands on its corresponding body shown in darkest shade. Body2Hands captures distinctcorrelations between the body and hands for common communicative gestures.

(3) Our method outperforms current SoTA in hand pose estimation, overcoming challenging views

While current image-based SoTA methods often fail on obstructed views of the hands, our prior based on body motion provides an additional cue for hand pose estimation to overcome challenges caused by fundamental depth ambiguity, frequent self-occlusion, and severe motion blur. Furthermore, we consider the temporal aspect of the input, allowing our method to produce smoother, more realistic hand sequences.

Our predicted 3D hand poses against a SOTA image-based method, MTC [Xiang ICCV 2019]. We show each prediction from a novel view below their respective hand. Row 1: View of speaker and magnified hands (not used by our method). Row 2: results from our method, using body-only as input. Row 3:MTC [Xiang ICCV 2019] image-based results. We show results from a person not seen in the training set (right) to demonstrate our model generalizes across individuals.

Analysis of typical errors. Error over time plotted on the left (lower is better). Frames shown for notable scenarios on the right. MTC fails from (a) naturally arising occlusions or from (c) motion blur/low resolution on hands. With clear views of the hands (b) and (d), MTC performs slightly better, though the margin separating ours from MTC is smaller than in cases where MTC fails. Overall, ours outperforms other baselines whether we take as input an image observation or not.


To learn the novel deep body prior in a data-driven way, we formulate a predictive task: given the body poses of a speaker, the goal is to predict their corresponding hand poses. Our method is trained on in-the-wild 3D motion data.

Our network takes a 3D body pose sequence as input. The body pose encoder learns inter-joint relationships, while the UNet summarizes the sequence into a body dynamics representation. Finally, the hand decoder learns a mapping from body dynamics to hands. The output is a predicted corresponding gestural hand pose sequence. L1 regression to the ground truth hand poses provides a training signal, while an adversarial discriminator ensures the predicted motion is realistic.


   title={Body2hands: Learning to infer 3d hands from conversational gesture body dynamics},
   author={Ng, Evonne and Ginosar, Shiry and Darrell, Trevor and Joo, Hanbyul},
   journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},