From Audio to Photoreal Embodiment:
Synthesizing Humans in Conversations

1 Meta Reality Labs Research, 2 University of California, Berkeley

tldr: From audio of a dyadic conversation, we generate corresponding photorealistic face, body, and hand gestures.

Generalizability: Avatars are driven from the voice of the authors (not the actors that the model was trained on).

Abstract

We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset will be publicly released.

Overview: From audio of a dyadic conversation, we generate photorealistic face, body, and hand gestures.

Method

Please follow along the numbered videos for an overview of our approach.
takeaway: For body, our joint VQ + diffusion method allows us to achieve more dynamic and peaky motions compared to using only one or the other.

1. We capture a novel, rich dataset of dyadic conversations that allows for photorealistic reconstructions. Dataset here.
2. Our motion model comprises of 3 parts: a face motion model, guide pose predictor, and body motion model.
3. Given audio and outputs from a pretrained lip regressor, we train a conditional diffusion model to output facial motion.
4. For body, we take audio as input and autoregressively output VQ-ed guide poses at 1 fps.
5. We then pass both audio and guide poses into a diffusion model that in-fills high-frequency body motion at 30 fps.
6. Both the generated face and body motion is then passed into our trained avatar renderer to generate a photorealistic avatar.
7. VoilĂ ! The final result.

Results

We highlight notable moments in each video below.
takeaway: we generate peaky, diverse motion such as pointing, wrist flicking, shrugging, etc. Our VQ+diffusion method allows for higher diversity across samples.

1. Guide-poses drive the diffusion model to integrating pointing movement.
2. Diffusion model generates subtle details to convey disgruntleness ("ugh" face, dismissive wrist flick, turning away).
3 + 4. Our model generates varied samples given the same audio input.

Comparisons

We highlight notable moments in each video below.
takeaway: Our approach generates more dynamic and expressive motion than prior SOTA, and more plausible motion than KNN or Random.

1. Wrist flicks to indicate listing; shrugged shoulders when telling a story.
2. Emphasizing arm motion on "they definitely happen for a reason"; pointing to make a statement.
3. General hand sweeping patterns that follow the conversation and voice influctations.
4. Pointing when asking a question; moves head backwards when thinking; outward hand movement during response.

Applications

Generalizability: Our method generalizes to arbitrary audio such as one taken from a TV clip.

Animation: The coarse guide poses can be used for downstream applications such as motion editing.

A/B perceptual evaluation: Ours vs.~ground truth or Ours vs.~our strongest baseline LDA [Alexanderson et.al. 2023]

Importance of Photorealism

Ours outperforms LDA [Alexanderson et.al. 2023] in both mesh and photoreal settings (top). Interestingly, evaluators shifted from slightly to strongly preferring ours when visualized in a photorealistic manner (top row). This trend continues when we compare our method against ground truth (bottom row). While ours performs competitively against ground truth in a mesh-based rendering, it lags in the photoreal domain with 43% of evaluators strongly preferring ground truth over ours. Since meshes often obscure subtle motion details, it is difficult to accurately evaluate the nuances in gestures leading to evaluators being more forgiving of "incorrect" motions. Our results suggest that photorealism is essential to accurately evaluating conversational motion.

BibTeX

@inproceedings{ng2024audio2photoreal,
  title        = {
    From Audio to Photoreal Embodiment:
    Synthesizing Humans in Conversations
  },
  author       = {
    Ng, Evonne and Romero, Javier and
    Bagautdinov, Timur and Bai, Shaojie and
    Darrell, Trevor and Kanazawa, Angjoo and
    Richard, Alexander
  },
  year         = 2024,
  booktitle    = {ArXiv}
}

Template adapted from Nerfies.