Language models transfer to listener motion prediction. Given a video of a listener and speaker pair, we extract text corresponding to the spoken words of the speaker. We fine-tune a pretrained large language model to autoregressively generate realistic 3D listener motion in response to the input transcript. Our method generates semantically meaningful gestures (e.g. an appropriately timed smile inferred from “amazing”) that synchronously flow with the conversation. We can optionally render the output of our approach as photorealistic video. Please see supplementary video for results.


The work of Ng, Subramanian and Darrell is supported by BAIR’s industrial alliance programs, and the DoD DARPA’s Machine Common Sense and/or SemaFor programs. Ginosar’s work is funded by NSF under Grant #2030859 to the Computing Research Association for CIFellows Project