Localizing Moments in Video with Natural Language

Authors: Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

ICCV 2017 [PDF]

Concept Figure

Abstract: We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in longer videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms baselines and believe our initial results and release of DiDeMo will inspire further research on localizing video moments with natural language.

Below are examples of retrieved moments in DiDeMo using MCN.


        title = {Localizing Moments in Video With Natural Language},
        author = {Hendricks, Lisa Anne and Wang, Oliver and Shechtman, Eli and Sivic, Josef and Darrell, Trevor, and Russell, Bryan},
       booktitle = {International Conference on Computer Vision (ICCV)},
       year = {2017}

Dataset: Released here!

Code: You can see my prototxts and pre-trained models here!.