LART

Abstract: In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Our best model achieve 45.1 mAP on AVA v2.2 action recognition task.

Lagrangian Action Recognition with Tracking (LART)

Given a video, first, we track every person using a tracking algorithm (e.g. PHALP). Then every detection in the track is tokenized to represent a human-centric vector (e.g. pose, appearance). To represent 3D pose we use SMPL] parameters and estimated 3D location of the person, for contextualized appearance we use MViT (pre-trained on MaskFeat]) features. Then we train a transformer network to predict actions using the tracks. Note that, at the second frame we do not have detection for the blue person, at these places we pass a mask token to in-fill the missing detections.

Action Recognition with Pose (LART-Pose)

Class-wise performance on AVA: We show the performance of JMRN and LART-Pose on 60 AVA classes (average precision and relative gain). For pose based classes such as standing, sitting, and walking our 3D pose model can achieve above 60 mAP average precision performance by only looking at the 3D poses over time. By modeling multiple trajectories as input our model can understand the interactions among people. For example, activities such as dancing (+30.1%), martial art (+19.8%) and hugging (+62.1%) have large relative gains over state-of-the-art pose only model. We only plot the gains if it is above or below 1 mAP.

Action Recognition with MVIT+Pose (LART)

Comparison with State-of-the-art methods: We show class-level performance (average precision and relative gain) of MViT (pretrained on MaskFeat) and ours. Our methods achieve better performance compared to MViT on over 50 classes out of 60 classes. Especially, for actions like running, fighting, hugging, and sleeping etc., our method achieves over +5 mAP. This shows the benefit of having access to explicit tracks and 3D poses for action recognition. We only plot the gains if it is above or below 1 mAP.

Comparison with State-of-the-art method (Hiera): Our best model, using Hiera as the backbone and trained on AVA v2.2, achieves 45.1 mAP on AVA v2.2 action recognition task.

Qualitative Results

Qualitative Results: We show the predictions from MViT and our model on validation samples from AVA v2.2. The person with the colored mesh indicates the person-of-interest for which we recognise the action and the one with the gray mesh indicates the supporting actors. The first two columns demonstrate the benefits of having access to the action-tubes of other people for action prediction. In the first column, the orange person is very close to the other person with hugging posture, which makes it easy to predict hugging with higher probability. Similarly, in the second column, the explicit interaction between the multiple people, and knowing others also fighting increases the confidence for the fighting action for the green person over the 2D recognition model. The third and the fourth columns show the benefit of explicitly modeling the 3D pose over time (using tracks) for action recognition. Where the yellow person is in riding pose and purple person is looking upwards and legs on a vertical plane. The last column indicates the benefit of representing people with an amodal representation. Here the hand of the blue person is occluded, so the 2D recognition model does not see the action as a whole. However, SMPL meshes are amodal, therefore the hand is still present, which boosts the probability of predicting the action label for closing the door.

Citation

  @inproceedings{rajasegaran2023benefits,
    title={On the Benefits of 3D Pose and Tracking for Human Action Recognition},
    author={Rajasegaran, Jathushan and Pavlakos, Georgios and Kanazawa, Angjoo and Feichtenhofer, Christoph and Malik, Jitendra},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={640--649},
    year={2023}
  }