About
I am a second-year CS Ph.D. student at UC Berkeley and a member of Berkeley AI Research, advised by
Angjoo Kanazawa.
Previously, I obtained my master degree from Columbia where I worked with
Shih-Fu Chang. During my undergrad years in Jiao Tong University, I collaborated
closely with
Anind Dey.
I am interested in understanding human visual persistency and
how it can be computationally replicated in machines. I strongly
believe that, such a reasoning accross both space and time rarely
emerges from 2D correspondence without connecting to 3D. My works
therefore are currently themed with video and 3D vision, and their
projection into graphics and learning.
Publications
Long-term Human Motion Prediction with Scene Context
Zhe Cao,
Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, Jitendra
Malik
ECCV 2020 /
arxiv /
abstract /
bib /
website
/
code & data
Oral Presentation
Human movement is goal-directed and influenced by the spatial
layout of the objects in the scene. To plan future human motion,
it is crucial to perceive the environment -- imagine how hard it
is to navigate a new room with lights off. Existing works on
predicting human motion do not pay attention to the scene
context and thus struggle in long-term prediction. In this work,
we propose a novel three-stage framework that exploits scene
context to tackle this task. Given a single scene image and 2D
pose histories, our method first samples multiple human motion
goals, then plans 3D human paths towards each goal, and finally
predicts 3D human pose sequences following each path. For stable
training and rigorous evaluation, we contribute a diverse
synthetic dataset with clean annotations. In both synthetic and
real datasets, our method shows consistent quantitative and
qualitative improvements over existing methods.
@inproceedings{cao2020long,
title={Long-term Human Motion Prediction with Scene Context},
author={Cao, Zhe and Gao, Hang and Mangalam, Karttikeya
and Cai, Qi-Zhi and Vo, Minh and Malik, Jitendra},
booktitle={European Conference on Computer Vision},
year={2019},
}
Deformable Kernels: Adapting Effective Receptive Fields for Object
Deformation
Hang Gao*, Xizhou Zhu*, Steve Lin, Jifeng
Dai
Spatio-Temporal Action Graph Networks
Roei Herzig*, Elad Levi*, Huijuan
Xu*,
Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, Trevor
Darrell
ICCV 2019 Workshop /
arxiv
Disentangling Propagation and Generation for Video Prediction
Hang Gao*, Huazhe Xu*, Qi-Zhi Cai, Ruth
Wang, Fisher Yu, Trevor Darrell
ICCV 2019 /
arxiv /
abstract
/
bib
A dynamic scene has two types of elements: those that move
fluidly and can be predicted from previous frames, and those
which are disoccluded (exposed) and cannot be extrapolated.
Prior approaches to video prediction typically learn either to
warp or to hallucinate future pixels, but not both. In this
paper, we describe a computational model for high-fidelity video
prediction which disentangles motion-specific propagation from
motion-agnostic generation. We introduce a confidence-aware
warping operator which gates the output of pixel predictions
from a flow predictor for non-occluded regions and from a
context encoder for occluded regions. Moreover, in contrast to
prior works where confidence is jointly learned with flow and
appearance using a single network, we compute confidence after a
warping step, and employ a separate network to inpaint exposed
regions. Empirical results on both synthetic and real datasets
show that our disentangling approach provides better occlusion
maps and produces both sharper and more realistic predictions
compared to strong baselines.
@inproceedings{gao2019disentangling,
title={Disentangling Propagation and Generation for Video Prediction},
author={Gao, Hang and Xu, Huazhe and Cai, Qi-Zhi
and Wang, Ruth and Yu, Fisher and Darrell, Trevor},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
year={2019},
}
Low-shot Learning via Covariance-Preserving Adversarial
Augmentation Networks
Hang Gao, Zheng Shou, Alireza Zareian, Hanwang Zhang, Shih-Fu
Chang
NeurIPS 2018 /
arxiv /
abstract /
bib
Deep neural networks suffer from over-fitting and catastrophic
forgetting when trained with small data. One natural remedy for
this problem is data augmentation, which has been recently shown
to be effective. However, previous works either assume that
intra-class variances can always be generalized to new classes,
or employ naive generation methods to hallucinate finite
examples without modeling their latent distributions. In this
work, we propose Covariance-Preserving Adversarial Augmentation
Networks to overcome existing limits of low-shot learning.
Specifically, a novel Generative Adversarial Network is designed
to model the latent distribution of each novel class given its
related base counterparts. Since direct estimation on novel
classes can be inductively biased, we explicitly preserve
covariance information as the ``variability'' of base examples
during the generation process. Empirical results show that our
model can generate realistic yet diverse examples, leading to
substantial improvements on the ImageNet benchmark over the
state of the art.
@inproceedings{gao2018low,
title={Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks},
author={Gao, Hang and Shou, Zheng and Zareian, Alireza
and Zhang, Hanwang and Chang, Shih-Fu},
booktitle={Advances in Neural Information Processing Systems},
year={2018},
}
AutoLoc: Weakly-supervised Temporal Action Localization in
Untrimmed Videos
Zheng Shou,
Hang Gao, Lei Zhang, Kazuyuki Miyazawa, Shih-Fu Chang
ECCV 2018 /
arxiv /
abstract
/
bib /
code
Temporal Action Localization (TAL) in untrimmed video is
important for many applications. But it is very expensive to
annotate the segment-level ground truth (action class and
temporal boundary). This raises the interest of addressing TAL
with weak supervision, namely only video-level annotations are
available during training). However, the state-of-the-art
weakly-supervised TAL methods only focus on generating good
Class Activation Sequence (CAS) over time but conduct simple
thresholding on CAS to localize actions. In this paper, we first
develop a novel weakly-supervised TAL framework called AutoLoc
to directly predict the temporal boundary of each action
instance. We propose a novel Outer-Inner-Contrastive (OIC) loss
to automatically discover the needed segment-level supervision
for training such a boundary predictor. Our method achieves
dramatically improved performance: under the IoU threshold 0.5,
our method improves mAP on THUMOS'14 from 13.7% to 21.2% and mAP
on ActivityNet from 7.4% to 27.3%. It is also very encouraging
to see that our weakly-supervised method achieves comparable
results with some fully-supervised methods.
@inproceedings{shou2018autoloc,
title={AutoLoc: Weakly-supervised Temporal Action
Localization in Untrimmed Videos},
author={Shou, Zheng and Gao, Hang and Zhang, Lei
and Miyazawa, Kazuyuki and Chang, Shih-Fu},
booktitle={European Conference on Computer Vision},
year={2018}
}
ER: Early Recognition of Inattentive Driving Events Leveraging
Audio Devices on Smartphones
Xiangyu Xu,
Hang Gao, Jiadi Yu, Yingying Chen, Yanmin Zhu, Guangtao Xue,
Minglu Li
INFOCOM 2017 /
IEEE /
abstract /
bib
Real-time driving behavior monitoring is a corner stone to
improve driving safety. Most of the existing studies on driving
behavior monitoring using smartphones only provide detection
results after an abnormal driving behavior is finished, not
sufficient for driver alert and avoiding car accidents. In this
paper, we leverage existing audio devices on smartphones to
realize early recognition of inattentive driving events
including Fetching Forward, Picking up Drops, Turning Back and
Eating or Drinking. Through empirical studies of driving traces
collected in real driving environments, we find that each type
of inattentive driving event exhibits unique patterns on Doppler
profiles of audio signals. This enables us to develop an Early
Recognition system, ER, which can recognize inattentive driving
events at an early stage and alert drivers timely. ER employs
machine learning methods to first generate binary classifiers
for every pair of inattentive driving events, and then develops
a modified vote mechanism to form a multi-classifier for all
inattentive driving events along with other driving behaviors.
It next turns the multi-classifier into a gradient model forest
to achieve early recognition of inattentive driving. Through
extensive experiments with 8 volunteers driving for about half a
year, ER can achieve an average total accuracy of 94.80% for
inattentive driving recognition and recognize over 80%
inattentive driving events before the event is 50% finished.
@inproceedings{xu2017er,
title={ER: Early recognition of inattentive driving
leveraging audio devices on smartphones},
author={Xu, Xiangyu and Gao, Hang and Yu, Jiadi and
Chen, Yingying and Zhu, Yanmin and Xue, Guangtao and Li, Minglu},
booktitle={INFOCOM 2017-IEEE Conference on Computer
Communications, IEEE},
pages={1--9},
year={2017},
organization={IEEE}
}