Deepak Pathak
Email:

I am a fourth-year graduate student in Computer Science at UC Berkeley working primarily with Prof. Trevor Darrell and Prof. Alyosha Efros, at the intersection of deep learning, computer vision, and robotics.

Earlier, I graduated from IIT Kanpur with Bachelors in Computer Science and Engineering in 2014. I did my undergraduate thesis with Prof. Amitabha Mukerjee on unsupervised anomaly detection in surveillance videos. I have also spent time at Microsoft Research in New York City lab working on forecasting and prediction markets.

I spent Summer 2016 at Facebook AI Research (FAIR, Seattle) working with Ross Girshick, Bharath Hariharan, and Piotr Dollár.

CV | Google Scholar | Github

News
Publications
sym
sym

[NEW] Zero-Shot Visual Imitation
Deepak Pathak*, Parsa Mahmoudieh*, Guanghao Luo*, Pulkit Agrawal*, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, Trevor Darrell
International Conference on Learning Representations (ICLR), 2018
Oral presentation

webpage | pdf | abstract | bibtex | arXiv | code | videos | open-review

The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance.

@inproceedings{pathakICLR18zeroshot,
    Author = {Pathak, Deepak and
    Mahmoudieh, Parsa and Luo, Guanghao and
    Agrawal, Pulkit and Chen, Dian and
    Shentu, Yide and Shelhamer, Evan and
    Malik, Jitendra and Efros, Alexei A. and
    Darrell, Trevor},
    Title = {Zero-Shot Visual Imitation},
    Booktitle = {ICLR},
    Year = {2018}
}
sym

[NEW] Investigating Human Priors for Playing Video Games
Rachit Dubey, Pulkit Agarwal, Deepak Pathak, Thomas L. Griffiths, Alexei A. Efros
International Conference on Learning Representations (ICLR) Workshop, 2018

project webpage | pdf | abstract | bibtex | arXiv | video

What makes humans so good at solving seemingly complex video games? Unlike computers, humans bring in a great deal of prior knowledge about the world, enabling efficient decision making. This paper investigates the role of human priors for solving video games. Given a sample game, we conduct a series of ablation studies to quantify the importance of various priors on human performance. We do this by modifying the video game environment to systematically mask different types of visual information that could be used by humans as priors. We find that removal of some prior knowledge causes a drastic degradation in the speed with which human players solve the game, e.g. from 2 minutes to over 20 minutes. Furthermore, our results indicate that general priors, such as the importance of objects and visual consistency, are critical for efficient game-play.

@inproceedings{pathakICLRW18human,
    Author = {Dubey, Rachit and Agrawal, Pulkit
    and Pathak, Deepak and Griffiths, Thomas L.
    and Efros, Alexei A.},
    Title = {Investigating Human Priors for
    Playing Video Games},
    Booktitle = {ICLR Workshop},
    Year = {2018}
}
sym

[NEW] Toward Multimodal Image-to-Image Translation
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, Eli Shechtman
Advances in Neural Information Processing Systems (NIPS), 2017

project webpage | pdf | abstract | bibtex | arXiv | code | video

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

@inproceedings{zhu2017multimodal,
    Author = {Zhu, Jun-Yan and Zhang, Richard
    and Pathak, Deepak and Darrell, Trevor
    and Efros, Alexei A and Wang, Oliver
    and Shechtman, Eli},
    Title = {Toward Multimodal Image-to-Image
    Translation},
    Booktitle = {NIPS},
    Year = {2017}
}
sym sym

[NEW] Curiosity-driven Exploration by Self-supervised Prediction
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros and Trevor Darrell
International Conference on Machine Learning (ICML), 2017
[Oral presentation video]

project webpage | pdf | abstract | bibtex | arXiv | code | video | in the media
Also presented at CVPR'17 Workshop on Deep Learning for Robotic Vision (Oral Spotlight)

In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch.

@inproceedings{pathakICMl17curiosity,
    Author = {Pathak, Deepak and
    Agrawal, Pulkit and
    Efros, Alexei A. and
    Darrell, Trevor},
    Title = {Curiosity-driven Exploration
    by Self-supervised Prediction},
    Booktitle = {ICML},
    Year = {2017}
}
sym

[NEW] Learning Features by Watching Objects Move
Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell and Bharath Hariharan
Computer Vision and Pattern Recognition (CVPR), 2017

project webpage | pdf | abstract | bibtex | arXiv | code
Also presented at CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding (Oral Spotlight)

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as `pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed `pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

@inproceedings{pathakCVPR17learning,
    Author = {Pathak, Deepak and
    Girshick, Ross and
    Doll{\'a}r, Piotr and
    Darrell, Trevor and
    Hariharan, Bharath},
    Title = {Learning Features
    by Watching Objects Move},
    Booktitle = {CVPR},
    Year = {2017}
}
sym

Context Encoders: Feature Learning by Inpainting
Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell and Alexei A. Efros
Computer Vision and Pattern Recognition (CVPR), 2016

project webpage | pdf w/ supp | abstract | bibtex | arXiv | code | slides

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

@inproceedings{pathakCVPR16context,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Donahue, Jeff and
    Darrell, Trevor and
    Efros, Alexei A.},
    Title = {Context Encoders:
    Feature Learning by Inpainting},
    Booktitle = {CVPR},
    Year = {2016}
}
sym

Large Scale Visual Recognition through Adaptation using Joint Representation and Multiple Instance Learning
Judy Hoffman, Deepak Pathak, Eric Tzeng, Jonathan Long, Sergio Guadarrama, Trevor Darrell and Kate Saenko
Journal of Machine Learning Research (JMLR), 2016

pdf | abstract | bibtex | jmlr

A major barrier towards scaling visual recognition systems is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) trained used 1.2M+ labeled images have emerged as clear winners on object classification benchmarks. Unfortunately, only a small fraction of those labels are available with bounding box localization for training the detection task and even fewer pixel level annotations are available for semantic segmentation. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect scene-centric images with precisely localized labels. We develop methods for learning large scale recognition models which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. We provide a novel formulation of a joint multiple instance learning method that includes examples from object-centric data with image-level labels when available, and also performs domain transfer learning to improve the underlying detector representation. We then show how to use our large scale detectors to produce pixel level annotations. Using our method, we produce a >7.6K category detector and release code and models at lsda.berkeleyvision.org.

@inproceedings{pathakJMLR16,
    Author = {Hoffman, Judy and
    Pathak, Deepak and
    Tzeng, Eric and
    Long, Jonathan and
    Guadarrama, Sergio and
    Darrell, Trevor and
    Saenko, Kate},
    Title = {Large Scale Visual Recognition
    through Adaptation using Joint
    Representation and Multiple Instance
    Learning},
    Booktitle = {JMLR},
    Year = {2016}
}
sym

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation
Deepak Pathak, Philipp Krähenbühl and Trevor Darrell
International Conference on Computer Vision (ICCV), 2015

pdf | supp | abstract | bibtex | arXiv | code

We present an approach to learn a dense pixel-wise labeling from image-level tags. Each image-level tag imposes constraints on the output labeling of a Convolutional Neural Network (CNN) classifier. We propose Constrained CNN (CCNN), a method which uses a novel loss function to optimize for any set of linear constraints on the output space (i.e. predicted label distribution) of a CNN. Our loss formulation is easy to optimize and can be incorporated directly into standard stochastic gradient descent optimization. The key idea is to phrase the training objective as a biconvex optimization for linear models, which we then relax to nonlinear deep networks. Extensive experiments demonstrate the generality of our new learning framework. The constrained loss yields state-of-the-art results on weakly supervised semantic image segmentation. We further demonstrate that adding slightly more supervision can greatly improve the performance of the learning algorithm.

@inproceedings{pathakICCV15ccnn,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Darrell, Trevor},
    Title = {Constrained Convolutional
    Neural Networks for Weakly
    Supervised Segmentation},
    Booktitle = {ICCV},
    Year = {2015}
}
sym

Detector Discovery in the Wild: Joint Multiple Instance and Representation Learning
Judy Hoffman, Deepak Pathak, Trevor Darrell and Kate Saenko
Computer Vision and Pattern Recognition (CVPR), 2015

pdf | abstract | bibtex | arXiv

We develop methods for detector learning which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. Previous methods for weak-label learning often learn detector models independently using latent variable optimization, but fail to share deep representation knowledge across classes and usually require strong initialization. Other previous methods transfer deep representations from domains with strong labels to those with only weak labels, but do not optimize over individual latent boxes, and thus may miss specific salient structures for a particular category. We propose a model that subsumes these previous approaches, and simultaneously trains a representation and detectors for categories with either weak or strong labels present. We provide a novel formulation of a joint multiple instance learning method that includes examples from classification-style data when available, and also performs domain transfer learning to improve the underlying detector representation. Our model outperforms known methods on ImageNet-200 detection with weak labels.

@inproceedings{pathakCVPR15,
    Author = {Hoffman, Judy and
    Pathak, Deepak and
    Darrell, Trevor and
    Saenko, Kate},
    Title = {Detector Discovery
    in the Wild: Joint Multiple
    Instance and Representation
    Learning},
    Booktitle = {CVPR},
    Year = {2015}
}
sym

Fully Convolutional Multi-Class Multiple Instance Learning
Deepak Pathak, Evan Shelhamer, Jonathan Long and Trevor Darrell
International Conference on Learning Representations (ICLR), Workshop 2015

pdf | abstract | bibtex | arXiv

Multiple instance learning (MIL) can reduce the need for costly annotation in tasks such as semantic segmentation by weakening the required degree of supervision. We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network. In this setting, we seek to learn a semantic segmentation model from just weak image-level labels. The model is trained end-to-end to jointly optimize the representation while disambiguating the pixel-image label assignment. Fully convolutional training accepts inputs of any size, does not need object proposal pre-processing, and offers a pixelwise loss map for selecting latent instances. Our multi-class MIL loss exploits the further supervision given by images with multiple labels. We evaluate this approach through preliminary experiments on the PASCAL VOC segmentation challenge.

@inproceedings{pathakICLR15,
    Author = {Pathak, Deepak and
    Shelhamer, Evan and
    Long, Jonathan and
    Darrell, Trevor},
    Title = {Fully Convolutional
    Multi-Class Multiple Instance
    Learning},
    Booktitle = {ICLR Workshop},
    Year = {2015}
}
sym

Anomaly Localization in Topic-based Analysis of Surveillance Videos
Deepak Pathak, Abhijit Sharang and Amitabha Mukerjee
IEEE Winter Conference on Applications of Computer Vision (WACV), 2015

pdf | abstract | bibtex

Topic-models for video analysis have been used for unsupervised identification of normal activity in videos, thereby enabling the detection of anomalous actions. However, while intervals containing anomalies are detected, it has not been possible to localize the anomalous activities in such models. This is a challenging problem as the abnormal content is usually a small fraction of the entire video data and hence distinctions in terms of likelihood are unlikely. Here we propose a methodology to extend the topic based analysis with rich local descriptors incorporating quantized spatio-temporal gradient descriptors with image location and size information. The visual clips over this vocabulary are then represented in latent topic space using models like pLSA. Further, we introduce an algorithm to quantify the anomalous content in a video clip by projecting the learned topic space information. Using the algorithm, we detect whether the video clip is abnormal and if positive, localize the anomaly in spatio-temporal domain. We also contribute one real world surveillance video dataset for comprehensive evaluation of the proposed algorithm. Experiments are presented on the proposed and two other standard surveillance datasets.

@inproceedings{pathakWACV15,
    Author = {Pathak, Deepak and
    Sharang, Abhijit and
    Mukerjee, Amitabha},
    Title = {Anomaly Localization
    in Topic-based Analysis of
    Surveillance Videos},
    Booktitle = {WACV},
    Year = {2015}
}
sym

Where is my Friend? - Person identification in Social Networks
Deepak Pathak, Sai Nitish Satyavolu and Vinay P. Namboodiri
IEEE Conference on Automatic Face and Gesture Recognition (FG), 2015

pdf | abstract | bibtex

One of the interesting applications of computer vision is to be able to identify or detect persons in real world. This problem has been posed in the context of identifying people in television series or in multi-camera networks. However, a common scenario for this problem is to be able to identify people among images prevalent on social networks. In this paper we present a method that aims to solve this problem in real world conditions where the person can be in any pose, profile and orientation and the face itself is not always clearly visible. Moreover, we show that the problem can be solved with as weak supervision only a label whether the person is present or not, which is usually the case as people are tagged in social networks. This is challenging as there can be ambiguity in association of the right person. The problem is solved in this setting using a latent max-margin formulation where the identity of the person is the latent parameter that is classified. This framework builds on other off the shelf computer vision techniques for person detection and face detection and is able to also account for inaccuracies of these components. The idea is to model the complete person in addition to face, that too with weak supervision. We also contribute three real-world datasets that we have created for extensive evaluation of the solution. We show using these datasets that the problem can be effectively solved using the proposed method.

@inproceedings{pathakFG15,
    Author = {Pathak, Deepak and
    Satyavolu, Sai Nitish and
    Namboodiri, Vinay P.},
    Title = {Where is my Friend? -
    Person identification in Social
    Networks},
    Booktitle = {Automatic Face and
    Gesture Recognition (FG)},
    Year = {2015}
}
sym

A Comparison Of Forecasting Methods: Fundamentals, Polling, Prediction Markets, and Experts
Deepak Pathak, David Rothschild and Miro Dudík
Journal of Prediction Markets (JPM), 2015

pdf | abstract | bibtex | predictions2014 | predictions2016

We compare Oscar forecasts derived from four data types (fundamentals, polling, prediction markets, and domain experts) across three attributes (accuracy, timeliness and cost effectiveness). Fundamentals-based forecasts are relatively expensive to construct, an attribute the academic literature frequently ignores, and update slowly over time, constraining their accuracy. However, fundamentals provide valuable insights into the relationship between key indicators for nominated movies and their chances of victory. For instance, we find that the performance in other awards shows is highly predictive of the Oscar victory whereas box office results are not. Polling- based forecasts have the potential to be both accurate and timely. Timeliness requires incentives for frequent responses by high-information users. Accuracy is achieved by a proper transformation of raw polls. Prediction market prices are accurate forecasts, but can be further improved by simple transformations of raw prices, yielding the most accurate forecasts in our study. Expert forecasts exhibit some characteristics of fundamental models, but are generally not comparatively accurate or timely. This study is unique in both comparing and aggregating four traditional data sources, and considering critical attributes beyond accuracy. We believe that the results of this study generalize to many other domains.

@inproceedings{pathakJPM15,
    Author = {Pathak, Deepak and
    Rothschild, David and
    Dudik, Miro},
    Title = {A Comparison Of Forecasting
    Methods: Fundamentals, Polling,
    Prediction Markets, and Experts},
    Booktitle = {Journal of Prediction Markets (JPM)},
    Year = {2015}
}
Technical Reports
sym

Constrained Structured Regression with Convolutional Neural Networks
Deepak Pathak, Philipp Krähenbühl, Stella X. Yu and Trevor Darrell
arXiv:1511.07497, 2015

pdf | abstract | bibtex | arXiv

Convolutional Neural Networks (CNNs) have recently emerged as the dominant model in computer vision. If provided with enough training data, they predict almost any visual quantity. In a discrete setting, such as classification, CNNs are not only able to predict a label but often predict a confidence in the form of a probability distribution over the output space. In continuous regression tasks, such a probability estimate is often lacking. We present a regression framework which models the output distribution of neural networks. This output distribution allows us to infer the most likely labeling following a set of physical or modeling constraints. These constraints capture the intricate interplay between different input and output variables, and complement the output of a CNN. However, they may not hold everywhere. Our setup further allows to learn a confidence with which a constraint holds, in the form of a distribution of the constrain satisfaction. We evaluate our approach on the problem of intrinsic image decomposition, and show that constrained structured regression significantly increases the state-of-the-art.

@inproceedings{pathakArxiv15,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Yu, Stella X. and
    Darrell, Trevor},
    Title = {Constrained Structured
    Regression with Convolutional
    Neural Networks},
    Booktitle = {arXiv:1511.07497},
    Year = {2015}
}
Teaching
pacman

CS189/289: Introduction to Machine Learning - Fall '15 (GSI)
Instructor: Prof. Alexei A. Efros and Dr. Isabelle Guyon

CS280: Computer Vision - Spring '16 (GSI)
Instructor: Prof. Trevor Darrell and Prof. Alexei A. Efros

Awards
  • Facebook Graduate Fellowship (2018-2020)
  • NVIDIA Graduate Fellowship (2017-2018)
  • Snapchat Inc. Graduate Fellowship (2017)
  • Gold Medal for the first rank in CS department at IIT Kanpur (2014)
  • Best Undergraduate Thesis Award at IIT Kanpur (2014)

Template: this, this and this