Deepak Pathak
Email:

I am a third year graduate student in Computer Science at UC Berkeley advised by Prof. Trevor Darrell, researching computer vision and machine learning.

Earlier, I graduated from IIT Kanpur with Bachelors in Computer Science and Engineering in 2014. I did my undergraduate thesis with Prof. Amitabha Mukerjee on unsupervised anomaly detection in surveillance videos. I have also spent time at Microsoft Research in New York City lab working on forecasting and prediction markets.

I spent Summer 2016 at Facebook AI Research (FAIR, Seattle) working with Ross Girshick, Bharath Hariharan and Piotr Dollár.

CV | Google Scholar | Github

News
Publications
sym sym

[NEW] Curiosity-driven Exploration by Self-supervised Prediction
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros and Trevor Darrell
International Conference on Machine Learning (ICML), 2017
Oral Spotlight at CVPR'17 Workshop on Deep Learning for Robotic Vision

project webpage | pdf | abstract | bibtex | arXiv | code | video | in the media

In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly predicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two environments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen scenarios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch.

@inproceedings{pathakICMl17curiosity,
    Author = {Pathak, Deepak and
    Agrawal, Pulkit and
    Efros, Alexei A. and
    Darrell, Trevor},
    Title = {Curiosity-driven Exploration
    by Self-supervised Prediction},
    Booktitle = {ICML},
    Year = {2017}
}
sym

[NEW] Learning Features by Watching Objects Move
Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell and Bharath Hariharan
Computer Vision and Pattern Recognition (CVPR), 2017
Oral Talk at CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding

project webpage | pdf | abstract | bibtex | arXiv | code

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as `pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed `pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

@inproceedings{pathakCVPR17learning,
    Author = {Pathak, Deepak and
    Girshick, Ross and
    Doll{\'a}r, Piotr and
    Darrell, Trevor and
    Hariharan, Bharath},
    Title = {Learning Features
    by Watching Objects Move},
    Booktitle = {CVPR},
    Year = {2017}
}
sym

Context Encoders: Feature Learning by Inpainting
Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell and Alexei A. Efros
Computer Vision and Pattern Recognition (CVPR), 2016

project webpage | pdf w/ supp | abstract | bibtex | arXiv | code | slides

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

@inproceedings{pathakCVPR16context,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Donahue, Jeff and
    Darrell, Trevor and
    Efros, Alexei A.},
    Title = {Context Encoders:
    Feature Learning by Inpainting},
    Booktitle = {CVPR},
    Year = {2016}
}
sym

Large Scale Visual Recognition through Adaptation using Joint Representation and Multiple Instance Learning
Judy Hoffman, Deepak Pathak, Eric Tzeng, Jonathan Long, Sergio Guadarrama, Trevor Darrell and Kate Saenko
Journal of Machine Learning Research (JMLR), 2016

pdf | abstract | bibtex | jmlr

A major barrier towards scaling visual recognition systems is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) trained used 1.2M+ labeled images have emerged as clear winners on object classification benchmarks. Unfortunately, only a small fraction of those labels are available with bounding box localization for training the detection task and even fewer pixel level annotations are available for semantic segmentation. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect scene-centric images with precisely localized labels. We develop methods for learning large scale recognition models which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. We provide a novel formulation of a joint multiple instance learning method that includes examples from object-centric data with image-level labels when available, and also performs domain transfer learning to improve the underlying detector representation. We then show how to use our large scale detectors to produce pixel level annotations. Using our method, we produce a >7.6K category detector and release code and models at lsda.berkeleyvision.org.

@inproceedings{pathakJMLR16,
    Author = {Hoffman, Judy and
    Pathak, Deepak and
    Tzeng, Eric and
    Long, Jonathan and
    Guadarrama, Sergio and
    Darrell, Trevor and
    Saenko, Kate},
    Title = {Large Scale Visual Recognition
    through Adaptation using Joint
    Representation and Multiple Instance
    Learning},
    Booktitle = {JMLR},
    Year = {2016}
}
sym

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation
Deepak Pathak, Philipp Krähenbühl and Trevor Darrell
International Conference on Computer Vision (ICCV), 2015

pdf | supp | abstract | bibtex | arXiv | code

We present an approach to learn a dense pixel-wise labeling from image-level tags. Each image-level tag imposes constraints on the output labeling of a Convolutional Neural Network (CNN) classifier. We propose Constrained CNN (CCNN), a method which uses a novel loss function to optimize for any set of linear constraints on the output space (i.e. predicted label distribution) of a CNN. Our loss formulation is easy to optimize and can be incorporated directly into standard stochastic gradient descent optimization. The key idea is to phrase the training objective as a biconvex optimization for linear models, which we then relax to nonlinear deep networks. Extensive experiments demonstrate the generality of our new learning framework. The constrained loss yields state-of-the-art results on weakly supervised semantic image segmentation. We further demonstrate that adding slightly more supervision can greatly improve the performance of the learning algorithm.

@inproceedings{pathakICCV15ccnn,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Darrell, Trevor},
    Title = {Constrained Convolutional
    Neural Networks for Weakly
    Supervised Segmentation},
    Booktitle = {ICCV},
    Year = {2015}
}
sym

Detector Discovery in the Wild: Joint Multiple Instance and Representation Learning
Judy Hoffman, Deepak Pathak, Trevor Darrell and Kate Saenko
Computer Vision and Pattern Recognition (CVPR), 2015

pdf | abstract | bibtex | arXiv

We develop methods for detector learning which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. Previous methods for weak-label learning often learn detector models independently using latent variable optimization, but fail to share deep representation knowledge across classes and usually require strong initialization. Other previous methods transfer deep representations from domains with strong labels to those with only weak labels, but do not optimize over individual latent boxes, and thus may miss specific salient structures for a particular category. We propose a model that subsumes these previous approaches, and simultaneously trains a representation and detectors for categories with either weak or strong labels present. We provide a novel formulation of a joint multiple instance learning method that includes examples from classification-style data when available, and also performs domain transfer learning to improve the underlying detector representation. Our model outperforms known methods on ImageNet-200 detection with weak labels.

@inproceedings{pathakCVPR15,
    Author = {Hoffman, Judy and
    Pathak, Deepak and
    Darrell, Trevor and
    Saenko, Kate},
    Title = {Detector Discovery
    in the Wild: Joint Multiple
    Instance and Representation
    Learning},
    Booktitle = {CVPR},
    Year = {2015}
}
sym

Fully Convolutional Multi-Class Multiple Instance Learning
Deepak Pathak, Evan Shelhamer, Jonathan Long and Trevor Darrell
International Conference on Learning Representations (ICLR), Workshop 2015

pdf | abstract | bibtex | arXiv

Multiple instance learning (MIL) can reduce the need for costly annotation in tasks such as semantic segmentation by weakening the required degree of supervision. We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network. In this setting, we seek to learn a semantic segmentation model from just weak image-level labels. The model is trained end-to-end to jointly optimize the representation while disambiguating the pixel-image label assignment. Fully convolutional training accepts inputs of any size, does not need object proposal pre-processing, and offers a pixelwise loss map for selecting latent instances. Our multi-class MIL loss exploits the further supervision given by images with multiple labels. We evaluate this approach through preliminary experiments on the PASCAL VOC segmentation challenge.

@inproceedings{pathakICLR15,
    Author = {Pathak, Deepak and
    Shelhamer, Evan and
    Long, Jonathan and
    Darrell, Trevor},
    Title = {Fully Convolutional
    Multi-Class Multiple Instance
    Learning},
    Booktitle = {ICLR Workshop},
    Year = {2015}
}
sym

Anomaly Localization in Topic-based Analysis of Surveillance Videos
Deepak Pathak, Abhijit Sharang and Amitabha Mukerjee
IEEE Winter Conference on Applications of Computer Vision (WACV), 2015

pdf | abstract | bibtex

Topic-models for video analysis have been used for unsupervised identification of normal activity in videos, thereby enabling the detection of anomalous actions. However, while intervals containing anomalies are detected, it has not been possible to localize the anomalous activities in such models. This is a challenging problem as the abnormal content is usually a small fraction of the entire video data and hence distinctions in terms of likelihood are unlikely. Here we propose a methodology to extend the topic based analysis with rich local descriptors incorporating quantized spatio-temporal gradient descriptors with image location and size information. The visual clips over this vocabulary are then represented in latent topic space using models like pLSA. Further, we introduce an algorithm to quantify the anomalous content in a video clip by projecting the learned topic space information. Using the algorithm, we detect whether the video clip is abnormal and if positive, localize the anomaly in spatio-temporal domain. We also contribute one real world surveillance video dataset for comprehensive evaluation of the proposed algorithm. Experiments are presented on the proposed and two other standard surveillance datasets.

@inproceedings{pathakWACV15,
    Author = {Pathak, Deepak and
    Sharang, Abhijit and
    Mukerjee, Amitabha},
    Title = {Anomaly Localization
    in Topic-based Analysis of
    Surveillance Videos},
    Booktitle = {WACV},
    Year = {2015}
}
sym

Where is my Friend? - Person identification in Social Networks
Deepak Pathak, Sai Nitish Satyavolu and Vinay P. Namboodiri
IEEE Conference on Automatic Face and Gesture Recognition (FG), 2015

pdf | abstract | bibtex

One of the interesting applications of computer vision is to be able to identify or detect persons in real world. This problem has been posed in the context of identifying people in television series or in multi-camera networks. However, a common scenario for this problem is to be able to identify people among images prevalent on social networks. In this paper we present a method that aims to solve this problem in real world conditions where the person can be in any pose, profile and orientation and the face itself is not always clearly visible. Moreover, we show that the problem can be solved with as weak supervision only a label whether the person is present or not, which is usually the case as people are tagged in social networks. This is challenging as there can be ambiguity in association of the right person. The problem is solved in this setting using a latent max-margin formulation where the identity of the person is the latent parameter that is classified. This framework builds on other off the shelf computer vision techniques for person detection and face detection and is able to also account for inaccuracies of these components. The idea is to model the complete person in addition to face, that too with weak supervision. We also contribute three real-world datasets that we have created for extensive evaluation of the solution. We show using these datasets that the problem can be effectively solved using the proposed method.

@inproceedings{pathakFG15,
    Author = {Pathak, Deepak and
    Satyavolu, Sai Nitish and
    Namboodiri, Vinay P.},
    Title = {Where is my Friend? -
    Person identification in Social
    Networks},
    Booktitle = {Automatic Face and
    Gesture Recognition (FG)},
    Year = {2015}
}
sym

A Comparison Of Forecasting Methods: Fundamentals, Polling, Prediction Markets, and Experts
Deepak Pathak, David Rothschild and Miro Dudík
Journal of Prediction Markets (JPM), 2015

pdf | abstract | bibtex | predictions2014 | predictions2016

We compare Oscar forecasts derived from four data types (fundamentals, polling, prediction markets, and domain experts) across three attributes (accuracy, timeliness and cost effectiveness). Fundamentals-based forecasts are relatively expensive to construct, an attribute the academic literature frequently ignores, and update slowly over time, constraining their accuracy. However, fundamentals provide valuable insights into the relationship between key indicators for nominated movies and their chances of victory. For instance, we find that the performance in other awards shows is highly predictive of the Oscar victory whereas box office results are not. Polling- based forecasts have the potential to be both accurate and timely. Timeliness requires incentives for frequent responses by high-information users. Accuracy is achieved by a proper transformation of raw polls. Prediction market prices are accurate forecasts, but can be further improved by simple transformations of raw prices, yielding the most accurate forecasts in our study. Expert forecasts exhibit some characteristics of fundamental models, but are generally not comparatively accurate or timely. This study is unique in both comparing and aggregating four traditional data sources, and considering critical attributes beyond accuracy. We believe that the results of this study generalize to many other domains.

@inproceedings{pathakJPM15,
    Author = {Pathak, Deepak and
    Rothschild, David and
    Dudik, Miro},
    Title = {A Comparison Of Forecasting
    Methods: Fundamentals, Polling,
    Prediction Markets, and Experts},
    Booktitle = {Journal of Prediction Markets (JPM)},
    Year = {2015}
}
Technical Reports
sym

Constrained Structured Regression with Convolutional Neural Networks
Deepak Pathak, Philipp Krähenbühl, Stella X. Yu and Trevor Darrell
arXiv:1511.07497, 2015

pdf | abstract | bibtex | arXiv

Convolutional Neural Networks (CNNs) have recently emerged as the dominant model in computer vision. If provided with enough training data, they predict almost any visual quantity. In a discrete setting, such as classification, CNNs are not only able to predict a label but often predict a confidence in the form of a probability distribution over the output space. In continuous regression tasks, such a probability estimate is often lacking. We present a regression framework which models the output distribution of neural networks. This output distribution allows us to infer the most likely labeling following a set of physical or modeling constraints. These constraints capture the intricate interplay between different input and output variables, and complement the output of a CNN. However, they may not hold everywhere. Our setup further allows to learn a confidence with which a constraint holds, in the form of a distribution of the constrain satisfaction. We evaluate our approach on the problem of intrinsic image decomposition, and show that constrained structured regression significantly increases the state-of-the-art.

@inproceedings{pathakArxiv15,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Yu, Stella X. and
    Darrell, Trevor},
    Title = {Constrained Structured
    Regression with Convolutional
    Neural Networks},
    Booktitle = {arXiv:1511.07497},
    Year = {2015}
}
Teaching
pacman

CS189/289: Introduction to Machine Learning - Fall '15 (GSI)
Instructor: Prof. Alexei A. Efros and Dr. Isabelle Guyon

CS280: Computer Vision - Spring '16 (GSI)
Instructor: Prof. Trevor Darrell and Prof. Alexei A. Efros


Template: this, this, this and this