Publications

Doris Xin, Devin Petersohn, Dixin Tang, Yifan Wu, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran. "Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time." CoRR (arXiv), 2021.

We propose opportunistic evaluation, a framework for accelerating interactions with dataframes. Interactive latency is critical for iterative, human-in-the-loop dataframe workloads for supporting exploratory data analysis. Opportunistic evaluation significantly reduces interactive latency by 1) prioritizing computation directly relevant to the interactions and 2) leveraging think time for asynchronous background computation for non-critical operators that might be relevant to future interactions. We show, through empirical analysis, that current user behavior presents ample opportunities for optimization, and the solutions we propose effectively harness such opportunities.

@article{xin2021enhancing,
 abstract = {We propose opportunistic evaluation, a framework for accelerating interactions with dataframes. Interactive latency is critical for iterative, human-in-the-loop dataframe workloads for supporting exploratory data analysis. Opportunistic evaluation significantly reduces interactive latency by 1) prioritizing computation directly relevant to the interactions and 2) leveraging think time for asynchronous background computation for non-critical operators that might be relevant to future interactions. We show, through empirical analysis, that current user behavior presents ample opportunities for optimization, and the solutions we propose effectively harness such opportunities.},
 archiveprefix = {arXiv},
 author = {Doris Xin and Devin Petersohn and Dixin Tang and Yifan Wu and Joseph E. Gonzalez and Joseph M. Hellerstein and Anthony D. Joseph and Aditya G. Parameswaran},
 eprint = {2103.02145},
 journal = {CoRR},
 keywords = {arxivpre},
 primaryclass = {cs.DB},
 title = {Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time},
 url = {https://arxiv.org/abs/2103.02145},
 year = {2021}
}

Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Suzanne Petryk, Sarah Adel Bargal, and Joseph E. Gonzalez. "NBDT: Neural-Backed Decision Tree." International Conference on Learning Representations, 2021.

Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use. In response, previous work combines decision trees with deep learning, yielding models that (1) sacrifice interpretability for accuracy or (2) sacrifice accuracy for interpretability. We forgo this dilemma by jointly improving accuracy and interpretability using Neural-Backed Decision Trees (NBDTs). NBDTs replace a neural network's final linear layer with a differentiable sequence of decisions and a surrogate loss. This forces the model to learn high-level concepts and lessens reliance on highly-uncertain decisions, yielding (1) accuracy: NBDTs match or outperform modern neural networks on CIFAR, ImageNet and better generalize to unseen classes by up to 16\%. Furthermore, our surrogate loss improves the original model's accuracy by up to 2\%. NBDTs also afford (2) interpretability: improving human trustby clearly identifying model mistakes and assisting in dataset debugging.

@inproceedings{wan2021nbdt,
 abstract = {Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use. In response, previous work combines decision trees with deep learning, yielding models that (1) sacrifice interpretability for accuracy or (2) sacrifice accuracy for interpretability. We forgo this dilemma by jointly improving accuracy and interpretability using Neural-Backed Decision Trees (NBDTs). NBDTs replace a neural network's final linear layer with a differentiable sequence of decisions and a surrogate loss. This forces the model to learn high-level concepts and lessens reliance on highly-uncertain decisions, yielding (1) accuracy: NBDTs match or outperform modern neural networks on CIFAR, ImageNet and better generalize to unseen classes by up to 16\%. Furthermore, our surrogate loss improves the original model's accuracy by up to 2\%. NBDTs also afford (2) interpretability: improving human trustby clearly identifying model mistakes and assisting in dataset debugging.},
 author = {Alvin Wan and Lisa Dunlap and Daniel Ho and Jihan Yin and Scott Lee and Suzanne Petryk and Sarah Adel Bargal and Joseph E. Gonzalez},
 booktitle = {International Conference on Learning Representations},
 code = {https://github.com/alvinwan/neural-backed-decision-trees},
 keywords = {peerrev, selected},
 title = { {NBDT}: Neural-Backed Decision Tree},
 url = {https://openreview.net/forum?id=mCLVeEpplNE},
 year = {2021}
}

Vidit Saxena, Joakim Jalden, and Joseph E. Gonzalez. "Thompson Sampling for Linearly Constrained Bandits." Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AIStats), 2020.

We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O( sqrt( T ) ) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O( log ( T ) ). We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

@inproceedings{pmlr-v108-saxena20a,
 abstract = {We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O( sqrt( T ) ) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O( log ( T ) ). We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.},
 address = {Online},
 author = {Vidit Saxena and Joakim Jalden and Joseph E. Gonzalez},
 booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AIStats)},
 date-modified = {2020-08-02 11:27:35 -0700},
 editor = {Chiappa, Silvia and Calandra, Roberto},
 keywords = {peerrev},
 month = {8},
 pages = {1999--2009},
 pdf = {http://proceedings.mlr.press/v108/saxena20a/saxena20a.pdf},
 publisher = {PMLR},
 series = {Proceedings of Machine Learning Research},
 title = {Thompson Sampling for Linearly Constrained Bandits},
 url = {http://proceedings.mlr.press/v108/saxena20a.html},
 volume = {108},
 year = {2020}
}

Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph E. Gonzalez, and Raman Arora. "FetchSGD: Communication-Efficient Federated Learning with Sketching." Proceedings of the International Conference on Machine Learning (ICML), 2020.

Existing approaches to federated learning suffer from a communication bottleneck as well as convergence issues due to sparse client participation. In this paper we introduce a novel algorithm, called FetchSGD, to overcome these challenges. FetchSGD compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers. A key insight in the design of FetchSGD is that, because the Count Sketch is linear, momentum and error accumulation can both be carried out within the sketch. This allows the algorithm to move momentum and error accumulation from clients to the central aggregator, overcoming the challenges of sparse client participation while still achieving high compression rates and good convergence. We prove that FetchSGD has favorable convergence guarantees, and we demonstrate its empirical effectiveness by training two residual networks and a transformer model.

@inproceedings{FetchSGDICML20,
 abstract = {Existing approaches to federated learning suffer from a communication bottleneck as well as convergence issues due to sparse client participation. In this paper we introduce a novel algorithm, called FetchSGD, to overcome these challenges. FetchSGD compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers. A key insight in the design of FetchSGD is that, because the Count Sketch is linear, momentum and error accumulation can both be carried out within the sketch. This allows the algorithm to move momentum and error accumulation from clients to the central aggregator, overcoming the challenges of sparse client participation while still achieving high compression rates and good convergence. We prove that FetchSGD has favorable convergence guarantees, and we demonstrate its empirical effectiveness by training two residual networks and a transformer model.},
 author = {Daniel Rothchild and Ashwinee Panda and Enayat Ullah and Nikita Ivkin and Ion Stoica and Vladimir Braverman and Joseph E. Gonzalez and Raman Arora},
 booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {7},
 series = {ICML'20},
 title = { {FetchSGD}: Communication-Efficient Federated Learning with Sketching},
 url = {https://proceedings.icml.cc/static/paper\%5Ffiles/icml/2020/5927-Paper.pdf},
 year = {2020}
}

Xin Wang, Thomas E. Huang, Trevor Darrell, Joseph E. Gonzalez, and Fisher Yu. "Frustratingly Simple Few-Shot Object Detection." Proceedings of the International Conference on Machine Learning (ICML), 2020.

Detecting rare objects from a few examples is an emerging problem. Prior works show meta-learning is a promising approach. But, fine-tuning techniques have drawn scant attention. We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 2\~20 points on current benchmarks and sometimes even doubles the accuracy of the prior methods. However, the high variance in the few samples often leads to the unreliability of existing benchmarks. We revise the evaluation protocols by sampling multiple groups of training examples to obtain stable comparisons and build new benchmarks based on three datasets: PASCAL VOC, COCO and LVIS. Again, our fine-tuning approach establishes a new state of the art on the revised benchmarks. The code as well as the pretrained models are available at this URL.

@inproceedings{FSFewShotICML20,
 abstract = {Detecting rare objects from a few examples is an emerging problem. Prior works show meta-learning is a promising approach. But, fine-tuning techniques have drawn scant attention. We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 2\~20 points on current benchmarks and sometimes even doubles the accuracy of the prior methods. However, the high variance in the few samples often leads to the unreliability of existing benchmarks. We revise the evaluation protocols by sampling multiple groups of training examples to obtain stable comparisons and build new benchmarks based on three datasets: PASCAL VOC, COCO and LVIS. Again, our fine-tuning approach establishes a new state of the art on the revised benchmarks. The code as well as the pretrained models are available at this URL.},
 author = {Xin Wang and Thomas E. Huang and Trevor Darrell and Joseph E. Gonzalez and Fisher Yu},
 booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
 code = {https://github.com/ucbdrive/few-shot-object-detection},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {7},
 series = {ICML'20},
 title = {Frustratingly Simple Few-Shot Object Detection},
 url = {https://proceedings.icml.cc/static/paper\%5Ffiles/icml/2020/2957-Paper.pdf},
 year = {2020}
}

Xiayue Charles Lin, Joseph E. Gonzalez, and Joseph M. Hellerstein. "Serverless Boom or Bust? An Analysis of Economic Incentives." 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), 2020.

Serverless computing is a new paradigm that promises to free cloud users from the burden of having to provision and manage resources. However, the degree to which serverless computing will replace provisioned servers remains an open question. To address this, we develop an economic model that aims to quantify the value of serverless to providers and customers. A simple model of incentives for rational providers and customers allows us to see, in broad strokes, when and why serverless technologies are worth pursuing. By characterizing the conditions under which mutually beneficial economic incentives exist, our model suggests that many classes of customers can already benefit from switching to a serverless model and taking advantage of autoscaling at today's price points. Our model also helps characterize technical research directions that would be likely to have impact in the market.

@inproceedings{Lin20,
 abstract = {
Serverless computing is a new paradigm that promises to free cloud users from the burden of having to provision and manage resources. However, the degree to which serverless computing will replace provisioned servers remains an open question.

To address this, we develop an economic model that aims to quantify the value of serverless to providers and customers. A simple model of incentives for rational providers and customers allows us to see, in broad strokes, when and why serverless technologies are worth pursuing. By characterizing the conditions under which mutually beneficial economic incentives exist, our model suggests that many classes of customers can already benefit from switching to a serverless model and taking advantage of autoscaling at today's price points. Our model also helps characterize technical research directions that would be likely to have impact in the market.
},
 author = {Xiayue Charles Lin and Joseph E. Gonzalez and Joseph M. Hellerstein},
 booktitle = {12th {USENIX} Workshop on Hot Topics in Cloud Computing ({HotCloud})},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {7},
 publisher = { {USENIX} Association},
 title = {Serverless Boom or Bust? An Analysis of Economic Incentives},
 url = {https://www.usenix.org/conference/hotcloud20/presentation/lin},
 year = {2020}
}

Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. "Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers." Proceedings of the International Conference on Machine Learning (ICML), 2020.

Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

@inproceedings{TrainBigICML20,
 abstract = {
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.

This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
},
 author = {Zhuohan Li and Eric Wallace and Sheng Shen and Kevin Lin and Kurt Keutzer and Dan Klein and Joseph E. Gonzalez},
 booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev, selected},
 month = {7},
 series = {ICML'20},
 title = {Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers},
 url = {https://proceedings.icml.cc/static/paper\%5Ffiles/icml/2020/6626-Paper.pdf},
 year = {2020}
}

Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, and Joseph E. Gonzalez. "FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions." Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to \$10^{14}\times\$ over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9\% higher accuracy, 15\% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20\% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6\% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at https://github.com/facebookresearch/mobile-vision.

@inproceedings{FBNetV2,
 abstract = {Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to \$10^{14}\times\$ over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9\% higher accuracy, 15\% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20\% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6\% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at https://github.com/facebookresearch/mobile-vision.},
 author = {Alvin Wan and Xiaoliang Dai and Peizhao Zhang and Zijian He and Yuandong Tian and Saining Xie and Bichen Wu and Matthew Yu and Tao Xu and Kan Chen and Peter Vajda and Joseph E. Gonzalez},
 booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
 code = {https://github.com/facebookresearch/mobile-vision},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev, selected},
 month = {6},
 title = { {FBNetV2}: Differentiable Neural Architecture Search for Spatial and Channel Dimensions},
 url = {https://arxiv.org/abs/2004.05565},
 year = {2020}
}

Rolando Garcia, Eric Liu, Vikram Sreekanti, Bobby Yan and Anusha Dandamudi, Joseph E. Gonzalez, Joseph M. Hellerstein, and Koushik Sen. "Hindsight Logging for Model Training." Proc. VLDB Endow., 2020.

In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7\%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.

@article{rol2020hindsight,
 abstract = {In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7\%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.},
 author = {Rolando Garcia and Eric Liu and Vikram Sreekanti and Bobby Yan 
and Anusha Dandamudi and Joseph E. Gonzalez and Joseph M. Hellerstein  and Koushik Sen},
 doi = {10.14778/3436905.3436925},
 issn = {2150-8097},
 issue_date = {December 2020},
 journal = {Proc. VLDB Endow.},
 keywords = {peerrev},
 month = {12},
 number = {4},
 numpages = {12},
 pages = {682–693},
 publisher = {VLDB Endowment},
 title = {Hindsight Logging for Model Training},
 url = {https://doi.org/10.14778/3436905.3436925},
 volume = {14},
 year = {2020}
}

Daniel Crankshaw, Gur-Eyal Sela, Corey Zumar, Xiangxi Mo, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. "InferLine: ML Inference Pipeline Composition Framework." Proceedings of the ACM Symposium on Cloud Computing, 2020.

The dominant cost in production machine learning workloads is not training individual models but serving predictions from increasingly complex prediction pipelines spanning multiple models, machine learning frameworks, and parallel hardware accelerators. Due to the complex interaction between model configurations and parallel hardware, prediction pipelines are challenging to provision and costly to execute when serving interactive latency-sensitive applications. This challenge is exacerbated by the unpredictable dynamics of bursty workloads. In this paper we introduce InferLine, a system which efficiently provisions and executes ML inference pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configuration in a fine-grained fashion. Unpredictable changes in the serving workload are dynamically and cost-optimally accommodated with minimal service level degradation. InferLine introduces (1) automated model profiling and pipeline lineage extraction, (2) a fine-grain, cost-minimizing pipeline configuration planner, and (3) a fine-grain reactive controller. InferLine is able to configure and deploy prediction pipelines across a wide range of workload patterns and latency goals. It outperforms coarse-grained configuration alternatives by up 7.6x in cost while achieving up to 32x lower SLO miss rate on real workloads and generalizes across state-of-the-art model serving frameworks.

@inproceedings{InferlineSOCC20,
 abstract = {
The dominant cost in production machine learning workloads is not training individual models but serving predictions from increasingly complex prediction pipelines spanning multiple models, machine learning frameworks, and parallel hardware accelerators. Due to the complex interaction between model configurations and parallel hardware, prediction pipelines are challenging to provision and costly to execute when serving interactive latency-sensitive applications. This challenge is exacerbated by the unpredictable dynamics of bursty workloads.

In this paper we introduce InferLine, a system which efficiently provisions and executes ML inference pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configuration in a fine-grained fashion. Unpredictable changes in the serving workload are dynamically and cost-optimally accommodated with minimal service level degradation. InferLine introduces (1) automated model profiling and pipeline lineage extraction, (2) a fine-grain, cost-minimizing pipeline configuration planner, and (3) a fine-grain reactive controller. InferLine is able to configure and deploy prediction pipelines across a wide range of workload patterns and latency goals. It outperforms coarse-grained configuration alternatives by up 7.6x in cost while achieving up to 32x lower SLO miss rate on real workloads and generalizes across state-of-the-art model serving frameworks.
},
 author = {Daniel Crankshaw and Gur{-}Eyal Sela and Corey Zumar and Xiangxi Mo and Joseph E. Gonzalez and Ion Stoica and Alexey Tumanov},
 booktitle = {Proceedings of the ACM Symposium on Cloud Computing},
 keywords = {peerrev},
 month = {11},
 publisher = {Association for Computing Machinery},
 series = { {SoCC} '20},
 title = { {InferLine}: {ML} Inference Pipeline Composition Framework},
 url = {http://arxiv.org/abs/1812.01776},
 year = {2020}
}

Ashwin Balakrishna, Brijen Thananjeyan, Jonathan Lee, Felix Li, Arsh Zahed, Joseph E. Gonzalez, and Ken Goldberg. "On-Policy Robot Imitation Learning from a Converging Supervisor." Proceedings of the Conference on Robot Learning, 2020.

Existing on-policy imitation learning algorithms, such as DAgger, assume access to a fixed supervisor. However, there are many settings where the supervisor may evolve during policy learning, such as a human performing a novel task or an improving algorithmic controller. We formalize imitation learning from a ``converging supervisor'' and provide sublinear static and dynamic regret guarantees against the best policy in hindsight with labels from the converged supervisor, even when labels during learning are only from intermediate supervisors. We then show that this framework is closely connected to a class of reinforcement learning (RL) algorithms known as dual policy iteration (DPI), which alternate between training a reactive learner with imitation learning and a model-based supervisor with data from the learner. Experiments suggest that when this framework is applied with the state-of-the-art deep model-based RL algorithm PETS as an improving supervisor, it outperforms deep RL baselines on continuous control tasks and provides up to an 80-fold speedup in policy evaluation.

@inproceedings{Balakrishna20,
 abstract = {Existing on-policy imitation learning algorithms, such as DAgger, assume access to a fixed supervisor. However, there are many settings where the supervisor may evolve during policy learning, such as a human performing a novel task or an improving algorithmic controller. We formalize imitation learning from a ``converging supervisor'' and provide sublinear static and dynamic regret guarantees against the best policy in hindsight with labels from the converged supervisor, even when labels during learning are only from intermediate supervisors. We then show that this framework is closely connected to a class of reinforcement learning (RL) algorithms known as dual policy iteration (DPI), which alternate between training a reactive learner with imitation learning and a model-based supervisor with data from the learner. Experiments suggest that when this framework is applied with the state-of-the-art deep model-based RL algorithm PETS as an improving supervisor, it outperforms deep RL baselines on continuous control tasks and provides up to an 80-fold speedup in policy evaluation.},
 author = {Ashwin Balakrishna and Brijen Thananjeyan and Jonathan Lee and Felix Li and Arsh Zahed and Joseph E. Gonzalez and Ken Goldberg},
 booktitle = {Proceedings of the Conference on Robot Learning},
 date-modified = {2020-08-02 11:27:35 -0700},
 editor = {Kaelbling, Leslie Pack and Kragic, Danica and Sugiura, Komei},
 keywords = {peerrev},
 month = {10},
 pages = {24--41},
 pdf = {http://proceedings.mlr.press/v100/balakrishna20a/balakrishna20a.pdf},
 publisher = {PMLR},
 series = {Proceedings of Machine Learning Research},
 title = {On-Policy Robot Imitation Learning from a Converging Supervisor},
 url = {http://proceedings.mlr.press/v100/balakrishna20a.html},
 volume = {100},
 year = {2020}
}

Vikram Sreekanti, Chenggang Wu, Saurav Chhatrapati, Joseph E. Gonzalez, Joseph M. Hellerstein, and Jose M. Faleiro. "A Fault-Tolerance Shim for Serverless Computing." Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys), 2020.

Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would like atomic visibility of the updates made by a FaaS application. In this paper, we present aft, an atomic fault tolerance shim for serverless applications. aft interposes between a commodity FaaS platform and storage engine and ensures atomic visibility of updates by enforcing the read atomic isolation guarantee. aft supports new protocols to guarantee read atomic isolation in the serverless setting. We demonstrate that aft introduces minimal overhead relative to existing storage engines and scales smoothly to thousands of requests per second, while preventing a significant number of consistency anomalies.

@inproceedings{AftEuroSys20,
 abstract = {
Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would like atomic visibility of the updates made by a FaaS application.

In this paper, we present aft, an atomic fault tolerance shim for serverless applications. aft interposes between a commodity FaaS platform and storage engine and ensures atomic visibility of updates by enforcing the read atomic isolation guarantee. aft supports new protocols to guarantee read atomic isolation in the serverless setting. We demonstrate that aft introduces minimal overhead relative to existing storage engines and scales smoothly to thousands of requests per second, while preventing a significant number of consistency anomalies.
},
 address = {New York, NY, USA},
 articleno = {15},
 author = {Vikram Sreekanti and Chenggang Wu and Saurav Chhatrapati and Joseph E. Gonzalez and Joseph M. Hellerstein and Jose M. Faleiro},
 booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys)},
 code = {https://github.com/vsreekanti/aft},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {9781450368827},
 keywords = {peerrev},
 location = {Heraklion, Greece},
 numpages = {15},
 publisher = {Association for Computing Machinery},
 series = {EuroSys '20},
 title = {A Fault-Tolerance Shim for Serverless Computing},
 url = {https://doi.org/10.1145/3342195.3387535},
 year = {2020}
}

Jianfei Chen, Yu Gai, Zhewei Yao, Michael W. Mahoney, and Joseph E. Gonzalez. "A Statistical Framework for Low-bitwidth Training of Deep Neural Networks." Advances in Neural Information Processing Systems, 2020.

Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties. In this paper, we address this problem by presenting a statistical framework for analyzing FQT algorithms. We view the quantized gradient of FQT as a stochastic estimator of its full precision counterpart, a procedure known as quantization-aware training (QAT). We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance. Inspired by these theoretical results, we develop two novel gradient quantizers, and we show that these have smaller variance than the existing per-tensor quantizer. For training ResNet-50 on ImageNet, our 5-bit block Householder quantizer achieves only 0.5\% validation accuracy loss relative to QAT, comparable to the existing INT8 baseline.

@inproceedings{ChenGYMG20,
 abstract = {Fully quantized training (FQT), which uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model, is a promising approach to accelerate the training of deep neural networks. One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties. In this paper, we address this problem by presenting a statistical framework for analyzing FQT algorithms. We view the quantized gradient of FQT as a stochastic estimator of its full precision counterpart, a procedure known as quantization-aware training (QAT). We show that the FQT gradient is an unbiased estimator of the QAT gradient, and we discuss the impact of gradient quantization on its variance. Inspired by these theoretical results, we develop two novel gradient quantizers, and we show that these have smaller variance than the existing per-tensor quantizer. For training ResNet-50 on ImageNet, our 5-bit block Householder quantizer achieves only 0.5\% validation accuracy loss relative to QAT, comparable to the existing INT8 baseline.},
 author = {Jianfei Chen and Yu Gai and Zhewei Yao and Michael W. Mahoney and Joseph E. Gonzalez},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 keywords = {peerrev, selected},
 pages = {883--894},
 publisher = {Curran Associates, Inc.},
 title = {A Statistical Framework for Low-bitwidth Training of Deep Neural Networks},
 url = {https://proceedings.neurips.cc/paper/2020/file/099fe6b0b444c23836c4a5d07346082b-Paper.pdf},
 volume = {33},
 year = {2020}
}

Brijen Thananjeyan, Ashwin Balakrishna, Ugo Rosolia, Joseph E. Gonzalez, Aaron D. Ames, and Ken Goldberg. "ABC-LMPC: Safe Sample-Based Learning MPC for Stochastic Nonlinear Dynamical Systems with Adjustable Boundary Conditions." Proceedings of the Int. Workshop on the Algorithmic Foundations of Robotics (WAFR), 2020.

Sample-based learning model predictive control (LMPC) strategies have recently attracted attention due to their desirable theoretical properties and their good empirical performance on robotic tasks. However, prior analysis of LMPC controllers for stochastic systems has mainly focused on linear systems in the iterative learning control setting. We present a novel LMPC algorithm, Adjustable Boundary Condition LMPC (ABC-LMPC), which enables rapid adaptation to novel start and goal configurations and theoretically show that the resulting controller guarantees iterative improvement in expectation for stochastic nonlinear systems. We present results with a practical instantiation of this algorithm and experimentally demonstrate that the resulting controller adapts to a variety of initial and terminal conditions on 3 stochastic continuous control tasks.

@inproceedings{ABCLMPCWAFR20,
 abstract = {Sample-based learning model predictive control (LMPC) strategies have recently attracted attention due to their desirable theoretical properties and their good empirical performance on robotic tasks. However, prior analysis of LMPC controllers for stochastic systems has mainly focused on linear systems in the iterative learning control setting. We present a novel LMPC algorithm, Adjustable Boundary Condition LMPC (ABC-LMPC), which enables rapid adaptation to novel start and goal configurations and theoretically show that the resulting controller guarantees iterative improvement in expectation for stochastic nonlinear systems. We present results with a practical instantiation of this algorithm and experimentally demonstrate that the resulting controller adapts to a variety of initial and terminal conditions on 3 stochastic continuous control tasks.},
 author = {Brijen Thananjeyan and Ashwin Balakrishna and Ugo Rosolia and Joseph E. Gonzalez and Aaron D. Ames and Ken Goldberg},
 booktitle = {Proceedings of the Int. Workshop on the Algorithmic Foundations of Robotics (WAFR)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = { {ABC-LMPC:} Safe Sample-Based Learning {MPC} for Stochastic Nonlinear Dynamical Systems with Adjustable Boundary Conditions},
 url = {https://arxiv.org/abs/2003.01410},
 year = {2020}
}

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. "Ansor: Generating High-Performance Tensor Programs for Deep Learning." 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020, 2020.

High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8x, 2.6x, and 1.7x, respectively.

@inproceedings{ZhengJSWYHWYZSG20,
 abstract = {
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy.

We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8x, 2.6x, and 1.7x, respectively.
},
 author = {Lianmin Zheng and Chengfan Jia and Minmin Sun and Zhao Wu and Cody Hao Yu and Ameer Haj{-}Ali and Yida Wang and Jun Yang and Danyang Zhuo and Koushik Sen and Joseph E. Gonzalez and Ion Stoica},
 booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI} 2020, Virtual Event, November 4-6, 2020},
 keywords = {peerrev, selected},
 pages = {863--879},
 publisher = { {USENIX} Association},
 title = {Ansor: Generating High-Performance Tensor Programs for Deep Learning},
 url = {https://www.usenix.org/conference/osdi20/presentation/zheng},
 year = {2020}
}

Mong H. Ng, Kaahan Radia, Jianfei Chen, Dequan Wang, Ionel Gog, and Joseph E. Gonzalez. "BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud." Proceedings of the Workshop in Scalability for Autonomous Driving at CVPR'20, 2020.

Bird's-eye-view (BEV) is a powerful and widely adopted representation for road scenes that captures surrounding objects and their spatial locations, along with overall context in the scene. In this work, we focus on bird's eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. This task is made possible by simulators such as Carla, which allow for cheap data collection, arbitrary camera placements, and supervision in ways otherwise not possible in the real world. There are two main challenges to this task: the view transformation from side view to bird's eye view, as well as transfer learning to unseen domains. Existing work transforms between views through fully connected layers and transfer learns via GANs. This suffers from a lack of depth reasoning and performance degradation across domains. Our novel 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects' spatial locations in the BEV. In addition, we transfer learning by abstracting high-level geometric features and predicting an intermediate representation that is common across different domains. We publish a new dataset called BEVSEG-Carla and show that our approach improves state-of-the-art by 24\% mIoU and performs well when transferred to a new domain.

@inproceedings{ng2020bevseg,
 abstract = {Bird's-eye-view (BEV) is a powerful and widely adopted representation for road scenes that captures surrounding objects and their spatial locations, along with overall context in the scene. In this work, we focus on bird's eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. This task is made possible by simulators such as Carla, which allow for cheap data collection, arbitrary camera placements, and supervision in ways otherwise not possible in the real world. There are two main challenges to this task: the view transformation from side view to bird's eye view, as well as transfer learning to unseen domains. Existing work transforms between views through fully connected layers and transfer learns via GANs. This suffers from a lack of depth reasoning and performance degradation across domains. Our novel 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects' spatial locations in the BEV. In addition, we transfer learning by abstracting high-level geometric features and predicting an intermediate representation that is common across different domains. We publish a new dataset called BEVSEG-Carla and show that our approach improves state-of-the-art by 24\% mIoU and performs well when transferred to a new domain.},
 archiveprefix = {arXiv},
 author = {Mong H. Ng and Kaahan Radia and Jianfei Chen and Dequan Wang and Ionel Gog and Joseph E. Gonzalez},
 booktitle = {Proceedings of the Workshop in Scalability for Autonomous Driving at {CVPR}'20},
 eprint = {2006.11436},
 keywords = {peerrev},
 title = { {BEV-Seg}: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud},
 url = {https://arxiv.org/abs/2006.11436},
 year = {2020}
}

Yaoqing Yang, Rajiv Khanna, Yaodong Yu, Amir Gholami, Kurt Keutzer, Joseph E. Gonzalez, Kannan Ramchandran, and Michael W. Mahoney. "Boundary thickness and robustness in learning models." Advances in Neural Information Processing Systems, 2020.

Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured by the robust generalization gap between training and testing) and lower robustness. We show that a thicker boundary helps improve robustness against adversarial examples (e.g., improving the robust test accuracy of adversarial training), as well as so-called out-of-distribution (OOD) transforms, and we show that many commonly-used regularization and data augmentation procedures can increase boundary thickness. On the theoretical side, we establish that maximizing boundary thickness is akin to minimizing the so-called mixup loss. Using these observations, we can show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms. We can also show that the performance improvement in several recent lines of work happens in conjunction with a thicker boundary.

@inproceedings{YangKYGKGRM20,
 abstract = {Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured by the robust generalization gap between training and testing) and lower robustness. We show that a thicker boundary helps improve robustness against adversarial examples (e.g., improving the robust test accuracy of adversarial training), as well as so-called out-of-distribution (OOD) transforms, and we show that many commonly-used regularization and data augmentation procedures can increase boundary thickness. On the theoretical side, we establish that maximizing boundary thickness is akin to minimizing the so-called mixup loss. Using these observations, we can show that noise-augmentation on mixup training further increases boundary thickness, thereby combating vulnerability to various forms of adversarial attacks and OOD transforms. We can also show that the performance improvement in several recent lines of work happens in conjunction with a thicker boundary.},
 author = {Yaoqing Yang and Rajiv Khanna and Yaodong Yu and Amir Gholami and Kurt Keutzer and Joseph E. Gonzalez and Kannan Ramchandran and Michael W. Mahoney},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
 keywords = {peerrev, selected},
 pages = {6223--6234},
 publisher = {Curran Associates, Inc.},
 title = {Boundary thickness and robustness in learning models},
 url = {https://proceedings.neurips.cc/paper/2020/file/44e76e99b5e194377e955b13fb12f630-Paper.pdf},
 volume = {33},
 year = {2020}
}

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. "Breaking the Memory Wall with Optimal Tensor Rematerialization." Proceedings of Machine Learning and Systems, 2020.

Modern neural networks are increasingly bottlenecked by the limited capacity of on-device GPU memory. Prior work explores dropping activations as a strategy to scale to larger neural networks with fixed memory. However, these heuristics assume uniform cost per layer and only consider simple linear chain architectures, limiting their usability. In this paper, we formalize the problem of trading-off computation time and memory requirements for DNN training as the tensor rematerialization optimization problem. We develop a new system to optimally solve the problem in reasonable times (under an hour) using off-the-shelf MILP solvers. These schedules subsequently accelerate millions of training iterations. Our optimization pass in TensorFlow 2.0 automatically yields real training speedups of up to 4.8x over prior work, and can enable up to 5x increase in input size for real-world large networks.

@inproceedings{Checkmate20,
 abstract = {Modern neural networks are increasingly bottlenecked by the limited capacity of on-device GPU memory. Prior work explores dropping activations as a strategy to scale to larger neural networks with fixed memory. However, these heuristics assume uniform cost per layer and only consider simple linear chain architectures, limiting their usability. In this paper, we formalize the problem of trading-off computation time and memory requirements for DNN training as the tensor rematerialization optimization problem. We develop a new system to optimally solve the problem in reasonable times (under an hour) using off-the-shelf MILP solvers. These schedules subsequently accelerate millions of training iterations. Our optimization pass in TensorFlow 2.0 automatically yields real training speedups of up to 4.8x over prior work, and can enable up to 5x increase in input size for real-world large networks.},
 author = {Paras Jain and Ajay Jain and Aniruddha Nrusimha and Amir Gholami and Pieter Abbeel and Kurt Keutzer and Ion Stoica and Joseph E. Gonzalez},
 booktitle = {Proceedings of Machine Learning and Systems},
 code = {https://github.com/parasj/checkmate},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev, selected},
 pages = {497--511},
 title = {Breaking the Memory Wall with Optimal Tensor Rematerialization},
 url = {https://arxiv.org/abs/1910.02653},
 year = {2020}
}

Vikram Sreekanti, Chenggang Wu, Xiayue Charles Lin, Johann Schleier-Smith, Jose M. Faleiro, Joseph E. Gonzalez, Joseph M. Hellerstein, and Alexey Tumanov. "Cloudburst: Stateful Functions-as-a-Service." Proceedings of Very Large Data Bases (PVLDB), 2020.

Function-as-a-Service (FaaS) platforms and ``serverless'' cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.

@inproceedings{Cloudburst20,
 abstract = {Function-as-a-Service (FaaS) platforms and ``serverless'' cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.},
 author = {Vikram Sreekanti and Chenggang Wu and Xiayue Charles Lin and Johann Schleier{-}Smith and Jose M. Faleiro and Joseph E. Gonzalez and Joseph M. Hellerstein and Alexey Tumanov},
 booktitle = {Proceedings of Very Large Data Bases (PVLDB)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev, selected},
 title = {Cloudburst: Stateful Functions-as-a-Service},
 url = {http://www.vldb.org/pvldb/vol13/p2438-sreekanti.pdf},
 volume = {13},
 year = {2020}
}

David E. Culler, Prabal Dutta, Gabe Fierro, Joseph E. Gonzalez, Nathan Pemberton, Johann Schleier-Smith, Kalyanaraman Shankari, Alvin Wan, and Thomas Zachariah. "CoVista: A Unified View on Privacy Sensitive Mobile Contact Tracing." IEEE Data Eng. Bull., 2020.

Governments around the world have become increasingly frustrated with tech giants dictating public health policy. The software created by Apple and Google enables individuals to track their own potential exposure through collated exposure notifications. However, the same software prohibits location tracking, denying key information needed by public health officials for robust contract tracing. This information is needed to treat and isolate COVID-19 positive people, identify transmission hotspots, and protect against continued spread of infection. In this article, we present two simple ideas: the lighthouse and the covid-commons that address the needs of public health authorities while preserving the privacy-sensitive goals of the Apple and google exposure notification protocols.

@article{Covid,
 abstract = {Governments around the world have become increasingly frustrated with tech giants dictating public health policy. The software created by Apple and Google enables individuals to track their own potential exposure through collated exposure notifications. However, the same software prohibits location tracking, denying key information needed by public health officials for robust contract tracing. This information is needed to treat and isolate COVID-19 positive people, identify transmission hotspots, and protect against continued spread of infection. In this article, we present two simple ideas: the lighthouse and the covid-commons that address the needs of public health authorities while preserving the privacy-sensitive goals of the Apple and google exposure notification protocols.},
 author = {David E. Culler and Prabal Dutta and Gabe Fierro and Joseph E. Gonzalez and Nathan Pemberton and Johann Schleier{-}Smith and Kalyanaraman Shankari and Alvin Wan and Thomas Zachariah},
 date-modified = {2020-08-02 11:27:35 -0700},
 journal = { {IEEE} Data Eng. Bull.},
 keywords = {techreport},
 number = {2},
 pages = {83--94},
 title = {CoVista: {A} Unified View on Privacy Sensitive Mobile Contact Tracing},
 url = {http://sites.computer.org/debull/A20june/p83.pdf},
 volume = {43},
 year = {2020}
}

Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, and Ion Stoica. "Contrastive Code Representation Learning." CoRR (arXiv), 2020.

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9\% over supervised approaches and 4.8\% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.

@article{jain2020contrastive,
 abstract = {Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train models over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9\% over supervised approaches and 4.8\% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves the accuracy of existing baselines.},
 archiveprefix = {arXiv},
 author = {Paras Jain and Ajay Jain and Tianjun Zhang and Pieter Abbeel and Joseph E. Gonzalez and Ion Stoica},
 code = {https://parasj.github.io/contracode/},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {2007.04973},
 journal = {CoRR},
 keywords = {arxivpre},
 primaryclass = {cs.LG},
 title = {Contrastive Code Representation Learning},
 url = {https://arxiv.org/abs/2007.04973},
 year = {2020}
}

Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew Yu, Peter Vajda, and Joseph E. Gonzalez. "FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function." CoRR (arXiv), 2020.

Neural Architecture Search (NAS) yields state-of-the-art neural networks that outperform their best manually-designed counterparts. However, previous NAS methods search for architectures under one training recipe (i.e., training hyperparameters), ignoring the significance of training recipes and overlooking superior architectures under other training recipes. Thus, they fail to find higher-accuracy architecture-recipe combinations. To address this oversight, we present JointNAS to search both (a) architectures and (b) their corresponding training recipes. To accomplish this, we introduce a neural acquisition function that scores architectures and training recipes jointly. Following pre-training on a proxy dataset, this acquisition function guides both coarse-grained and fine-grained searches to produce FBNetV3. FBNetV3 is a family of state-of-the-art compact ImageNet models, outperforming both automatically and manually-designed architectures. For example, FBNetV3 matches both EfficientNet and ResNeSt accuracy with 1.4x and 5.0x fewer FLOPs, respectively. Furthermore, the JointNAS-searched training recipe yields significant performance gains across different networks and tasks.

@article{FBNetV3,
 abstract = {Neural Architecture Search (NAS) yields state-of-the-art neural networks that outperform their best manually-designed counterparts. However, previous NAS methods search for architectures under one training recipe (i.e., training hyperparameters), ignoring the significance of training recipes and overlooking superior architectures under other training recipes. Thus, they fail to find higher-accuracy architecture-recipe combinations. To address this oversight, we present JointNAS to search both (a) architectures and (b) their corresponding training recipes. To accomplish this, we introduce a neural acquisition function that scores architectures and training recipes jointly. Following pre-training on a proxy dataset, this acquisition function guides both coarse-grained and fine-grained searches to produce FBNetV3. FBNetV3 is a family of state-of-the-art compact ImageNet models, outperforming both automatically and manually-designed architectures. For example, FBNetV3 matches both EfficientNet and ResNeSt accuracy with 1.4x and 5.0x fewer FLOPs, respectively. Furthermore, the JointNAS-searched training recipe yields significant performance gains across different networks and tasks.},
 archiveprefix = {arXiv},
 author = {Xiaoliang Dai and Alvin Wan and Peizhao Zhang and Bichen Wu and Zijian He and Zhen Wei and Kan Chen and Yuandong Tian and Matthew Yu and Peter Vajda and Joseph E. Gonzalez},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {2006.02049},
 journal = {CoRR},
 keywords = {arxivpre},
 title = { {FBNetV3}: Joint Architecture-Recipe Search using Neural Acquisition Function},
 url = {https://arxiv.org/abs/2006.02049},
 volume = {abs/2006.02049},
 year = {2020}
}

Jeffrey Ichnowski, William Lee, Victor Murta, Samuel Paradis, Ron Alterovitz, Joseph E. Gonzalez, Ion Stoica, and Ken Goldberg. "Fog Robotics Algorithms for Distributed Motion Planning Using Lambda Serverless Computing." 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020, 2020.

For robots using motion planning algorithms such as {RRT} and {RRT\textasteriskcentered}, the computational load can vary by orders of magnitude as the complexity of the local environment changes. To adaptively provide such computation, we propose Fog Robotics algorithms in which cloud-based serverless lambda computing provides parallel computation on demand. To use this parallelism, we propose novel motion planning algorithms that scale effectively with an increasing number of serverless computers. However, given that the allocation of computing is typically bounded by both monetary and time constraints, we show how prior learning can be used to efficiently allocate resources at runtime. We demonstrate the algorithms and application of learned parallel allocation in both simulation and with the Fetch commercial mobile manipulator using Amazon Lambda to complete a sequence of sporadically computationally intensive motion planning tasks.

@inproceedings{IchnowskiLMPAGS20,
 abstract = {For robots using motion planning algorithms such as {RRT} and {RRT\textasteriskcentered}, the computational load can vary by orders of magnitude as the complexity of the local environment changes. To adaptively provide such computation, we propose Fog Robotics algorithms in which cloud-based serverless lambda computing provides parallel computation on demand. To use this parallelism, we propose novel motion planning algorithms that scale effectively with an increasing number of serverless computers. However, given that the allocation of computing is typically bounded by both monetary and time constraints, we show how prior learning can be used to efficiently allocate resources at runtime. We demonstrate the algorithms and application of learned parallel allocation in both simulation and with the Fetch commercial mobile manipulator using Amazon Lambda to complete a sequence of sporadically computationally intensive motion planning tasks.},
 author = {Jeffrey Ichnowski and William Lee and Victor Murta and Samuel Paradis and Ron Alterovitz and Joseph E. Gonzalez and Ion Stoica and Ken Goldberg},
 booktitle = {2020 {IEEE} International Conference on Robotics and Automation, {ICRA} 2020, Paris, France, May 31 - August 31, 2020},
 doi = {10.1109/ICRA40945.2020.9196651},
 keywords = {peerrev, selected},
 pages = {4232--4238},
 publisher = { {IEEE} },
 title = {Fog Robotics Algorithms for Distributed Motion Planning Using Lambda Serverless Computing},
 url = {https://doi.org/10.1109/ICRA40945.2020.9196651},
 year = {2020}
}

Priya Sundaresan, Jennifer Grannen, Brijen Thananjeyan, Ashwin Balakrishna, Michael Laskey, Kevin Stone, Joseph E. Gonzalez, and Ken Goldberg. "Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data." 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020, 2020.

Robotic manipulation of deformable 1D objects such as ropes, cables, and hoses is challenging due to the lack of high-fidelity analytic models and large configuration spaces. Furthermore, learning end-to-end manipulation policies directly from images and physical interaction requires significant time on a robot and can fail to generalize across tasks. We address these challenges using interpretable deep visual representations for rope, extending recent work on dense object descriptors for robot manipulation. This facilitates the design of interpretable and transferable geometric policies built on top of the learned representations, decoupling visual reasoning and control. We present an approach that learns point-pair correspondences between initial and goal rope configurations, which implicitly encodes geometric structure, entirely in simulation from synthetic depth images. We demonstrate that the learned representation - dense depth object descriptors (DDODs) - can be used to manipulate a real rope into a variety of different arrangements either by learning from demonstrations or using interpretable geometric policies. In 50 trials of a knot-tying task with the ABB YuMi Robot, the system achieves a 66\% knot-tying success rate from previously unseen configurations.

@inproceedings{SundaresanGTBLS20,
 abstract = {Robotic manipulation of deformable 1D objects such as ropes, cables, and hoses is challenging due to the lack of high-fidelity analytic models and large configuration spaces. Furthermore, learning end-to-end manipulation policies directly from images and physical interaction requires significant time on a robot and can fail to generalize across tasks. We address these challenges using interpretable deep visual representations for rope, extending recent work on dense object descriptors for robot manipulation. This facilitates the design of interpretable and transferable geometric policies built on top of the learned representations, decoupling visual reasoning and control. We present an approach that learns point-pair correspondences between initial and goal rope configurations, which implicitly encodes geometric structure, entirely in simulation from synthetic depth images. We demonstrate that the learned representation - dense depth object descriptors (DDODs) - can be used to manipulate a real rope into a variety of different arrangements either by learning from demonstrations or using interpretable geometric policies. In 50 trials of a knot-tying task with the ABB YuMi Robot, the system achieves a 66\% knot-tying success rate from previously unseen configurations.},
 author = {Priya Sundaresan and Jennifer Grannen and Brijen Thananjeyan and Ashwin Balakrishna and Michael Laskey and Kevin Stone and Joseph E. Gonzalez and Ken Goldberg},
 booktitle = {2020 {IEEE} International Conference on Robotics and Automation, {ICRA} 2020, Paris, France, May 31 - August 31, 2020},
 doi = {10.1109/ICRA40945.2020.9197121},
 keywords = {peerrev},
 pages = {9411--9418},
 publisher = { {IEEE} },
 title = {Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data},
 url = {https://doi.org/10.1109/ICRA40945.2020.9197121},
 year = {2020}
}

Kirthevasan Kandasamy, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. "Mechanism Design with Bandit Feedback." CoRR (arXiv), 2020.

We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values. On each round, a mechanism assigns an allocation each to a set of agents and charges them a price; then the agents provide (stochastic) feedback to the mechanism for the allocation they received. This is motivated by applications in cloud markets and online advertising where an agent may know her value for an allocation only after experiencing it. Therefore, the mechanism needs to explore different allocations for each agent, while simultaneously attempting to find the socially optimal set of allocations. Our focus is on truthful and individually rational mechanisms which imitate the classical VCG mechanism in the long run. To that end, we define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism. We show that these three terms are interdependent via an \$\Omega(T^{2/3})\$ lower bound for the maximum of these three terms after T rounds of allocations, and describe a family of anytime algorithms which achieve this rate. Our framework provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets, and additionally to control the degree of truthfulness and individual rationality.

@article{Kirthevasan20_mechanism_design,
 abstract = {We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values. On each round, a mechanism assigns an allocation each to a set of agents and charges them a price; then the agents provide (stochastic) feedback to the mechanism for the allocation they received. This is motivated by applications in cloud markets and online advertising where an agent may know her value for an allocation only after experiencing it. Therefore, the mechanism needs to explore different allocations for each agent, while simultaneously attempting to find the socially optimal set of allocations. Our focus is on truthful and individually rational mechanisms which imitate the classical VCG mechanism in the long run. To that end, we define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism. We show that these three terms are interdependent via an \$\Omega(T^{2/3})\$ lower bound for the maximum of these three terms after T rounds of allocations, and describe a family of anytime algorithms which achieve this rate. Our framework provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets, and additionally to control the degree of truthfulness and individual rationality.},
 archiveprefix = {arXiv},
 author = {Kirthevasan Kandasamy and Joseph E. Gonzalez and Michael I. Jordan and Ion Stoica},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {2004.08924},
 journal = {CoRR},
 keywords = {arxivpre},
 title = {Mechanism Design with Bandit Feedback},
 url = {https://arxiv.org/abs/2004.08924},
 volume = {abs/2004.08924},
 year = {2020}
}

Ankur Dave, Chester Leung, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica. "Oblivious Coopetitive Analytics Using Hardware Enclaves." Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys), 2020.

Coopetitive analytics refers to cooperation among competing parties to run queries over their joint data. Regulatory, business, and liability concerns prevent these organizations from sharing their sensitive data in plaintext. We propose Oblivious Coopetitive Queries (OCQ), an efficient, general framework for oblivious coopetitive analytics using hardware enclaves. OCQ builds on Opaque, a Spark-based framework for secure distributed analytics, to execute coopetitive queries using hardware enclaves in a decentralized manner. Its query planner chooses how and where to execute each relational operator to prevent data leakage through side channels such as memory access patterns, network traffic statistics, and cardinality, while minimizing overhead. We implemented OCQ as an extension to Apache Spark SQL. We find that OCQ is up to 9.9x faster than Opaque, a state-of-the-art secure analytics framework which outsources all data and computation to an enclave-enabled cloud; and is up to 219x faster than implementing analytics using AgMPC, a state-of-the-art secure multi-party computation framework.

@inproceedings{OCQ20,
 abstract = {
Coopetitive analytics refers to cooperation among competing parties to run queries over their joint data. Regulatory, business, and liability concerns prevent these organizations from sharing their sensitive data in plaintext.

We propose Oblivious Coopetitive Queries (OCQ), an efficient, general framework for oblivious coopetitive analytics using hardware enclaves. OCQ builds on Opaque, a Spark-based framework for secure distributed analytics, to execute coopetitive queries using hardware enclaves in a decentralized manner. Its query planner chooses how and where to execute each relational operator to prevent data leakage through side channels such as memory access patterns, network traffic statistics, and cardinality, while minimizing overhead.

We implemented OCQ as an extension to Apache Spark SQL. We find that OCQ is up to 9.9x faster than Opaque, a state-of-the-art secure analytics framework which outsources all data and computation to an enclave-enabled cloud; and is up to 219x faster than implementing analytics using AgMPC, a state-of-the-art secure multi-party computation framework.
},
 address = {New York, NY, USA},
 articleno = {39},
 author = {Ankur Dave and Chester Leung and Raluca Ada Popa and Joseph E. Gonzalez and Ion Stoica},
 booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems (EuroSys)},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {9781450368827},
 keywords = {peerrev},
 location = {Heraklion, Greece},
 numpages = {17},
 publisher = {Association for Computing Machinery},
 series = {EuroSys '20},
 title = {Oblivious Coopetitive Analytics Using Hardware Enclaves},
 url = {https://doi.org/10.1145/3342195.3387552},
 year = {2020}
}

Vikram Sreekanti, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez, and Joseph M. Hellerstein. "Optimizing Prediction Serving on Low-Latency Serverless Dataflow." CoRR (arXiv), 2020.

Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous processors and distributed computing resources. In this paper, we argue that a familiar dataflow API is well-suited to this latency-sensitive task, and amenable to optimization even with unmodified black-box ML models. We present the design of Cloudflow, a system that provides this API and realizes it on an autoscaling serverless backend. Cloudflow transparently implements performance-critical optimizations including operator fusion and competitive execution. Our evaluation shows that Cloudflow's optimizations yield significant performance improvements on synthetic workloads and that Cloudflow outperforms state-of-the-art prediction serving systems by as much as 2x on real-world prediction pipelines, meeting latency goals of demanding applications like real-time video analysis.

@article{Cloudflow20,
 abstract = {Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous processors and distributed computing resources. In this paper, we argue that a familiar dataflow API is well-suited to this latency-sensitive task, and amenable to optimization even with unmodified black-box ML models. We present the design of Cloudflow, a system that provides this API and realizes it on an autoscaling serverless backend. Cloudflow transparently implements performance-critical optimizations including operator fusion and competitive execution. Our evaluation shows that Cloudflow's optimizations yield significant performance improvements on synthetic workloads and that Cloudflow outperforms state-of-the-art prediction serving systems by as much as 2x on real-world prediction pipelines, meeting latency goals of demanding applications like real-time video analysis.},
 archiveprefix = {arXiv},
 author = {Vikram Sreekanti and Harikaran Subbaraj and Chenggang Wu and Joseph E. Gonzalez and Joseph M. Hellerstein},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {2007.05832},
 journal = {CoRR},
 keywords = {arxivpre},
 title = {Optimizing Prediction Serving on Low-Latency Serverless Dataflow},
 url = {https://arxiv.org/abs/2007.05832},
 volume = {abs/2007.05832},
 year = {2020}
}

Ajay Kumar Tanwani, Raghav Anand, Joseph E. Gonzalez, and Ken Goldberg. "RILaaS: Robot Inference and Learning as a Service." IEEE Robotics and Automation Letters, 2020.

Programming robots is complicated due to the lack of `plug-and-play' modules for skill acquisition. Virtualizing deployment of deep learning models can facilitate large-scale use/re-use of off-the-shelf functional behaviors. Deploying deep learning models on robots entails real-time, accurate and reliable inference service under varying query load. This letter introduces a novel Robot-Inference-and-Learning-as-a-Service (RILaaS) platform for low-latency and secure inference serving of deep models that can be deployed on robots. Unique features of RILaaS include: 1) low-latency and reliable serving with gRPC under dynamic loads by distributing queries over multiple servers on Edge and Cloud, 2) SSH based authentication coupled with SSL/TLS based encryption for security and privacy of the data, and 3) front-end REST API for sharing, monitoring and visualizing performance metrics of the available models. We report experiments to evaluate the RILaaS platform under varying loads of batch size, number of robots, and various model placement hosts on Cloud, Edge, and Fog for providing benchmark applications of object recognition and grasp planning as a service. We address the complexity of load balancing with a reinforcement learning algorithm that optimizes simulated profiles of networked robots; outperforming several baselines including round robin, least connections, and least model time with 68.30\% and 14.04\% decrease in round-trip latency time across models compared to the worst and the next best baseline respectively. Details and updates are available at: \url{https://sites.google.com/view/rilaas.}

@article{Tanwani20,
 abstract = {Programming robots is complicated due to the lack of `plug-and-play' modules for skill acquisition. Virtualizing deployment of deep learning models can facilitate large-scale use/re-use of off-the-shelf functional behaviors. Deploying deep learning models on robots entails real-time, accurate and reliable inference service under varying query load. This letter introduces a novel Robot-Inference-and-Learning-as-a-Service (RILaaS) platform for low-latency and secure inference serving of deep models that can be deployed on robots. Unique features of RILaaS include: 1) low-latency and reliable serving with gRPC under dynamic loads by distributing queries over multiple servers on Edge and Cloud, 2) SSH based authentication coupled with SSL/TLS based encryption for security and privacy of the data, and 3) front-end REST API for sharing, monitoring and visualizing performance metrics of the available models. We report experiments to evaluate the RILaaS platform under varying loads of batch size, number of robots, and various model placement hosts on Cloud, Edge, and Fog for providing benchmark applications of object recognition and grasp planning as a service. We address the complexity of load balancing with a reinforcement learning algorithm that optimizes simulated profiles of networked robots; outperforming several baselines including round robin, least connections, and least model time with 68.30\% and 14.04\% decrease in round-trip latency time across models compared to the worst and the next best baseline respectively. Details and updates are available at: \url{https://sites.google.com/view/rilaas.} },
 author = {Ajay Kumar Tanwani and Raghav Anand and Joseph E. Gonzalez and Ken Goldberg},
 date-modified = {2020-08-02 11:27:35 -0700},
 journal = {IEEE Robotics and Automation Letters},
 keywords = {peerrev},
 number = {3},
 pages = {4423--4430},
 title = { {RILaaS}: Robot Inference and Learning as a Service},
 url = {https://ieeexplore.ieee.org/document/9103220},
 volume = {5},
 year = {2020}
}

Brijen Thananjeyan, Ashwin Balakrishna, Ugo Rosolia, Felix Li, Rowan McAllister, Joseph E. Gonzalez, Sergey Levine, Francesco Borrelli, and Ken Goldberg. "Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks." IEEE Robotics Autom. Lett., 2020.

Reinforcement learning (RL) for robotics is challenging due to the difficulty in hand-engineering a dense cost function, which can lead to unintended behavior, and dynamical uncertainty, which makes exploration and constraint satisfaction challenging. We address these issues with a new model-based reinforcement learning algorithm, Safety Augmented Value Estimation from Demonstrations (SAVED), which uses supervision that only identifies task completion and a modest set of suboptimal demonstrations to constrain exploration and learn efficiently while handling complex constraints. We then compare SAVED with 3 state-of-the-art model-based and model-free RL algorithms on 6 standard simulation benchmarks involving navigation and manipulation and a physical knot-tying task on the da Vinci surgical robot. Results suggest that SAVED outperforms prior methods in terms of success rate, constraint satisfaction, and sample efficiency, making it feasible to safely learn a control policy directly on a real robot in less than an hour. For tasks on the robot, baselines succeed less than 5\% of the time while SAVED has a success rate of over 75\% in the first 50 training iterations. Code and supplementary material is available

@article{SAVED20,
 abstract = {Reinforcement learning (RL) for robotics is challenging due to the difficulty in hand-engineering a dense cost function, which can lead to unintended behavior, and dynamical uncertainty, which makes exploration and constraint satisfaction challenging. We address these issues with a new model-based reinforcement learning algorithm, Safety Augmented Value Estimation from Demonstrations (SAVED), which uses supervision that only identifies task completion and a modest set of suboptimal demonstrations to constrain exploration and learn efficiently while handling complex constraints. We then compare SAVED with 3 state-of-the-art model-based and model-free RL algorithms on 6 standard simulation benchmarks involving navigation and manipulation and a physical knot-tying task on the da Vinci surgical robot. Results suggest that SAVED outperforms prior methods in terms of success rate, constraint satisfaction, and sample efficiency, making it feasible to safely learn a control policy directly on a real robot in less than an hour. For tasks on the robot, baselines succeed less than 5\% of the time while SAVED has a success rate of over 75\% in the first 50 training iterations. Code and supplementary material is available},
 author = {Brijen Thananjeyan and Ashwin Balakrishna and Ugo Rosolia and Felix Li and Rowan McAllister and Joseph E. Gonzalez and Sergey Levine and Francesco Borrelli and Ken Goldberg},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/journals/ral/ThananjeyanBRLM20.bib},
 date-modified = {2020-08-02 11:27:35 -0700},
 journal = { {IEEE} Robotics Autom. Lett.},
 keywords = {peerrev, selected},
 number = {2},
 pages = {3612--3619},
 timestamp = {Fri, 22 May 2020 21:54:18 +0200},
 title = {Safety Augmented Value Estimation From Demonstrations {(SAVED):} Safe Deep Model-Based {RL} for Sparse Cost Robotic Tasks},
 url = {https://arxiv.org/abs/1905.13402},
 volume = {5},
 year = {2020}
}

Alvin Wan, Daniel Ho, Younjin Song, Henk Tillman, Sarah Adel Bargal, and Joseph E. Gonzalez. "SegNBDT: Visual Decision Rules for Segmentation." CoRR (arXiv), 2020.

The black-box nature of neural networks limits model decision interpretability, in particular for high-dimensional inputs in computer vision and for dense pixel prediction tasks like segmentation. To address this, prior work combines neural networks with decision trees. However, such models (1) perform poorly when compared to state-of-the-art segmentation models or (2) fail to produce decision rules with spatially-grounded semantic meaning. In this work, we build a hybrid neural-network and decision-tree model for segmentation that (1) attains neural network segmentation accuracy and (2) provides semi-automatically constructed visual decision rules such as ''Is there a window?.'' We obtain semantic visual meaning by extending saliency methods to segmentation and attain accuracy by leveraging insights from neural-backed decision trees, a deep learning analog of decision trees for image classification. Our model SegNBDT attains accuracy within ~2-4\% of the state-of-the-art HRNetV2 segmentation model while also retaining explainability; we achieve state-of-the-art performance for explainable models on three benchmark datasets -- Pascal-Context (49.12\%), Cityscapes (79.01\%), and Look Into Person (51.64\%). Furthermore, user studies suggest visual decision rules are more interpretable, particularly for incorrect predictions. Code and pretrained models can be found at this https URL.

@article{wan2020segnbdt,
 abstract = {The black-box nature of neural networks limits model decision interpretability, in particular for high-dimensional inputs in computer vision and for dense pixel prediction tasks like segmentation. To address this, prior work combines neural networks with decision trees. However, such models (1) perform poorly when compared to state-of-the-art segmentation models or (2) fail to produce decision rules with spatially-grounded semantic meaning. In this work, we build a hybrid neural-network and decision-tree model for segmentation that (1) attains neural network segmentation accuracy and (2) provides semi-automatically constructed visual decision rules such as ''Is there a window?.'' We obtain semantic visual meaning by extending saliency methods to segmentation and attain accuracy by leveraging insights from neural-backed decision trees, a deep learning analog of decision trees for image classification. Our model SegNBDT attains accuracy within ~2-4\% of the state-of-the-art HRNetV2 segmentation model while also retaining explainability; we achieve state-of-the-art performance for explainable models on three benchmark datasets -- Pascal-Context (49.12\%), Cityscapes (79.01\%), and Look Into Person (51.64\%). Furthermore, user studies suggest visual decision rules are more interpretable, particularly for incorrect predictions. Code and pretrained models can be found at this https URL.},
 archiveprefix = {arXiv},
 author = {Alvin Wan and Daniel Ho and Younjin Song and Henk Tillman and Sarah Adel Bargal and Joseph E. Gonzalez},
 code = {https://github.com/daniel-ho/SegNBDT},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {2006.06868},
 journal = {CoRR},
 keywords = {arxivpre},
 primaryclass = {cs.CV},
 title = { {SegNBDT}: Visual Decision Rules for Segmentation},
 url = {https://arxiv.org/abs/2006.06868},
 year = {2020}
}

Samvit Jain, Xun Zhang, Yuhao Zhou, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Paramvir Bahl, and Joseph Gonzalez. "Spatula: Efficient cross-camera video analytics on large camera networks." 5th IEEE/ACM Symposium on Edge Computing, SEC 2020, San Jose, CA, USA, November 12-14, 2020, 2020.

Cameras are deployed at scale with the purpose of searching and tracking objects of interest (e.g., a suspected person) through the camera network on live videos. Such cross-camera analytics is data and compute intensive, whose costs grow with the number of cameras and time. We present Spatula, a cost-efficient system that enables scaling cross-camera analytics on edge compute boxes to large camera networks by leveraging the spatial and temporal cross-camera correlations. While such correlations have been used in computer vision community, Spatula uses them to drastically reduce the communication and computation costs by pruning search space of a query identity (e.g., ignoring frames not correlated with the query identity's current position). Spatula provides the first system substrate on which cross-camera analytics applications can be built to efficiently harness the cross-camera correlations that are abundant in large camera deployments. Spatula reduces compute load by 8.3x on an 8-camera dataset, and by 23x-86x on two datasets with hundreds of cameras (simulated from real vehicle/pedestrian traces). We have also implemented Spatula on a testbed of 5 AWS DeepLens cameras.

@inproceedings{JainZZAJSBG20,
 abstract = {Cameras are deployed at scale with the purpose of searching and tracking objects of interest (e.g., a suspected person) through the camera network on live videos. Such cross-camera analytics is data and compute intensive, whose costs grow with the number of cameras and time. We present Spatula, a cost-efficient system that enables scaling cross-camera analytics on edge compute boxes to large camera networks by leveraging the spatial and temporal cross-camera correlations. While such correlations have been used in computer vision community, Spatula uses them to drastically reduce the communication and computation costs by pruning search space of a query identity (e.g., ignoring frames not correlated with the query identity's current position). Spatula provides the first system substrate on which cross-camera analytics applications can be built to efficiently harness the cross-camera correlations that are abundant in large camera deployments. Spatula reduces compute load by 8.3x on an 8-camera dataset, and by 23x-86x on two datasets with hundreds of cameras (simulated from real vehicle/pedestrian traces). We have also implemented Spatula on a testbed of 5 AWS DeepLens cameras.},
 author = {Samvit Jain and Xun Zhang and Yuhao Zhou and Ganesh Ananthanarayanan and Junchen Jiang and Yuanchao Shu and Paramvir Bahl and Joseph Gonzalez},
 booktitle = {5th {IEEE/ACM} Symposium on Edge Computing, {SEC} 2020, San Jose, CA, USA, November 12-14, 2020},
 doi = {10.1109/SEC50012.2020.00016},
 keywords = {peerrev},
 pages = {110--124},
 publisher = { {IEEE} },
 title = {Spatula: Efficient cross-camera video analytics on large camera networks},
 url = {https://doi.org/10.1109/SEC50012.2020.00016},
 year = {2020}
}

Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E. Gonzalez, and Kurt Keutzer. "SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis." CoRR (arXiv), 2020.

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at this https URL.

@article{SqueezeWave20,
 abstract = {Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs. Code, trained models, and generated audio are publicly available at this https URL.},
 archiveprefix = {arXiv},
 author = {Bohan Zhai and Tianren Gao and Flora Xue and Daniel Rothchild and Bichen Wu and Joseph E. Gonzalez and Kurt Keutzer},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {2001.05685},
 journal = {CoRR},
 keywords = {arxivpre},
 title = {SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis},
 url = {https://arxiv.org/abs/2001.05685},
 volume = {abs/2001.05685},
 year = {2020}
}

Devin Petersohn, William W. Ma, Doris Jung Lin Lee, Stephen Macke, Doris Xin, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran. "Towards Scalable Dataframe Systems." Proceedings of Very Large Data Bases (PVLDB), 2020.

Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building MODIN, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature data-frame features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.

@inproceedings{Modin20,
 abstract = {Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building MODIN, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature data-frame features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.},
 author = {Devin Petersohn and William W. Ma and Doris Jung Lin Lee and Stephen Macke and Doris Xin and Xiangxi Mo and Joseph E. Gonzalez and Joseph M. Hellerstein and Anthony D. Joseph and Aditya G. Parameswaran},
 booktitle = {Proceedings of Very Large Data Bases (PVLDB)},
 code = {https://github.com/modin-project/modin},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {Towards Scalable Dataframe Systems},
 url = {http://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf},
 volume = {13},
 year = {2020}
}

Samvit Jain, Xin Wang, and Joseph Gonzalez. "Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

We present Accel, a novel semantic video segmentation system that achieves high accuracy at low inference cost by combining the predictions of two network branches: (1) a reference branch that extracts high-detail features on a reference keyframe, and warps these features forward using frame-to-frame optical flow estimates, and (2) an update branch that computes features of adjustable quality on the current frame, performing a temporal update at each video frame. The modularity of the update branch, where feature subnetworks of varying layer depth can be inserted (e.g. ResNet-18 to ResNet-101), enables operation over a new, state-of-the-art accuracy-throughput trade-off spectrum. Over this curve, Accel models achieve both higher accuracy and faster inference times than the closest comparable single-frame segmentation networks. In general, Accel significantly outperforms previous work on efficient semantic video segmentation, correcting warping-related error that compounds on datasets with complex dynamics. Accel is end-to-end trainable and highly modular: the reference network, the optical flow network, and the update network can each be selected independently, depending on application requirements, and then jointly fine-tuned. The result is a robust, general system for fast, high-accuracy semantic segmentation on video.

@inproceedings{Accel19,
 abstract = {We present Accel, a novel semantic video segmentation system that achieves high accuracy at low inference cost by combining the predictions of two network branches: (1) a reference branch that extracts high-detail features on a reference keyframe, and warps these features forward using frame-to-frame optical flow estimates, and (2) an update branch that computes features of adjustable quality on the current frame, performing a temporal update at each video frame. The modularity of the update branch, where feature subnetworks of varying layer depth can be inserted (e.g. ResNet-18 to ResNet-101), enables operation over a new, state-of-the-art accuracy-throughput trade-off spectrum. Over this curve, Accel models achieve both higher accuracy and faster inference times than the closest comparable single-frame segmentation networks. In general, Accel significantly outperforms previous work on efficient semantic video segmentation, correcting warping-related error that compounds on datasets with complex dynamics. Accel is end-to-end trainable and highly modular: the reference network, the optical flow network, and the update network can each be selected independently, depending on application requirements, and then jointly fine-tuned. The result is a robust, general system for fast, high-accuracy semantic segmentation on video.},
 author = {Samvit Jain and Xin Wang and Joseph Gonzalez},
 booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {6},
 title = {Accel: {A} Corrective Fusion Network for Efficient Semantic Segmentation on Video},
 url = {http://arxiv.org/abs/1807.06667},
 year = {2019}
}

Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, and Joseph E. Gonzalez. "TAFE-Net: Task-Aware Feature Embeddings for Low Shot Learning." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Learning good feature embeddings for images often requires substantial training data. As a consequence, in settings where training data is limited (e.g., few-shot and zero-shot learning), we are typically forced to use a generic feature embedding across various tasks. Ideally, we want to construct feature embeddings that are tuned for the given task. In this work, we propose Task-Aware Feature Embedding Networks (TAFE-Nets) to learn how to adapt the image representation to a new task in a meta learning fashion. Our network is composed of a meta learner and a prediction network. Based on a task input, the meta learner generates parameters for the feature layers in the prediction network so that the feature embedding can be accurately adjusted for that task. We show that TAFE-Net is highly effective in generalizing to new tasks or concepts and evaluate the TAFE-Net on a range of benchmarks in zero-shot and few-shot learning. Our model matches or exceeds the state-of-the-art on all tasks. In particular, our approach improves the prediction accuracy of unseen attribute-object pairs by 4 to 15 points on the challenging visual attribute-object composition task.

@inproceedings{Tafe19,
 abstract = {Learning good feature embeddings for images often requires substantial training data. As a consequence, in settings where training data is limited (e.g., few-shot and zero-shot learning), we are typically forced to use a generic feature embedding across various tasks. Ideally, we want to construct feature embeddings that are tuned for the given task. In this work, we propose Task-Aware Feature Embedding Networks (TAFE-Nets) to learn how to adapt the image representation to a new task in a meta learning fashion. Our network is composed of a meta learner and a prediction network. Based on a task input, the meta learner generates parameters for the feature layers in the prediction network so that the feature embedding can be accurately adjusted for that task. We show that TAFE-Net is highly effective in generalizing to new tasks or concepts and evaluate the TAFE-Net on a range of benchmarks in zero-shot and few-shot learning. Our model matches or exceeds the state-of-the-art on all tasks. In particular, our approach improves the prediction accuracy of unseen attribute-object pairs by 4 to 15 points on the challenging visual attribute-object composition task.},
 author = {Xin Wang and Fisher Yu and Ruth Wang and Trevor Darrell and Joseph E. Gonzalez},
 booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 code = {https://github.com/ucbdrive/tafe-net},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {6},
 title = { {TAFE-Net}: Task-Aware Feature Embeddings for Low Shot Learning},
 url = {https://arxiv.org/abs/1904.05967},
 year = {2019}
}

Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Menezes Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, and David A. Patterson. "Cloud Programming Simplified: A Berkeley View on Serverless Computing." EECS Department, University of California, Berkeley Technical Report, 2019.

Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an accounting of the predictions of the 2009 Berkeley View of Cloud Computing paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Just as the 2009 paper identified challenges for the cloud and predicted they would be addressed and that cloud use would accelerate, we predict these issues are solvable and that serverless computing will grow to dominate the future of cloud computing.

@techreport{Jonas2019,
 abstract = {Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an accounting of the predictions of the 2009 Berkeley View of Cloud Computing paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Just as the 2009 paper identified challenges for the cloud and predicted they would be addressed and that cloud use would accelerate, we predict these issues are solvable and that serverless computing will grow to dominate the future of cloud computing.},
 author = {Eric Jonas and Johann Schleier-Smith and Vikram Sreekanti and Chia-Che Tsai and Anurag Khandelwal and Qifan Pu and Vaishaal Shankar and Joao Menezes Carreira and Karl Krauth and Neeraja Yadwadkar and Joseph E. Gonzalez and Raluca Ada Popa and Ion Stoica and David A. Patterson},
 date-modified = {2020-08-02 11:27:35 -0700},
 institution = {EECS Department, University of California, Berkeley},
 keywords = {techreport},
 month = {2},
 number = {UCB/EECS-2019-3},
 title = {Cloud Programming Simplified: A Berkeley View on Serverless Computing},
 url = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-3.html},
 year = {2019}
}

Samvit Jain, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, and Joseph E. Gonzalez. "Scaling Video Analytics Systems to Large Camera Deployments." HotMobile `19, Proceedings of the 20th International Workshop on Mobile Computing Systems and Applications, 2019.

Driven by advances in computer vision and the falling costs of camera hardware, organizations are deploying video cameras en masse for the spatial monitoring of their physical premises. Scaling video analytics to massive camera deployments, however, presents a new and mounting challenge, as compute cost grows proportionally to the number of camera feeds. This paper is driven by a simple question: can we scale video analytics in such a way that cost grows sublinearly, or even remains constant, as we deploy more cameras, while inference accuracy remains stable, or even improves. We believe the answer is yes. Our key observation is that video feeds from wide-area camera deployments demonstrate significant content correlations (e.g. to other geographically proximate feeds), both in space and over time. These spatio-temporal correlations can be harnessed to dramatically reduce the size of the inference search space, decreasing both workload and false positive rates in multi-camera video analytics. By discussing use-cases and technical challenges, we propose a roadmap for scaling video analytics to large camera networks, and outline a plan for its realization.

@inproceedings{Hotmobil2019,
 abstract = {Driven by advances in computer vision and the falling costs of camera hardware, organizations are deploying video cameras en masse for the spatial monitoring of their physical premises. Scaling video analytics to massive camera deployments, however, presents a new and mounting challenge, as compute cost grows proportionally to the number of camera feeds. This paper is driven by a simple question: can we scale video analytics in such a way that cost grows sublinearly, or even remains constant, as we deploy more cameras, while inference accuracy remains stable, or even improves. We believe the answer is yes. Our key observation is that video feeds from wide-area camera deployments demonstrate significant content correlations (e.g. to other geographically proximate feeds), both in space and over time. These spatio-temporal correlations can be harnessed to dramatically reduce the size of the inference search space, decreasing both workload and false positive rates in multi-camera video analytics. By discussing use-cases and technical challenges, we propose a roadmap for scaling video analytics to large camera networks, and outline a plan for its realization.},
 author = {Samvit Jain and Ganesh Ananthanarayanan and Junchen Jiang and Yuanchao Shu and Joseph E. Gonzalez},
 booktitle = {HotMobile `19, Proceedings of the 20th International Workshop on Mobile Computing Systems and Applications},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {2},
 title = {Scaling Video Analytics Systems to Large Camera Deployments},
 url = {https://arxiv.org/abs/1809.02318},
 year = {2019}
}

Zuxuan Wu, Xin Wang, Joseph E. Gonzalez, Tom Goldstein, and Larry S. Davis. "ACE: Adapting to Changing Environments for Semantic Segmentation." International Conference in Computer Vision (ICCV), 2019.

Deep neural networks exhibit exceptional accuracy when they are trained and tested on the same data distributions. However, neural classifiers are often extremely brittle when confronted with domain shift---changes in the input distribution that occur over time. We present ACE, a framework for semantic segmentation that dynamically adapts to changing environments over the time. By aligning the distribution of labeled training data from the original source domain with the distribution of incoming data in a shifted domain, ACE synthesizes labeled training data for environments as it sees them. This stylized data is then used to update a segmentation model so that it performs well in new environments. To avoid forgetting knowledge from past environments, we introduce a memory that stores feature statistics from previously seen domains. These statistics can be used to replay images in any of the previously observed domains, thus preventing catastrophic forgetting. In addition to standard batch training using stochastic gradient decent (SGD), we also experiment with fast adaptation methods based on adaptive meta-learning. Extensive experiments are conducted on two datasets from SYNTHIA, the results demonstrate the effectiveness of the proposed approach when adapting to a number of tasks.

@inproceedings{WangICCV19,
 abstract = {Deep neural networks exhibit exceptional accuracy when they are trained and tested on the same data distributions. However, neural classifiers are often extremely brittle when confronted with domain shift---changes in the input distribution that occur over time. We present ACE, a framework for semantic segmentation that dynamically adapts to changing environments over the time. By aligning the distribution of labeled training data from the original source domain with the distribution of incoming data in a shifted domain, ACE synthesizes labeled training data for environments as it sees them. This stylized data is then used to update a segmentation model so that it performs well in new environments. To avoid forgetting knowledge from past environments, we introduce a memory that stores feature statistics from previously seen domains. These statistics can be used to replay images in any of the previously observed domains, thus preventing catastrophic forgetting. In addition to standard batch training using stochastic gradient decent (SGD), we also experiment with fast adaptation methods based on adaptive meta-learning. Extensive experiments are conducted on two datasets from SYNTHIA, the results demonstrate the effectiveness of the proposed approach when adapting to a number of tasks.},
 author = {Zuxuan Wu and Xin Wang and Joseph E. Gonzalez and Tom Goldstein and Larry S. Davis},
 booktitle = {International Conference in Computer Vision (ICCV)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {10},
 title = { {ACE}: Adapting to Changing Environments for Semantic Segmentation},
 url = {http://arxiv.org/abs/1904.06268},
 year = {2019}
}

Joseph M. Hellerstein, Jose M. Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. "Serverless Computing: One Step Forward, Two Steps Back." Conference on Innovative Data Systems Research (CIDR '19), 2019.

Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current serverless offerings a bad fit for cloud innovation and particularly bad for data systems innovation. In addition to pinpointing some of the main shortfalls of current serverless architectures, we raise a set of challenges we believe must be met to unlock the radical potential that the cloud---with its exabytes of storage and millions of cores---should offer to innovative developers.

@inproceedings{cidr19,
 abstract = {Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current serverless offerings a bad fit for cloud innovation and particularly bad for data systems innovation. In addition to pinpointing some of the main shortfalls of current serverless architectures, we raise a set of challenges we believe must be met to unlock the radical potential that the cloud---with its exabytes of storage and millions of cores---should offer to innovative developers.},
 author = {Joseph M. Hellerstein and Jose M. Faleiro and Joseph E. Gonzalez and Johann Schleier{-}Smith and Vikram Sreekanti and Alexey Tumanov and Chenggang Wu},
 booktitle = {Conference on Innovative Data Systems Research ({CIDR} '19)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {1},
 title = {Serverless Computing: One Step Forward, Two Steps Back},
 url = {https://arxiv.org/abs/1812.03651},
 year = {2019}
}

Ajay Kumar Tanwani, Nitesh Mor, John Kubiatowicz, Joseph E. Gonzalez, and Ken Goldberg. "A Fog Robotics Approach to Deep Robot Learning: Application to Object Recognition and Grasp Planning in Surface Decluttering." International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20-24, 2019, 2019.

The growing demand of industrial, automotive and service robots presents a challenge to the centralized Cloud Robotics model in terms of privacy, security, latency, bandwidth, and reliability. In this paper, we present a `Fog Robotics' approach to deep robot learning that distributes compute, storage and networking resources between the Cloud and the Edge in a federated manner. Deep models are trained on non-private (public) synthetic images in the Cloud; the models are adapted to the private real images of the environment at the Edge within a trusted network and subsequently, deployed as a service for low-latency and secure inference/prediction for other robots in the network. We apply this approach to surface decluttering, where a mobile robot picks and sorts objects from a cluttered floor by learning a deep object recognition and a grasp planning model. Experiments suggest that Fog Robotics can improve performance by sim-to-real domain adaptation in comparison to exclusively using Cloud or Edge resources, while reducing the inference cycle time by 4\times to successfully declutter 86\% of objects over 213 attempts.

@inproceedings{Tanwani19,
 abstract = {The growing demand of industrial, automotive and service robots presents a challenge to the centralized Cloud Robotics model in terms of privacy, security, latency, bandwidth, and reliability. In this paper, we present a `Fog Robotics' approach to deep robot learning that distributes compute, storage and networking resources between the Cloud and the Edge in a federated manner. Deep models are trained on non-private (public) synthetic images in the Cloud; the models are adapted to the private real images of the environment at the Edge within a trusted network and subsequently, deployed as a service for low-latency and secure inference/prediction for other robots in the network. We apply this approach to surface decluttering, where a mobile robot picks and sorts objects from a cluttered floor by learning a deep object recognition and a grasp planning model. Experiments suggest that Fog Robotics can improve performance by sim-to-real domain adaptation in comparison to exclusively using Cloud or Edge resources, while reducing the inference cycle time by 4\times to successfully declutter 86\% of objects over 213 attempts.},
 author = {Ajay Kumar Tanwani and Nitesh Mor and John Kubiatowicz and Joseph E. Gonzalez and Ken Goldberg},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/conf/icra/TanwaniMKGG19},
 booktitle = {International Conference on Robotics and Automation, {ICRA} 2019, Montreal, QC, Canada, May 20-24, 2019},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 pages = {4559--4566},
 timestamp = {Tue, 13 Aug 2019 20:25:20 +0200},
 title = {A Fog Robotics Approach to Deep Robot Learning: Application to Object Recognition and Grasp Planning in Surface Decluttering},
 url = {https://doi.org/10.1109/ICRA.2019.8793690},
 year = {2019}
}

Tianjun Zhang, Zhewei Yao, Amir Gholami, Kurt Keutzer, Joseph E. Gonzalez, George Biros, and Michael W. Mahoney. "ANODEV2: A Coupled Neural ODE Evolution Framework." Neural Information Processing Systems (NeurIPS), 2019.

It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time stepping. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach.

@inproceedings{Zhang19,
 abstract = {It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, which allow more general discretization schemes with adaptive time stepping. Here, we propose ANODEV2, which is an extension of this approach that also allows evolution of the neural network parameters, in a coupled ODE-based formulation. The Neural ODE method introduced earlier is in fact a special case of this new more general framework. We present the formulation of ANODEV2, derive optimality conditions, and implement a coupled reaction-diffusion-advection version of this framework in PyTorch. We present empirical results using several different configurations of ANODEV2, testing them on multiple models on CIFAR-10. We report results showing that this coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, as compared to the baseline models as well as the recently-proposed Neural ODE approach.},
 author = {Tianjun Zhang and Zhewei Yao and Amir Gholami and Kurt Keutzer and Joseph E. Gonzalez and George Biros and Michael W. Mahoney},
 booktitle = {Neural Information Processing Systems ({NeurIPS})},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = { {ANODEV2:} A Coupled Neural ODE Evolution Framework},
 url = {https://arxiv.org/abs/1906.04596},
 year = {2019}
}

Vidit Saxena, Joakim Jald\'en, Joseph E. Gonzalez, Mats Bengtsson, Hugo M. Tullberg, and Ion Stoica. "Contextual Multi-Armed Bandits for Link Adaptation in Cellular Networks." Proceedings of the Workshop on Network Meets AI (NetAI) at SIGCOMM, 2019.

Cellular networks dynamically adjust the transmission parameters for a wireless link in response to its time-varying channel state. This is known as link adaptation, where the typical goal is to maximize the link throughput. State-of-the-art outer loop link adaptation (OLLA) selects the optimal transmission parameters based on an approximate, offline, model of the wireless link. Further, OLLA refines the offline model by dynamically compensating any deviations from the observed link performance. However, in practice, OLLA suffers from slow convergence and a sub-optimal link throughput. In this paper, we propose a link adaptation approach that overcomes the shortcomings of OLLA through a novel learning scheme. Our approach relies on contextual multi-armed bandits (MAB), where the context vector is composed of the instantaneous wireless channel state along with side information about the link. For a given context, our approach learns the success probability for each of the available transmission parameters, which is then exploited to select the throughput-maximizing parameters. Through numerical experiments, we show that our approach converges faster than OLLA and achieves a higher steady-state link throughput. For frequent and infrequent channel reports respectively, our scheme outperforms OLLA by 15\% and 25\% in terms of the steady-state link throughput.

@inproceedings{SaxenaSigcommNetAI,
 abstract = {Cellular networks dynamically adjust the transmission parameters for a wireless link in response to its time-varying channel state. This is known as link adaptation, where the typical goal is to maximize the link throughput. State-of-the-art outer loop link adaptation (OLLA) selects the optimal transmission parameters based on an approximate, offline, model of the wireless link. Further, OLLA refines the offline model by dynamically compensating any deviations from the observed link performance. However, in practice, OLLA suffers from slow convergence and a sub-optimal link throughput. In this paper, we propose a link adaptation approach that overcomes the shortcomings of OLLA through a novel learning scheme. Our approach relies on contextual multi-armed bandits (MAB), where the context vector is composed of the instantaneous wireless channel state along with side information about the link. For a given context, our approach learns the success probability for each of the available transmission parameters, which is then exploited to select the throughput-maximizing parameters. Through numerical experiments, we show that our approach converges faster than OLLA and achieves a higher steady-state link throughput. For frequent and infrequent channel reports respectively, our scheme outperforms OLLA by 15\% and 25\% in terms of the steady-state link throughput.},
 author = {Vidit Saxena and Joakim Jald{\'{e} }n and Joseph E. Gonzalez and Mats Bengtsson and Hugo M. Tullberg and Ion Stoica},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/conf/sigcomm/SaxenaJGBTS19},
 booktitle = {Proceedings of the Workshop on Network Meets {AI} ({NetAI}) at SIGCOMM},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 pages = {44--49},
 timestamp = {Thu, 15 Aug 2019 09:19:24 +0200},
 title = {Contextual Multi-Armed Bandits for Link Adaptation in Cellular Networks},
 url = {https://doi.org/10.1145/3341216.3342212},
 year = {2019}
}

Xin Wang, Fisher Yu, Lisa Dunlap, Yi-An Ma, Ruth Wang, Azalia Mirhoseini, Trevor Darrell, and Joseph E. Gonzalez. "Deep Mixture of Experts via Shallow Embedding." Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, 2019.

Larger networks generally have greater representational power at the cost of increased computational complexity. Sparsifying such networks has been an active area of research but has been generally limited to static regularization or dynamic approaches using reinforcement learning. We explore a mixture of experts (MoE) approach to deep dynamic routing, which activates certain experts in the network on a per-example basis. Our novel DeepMoE architecture increases the representational power of standard convolutional networks by adaptively sparsifying and recalibrating channel-wise features in each convolutional layer. We employ a multi-headed sparse gating network to determine the selection and scaling of channels for each input, leveraging exponential combinations of experts within a single convolutional network. Our proposed architecture is evaluated on four benchmark datasets and tasks, and we show that Deep-MoEs are able to achieve higher accuracy with lower computation than standard convolutional networks.

@inproceedings{DeepMoE19,
 abstract = {Larger networks generally have greater representational power at the cost of increased computational complexity. Sparsifying such networks has been an active area of research but has been generally limited to static regularization or dynamic approaches using reinforcement learning. We explore a mixture of experts (MoE) approach to deep dynamic routing, which activates certain experts in the network on a per-example basis. Our novel DeepMoE architecture increases the representational power of standard convolutional networks by adaptively sparsifying and recalibrating channel-wise features in each convolutional layer. We employ a multi-headed sparse gating network to determine the selection and scaling of channels for each input, leveraging exponential combinations of experts within a single convolutional network. Our proposed architecture is evaluated on four benchmark datasets and tasks, and we show that Deep-MoEs are able to achieve higher accuracy with lower computation than standard convolutional networks.},
 author = {Xin Wang and Fisher Yu and Lisa Dunlap and Yi{-}An Ma and Ruth Wang and Azalia Mirhoseini and Trevor Darrell and Joseph E. Gonzalez},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/conf/uai/WangYDMWMDG19},
 booktitle = {Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, {UAI} 2019, Tel Aviv, Israel, July 22-25, 2019},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 pages = {192},
 timestamp = {Fri, 19 Jul 2019 13:05:12 +0200},
 title = {Deep Mixture of Experts via Shallow Embedding},
 url = {http://auai.org/uai2019/proceedings/papers/192.pdf},
 year = {2019}
}

Wenting Zheng, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica. "Helen: Maliciously Secure Coopetitive Learning for Linear Models." IEEE Symposium on Security and Privacy (Oakland), 2019.

Many organizations wish to collaboratively train machine learning models on their combined datasets for a common benefit (e.g., better medical research, or fraud detection). However, they often cannot share their plaintext datasets due to privacy concerns and/or business competition. In this paper, we design and build Helen, a system that allows multiple parties to train a linear model without revealing their data, a setting we call coopetitive learning. Compared to prior secure training systems, Helen protects against a much stronger adversary who is malicious and can compromise m-1 out of m parties. Our evaluation shows that Helen can achieve up to five orders of magnitude of performance improvement when compared to training using an existing state-of-the-art secure multi-party computation framework.

@inproceedings{Helen19,
 abstract = {Many organizations wish to collaboratively train machine learning models on their combined datasets for a common benefit (e.g., better medical research, or fraud detection). However, they often cannot share their plaintext datasets due to privacy concerns and/or business competition. In this paper, we design and build Helen, a system that allows multiple parties to train a linear model without revealing their data, a setting we call coopetitive learning. Compared to prior secure training systems, Helen protects against a much stronger adversary who is malicious and can compromise m-1 out of m parties. Our evaluation shows that Helen can achieve up to five orders of magnitude of performance improvement when compared to training using an existing state-of-the-art secure multi-party computation framework.},
 author = {Wenting Zheng and Raluca Ada Popa and Joseph E. Gonzalez and Ion Stoica},
 booktitle = { {IEEE} Symposium on Security and Privacy ({Oakland})},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 publisher = { {IEEE} Computer Society},
 title = {Helen: Maliciously Secure Coopetitive Learning for Linear Models},
 url = {https://people.eecs.berkeley.edu/~wzheng/helen\%5Fieeesp.pdf},
 year = {2019}
}

Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. "HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline." Proceedings of the ACM Symposium on Cloud Computing, 2019.

Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible. To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched---a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload overlooked in prior work -- trial disposability, progressively identifiable rankings among different configurations, and space-time constraints -- to outperform standard hyperparameter search algorithms across a variety of benchmarks.

@inproceedings{HyperschedSOCC19,
 abstract = {
Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible.

To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched---a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload overlooked in prior work -- trial disposability, progressively identifiable rankings among different configurations, and space-time constraints -- to outperform standard hyperparameter search algorithms across a variety of benchmarks.
},
 address = {New York, NY, USA},
 author = {Richard Liaw and Romil Bhardwaj and Lisa Dunlap and Yitian Zou and Joseph E. Gonzalez and Ion Stoica and Alexey Tumanov},
 booktitle = {Proceedings of the ACM Symposium on Cloud Computing},
 code = {https://github.com/ucbrise/hypersched},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {9781450369732},
 keywords = {peerrev},
 location = {Santa Cruz, CA, USA},
 numpages = {13},
 pages = {61--73},
 publisher = {Association for Computing Machinery},
 series = {SoCC '19},
 title = { {HyperSched}: Dynamic Resource Reallocation for Model Development on a Deadline},
 url = {https://doi.org/10.1145/3357223.3362719},
 year = {2019}
}

Xin Wang, Fisher Yu, Trevor Darrell, and Joseph E. Gonzalez. "Task-Aware Feature Generation for Zero-Shot Compositional Learning." CoRR (arXiv), 2019.

Visual concepts (e.g., red apple, big elephant) are often semantically compositional and each element of the compositions can be reused to construct novel concepts (e.g., red elephant). Compositional feature synthesis, which generates image feature distributions exploiting the semantic compositionality, is a promising approach to sample-efficient model generalization. In this work, we propose a task-aware feature generation (TFG) framework for compositional learning, which generates features of novel visual concepts by transferring knowledge from previously seen concepts. These synthetic features are then used to train a classifier to recognize novel concepts in a zero-shot manner. Our novel TFG design injects task-conditioned noise layer-by-layer, producing task-relevant variation at each level. We find the proposed generator design improves classification accuracy and sample efficiency. Our model establishes a new state of the art on three zero-shot compositional learning (ZSCL) benchmarks, outperforming the previous discriminative models by a large margin. Our model improves the performance of the prior arts by over 2x in the generalized ZSCL setting.

@article{TFG19,
 abstract = {Visual concepts (e.g., red apple, big elephant) are often semantically compositional and each element of the compositions can be reused to construct novel concepts (e.g., red elephant). Compositional feature synthesis, which generates image feature distributions exploiting the semantic compositionality, is a promising approach to sample-efficient model generalization. In this work, we propose a task-aware feature generation (TFG) framework for compositional learning, which generates features of novel visual concepts by transferring knowledge from previously seen concepts. These synthetic features are then used to train a classifier to recognize novel concepts in a zero-shot manner. Our novel TFG design injects task-conditioned noise layer-by-layer, producing task-relevant variation at each level. We find the proposed generator design improves classification accuracy and sample efficiency. Our model establishes a new state of the art on three zero-shot compositional learning (ZSCL) benchmarks, outperforming the previous discriminative models by a large margin. Our model improves the performance of the prior arts by over 2x in the generalized ZSCL setting.},
 archiveprefix = {arXiv},
 author = {Xin Wang and Fisher Yu and Trevor Darrell and Joseph E. Gonzalez},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1906.04854},
 journal = {CoRR},
 keywords = {arxivpre},
 primaryclass = {cs.CV},
 title = {Task-Aware Feature Generation for Zero-Shot Compositional Learning},
 url = {https://arxiv.org/abs/1906.04854},
 year = {2019}
}

Rolando Garcia, Vikram Sreekanti, Neeraja Yadwadkar, Daniel Crankshaw, Joseph E. Gonzalez, and Joseph M. Hellerstein. "Context: The Missing Piece in the Machine Learning Lifecycle." Proceedings of the KDD Workshop on Common Model Infrastructure (CMI), 2018.

Machine learning models have become ubiquitous in modern applications. The ML Lifecycle describes a three-phase process used by data scientists and data engineers to develop, train, and serve models. Unfortunately, context around the data, code, people, and systems involved in these pipelines is not captured today. In this paper, we first discuss common pitfalls that missing context creates. Some examples where context is missing include tracking the relationships between code and data and capturing experimental processes over time. We then discuss techniques to address these challenges and briefly mention future work around designing and implementing systems in this space.

@inproceedings{Flor18,
 abstract = {Machine learning models have become ubiquitous in modern applications. The ML Lifecycle describes a three-phase process used by data scientists and data engineers to develop, train, and serve models. Unfortunately, context around the data, code, people, and systems involved in these pipelines is not captured today. In this paper, we first discuss common pitfalls that missing context creates. Some examples where context is missing include tracking the relationships between code and data and capturing experimental processes over time. We then discuss techniques to address these challenges and briefly mention future work around designing and implementing systems in this space.},
 author = {Rolando Garcia and Vikram Sreekanti and Neeraja Yadwadkar and Daniel Crankshaw and Joseph E. Gonzalez and Joseph M. Hellerstein},
 booktitle = {Proceedings of the KDD Workshop on Common Model Infrastructure (CMI)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {8},
 title = {Context: The Missing Piece in the Machine Learning Lifecycle},
 url = {http://www.vikrams.io/papers/flor-cmi18.pdf},
 year = {2018}
}

Xin Wang, Yujia Luo, Dan Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E. Gonzalez. "IDK Cascades: Fast Deep Learning by Learning not to Overthink." Conference on Uncertainty in Artificial Intelligence (UAI), 2018.

Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions. We conjecture that fora majority of real-world inputs, the recent advances in deep learning have created models that effectively ``overthink'' on simple inputs. In this paper, we revisit the classic question of building model cascades that primarily leverage class asymmetry to reduce cost. We introduce the ``I Don't Know''(IDK) prediction cascades framework, a general framework to systematically compose a set of pre-trained models to accelerate inference without a loss in prediction accuracy. We propose two search based methods for constructing cascades as well as a new cost-aware objective within this framework. The proposed IDK cascade framework can be easily adopted in the existing model serving systems without additional model re-training. We evaluate the proposed techniques on a range of benchmarks to demonstrate the effectiveness of the proposed framework.

@inproceedings{IDK18,
 abstract = {Advances in deep learning have led to substantial increases in prediction accuracy but have been accompanied by increases in the cost of rendering predictions. We conjecture that fora majority of real-world inputs, the recent advances in deep learning have created models that effectively ``overthink'' on simple inputs. In this paper, we revisit the classic question of building model cascades that primarily leverage class asymmetry to reduce cost. We introduce the ``I Don't Know''(IDK) prediction cascades framework, a general framework to systematically compose a set of pre-trained models to accelerate inference without a loss in prediction accuracy. We propose two search based methods for constructing cascades as well as a new cost-aware objective within this framework. The proposed IDK cascade framework can be easily adopted in the existing model serving systems without additional model re-training. We evaluate the proposed techniques on a range of benchmarks to demonstrate the effectiveness of the proposed framework.},
 author = {Xin Wang and Yujia Luo and Dan Crankshaw and Alexey Tumanov and Fisher Yu and Joseph E. Gonzalez},
 booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {7},
 title = { {IDK} Cascades: Fast Deep Learning by Learning not to Overthink},
 url = {https://arxiv.org/abs/1706.00885},
 year = {2018}
}

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica. "Ray RLLib: A Composable and Scalable Reinforcement Learning Library." Proceedings of the 35th International Conference on Machine Learning, 2018.

Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse.

@inproceedings{rllibicml2018,
 abstract = {Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse.},
 author = {Eric Liang and Richard Liaw and Robert Nishihara and Philipp Moritz and Roy Fox and Joseph Gonzalez and Ken Goldberg and Ion Stoica},
 booktitle = {Proceedings of the 35th International Conference on Machine Learning},
 code = {https://ray.readthedocs.io/en/latest/rllib.html},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev, selected},
 month = {7},
 publisher = {ACM},
 series = {ICML '18},
 title = {Ray {RLLib}: {A} Composable and Scalable Reinforcement Learning Library},
 url = {https://arxiv.org/abs/1712.09381},
 year = {2018}
}

Dan Crankshaw, Joseph E. Gonzalez, and Peter Bailis. "Research for Practice: Prediction-Serving Systems." Commun. ACM, 2018.

What happens when we wish to actually deploy a machine learning model to production? This survey examines several recent systems for serving machine learning models as well as some classic papers describing early efforts in prediction serving.

@article{acmqueue2018,
 abstract = {What happens when we wish to actually deploy a machine learning model to production? This survey examines several recent systems for serving machine learning models as well as some classic papers describing early efforts in prediction serving.},
 acmid = {3190574},
 address = {New York, NY, USA},
 author = {Dan Crankshaw and Joseph E. Gonzalez and Peter Bailis},
 date-modified = {2020-08-02 11:27:35 -0700},
 issn = {0001-0782},
 issue_date = {August 2018},
 journal = {Commun. ACM},
 keywords = {techreport},
 month = {7},
 number = {8},
 numpages = {5},
 pages = {45--49},
 publisher = {ACM},
 title = {Research for Practice: Prediction-Serving Systems},
 url = {http://doi.acm.org/10.1145/3190574},
 volume = {61},
 year = {2018}
}

Xin Wang, Fisher Yu, Zi-Yi Dou, and Joseph E. Gonzalez. "SkipNet: Learning Dynamic Routing in Convolutional Networks." Proceedings of the European Conference on Computer Vision (ECCV), 2018.

While deeper convolutional networks are needed to achieve maximum accuracy in visual perception tasks, for many inputs shallower networks are sufficient. We exploit this observation by learning to skip convolutional layers on a per-input basis. We introduce SkipNet, a modified residual network, that uses a gating network to selectively skip convolutional blocks based on the activations of the previous layer. We formulate the dynamic skipping problem in the context of sequential decision making and propose a hybrid learning algorithm that combines supervised learning and reinforcement learning to address the challenges of non-differentiable skipping decisions. We show SkipNet reduces computation by 30-90\% while preserving the accuracy of the original model on four benchmark datasets and outperforms the state-of-the-art dynamic networks and static compression methods. We also qualitatively evaluate the gating policy to reveal a relationship between image scale and saliency and the number of layers skipped.

@inproceedings{Skipnet18,
 abstract = {While deeper convolutional networks are needed to achieve maximum accuracy in visual perception tasks, for many inputs shallower networks are sufficient. We exploit this observation by learning to skip convolutional layers on a per-input basis. We introduce SkipNet, a modified residual network, that uses a gating network to selectively skip convolutional blocks based on the activations of the previous layer. We formulate the dynamic skipping problem in the context of sequential decision making and propose a hybrid learning algorithm that combines supervised learning and reinforcement learning to address the challenges of non-differentiable skipping decisions. We show SkipNet reduces computation by 30-90\% while preserving the accuracy of the original model on four benchmark datasets and outperforms the state-of-the-art dynamic networks and static compression methods. We also qualitatively evaluate the gating policy to reveal a relationship between image scale and saliency and the number of layers skipped.},
 author = {Xin Wang and Fisher Yu and Zi{-}Yi Dou and Joseph E. Gonzalez},
 booktitle = {Proceedings of the European Conference on Computer Vision ({ECCV})},
 code = {https://github.com/ucbdrive/skipnet},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev, selected},
 month = {7},
 title = { {SkipNet}: Learning Dynamic Routing in Convolutional Networks},
 url = {https://arxiv.org/abs/1711.09485},
 year = {2018}
}

Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph E. Gonzalez, and Kurt Keutzer. "Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions." The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free ``shift'' operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation's efficacy, we replace ResNet's 3x3 convolutions with shift-based modules for improved CIFAR10 and CIFAR100 accuracy using 60\% fewer parameters; we additionally demonstrate the operation's resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation's applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.

@inproceedings{Shift18,
 abstract = {Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free ``shift'' operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation's efficacy, we replace ResNet's 3x3 convolutions with shift-based modules for improved CIFAR10 and CIFAR100 accuracy using 60\% fewer parameters; we additionally demonstrate the operation's resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation's applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.},
 author = {Bichen Wu and Alvin Wan and Xiangyu Yue and Peter Jin and Sicheng Zhao and Noah Golmant and Amir Gholaminejad and Joseph E. Gonzalez and Kurt Keutzer},
 booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {6},
 title = {Shift: A Zero {FLOP}, Zero Parameter Alternative to Spatial Convolutions},
 url = {https://arxiv.org/abs/1711.08141},
 year = {2018}
}

Samvit Jain, and Joseph E. Gonzalez. "Fast Semantic Segmentation on Video Using Block Motion-Based Feature Interpolation." The Third International Workshop on Video Segmentation (IWVS), 2018.

Convolutional networks optimized for accuracy on challenging, dense prediction tasks are prohibitively slow to run on each frame in a video. The spatial similarity of nearby video frames, however, suggests opportunity to reuse computation. Existing work has explored basic feature reuse and feature warping based on optical flow, but has encountered limits to the speedup attainable with these techniques. In this paper, we present a new, two part approach to accelerating inference on video. First, we propose a fast feature propagation technique that utilizes the block motion vectors present in compressed video (e.g. H.264 codecs) to cheaply propagate features from frame to frame. Second, we develop a novel feature estimation scheme, termed feature interpolation, that fuses features propagated from enclosing keyframes to render accurate feature estimates, even at sparse keyframe frequencies. We evaluate our system on the Cityscapes and CamVid datasets, comparing to both a frame-by-frame baseline and related work. We find that we are able to substantially accelerate segmentation on video, achieving near real-time frame rates (20.1 frames per second) on large images (960 x 720 pixels), while maintaining competitive accuracy. This represents an improvement of almost 6x over the single-frame baseline and 2.5x over the fastest prior work.

@inproceedings{Jain18IWVS,
 abstract = {Convolutional networks optimized for accuracy on challenging, dense prediction tasks are prohibitively slow to run on each frame in a video. The spatial similarity of nearby video frames, however, suggests opportunity to reuse computation. Existing work has explored basic feature reuse and feature warping based on optical flow, but has encountered limits to the speedup attainable with these techniques. In this paper, we present a new, two part approach to accelerating inference on video. First, we propose a fast feature propagation technique that utilizes the block motion vectors present in compressed video (e.g. H.264 codecs) to cheaply propagate features from frame to frame. Second, we develop a novel feature estimation scheme, termed feature interpolation, that fuses features propagated from enclosing keyframes to render accurate feature estimates, even at sparse keyframe frequencies. We evaluate our system on the Cityscapes and CamVid datasets, comparing to both a frame-by-frame baseline and related work. We find that we are able to substantially accelerate segmentation on video, achieving near real-time frame rates (20.1 frames per second) on large images (960 x 720 pixels), while maintaining competitive accuracy. This represents an improvement of almost 6x over the single-frame baseline and 2.5x over the fastest prior work.},
 author = {Samvit Jain and Joseph E. Gonzalez},
 booktitle = {The Third International Workshop on Video Segmentation (IWVS)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {3},
 title = {Fast Semantic Segmentation on Video Using Block Motion-Based Feature Interpolation},
 url = {https://arxiv.org/abs/1803.07742},
 year = {2018}
}

Sicheng Zhao, Bichen Wu, Joseph Gonzalez, Sanjit A. Seshia, and Kurt Keutzer. "Unsupervised Domain Adaptation: from Simulation Engine to the RealWorld." CoRR (arXiv), 2018.

Large-scale labeled training datasets have enabled deep neural networks to excel on a wide range of benchmark vision tasks. However, in many applications it is prohibitively expensive or time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled target domain. Unfortunately, direct transfer across domains often performs poorly due to domain shift and dataset bias. Domain adaptation is the machine learning paradigm that aims to learn a model from a source domain that can perform well on a different (but related) target domain. In this paper, we summarize and compare the latest unsupervised domain adaptation methods in computer vision applications. We classify the non-deep approaches into sample re-weighting and intermediate subspace transformation categories, while the deep strategy includes discrepancy-based methods, adversarial generative models, adversarial discriminative models and reconstruction-based methods. We also discuss some potential directions.

@article{Zhao2018,
 abstract = {Large-scale labeled training datasets have enabled deep neural networks to excel on a wide range of benchmark vision tasks. However, in many applications it is prohibitively expensive or time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled target domain. Unfortunately, direct transfer across domains often performs poorly due to domain shift and dataset bias. Domain adaptation is the machine learning paradigm that aims to learn a model from a source domain that can perform well on a different (but related) target domain. In this paper, we summarize and compare the latest unsupervised domain adaptation methods in computer vision applications. We classify the non-deep approaches into sample re-weighting and intermediate subspace transformation categories, while the deep strategy includes discrepancy-based methods, adversarial generative models, adversarial discriminative models and reconstruction-based methods. We also discuss some potential directions.},
 archiveprefix = {arXiv},
 author = {Sicheng Zhao and Bichen Wu and Joseph Gonzalez and Sanjit A. Seshia and Kurt Keutzer},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/abs-1803-09180},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1803.09180},
 journal = {CoRR},
 keywords = {arxivpre},
 month = {3},
 title = {Unsupervised Domain Adaptation: from Simulation Engine to the RealWorld},
 url = {http://arxiv.org/abs/1803.09180},
 volume = {abs/1803.09180},
 year = {2018}
}

Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E. Gonzalez, and Sergey Levine. "Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning." CoRR (arXiv), 2018.

Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.

@article{Feinberg2018,
 abstract = {Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.},
 archiveprefix = {arXiv},
 author = {Vladimir Feinberg and Alvin Wan and Ion Stoica and Michael I. Jordan and Joseph E. Gonzalez and Sergey Levine},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/abs-1803-00101},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1803.00101},
 journal = {CoRR},
 keywords = {arxivpre},
 month = {2},
 title = {Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning},
 url = {http://arxiv.org/abs/1803.00101},
 volume = {abs/1803.00101},
 year = {2018}
}

Xiangxi Mo, Paras Jain, Ajay Jain, Alexey Tumanov, Joseph E. Gonzalez, and Ion Stoica. "A Case for Dynamic GPU Inference Multitenancy and Scheduling." Proceedings of the Learning Systems Workshop at NIPS 2018, 2018.

Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads. We evaluate the performance trade-offs of each approach with respect to resource-efficiency, latency predictability, and isolation when compared with conventional batched inference. Our experimental analysis suggests at least a 5x potential for improved utilization through the exploration of more advanced spatial and temporal multiplexing strategies. Our preliminary prototype of a dynamic space-time scheduler demonstrates a 3.18x speedup over space-only multiplexing and a 7.76x speedup over time-only multiplexing, while also providing better isolation and latency predictability.

@inproceedings{LearningSys2018,
 abstract = {Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads. We evaluate the performance trade-offs of each approach with respect to resource-efficiency, latency predictability, and isolation when compared with conventional batched inference. Our experimental analysis suggests at least a 5x potential for improved utilization through the exploration of more advanced spatial and temporal multiplexing strategies. Our preliminary prototype of a dynamic space-time scheduler demonstrates a 3.18x speedup over space-only multiplexing and a 7.76x speedup over time-only multiplexing, while also providing better isolation and latency predictability.},
 author = {Xiangxi Mo and Paras Jain and Ajay Jain and Alexey Tumanov and Joseph E. Gonzalez and Ion Stoica},
 booktitle = {Proceedings of the Learning Systems Workshop at NIPS 2018},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {12},
 title = {A Case for Dynamic GPU Inference Multitenancy and Scheduling},
 url = {http://learningsys.org/nips18/assets/papers/102CameraReadySubmissionGPU\%5FVirtualization\%20(8).pdf},
 year = {2018}
}

J. Weston Hughes, Taylor Sittler, Anthony D. Joseph, Jeffrey E. Olgin, Joseph E. Gonzalez, and Geoffrey H. Tison. "Using Multitask Learning to Improve 12-Lead Electrocardiogram Classification." CoRR (arXiv), 2018.

We develop a multi-task convolutional neural network (CNN) to classify multiple diagnoses from 12-lead electrocardiograms (ECGs) using a dataset comprised of over 40,000 ECGs, with labels derived from cardiologist clinical interpretations. Since many clinically important classes can occur in low frequencies, approaches are needed to improve performance on rare classes. We compare the performance of several single-class classifiers on rare classes to a multi-headed classifier across all available classes. We demonstrate that the addition of common classes can significantly improve CNN performance on rarer classes when compared to a model trained on the rarer class in isolation. Using this method, we develop a model with high performance as measured by F1 score on multiple clinically relevant classes compared against the gold-standard cardiologist interpretation.

@article{Hughes18,
 abstract = {We develop a multi-task convolutional neural network (CNN) to classify multiple diagnoses from 12-lead electrocardiograms (ECGs) using a dataset comprised of over 40,000 ECGs, with labels derived from cardiologist clinical interpretations. Since many clinically important classes can occur in low frequencies, approaches are needed to improve performance on rare classes. We compare the performance of several single-class classifiers on rare classes to a multi-headed classifier across all available classes. We demonstrate that the addition of common classes can significantly improve CNN performance on rarer classes when compared to a model trained on the rarer class in isolation. Using this method, we develop a model with high performance as measured by F1 score on multiple clinically relevant classes compared against the gold-standard cardiologist interpretation.},
 archiveprefix = {arXiv},
 author = {J. Weston Hughes and Taylor Sittler and Anthony D. Joseph and Jeffrey E. Olgin and Joseph E. Gonzalez and Geoffrey H. Tison},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/abs-1812-00497},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1812.00497},
 journal = {CoRR},
 keywords = {arxivpre},
 month = {12},
 title = {Using Multitask Learning to Improve 12-Lead Electrocardiogram Classification},
 url = {http://arxiv.org/abs/1812.00497},
 volume = {abs/1812.00497},
 year = {2018}
}

Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, and Joseph Gonzalez. "On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent." CoRR (arXiv), 2018.

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for \emph{either} train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.

@article{Golmant18,
 abstract = {Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for \emph{either} train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.},
 archiveprefix = {arXiv},
 author = {Noah Golmant and Nikita Vemuri and Zhewei Yao and Vladimir Feinberg and Amir Gholami and Kai Rothauge and Michael W. Mahoney and Joseph Gonzalez},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/abs-1811-12941},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1811.12941},
 journal = {CoRR},
 keywords = {arxivpre},
 month = {11},
 title = {On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent},
 url = {http://arxiv.org/abs/1811.12941},
 volume = {abs/1811.12941},
 year = {2018}
}

Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica. "Tune: A Research Platform for Distributed Model Selection and Training." Proceedings of the ICML Workshop on AutoML, 2018.

Modern machine learning algorithms are increasingly computationally demanding, requiring specialized hardware and distributed computation to achieve high performance in a reasonable time frame. Many hyperparameter search algorithms have been proposed for improving the efficiency of model selection, however their adaptation to the distributed compute environment is often ad-hoc. We propose Tune, a unified framework for model selection and training that provides a narrow-waist interface between training scripts and search algorithms. We show that this interface meets the requirements for a broad range of hyperparameter search algorithms, allows straightforward scaling of search to large clusters, and simplifies algorithm implementation. We demonstrate the implementation of several state-of-the-art hyperparameter search algorithms in Tune.

@inproceedings{Tune18,
 abstract = {Modern machine learning algorithms are increasingly computationally demanding, requiring specialized hardware and distributed computation to achieve high performance in a reasonable time frame. Many hyperparameter search algorithms have been proposed for improving the efficiency of model selection, however their adaptation to the distributed compute environment is often ad-hoc. We propose Tune, a unified framework for model selection and training that provides a narrow-waist interface between training scripts and search algorithms. We show that this interface meets the requirements for a broad range of hyperparameter search algorithms, allows straightforward scaling of search to large clusters, and simplifies algorithm implementation. We demonstrate the implementation of several state-of-the-art hyperparameter search algorithms in Tune.},
 author = {Richard Liaw and Eric Liang and Robert Nishihara and Philipp Moritz and Joseph E. Gonzalez and Ion Stoica},
 booktitle = {Proceedings of the ICML Workshop on AutoML},
 code = {https://ray.readthedocs.io/en/latest/tune.html},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {Tune: A Research Platform for Distributed Model Selection and Training},
 url = {https://arxiv.org/abs/1807.05118},
 year = {2018}
}

Ion Stoica, Dawn Song, Raluca Ada Popa, David A. Patterson, Michael W. Mahoney, Randy H. Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph E. Gonzalez, Ken Goldberg, Ali Ghodsi, David E. Culler, and Pieter Abbeel. "A Berkeley View of Systems Challenges for AI." EECS Department, University of California, Berkeley Technical Report, 2017.

With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by methodological advances in machine learning, by innovations in systems software and architectures, and by the broad accessibility of these technologies. The next generation of AI systems promises to accelerate these developments and increasingly impact our lives via frequent interactions and making (often mission-critical) decisions on our behalf, often in highly personalized contexts. Realizing this promise, however, raises daunting challenges. In particular, we need AI systems that make timely and safe decisions in unpredictable environments, that are robust against sophisticated adversaries, and that can process ever increasing amounts of data across organizations and individuals without compromising confidentiality. These challenges will be exacerbated by the end of the Moore's Law, which will constrain the amount of data these technologies can store and process. In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI's potential to improve lives and society.

@techreport{Stoica17,
 abstract = {
With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by methodological advances in machine learning, by innovations in systems software and architectures, and by the broad accessibility of these technologies.

The next generation of AI systems promises to accelerate these developments and increasingly impact our lives via frequent interactions and making (often mission-critical) decisions on our behalf, often in highly personalized contexts. Realizing this promise, however, raises daunting challenges. In particular, we need AI systems that make timely and safe decisions in unpredictable environments, that are robust against sophisticated adversaries, and that can process ever increasing amounts of data across organizations and individuals without compromising confidentiality. These challenges will be exacerbated by the end of the Moore's Law, which will constrain the amount of data these technologies can store and process. In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI's potential to improve lives and society.
},
 author = {Ion Stoica and Dawn Song and Raluca Ada Popa and David A. Patterson and Michael W. Mahoney and Randy H. Katz and Anthony D. Joseph and Michael Jordan and Joseph M. Hellerstein and Joseph E. Gonzalez and Ken Goldberg and Ali Ghodsi and David E. Culler and Pieter Abbeel},
 date-modified = {2020-08-02 11:27:35 -0700},
 institution = {EECS Department, University of California, Berkeley},
 keywords = {techreport},
 month = {9},
 number = {UCB/EECS-2017-159},
 title = {A Berkeley View of Systems Challenges for {AI} },
 url = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-159.html},
 year = {2017}
}

Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, Burton Smith, and Randy H. Katz. "Selecting the Best VM Across Multiple Public Clouds: A Data-driven Performance Modeling Approach." Proceedings of the 2017 Symposium on Cloud Computing, 2017.

Users of cloud services are presented with a bewildering choice of VM types and the choice of VM can have significant implications on performance and cost. In this paper we address the fundamental problem of accurately and economically choosing the best VM for a given workload and user goals. To address the problem of optimal VM selection, we present PARIS, a data-driven system that uses a novel hybrid offline and online data collection and modeling framework to provide accurate performance estimates with minimal data collection. PARIS is able to predict workload performance for different user-specified metrics, and resulting costs for a wide range of VM types and workloads across multiple cloud providers. When compared to sophisticated baselines, including collaborative filtering and a linear interpolation model using measured workload performance on two VM types, PARIS produces significantly better estimates of performance. For instance, it reduces runtime prediction error by a factor of 4 for some workloads on both AWS and Azure. The increased accuracy translates into a 45\% reduction in user cost while maintaining performance.

@inproceedings{Paris17,
 abstract = {Users of cloud services are presented with a bewildering choice of VM types and the choice of VM can have significant implications on performance and cost. In this paper we address the fundamental problem of accurately and economically choosing the best VM for a given workload and user goals. To address the problem of optimal VM selection, we present PARIS, a data-driven system that uses a novel hybrid offline and online data collection and modeling framework to provide accurate performance estimates with minimal data collection. PARIS is able to predict workload performance for different user-specified metrics, and resulting costs for a wide range of VM types and workloads across multiple cloud providers. When compared to sophisticated baselines, including collaborative filtering and a linear interpolation model using measured workload performance on two VM types, PARIS produces significantly better estimates of performance. For instance, it reduces runtime prediction error by a factor of 4 for some workloads on both AWS and Azure. The increased accuracy translates into a 45\% reduction in user cost while maintaining performance.},
 acmid = {3131614},
 author = {Neeraja J. Yadwadkar and Bharath Hariharan and Joseph E. Gonzalez and Burton Smith and Randy H. Katz},
 booktitle = {Proceedings of the 2017 Symposium on Cloud Computing},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {978-1-4503-5028-0},
 keywords = {peerrev},
 location = {Santa Clara, California},
 month = {9},
 numpages = {14},
 pages = {452--465},
 publisher = {ACM},
 series = { {SoCC} '17},
 title = {Selecting the Best {VM} Across Multiple Public Clouds: A Data-driven Performance Modeling Approach},
 url = {https://doi.acm.org/10.1145/3127479.3131614},
 year = {2017}
}

Francois W. Belletti, Evan R. Sparks, Michael J. Franklin, Alexandre M. Bayen, and Joseph E. Gonzalez. "Random Projection Design for Scalable Implicit Smoothing of Randomly Observed Stochastic Processes." Artificial Intelligence and Statistics (AIStats '17), 2017.

Sampling at random timestamps, long range dependencies, and scale hamper standard meth- ods for multivariate time series analysis. In this paper we present a novel estimator for cross-covariance of randomly observed time series which unravels the dynamics of an unobserved stochastic process. We analyze the statistical properties of our estimator without needing the assumption that observation timestamps are independent from the process of interest and show that our solution is not hindered by the issues affecting standard estimators for cross-covariance. We implement and evaluate our statistically sound and scalable approach in the distributed setting using Apache Spark and demonstrate its ability to unravel causal dynamics on both simulations and high-frequency financial trading data.

@inproceedings{aistats17,
 abstract = {Sampling at random timestamps, long range dependencies, and scale hamper standard meth- ods for multivariate time series analysis. In this paper we present a novel estimator for cross-covariance of randomly observed time series which unravels the dynamics of an unobserved stochastic process. We analyze the statistical properties of our estimator without needing the assumption that observation timestamps are independent from the process of interest and show that our solution is not hindered by the issues affecting standard estimators for cross-covariance. We implement and evaluate our statistically sound and scalable approach in the distributed setting using Apache Spark and demonstrate its ability to unravel causal dynamics on both simulations and high-frequency financial trading data.},
 author = {Francois W. Belletti and Evan R. Sparks and Michael J. Franklin and Alexandre M. Bayen and Joseph E. Gonzalez},
 booktitle = {Artificial Intelligence and Statistics ({AIStats} '17)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {7},
 title = {Random Projection Design for Scalable Implicit Smoothing of Randomly Observed Stochastic Processes},
 url = {http://proceedings.mlr.press/v54/belletti17a/belletti17a.pdf},
 year = {2017}
}

Richard Liaw, Sanjay Krishnan, Animesh Garg, Daniel Crankshaw, Joseph E. Gonzalez, and Ken Goldberg. "Composing Meta-Policies for Autonomous Driving Using Hierarchical Deep Reinforcement Learning." CoRR (arXiv), 2017.

Rather than learning new control policies for each new task, it is possible, when tasks share some structure, to compose a ``meta-policy'' from previously learned policies. This paper reports results from experiments using Deep Reinforcement Learning on a continuous-state, discrete-action autonomous driving simulator. We explore how Deep Neural Networks can represent meta-policies that switch among a set of previously learned policies, specifically in settings where the dynamics of a new scenario are composed of a mixture of previously learned dynamics and where the state observation is possibly corrupted by sensing noise. We also report the results of experiments varying dynamics mixes, distractor policies, magnitudes/distributions of sensing noise, and obstacles. In a fully observed experiment, the meta-policy learning algorithm achieves 2.6x the reward achieved by the next best policy composition technique with 80\% less exploration. In a partially observed experiment, the meta-policy learning algorithm converges after 50 iterations while a direct application of RL fails to converge even after 200 iterations.

@article{Liaw2017,
 abstract = {Rather than learning new control policies for each new task, it is possible, when tasks share some structure, to compose a ``meta-policy'' from previously learned policies. This paper reports results from experiments using Deep Reinforcement Learning on a continuous-state, discrete-action autonomous driving simulator. We explore how Deep Neural Networks can represent meta-policies that switch among a set of previously learned policies, specifically in settings where the dynamics of a new scenario are composed of a mixture of previously learned dynamics and where the state observation is possibly corrupted by sensing noise. We also report the results of experiments varying dynamics mixes, distractor policies, magnitudes/distributions of sensing noise, and obstacles. In a fully observed experiment, the meta-policy learning algorithm achieves 2.6x the reward achieved by the next best policy composition technique with 80\% less exploration. In a partially observed experiment, the meta-policy learning algorithm converges after 50 iterations while a direct application of RL fails to converge even after 200 iterations.},
 archiveprefix = {arXiv},
 author = {Richard Liaw and Sanjay Krishnan and Animesh Garg and Daniel Crankshaw and Joseph E. Gonzalez and Ken Goldberg},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/abs-1711-01503},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1711.01503},
 journal = {CoRR},
 keywords = {arxivpre},
 month = {11},
 title = {Composing Meta-Policies for Autonomous Driving Using Hierarchical Deep Reinforcement Learning},
 url = {http://arxiv.org/abs/1711.01503},
 volume = {abs/1711.01503},
 year = {2017}
}

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. "Clipper: A Low-Latency Online Prediction Serving System." 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 2017.

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this paper, we introduce Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks and applications. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. Finally, we compare Clipper to the Tensorflow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve accuracy and render more robust predictions.

@inproceedings{Clipper17,
 abstract = {
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.

In this paper, we introduce Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks and applications. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. Finally, we compare Clipper to the Tensorflow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve accuracy and render more robust predictions.
},
 address = {Boston, MA},
 author = {Daniel Crankshaw and Xin Wang and Guilio Zhou and Michael J. Franklin and Joseph E. Gonzalez and Ion Stoica},
 booktitle = {14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)},
 code = {https://clipper.ai},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {978-1-931971-37-9},
 keywords = {peerrev, selected},
 pages = {613--627},
 publisher = {USENIX Association},
 title = {Clipper: A Low-Latency Online Prediction Serving System},
 url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw},
 year = {2017}
}

Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, Sudhansku Arora, Arka Bhattacharyya, Shirshanka Das, Akon Dey, Mark Donsky, Gabriel Fierro, Sreyashi Nag, Krishna Ramachandran, Chang She, Eric Sun, Carl Steinbach, and Venkat Subramanian. "Establishing Common Ground with Data Context." Conference on Innovative Data Systems Research (CIDR '17), 2017.


@inproceedings{Ground17,
 author = {Joseph M. Hellerstein and Vikram Sreekanti and Joseph E. Gonzalez and Sudhansku Arora and Arka Bhattacharyya and Shirshanka Das and Akon Dey and Mark Donsky and Gabriel Fierro and Sreyashi Nag and Krishna Ramachandran and Chang She and Eric Sun and Carl Steinbach and Venkat Subramanian},
 booktitle = {Conference on Innovative Data Systems Research ({CIDR} '17)},
 keywords = {peerrev},
 title = {Establishing Common Ground with Data Context},
 year = {2017}
}

Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E. Gonzalez, and Ion Stoica. "Opaque: An Oblivious and Encrypted Distributed Analytics Platform." 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 2017.


@inproceedings{Opaque17,
 address = {Boston, MA},
 author = {Wenting Zheng and Ankur Dave and Jethro G. Beekman and Raluca Ada Popa and Joseph E. Gonzalez and Ion Stoica},
 booktitle = {14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {978-1-931971-37-9},
 keywords = {peerrev},
 pages = {283--298},
 publisher = {USENIX Association},
 title = {Opaque: An Oblivious and Encrypted Distributed Analytics Platform},
 url = {https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/zheng},
 year = {2017}
}

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph E. Gonzalez, Scott Shenker, and Ion Stoica. "Apache Spark: A Unified Engine for Big Data Processing." Commun. ACM, 2016.


@article{acmqueu2016,
 acmid = {2934664},
 address = {New York, NY, USA},
 author = {Matei Zaharia and Reynold S. Xin and Patrick Wendell and Tathagata Das and Michael Armbrust and Ankur Dave and Xiangrui Meng and Josh Rosen and Shivaram Venkataraman and Michael J. Franklin and Ali Ghodsi and Joseph E. Gonzalez and Scott Shenker and Ion Stoica},
 date-modified = {2020-08-02 11:27:35 -0700},
 issn = {0001-0782},
 issue_date = {November 2016},
 journal = {Commun. ACM},
 keywords = {techreport},
 month = {9},
 number = {11},
 numpages = {10},
 pages = {56--65},
 publisher = {ACM},
 title = {Apache Spark: A Unified Engine for Big Data Processing},
 url = {https://doi.acm.org/10.1145/2934664},
 volume = {59},
 year = {2016}
}

Rong Gu, Qianhao Dong, Haoyuan Li, Joseph E. Gonzalez, Zhao Zhang, Shuai Wang, Yihua Huang, Scott Shenker, Ion Stoica, and Patrick P. C. Lee. "DFS-Perf: A Scalable and Unified Benchmarking Framework for Distributed File Systems." EECS Department, University of California, Berkeley Technical Report, 2016.


@techreport{Rong2016,
 author = {Rong Gu and Qianhao Dong and Haoyuan Li and Joseph E. Gonzalez and Zhao Zhang and Shuai Wang and Yihua Huang and Scott Shenker and Ion Stoica and Patrick P. C. Lee},
 date-modified = {2020-08-02 11:27:35 -0700},
 institution = {EECS Department, University of California, Berkeley},
 keywords = {techreport},
 month = {7},
 number = {UCB/EECS-2016-133},
 title = {DFS-Perf: A Scalable and Unified Benchmarking Framework for Distributed File Systems},
 url = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-133.html},
 year = {2016}
}

Ankur Dave, Alekh Jindal, Li Erran Li, Reynold Xin, Joseph E. Gonzalez, and Matei Zaharia. "GraphFrames: An Integrated API for Mixing Graph and Relational Queries.." SIGMOD Grades Workshop, 2016.


@inproceedings{Graphframes16,
 author = {Ankur Dave and Alekh Jindal and Li Erran Li and Reynold Xin and Joseph E. Gonzalez and Matei Zaharia},
 booktitle = { {SIGMOD} Grades Workshop},
 keywords = {peerrev},
 title = {GraphFrames: An Integrated API for Mixing Graph and Relational Queries.},
 year = {2016}
}

Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, and Randy Katz. "Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling." Journal of Machine Learning Research (JMLR '16), 2016.


@inproceedings{Mtl16,
 author = {Neeraja J. Yadwadkar and Bharath Hariharan and Joseph E. Gonzalez and Randy Katz},
 booktitle = {Journal of Machine Learning Research ({JMLR} '16)},
 keywords = {peerrev},
 title = {Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling},
 year = {2016}
}

Francois W. Belletti, Evan R. Sparks, Michael J. Franklin, Alexandre M. Bayen, and Joseph E. Gonzalez. "Scalable Linear Causal Inference for Irregularly Sampled Time Series with Long Range Dependencies." CoRR (arXiv), 2016.


@article{BellettiSFBG16,
 archiveprefix = {arXiv},
 author = {Francois W. Belletti and Evan R. Sparks and Michael J. Franklin and Alexandre M. Bayen and Joseph E. Gonzalez},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/BellettiSFBG16},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1603.03336},
 journal = {CoRR},
 keywords = {arxivpre},
 timestamp = {Mon, 13 Aug 2018 16:48:40 +0200},
 title = {Scalable Linear Causal Inference for Irregularly Sampled Time Series with Long Range Dependencies},
 url = {http://arxiv.org/abs/1603.03336},
 volume = {abs/1603.03336},
 year = {2016}
}

Joseph E. Gonzalez, Peter Bailis, Michael I. Jordan, Michael J. Franklin, Joseph M. Hellerstein, Ali Ghodsi, and Ion Stoica. "Asynchronous Complex Analytics in a Distributed Dataflow Architecture." CoRR (arXiv), 2015.


@article{Gonzalez15,
 archiveprefix = {arXiv},
 author = {Joseph E. Gonzalez and Peter Bailis and Michael I. Jordan and Michael J. Franklin and Joseph M. Hellerstein and Ali Ghodsi and Ion Stoica},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/bib/journals/corr/GonzalezBJFHGS15},
 date-modified = {2020-08-02 11:27:35 -0700},
 eprint = {1510.07092},
 journal = {CoRR},
 keywords = {arxivpre},
 timestamp = {Mon, 13 Aug 2018 16:46:22 +0200},
 title = {Asynchronous Complex Analytics in a Distributed Dataflow Architecture},
 url = {http://arxiv.org/abs/1510.07092},
 volume = {abs/1510.07092},
 year = {2015}
}

Veronika Strnadova-Neeley, Aydin Buluc, Jarrod Chapman, John Gilbert, Joseph E. Gonzalez, and Leonid Oliker. "Efficient Data Reduction for Large-Scale Genetic Mapping." ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB '15), 2015.


@inproceedings{bcb2015,
 author = {Veronika Strnadova-Neeley and Aydin Buluc and Jarrod Chapman and John Gilbert and Joseph E. Gonzalez and Leonid Oliker},
 booktitle = { {ACM} Conference on Bioinformatics, Computational Biology, and Health Informatics ({BCB} '15)},
 keywords = {peerrev},
 title = {Efficient Data Reduction for Large-Scale Genetic Mapping},
 year = {2015}
}

Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, and Randy Katz. "Faster Jobs in Distributed Data Processing using Multi-Task Learning." SIAM International Conference on Data Mining (SDM '15), 2015.


@inproceedings{sdm15,
 author = {Neeraja J. Yadwadkar and Bharath Hariharan and Joseph E. Gonzalez and Randy Katz},
 booktitle = { {SIAM} International Conference on Data Mining ({SDM} '15)},
 keywords = {peerrev},
 title = {Faster Jobs in Distributed Data Processing using Multi-Task Learning},
 year = {2015}
}

Daniel Crankshaw, Xin Wang, Joseph E. Gonzalez, and Michael J. Franklin. "Scalable Training and Serving of Personalized Models." Proceedings of the Learning Systems Workshop at NIPS 2015, 2015.


@inproceedings{Velox15,
 author = {Daniel Crankshaw and Xin Wang and Joseph E. Gonzalez and Michael J. Franklin},
 booktitle = {Proceedings of the Learning Systems Workshop at NIPS 2015},
 keywords = {peerrev},
 title = {Scalable Training and Serving of Personalized Models},
 year = {2015}
}

Daniel Crankshaw, Peter Bailis, Joseph E. Gonzalez, Haoyuan Li, Zhao Zhang, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan. "The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox." Conference on Innovative Data Systems Research (CIDR '15), 2015.


@inproceedings{VeloxCIDR15,
 author = {Daniel Crankshaw and Peter Bailis and Joseph E. Gonzalez and Haoyuan Li and Zhao Zhang and Michael J. Franklin and Ali Ghodsi and Michael I. Jordan},
 booktitle = {Conference on Innovative Data Systems Research ({CIDR} '15)},
 keywords = {peerrev},
 title = {The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox},
 year = {2015}
}

Veronika Strnadova, Aydin Buluc, Leonid Oliker, Joseph E. Gonzalez, Stefanie Jegelka, Jarrod Chapman, and John Gilbert. "Fast Clustering Methods for Genetic Mapping in Plants." 16th SIAM Conference on Parallel Processing for Scientific Computing, 2014.


@inproceedings{GeneClust14,
 author = {Veronika Strnadova and Aydin Buluc and Leonid Oliker and Joseph E. Gonzalez and Stefanie Jegelka and Jarrod Chapman and John Gilbert},
 booktitle = {16th SIAM Conference on Parallel Processing for Scientific Computing},
 keywords = {peerrev},
 title = {Fast Clustering Methods for Genetic Mapping in Plants},
 year = {2014}
}

Joseph E. Gonzalez. "From Graphs to Tables the Design of Scalable Systems for Graph Analytics." Proceedings of the 23rd International Conference on World Wide Web, 2014.


@inproceedings{WWW2014,
 author = {Joseph E. Gonzalez},
 booktitle = {Proceedings of the 23rd International Conference on World Wide Web},
 date-modified = {2020-08-02 11:27:35 -0700},
 isbn = {978-1-4503-2745-9},
 keywords = {techreport},
 location = {Seoul, Korea},
 numpages = {2},
 pages = {1149--1150},
 publisher = {ACM},
 series = {WWW '14 Companion},
 title = {From Graphs to Tables the Design of Scalable Systems for Graph Analytics},
 url = {https://doi.acm.org/10.1145/2567948.2580059},
 year = {2014}
}

Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. "GraphX: Graph Processing in a Distributed Dataflow Framework." 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014.


@inproceedings{GraphX14,
 author = {Joseph E. Gonzalez and Reynold S. Xin and Ankur Dave and Daniel Crankshaw and Michael J. Franklin and Ion Stoica},
 booktitle = {11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14)},
 keywords = {peerrev},
 pages = {599--613},
 title = {GraphX: Graph Processing in a Distributed Dataflow Framework},
 year = {2014}
}

Xinghao Pan, Stefanie Jegelka, Joseph E. Gonzalez, Joseph K. Bradley, and Michael I. Jordan. "Parallel Double Greedy Submodular Maximization." Neural Information Processing Systems (NIPS '14), 2014.


@inproceedings{ParallelSubmodular14,
 author = {Xinghao Pan and Stefanie Jegelka and Joseph E. Gonzalez and Joseph K. Bradley and Michael I. Jordan},
 booktitle = {Neural Information Processing Systems ({NIPS} '14)},
 keywords = {peerrev},
 title = {Parallel Double Greedy Submodular Maximization},
 year = {2014}
}

David Bader, Ayd\in Bulu\cc, John Gilbert, Joseph E. Gonzalez, Jeremy Kepner, and Timothy Mattson. "The Graph BLAS effort and its implications for Exascale." SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities (EX14), 2014.


@inproceedings{gblas14,
 author = {David Bader and Ayd{\i}n Bulu\c{c} and John Gilbert and Joseph E. Gonzalez and Jeremy Kepner and Timothy Mattson},
 booktitle = {SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities (EX14)},
 keywords = {peerrev},
 title = {The Graph BLAS effort and its implications for Exascale},
 year = {2014}
}

T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. "Standards for graph algorithm primitives." 2013 IEEE High Performance Extreme Computing Conference (HPEC), 2013.

It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

@inproceedings{Standards13,
 abstract = {It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.},
 author = {T. Mattson and D. Bader and J. Berry and A. Buluc and J. Dongarra and C. Faloutsos and J. Feo and J. Gilbert and J. Gonzalez and B. Hendrickson and J. Kepner and C. Leiserson and A. Lumsdaine and D. Padua and S. Poole and S. Reinhardt and M. Stonebraker and S. Wallach and A. Yoo},
 booktitle = {2013 IEEE High Performance Extreme Computing Conference (HPEC)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {9},
 pages = {1--2},
 title = {Standards for graph algorithm primitives},
 url = {https://doi.org/10.1109/HPEC.2013.6670338},
 year = {2013}
}

Evan Sparks, Ameet Talwalkar, Virginia Smith, Xinghao Pan, Joseph E. Gonzalez, Tim Kraska, Michael I. Jordan, and Michael J. Franklin. "MLI: An API for Distributed Machine Learning." International Conference on Data Mining (ICDM), 2013.

MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

@inproceedings{MLI12,
 abstract = {MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.},
 author = {Evan Sparks and Ameet Talwalkar and Virginia Smith and Xinghao Pan and Joseph E. Gonzalez and Tim Kraska and Michael I. Jordan and Michael J. Franklin},
 booktitle = {International Conference on Data Mining (ICDM)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {12},
 organization = {IEEE},
 title = { {MLI}: An API for Distributed Machine Learning},
 url = {https://ieeexplore.ieee.org/abstract/document/6729619},
 year = {2013}
}

Reynold Xin, Joseph E. Gonzalez, Michael Franklin, and Ion Stoica. "GraphX: A Resilient Distributed Graph System on Spark." SIGMOD Grades Workshop, 2013.

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g., Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining. We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

@inproceedings{SigmodGraphX13,
 abstract = {
From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g., Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
},
 author = {Reynold Xin and Joseph E. Gonzalez and Michael Franklin and Ion Stoica},
 booktitle = { {SIGMOD} Grades Workshop},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {GraphX: A Resilient Distributed Graph System on Spark},
 url = {https://dl.acm.org/citation.cfm?id=2484427},
 year = {2013}
}

Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, and Michael I. Jordan. "Optimistic Concurrency Control for Distributed Unsupervised Learning." NIPS '13, 2013.

Research on distributed machine learning algorithms has focused primarily on one of two extremes - algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this ``optimistic concurrency control'' paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment.

@inproceedings{OCC13,
 abstract = {Research on distributed machine learning algorithms has focused primarily on one of two extremes - algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this ``optimistic concurrency control'' paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment.},
 author = {Xinghao Pan and Joseph E. Gonzalez and Stefanie Jegelka and Tamara Broderick and Michael I. Jordan},
 booktitle = { {NIPS} '13},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {Optimistic Concurrency Control for Distributed Unsupervised Learning},
 url = {https://arxiv.org/abs/1307.8049},
 year = {2013}
}

Yucheng Low AND Joseph E. Gonzalez AND Aapo Kyrola AND Danny Bickson AND Carlos Guestrin AND Joseph M. Hellerstein. "Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud.." Proceedings of Very Large Data Bases (PVLDB), 2012.

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

@inproceedings{DistGraphlab12,
 abstract = {While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.},
 author = {Yucheng Low AND Joseph E. Gonzalez AND Aapo Kyrola AND Danny Bickson AND Carlos Guestrin AND Joseph M. Hellerstein},
 booktitle = {Proceedings of Very Large Data Bases (PVLDB)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {8},
 title = {Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud.},
 url = {https://arxiv.org/abs/1204.6078},
 year = {2012}
}

Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." OSDI '12, 2012.

Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graphparallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.

@inproceedings{PowerGraph12,
 abstract = {
Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability.

In this paper, we characterize the challenges of computation on natural graphs in the context of existing graphparallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.
},
 author = {Joseph E. Gonzalez and Yucheng Low and Haijie Gu and Danny Bickson and Carlos Guestrin},
 booktitle = { {OSDI} '12},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs},
 url = {https://www.usenix.org/system/files/conference/osdi12/osdi12-final-167.pdf},
 year = {2012}
}

Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alex Smola. "Scalable Inference in Latent Variable Models." Conference on Web Search and Data Mining (WSDM), 2012.

Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets. In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework

@inproceedings{ParamServer12,
 abstract = {
Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets.

In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework
},
 author = {Amr Ahmed and Mohamed Aly and Joseph Gonzalez and Shravan Narayanamurthy and Alex Smola},
 booktitle = {Conference on Web Search and Data Mining (WSDM)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {Scalable Inference in Latent Variable Models},
 url = {http://www.cs.cmu.edu/~jegonzal/papers/ahmed\%5Fscalable\%5Finference\%5Fin\%5Flatent\%5Fvariable\%5Fmodels.pdf},
 year = {2012}
}

Joseph E. Gonzalez, Yucheng Low, Arthur Gretton, and Carlos Guestrin. "Parallel Gibbs Sampling: From Colored Fields to Thin Junction Trees." Artificial Intelligence and Statistics (AISTATS), 2011.

We explore the task of constructing a parallel Gibbs sampler, to both improve mixing and the exploration of high likelihood states. Recent work in parallel Gibbs sampling has focused on update schedules which do not guarantee convergence to the intended stationary distribution. In this work, we propose two methods to construct parallel Gibbs samplers guaranteed to draw from the targeted distribution. The first method, called the Chromatic sampler, uses graph coloring to construct a direct parallelization of the classic sequential scan Gibbs sampler. In the case of 2-colorable models we relate the Chromatic sampler to the Synchronous Gibbs sampler (which draws all variables simultaneously in parallel), and reveal new ergodic properties of Synchronous Gibbs chains. Our second method, the Splash sampler, is a complementary strategy which can be used when the variables are tightly coupled. This constructs and samples multiple blocks in parallel, using a novel locking protocol and an iterative junction tree generation algorithm. We further improve the Splash sampler through adaptive tree construction. We demonstrate the benefits of our two sampling algorithms on large synthetic and real-world models using a 32 processor multi-core system.

@inproceedings{GibbsSplash11,
 abstract = {We explore the task of constructing a parallel Gibbs sampler, to both improve mixing and the exploration of high likelihood states. Recent work in parallel Gibbs sampling has focused on update schedules which do not guarantee convergence to the intended stationary distribution. In this work, we propose two methods to construct parallel Gibbs samplers guaranteed to draw from the targeted distribution. The first method, called the Chromatic sampler, uses graph coloring to construct a direct parallelization of the classic sequential scan Gibbs sampler. In the case of 2-colorable models we relate the Chromatic sampler to the Synchronous Gibbs sampler (which draws all variables simultaneously in parallel), and reveal new ergodic properties of Synchronous Gibbs chains. Our second method, the Splash sampler, is a complementary strategy which can be used when the variables are tightly coupled. This constructs and samples multiple blocks in parallel, using a novel locking protocol and an iterative junction tree generation algorithm. We further improve the Splash sampler through adaptive tree construction. We demonstrate the benefits of our two sampling algorithms on large synthetic and real-world models using a 32 processor multi-core system.},
 author = {Joseph E. Gonzalez and Yucheng Low and Arthur Gretton and Carlos Guestrin},
 booktitle = {Artificial Intelligence and Statistics (AISTATS)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {5},
 title = {Parallel Gibbs Sampling: From Colored Fields to Thin Junction Trees},
 url = {http://proceedings.mlr.press/v15/gonzalez11a.html},
 year = {2011}
}

Yucheng Low, Joseph E. Gonzalez, Aapo Kyrola, Daniel Bickson, Carlos Guestrin, and Joseph M. Hellerstein. "GraphLab: A New Parallel Framework for Machine Learning." Conference on Uncertainty in Artificial Intelligence (UAI), 2010.

Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.

@inproceedings{Graphlab10,
 abstract = {Designing and implementing efficient, provably correct parallel machine learning (ML) algorithms is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We demonstrate the expressiveness of the GraphLab framework by designing and implementing parallel versions of belief propagation, Gibbs sampling, Co-EM, Lasso and Compressed Sensing. We show that using GraphLab we can achieve excellent parallel performance on large scale real-world problems.},
 author = {Yucheng Low and Joseph E. Gonzalez and Aapo Kyrola and Daniel Bickson and Carlos Guestrin and Joseph M. Hellerstein},
 booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 title = {GraphLab: A New Parallel Framework for Machine Learning},
 url = {https://arxiv.org/abs/1006.4990},
 year = {2010}
}

Joseph E. Gonzalez, Yucheng Low, Carlos Guestrin, and David O'Hallaron. "Distributed Parallel Inference on Large Factor Graphs." Conference on Uncertainty in Artificial Intelligence (UAI), 2009.

As computer clusters become more common and the size of the problems encountered in the field of AI grows, there is an increasing demand for efficient parallel inference algorithms. We consider the problem of parallel inference on large factor graphs in the distributed memory setting of computer clusters. We develop a new efficient parallel inference algorithm, DBRSplash, which incorporates over-segmented graph partitioning, belief residual scheduling, and uniform work Splash operations. We empirically evaluate the DBRSplash algorithm on a 120 processor cluster and demonstrate linear to super-linear performance gains on large factor graph models.

@inproceedings{DistSplash09,
 abstract = {As computer clusters become more common and the size of the problems encountered in the field of AI grows, there is an increasing demand for efficient parallel inference algorithms. We consider the problem of parallel inference on large factor graphs in the distributed memory setting of computer clusters. We develop a new efficient parallel inference algorithm, DBRSplash, which incorporates over-segmented graph partitioning, belief residual scheduling, and uniform work Splash operations. We empirically evaluate the DBRSplash algorithm on a 120 processor cluster and demonstrate linear to super-linear performance gains on large factor graph models.},
 author = {Joseph E. Gonzalez and Yucheng Low and Carlos Guestrin and David O'Hallaron},
 booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {7},
 title = {Distributed Parallel Inference on Large Factor Graphs},
 url = {https://arxiv.org/pdf/1205.2645.pdf},
 year = {2009}
}

Joseph E. Gonzalez, Yucheng Low, and Carlos Guestrin. "Residual Splash for Optimally Parallelizing Belief Propagation." Artificial Intelligence and Statistics (AISTATS), 2009.

As computer architectures move towards parallelism we must build a new theoretical understanding of parallelism in machine learning. In this paper we focus on parallelizing message passing inference algorithms in graphical models. We develop a theoretical understanding of the limitations of parallelism in belief propagation and bound the optimal achievable running parallel performance on a certain class of graphical models. We demonstrate that the fully synchronous parallelization of belief propagation is highly inefficient. We provide a new parallel belief propagation which achieves optimal performance on a certain class of graphical models. Using two challenging real-world problems, we empirically evaluate the performance of our algorithm. On the real-world problems, we find that our new algorithm achieves near linear performance improvements and out performs alternative parallel belief propagation algorithms.

@inproceedings{ParallelSplash09,
 abstract = {As computer architectures move towards parallelism we must build a new theoretical understanding of parallelism in machine learning. In this paper we focus on parallelizing message passing inference algorithms in graphical models. We develop a theoretical understanding of the limitations of parallelism in belief propagation and bound the optimal achievable running parallel performance on a certain class of graphical models. We demonstrate that the fully synchronous parallelization of belief propagation is highly inefficient. We provide a new parallel belief propagation which achieves optimal performance on a certain class of graphical models. Using two challenging real-world problems, we empirically evaluate the performance of our algorithm. On the real-world problems, we find that our new algorithm achieves near linear performance improvements and out performs alternative parallel belief propagation algorithms.},
 author = {Joseph E. Gonzalez and Yucheng Low and Carlos Guestrin},
 booktitle = {Artificial Intelligence and Statistics (AISTATS)},
 date-modified = {2020-08-02 11:27:35 -0700},
 keywords = {peerrev},
 month = {4},
 title = {Residual Splash for Optimally Parallelizing Belief Propagation},
 url = {http://proceedings.mlr.press/v5/gonzalez09a.html},
 year = {2009}
}