Kay Ousterhout

Kay Ousterhout

keo (at) cs (dot) berkeley (dot) edu · 419 Soda Hall


I am a Ph.D. student in the UC Berkeley NetSys Lab, advised by Sylvia Ratnasamy. I am broadly interested in networking, computer systems, and cloud computing. My PhD thesis work focuses on understanding performance in data analytics frameworks.

I am also a committer and PMC member for Apache Spark. My work on Spark has focused on improving scheduler performance, and I currently help maintain and review pull requests for the scheduler code.

I am currently supported by a Google PhD Fellowship. In the past, I was supported by a Hertz Foundation Graduate Fellowship, a UC Berkeley Chancellor's Fellowship, and a Google Anita Borg Memorial Scholarship.

I graduated from Princeton University in 2011 with a B.S.E. in Computer Science. At Princeton, I was advised by Jennifer Rexford and Michael J. Freedman.

Thesis Work

The first component of my thesis work focused on characterizing the performance of large-scale data analytics frameworks like Spark. As part of that project, I added instrumentation to Spark to measure how much time is spent doing network and disk I/O. Most of that instrumentation is now part of Spark, and can be visualized in the Spark UI by clicking the "Event Timeline" link on the stage detail page. More information about that project is available here; that page includes links to some detailed traces we collected.

One takeaway from my work measuring performance in current systems is that today's systems make it difficult to reason about performance. In Spark, for example, pervasive pipelining and parallelism make it difficult (even with extensive instrumentation and metrics) for users to model performance and understand how changing the software or hardware configuration would impact performance. Today's users have many choices in how to configure their workloads (e.g., what type of EC2 instance should they use to run their job?); without the ability to reason about performance, they cannot configure for the best performance. My current research focuses on a new system, Monotasks, that we've designed with the singular goal of making it easy for users to reason about performance. Monotasks is a replacement for the execution layer of Apache Spark, and is fully API-compatible with Spark. For more information about monotasks, refer to my talk at Spark Summit 2016 (linked below).


Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
SOSP 2017 (to appear)

Drizzle: Fast and Adaptable Stream Processing at Scale
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, Ion Stoicai
SOSP 2017 (to appear)

Performance clarity as a first-class design principle
Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker
HotOS 2017

Making Sense of Performance in Data Analytics Frameworks
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun
NSDI 2015

Sparrow: Distributed, Low Latency Scheduling
Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
SOSP 2013

The Case for Tiny Tasks in Compute Clusters
Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, Ion Stoica
HotOS 2013


Re-Architecting Apache Spark for Performance Understandability at Spark Summit 2016 pdf pptx video

Making Sense of Spark Performance at Spark Summit 2015 pdf pptx video

Making Sense of Performance in Data Analytics Frameworks at NSDI 2015 pdf pptx video

Making Sense of Spark Performance O'Reilly Webcast pdf pptx webcast

Next-Generation Spark Scheduling with Sparrow at Spark Summit 2013 pdf pptx video

Sparrow: Distributed, Low Latency Scheduling at SOSP 2013 pdf pptx video

The Case for Tiny Tasks in Compute Clusters at HotOS 2013 pdf pptx


In the 2016 Spring Semester, I was a TA for CS61B. As part of my work as a TA, I wrote the Editor Project. I also maintained a list of weekly programming tips that are available here.

I also developed the course materials for the "Scaling Up Analytics" portion of the Spring 2014 Introduction to Data Science course, including a lab on how Apache Spark works.