Kristal Curtis

I recently completed my Ph.D. in Computer Science at UC Berkeley. I was advised by David Patterson and Armando Fox as a member of the Algorithms, Machines, and People (AMP) Lab.

My research interests include machine learning, distributed systems, and genomics. I am also interested in software engineering and data science.

In what follows, I highlight some of the projects I've worked on during my Ph.D. My CV and résumé are also available.


Though DNA sequencing has improved dramatically over the past decade, variant calling, which is the process of reconstructing a patient’s genome from the reads that the sequencers produce, remains a difficult problem, largely due to the genome’s redundant structure. SiRen is our algorithm for characterizing the genome’s structure in a way that makes sense from the perspective of the reads themselves. We use the term similar regions to refer to the areas of redundancy that we have identified. We then confirm that the similar regions are characterized by low variant calling accuracy. We show that the structure of the similar regions provides a platform for repairing alignment errors, thus leading to significantly improved variant calling accuracy.

Software is open source and available at

Kristal Curtis. "Leveraging Similar Regions to Improve Genome Data Processing." Ph.D. Thesis. Technical Report No. UCB/EECS-2015-199, 2015.

Kristal Curtis, Ameet Talwalkar, Matei Zaharia, Armando Fox, and David Patterson. "SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling." Technical Report No. UCB/EECS-2015-159, 2015.


Variant calling is the essential problem of transforming the raw output of DNA sequencers to a summary of a patient's genome. Since variant calling is difficult, variant calling tools often disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. SMaSH is a benchmarking methodology consisting of both synthetic and real datasets as well as validation data that, via our validation scripts, can be used for for evaluating variant calling algorithms.

See the SMaSH website to download datasets and evaluation scripts.

Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma’ayan Bresler, Yun S. Song, Michael I. Jordan, and David Patterson. “SMaSH: a benchmarking toolkit for human genome variant calling.” Bioinformatics, Vol. 30, No. 19, pp. 2787–2795, 2014.


The Scalable Nucleotide Alignment Program (SNAP) is a novel and efficient alignment algorithm and software package that enables new applications in sequence analysis, including outbreak detection, sample quality control, and genome remapping. SNAP provides accuracy equivalent to the current state-of-the-art aligners in 1/2 to 1/30 the time. SNAP is well-suited to upcoming developments in sequencing technology, with improved accuracy and increasing speed on longer paired-end read lengths. It can align a typical paired-end human genome dataset in approximately four hours on a single commodity server, or at double that speed on longer read lengths.

See the SNAP website for more information.

SNAP is available open source at

Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David Patterson, Scott Shenker, Ion Stoica, Richard Karp, and Taylor Sittler. “Faster and More Accurate Sequence Alignment with SNAP.” ArXiv Technical Report 1111.5572v1, 2011.


Providers of large-scale, interactive web services obsess about performance. Their developers labor to write queries whose latency will meet the company’s service level objective (SLO). Since it is difficult to reason about latency for traditional relational database queries, today’s large-scale applications are increasingly moving to distributed key-value stores, which are attractive due to their incremental scalability and predictable performance. While key-value stores are less convenient than databases due to their narrow interface, the Performance-Insightful Query Language (PIQL) addresses this problem by allowing developers to express their queries declaratively and compiling the queries to key-value primitives. These developments provide an appealing setting for prediction. Using PIQL query operator models and query plans, we can predict queries’ 99th-percentile latency within 20% of the actual values.

PIQL is available open source at

Michael Armbrust, Kristal Curtis, Tim Kraska, Armando Fox, Michael J. Franklin, and David Patterson. “PIQL: Success-Tolerant Query Processing in the Cloud.” Proceedings of the Very Large Data Bases Endowment Inc. (VLDB Endowment), Vol. 5, No. 3, 2011.

Kristal Curtis. “Determining SLO Violations at Compile Time.” M.S. Thesis, UC Berkeley, 2010.

© 2015 Kristal Curtis
Based on a template designed by Andreas Viklund