Aditya Parameswaran

I am an Associate Professor at the University of California, Berkeley, with a joint appointment at the I School and EECS. I co-direct the EPIC Data Lab and the California Police Records Access project. I co-founded Ponder, which was acquired by Snowflake in October 2023. I am part of the Data Systems & Foundations and Human-Computer Interaction groups, and I am affiliated with the Berkeley Institute of Data Science.

My research interests are broadly in building tools for simplifying data science at scale, i.e., empowering individuals and teams to leverage and make sense of their large datasets more easily, efficiently, and effectively.



We are always looking for postdocs, PhD, MS, and UG students or research/development staff to join our efforts! If you are a postdoc or staff applicant, feel free to email me directly with your CV and qualifications. If you are an aspiring PhD student, please apply to the EECS or I School PhD programs. If you are an MS or UG student, feel free to fill out this form: it is rare that we will work with UG/MS students outside UC Berkeley except in cases of unusually good fit.

Formal Biographical Sketch

Aditya Parameswaran is an Associate Professor in the School of Information (I School) and Electrical Engineering and Computer Sciences (EECS) at UC Berkeley. Aditya co-directs the EPIC Data Lab, a lab targeted at low/no-code data tooling with a special emphasis on social justice applications. Aditya also co-directs the California Police Records Access project, an initiative to build a first-of-its-kind state-wide use-of-force and misconduct database. Until its acquisition by Snowflake in October 2023, Aditya served as the President of Ponder, a company he co-founded with his students based on popular data science tools developed at Berkeley. Aditya is affiliated with the Berkeley Institute of Design, and the Berkeley Institute of Data Science — and is part of the Data Systems & Foundations and Human-Computer Interaction groups at Berkeley. Aditya develops human-centered tools for scalable data science — making it easy for end-users and teams to leverage and make sense of their large and complex datasets — by synthesizing techniques from data systems and human-computer interaction. His visualization and data exploration tools have been downloaded and used by millions of users in a variety of domains.

Click here for a longer bio.

News

  • May 5, 2024: Our paper on dataset search, led by Madelon, was accepted at HILDA! Preprint here.
  • May 1, 2024: New preprint up on document analytics with LLMs as part of our new ZenDB system, led by Yiming. Read about it here!
  • April 28, 2024: Shreya summarized her work on evaluating LLM pipelines, from SPADE to EvalGen, as part of an MLSys seminar. Listen here!
  • April 24, 2024: Devin gave a nice talk on our journey through dataframe land from Berkeley to Ponder, now to Snowflake at CMU. Listen here!
  • April 22, 2024: Our work on building a UI for evaluating LLM pipelines was just released as a preprint.
  • April 1, 2024: Our demo on Motion was accepted to SIGMOD!
  • April 1, 2024: Welcome to Tarak Shah, who is joining us from the Human Rights Data Advocacy Group for work on the police records project.
  • March 1, 2024: Welcome to our newest postdoc, Sep Zeighami, joining us from USC!
  • February 28, 2024: Our work on LLM-based extraction of information from police misconduct PDFs is having real impact; check out this news story from Stockton.
  • February 14, 2024: Two of three organizers for the DEEM workshop this year work in our group, so look out for an exciting event!
  • January 16, 2024: New preprint on automatic assertion generation for LLM pipelines, led by Shreya!
  • January 13, 2024: We presented our work on prompt engineering-meets-crowdsourcing at CIDR'24!
  • November 28, 2023: Madelon Hulsebos joins us as a postdoc, with a BIDS Postdoctoral Fellowship!
  • November 8, 2023: Excited to announce a collaboration with LangChain, a leading LLM framework, to support automatic discovery of evaluation functions and assertions by analysis of prompt version histories. Blog post here.
  • October 24, 2023: Delighted to announce that our company Ponder was acquired by Snowflake, the leading cloud data warehouse vendor, to accelerate their data science offerings. It has been a wild ride, from inception in 2021, to a $7M seed fundraise by leading VC firms, to scaling the company up to 15 employees, to building a never-before-product running data science libraries in data warehouses, to acquisition. Posts from Snowflake, VentureBeat, and us.
  • August 30, 2023: We presented three full papers (one by Dixin, one by Shreya, and one jointly by Shreya and Stephen) and one demo paper (by Dixin) at VLDB.
  • August 29, 2023: Here's a blog post describing our VLDB paper on Transactional Panorama, led by Dixin Tang.
  • August 25, 2023: We are part of a new NSF-funded $3M AI training program called CRELS (Computational Research for Equity in the Legal System), with colleagues in social sciences and statistics.
  • August 15, 2023: Welcome to Yiming Lin, our newest postdoc, who comes to us from UC Irvine!
  • August 8, 2023: We released a new preprint on how one can leverage ideas from declarative crowdsourcing to improve accuracy and reduce cost when processing data with Large Language Models. It was nice to revisit work from a decade ago!
  • June 28, 2023: The CA state budget includes $6.87M in funding to support the Police Records Access project, which we are part of, along with our partners at the Stanford and Berkeley Schools of Journalism.
  • June 27, 2023: Nice to see a shoutout to our work on police roster data cleaning and extraction from a local journalism article on San Francisco law enforcement settlements. Congrats Aditya M and Diana Q!
  • June 18, 2023: I enjoyed giving a keynote at the DEEM workshop at SIGMOD, with my talk titled "Enhance, don't Replace: a Recipe for Success in Data Science Tooling".
  • May 31, 2023: Dixin Tang, a postdoc working with me, will be an assistant professor at University of Texas, Austin. Go Dixin!
  • May 31, 2023: Dixin and Fanchao's visualization tool for spreadsheet computation networks was accepted at VLDB'23.
  • May 24, 2023: I enjoyed appearing on the data stack show, talking about how decoupling APIs from execution engines is important.
  • May 18, 2023: Our work with journalists and public defenders as part of the EPIC lab was highlighted in the LA Times news article about the new college of computing, data science, and society.
  • May 11, 2023: I enjoyed serving on a panel on AI-meets-databases at the NorCal DB day.
  • May 1, 2023: Our paper on GATE, an efficient, automated, and precise system for ML-centric data validation is up on arxiv.
  • March 27, 2023: Our work with journalists as part of the EPIC data lab is having real impact: we helped journalists identify that responses to a FOIA request was inadequate, as detailed in this news story.
  • March 17, 2023: Modin now supports NumPy in addition to Pandas; we can push down linear algebra operations to distributed computing engines.
  • March 6, 2023: I wrote a rejoinder to the "Big Data is Dead" blogpost: read it here.
  • March 4, 2023: Delighted to be inducted as a Kavli Fellow by the National Academy of Sciences at the US Frontiers of Science Symposium in Irvine.
  • February 9. 2023: Enjoyed speaking at the Data Dividend, targeted at "harnessing the power of people, processes and technology to unlock value from data", an event hosted by the Economist and by IBM
  • February 10, 2023: Dixin's paper on compaction of spreadsheet graphs was accepted at ICDE!
  • January 13, 2023: Dixin's paper on transactional panorama — bridging transactions and interactive dashboards — was accepted at VLDB!
  • January 1, 2023: What a year for open-source efforts! Modin reached 5M downloads, Lux reached 500K downloads, and Nbsafety reached 150K downloads.
  • December 1, 2022: Modin now ships as part of the AWS SDK for Pandas as well as AWS Glue — showcasing how useful Modin is for data wrangling and ETL with the pandas API at scale.
  • November 29, 2022: Congrats to my former student, Doris Lee, on being named a Forbes' 30 under 30 in Enterprise Technology!
        Click here for more news.

Synergistic Activities

I serve on the steering committees of HILDA (Human-in-the-loop Data Analytics) at SIGMOD and DSIA (Data Systems for Interactive Analysis) at VIS. Lots of excitement around this nascent area at the intersection of databases, data mining, and visualization/HCI - join us! I also served as the Faculty Equity Advisor at the School of Information for two terms in 2023 and 2021.

In the recent past, I was a co-chair of Workshops for SIGMOD 2020 and 2021. I served as the US Sponsor Chair for VLDB 2021; I also served as an Area/Associate Chair for HCOMP 2020, VLDB 2020, and SIGMOD 2020, as a Program Committee member for VLDB Demo 2019 and HILDA 2019 (phew!) I've served on the program committees of VLDB, KDD, SIGMOD, WSDM, WWW, SOCC, HCOMP, ICDE, and EDBT, many of them multiple times. I am serving on the program committee for VLDB 2023-2024.

Recent Releases



Medium Blog




Selected Projects

lux

Lux: An always-on visualization recommendation system

Lux is a tool for effortlessly visualizing insights from very large data sets in dataframe workflows. Lux builds on half a decade of work on visualization recommendation systems.

Project page here.


modin

Modin: A Scalable Dataframe System

Modin applies database and distributed systems ideas to help run dataframe workloads faster, with over 2M open-source downloads.

Project page here.


nbsafety

NBTools: Better Computational Notebooks

NBsafety and NBslicer make it easy for data scientists to write correct, reproducible code in computational notebooks.

Project page here.


helix

Helix: An Accelerated Human-in-the-loop Machine Learning System

Helix accelerates the iterative development of machine learning pipelines with a human developer "in the loop" via intelligent assistance and reuse.

Project page here.


dataspread

DataSpread: A Spreadsheet-Database Hybrid

DataSpread is a tool that marries the best of databases and spreadsheets.

Project page: here


Datasift

Orpheus: Relational Dataset Version Management at Scale

DataHub (or "GitHub for Data") is a system that enables collaborative data science by keeping track of large numbers of versions and their dependencies compactly, and allowing users to progressively clean, integrate and visualize their datasets. OrpheusDB is a component of DataHub focused on using a relational database for versioning.

Project page: here


crowd-alg

Populace: A Suite of Crowd-Powered Algorithms

Our work has developed a number of algorithms for gathering, processing, and understanding data obtained from humans (or crowds), while minimizing cost, latency, and error. Since 2014, our focus has been on optimizing open-ended crowdsourcing: an understudied and challenging class.

Project page: here