Aditya Parameswaran

I am an Associate Professor in EECS at the University of California, Berkeley. I co-direct the EPIC Data Lab and the Police Records Access project. I co-founded Ponder, which was acquired by Snowflake in October 2023.

I am broadly interested in simplifying data science at scale, i.e., empowering individuals and teams to leverage and make sense of their large datasets more easily, efficiently, and effectively.

Of late, my team and I have been exploring LLM-powered data tooling: what are new approaches for data extraction, data transformation, and insight discovery, now with LLMs as key ingredient.

We are always looking for postdocs, PhD, MS, and UG students or research/development staff to join our efforts! If you are a postdoc or staff applicant, feel free to email me directly with your CV and qualifications. If you are an aspiring PhD student, please apply to our PhD program. If you are an MS or UG student, feel free to fill out this form: it is rare that we will work with UG/MS students outside UC Berkeley except in cases of unusually good fit.

Formal Biographical Sketch

Aditya Parameswaran is an Associate Professor in Electrical Engineering and Computer Sciences (EECS) at UC Berkeley. Aditya co-directs the EPIC Data Lab, a lab targeted at low/no-code data tooling powered by LLMs. Aditya also co-directs the Police Records Access project, an initiative to build a first-of-its-kind state-wide police use-of-force and misconduct database. Until its acquisition by Snowflake in October 2023, Aditya served as the President of Ponder, a company he co-founded with his students based on popular data science tools developed at Berkeley. Aditya develops next-generation data tooling — making it easy for end-users and teams to leverage and make sense of their large and complex datasets — by synthesizing techniques from data systems, human-computer interaction, and artificial intelligence. Multiple tools from his group (DocETL, IPyFlow, Modin, Lux) have been widely adopted by end-users, with tens of millions of downloads.

Click here for a longer bio.

Aditya spent a year as a PostDoc at MIT CSAIL following his PhD at Stanford. Aditya is affiliated with the Berkeley Institute of Design, and the Berkeley Institute of Data Science — and is part of the Data Systems & Foundations and Human-Computer Interaction groups at Berkeley. Aditya has received various awards for his work, including being named as Kavli Fellow from the National Academy of Sciences (2023), the IIT Bombay Young Alumnus Achiever award (2022), SIGMOD 2020 Best Paper Award (2020), Alfred P. Sloan Research Fellowship (2020), VLDB Early Career Research Contributions Award (2019), the Army Research Office Young Investigator Program Award (2018), the NSF CAREER Award (2017), the TCDE Rising Star Award (2017), the Dean's Excellence in Research Award (2018) and the C. W. Gear Junior Faculty Award (2017) from the University of Illinois, multiple "best" doctoral dissertation awards (from SIGMOD, SIGKDD, and Stanford in 2014), "Excellent" teacher awards from Illinois (2015, 2017), a Google faculty award (2015) and focused research award (2017), the Key Scientific Challenges award from Yahoo!, six best-of-conference citations (VLDB 2010, KDD 2012, ICDE 2014, ICDE 2016, AISTATS 2017, VLDB 2017, VLDB 2023), a best demo award (ICDE 2019) and a best demo honorable mention (SIGMOD 2017).

In the last five years, he has served as the Space Coordinator for Systems (200+ people) for EECS, Faculty Equity Advisor at the I School and EECS, Area Editor for VLDB and SIGMOD, Associate Editor for VLDB Journal, the Workshops Co-Chair at SIGMOD, the US Sponsorship Chair at VLDB, and on the steering committee of the Data AI Systems Workshop @ICDE, the HILDA (Human-in-the-loop Data Analytics) Workshop @ SIGMOD and the DSIA (Data Systems for Interactive Analysis) Workshop @ VIS. He has also served on program committees of various database, data mining, web, systems, visualization, and crowdsourcing conferences, as well as as an associate editor of SIGMOD Record (2014-2019). His research group has been supported with funding from the State of California, NSF (NRT - now terminated, Future of Work, CAREER, Medium, AITF, BigData), the NIH (2X), the Army Research Office, the Sloan Foundation, the Toyota Research Institute, Microsoft, Sigma Computing, Adobe, the Siebel Energy Institute, Facebook, and Google.

News

April 18, 2025: We released a preprint on a new interface, RAGGY, for debugging RAG pipelines.
April 15, 2025: Grad school admission decisions: undergrad alums Ruiying and Reya got into grad school from a number of top places! Reya will be heading to Columbia, while Ruiying will be returning as a grad student to Berkeley!
April 15, 2025: We contributed to the once-every-five-years report on the future of database research, take a look!
April 7, 2025: Chroma's Generative Benchmarking approach uses Shreya et al's EvalGen work.
March 1, 2025: Our paper, led by Reya and Shreya, on a benchmark of assertions for LLM pipelines was accepted to NAACL as an oral talk!
February 24, 2025: Turns out, with LLMs, your data systems can now become more proactive. Read about our new vision paper here.
January 16, 2025: I co-wrote a blog post on challenges in data exploration and visual analytics that appeared on the SIGMOD blog.
January 15, 2025: Our preprint on extracting information from templatized documents, led by Yiming, is up! The results are astounding: 20-30% higher precision and recall than state-of-the-art, at a fraction of the cost and time.
January 13, 2025: We've been building an IDE for authoring semantic data processing pipelines in DocETL, called DocWrangler. You can play with it in our playground. Led by Shreya and Bhavya. Blogpost here.
January 4, 2025: Call for papers at the first Data-AI Systems workshop is out.
November 12, 2024: Shreya presented our paper on the three V's of MLOps at CSCW. Paper here.
November 1, 2024: We released TARGET, a table retrieval benchmark, led by Madelon. See here.
October 22, 2024: I appeared on a podcast called Disseminate and spoke about our CIDR'11 paper, one of the most influential database papers from that year according to a ranking from Ryan Marcus. Listen here.
October 16, 2024: Our DocETL preprint is out! Lots of people on GitHub and Discord getting value from it already - spanning domains ranging from forensic analysis to climate science to medical data analysis. Over a 1000 GitHub stars already. Excited to have this out there.
September 20, 2024: Sep's NUDGE framework is now part of LlamaIndex.
September 20, 2024: Rachel's paper on RequestAtlas was accepted at CSCW'2025!
September 16, 2024: Excited to see pandas-to-SQL translation that we pioneered at Ponder now generally available to all Snowflake customers. Congrats Ponder team!
September 5, 2024: Sep's paper on lightweight fine-tuning of retrieval for RAG is now on Arxiv. Blog post here. This is a no-brainer for folks wanting to improve retrieval without paying the cost of expensive fine-tuning!
September 5, 2024: We restarted our blog! See here.
September 4, 2024: Congrats to incoming PhD student Bhavya Chopra for a VL/HCC best paper (her second in two years!)
September 1, 2024: Congrats to alum Madelon Hulsebos for starting as faculty at CWI, a leading research institution in Amsterdam!
August 29, 2024: Shreya presented her work on EvalGen and SPADE at Princeton.
August 15, 2024: Welcome to incoming PhD students Bhavya Chopra and HC Moore! And goodbyes to Prof. Eugene Wu from Columbia who was visiting us this past year - we'll miss you Eugene!
July 23, 2024: Our work on translating dataframes to databases was awarded two patents, see here and here.
July 15, 2024: Our work on transactional panorama won a "best of VLDB 2023" award; congrats Dixin!
July 1, 2024: Shreya's EvalGen paper was accepted at UIST'24!
June 26, 2024: Shreya's work on evaluation assistants was deployed by LangChain! Read more here. She also wrote a piece with other LLM practitioners on best practices with LLMs here.
June 1, 2024: Our papers on SMASH (a string alignment algorithm) and SPADE (an LLM-powered assertion generation system) were accepted at VLDB! Five letter acroynms FTW!
May 5, 2024: Our paper on dataset search, led by Madelon, was accepted at HILDA! Preprint here.
May 1, 2024: New preprint up on document analytics with LLMs as part of our new ZenDB system, led by Yiming. Read about it here!
April 28, 2024: Shreya summarized her work on evaluating LLM pipelines, from SPADE to EvalGen, as part of an MLSys seminar. Listen here!

Click here for more news.

April 24, 2024: Devin gave a nice talk on our journey through dataframe land from Berkeley to Ponder, now to Snowflake at CMU. Listen here!
April 22, 2024: Our work on building a UI for evaluating LLM pipelines was just released as a preprint.
April 1, 2024: Our demo on Motion was accepted to SIGMOD!
April 1, 2024: Welcome to Tarak Shah, who is joining us from the Human Rights Data Advocacy Group for work on the police records project.
March 1, 2024: Welcome to our newest postdoc, Sep Zeighami, joining us from USC!
February 28, 2024: Our work on LLM-based extraction of information from police misconduct PDFs is having real impact; check out this news story from Stockton.
February 14, 2024: Two of three organizers for the DEEM workshop this year work in our group, so look out for an exciting event!
January 16, 2024: New preprint on automatic assertion generation for LLM pipelines, led by Shreya!
January 13, 2024: We presented our work on prompt engineering-meets-crowdsourcing at CIDR'24!
November 28, 2023: Madelon Hulsebos joins us as a postdoc, with a BIDS Postdoctoral Fellowship!
November 8, 2023: Excited to announce a collaboration with LangChain, a leading LLM framework, to support automatic discovery of evaluation functions and assertions by analysis of prompt version histories. Blog post here.
October 24, 2023: Delighted to announce that our company Ponder was acquired by Snowflake, the leading cloud data warehouse vendor, to accelerate their data science offerings. It has been a wild ride, from inception in 2021, to a $7M seed fundraise by leading VC firms, to scaling the company up to 15 employees, to building a never-before-product running data science libraries in data warehouses, to acquisition. Posts from Snowflake, VentureBeat, and us.
August 30, 2023: We presented three full papers (one by Dixin, one by Shreya, and one jointly by Shreya and Stephen) and one demo paper (by Dixin) at VLDB.
August 29, 2023: Here's a blog post describing our VLDB paper on Transactional Panorama, led by Dixin Tang.
August 25, 2023: We are part of a new NSF-funded $3M AI training program called CRELS (Computational Research for Equity in the Legal System), with colleagues in social sciences and statistics.
August 15, 2023: Welcome to Yiming Lin, our newest postdoc, who comes to us from UC Irvine!
August 8, 2023: We released a new preprint on how one can leverage ideas from declarative crowdsourcing to improve accuracy and reduce cost when processing data with Large Language Models. It was nice to revisit work from a decade ago!
June 28, 2023: The CA state budget includes $6.87M in funding to support the Police Records Access project, which we are part of, along with our partners at the Stanford and Berkeley Schools of Journalism.
June 27, 2023: Nice to see a shoutout to our work on police roster data cleaning and extraction from a local journalism article on San Francisco law enforcement settlements. Congrats Aditya M and Diana Q!
June 18, 2023: I enjoyed giving a keynote at the DEEM workshop at SIGMOD, with my talk titled "Enhance, don't Replace: a Recipe for Success in Data Science Tooling".
May 31, 2023: Dixin Tang, a postdoc working with me, will be an assistant professor at University of Texas, Austin. Go Dixin!
May 31, 2023: Dixin and Fanchao's visualization tool for spreadsheet computation networks was accepted at VLDB'23.
May 24, 2023: I enjoyed appearing on the data stack show, talking about how decoupling APIs from execution engines is important.
May 18, 2023: Our work with journalists and public defenders as part of the EPIC lab was highlighted in the LA Times news article about the new college of computing, data science, and society.
May 11, 2023: I enjoyed serving on a panel on AI-meets-databases at the NorCal DB day.
May 1, 2023: Our paper on GATE, an efficient, automated, and precise system for ML-centric data validation is up on arxiv.
March 27, 2023: Our work with journalists as part of the EPIC data lab is having real impact: we helped journalists identify that responses to a FOIA request was inadequate, as detailed in this news story.
March 17, 2023: Modin now supports NumPy in addition to Pandas; we can push down linear algebra operations to distributed computing engines.
March 6, 2023: I wrote a rejoinder to the "Big Data is Dead" blogpost: read it here.
March 4, 2023: Delighted to be inducted as a Kavli Fellow by the National Academy of Sciences at the US Frontiers of Science Symposium in Irvine.
February 9. 2023: Enjoyed speaking at the Data Dividend, targeted at "harnessing the power of people, processes and technology to unlock value from data", an event hosted by the Economist and by IBM
February 10, 2023: Dixin's paper on compaction of spreadsheet graphs was accepted at ICDE!
January 13, 2023: Dixin's paper on transactional panorama — bridging transactions and interactive dashboards — was accepted at VLDB!
January 1, 2023: What a year for open-source efforts! Modin reached 5M downloads, Lux reached 500K downloads, and Nbsafety reached 150K downloads.

Click here for even more news.

December 1, 2022: Modin now ships as part of the AWS SDK for Pandas as well as AWS Glue — showcasing how useful Modin is for data wrangling and ETL with the pandas API at scale.
November 29, 2022: Congrats to my former student, Doris Lee, on being named a Forbes' 30 under 30 in Enterprise Technology!
September 16, 2022: Shreya and Rolando's paper on lessons from interviewing ML Engineers on MLOps practices and challenges has been posted on arxiv.
September 13, 2022: We formally unveiled our new lab, the EPIC Data Lab. Here is an article from CDSS about the lab. We are thankful to our current sponsors, Microsoft, Google, and Sigma Computing, along with the National Science Foundation.
September 1, 2022: Our paper on NBSlicer (led by Shreya and Stephen) and a vision paper on MLOps (led by Shreya) were both accepted at VLDB'23!
August 18, 2022: Here's our fourth blog post comparing Pandas to SQL.
July 26, 2022: Here's our third blog post comparing Pandas to SQL.
July 14, 2022: Here's our second blog post comparing Pandas to SQL.
June 30, 2022: Our CACM paper on ShapeSearch — the invited extended version of our SIGMOD best paper award, is out here.
June 28, 2022: Here's our first blog post comparing Pandas to SQL.
June 23, 2022: Congrats to Drs. Doris Lee, Devin Petersohn, and Doris Xin on their graduation!
June 15, 2022: New paper from Joe and I on our new data engineering class at Cal!
June 1, 2022: New paper with CMU folks on extending Lux to remember analysis history in a notebook, and use it to bias recommendations towards recent operations or past interest - to appear at Eurovis.
May 20, 2022: We released a position paper on the vision behind the Sky lab.
May 19, 2022: Startups all-around! My former student, Doris Xin now serves as a CEO of a startup, Linea.
April 30, 2022: Enjoyed giving a keynote at the SIAM Data Mining Conference.
March 9, 2022: Excited to announce the launch of Ponder, a company I co-founded with my students. Ponder raised $7M from top-tier VCs to develop scalable, enterprise-ready data science tools, leveraging our success with Modin and Lux. I am serving as the President of Ponder, while staying on as faculty at Berkeley.
February 15, 2022: Modin hit 1.5M downloads this week, with 200K downloads in the last month - grateful to see the groundswell of adoption.
February 2, 2022: Delighted to receive the Young Alumni Achiever's award from my alma mater, IIT Bombay.
January 6, 2022: I presented a talk on our mission to develop enterprise-ready pandas at CIDR, see recording here.
November 18, 2021: Doris presented Lux at the Data Umbrella; see recording here.
November 11, 2021: Shreya Shankar presented her work on MLTrace at the Toronto ML Society as well as at Facebook; recording here.
October 26, 2021: Stephen Macke was interviewed about NBSafety in the Software Engineering Daily Podcast.
October 28, 2021: Doris Lee presented Lux at PyData, while Modin was featured on the Anaconda blog.
October 1, 2021: Our papers on Lux and Modin were accepted to VLDB 2022!
September 15, 2021: Tenure!
September 7, 2021: We received a 2M NSF grant to kickstart the EPIC Data Lab, short for Effective Programming, Interaction, and Computation with Data. Articles on EPIC here and here. We're working on the messy data challenges in criminal justice, along with public defenders via the NACDL (National Association of Criminal Defense Lawyers) and journalists via the Big Local News/KQED. We're part of a consortium called CLEAN: Community Law Enforcement Accountability Network.
September 1, 2021: Delighted to welcome Nithin Chalapathi and Shreya Shankar as new EECS PhD students!
August 13, 2021: Doris Lee and Devin Petersohn wrap up their dissertations on visualization recommendation and dataframe systems respectively. Congratulations Dr. Lee and Dr. Petersohn! Exciting times ahead!!
August 9, 2021: Devin Petersohn gives his dissertation talk on dataframe systems. Devin's work has laid the groundwork for how to reason about and optimize dataframe computation. Congrats Devin!
August 3, 2021: Stephen Macke wraps up his dissertation! Stephen's work on minimizing error while ensuring interactivity in a range of settings has been a joy to be a part of.
July 26, 2021: Tana Wattanawaroon defends his dissertation on supporting efficient computation in spreadsheets, titled "Generalizing Spreadsheet Computation for Evolving Spreadsheets at Scale". Congratulations Tana!
July 12, 2021: Our new PhD student, Shreya Shankar, is off to the races with MLTrace, a lightweight approach for instrumenting ML code to allow for end-to-end debugging, reproducibility, and introspection. Super exciting stuff; watch this space!
July 1, 2021: Our scalable dataframe system, Modin, has over 6000 github stars and over 1M downloads at this point. We're excited that there is so much organic traction and interest!
May 14, 2021: Our technical report on Lux is out! Lux has users across a range of industries at this point, including insurance, retail, and education, and thousands of github stars. Our paper introduces our always-on visualization framework, our lightweight intent language, and how we made Lux interactive when operating on large dataframes.
Apr 19, 2021: New Lux release; v0.3.0 supports a relational database backend, geovis, Jupyter lab, and matplotlib instead of altair, and more! Dashboard export (to streamlit, data pane, HTML, etc. forthcoming.) More here.
Apr 16, 2021: Doris Xin presents her dissertation talk on "Usable and Efficient Systems for Machine Learning"! Congratulations Dr. Xin! Doris is off to be a CEO at a new startup!
Apr 13, 2021: Doris Lee's paper in collaboration with folks at Tableau Research (Vidya Setlur, Melanie Tory) on developing a taxonomy for visualization recommendation was accepted at VIS 2021 via TVCG!
March 26, 2021: Doris Xin's paper in collaboration with Google folks on understanding production machine learning pipelines in TFX (along with my longtime collaborator Alkis Polyzotis and Hui Miao) was accepted as an industry paper at SIGMOD'21!
March 15, 2021: Our paper on leveraging think-time for opportunistic dataframe query evaluation was published at IEEE Data Engineering Bulletin and showcased at ML and Data Projects to know
February 15, 2021: Thrilled to see continued Lux adoption and interest. We had another industry blog post, yet another one, and hit over 700 stars on github.
January 15, 2021: Stephen Macke won the CIDR gong show for presenting his take on the next generation of notebooks! Devin Petersohn also presented his work on scalable dataframes.
January 10, 2021: Dual VLDB'21 accepts! Our paper on a general-purpose spreadsheet exploration tool, enabling zoom in/out, called NOAH, led by Sajjadur Rahman, was one. Our paper on NBSafety, a Jupyter kernel for safe notebook interactions, led by Stephen Macke, was another.
January 6, 2021: A virtual welcome to our new postdoc, Dixin Tang, who is joining us from U Chicago having worked with Aaron Elmore and Mike Franklin.
January 1, 2021: I took on the role of the Faculty Equity Advisor at the School of Information. See more here.
December 13, 2020: Our paper studying industry users of AutoML and how next-gen AutoML tools should look like was accepted at CHI'21, led by Doris Xin, Eva Wu, and Doris Lee.
December 9, 2020: Doris Lee passed her qualifying exam!
November 20, 2020: Much Lux love coming from industry. This LinkedIn post has nearly 8000 shares. This Medium article was written by another industry person to introduce Lux to the general public. Lux was also listed as a "Project to Know" here.
November 10, 2020: Our quest for safer notebooks continues with NBSafety, led by Stephen Macke! NBSafety automatically tracks lineage of variables in the notebook, providing suggestions to avoid executions of stale cells that could lead to incorrect or non-reproducible behavior. Lots of heavyweight program analysis techniques leading to a delightful, unobtrusive user experience. NBSafety was first presented at PLATEAU 2020, a PL-HCI workshop.
November 4, 2020: Congratulations to Doris Xin for passing her qualifying exam!
November 3, 2020: Congratulations to Stephen Macke for defending his thesis!
October 23, 2020: Enjoyed writing this perspective piece on the challenges and opportunities from working with domain experts on building interactive visualization tools, appearing at Patterns.
October 10, 2020: We've poured all of our experience building visualization recommendation tools over the years into Lux, led by Doris Lee. Lux provides in-situ visualization recommendations within notebooks as an add-on to dataframes instead of the standard tabular dataframe view. Doris demonstrated Lux at JupyterCon 2020.
October 7, 2020: Devin Petersohn wrote about how Modin helps data scientists be more productive here.
October 9, 2020: Congratulations to Tana Wattanawaroon for passing his preliminary exam!
October 3, 2020: Yay! Stephen Macke's paper on tighter confidence intervals for approximate query processing was accepted to ICDE'21!
September 14, 2020: Excited to have our paper on visualization recommendation for genomics out in the Patterns journal from Cell Press, led by Silu Huang (now at MSR) and Charles Blatti. Paper here. Our work on this project started as part of the NIH BD2K center a while back.
September 1, 2020: I presented a two-part keynote at the VLDB PhD workshop. One of the parts was on my experiences dealing with rejection; check out the recording here.
July 15, 2020: Our vision paper on scalable dataframe systems has been accepted at VLDB'20.
June 16, 2020: Our paper on ShapeSearch won the SIGMOD 2020 Best paper award, one of two awards out of over 450 submissions. It's amazing to see SIGMOD appreciate non-traditional usability-oriented work. Congratulations Tarique! Here's an (overly generous) article on the award.
June 10, 2020: We have open-sourced our spreadsheet benchmark, sheetperf.
June 9, 2020: I appeared on the Software Engineering Daily podcast to discuss Human-in-the-loop Data Analytics.
May 26, 2020: My former student Silu Huang, now a senior researcher at Microsoft Research, was awarded the Jim Gray dissertation award honorable mention, given to the second ranked dissertation in data management. Congratulations, Silu!
May 19, 2020: Slides and other materials from my Spring class on data engineering are available here. I squished the entirety of a traditional database class into the first half of the semester, focusing on user-facing aspects. I then covered non-traditional topics: json/doc stores, IR, spreadsheets, data frames, OLAP/vis, col. stores & compression, parallel proc., streaming/sketching, security & privacy, graph proc., contrasting with the relational approach. The key emphasis was on principles underlying data systems that data scientists may consider using, and how to pick between them for a given situation.
May 16, 2020: Congratulations to my Ph.D. student graduates! Tarique Siddiqui is off to be a Senior Researcher at Microsoft Research in the Databases group. Sajjadur Rahman is off to be a Researcher at Megagon Labs. Liqi Xu is off to be a Research Scientist at Facebook.
May 16, 2020: Congratulations to my M.S. student graduates! Angela Lee is off to Google, while Jaewoo Kim is off to AWS.
May 4, 2020: We released another version of our covidvis tool. We also released a manually gathered intervention dataset to go with the tool. Grateful to receive positive feedback for our efforts.
May 1, 2020: Two papers accepted to HILDA this year: Pingjing et al. lead a paper on results from the first lab-based user evaluation across a range of analytical tasks on spreadsheets, identifying roadblocks and opportunities. Doris et al. lead a paper on understanding ML development patterns from surveying a large collection of real-world ML traces.
April 15, 2020: Doris Lee participated in the CHI'20 doctoral consortium!
April 10, 2020: We released our first version of our covidvis tool to help make sense of the impact of interventions on the progress of the COVID-19. Collaborators include epidemiologists, statisticians, and public health folks.
April 1, 2020: SIGMOD paper acceptances: our paper on benchmarking spreadsheets, led by Sajjadur Rahman, and our paper on ShapeSearch, a multi-modality interface for querying for visual patterns led by Tarique Siddiqui.
February 12, 2020: Excited to be named one of the Alfred P. Sloan research fellows in Computer Science this year! Articles here, here, and here.
February 11, 2020: Preprint on Genvisage is live. Genvisage is a tool for explaining the results of genomics experiments; done in collaboration with Saurabh Sinha and Charles Blatti, led by Silu Huang.
January 28, 2020: Congrats Doris Lee for winning a Facebook fellowship! She's one of 36 fellows from over 1800 applicants. Article here.
January 26, 2020: A photo tribute to Hector Garcia-Molina, my advisor, here, who passed away earlier this year, assembled by his Ph.D. students.
January 24, 2020: The SIGMOD 2020 workshops list is live. Great list of upcoming workshops!
January 7, 2020: The first of many papers on the Modin project is live! ArXiV link here. Modin is aiming to make dataframes more scalable by applying database techniques, and is currently being used in a number of companies, with dozens of collaborators and lots of interest on github. Led by Devin Petersohn.
December 6, 2019: The Morning Paper, a popular blog that covers academic papers, covered our paper on spreadsheet benchmarking. Thanks Adrian!
November 13, 2019: A nice article on the new Data Systems & Foundations group at UC Berkeley, from the Division of Data Science and Information. Excited for what lies ahead!
September 27, 2019: A generous article on the VLDB Early Career Award, from the School of Information. Here's the paper I had the privilege of writing as part of the award.
September 9. 2019: Double Doris Xin whammy! Helix was mentioned as a "project to know" in a blog post from Amplify Partners. Thanks Sarah! And Doris was selected as a Hiedelberg Laureate. Congratulations!
August 26, 2019: Grateful to receive the VLDB Early Career Research Contributions Award for "developing tools for large-scale data exploration, targeting non-programmers." The award is given for research impact through a specific technical contribution of high significance since completing the Ph.D. (for up to 8 years post Ph.D.)
August 22, 2019: Please send us your most compelling SIGMOD workshop ideas for SIGMOD2020! Deadline September 27. Here's the call for proposals.
July 1, 2019: *BIG NEWS!* I moved to UC Berkeley, with a faculty appointment at the School of Information and the EECS Department; you can find articles about my appointment in EECS and the I School. Berkeley has been making some exciting moves in the broad data and information space, including a new Division of Data Science and Information, a very popular Masters in Information and Data Science program, and a new and increasingly popular data science major. Looking forward to being a part of this journey, and to moving back to the Bay Area! Illinois has been a wonderful home for the past 5-odd years with terrific students, a collegial research environment, and Midwestern charm (plus baby goats!); I'm going to miss my colleagues, the university, and the town terribly.
June 25, 2019: Our new paper describing our design study with Zenvisage was accepted at VAST'19 at VIS. Congrats Doris Lee!
June 20, 2019: Our vision paper on the data management and HCI challenges underlying AutoML was published in the IEEE Data Engg Bulletin.
June 20, 2019: Silu Huang defended her thesis on versioning-meets-databases. Congrats Silu!
May 1, 2019: Our proposal received a runner-up mention for the Facebook Probability and Programming Research Award.
April 11, 2019: Congratulations to the DataSpread team for a best demo award at ICDE 2019! The best demo awards went to two out of 24 demonstrations.
March 20, 2019: Silu Huang accepted her offer to be a researcher at the DMX group at Microsoft Research! I interned in the DMX group back in 2010 and it's a fantastic place to work.
March 15, 2019: DataSpread's asynch computation framework was accepted at SIGMOD'19. Congrats to Mangesh, Tana et al.!
March 1, 2019: Yihan Gao defended his thesis on data compression and data extraction, titled "Extracting and Utilizing Hidden Structures in Large Datasets". Yihan will return to his undergraduate alma mater to be an assistant professor at Tsinghua University. Congratulations Yihan!
February 10, 2019: Our demo paper on DataSpread's navigation, formula, and relational capabilities was accepted at ICDE'19. Congrats Mangesh, Tana, Sajjadur, and gang!
February 1, 2019: Mangesh Bendre is leaving the group to start as a research scientist at Visa Research. Congratulations Mangesh! Mangesh's thesis on DataSpread is up at this link.
December 15, 2018: Congratulations to Doris and gang for our IUI 2019 paper (link here) identifying a new fallacy in data exploration-the drill-down fallacy-and developing techniques to work around it.
December 1, 2018: Yay! My student Mangesh Bendre (coadvised with Kevin Chang) defended his thesis on DataSpread. Mangesh has spearheaded the development of DataSpread, and was instrumental in many of the key innovations so far: the hybrid data model, positional indexing, and asynchronous formula computation.
November 15, 2018: Congrats to Doris and the rest of the Helix team for the VLDB 2019 paper on the design of Helix, our human-in-the-loop machine learning system.
August 13, 2018: Congratulations to Silu, Liqi et al. (w/ Aaron Elmore) for a "best of conference" nomination for our VLDB 2017 paper on our versioned database system Orpheus's design and implementation!
August 13, 2018: Doris's IEEE D.E. Bulletin paper articulating our vision for a visual discovery assistant, called VIDA, will be out soon. Thanks to Alexandra for inviting us.
August 10, 2018: Shreya's paper on incorporating constraints for more accurate crowd-powered sorting was accepted as a short paper at CIKM 2018.
July 11, 2018: Thrilled to receive the Army Research Office Young Investigator Program Award for our work on decoupling perspectives in crowdsourcing. Thanks to the CS Department for the generous article!
June 30, 2018: Our SIGMOD blog post on why visual data exploration introduces a number of new data management challenges is up. Thank you to Georgia Koutrika for inviting me!
June 15, 2018: Our project page for Helix is up!
June 15, 2018: Short papers accepted at IDEA on iteration in machine learning workflows, and at HCOMP on quality evaluation methods for crowdsourced segmentation.
May 27, 2018: I am serving on the steering committee of the DSIA workshop @ VIS 2018. Consider submitting your latest and greatest work here!
May 2, 2018: Demos on Helix, our human-in-the-loop machine learning tool, and ShapeSearch, our flexible shape-based trend-line querying tool, were accepted at VLDB 2018.
April 25, 2018: Our paper on Needletail, an efficient sampling engine for browsing, was accepted at the HILDA Workshop at SIGMOD 2018. We've used Needletail in a number of papers on scalable approximate visualization generation, so we're glad to have this finally out there!
April 15, 2018: Our paper on accelerating human-in-the-loop ML, a vision paper for the Helix project, was accepted at the DEEM Workshop at SIGMOD 2018. Lots more to come on Helix in the near future. Congrats Doris!
April 1, 2018: Thrilled to receive the 2018 Dean's Excellence in Research Award from the University of Illinois, given to assistant professors with an outstanding research profile + impact. Delighted to be able to celebrate with the group (photo on the right)!
March 1, 2018: Happy to be recognized with a spot on the "List of teachers rated as Excellent" for the second year in a row!
February 15, 2018: Our demo paper (w/ folks at UChicago) on generating succinct diffs between data versions was accepted at SIGMOD 2018.
February 10, 2018: Mangesh's paper on data models and indexes for scalable spreadsheets has been accepted to ICDE 2018! This paper lays out the groundwork for our DataSpread project, many years in the making.
December 12, 2017: New paper on quickly identifying a succinct difference (or "diff") between two relational datasets here. We characterize the complexity of this problem, based on varying the classes of operators and types of attributes.
December 10, 2017: More Kelly news! Kelly received the Snap Research Scholarshop and the CRA Undergraduate research award honorable mention. Woohoo!
December 1, 2017: Interested in trying out our latest version of Zenvisage? Here's the link: zenvisage.cs.illinois.edu. More at our Medium blog post.
November 18, 2017: Paper studying scalability issues in Microsoft Excel by analyzing a large collection of Reddit posts, accepted at CHI 2018. Congrats Kelly Mack (an amazing achievement for an undergrad)! In other news, Kelly was also nominated for the CRA undergraduate research award.
November 10, 2017: Paper on our automatic data lake extraction tool accepted at SIGMOD 2018. Our tool automatically identifies the components corresponding to formatting and filters it out to extract a structured representation, with high accuracies on log files from github. Congrats Yihan and Silu!
October 15, 2017: My O'Reilly Blog post on "Enabling Data Science for the Majority" is live! In here, I articulate that there are 5 BIG challenges in democratizing data science, and describe some of our work as well as some of the other work in this space. Read this if you want to find out what's new and cool in data science research.
October 10, 2017: New preprint on characterizing the spectrum of scalability issues in Microsoft Excel via Reddit posts here as part of our DataSpread project. Led by Kelly, our intrepid undergrad!
October 1, 2017: The Zenvisage gang chronicle our multi-year effort in participatory design with Zenvisage along with scientists from material science, genetics, and astrophysics is chronicled here. Many interesting insights on how visual exploration systems like Zenvisage can fit into scientific data exploration workflows + many real instances of valuable scientific findings gained from the process!
September 11, 2017: Thanks to new funding from the NSF Algorithms in the Field (AitF) program, we can advance scalable visualization by applying sublinear time techniques, along with the super smart theory duo of Ronitt Rubinfeld (MIT) and Ilias Diakonikolas (USC). NSF page here.
September 1, 2017: VLDB Blog Posts! Here they are:
- "Towards Automating Insight", here.
- "Drawing Conclusions Early with Incvisage", here.
- "Painless Data Versioning for Collaborative Data Science", here.
- "Crowdsourcing in Practice: Our Findings", here.
August 15, 2017: Grateful to receive the C.W. Gear Junior Faculty Award from the University of Illinois! Thanks to the Department of CS for being such a supportive environment for junior faculty!
August 1, 2017: New/Updated preprints:
- on Needletail, our "any-k" browsing and sampling engine, here.
- on FastMatch, an algorithm for rapidly matching histograms to a target, applying a variety of systems and algorithmic ideas, here; a key component of Zenvisage.
- on DataSpread, studying representation and indexing schemes for spreadsheet data, here.
- on Catamaran (formerly known as Datamaran), our unsupervised extraction tool for large-scale extraction from data lakes, here.
May 18, 2017: The OrpheusDB demo received a best demo honorable mention! Congrats to Liqi + Silu! Missed it at SIGMOD? You can still catch it here: video.
May 15, 2017: Paper on IncVisage: our incrementally improving visualization algorithm and interface has been accepted to VLDB'17! Paper here. Joint work with theorists at MIT and Waterloo, and HCI/Viz folks at Illinois. Perhaps the first paper that has theory, DB, and HCI co-authors? (Would love to be corrected if not.)
April 15, 2017: Thrilled and honored to receive:
- The NSF CAREER Award: Abstract here. Excited to pursue the vision of optimizing "open-ended" crowdsourcing! Vision paper from IEEE Data Engg. Bulletin here.
- The TCDE (Technical Committed on Data Engineering) Early Career Award, awarded for an individual's whole body of work in the first 5 years after the PhD. The award citation: The award is for developing new interactive tools and techniques that expand the reach of data analytics, enabling powerful data-driven discoveries by experts and non-experts alike.
April 15, 2017: Orpheus Updates: demo accepted at SIGMOD 2017; paper accepted at VLDB 2017 (no revisions!); open-source release here.
April 3, 2017: The New York Times cited Adam Marcus and my book on crowdsourced data management. Article here.
April 3, 2017: The HILDA 2017 workshop (co-located with SIGMOD) program is up.
March 1, 2017: Manas's TKDE paper on smart drill-down (from the "best of ICDE 2016") was accepted.
February 20, 2017: Vision paper on next-gen visualization recommendation systems with Manasi Vartak, Sam et al. is out at SIGMOD Record. Link here.
January 30, 2017: My student Silu Huang won the MSR Faculty Fellowship: the first Illinois student since 2011! A great honor! Silu has been recently working on Orpheus.
January 30, 2017: Yihan's paper on calibrating classifiers has been accepted as an ORAL presentation at AISTATS'17!
January 15, 2017: Our paper analyzing a very large log of all tasks from a popular crowdsourcing marketplace has been accepted at VLDB'17. Learn all about how a marketplace operates, what the distribution of tasks look like, and how the workers behave here.
January 10, 2017: Three of our key analytics tools, DataSpread, Zenvisage, and OrpheusDB, are moving out of private betas with a few interested parties to the public, available for easy download and deployment. More details and download links here: http://tiny.cc/three-tools.
January 1, 2017: New preprint release on Catamaran (formerly known as Datamaran), our new fully-unsupervised data extraction tool from machine generated data: no examples or supervision needed! Preprint here.
December 1, 2016: Two new paper updates:
- Our paper on SlimFast: a data fusion algorithm, spearheaded by Theo Rekatsinas has been accepted at SIGMOD'17!
- Our vision paper on Open-Ended Crowdsourcing was accepted to appear at the IEEE data engineering bulletin, spearheaded by the amazing Tova Milo.
December 1, 2016: I've given a bunch of talks on our three tools for human-in-the-loop data analytics: a distinguished colloquium at Northwestern, a keynote at the Enterprise Intelligence workshop at KDD'16, and BigData events at Illinois and Chicago. Grab the slides here.
December 1, 2016: My exceptional PhD student, Silu Huang, was a finalist in the prestigious Microsoft Research PhD fellowship competition, with an in-person interview at MSR HQ -- so proud of her! Fingers crossed for the eventual outcome.
November 15, 2016: Many new preprints! Grab 'em while they're hot:
- From the Zenvisage project: a paper on visualizations that incrementally improve over time, and a paper on our rapid sampling engine for visualizations.
- From the Orpheus project: a paper describing data models and partitioning schemes for relational dataset versioning.
- From the Populace project: a paper on consensus-based clustering of unstructured data.
- From the DataSpread project: a paper evaluating representation schemes and indexing structures for billion cell spreadsheets.
November 15, 2016: We delivered our tutorial on crowdsourced data management at HCOMP'16: slides part 1 part 2.
November 1, 2016: Thanks to Adobe for supporting our research efforts!
November 1, 2016: New releases for : a paper on the Zenvisage query language, ZQL and our smart-fuse query optimizer, accepted at VLDB'17 here, plus a demonstration paper accepted at CIDR'17 here.
October 15, 2016: I am one of the chairs of the Human-in-the-loop Data Analytics (HILDA) Workshop at SIGMOD'17, along with the peerless Joe Hellerstein, from Berkeley, and Carsten Binnig from Brown. Website here. Follow us on twitter.
October 1, 2016: My outstanding MS student, Vipul Venkataraman won the Siebel Scholarship: cool cash prize of $20K. Well-deserved!
September 15, 2016: Participated in a fun panel on "Will AI eat us all?" with the eminent team of Sunita Sarawagi, Sihem Amer-Yahia, H. Jagadish, and Ihab Ilyas at VLDB'16. Short answer: no.
September 1, 2016: Thanks to NSF for funding our work on DataSpread with an NSF BigData grant. Some Illinois press here.
September 1, 2016: Participated in an invited workshop on the "Theory and Models for Crowds and Networks" with an eminent team of researchers in Oaxaca, Mexico. I presented a tutorial on the data management community's take on crowdsourcing. Slides here.
August 1, 2016: New slick websites for projects:
- , our spreadsheet-database hybrid: here.
- , our versioned database system: here.
- , our visualization recommendation system: here.
- , our project on optimizing crowdsourcing: here.
July 15, 2016: Our new paper on producing intelligent summaries of facets of papers, with Xiang Ren, Tarique Siddiqui, and Jiawei Han has been accepted at CIKM 2016!
June 20, 2016: Our paper on data exploration at ICDE 2016 was invited to the TKDE "best of conference" issue, an honor reserved for the top few papers at the conference. Great job Manas!
June 15, 2016: After two years of extensive collaborations with folks at the two institutes, I am now an "official" affiliate of the Institute for Genomic Biology, and the Beckman Institute for Advanced Science and Technology.
June 1, 2016: Our paper on Squish: a tool for compression of relational datasets was accepted at KDD 2016! Our code is open-source and available on Github.
May 1, 2016: New release on our visual data exploration platform zenvisage. Paper here, and website dedicated to Zenvisage here. Contact us if you'd like to test run zenvisage on your datasets!
April 15, 2016: We just received a seed grant from the Siebel Energy Institute to develop Zenvisage in collaboration with battery scientists at Carnegie Mellon! Excited to see what happens next.
April 10, 2016: We received a whopping 3X the number of submissions for the undergraduate research contest. Who knows what these young researchers will accomplish next?
April 1, 2016: Our paper on Decibel, the storage engine underlying DataHub, was accepted at SIGMOD 2016!
March 1, 2016: Thrilled to be among the "List of Teachers Ranked as Excellent by their Students" at Illinois! Happy to see that students enjoy my classes.
January 6, 2016: Adam and I are proud to finally release a book on crowdsourced data management, a labor of love under development for two years. The book not only covers the state of the art, but also contains a survey of both industry users of crowdsourcing and managers of crowdsourcing marketplaces. We hope that this book will be the definitive reference for how crowdsourcing is used in practice. Do send us comments!
January 1, 2016: Our vision paper on the unsolved challenges in large-scale data crowdsourcing was accepted at TKDE.
December 15, 2015: Our paper on interactive exploration using a more expressive drill-down operator was accepted at ICDE 2016 in Finland.
November 25, 2015: Some Illinois press on our NSF-funded DataHub grant. Thrilled and honored to be working with the amazing Sam Madden and Amol Deshpande at solving the problems underlying collaborative data analytics.
November 15, 2015: Our paper on optimally managing worker and answer quality in crowdsourcing was accepted at SIGMOD 2016.
October 1, 2015: We just heard word that NIH has funded our BD2K commons supplement. Looking forward to working with folks at UChicago to improve data publication workflows!
September 15, 2015: Student awesomeness: my student Silu Huang won the 3M foundation fellowship, while Tarique Siddiqui won the Siebel Foundation fellowship.
September 1, 2015: Thanks to the NSF, we now have funding to support research and development on DataHub via a Medium IIS grant with MIT and UMD! Link to the project page here.
August 1, 2015: The full SeeDB paper has been accepted at VLDB 2016 in India!
July 1, 2015: Our JellyBean paper on using humans to count objects in images will appear at HCOMP 2015!
June 9, 2015: Release of a new preprint on calibrating the output of confidence estimates from classification algorithms, using classical learning theory tools. This is work driven by my awesome student Yihan Gao.
June 6, 2015: Our DataHub query language proposal was accepted at TaPP, a focused provenance workshop.
June 1, 2015: Final tally for VLDB 2015 -- three papers and three demos on a variety of topics:
- papers: crowds, visualizations, and versioning;
- demos: data exploration, Excel-meets-databases, and collaborative data analytics.
May 27, 2015: Our paper on versioning principles was accepted at VLDB'15 without any revisions!
May 15, 2015: Undergraduate research news: Andrew Kuznetsov, a freshman working in our group won the ISUR undergraduate research prize, and Andrew with two other freshmen -- Andrew Thieck and Radhir Kothuri won the third prize in the Illinois Engineering Open House competition for their crowdsourcing tool.
May 12, 2015: Our paper on debiasing was accepted at KDD 2015!
April 9, 2015: Our first release of a new project, titled Data-Spread, with my esteemed colleague Kevin Chang and student Mangesh Bendre. Data-Spread is a tool that unifies databases and spreadsheets. You have to see it to believe it!
- Here is a YouTube video showing Data-Spread in action.
- Here is our demo paper on Data-Spread.
March 10, 2015: Four more new preprints in the last month! These were:
- our paper on SeeDB for query driven automatic visualization generation;
- our jellybean paper on counting objects in images; turns out we can do way better than humans or computer vision algorithms!
- our paper on debiasing of batches; crowdsourcing practitioners often use batching to save costs, but this can lead to non-independence: we deal with this issue.
- our versioning theory paper; to build a solid foundation for our DataHub project, we explored how to trade off storage and retrieval costs.
February 9, 2015: Our paper on exploiting correlations to avoid expensive predicate evaluations was accepted at SIGMOD 2015!
February 12, 2015: Many thanks to Google for their support via a Google Faculty Research Award! Excited to be building the next generation visualization toolkit.
December 10, 2014: Three new preprints in the last month! These were:
- smart drill-down, our tool for zooming into portions of a dataset quickly;
- our paper on globally optimal crowdsourcing quality management; and
- our paper on gathering data using the crowd, exploiting a hierarchy and MABs.
November 10, 2014: Three new paper acceptances in the last month!
- Our Datahub paper was accepted at CIDR;
- our rapid approximate visualization generation paper was accepted at VLDB;
- and our paper on generalized confidence intervals for crowdsourced workers was accepted at ICDE!
October 10, 2014: Thrilled to be a part of the new NIH BD2K (Big Data 2 Knowledge) center for revolutionizing genomic data analysis. Thank you, NIH, for the support!
September 2, 2014: We can finally talk about our exciting new project, titled Datahub (i.e., GitHub for Data) on collaborative data science and version management. The ambitious goal is to eliminate the pain-points of data book-keeping while doing collaborative data science.
September 1, 2014: Our paper on pricing for crowdsourcing tasks has been accepted for presentation at VLDB 2015! The paper studies a simple, but important problem: if you have a batch of tasks and a deadline, how should you vary price to meet the deadline?
August 25, 2014: Pleasantly surprised to be selected as the KDD dissertation award runner-up, having already been given the SIGMOD dissertation award! Feel truly lucky to have two communities - SIGMOD and KDD - supporting my work!
August 24, 2014: Had a blast being a keynote speaker at KDD IDEA 2014 - a big thank you to the organizers for inviting me! If this year was any indication, IDEA is going to flourish as a workshop for many years!
August 20, 2014: Our paper on optimally learning maximum-likelihood worker accuracies has been accepted as a work-in-progress paper for HCOMP 2014! The paper tackles the problem of worker quality estimation in a way EM-based algorithms cannot - by providing optimality guarantees.
August 15, 2014: Started at Illinois; exciting times ahead!

Synergistic Activities

I serve on the steering committees of Data AI Systems Workshop @ICDE, HILDA (Human-in-the-loop Data Analytics) at SIGMOD and DSIA (Data Systems for Interactive Analysis) at VIS. Lots of excitement around this nascent area at the intersection of AI, databases, data mining, and visualization/HCI - join us! I currently serve as the Faculty Equity Advisor and the Space Lead for Systems for the Computer Science Division in EECS; I also served as the Faculty Equity Advisor at the School of Information for two terms in 2023 and 2021.

I am serving as an Area Chair for SIGMOD 2026 and VLDB 2025 Demo. I also serve as the Associate Editor for VLDB Journal. I am serving on the program committee for VLDB Tutorials 2025, VLDB 2024-25, and CIDR 2025. I've served on the Program Committees and as Area Chair/Editor of VLDB, KDD, SIGMOD, WSDM, WWW, SOCC, HCOMP, ICDE, and EDBT, many of them multiple times.

Recent Releases

PAPER RequestAtlas: Supporting the Slow and Iterative Process of Requesting Public Records.
Rachel Warren, Aditya G. Parameswaran, Lisa Pickoff-White, Niloufar Salehi. 28th Int'l Conference on Computer-Supported Cooperative Work (CSCW), Bergen, Norway. November 2025
(Used by journalists for managing 1000s of public record requests.)
PAPER Towards Accurate and Efficient Document Analytics with Large Language Models.
Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeighami, Aditya G. Parameswaran, Eugene Wu. 41st IEEE Int’l Conf on Data Engineering (ICDE), Hong Kong. May 2025
PAPER PromptEvals: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines.
Reya Vir*, Shreya Shankar*, Harrison Chase, William Hinthorn, Aditya G. Parameswaran. 20th Conf. of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Albuquerque, USA. April 2025
(Done in collaboration with LangChain, a leading LLM workflow company.)
PRE-PRINT RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines.
Quentin Romero Lauro*, Shreya Shankar*, Sepanta Zeighami, Aditya G. Parameswaran. Technical Report. April 2025
PRE-PRINT TWIX: Automatically Reconstructing Structured Data from Templatized Documents.
Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G. Parameswaran. Technical Report. April 2025
PRE-PRINT Steering Semantic Data Processing with DocWrangler.
Shreya Shankar*, Bhavya Chopra*, Mawil Hasan, Stephen Lee, Björn Hartmann, Joseph M. Hellerstein, Aditya G. Parameswaran, Eugene Wu. Technical Report. April 2025
(The deployed version of DocWrangler has been used over 1500 times.)
PRE-PRINT The Cambridge Report on Database Research.
Anastasia Ailamaki, ..., Aditya Parameswaran, .... Technical Report. April 2025
(This is a once-every-five-years report on data management research, written by experts in the field.)
PRE-PRINT Why Do Multi-Agent LLM Systems Fail?.
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E Gonzalez, Ion Stoica. Technical Report. March 2025
PAPER NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval.
Sepanta Zeighami, Zac Wellmer, Aditya G. Parameswaran. 13th Int’l Conf on Learning Representations (ICLR), Singapore. March 2025
(Deployed by LlamaIndex as part of their open-source.)
PAPER LLM-Powered Proactive Data Systems.
Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya G. Parameswaran. IEEE Data Engineering Bulletin, Issue on LLMs-meets-data. March 2025
PAPER Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle.
Rolando Garcia, Pragya Kallanagoudar, Chithra Anand, Sarah E. Chasins, Joseph M. Hellerstein, Aditya G. Parameswaran. 25th Conference on Innovative Data Systems Research (CIDR), Amsterdam, Netherlands. January 2025
PAPER Benchmarking Table Retrieval for Generative Tasks.
Carl Ji, Aditya Parameswaran, Madelon Hulsebos. TRL Workshop @ NeurIPS 2024, Vancouver, Canada. November 2024
PAPER 'We Have No Idea How Models will Behave in Production until Production': How Engineers Operationalize Machine Learning.
Shreya Shankar, Rolando Garcia, Joseph M. Hellerstein, Aditya G. Parameswaran. 27th Int'l Conference on Computer-Supported Cooperative Work (CSCW), San Jose, Costa Rica. November 2024
(The 3 V's of MLOps, coined by this paper, was covered in a number of industry blogs and podcasts.)
PAPER Inferring Visualization Intent from Conversation.
Haotian Li, Nithin Chalapathi, Huamin Qu, Alvin Cheung, Aditya G. Parameswaran. 33rd Int’l Conf on Information and Knowledge Management (CIKM), Boise, USA. October 2024
PRE-PRINT DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.
Shreya Shankar, Aditya G. Parameswaran, Eugene Wu. Technical Report. October 2024
(Over 1.4K Github stars and multiple users across domains.)

Medium Blog

Selected Projects

LLMs In Production

Supporting and sustaining LLMs in production, including building robust pipelines, identifying valuable constraints/assertions, evaluating performance, and maintaining state for long-running pipelines.

PAPER PromptEvals: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines.
Reya Vir*, Shreya Shankar*, Harrison Chase, William Hinthorn, Aditya G. Parameswaran. 20th Conf. of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Albuquerque, USA. April 2025
(Done in collaboration with LangChain, a leading LLM workflow company.)
PAPER 'We Have No Idea How Models will Behave in Production until Production': How Engineers Operationalize Machine Learning.
Shreya Shankar, Rolando Garcia, Joseph M. Hellerstein, Aditya G. Parameswaran. 27th Int'l Conference on Computer-Supported Cooperative Work (CSCW), San Jose, Costa Rica. November 2024
(The 3 V's of MLOps, coined by this paper, was covered in a number of industry blogs and podcasts.)
PAPER Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.
Shreya Shankar, J.D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, Ian Arawjo. 39th ACM Symposium on User Interface Software and Technology (UIST), Pittsburgh, USA. October 2024
(Deployed by LangChain as part of their LangChain Hub.)
PAPER SPADE: Synthesizing Assertions for Large Language Model Pipelines.
Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, Eugene Wu. 50th International Conference on Very Large Data Bases (VLDB), Guangzhou, China. August 2024
(Deployed by LangChain as part of their LangChain Hub.)
PAPER Building Reactive Large Language Model Pipelines with Motion (Demo).
Shreya Shankar, Aditya G. Parameswaran. ACM SIGMOD Int'l Conf. on Management of Data , Santiago, Chile. June 2024
PAPER Revisiting Prompt Engineering via Declarative Crowdsourcing.
Aditya G. Parameswaran, Shreya Shankar, Parth Asawa, Naman Jain, Yujie Wang. Conference on Innovative Database Research (CIDR), Chaminade, USA. January 2024

Question Answering with LLMs

Building general-purpose question-answering systems with LLM-powered agents operating on stuctured data.

PRE-PRINT Why Do Multi-Agent LLM Systems Fail?.
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E Gonzalez, Ion Stoica. Technical Report. March 2025
PAPER NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval.
Sepanta Zeighami, Zac Wellmer, Aditya G. Parameswaran. 13th Int’l Conf on Learning Representations (ICLR), Singapore. March 2025
(Deployed by LlamaIndex as part of their open-source.)
PAPER LLM-Powered Proactive Data Systems.
Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya G. Parameswaran. IEEE Data Engineering Bulletin, Issue on LLMs-meets-data. March 2025
PAPER Benchmarking Table Retrieval for Generative Tasks.
Carl Ji, Aditya Parameswaran, Madelon Hulsebos. TRL Workshop @ NeurIPS 2024, Vancouver, Canada. November 2024

LLM-Powered Document Analytics

Supporting structured queries on unstructured data, including PDFs, with applications in social justice.

PAPER RequestAtlas: Supporting the Slow and Iterative Process of Requesting Public Records.
Rachel Warren, Aditya G. Parameswaran, Lisa Pickoff-White, Niloufar Salehi. 28th Int'l Conference on Computer-Supported Cooperative Work (CSCW), Bergen, Norway. November 2025
(Used by journalists for managing 1000s of public record requests.)
PAPER Towards Accurate and Efficient Document Analytics with Large Language Models.
Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeighami, Aditya G. Parameswaran, Eugene Wu. 41st IEEE Int’l Conf on Data Engineering (ICDE), Hong Kong. May 2025
PRE-PRINT TWIX: Automatically Reconstructing Structured Data from Templatized Documents.
Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G. Parameswaran. Technical Report. April 2025
PRE-PRINT Steering Semantic Data Processing with DocWrangler.
Shreya Shankar*, Bhavya Chopra*, Mawil Hasan, Stephen Lee, Björn Hartmann, Joseph M. Hellerstein, Aditya G. Parameswaran, Eugene Wu. Technical Report. April 2025
(The deployed version of DocWrangler has been used over 1500 times.)
PAPER Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle.
Rolando Garcia, Pragya Kallanagoudar, Chithra Anand, Sarah E. Chasins, Joseph M. Hellerstein, Aditya G. Parameswaran. 25th Conference on Innovative Data Systems Research (CIDR), Amsterdam, Netherlands. January 2025
PRE-PRINT DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing.
Shreya Shankar, Aditya G. Parameswaran, Eugene Wu. Technical Report. October 2024
(Over 1.4K Github stars and multiple users across domains.)
PAPER Revisiting Prompt Engineering via Declarative Crowdsourcing.
Aditya G. Parameswaran, Shreya Shankar, Parth Asawa, Naman Jain, Yujie Wang. Conference on Innovative Database Research (CIDR), Chaminade, USA. January 2024

Lux: An always-on visualization recommendation system

Lux is a tool for effortlessly visualizing insights from very large data sets in dataframe workflows. Lux builds on half a decade of work on visualization recommendation systems.

Project page here.

PAPER Lux: Always-on Visualization Recommendations for Exploratory Data Science.
Doris Jung-Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A. Hearst, Aditya G. Parameswaran. 48th International Conference on Very Large Data Bases (VLDB), Sydney, Australia and Zoom. September 2022
(Downloaded over 675K times as of October 2023, and used in a variety of industries.)
PAPER Expressive Visual Querying for Accelerating Insight.
Tarique Siddiqui, Paul Luh, Zesheng Wang, Karrie Karahalios, Aditya G. Parameswaran. CACM, Volume 65 No. 7. 2022
(Invited Paper due to SIGMOD Best Paper Award.)
PAPER Leveraging Analysis History for Improved In Situ Visualization Recommendation.
Will Epperson, Doris Lee, Leijie Wang, Kunal Agarwal, Aditya Parameswaran, Dominik Moritz, Adam Perer. EuroVis’22: 24th Eurographics Conference on Visualization, Rome, Italy. 2022
PAPER Deconstructing Categorization in Visualization Recommendation: A Taxonomy and Comparative Study.
Doris Jung-Lin Lee, Vidya Setlur, Melanie Tory, Karrie Karahalios, Aditya Parameswaran. IEEE Int'l Conf. on Information Visualization (InfoVis), Zoom. October 2021
PAPER ShapeSearch: A Flexible and Efficient System for Shape-based Exploration of Trendlines.
Tarique Siddiqui, Zesheng Wang, Paul Luh, Karrie Karahalios, Aditya Parameswaran. SIGMOD Int'l Conf. on Management of Data, Portland, USA. June 2020
(Best Paper Award: 2 our of 450+ submissions.)
PAPER You can't always sketch what you want: Understanding Sensemaking in Visual Query Systems.
Doris Jung-Lin Lee, John Lee, Tarique Siddiqui, Jaewoo Kim, Karrie Karahalios, Aditya Parameswaran. IEEE Int’l Conf. on Visual Analytics Science & Technology (TVCG Track at VAST’19 at VIS), Vancouver, Canada. October 2019
PAPER Avoiding drill-down fallacies with VisPilot: assisted exploration of data subsets.
Doris Jung-Lin Lee, Himel Dev, Huizi Hu, Hazem Elmeleegy, Aditya Parameswaran. 24th International Conference on Intelligent User Interfaces (IUI), Los Angeles, USA. March 2019
PAPER The Case for a Visual Discovery Assistant: A Holistic Solution for Accelerating Visual Data Exploration.
Doris Jung-Lin Lee and Aditya Parameswaran. IEEE Data Engineering Bulletin, Issue on Insights and Explanations in Data Analysis. September 2018
PAPER Effortless Visual Data Exploration with Zenvisage: An Interactive and Expressive Visual Analytics System.
Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, Aditya Parameswaran. 43rd International Conference on Very Large Data Bases (VLDB), Munich, Germany. September 2017
PAPER Towards Visualization Recommendation Systems.
Manasi Vartak, Silu Huang, Tarique Siddiqui, Samuel Madden, and Aditya Parameswaran. SIGMOD Record, Chicago, USA. December 2016
PAPER SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics.
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 42nd International Conference on Very Large Data Bases (VLDB), New Delhi, India. September 2016

Modin: A Scalable Dataframe System

Modin applies database and distributed systems ideas to help run dataframe workloads faster, with over 2M open-source downloads.

Project page here.

PAPER Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System.
Devin Petersohn*, Dixin Tang*, Rehan Durrani, Areg Melik-Adamyan, Joseph E. Gonzalez, Anthony D. Joseph, Aditya G. Parameswaran. 48th International Conference on Very Large Data Bases (VLDB), Sydney, Australia and Zoom. September 2022
(Downloaded over 14M times as of October 2023, and used in a variety of industries.)
PAPER Enhancing the Interactivity of Dataframe Queries by Leveraging Think Time..
Doris Xin, Devin Petersohn, Dixin Tang, Yifan Wu, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, Aditya G. Parameswaran. IEEE Data Engineering Bulletin, Issue on Data Validation for Machine Learning. May 2021
PAPER Towards Scalable Dataframe Systems.
Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, Aditya Parameswaran. 46th Int'l Conf. on Very Large Data Bases, Tokyo, Japan. August 2020

NBTools: Better Computational Notebooks

NBsafety and NBslicer make it easy for data scientists to write correct, reproducible code in computational notebooks.

Project page here.

PAPER Bolt-on, Compact, and Rapid Program Slicing for Notebooks.
Shreya Shankar*, Stephen Macke*, Sarah Chasins, Andrew Head, Aditya G. Parameswaran. 49th International Conference on Very Large Data Bases (VLDB), Vancouver, Canada. September 2023
(The underlying open-source IPyflow tool has 275K Downloads, 1000 GitHub Stars as of October 2023.)
PAPER Fine-Grained Lineage for Safer Notebook Interactions.
Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin, Aditya Parameswaran. 47th International Conference on Very Large Data Bases (VLDB), Copenhagen, Denmark and Zoom. September 2021
(Downloaded over 240K times as of June 2023.)

DataSpread: A Spreadsheet-Database Hybrid

DataSpread is a tool that marries the best of databases and spreadsheets.

Project page: here

PAPER Visualizing Spreadsheet Formula Graphs Compactly.
Fanchao Chen, Dixin Tang, Haotian Li, Aditya G. Parameswaran. 49th International Conference on Very Large Data Bases (VLDB), Vancouver, Canada. September 2023
PAPER Efficient and Compact Spreadsheet Formula Graphs.
Dixin Tang, Fanchao Chen, Christopher De Leon, Tana Wattanawaroon, Jeaseok Yun, Srinivasan Seshadri, Aditya G. Parameswaran. 39th International Conf. on Data Engineering (ICDE), Anaheim, CA, USA. April 2023
PAPER NOAH: Interactive Spreadsheet Exploration with Dynamic Hierarchical Overviews.
Sajjadur Rahman, Mangesh Bendre, Yuyang Liu, Shichu Zhu, Zhaoyuan Su, Karrie Karahalios, Aditya Parameswaran. 47th International Conference on Very Large Data Bases (VLDB), Copenhagen, Denmark and Zoom. September 2021
PAPER Benchmarking Spreadsheet Systems.
Sajjadur Rahman, Kelly Mack, Mangesh Bendre, Ruilin Zhang, Karrie Karahalios, Aditya Parameswaran. SIGMOD Int'l Conf. on Management of Data, Portland, USA. June 2020
(Covered in the Morning Paper, a popular industry blog)
PAPER Understanding Data Analysis Workflows on Spreadsheets: Roadblocks and Opportunities.
Pingjing Yang, Ti-Chung Cheng, Sajjadur Rahman, Mangesh Bendre, Karrie Karahalios, Aditya Parameswaran. Workshop on Human-in-the-Loop Data Analytics (HILDA) at the ACM SIGMOD Int'l Conf. on Management of Data, Portland, USA. June 2020
PAPER Anti-Freeze for Large and Complex Spreadsheets: Asynchronous Formula Computation.
Mangesh Bendre, Tana Wattanawaroon, Kelly Mack, Kevin Chang, Aditya Parameswaran. SIGMOD Int'l Conf. on Management of Data, Amsterdam, The Netherlands. June 2019
PAPER Faster, Higher, Stronger: Redesigning Spreadsheets for Scale (Demo).
Mangesh Bendre, Tana Wattanawaroon, Sajjadur Rahman, Kelly Mack, Yuyang Liu, Shichu Zhu, Yu Lu, Ping-Jing Yang, Xinyan Zhou, Kevin Chang, Karrie Karahalios, Aditya Parameswaran. 35th International Conf. on Data Engineering (ICDE), Macau. April 2019
(Best Demo Award: Given to two out of 24 papers.)
PRE-PRINT Directed Data Management: A New Frontier in Database Usability.
Mangesh Bendre, Sajjadur Rahman, Tana Wattanawaroon, Kelly Mack, Yu Lu, Kevin Chang, Karrie Karahalios, Aditya Parameswaran. Technical Report. August 2018
PAPER Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management.
Mangesh Bendre, Vipul Venkataraman, Xinyan Zhou, Kevin Chang, Aditya Parameswaran. 34th International Conf. on Data Engineering (ICDE), Paris, France. April 2018
PAPER Characterizing Scalability Issues in Spreadsheet Software using Online Forums (Case Study Paper).
Kelly Mack, John Lee, Kevin Chang, Karrie Karahalios, Aditya Parameswaran. International Conference on Human Factors in Computing Systems (CHI), Montreal, Canada. April 2018
PAPER Data-Spread: Unifying Databases and Spreadsheets (Demo).
Mangesh Bendre, Bofan Sun, Xinyan Zhou, Ding Zhang, Shy-Yauer Lin, Kevin Chang, and Aditya Parameswaran. 41st International Conference on Very Large Data Bases (VLDB), Kohala Coast, Hawaii, USA. September 2015