stackoverflow

An Analysis of the Stack Overflow Q&A Site

This project analyzes a Question & Answer site for programmers, Stack Overflow, that dramatically improves on the utility and performance of Q&A systems for technical domains. Over 92% of Stack Overflow questions about expert topics are answered — in a median time of 11 minutes. Using a mixed methods approach that combines statistical data analysis with user interviews, we seek to understand this success. We argue that it is not primarily due to an a priori superior technical design, but also to the high visibility and daily involvement of the design team within the community they serve. This model of continued community leadership presents challenges to both CSCW systems research as well as to attempts to apply the Stack Overflow model to other specialized knowledge domains..

This project is complete and no longer under active development.

Paper

Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Björn Hartmann. 2011. Design lessons from the fastest q&a site in the west. In Proceedings of CHI 2011. ACM, New York, NY, USA, 2857-2866.
ACM Digital Library| Local pdf

CHI2011 Talk Slides: Powerpoint file | pdf

Berkeley Contributors

Björn Hartmann
Manas Mittal

Corpus

Our analysis is based on the August 2010 Stack Exchange Data Dump (creative-commons licensed). We analyzed two years of user activity — from July 31, 2008 to July 31, 2010. As of early August 2010, Stack Overflow had a total of 300k registered users who asked 833k questions, provided 2,2M answers, and posted 2,9M comments.

Source Code

Our analysis code is available for download under a BSD license:

stackoverflowanalysis-188.zip (40mb)

We converted the XML data dump files into a SQLite3 database. Analysis code is written in Python 2.x and SQL. Graphs are generated using matplotlib. This file is large because it contains some intermediate results of large queries.

Build Instructions

Download the source file above and unzip into a directory of your choice.
Download the data dump and place XML files into directory xml/ - see xml/README.txt.
Run the import script import/import-all.sh (it calls python scripts to import individual tables)
Create indices to speed up queries using import/create-indices.sql.
Individual analysis scripts (both in Python and SQL) are in folder analysis/