Crowd-Powered Data Management

SCOOP: The Stanford — Santa-Cruz Project for Cooperative Computing with Algorithms, Data and People

News

(Sep 2013) New Technical Report on Filtering Generalizations released.

(May 2013) New Technical Report on Crowd-Powered Search released.

(Aug 2012) New Technical Reports on Confidence, Crowd-powered Finding, and Deco Query Processing released.

(Jun 2012) New Technical Report on Identifying Reliable Workers released.

(Feb 2012) New Technical Reports on the Deco System demo and Entity Resolution released.

(Dec 2011) New Technical Report on Human-powered Debugging of Lineage in Large Data Pipelines released.

(Nov 2011) New Technical Report on Human-powered Top-1 Computation released.

(Nov 2011) New Technical Report on the Deco system released.

(Sep 2011) New Technical Report on Filtering Data with Humans released.

(Jun 2011) We presented the data model for our system, called Deco (for Declarative Crowdsourcing), outlined some Query Processing challenges, and surveyed our work on Crowd Algorithms at a Crowdsourcing event at UC Berkeley.

(Jan 2011) We presented the vision for sCOOP at CIDR 2011.

Overview

There are many tasks done easily and better by people than by current computer algorithms: tasks dealing with understanding and analyzing images, video, text and speech, as well as subjective opinions and abstract concepts.

Due to the proliferation of cheap and reliable internet connectivity, a large number of people are now online and willing to answer questions for monetary gain. There are a number of human computation (a.k.a. crowdsourcing) marketplaces — Mechanical Turk, oDesk, LiveOps, and others — that enable workers to find tasks easily.

sCOOP is a project whose broad theme is to leverage people as processing units, much like computer processes or subroutines, to achieve some global objective. A primary focus of sCOOP is to optimize this computation — while there may be many ways to orchestrate a particular task, our goal is to use as few resources (e.g., time, money) as possible, while getting equally good or better results as unoptimized computation.

We are approaching the problem of orchestrating computing tasks that involve people via two (not necessarily mutually exclusive) directions:

Optimizing Crowd Algorithms

Here, the goal is to optimize some fundamental data processing algorithms where the unit operations are performed by people. Examples of algorithms include: sorting, clustering, classification, and categorization.

Over the last year, we worked on algorithms for max, filtering, graph search, and lineage debugging. The goal of the Max problem is to find the best item among a given set of items (e.g., photos, videos or songs), given a budget on the number of pairwise comparisons that may be asked of humans. We also looked at the Filtering problem, where we wanted to find which items in a given data set satisfy a given set of properties (that may be verified by humans), and the goal was to find the cost-optimal filtering algorithm, given constraints on error and time. We also considered the problem of human-assisted graph search, which has applications in many domains that can utilize human intelligence, including curation of hierarchies, image segmentation and categorization, interactive search and filter synthesis. (We studied this problem across many dimensions, fixed budget vs. unlimited budget, different graph structures, and multiple ‘‘target nodes’’ vs. single ‘‘target’’.) Most recently, we considered using expert human input in debugging of data provenance.

The Deco System: Declarative Querying of Humans, Algorithms and Databases

Here, the goal is to combine human and algorithmic computation with traditional database operations in order to perform complex tasks. This combination involves several optimization objectives: minimizing total elapsed time, minimizing the monetary cost to perform human computation (minimizing the number of questions and pricing them accordingly), and maximizing confidence in the obtained answers.

Our proposed approach views the crowd-sourcing service as another database where facts are computed by human processors. By promoting the crowd-sourcing service to a first-class citizen on the same level as extensional data, it is possible to write a declarative query that seamlessly combines information from both. The system becomes responsible for optimizing the order in which tuples are processed, the order in which tasks are scheduled, whether tasks are handled by algorithms or a crowd-sourcing service, the pricing of the latter tasks, and the seamless transfer of information between the database system and the external services. Moreover, it provides built-in mechanisms to handle uncertainty, so that the developer can explicitly control the quality of the query results. Using the declarative approach, we can facilitate the development of complex applications that combine knowledge from human computation, algorithmic computation, and data.

Our current design and details of our initial prototype can be found in the Deco paper.

Talks

Data-Centric Human Computation — Jennifer's Overview Talk about the Scoop Project: talk

Active Sampling for Entity Matching — This talk was given at KDD 2012: talk

CrowdScreen: Algorithms for Filtering Data with Humans — This talk was given at SIGMOD 2012: talk

Human-assisted Graph Search: It's Okay to Ask Questions — This talk was given at VLDB 2011: talk

Deco Data Model, Query Processing and Crowd Algorithms Talks — These talks were given at the Crowd-Crowd event at UC Berkeley on June 6, 2011:
Data Model Talk
Query Processing Talk
Crowd Algorithms Talk

Answering Queries using Humans, Algorithms and Databases — This is the vision talk for sCOOP given at CIDR’11: talk

Relevant tech reports / publications:

Optimal Crowd-Powered Rating and Filtering Algorithms, pdf
Aditya Parameswaran, Stephen Boyd, Hector Garcia-Molina, Ashish Gupta, Neoklis Polyzotis, and Jennifer Widom
40th International Conf. on Very Large Data Bases (VLDB), Hangzhou, China, Sep 2014

DataSift: A Crowd-Powered Search Toolkit (Demo), pdf
Aditya Parameswaran, Ming Han Teh, Hector Garcia-Molina and Jennifer Widom
SIGMOD International Conf. on Management of Data, Snowbird, Utah, USA, Jun 2014

Crowd-Powered Find Algorithms, pdf
Anish Das Sarma, Aditya Parameswaran, Hector Garcia-Molina and Alon Halevy
30th International Conf. on Data Engineering (ICDE), Chicago, USA, Apr 2014

Finish Them!: Pricing Algorithms for Human Computation, pdf
Yihan Gao and Aditya Parameswaran
Technical Report, March 2014

Comprehensive and Reliable Crowd Assessment Algorithms, pdf
Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran,
Technical Report, March 2014

An Expressive and Accurate Crowd-Powered Search Toolkit, pdf
Aditya Parameswaran, Ming Han Teh, Hector Garcia-Molina and Jennifer Widom
1st Conf. on Human Computation and Crowdsourcing (HCOMP), Palm Springs, USA, Nov 2013

Human-Powered Data Management, pdf
Aditya Parameswaran
Doctoral Dissertation, Stanford University, Sep 2013
(Winner of Stanford University's Arthur Samuel Best Thesis Award 2013-4)

Active Sampling for Entity Matching with Guarantees, pdf
Kedar Bellare, Suresh Iyengar, Aditya Parameswaran and Vibhor Rastogi
ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
Volume 7(3), September 2013

Evaluating the Crowd with Confidence, pdf
Manas Joglekar, Hector Garcia-Molina and Aditya Parameswaran
19th International Conf. on Knowledge Discovery and Data Mining (KDD), Chicago, USA, Aug 2013

An Overview of the Deco System: Data Model and Query Language; Query Processing and Optimization, pdf
Hyunjung Park, Richard Pang, Aditya Parameswaran, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom
SIGMOD Record, Volume 41, Dec 2012

Human-Powered Debugging of Large Data Pipelines, pdf
Nilesh Dalvi, Aditya Parameswaran and Vibhor Rastogi
25th International Conf. on Neural Information Processing Systems (NIPS), Tahoe, Nevada, USA, Dec 2012

Deco: Declarative Crowdsourcing, pdf
Aditya Parameswaran, Hyunjung Park, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
21th International Conf. on Information and Knowledge Management (CIKM), Maui, Hawaii, USA, Nov 2012

Deco: A System for Declarative Crowdsourcing (Demo), pdf
Hyunjung Park, Richard Pang, Aditya Parameswaran, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
38th International Conf. on Very Large Data Bases (VLDB), Istanbul, Turkey, Sep 2012

Query Processing over Crowdsourced Data, pdf
Hyunjung Park, Aditya Parameswaran and Jennifer Widom
Infolab Technical Report, Aug 2012

Active Sampling for Entity Matching, pdf talk
Kedar Bellare, Suresh Iyengar, Aditya Parameswaran and Vibhor Rastogi
18th International Conf. on Knowledge Discovery and Data Mining (KDD), Beijing, China, Aug 2012
(Invited to: Special Issue of TKDD Journal for KDD 2012 Best Papers.)

Identifying Reliable Workers Swiftly, pdf
Aditya Ramesh, Aditya Parameswaran, Hector Garcia-Molina and Neoklis Polyzotis
Infolab Technical Report, Jun 2012

So Who Won? Dynamic Max Discovery with the Crowd, pdf
Stephen Guo, Aditya Parameswaran and Hector Garcia-Molina
SIGMOD International Conf. on Management of Data, Scottsdale, Arizona, USA, Jun 2012

CrowdScreen: Algorithms for Filtering Data with Humans, pdf talk
Aditya Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh and Jennifer Widom
SIGMOD International Conf. on Management of Data, Scottsdale, Arizona, USA, Jun 2012

Human-assisted Graph Search: It's Okay to Ask Questions, pdf talk
Aditya Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
37th International Conf. on Very Large Data Bases (VLDB), Seattle, USA, Sep 2011

Answering Queries using Humans, Algorithms and Databases, pdf pptx
Aditya Parameswaran and Neoklis Polyzotis
Conference on Innovative Database Research (CIDR), Asilomar, USA, Jan 2011