Web Extraction

Concept Hierarchy Maintenance

At a research internship at Kosmix (now acquired by Walmart) with Anand Rajaraman, I looked at the problem of generating concepts  —  i.e., entities, items and ideas that users are interested in and searching for. However, the problem is that users typically do not simply type in their concept of choice into the search bar, but also extra terms, or other concepts. For example, a person searching for “Martin Scorcese” might type in “Martin Scorcese Departed”. We used ideas from association rule mining to figure out whether a sequence of n words represents a concept, relative to the n+1 word sequences that contain it, or the n-1 word sequences contained by it. We proved some theoretically desirable properties of our approach, and experimentally demonstrated effectiveness on a dataset of query logs. Our solution was deployed internally at Kosmix to enhance their concept hierarchy.

The next step to this work was to find where to attach an extracted concept to a topic hierarchy. Since this is a task not easily done by computers, we looked at using human involvement to assist our search. We devised algorithms to find the minimal set of questions to ask humans to identify the correct category.

Fault-Tolerant Wrappers

Web-pages that are script or template based prove to be invaluable for extraction of concept metadata. For instance, it is easy to ask humans to annotate a few web-pages and learn a web wrapper to extract all metadata from a script-based website such as Yelp, Amazon, Ebay and so on. (For instance, restaurant phone numbers may be extracted from Yelp.) However, these web-pages change often, and the web wrappers learnt for the web-pages may no longer extract correct data. At a research internship at Yahoo! Research Bangalore, I looked at the wrapper maintenance and management problem with Rajeev Rastogi and Nilesh Dalvi. We were able to design efficient algorithms that output theoretically optimal robust wrappers for two different change models, and found that these wrappers perform orders of magnitude better than existing wrappers in terms of fault tolerance.

Keyword Search

At an internship at Microsoft Research, I worked on the problem of matching keyword search queries to a large collection of prespecified patterns (such as “Restaurant” near “Location”) and a large database of facts. Our solution is the first efficient solution with guarantees; and the implementation of our approach scaled easily to the large dataset sizes at Microsoft.

Other Topics

I also looked at problems in debugging large information extraction pipelines, and on building better classifiers for entity resolution, both using crowdsourcing in an efficient and optimized manner.

Relevant tech reports / publications:

  1. Efficient Parsing-based Keyword Search over Databases, pdf
    Aditya Parameswaran, Raghav Kaushik and Arvind Arasu
    22th International Conf. on Information and Knowledge Management (CIKM), Burlingame, USA, Nov 2013

  2. Active Sampling for Entity Matching with Guarantees, pdf
    Kedar Bellare, Suresh Iyengar, Aditya Parameswaran and Vibhor Rastogi
    ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
    Volume 7(3), September 2013

  3. Human-Powered Debugging of Large Data Pipelines, pdf
    Nilesh Dalvi, Aditya Parameswaran and Vibhor Rastogi
    25th International Conf. on Neural Information Processing Systems (NIPS), Tahoe, Nevada, USA, Dec 2012

  4. Active Sampling for Entity Matching, pdf talk
    (Invited to: Special Issue of TKDD Journal for KDD 2012 Best Papers.)
    Kedar Bellare, Suresh Iyengar, Aditya Parameswaran and Vibhor Rastogi
    18th International Conf. on Knowledge Discovery and Data Mining (KDD), Beijing, China, Aug 2012

  5. Optimal Schemes for Robust Web Extraction, pdf talk
    Aditya Parameswaran, Nilesh Dalvi, Hector Garcia-Molina and Rajeev Rastogi
    37th International Conf. on Very Large Data Bases (VLDB), Seattle, USA, Sep 2011

  6. Human-assisted Graph Search: It's Okay to Ask Questions, pdf talk
    Aditya Parameswaran, Anish Das Sarma, Hector Garcia-Molina, Neoklis Polyzotis and Jennifer Widom
    37th International Conf. on Very Large Data Bases (VLDB), Seattle, USA, Sep 2011

  7. Towards the Web of Concepts: Extracting Concepts from Large Datasets, pdf pptx
    Aditya Parameswaran, Hector Garcia-Molina and Anand Rajaraman,
    36th International Conf. on Very Large Data Bases (VLDB), Singapore, Sep 2010
    (Invited to: Special Issue of VLDB Journal for VLDB 2010 Best Papers.)