PLUIE: Probability and Logic Unified for Information Extraction
Reading list

This list is still under construction. Suggestions welcome
Books

Information Extraction: Algorithms and Prospects in a Retrieval Context by Marie-Francine Moens. Springer, 2006/2010.
Text Mining: Predictive Methods for Analyzing Unstructured Information by Sholom M. Weiss, Nitin Indurkhya, Tong Zhang and Fred Damerau. Springer, 2010.
For background: relevant sections of Artificial Intelligence: A Modern Approach, 3rd edition by Stuart Russell and Peter Norvig. Pearson, 2010.

Overviews

Hobbs, Jerry R., and Ellen Riloff, 2010. ``Information Extraction.'' In N. Indurkhya and F. Damerau (eds.), Handbook of Natural Language Processing, CRC Press, Boca Raton, Florida, pp. 511-532.
Thorough and helpful survey by two pioneers in the field.
Russell and Norvig, Chapter 22.4
A very brief (9 pages) introduction to tasks and methods.
Sunita Sarawagi (2008). ``Information Extraction", in Foundations and Trends in Databases, 1(3), 261-377.
A thorough, insightful, and long introduction and survey.
Douglas E. Appelt and. David J. Israel (1999). Introduction to Information Extraction Technology. A Tutorial Prepared for IJCAI-99.
Suchanek et al (2011). Semantic knowledge bases from web sources. Tutorial given at IJCAI 2011.
(Video also available.) A comprehensive survey of tasks and systems for web-scale knowledge generation.
See also Fabian M. Suchanek and Gerhard Weikum (2013). ``Knowledge Harvesting from Text and Web Sources.'' Tutorial at the International Conference on Data Engineering (ICDE 2013)
Eduard Hovy, Roberto Navigli, Simone Paolo Ponzetto (2013). Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence, 194, 2013.
Introductory paper for an AIJ special issue on extraction and Wikipedia.
Hobbs, Jerry R., 2002. ``Information Extraction from Biomedical Text,'' Journal of Biomedical Informatics, Vol. 35, No. 4, August 2002, pp. 260-264.

Generative models for relation extraction

Rink, B. and Harabagiu, S. (2011). A generative model for unsupervised discovery of relations and argument classes from clinical texts. In Proc. EMNLP-11.
Yao, L., Haghighi, A., Riedel, S., and McCallum, A. (2011). Structured relation discovery using generative models. In Proc. EMNLP-11.
See also Pasula et al. (2003).

Bootstrapping

Brin, Sergey (1999). Extracting Patterns and Relations from the World Wide Web. Technical Report 1999-65, Stanford University Infolab.
Jones, R., McCallum, A., Nigam, K., and Riloff, E. (1999). Bootstrapping for text learning tasks. In Proc. IJCAI-99 Workshop on Text Mining.
These papers introduced the basic idea of finding linguistic patterns that express facts and finding facts expressed by those patterns.

Entity extraction and entity resolution/Identity uncertainty

Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser, ``Identity Uncertainty and Citation Matching.'' In Advances in Neural Information Processing Systems 15, MIT Press, 2003.
An open-universe generative probability model of researchers, papers, and citations, showing how entity resolution happens automatically in such a model.
Xin Dong, Alon Y. Halevy and Jayant Madhavan: Reference Reconciliation in Complex Information Spaces. SIGMOD 2005.
Describes a practical system aimed at extracting personal information from a user's own files.
Fatiha Sais, Nathalie Pernelle and Marie-Christine Rousset (2009). Combining a Logical and a Numerical method for Reference Reconciliation. Journal of Data Semantics, Volume 12.
Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart, ``PARIS: Probabilistic Alignment of Relations, Instances, and Schema.'' 38th International Conference on Very Large Databases (VLDB 2012) PVLDB Journal, Volume 5, Number 3, November 2011.
Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, James R. Curran (2013). Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, 194, 151-175, 2013.
.

Ontology, time, and events

Russell and Norvig, Chapter 12 (sections 12.1-12.5)
Describes a high-level ontology for everything, with a focus on categories and events.
Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Gerhard Weikum (2013). YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, 194, 28-61, 2013.
Pays special attention to the time at which events occur and facts hold.
Vivi Nastase, Michael Strube (2013). Transforming Wikipedia into a large scale multilingual concept network. Artificial Intelligence, 194, 62-85, 2013.
Creating categories and taxonomies: leveraging a good deal of human curation and semi-structured knowledge may yield higher quality.

Systems (see also IAGO2, SEMEX, etc.)

Etzioni, O. and Banko, M. and Soderland, S. and Weld, D. S., "Open Information Extraction from the Web." CACM 2008 51(12).
Motivates the importance of the task, suggests some plausible approaches taken in their projects.
See also O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam (2011). Open Information Extraction: the Second Generation. In Proc. International Joint Conference on Artificial Intelligence.
and H. Poon, J. Christensen, P. Domingos, O. Etzioni, R. Hoffmann, C. Kiddon, Mausam, A. Ritter, S. Soderland, D. Weld, F. Wu, T. Lin, X. Ling, S. Schoenmackers (2010). Machine Reading at the University of Washington. Annual Conference of the North American Chapter of the Association for Computational Linguistics.
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell (2010). Toward an Architecture for Never-Ending Language Learning. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2010.
NELL has developed over several years and combines several interesting ideas to improve performance over time.
Michael Wick, Andrew McCallum, Gerome Miklau (2010). Scalable Probabilistic Databases with Factor Graphs and MCMC. Proc. VLDB 2010.
Probably the most comprehensive probabilistic approach to information extraction.

Probability models

Russell and Norvig, Chapter 14
Introduction to Bayesian networks.
Russell and Norvig, Chapter 15
Introduction to temporal models (HMMs, dynamic Bayesian networks).
Russell and Norvig, Chapter 22.1-2, Chapter 23.1-3
Introduction to n-gram models and grammars.
Isabelle Tellier (2013). Markov Random Fields for Information Extraction (draft). Chapter 6 of Gaussier and Yvon. Wiley 2013.
Covers HMMs, CRFs, MRFs, as well as an overview of IE tasks and approaches.
Sato, T. and Kameya, Y.: New advances in logic-based probabilistic modeling by PRISM. In Probabilistic Inductive Logic Programming, LNCS 4911, Springer, pp.118.155, 2008.
PRISM is a closed-universe probabilistic modelling language and inference engine based on Prolog; this paper includes application to PCFGs.
Brian Milch and Stuart Russell, ``Extending Bayesian Networks to the Open-Universe Case.'' In Rina Dechter, Hector Geffner, and Joseph Y Halpern, Eds., Heuristics, Probability and Causality. A Tribute to Judea Pearl. College Publications, 2010.
Describes a formal language (BLOG) for open-universe probability models.

PLUIE: Probability and Logic Unified for Information Extraction Reading list

Books

Overviews

Generative models for relation extraction

Bootstrapping

Entity extraction and entity resolution/Identity uncertainty

Ontology, time, and events

Systems (see also IAGO2, SEMEX, etc.)

Probability models

PLUIE: Probability and Logic Unified for Information Extraction
Reading list