Sparse Machine Learning for Large-Scale Text Analytics and Applications

Consider a large data set of documents (emails, Q&As, news, etc). What are the most important keywords? What are examples of questions (or answers) that illustrate how these keywords are used? What are the important clusters of documents? Can we automatically name or tag them? Can we summarize the difference between two different sets of documents? Can we even quickly compare them if they are written in a different language, without having to translate everything? How do these summarization techniques allow user-friendly visualizations?

In this lecture I will describe a set of recently developed machine learning techniques for large-scale text analytics, which are based on the idea of sparsity. These highly scalable techniques allow to address some of the challenges raised above. I will provide examples from specific data sets (flight reports from commercial pilots in the US, and news data), and hope to encourage a discussion on the relevance of large-scale text analytics for online learning.