Discovering Word Associations in News Media via Feature Selection and Sparse Classification

  • Authors: B. Gawalt, J. Jia, L. Miratrix, L. El Ghaoui, B. Yu, S. Clavier.

  • Abstract:

Our perceptions of the world are largely shaped by news media. Understanding how media portray certain words and terms is a critical step towards assessing media's influence on those perceptions. In this paper we analyze the ‘‘image“ of a given query word in a given corpus of text news by producing a short list of other words with which this query is strongly associated. We use a number of feature selection schemes for text classification to help in this task. We apply these classification techniques using indicators of the query word's appearance in each document used as the document ‘‘labels” and the indicators for all other words as document predictors/features. The features selected by any scheme is then considered the list of words comprising the query word's ‘‘image". To be easily understandable, a list should be extremely short with respect to the dictionary of terms present in the corpus. The approach thus requires aggressive feature (word) selection in order to single out at most a few tens of terms in a universe of hundreds of thousands or more. In addition, a word imaging scheme should scale well with the size of data (number and size of documents, size of dictionary). We produce one scheme for feature selection through a sparse classification model. A standard classification algorithm assigns one weight per term in the predictor dictionary, in order to maximize the capacity to successfully predict the labels of document units. By imposing a sparsity constraint on the weight vector, we single out the few words that are most able to predict the presence or absence of a query word in any document. This paper compares this and several other schemes that are potentially well suited to the task of word imaging, each method presenting a different manner of feature selection.

We present two evaluations of these schemes. One evaluates the predictive classification performance of a logistic regression model trained over the corpus using only a scheme's selected features. The other is based on the judgement of human readers: a pair of word lists generated by different schemes operating on identical queries are presented to a human subject alongside a trio of document units (paragraphs) containing the query word. This subject then chooses the list which in his/her estimation is the best summary of the document units.

We apply these schemes to study the images of frequently covered countries and regions in recent news articles from the International section of the New York Times. Our preliminary experiments indicate that, while most methods perform similarly well on our data in terms of predictive performance, human-based evaluations appear to favor features selected by training of sparse logistic regression, a penalized variant of logistic regression that encourages sparse classifiers. This indicates that classification metrics based on pure predictive performance, while useful as a indicator for preselecting algorithms, are not enough to predict human assessment of word association algorithms.

  • Bibtex reference:

	Author = {B. Gawalt and J. Jia and L. Miratrix and L. {El Ghaoui} and B. Yu and S. Clavier},
	Title = {Discovering Word Associations in News Media
			via Feature Selection and Sparse Classification},
	Booktitle = {Proc. 11th ACM SIGMM International
	Conference on Multimedia Information Retrieval},
	Year = 2010