# jemdoc: menu{MENU}{index.html}, showsource
= Discovering Word Associations in News Media via Feature Selection and Sparse Classification

- *Download:* [Pubs/MIR2010.pdf .pdf]

- *Authors:* B. Gawalt, J. Jia, L. Miratrix, L. El Ghaoui, B. Yu, S. Clavier. 

- *Status:* In /Proc. [http://riemann.ist.psu.edu/mir2010/ 11th ACM SIGMM International Conference on Multimedia Information Retrieval]/, 2010.

- *Abstract:*  

Our perceptions of the world are largely shaped by news media. Understanding how media portray certain words and terms is a critical step towards assessing media's influence on those
perceptions. In this paper we analyze the ``image" of a given query word in a given corpus of text
news by producing a short list of other words with which this query is strongly associated. We use
a number of feature selection schemes for text classification to help in this task. We apply these
classification techniques using indicators of the query word's appearance in each document used as
the document ``labels" and the indicators for all other words as document predictors/features. The
features selected by any scheme is then considered the list of words comprising the query word's
``image". To be easily understandable, a list should be extremely short with respect to the
dictionary of terms present in the corpus. The approach thus requires aggressive feature (word)
selection in order to single out at most a few tens of terms in a universe of hundreds of
thousands or more. In addition, a word imaging scheme should scale well with the size of data
(number and size of documents, size of dictionary). We produce one scheme for feature selection
through a sparse classification model. A standard classification algorithm assigns one weight per
term in the predictor dictionary, in order to maximize the capacity to successfully predict the
labels of document units. By imposing a sparsity constraint on the weight vector, we single out
the few words that are most able to predict the presence or absence of a query word in any
document. This paper compares this and several other schemes that are potentially well suited to
the task of word imaging, each method presenting a different manner of feature selection.

We present two evaluations of these schemes. One evaluates the predictive classification
performance of a logistic regression model trained over the corpus using only a scheme's selected
features. The other is based on the judgement of human readers: a pair of word lists generated by
different schemes operating on identical queries are presented to a human subject alongside a trio
of document units (paragraphs) containing the query word. This subject then chooses the list which
in his/her estimation is the best summary of the document units.

We apply these schemes to study the images of frequently covered countries and regions in recent
news articles from the International section of the New York Times. Our preliminary experiments
indicate that, while most methods perform similarly well on our data in terms of predictive
performance, human-based evaluations appear to favor features selected by training of sparse
logistic regression, a penalized variant of logistic regression that encourages sparse
classifiers. This indicates that classification metrics based on pure predictive performance,
while useful as a indicator for preselecting algorithms, are not enough to predict human
assessment of word association algorithms.

#- *Related entries:*

- *Bibtex reference:*
~~~
{}{}
@conference{GJMEYC:10,
	Author = {B. Gawalt and J. Jia and L. Miratrix and L. {El Ghaoui} and B. Yu and S. Clavier},
	Title = {Discovering Word Associations in News Media 
			via Feature Selection and Sparse Classification},
	Booktitle = {Proc. 11th ACM SIGMM International 
	Conference on Multimedia Information Retrieval},
	Year = 2010
}
~~~