Image of ``microsoft'' 

The graph shows an image of the term ‘‘microsoft’’ in the all the New York Times headlines (Jan 1985 — Dec 2007), obtained via sparse classification.

We have used the sparse logistic regression algorithm described in Koh, Kim and Boyd (2007), on a sliding window of one year at yearly increments. For every year, we looked at the past year’s worth of headlines, and computed a sparse logistic regression classifier to classify the headlines containing the term ‘‘microsoft" from those not containing that term. (Each headline is a document unit, and we have represented the text data using a bag-of-words representation, after removing stop words and removing capitalization and punctuation.) Thus, for each year, we have obtained a short list of terms, and associated weights, that are predictors of the appearance of the query term in any headline. This results in a matrix of weights, each column corresponding to a year, and each row to a word. If we arrange the rows of the matrix so that words are listed in order of appearance, the matrix will have a staircase pattern. Tracing any row reveals the corresponding term blinking in and out over time as a salient descriptor of ‘‘microsoft’’.

A visualization of the matrix of weights is shown in the figure above. In the figure, each rectangle indicates the presence of the word as an important feature for that year. (We have only shown the top 40 words, as ranked in terms of their overall weight in the matrix.) The darkness of the rectangle indicates its relative weight to the other words selected. The vertical axis corresponds to the terms in the resulting collection of lists, shown by order of appearance. Thus, the plot shows a staircase pattern.

The list appears to provide an accurate summary of Microsoft, with the top prize going to ‘‘software’’ (the long-term focus of the company) and ‘‘xbox’’ (its most recent best-selling product). The figure is not simply a summary, as it provides a timeline of the evolution of the company that is consistent with common knowledge, and indeed the Wikipedia entry on Microsoft. The initial terms refer to a high-growth corporation (with terms like ‘‘company’’, ‘‘net’’, ‘‘profit’’, ‘‘doubled’’). The list of words then visits terms related to products, from ‘‘lotus’’ to ‘‘windows’’ to ‘‘xbox’’. Another important topic involves legal terms (‘‘case’’, ‘‘judge’’, ‘‘settle“) , with a reference to the famous anti-trust case in Europe. More recently, the terms reflect the growing importance of the Internet (‘‘web’’) and media (‘‘broadcast’’). Throughout, the names of important related companies are mentioned: ‘‘intuit”, ‘‘apple", ‘‘google’’.

Reading the plot vertically allows to understand the main topics for a particular year. For example, the year 2002-2003 hits terms like ‘‘anti-trust’’, ‘‘europe’’, and ‘‘software’’. The plot also allows to pinpoint terms that keep recurring in the news over a long stretch of time (‘‘software’’ is the most constant reference over time).