This page is meant to be a brief follow-up to our paper Painless Unsupervised Learning with Features, Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. At NAACL a lot of people had questions about why the direct gradient approach (applying LBFGS directly to the data likelihood) consistently outperformed EM in terms of accuracy. We wondered this ourselves, and after the talk I mentioned a hypothesis having to do with coarse features and when their weights gain magnitude during the learning process. Since the talk, I've run a few more experiments and here I'll show some data that supports the hypothesis.
What do I mean by coarse features? Coarse features are active for multiple outcomes. The basic indicator features, that duplicate the naive multinomial parameterization, are each active for a single multinomial outcome and are thus the finest-grained features possible. The rich features that we add to tie together probabilities in the model do so by activating for multiple outcomes. For example, we added a feature to the emission distribution of the HMM that activates for all words that end in 'ing'. This is an example of a coarse feature. In contrast, we also had a feature that activated only for the word 'running'. This is a fine feature.
The coarse features are designed so that putting weight on them is correlated with producing consistent and accurate analyses. For example, assigning large weight to the coarse feature that activates on all words that end in 'ing' given some latent POS class will ensure that most words with tense morphology 'ing' are in this same class. Thus, we predict that using the coarse features effectively can lead to high accuracy. While regularization should encourage the use of coarse features, in practice we found it blunt. Instead, we found that the choice of learning procedure had a more interesting effect.
When we plug any supervised feature-based maxent model into a gradient-based likelihood optimizer the trajectory is as follows: first frequent (coarse) features are over represented, and then, as learning progresses the likelihood term is fine-tuned through negotiation between the fine and coarse features. We know this from experience. In practice, we see weights for coarse features getting a sharp gain in magnitude early in learning and then dropping later on.
When training an unsupervised model with features using EM, each M-step is like a supervised maxent problem trained using a gradient-based approach, and thus each individual M-step should follow the mentioned trajectory. This means that at the beginning of each M-step we should see coarse feature weights gaining magnitude, and then at the end of each M-step we should see these weights diminishing. It also means that even on the first M-step EM will begin fine-tuning the likelihood using the finer-grained features.
We expect something different from an unsupervised model with features trained by directly optimizing the marginal data likelihood. In particular, since each step along the gradient of the data likelihood is actually the first step of some M-step (and we expect the first part of any M-step to put the most weight on coarse features), we expect the direct gradient approach to focus on the coarse features for longer before it begins to fine-tune.
Putting this all together, it seems reasonable to hope that the direct gradient optimizer might find a local optimum where coarse features can be used more effectively. This is because it will put weight on coarse features when the basin is still being determined. In contrast, EM will enter into a fine-tuning stage and commit to settings of the finer-grained features in the very first M-step, before things have been figured out and thus may get stuck in a bad basin.
We ran an experiment for unsupervised POS induction where we recorded the magnitudes of the different kinds of features (coarse and fine) at each iteration of learning for both the direct gradient method (LBFGS) and EM.
Figure 1 shows the sum of the squared weights of coarse features (active for more than 50 outcomes) against the learning iteration. For EM, each step taken inside each M-step counts as an iteration. The plot of coarse feature weights norm against iteration for EM is given in red, and the same for the direct gradient approach is shown in green. For EM, the iterations at which new E-steps occurred are marked in blue. These mark the boundaries between M-steps. For contrast, Figure 2 shows the same information for fine-grained features.
It appears that within each M-step, coarse feature weights increase and then decrease, giving the visible bumps in the curve. Since there are many M-steps before convergence, this happens many times during learning. The first M-step is the most extreme, seeing a sharp increase in coarse feature weights, followed by a deep plunge. These results fit with our predictions.
Applying LBFGS directly to the data likelihood yielded a different trajectory from that of EM. While overall we still see a rise and fall of coarse feature weights, this happens on a large scale and only once over the course of learning, compared to the multiple bumps seen in EM's trajectory. As predicted, the direct gradient approach puts a lot of weight on coarse features early in learning. Finally, the local optimum found by the direct gradient approach assigns larger magnitude to coarse features than does the local optimum found by EM. This again fits the hypothesis.
Figure 3 plots many-to-1 accuracy of the induced POS tags against the learning iteration. Comparing the graph of coarse feature weights norm to the graph of accuracy, it appears that the two are in close correspondence. This again fits with our hypothesis. Also, as we found in the experiments reported in the paper, the direct gradient approach outperforms EM in terms of accuracy. This finding is in agreement with our hypothesis that the direct gradient approach finds a basin early in learning where coarse features can be used effectively.
Figure 4 plots the regularized marginal data likelihood against the learning iteration for both the direct gradient approach and EM. In this experiment the direct gradient approach converges to a local optimum with a better data likelihood than that of the local optimum EM converges to. However, this was not the case in general. After the NAACL talk I mentioned that we did not find that the direct gradient approach consistently achieved better data likelihood than EM did. Initializing the unsupervised POS inducer with 10 random seeds yielded accuracies that were significantly better for the direct gradient approach when compared to EM. However, the difference in mean data likelihood was very small compared to the variance. Perhaps evaluating a much larger number of random seeds could show a statistically significant difference in data likelihood.