| CS 288: Statistical Natural Language Processing, Spring 2010 |  | 
| Assignment 1: Language Modeling | |
| Due: February 2nd | 
Setup
First, make sure you can access the course materials. The 
components are:
    code1.zip : the Java source code provided for this course
    data1.zip : the data sets used in this assignment
The authentication restrictions are due to licensing terms. The username and 
password should have been mailed to the account you listed with the Berkeley 
registrar. If for any reason you did not get it, please let me know.
Unzip the source files to your local working directory. Some of the classes and packages won’t be relevant until later assignments, but feel free to poke around. Make sure you can compile the entirety of the course code without errors (if you get warnings about unchecked casts, ignore them - that’s a Java 1.5 issue, if you cannot get the code to compile, email, stop by office hours, or post to the newsgroup). If you are at the source root (i.e. your current directory contains only the directory 'edu'), you can compile the provided code with
javac -d classes */*/*/*.java */*/*/*/*.java
You can then run a simple test file by typing
    java -cp classes edu.berkeley.nlp.Test
You should get a confirmation message back.  You may wish to use an IDE 
such as Eclipse (I recommend it).  If so, it is expected that you be able 
to set it up yourself.
 
Next, unzip the data into a directory of your choice.  
For this assignment, the first Java file to inspect is:
    src/edu/berkeley/nlp/assignments/LanguageModelTester.java
Try running it with:
    java edu.berkeley.nlp.assignments.LanguageModelTester -path 
DATA -model baseline
where DATA is the directory containing the contents of the 
data zip.
If everything’s working, you’ll get some output about the performance of a 
baseline language model being tested. The code is reading in some newswire and 
building a basic unigram language model that I’ve provided. This is phenomenally 
bad language model, as you can see from the strings it generates - you’ll 
improve on it.
 
Description
In this assignment, you will construct several language 
models and test them with the provided harness.
Take a look at the main method of LanguageModelTester.java, and its output.
Training: Several data objects are loaded by the 
harness. First, it loads about 1M words of WSJ text (from the Penn treebank, 
which we'll use again later).  These sentences have been "speechified", for 
example translating "$" to "dollars", and tokenized for you.  The WSJ data 
is split into training data (80%), validation (held-out) data (10%), and test 
data (10%).  In addition to the WSJ text, the harness loads a set of speech 
recognition problems (from the HUB data set). Each HUB problem consists of a set 
of candidate transcriptions of a given spoken sentence.  For this 
assignment, the candidate list always includes the correct transcription and 
never includes words unseen in the WSJ training data.  Each candidate 
transcription is accompanied by a pre-computed acoustic score, which represents 
the degree to which an acoustic model matched that transcription.  These 
lists are stored in SpeechNBestList objects.  Once all the WSJ data and HUB 
lists are loaded, a language model is built from the WSJ training sentences (the 
validation sentences are ignored entirely by the provided baseline language 
model, but may be used by your implementations for tuning). Then, several tests 
are run using the resulting language model. 
Evaluation: Each language model is tested in two ways.  First, the 
harness calculates the perplexity of the WSJ test sentences. In the WSJ test 
data, there will be unknown words.  Your language models should treat all 
entirely unseen words as if they were a single UNK token. This means that, for 
example, a good unigram model will actually assign a larger probability to each 
unknown word than to a known but rare word - this is because the aggregate 
probability of the UNK event is large, even though each specific unknown word 
itself may be rare.  To emphasize, your model's WSJ perplexity score will 
not strictly speaking be the perplexity of the extact test sentences, but the 
UNKed test sentences (a lower number).
Second, the harness will calculate the perplexity of the 
correct HUB transcriptions.  This number will, in general, be worse than 
the WSJ perplexity, since these sentences are drawn from a different source.  
Language models predict less well on distributions which do not match their 
training data.  The HUB sentences, however, will not contain any unseen 
words. 
Third, the harness will compute a word error rate (WER) on the HUB recognition 
task. The code takes the candidate transcriptions, scores each one with the 
language model, and combines those scores with the pre-computed acoustic scores. 
The best-scoring candidates are compared against the correct answers, and WER is 
computed. The testing code also provides information on the range of WER scores 
which are possible: note that the candidates are only so bad to begin with (the 
lists are pre-pruned n-best lists).  You should inspect the errors the 
system is making on the speech re-ranking task, by running the harness with the 
“-verbose” flag.
Finally, the harness will generating sentences by randomly sampling you language 
models. The provided language model’s outputs aren’t even vaguely like 
well-formed English, though yours will hopefully be a little better.  Note 
that improved fluency of generation does not mean improved modeling of unseen 
sentences.
Experiments: You will implement several language 
models, though you can choose which specific ones to try out. Along the way you 
must build the following:
Write-ups: For this assignment, you should turn in a write-up of the work you’ve done, but not the code (it is sometimes useful to mention code choices or even snippets in write-ups, and this is fine). The write-up should specify what models you implemented and what significant choices you made. It should include tables or graphs of the perplexities, accuracies, etc., of your systems. It should also include some error analysis - enough to convince me that you looked at the specific behavior of your systems and thought about what it’s doing wrong and how you’d fix it. There is no set length for write-ups, but a ballpark length might be 3-4 pages, including your evaluation results, a graph or two, and some interesting examples. I’m more interested in knowing what observations you made about the models or data than having a reiteration of the formal definitions of the various models.
Random Advice: In edu.berkeley.nlp.util there are some classes that might be of use - particularly the Counter and CounterMap classes. These make dealing with word to count and history to word to count maps much easier.