CS 288: Statistical Natural Language Processing, Spring 2011 |
![]() |
Assignment 1: Language Modeling |
|
Due: February 3rd |
Setup
First, make sure you can access the course materials. The
components are:
code1.tar.gz : the Java source code provided for this course
data1.tar.gz : the data sets used in this assignment
The authentication restrictions are due to licensing terms. The username and
password should have been mailed to the account you listed with the Berkeley
registrar. If for any reason you did not get it, please let us know.
The source archive contains four files: assign1.jar contains the provided classes and source code (most classes have source attached, but some do not). build_assign1.xml is an ant script that you will use to compile the .jar file you submit for grading. The other two files are stubs of the classes you will need to implement. You may wish to use an IDE such as Eclipse to link the .jar file and browse the source (we recommend it). In general, you are expected to be able to set up your development environment yourself, but for this first assignment, we will provide setup instructions using Eclipse:
At this point, your code should all compile, and if you are using Eclipse, you can browse
the classes provided in assign1.jar by looking under "Referenced Libraries". You can also run
a simple test by running
java -cp assign1.jar edu.berkeley.nlp.Test
You should get a confirmation message back.
The testing harness
we will be using is LanguageModelTester (in the edu.berkeley.nlp.assignments.assign1 package).
To run it, first unzip the data archive to a local directory ($PATH). Then, build the submission jar using
ant -f build_assign1.xml
Then, try running
java -cp assign1.jar:assign1-submit.jar -server -mx500m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType STUB
You will see the tester do some translation, which will take a couple of minutes and take up
about 130M of memory, printing out translations along the way (note that printing of translations
can be turned off with -noprint). The tester will also print BLEU, an automatic metric
for measuring translation quality (bigger numbers are better; 60 is about human-level accuracy).
For the stub, accuracy should be terrible (15-16). The next step
is to include an actual language model. We've provided a model, which you can use by running
java -cp assign1.jar:assign1-submit.jar -server -mx500m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType UNIGRAM
Now, you'll see the tester read in around 9,000,000 sentences of monolingual data and build an
LM. Unfortunately, the unigram model doesn't really help, so you'll need to improve it by writing
a higher order language model.
Description
In this assignment, you will construct two different language
models and test them with the provided harness.
Take a look at the main method of LanguageModelTester.java, and its output.
Training: Several data objects are loaded by the
harness. First, it loads about 250M words of monolingual English text.
These sentences have been tokenized for you. In addition to the monolingual text,
the harness loads the data necessary to run a phrase-based statistical translation system
and a set of sentence pairs
to test the system on. The data for the MT system consists of a phrase table
(a set of scored translation rules)
and some pre-tuned weights to trade
off between the scores from the phrase table (known as the translation model) and
the scores from the language model. Once all the data is loaded, a language model is built
from the monolingual English text. Then, we test how well it works by incorporating
the language model into an MT system and measuring translation quality.
Experiments: You will need to implement two language models:
an exact language model that directly reports the (appropriately smoothed) scores computed
from the training data, and a noisy language model that uses some sort of
approximation technique to create a language model that takes less
memory, but may not work quite as well. You should modify the classes ExactLmFactory and
NoisyLmFactory, respectively, to generate language models of this type. You are welcome
to create as many additional classes as you like, so long as those
two classes retain their names and continue to implement LanguageModelFactory.
For both types of language models, the basic scoring method should be a Kneser-Ney trigram model.
The particular implementation details are mostly up to you, though you are encouraged to
experiment with different options and see what works best. For the noisy model, you are free to
try any of the approximation methods discussed in class (or if you're feeling particularly ambitious,
ones you find in outside literature). In your write-up you should include a discussion of the
tradeoffs between memory usage and BLEU that you found in your experiments.
Evaluation:
Each language model is primarily tested by providing its scores to
a standard MT decoder and measuring the quality of the resulting translations.
An MT
In addition to translation quality, your language model will be evaluated for its speed
and memory usage. We will expect your exact language model to fit into about 900M, and the noisy
one to fit into about 600M.
Note that around 300M is used by the phrase table and the vocabulary (when we run
the unigram model, total memory usage is 348M), so at worst,
you should aim to make your exact language model fit in 1.2G of memory. However, we will allow the JVM
to use up to 2G of memory since some implementations may require additional scratch space
during language model construction.
For speed, we are measuring the speed of
When we autograde your submitted code we will do two things. First, we will measure BLEU, memory
usage and decoding speed using the same testing harness as you by running the two commands
java -cp assign1.jar:assign1-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType EXACT
java -cp assign1.jar:assign1-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign1.LanguageModelTester -path $PATH -lmType NOISY
In addition, we will programatically spot-check the stored counts for various n-grams. The exact model should return
all correct counts, whereas the noisy model should still be mostly correct,
but is permitted to make errors on at most 2% of the trigrams we query.
Write-ups: For this assignment, in addition to submitting a compiled jar for
autograding according to the standard instructions,
you should turn in a write-up of the work
you've done. The write-up should specify what
models you implemented and what significant choices you made. It should
include tables or graphs of BLEU, runtime, memory, etc., of your systems.
It should also include some error analysis - enough to convince us that you
looked at the specific behavior of your systems and thought about what it's
doing wrong and how you'd fix it. There is no set length for write-ups, but a
ballpark length might be 3-4 pages, including your evaluation results, a graph
or two, and some interesting examples. We're more interested in knowing what
observations you made about the models or data than having a reiteration of the
formal definitions of the various models.
What will impact your grade is the degree to which you can present what you did
clearly and make sense of what's going on in your experiments using thoughtful
error analysis. When you do see improvements in BLEU, where are they coming from,
specifically?
Try to localize the improvements as much as possible. Some example questions you
might consider: Do the errors that are corrected
by a given change to the language model make any sense? Why? You should also include some
discussion of what approximation techniques you tried for your noisy model, and what
worked and what didn't. What kind of tradeoffs did you observe?