CS 288:
Statistical Natural Language Processing, Spring 2011 |
 |
|
Assignment 2: Phrase-Based Decoding
|
Due: February 17th |
Setup
First, make sure you can access the course materials. The
components are:
code2.tar.gz : the Java source code provided for this course
data2.tar.gz : the data sets used in this assignment
The authentication restrictions are due to licensing terms. The username and
password are the same as for Assignment 1.
The testing harness
we will be using is MtDecoderTester (in the edu.berkeley.nlp.assignments.assign2 package).
To run it, first unzip the data archive to a local directory ($PATH). Then, build the submission jar using
ant -f build_assign2.xml
Then, try running
java -cp assign2.jar:assign2-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -decoderType MONO_GREEDY
You will see the tester do some translation of 100 French sentences, which will take a few seconds, printing out translations along the way (note that printing of translations
can be turned off with -noprint). The decoder is a very simple monotonic decoder which greedily translates foreign phrases one at a time.
Its accuracy should be poor (around 20), as should its model score (around -5216; see below for more about the model).
Description
In this assignment, you will implement several phrase-based decoders
and test them with the provided harness. Your decoder will attempt to
find the translation with maximum score under a linear model. This
linear model consists of three features: a trigram language model; a linear distortion model; and
a translation model which assigns scores to phrase pairs. The language model uses the data from Assignment 1, but
we have provided an implementation so you don't need to use your own (though you can if you wish). Each of
these features is weighted by some pre-tuned weights, so you should
not need to worry about these weights (though you may experiment with
them if you wish). Constructing the translation model table and tuning the weights will be dealt with in later assignments.
You will need to implement four decoders of increasing complexity. The first is a
monotonic beam-search decoder with no language model. You should return an instance of such a decoder from MonotonicNoLmDecoderFactory.
Using our implementation, using beams of size 2000, we decode the test set in 16 seconds and achieve a model score of -5276
and BLEU score of 17.2. Note that this decoder performs poorly even compared to the greedy decoder since no language model is used.
The second decoder, which should be constructed in MonotonicWithLmDecoderFactory, should implement monotonic beam search with
an integrated trigram language model. This should slow the search, but model score and BLEU will go up significantly.
Using our implementation, using beams of size 2000, we decode the test set in 2 minutes and achieve a model score of -3781
and BLEU score of 26.5.
The third decoder, which should be constructed in DistortingWithLmDecoderFactory, should implement a beam search
which permits limited distortion as discussed in class. The distortion score can be retrieved from DistortionModel class,
as can the distortion limit. The addition of distortion should also slow down the decoder but again achieve higher model and BLEU scores.
Using our implementation, using beams of size 2000, we decode the test set in about 20 minutes and achieve a model score of -3529
and BLEU score of 27.5.
Finally, you should implement a decoder with extensions of your choosing in AwesomeDecoderFactory. This portion of the assignment is open-ended, and you can
improve your decoder in any way you like. Some possible extensions include:
- Implementing/improving the future cost estimates used in the Pharoah decoder.
- Implementing/improving cube pruning.
- Implementing coarse-to-fine decoding.
- Exploiting equivalent LM states as in a context-encoded language model.
Note that this requires access to additional functionality in the language model, which is provided via the ContextEncodingNgramLanguageModel interface. You can get an instance
of this interface via NgramLanguageModelAdaptor class.
- Improving/replacing the linear distortion model.
- Using A* instead of beam search.
Evaluation:
The task of an MT decoder is to take a foreign sentence
and find the highest-scoring English translation according to a particular model. If the model is a good one, then translation accuracy (BLEU)
will correlate with model score, though this will not always be the case. As such, we will primarily evaluate your decoder on what model score
it achieves on the training data, though we will measure BLEU as well. Also, with all decoders, there is a speed vs. accuracy trade-off: you can
always achieve a higher model score with a slow decoder, or decode very quickly if you do not care about making large numbers of search errors and achieving a low
model score. We will measure the performance of you decoder both in terms of speed and accuracy (model score), and compare it to the speed-accuracy trade-off of our own implementation.
Note that it is up to you to pick a point along the speed-accuracy curve in your submission. However, you can (and should) do further exploration of your decoders
and report on the results in your write-up.
When we autograde your submitted code, we will run the following commands:
java -cp assign2.jar:assign2-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -decoderType MONO_NOLM
java -cp assign2.jar:assign2-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -decoderType MONO_LM
java -cp assign2.jar:assign2-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -decoderType DIST_LM
java -cp assign2.jar:assign2-submit.jar -server -mx2000m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -decoderType AWESOME
Please ensure that all of these commands complete on your machine before submitting.
Write-ups: Write-ups should similar to those in Assignment 1. It should
include tables or graphs of BLEU, runtime, model score, etc., of your systems, as well as
some error analysis - enough to convince us that you
looked at the specific behavior of your systems and thought about what it's
doing wrong and how you'd fix it. You should also describe the extensions you implemented in your AwesomeDecoder, including how and why
they improve on your other decoders.
Submission: You will submit assign2-submit.jar to an online system.
Note that this jar must contain implementations of
the DecoderFactory implementations we provide to you, but must not contain any modifications of the source code provided in
assign2.jar. To check that everything is in order, we will run a small "sanity" check when you submit the jar. Specifically,
we will run the commands
java -cp assign2.jar:assign2-submit.jar -server -mx50m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -lmType MONO_NOLM -sanityCheck
java -cp assign2.jar:assign2-submit.jar -server -mx50m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -lmType MONO_LM -sanityCheck
java -cp assign2.jar:assign2-submit.jar -server -mx50m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -lmType DIST_LM -sanityCheck
java -cp assign2.jar:assign2-submit.jar -server -mx50m edu.berkeley.nlp.assignments.assign2.MtDecoderTester -path $PATH -lmType AWESOME -sanityCheck
The -sanityCheck flag will run the test harness with a tiny amount of data just to make sure no exceptions are thrown.
Please ensure that these commands return successfully before submitting your jar.
You will also submit a write-up in class on the due date.
Grading: Unlike Assignment 1, there will be no hard requirements for completion of this assignment. However, you should ensure
that your times and accuracies are comparable to those quoted in this write-up -- underperforming submissions will receive a lower grade.
As always, additional improvements in memory usage, BLEU score, model score
and decoding speed will also affect your grade, as will your write-up. Your extension(s) will be evaluated primarily based on your write-up, though we will
run your AWESOME decoder to observe any improvements ourselves.