CS 288: Statistical Natural Language Processing, Spring 2009

Instructor: Dan Klein
Lecture: Monday and Wednesday, 2:30pm-4:00pm, 405 Soda Hall
Office Hours: Monday and Wednesday 4pm-5pm in 775 Soda Hall.


1/20/09:  The course newsgroup is ucb.class.cs288. If you use it, I'll use it!
1/20/09:  The previous website has been archived.
1/24/09:  Assignment 1 is posted.
1/27/09:  Corrected office hours posted (M/W, not T/Th)!
2/10/09:  Assignment 2 is posted.
2/26/09:  Assignment 3 is posted.
3/15/09:  Assignment 4 is posted.
3/29/09:  Final project guidelines are posted.

4/12/09:  Assignment 5 is posted.


This course will explore current statistical techniques for the automatic analysis of natural (human) language data. The dominant modeling paradigm is corpus-driven statistical learning, with a split focus between supervised and unsupervised methods.

In the first part of the course, we will examine the core tasks in natural language processing, including language modeling, word-sense disambiguation, morphological analysis, part-of-speech tagging, syntactic parsing, semantic interpretation, coreference resolution, and discourse analysis. In each case, we will discuss which linguistic features are relevant to the task, how to design efficient models which can accommodate those features, and how to learn with such models in data-sparse contexts. In the second part of the course, we will explore how these core techniques can be applied to user applications such as information extraction, question answering, speech recognition, machine translation, and interactive dialog systems.

Course assignments will highlight several core NLP tasks. For each task, you will construct a basic system, then improve it through a cycle of linguistic error analysis and model redesign. There will also be a final project, which will investigate a single topic or application in greater depth. This course assumes a good background in basic probability and a strong ability to program in Java. Prior experience with linguistics or natural languages is helpful, but not required.  Disclaimer: there will be a lot of statistics, algorithms, and coding in this class.


The primary texts for this course are:

Note that M+S is free online.  Also, make sure you get the purple 2nd edition of J+M, not the white 1st edition.

Syllabus [subject to substantial change!]

Week Date Topics Techniques Readings Assignments (Out) Assignments (Due)
1 Jan 20 Course Introduction [6PP] [2PP] J+M 1, M+S 1-3 HW1: Language Models
2 Jan 26 Language Modeling [6PP] [2PP] Multinomial Smoothing J+M 4, M+S 6, Chen & Goodman  
Jan 28 Language Modeling II [6PP] [2PP] More Smoothing Interpreting KN  
3 Feb 2 Text Classification [6PP] [2PP] Naive Bayes M+S 7, Event Models
Feb 4 Word Sense Disambiguation [6PP] [2PP]     HW1
4 Feb 9 More Classification Maximum Entropy Classification Tutorial Maxent Tutorial 1, 2, J+M 6 HW2: PNP Classification


Feb 11 Part-of-Speech Tagging [6PP] [2PP] HMMs/CRFs J+M 5, Toutanova & Manning,
Brants, Brill
5 Feb 16

No Class

Feb 18 Word Class Induction EM J+M 6, M+S 9-10, HMM Learning, Distributional Clustering, Johnson  
6 Feb 23 Speech Recognition [6PP] [2PP] Speech Signal J+M 7 HW3: POS Tagging HW2
Feb 25 Speech Recognition II [6PP] [2PP] Acoustic Modeling J+M 9  
7 Mar 2 Competitive Parsing  
Mar 4 Competitive Parsing      
8 Mar 9 Parsing [6PP] [2PP] M+S 3.2, 12.1, J+M 11  
Mar 11 Parsing II [6PP] [2PP] M+S 11, J+M 12, Best-First, A*, K-best    
9 Mar 16 PCFGs [6PP] [2PP] Unlexicalized, Split, Lexicalized HW4: Parsing HW3
Mar 18 Grammar Induction [6PP] [2PP]    
10 Mar 23 Spring Break
Mar 25 Spring Break
11 Mar 30 Machine Translation [6PP] [2PP] Word Alignment J+M 25, IBM Models, HMM Agreement Discriminative  
Apr 1 Machine Translation II Word Decoding Decoding
12 Apr 6 Machine Translation III [6PP] [2PP] Phrase-Based Systems Decoding, Learning Phrases FP Guidelines HW4
Apr 8 Syntactic Translation GHKM, Vs Phrases, Decoding  


13 Apr 13 Semantic Roles [6PP] [2PP] J+M 16, 19 HW5: Machine Translation  
Apr 15 Compositional Semantics Manning, J+M 18    
14 Apr 20 Semantic Interpretation Parsing to LF  
Apr 22 Coreference Supervised, Unsupervised, J+M 21
15 Apr 27 Question Answering [6PP] [2PP] HW5
Apr 29 Sentiment Analysis    
16 May 4 QA / Summarization
May 6 Unsupervised Semantics    
17 May 11 Diachronics [6PP] [2PP]     FP Due May 21