CS 288: Statistical Natural Language Processing, Spring 2010

Instructor: Dan Klein
Lecture: Monday and Wednesday, 2:30pm-4:00pm, 405 Soda Hall
Office Hours: Monday 4pm-5pm and Thursday 2:30pm-3:30pm in 724 (or 730) Sutardja Dai Hall.


1/19/10:  The course newsgroup is ucb.class.cs288. If you use it, I'll use it!
1/19/10:  The previous website has been archived.
1/19/10:  Assignment 1 is posted.
2/2/10:  Assignment 2 is posted.
2/18/10:  Assignment 3 is posted.
3/5/10:  Comments on writeups posted.
3/7/10:  Assignment 4 is posted.
4/4/10:  Final project guidelines are posted.
4/4/10:  Assignment 5 is posted.
4/12/10:  There are extra office hours on 4/12 (4-6pm) and 4/15 (2:30-4:00pm).


This course will explore current statistical techniques for the automatic analysis of natural (human) language data. The dominant modeling paradigm is corpus-driven statistical learning, with a split focus between supervised and unsupervised methods.

In the first part of the course, we will examine the core tasks in natural language processing, including language modeling, syntactic analysis, semantic interpretation, coreference resolution, and discourse analysis. In each case, we will discuss the underlying linguistic phenomena, which features are relevant to the task, how to design efficient models which can accommodate those features, and how to learn such models.  In the second part of the course, we will explore how these core techniques can be applied to user applications such as information extraction, question answering, speech recognition, machine translation, and interactive dialog systems.

Course assignments will highlight several core NLP tasks and methods. For each task, you will construct a basic system, then improve it through a cycle of linguistic error analysis and model redesign. There will also be a final project, which will investigate a single topic or application in greater depth. This course assumes a good background in basic probability and a strong ability to program in Java. Prior experience with linguistics or natural languages is helpful, but not required.  There will be a lot of statistics, algorithms, and coding in this class.


The primary texts for this course are:

Note that M&S is free online.  Also, make sure you get the purple 2nd edition of J+M, not the white 1st edition.

Syllabus [subject to substantial change!]

Week Date Topics Techniques Readings Assignments (Out) Assignments (Due)
1 Jan 20 Course Introduction [6PP] [2PP] J+M 1, M+S 1-3 HW1: Language Models
2 Jan 25 Words: Language Modeling [6PP] [2PP] N-Grams, Smoothing J+M 4, M+S 6, Chen & Goodman, Interpreting KN Massive Data  
Jan 27 Words: LMs II [6PP] [2PP] Smoothing, Naive Bayes M+S 7, Event Models  
3 Feb 1 Words: Text Cat [6PP] [2PP]    
Feb 3 Words: WSD [6PP] [2PP] Maxent Classification Tutorial, Maxent Tutorial 1, 2, J+M 6 HW2: PNP Classification HW1
4 Feb 8 Parts-of-Speech: Tagging [6PP] [2PP] HMMs/CRFs J+M 5, Toutanova & Manning,
Brants, Brill


Feb 10 Parts-of-Speech: Induction [6PP] [2PP] EM J+M 6, M+S 9-10, HMM Learning, Distributional Clustering, Johnson    
5 Feb 15


Feb 17 Speech Recognition [6PP] [2PP] Speech Signal J+M 7 HW3: POS Tagging HW2
6 Feb 22 Speech Recognition II [6PP] [2PP] Acoustic Modeling J+M 9  
Feb 24 Interlude: Competitive Parsing    
7 Mar 1 Interlude: Competitive Parsing  
Mar 3 Syntax: PCFGs [6PP] [2PP] M+S 3.2, 12.1, J+M 11    
8 Mar 8 Syntax: Algorithms [6PP] [2PP] M+S 11, J+M 12, Best-First, A*, K-best HW4: Parsing HW3
Mar 10 Syntax: Richer Models [6PP] [2PP] Unlexicalized, Split, Lexicalized    
9 Mar 15


Mar 17 Syntax: Grammar Induction [6PP] [2PP]    
10 Mar 22 Spring Break
Mar 24 Spring Break
11 Mar 29 Machine Translation I [6PP] [2PP] Word-Based Models J+M 25, IBM Models, HMM Agreement Discriminative, Decoding    
Mar 31 Machine Translation II [6PP] [2PP] Phrase-Based Systems Decoding, Learning Phrases FP Guidelines HW4
12 Apr 5 Machine Translation III [6PP] [2PP] Syntactic Systems GHKM, Vs Phrases, Decoding HW5: Machine Translation  
Apr 7 Semantics: Roles [6PP] [2PP] J+M 16, 19    
13 Apr 12 Semantics: Compositional [6PP] [2PP] Manning, J+M 18    
Apr 14 Semantics: Interpretation [6PP] [2PP] Parsing to LF  
14 Apr 19 Discourse: Coreference [6PP] [2PP] Supervised, Unsupervised, J+M 21 HW5
Apr 21 Discourse: Summarization [6PP] [2PP] Topic-based, N-gram based  
15 Apr 26 Question Answering [6PP] [2PP] N-gram-based, Grammar-based  
Apr 28 Diachronics [6PP] [2PP] Reconstruction   FP Due May 21