CS 294-7: Statistical Natural Language Processing, Spring 2007

Instructor: Dan Klein
Lecture: Mondays and Wednesdays, 1:00-2:30pm, 405 Soda Hall
Office Hours: Mondays and Wednesdays 2:30-3:30pm in 775 Soda Hall
GSI: Aria Haghighi
Section: Friday 11am-12pm, 320 Soda Hall
Office Hours: Tuesdays and Thursdays 2:10-3:10pm in 525 Soda Hall


4/1/07:  Assignment 5 is posted, due April 16th.
3/20/07:  Final project guidelines are up.
3/6/07:  Assignment 4 is posted, due March 21st.
2/20/07:  Assignment 3 is posted, due March 7th.
2/4/07:  Assignment 2 is posted, due Feb 21st.
1/27/07:  Assignment 1 is posted, due Feb 7th.
1/17/07:  The course newsgroup is ucb.class.cs294-7. If you use it, I'll use it!
1/17/07:  The previous website has been archived.


This course will explore current statistical techniques for the automatic analysis of natural (human) language data. The dominant modeling paradigm is corpus-driven supervised learning, but unsupervised methods and even hand-coded rule-based systems will be mentioned when appropriate.

In the first part of the course, we will examine the core tasks in natural language processing, including language modeling, word-sense disambiguation, morphological analysis, part-of-speech tagging, syntactic parsing, semantic interpretation, coreference resolution, and discourse analysis. In each case, we will discuss which linguistic features are relevant to the task, how to design efficient models which can accommodate those features, and how to estimate parameters for such models in data-sparse contexts. In the second part of the course, we will explore how these core techniques can be applied to user applications such as information extraction, question answering, speech recognition, machine translation, and interactive dialog systems.

Course assignments will highlight several core NLP tasks. For each task, we will construct a basic system, then improve it through a cycle of linguistic error analysis and model redesign. There will also be a final project, which will investigate a single topic or application in greater depth. This course assumes a familiarity with basic probability and the ability to program in Java. Prior experience with linguistics or natural languages is helpful, but not required.  Disclaimer: there will be a lot of statistics and algorithms in this class, as well as some serious coding.


The texts for this course are:

I recommend both, but most of what you need is online.

Syllabus [subject to change!]

Week Date Topics Techniques Readings Assignments (Out) Assignments (Due)
1 Jan 17 Course Introduction M+S 1-3
2 Jan 22 Language Models (6pp) (2pp) Multinomial Smoothing M+S 6, J+M 2nd Ed Ch 4,
Chen & Goodman
Jan 24 Language Models (6pp) (2pp) More Smoothing Interpreting KN HW1: Language Models
3 Jan 29 Text Categorization (6pp) (2pp) Naive-Bayes M+S 7, Event Models
Jan 31 Word-Sense Disambiguation (6pp) (2pp) Maximum Entropy Tutorial 1, 2, J+M 6
4 Feb 5 Part-of-Speech Tagging (6pp) (2pp) HMMs J+M 5, Toutanova & Manning,
Brants, Brill
HW2: PNP Classification


Feb 7 Tagging / Word Classes (6pp) (2pp) J+M 6, M+S 9-10, HMM Learning, Distributional Clustering   HW1
5 Feb 12 Speech Recognition (6pp) (2pp) J+M 7
Feb 14 Speech Recognition (6pp) (2pp) J+M 9
6 Feb 19




Feb 21 Machine Translation (6pp) (2pp) Word Alignment J+M 24, IBM Models, HMM HW3: POS Tagging HW2
7 Feb 26 Machine Translation Word Decoding Decoding
Feb 28 Machine Translation (6pp) (2pp) Phrase Alignment Phrases
8 Mar 5 Machine Translation (6pp) (2pp) Phrase Alignment      
Mar 7 Syntax (6pp) (2pp) M+S 3.2, 12.1, J+M 11 HW4: Machine Translation HW3
9 Mar 12 Syntactic Ambiguity (6pp) (2pp) M+S 11, J+M 12
Mar 14 Chart Parsing (6pp) (2pp) PCFGs  
10 Mar 19 PCFGs (6pp) (2pp) Unlexicalized, Split    
Mar 21 Lexicalized Models (6pp) (2pp) M+S 12.2, J+M 12.3-4, Best-First, A*, Collins, Charniak and Johnson FP Guidelines HW4
11 Mar 26 NO CLASS    
Mar 28 NO CLASS    
12 Apr 2 Frame Semantics (6pp) (2pp) J+M 16, 19 HW5: Parsing

Apr 4 Compositional Semantics (6pp) (2pp) Manning  
13 Apr 9 Semantics (6pp) (2pp)  
Apr 11 Semantics  
14 Apr 16 Coreference / QA (6pp) (2pp)


Apr 18 Question Answering
15 Apr 23 Syntactic Translation
Apr 25 TBD
16 Apr 30 Final Presentations    
May 2 Final Presentations  
17 May 7 Conclusion / Grammar Induction     FPs by May 18