CS 294

CS 294-19: Statistical Natural Language Processing, Spring 2008

Instructor: Dan Klein Lecture: Tuesday and Thursday, 11:00am-12:30pm, 320 Soda Hall Office Hours: Tuesday and Thursday 12:30-1:30pm in 775 Soda Hall.
GSI: Aria Haghighi Section: Wed. 12-1pm in 320 Soda Office Hours: Wed. 1-5pm in 525 Soda Hall (or by appointment)

Announcements

3/17/08: Assignment 5 is posted, due April 3.
2/17/08: Telebears mistakenly kicked out half the class. I'm working on it -- but don't panic, you're all in the class.
2/16/08: Assignment 3 is posted, due Feb 28. A few days extension is likely, but start early as your experiments will be compute intensive!
2/12/08: New section time! Friday 4-5pm in 320 Soda, starting this week.
2/4/08: Assignment 2 is posted, due Feb 14.
2/3/08: Section time survey for possible new section time. Please vote even if you like the current time. This week (2/6) section still W 12-1pm.
2/2/08: Newsgroup exists!
2/2/08: Readers now available at Copy Central at Hearst and Euclid.
1/29/08: Section is now Wednesday 12-1pm in 320 Soda.
1/21/08: Assignment 1 is posted, due Feb 5.
1/21/08: The course newsgroup is ucb.class.cs294-19. If you use it, I'll use it!
1/21/08: The previous website has been archived.

Description

This course will explore current statistical techniques for the automatic analysis of natural (human) language data. The dominant modeling paradigm is corpus-driven statistical learning, with a split focus between supervised and unsupervised methods.

In the first part of the course, we will examine the core tasks in natural language processing, including language modeling, word-sense disambiguation, morphological analysis, part-of-speech tagging, syntactic parsing, semantic interpretation, coreference resolution, and discourse analysis. In each case, we will discuss which linguistic features are relevant to the task, how to design efficient models which can accommodate those features, and how to learn with such models in data-sparse contexts. In the second part of the course, we will explore how these core techniques can be applied to user applications such as information extraction, question answering, speech recognition, machine translation, and interactive dialog systems.

Course assignments will highlight several core NLP tasks. For each task, we will construct a basic system, then improve it through a cycle of linguistic error analysis and model redesign. There will also be a final project, which will investigate a single topic or application in greater depth. This course assumes a good familiarity with basic probability and a strong ability to program in Java. Prior experience with linguistics or natural languages is helpful, but not required. Disclaimer: there will be a lot of statistics and algorithms in this class, as well as some serious coding.

Readings

The primary texts for this course are:

Jurafsky and Martin, Speech and Language Processing, 2nd edition ONLY [online -- but just vanished! More info soon... probably it'll become a reader.]
Manning and Shuetze, Foundations of Statistical Natural Language Processing [amazon] [online]

I recommend both, but everything that you need is online.

Syllabus [subject to substantial change!]

Week	Date	Topics	Techniques	Readings	Assignments (Out)	Assignments (Due)
1	Jan 22	Course Introduction (6PP) (2PP)		J+M 1, M+S 1-3	HW1: Language Models
1	Jan 24	Language Modeling (6PP) (2PP)	Multinomial Smoothing	J+M 4, M+S 6, Chen & Goodman
2	Jan 29	Language Modeling (6PP) (2PP)	More Smoothing	Interpreting KN
2	Jan 31	Text Classification (6PP) (2PP)	Naive Bayes	M+S 7, Event Models
3	Feb 5	Word Sense Disambiguation (6PP) (2PP)	Maximum Entropy	Classification Tutorial Maxent Tutorial 1, 2, J+M 6	HW2: PNP Classification	HW1
3	Feb 7	Classification (6PP) (2PP)		J+M 5, Toutanova & Manning, Brants, Brill
4	Feb 12	Part-of-Speech Tagging (6PP) (2PP)	HMMs/CRFs	J+M 6, M+S 9-10, HMM Learning, Distributional Clustering
4	Feb 14	Word Classes (6PP) (2PP)	EM	Johnson	HW3: POS Tagging	HW2
5	Feb 19	Speech Recognition (6PP) (2PP)	Speech Signal	J+M 7
5	Feb 21	Speech Recognition (6PP) (2PP)	Acoustic Modeling	J+M 9
6	Feb 26	Machine Translation (6PP) (2PP)	Word Alignment	J+M 25, IBM Models, HMM Agreement Discriminative	HW4: Machine Translation
6	Feb 28	Machine Translation (6PP) (2PP)	Word Decoding	Decoding		HW3
7	Mar 4	Machine Translation (6PP) (2PP)	Phrase-Based Systems	Decoding, Learning Phrases
7	Mar 6	Parsing I (6PP) (2PP)		M+S 3.2, 12.1, J+M 11
8	Mar 11	Parsing II (6PP) (2PP)		M+S 11, J+M 12	HW5: Parsing
8	Mar 13	PCFGs (6PP) (2PP)		Unlexicalized, Split		HW4
9	Mar 18	Lexialized Parsing (6PP) (2PP)		M+S 12.2, J+M 12.3-4, Best-First, A*, Collins, Charniak and Johnson
9	Mar 20	Grammar Induction (6PP) (2PP)
10	Mar 25	Spring Break
10	Mar 27	Spring Break
11	Apr 1	Semantic Roles		J+M 16, 19
11	Apr 3	Coreference		Supervised, Unsupervised, J+M 21		HW5
12	Apr 8	Compositional Semantics (6PP) (2PP)		Manning, J+M 18	FP Guidelines
12	Apr 10	Semantic Interpretation		Parsing to LF
13	Apr 15	Question Answering (6PP) (2PP)
13	Apr 17	Question Answering
14	Apr 22	Syntactic Translation		GHKM, Vs Phrases, Decoding
14	Apr 24	Grammar Induction
15	Apr 29	Historical Reconstruction / Phrase Learning	Sampling
15	May 1	Final Presentations
16	May 6	Final Presentations
16	May 8	Conclusion / Translation from Monotexts?				FP