
Practical information

Updates and Announcements

Scribe Notes

Homeworks/midterms and solutions

References and Readings

Project information
Practical information
Volume: 3 units
Lectures: Evans Hall 330, MonWed 13:0014:30 .
รถ
Grading: Homework (20%)
Inclass Midterm (40%), held on Wed. April 8
Final Project (40%).
Instructor: Martin Wainwright
 Office Hours: Mon. 14:3015:30 (421 Evans Hall); Thurs
12:0013:00 (421 Evans Hall)
 Email: wainwrig AT stat
DOT berkeley DOT edu
GSI: Junming Yin
 Office Hours: Tues. 15:00  16:00. (751 Soda Hall).
Friday 11:0012:00. (751 Soda Hall).
 Email: junming AT eecs
DOT berkeley DOT edu
Course syllabus: Download PDF
Scribing Materials Download ZIP file
Updates and Announcements
 Mon, May 4: Projects will be presented during the poster
session, to be held on Monday, May 11, 12:30 pm in 1011 Evans.
Writeups are due on the same day.
 Wed April 22: Please email me if you are doing your project
in a group, so that we can decide between inclass presentation
and poster session.
 Sat, April 4: Midterm review session will be held on Monday, April 6 from
78:30 pm in Soda Hall 320.
 Thurs, March 19: Typo in HW 4.4(a) in definition of Vs;
new version posted.
 Fri, March 13: Schedule for next week: Lecture
on Monday, March 16, but NO lecture on Wed., March 18. Lecture
scribe notes for makeup lecture now posted.
 Wed, March 11: Makeup lecture on Friday, March 13
from 12:30pm in Evans Hall 70. Also, midterm exam will be held in
class on Wed. April 8.
 Mon, March 2: HW #4 posted, due on Monday March 30.
 Wed, March 4: Problem 3.3. Correct as stated. However, some books
will use a slightly different definition of packing number, in which the
ball centers must be separated by epsilon (not 2 epsilon). Apologies
for any confusion.
 Thurs, Feb 26:
Corrected version of HW #3 posted; minor
typo in Problem 3.1. (No log. in front of phi.)
 Mon. Feb 23:
HW #3 posted. Prof. Wainwright's office hours
moved from Thurs to Friday, Feb 27, 121pm (421 Evans). Changed
 Mon. Feb 23: HW #1 Solution posted. Graded homeworks available
during office hours from Junming.
 Thurs. Feb 19: Clarifications on HW #2:
Problem 2.4. Delete "fixed" if it confuses you.
Problem 2.5(b) Delete "fraction": the interpretation of $\nu$
is the amount by which the constraint can be violated, as in
softmargin SVM.
 Monday Feb 9: HW #2 posted; due Monday, February 23th in class. Data set
regression.dat and test set
regression.test

Wed. Feb. 4: Comments on HW #1:
In Problem 1.4 (and throughout course), log means base e.
In Problems 1.6, parts (a) through (d), all kernels are to be
positive semidefinite kernels.
 Mon. Jan 26: HW #1 posted; due Monday, February 9th in class.
 Data sets for Homework #1: X.dat
and Y.dat
 Office hours on Wed. Jan. 28/ Thurs. Jan 29: Cancelled (Prof. Wainwright
delayed by ice storm in Texas).
Back to top
Scribe notes
 Wednesday April 15 Lecture 22
 Monday April 13 Lecture 21
 Wednesday April 8 Midterm exam
 Monday April 6 Lecture 20
 Wed April 1 Lecture 19
 Monday March 30 Lecture 18
 Monday March 16 Lecture 17
 Fri March 13 (Makeup lecture): Lecture 16
 Mon March 11: Lecture 15
 Mon March 9: Lecture 14
 Wed March 4: Lecture 13
 Mon March 2: Lecture 12
 Wed Feb 25: Lecture 11
 Mon February 23: Lecture 10
 Web February 16: Lecture 9
 Mon February 16: HOLIDAY  NO LECTURE
 Web February 11: Lecture 7
 Mon February 9: Lecture 6
 Web February 4: Lecture 5
 Mon February 2: Lecture 4
 Wed January 28: Lecture 3
 Mon January 26: Lecture 2
 Wed January 21 Lecture 1 (Introduction and logistics)
Homework and solutions
References and readings
Books:

Chapter 2 of "Convergence of Stochastic Processes", by D. Pollard, Material on
uniform laws of large numbers

Convex optimization , by S. Boyd and L. Vandenberghe. Material on
Lagrangians, duality etc.
 Learning with Kernels, by B. Scholkopf and A. Smola, MIT Press, 2002.
 Probabilistic Theory of Pattern Recognition, by
L. Devroye, L. Gyorfi, G. Lugosi, Springer, New York, 1996.
Papers:

Surrogate losses and convexity Bartlett, Jordan and McAuliffe

Nonparametric decentralized detection using kernel methods. X. Nguyen et al.
Relevant for general duality statement (see equation (22)).

Kernel PCA B. Schoelkopf et al.

An introduction to kernelbased learning algorithms. K.R. Mueller, S. Mika, G. Raetsch, K. Tsuda, and B. Schoelkopf.

Theory of
reproducing kernels , N. Aronszajn, Transactions of the AMS,
68(3), May 1950.
 Lecture notes on concentration of measure by G. Lugosi.
Course projects
The project can be in any area related to the topics of the course (classification,
regression, density estimation etc.) Possible types of projects include:
 Implementing and experimenting with some algorithms (e.g., SVMs, boosting,
kernel PCA, kernel regression, kernel CCA, spectral clustering).
 Running experiments for a particular application of interest.
 Trying to extend an existing method or theoretical result.
 Picking a topic that interests you, and learning about in more depth.
For instance, if you were interested in dimensionality reduction techniques,
then you might read and prepare a survey overview of a related set of
papers (e.g., on PCA, spectral methods, Laplacian eigenmap etc.) Similarly,
if you were interested in concentration inequalities and risk bounds.....See
below for some suggested readings.
The project will be evaluated by a combination of a written report (at
most 510 pages in length), and a presentation (either poster or class
presentation, depending on total number of projects). The presentations
will take place during the week May 48, and the written report
will be due by May 11, 2009.
Back to top
Some papers that may be of interest (very incomplete list) are given below.
If the PDF files are not available here but the journal reference is
given, you can download the papers by using websites such as:
 JSTOR
 IEEEXPLORE
 MathSciNet
Note that you need to be working from a computer with a UC Berkeley IP
address; otherwise, see the UC Berkeley library instructions on setting
up a proxy server.
If journal references not available here, you can find many papers by
googling the "authors" and "title".
Kernel methods, SVMs and related topics

`An introduction to kernelbased learning algorithms.'
K.R. Mueller, S. Mika, G. Raetsch, K. Tsuda, and B. Schoelkopf.

`A Tutorial on Support Vector Machines for Pattern Recognition.'
C. J. C. Burges.

A Tutorial on Support Vector Regression.'
A. J. Smola and B. Schoelkopf.
 See also the text books:
 `An Introduction to Support Vector Machines.'
Nello Cristianini and John ShaweTaylor.
Cambridge University Press, Cambridge, UK, 2000.
(see the web page)
 `Kernel Methods for Pattern Analysis.'
John ShaweTaylor and Nello Cristianini.
Cambridge University Press, Cambridge, UK, 2004.
(see the web page)
 `Learning with Kernels.'
Bernhard Schoelkopf and Alex Smola.
MIT Press, Cambridge, MA, 2002.
(see the web page)
The following papers describe a geometric view of
SVM optimization problems.

`A fast iterative nearest point algorithm for support vector
machine classifier design.'
S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy.

`Duality and Geometry in SVM Classifiers.'
Kristin P. Bennett and Erin J. Bredensteiner.

`A Geometric Interpretation of nuSVM Classifiers.'
D.J. Crisp and C.J.C. Burges
Reproducing kernel Hilbert spaces and related topics
These papers investigate the RKHS of Gaussian kernels.

`Consistency and convergence rates of oneclass SVM and related
algorithms.' (See Section 3.)
R. Vert and J.P. Vert.

`An Explicit Description of the Reproducing Kernel Hilbert
Spaces of Gaussian RBF Kernels .'
I. Steinwart, D. Hush, and C. Scovel.
The following papers describe kernels defined on structures, such as
sequences and trees. The first describes a number of operations that
can be used in constructing kernels.

`Convolution Kernels on discrete structures'
D. Haussler

`Dynamic alignment kernels'
C. Watkins

`Convolution kernels for natural language'
Michael Collins and Nigel Duffy.

`Text classification using string kernels'
H. Lodhi, C. Saunders, J. ShaweTaylor, N. Cristianini, and C. Watkins
Another good resource is the kernel machines web site:
http://www.kernelmachines.org
Surrogate loss functions and their properties
The following papers present relationships between convex cost
functions and discrete loss (for twoclass pattern classification).
The first paper generalizes and simplifies results of the second.
The third paper considers a general situations, including weighted
classification, regression, quantile estimation and density estimation.

`Convexity, classification, and risk bounds.'
Peter Bartlett, Mike Jordan and Jon McAuliffe.

`Statistical behavior and consistency of classification methods based
on convex risk minimization.'
Tong Zhang.

`How to compare different loss functions and their risks.'
Ingo Steinwart.
Kernel PCA and related topics
 Kernel PCA B. Schoelkopf et al.

Convergence of eigenspaces in kernel PCA , L. Zwald and G. Blanchard (2005).
 Kernel ICA, by F. Bach and M. Jordan, Journal of Machine Learning Research
(2003)
 Kernel Methods for Measuring Independence, by A. Gretton et al.
Journal of Machine Learning Research (2005).

Statistical properties of kernel principal components analysis. , G. Blanchard et al. (2007).
 Eigenspectra of random kernel matrices , by Noureddine El Karoui,
UC Berkeley Stat. technical report (2008)
Random matrix theory
The following book chapter provides nonasymptotic bounds on
the singular values of Gaussian random matrices:
 Local operator theory, random matrices, and Banach spaces,
by K. R. Davidson and S. J. Szarek, In "Handbook of Banach Spaces",
pages 317336.
The following papers discuss aspects of principal component
analysis in highdimensions, as well as PCA with sparsity
assumptions:
 On the distribution of the largest eigenvalue in principal components analysis, by I. M. Johnstone,
Annals of Statistics, Vol. 29, No. 2. (2001), pp. 295327.
 Sparse principal components analysis, by I. M. Johnstone and A. Lu. (2003, also 2009).
Paper on arxiv
 A direct formulation of sparse PCA using semidefinite programming, by d'Aspremont et al.
 Highdimensional analysis of
semidefinite programming relaxations for sparse principal component analysis, by A. A. Amini and M. J. Wainwright,
To appear in the Annals of Statistics. Paper
 Eigenspectra of random kernel matrices , by Noureddine El Karoui, UC Berkeley Stat. technical report (2008)
Highdimensional linear regression and $\ell_1$ relaxation
The following papers discuss aspects of linear regression
in highdimensions, including some under sparsity constraints.
Lasso for Gaussian graph selection and highdimensional analysis:
 Highdimensional graphs and variable
selection with the Lasso, by N. Meinshausen and P. Buhlmann,
Annals of Statistics, 34, pp 14361462, 2006.
 On model selection consistency of the Lasso,
by P. Zhao and B. Yu, Journal of Machine Learning Research,
pp. 25412567, 2006.
 Sharp thresholds for noisy
and highdimensional recovery of sparsity using $\ell_1$constrained
quadratic programming, by M. J. Wainwright, IEEE Transactions
on Information Theory, May 2009.
Full Text (PDF)
 The Dantzig Selector: Statistical estimation when $p$ is much larger than $n$, by E. Candes and T. Tao. Annals of Statistics 35:6,
pp. 23132351, 2007.
 Simultaneous Analysis of Lasso and Dantzig Selector,
by P. Bickel, Y. Ritov and A. Tsybakov, Annals of Statistics,
To appear.
Boosting, ensemble methods and related topics
 Bagging predictors, by L, Breiman.

The boosting approach to machine learning: An overview by R. E. Schapire
 Experiments with
a new boosting algorithm, by Y. Freund and R. E. Schapire.

`Boosting the margin: A new explanation for the
effectiveness of voting methods.'
Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee.

`Arcing classifiers.'
Leo Breiman.

`Additive logistic regression: a statistical view of boosting.'
Jerome Friedman, Trevor Hastie and Robert Tibshirani.
Concentration inequalities and risk bounds
`Concentrationofmeasure inequalities.'
Gabor Lugosi.
` A few notes on Statistical Learning Theory.'
Shahar Mendelson
Review paper on Rademacher averages and local Rademacher averages:
`New Approaches to Statistical Learning Theory.'
Olivier Bousquet.
The following paper collects some useful properties of
Rademacher averages.
`Rademacher and Gaussian complexities:
risk bounds and structural results'
P. L. Bartlett and S. Mendelson.
Rademacher averages for large margin classifiers:
`Empirical margin distributions and bounding the generalization error
of combined classifiers.'
Vladimir Koltchinskii and Dmitriy Panchenko.
The `finite lemma' (Rademacher averages of finite sets) is Lemma 5.2
in this paper, which also introduces local Rademacher averages.
`Some applications of concentration inequalities to statistics.'
Pascal Massart.
Model selection and related topics
`An experimental and theoretical comparison of model selection
methods.'
M. Kearns, Y. Mansour, A. Ng, and D. Ron
`Risk bounds for model selection via penalisation.'.
Andrew Barron, Lucien Birge and Pascal Massart.
`Model
selection and error estimation.'
P. L. Bartlett, S. Boucheron, and G. Lugosi.
`Some applications of concentration inequalities to statistics.'
Pascal Massart.
`A consistent strategy for boosting algorithms.'
Gabor Lugosi and Nicolas Vayatis.
`Consistency of support vector machines and other regularized kernel
classifiers.'
I. Steinwart.
`The
consistency of greedy algorithms for classification.'
Shie Mannor, Ron Meir, Tong Zhang.
`Concentration
inequalities and model selection.'
Pascal Massart.
For Bernstein's inequality, and generalizations, see:
`Concentrationofmeasure inequalities.'
Gabor Lugosi.
`A Bennett concentration inequality and its application
to suprema of empirical processes',
Olivier Bousquet.
`Concentration inequalities using the entropy method.'
S. Boucheron, G. Lugosi and P. Massart.
`A sharp concentration inequality with applications.'
Stephane Boucheron, Gabor Lugosi and Pascal Masssart.
Bounding the variance for excess loss classes:
`Improving the sample complexity using global data.'
S. Mendelson.
Back to top