[I told you last week that neural net research was popular in the 60's, but
the 1969 book "Perceptrons" killed interest in them throughout the 70's. They
came back in the 80's, but interest was partly killed off a second time in the
00's by...guess what? By support vector machines. SVMs work well for a lot
of tasks, they're faster to train, and they more or less have only one
hyperparameter, whereas neural nets take a lot of work to tune.]
[Neural nets are now in their third wave of popularity. The single biggest
factor in bringing them back is probably big data. Thanks to the internet,
we now have absolutely huge collections of images to train neural nets with,
and researchers have discovered that neural nets often give better performance
than competing algorithms when you have huge amounts of data to train them
with. In particular, convolutional neural nets are now learning better
features than handtuned features. That's a recent change.]
[One event that brought attention back to neural nets was the ImageNet Image
Classification Challenge in 2012.]
[Show ImageNet slide (imagenet.png).]
[The winner of that competition was a neural net, and it won by a huge margin,
about 10%. It's called AlexNet, and it's surprisingly similarly to LeNet 5,
in terms of how its layers are structured. However, there are some new
innovations that led to their prizewinning performance, besides the fact that
the training set had 1.4 million images: they used ReLUs, GPUs for training,
and dropout.]
[Show AlexNet convolutional neural net diagram (alexnet.pdf).]
UNSUPERVISED LEARNING
=====================
We have sample points, but no labels!
No classes, no yvalues, nothing to predict.
Goal: Discover structure in the data.
Examples:
 Clustering: partition data into groups of similar/nearby points.
 Dimensionality reduction: data often lies near a lowdimensional subspace
(or manifold) in feature space; matrices have lowrank approximations.
[Whereas clustering is about grouping similar sample points, dimensionality
reduction is more about identifying a continuous variation from sample point
to sample point.]
 Density estimation: fit a continuous distribution to discrete data.
[When we use maximum likelihood estimation to fit Gaussians to sample points,
that's density estimation, but we can also fit functions more complicated
than Gaussians, with more local variation.]
PRINCIPAL COMPONENTS ANALYSIS (PCA) (Karl Pearson, 1901)
=============================
Goal: Given sample points in R^d, find k directions that capture most of the
variation. (Dimensionality reduction.)
[Show 3D points projected to 2D (3dpca.pdf).]
[Show MNIST digits projected to 2D (pcadigits.pdf).]
Why?
 Find a small basis for representing variations in complex things, e.g. faces.
 Reducing # of dimensions makes some computations cheaper, e.g. regression.
 Remove irrelevant dimensions to reduce overfitting in learning algs.
Like subset selection, but we can choose features that aren't axisaligned,
i.e., linear combos of input features.
[Sometimes PCA is used as a preprocess before regression or classification for
the last two reasons.]
Let X be nbyd design matrix. [No fictitious dimension.]
From now on, assume X is centered: mean X_i is zero.
[As usual, we can center the data by computing the mean xvalue, then
subtracting the mean from each sample point.]
[Let's start by seeing what happens if we pick just one principal direction.]
Let w be a unit vector. ~
The _orthogonal_projection_ of point x onto vector w is x = (x . w) w
~ x . w
If w not unit, x =  w
w^2
o x


w v ~
O> o x
[The idea is that we're going to pick the best direction w, then project all
the data down onto w so we can analyze it in a onedimensional space.
Of course, we lose a lot of information when we project down from d dimensions
to just one. So, suppose we pick several directions. Those directions span
a subspace, and we want to project points orthogonally onto the subspace.
This is easy *if* the directions are orthogonal to each other.]
~ k
Given orthonormal directions v_1, ..., v_k, x = sum (x . v ) v
i=1 i i
[The word "orthonormal" implies they're mutually orthogonal and length 1.]
[Draw picture of orthogonal projection of a point onto a plane in 3D space.]
[Usually we don't actually want the projected point in R^d;
usually we want the coordinates x . v_i in principal components space.]
X^T X is square, symmetric, positive semidefinite, dbyd matrix.
Let 0 <= lambda_1 <= lambda_2 <= ... <= lambda_d be its eigenvalues. [sorted]
Let v_1, v_2, ..., v_d be corresponding orthogonal *unit* eigenvectors.
[It turns out that the principal directions will be these eigenvectors, and
the most important ones will be the ones with the greatest eigenvalues.
I will show you this in three different ways.]
PCA derivation 1: Fit a Gaussian to data with maximum likelihood estimation.
Choose k Gaussian axes of greatest variance.
[Show Gaussian fitted to sample points (gaussfitpca.png).]
^ 1 T
Recall that MLE estimates a covariance matrix Sigma =  X X. [If X centered.]
n
PCA Alg:
 Center X.
 Optional: Normalize X. Units of measurement different?
* Yes: Normalize.
[Bad for principal components to depend on arbitrary choice of scaling.]
* No: Usually don't.
[If several features have the same unit of measurement, but some of them
have much smaller variance, that difference is usually meaningful.]
[Show difference outcomes between normalized and not (normalize.pdf).]
 Compute unit eigenvectors/values of X^T X.
 Optional: choose k based on the eigenvalue sizes.
 For the best kdimensional subspace, pick eigenvectors v_{dk+1}, ..., v_d.
 Compute the coordinates of training/test data in principal components space.
[When we do this projection, we have two choices: we can uncenter the input
data before projecting it, OR we can translate the test data by the same
vector we used to translate the training data when we centered it.]
[Show graph of # of eigenvectors vs. variance captured (variance.pdf).
In this example, just 3 eigenvectors capture 70% of the variance.]
[If you are using PCA as a preprocess for a supervised learning algorithm,
there's a more effective way to choose k: (cross)validation.]
PCA derivation 2: Find direction w that maximizes variance of projected data
[In other words, when we project the data down, we don't want it all to bunch
up; we want to keep it as spread out as possible.]
[Show projection of points (project.jpg).]
T T
~ ~ ~ 1 n w 2 1 Xw^2 1 w X X w
Var({ X , X , ..., X }) =  sum (X . ) =   =  
1 2 n n i=1 i w n w^2 n w^T w
\_______/
_Rayleigh_quotient_ of X^T X and w
[This fraction is a wellknown construction called the Rayleigh quotient. When
you see it, you should smell eigenvectors nearby. How do we maximize this?]
If w is an eigenvector v_i, Ray. quo. = lambda_i
> of all eigenvectors, v_d achieves maximum variance lambda_d / n.
One can show v_d beats every other vector too.
[Because every vector w is a linear combination of eigenvectors, and so
its Rayleigh quotient will be a convex combination of eigenvalues.
It's easy to prove this, but I don't want to take the time.
For the proof, look up "Rayleigh quotient" in Wikipedia.]
[So the top eigenvector gives us the best direction. But we typically want
k directions. After we've picked one direction, then we have to pick
a direction that's orthogonal to the best direction. But subject to that
constraint, we again pick the direction that maximizes the variance.]
What if we constrain w to be orthogonal to v_d? Then pick v_{d1}.
PCA derivation 3: Find direction w that minimizes "projection error"
[Show animation of PCA projection (PCAanimation.gif).]
[You can think of this as a sort of leastsquares linear regression, with one
important change. Instead of measuring the error in a fixed vertical
direction, we're measuring the error in a direction orthogonal to the
principal component direction we choose.]
[Show linear regression vs. PCA (mylsq.png, mypca.png).]
n  ~ 2 n  X_i . w 2 n 2 w 2
sum X  X  = sum X   w = sum (X   (X . ) )
i=1  i i i=1  i w^2  i=1 i i w
= constant  n (variance from derivation 2).
Minimizing projection error = maximizing variance.
[From this point, we carry on with the same reasoning as derivation 2.]
[Show illustration of the first two principal components of the single
nucleotide polymorphism (SNP) matrix for the genes of various Europeans
(europegenetics.pdf). The input matrix has 2,541 people from these locations
in Europe, and 309,790 SNPs. Each SNP is binary, so think of it as 309,790
dimensions of zero or one. The output shows spots on the first two principal
components where the projected people from a particular national type are
denser than a certain threshold. What's amazing about this is how closely the
projected genotypes match the geography of Europe. (From Lao et al., 2008.)]
Eigenfaces

X contains n images of faces, d pixels each.
[If we have a 200 x 200 image of a face, we represent it as a vector of length
40,000, the same way we represent the MNIST digit data.]
Face recognition: Given a query face, compare it to all training faces;
find nearest neighbor in R^d.
[This works best if you have several training photos of each person you want to
recognize, with different lighting and different facial expressions.]
Problem: Each query takes Theta(nd) time.
Solution: Run PCA on faces. Reduce to much smaller dimension d'.
Now nearest neighbor takes O(nd') time.
[Possibly even less. We'll talk about speeding up nearestneighbor
search at the end of the semester. If the dimension is small
enough, you can sometimes do better than linear time.]
[Show images of average face and eigenfaces (facerecaverage.jpg,
facereceigen0.jpg, facereceigen119.jpg, facereceigen.jpg).]
[Show images of a face projected onto the first 4 and 50 eigenvectors
(eigenfaceproject.pdf). Latter is blurry but good enough for recognition.]
For best results, equalize the intensity distributions first.
[Show image equalization (facerecequalize.jpg).]
[If each image has 40,000 pixels, and you reduce it to 40 principal components,
then each query face requires you to read 20,000 stored coordinates instead of
20 million pixels.]
[Eigenfaces are not perfect. They encode both face shape *and* lighting.
Ideally, we would have some way to factor out lighting and analyze face shape
only, but that's harder. Some people say that the first 3 eigenfaces are
usually all about lighting, and you sometimes get better facial recognition by
dropping the first 3 eigenfaces.]
[Show BlanzVetter face morphing video (morphmod.mpg).]
[Blanz and Vetter use PCA in a more sophisticated way for 3D face modeling.
They take 3D scans of people's faces and find correspondences between peoples'
faces and an idealized model. For instance, they identify the tip of your
nose, the corners of your mouth, and other facial features, which is something
the original eigenface work did not do. Instead of feeding an array of pixels
into PCA, they feed the 3D locations of various points on your face into PCA.
This works more reliably.]