[I told you last week that neural net research was popular in the 60's, but the 1969 book "Perceptrons" killed interest in them throughout the 70's. They came back in the 80's, but interest was partly killed off a second time in the 00's by...guess what? By support vector machines. SVMs work well for a lot of tasks, they're faster to train, and they more or less have only one hyperparameter, whereas neural nets take a lot of work to tune.] [Neural nets are now in their third wave of popularity. The single biggest factor in bringing them back is probably big data. Thanks to the internet, we now have absolutely huge collections of images to train neural nets with, and researchers have discovered that neural nets often give better performance than competing algorithms when you have huge amounts of data to train them with. In particular, convolutional neural nets are now learning better features than hand-tuned features. That's a recent change.] [One event that brought attention back to neural nets was the ImageNet Image Classification Challenge in 2012.] [Show ImageNet slide (imagenet.png).] [The winner of that competition was a neural net, and it won by a huge margin, about 10%. It's called AlexNet, and it's surprisingly similarly to LeNet 5, in terms of how its layers are structured. However, there are some new innovations that led to their prize-winning performance, besides the fact that the training set had 1.4 million images: they used ReLUs, GPUs for training, and dropout.] [Show AlexNet convolutional neural net diagram (alexnet.pdf).] UNSUPERVISED LEARNING ===================== We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal: Discover structure in the data. Examples: - Clustering: partition data into groups of similar/nearby points. - Dimensionality reduction: data often lies near a low-dimensional subspace (or manifold) in feature space; matrices have low-rank approximations. [Whereas clustering is about grouping similar sample points, dimensionality reduction is more about identifying a continuous variation from sample point to sample point.] - Density estimation: fit a continuous distribution to discrete data. [When we use maximum likelihood estimation to fit Gaussians to sample points, that's density estimation, but we can also fit functions more complicated than Gaussians, with more local variation.] PRINCIPAL COMPONENTS ANALYSIS (PCA) (Karl Pearson, 1901) ============================= Goal: Given sample points in R^d, find k directions that capture most of the variation. (Dimensionality reduction.) [Show 3D points projected to 2D (3dpca.pdf).] [Show MNIST digits projected to 2D (pcadigits.pdf).] Why? - Find a small basis for representing variations in complex things, e.g. faces. - Reducing # of dimensions makes some computations cheaper, e.g. regression. - Remove irrelevant dimensions to reduce overfitting in learning algs. Like subset selection, but we can choose features that aren't axis-aligned, i.e., linear combos of input features. [Sometimes PCA is used as a preprocess before regression or classification for the last two reasons.] Let X be n-by-d design matrix. [No fictitious dimension.] From now on, assume X is centered: mean X_i is zero. [As usual, we can center the data by computing the mean x-value, then subtracting the mean from each sample point.] [Let's start by seeing what happens if we pick just one principal direction.] Let w be a unit vector. ~ The _orthogonal_projection_ of point x onto vector w is x = (x . w) w ~ x . w If w not unit, x = ----- w |w|^2 o x | | w v ~ O----------> o x [The idea is that we're going to pick the best direction w, then project all the data down onto w so we can analyze it in a one-dimensional space. Of course, we lose a lot of information when we project down from d dimensions to just one. So, suppose we pick several directions. Those directions span a subspace, and we want to project points orthogonally onto the subspace. This is easy *if* the directions are orthogonal to each other.] ~ k Given orthonormal directions v_1, ..., v_k, x = sum (x . v ) v i=1 i i [The word "orthonormal" implies they're mutually orthogonal and length 1.] [Draw picture of orthogonal projection of a point onto a plane in 3D space.] [Usually we don't actually want the projected point in R^d; usually we want the coordinates x . v_i in principal components space.] X^T X is square, symmetric, positive semidefinite, d-by-d matrix. Let 0 <= lambda_1 <= lambda_2 <= ... <= lambda_d be its eigenvalues. [sorted] Let v_1, v_2, ..., v_d be corresponding orthogonal *unit* eigenvectors. [It turns out that the principal directions will be these eigenvectors, and the most important ones will be the ones with the greatest eigenvalues. I will show you this in three different ways.] PCA derivation 1: Fit a Gaussian to data with maximum likelihood estimation. Choose k Gaussian axes of greatest variance. [Show Gaussian fitted to sample points (gaussfitpca.png).] ^ 1 T Recall that MLE estimates a covariance matrix Sigma = - X X. [If X centered.] n PCA Alg: - Center X. - Optional: Normalize X. Units of measurement different? * Yes: Normalize. [Bad for principal components to depend on arbitrary choice of scaling.] * No: Usually don't. [If several features have the same unit of measurement, but some of them have much smaller variance, that difference is usually meaningful.] [Show difference outcomes between normalized and not (normalize.pdf).] - Compute unit eigenvectors/values of X^T X. - Optional: choose k based on the eigenvalue sizes. - For the best k-dimensional subspace, pick eigenvectors v_{d-k+1}, ..., v_d. - Compute the coordinates of training/test data in principal components space. [When we do this projection, we have two choices: we can un-center the input data before projecting it, OR we can translate the test data by the same vector we used to translate the training data when we centered it.] [Show graph of # of eigenvectors vs. variance captured (variance.pdf). In this example, just 3 eigenvectors capture 70% of the variance.] [If you are using PCA as a preprocess for a supervised learning algorithm, there's a more effective way to choose k: (cross-)validation.] PCA derivation 2: Find direction w that maximizes variance of projected data [In other words, when we project the data down, we don't want it all to bunch up; we want to keep it as spread out as possible.] [Show projection of points (project.jpg).] T T ~ ~ ~ 1 n w 2 1 |Xw|^2 1 w X X w Var({ X , X , ..., X }) = - sum (X . ---) = - ------ = - --------- 1 2 n n i=1 i |w| n |w|^2 n w^T w \_______/ _Rayleigh_quotient_ of X^T X and w [This fraction is a well-known construction called the Rayleigh quotient. When you see it, you should smell eigenvectors nearby. How do we maximize this?] If w is an eigenvector v_i, Ray. quo. = lambda_i -> of all eigenvectors, v_d achieves maximum variance lambda_d / n. One can show v_d beats every other vector too. [Because every vector w is a linear combination of eigenvectors, and so its Rayleigh quotient will be a convex combination of eigenvalues. It's easy to prove this, but I don't want to take the time. For the proof, look up "Rayleigh quotient" in Wikipedia.] [So the top eigenvector gives us the best direction. But we typically want k directions. After we've picked one direction, then we have to pick a direction that's orthogonal to the best direction. But subject to that constraint, we again pick the direction that maximizes the variance.] What if we constrain w to be orthogonal to v_d? Then pick v_{d-1}. PCA derivation 3: Find direction w that minimizes "projection error" [Show animation of PCA projection (PCAanimation.gif).] [You can think of this as a sort of least-squares linear regression, with one important change. Instead of measuring the error in a fixed vertical direction, we're measuring the error in a direction orthogonal to the principal component direction we choose.] [Show linear regression vs. PCA (mylsq.png, mypca.png).] n | ~ |2 n | X_i . w |2 n 2 w 2 sum |X - X | = sum |X - ------- w| = sum (|X | - (X . ---) ) i=1 | i i| i=1 | i |w|^2 | i=1 i i |w| = constant - n (variance from derivation 2). Minimizing projection error = maximizing variance. [From this point, we carry on with the same reasoning as derivation 2.] [Show illustration of the first two principal components of the single nucleotide polymorphism (SNP) matrix for the genes of various Europeans (europegenetics.pdf). The input matrix has 2,541 people from these locations in Europe, and 309,790 SNPs. Each SNP is binary, so think of it as 309,790 dimensions of zero or one. The output shows spots on the first two principal components where the projected people from a particular national type are denser than a certain threshold. What's amazing about this is how closely the projected genotypes match the geography of Europe. (From Lao et al., 2008.)] Eigenfaces ---------- X contains n images of faces, d pixels each. [If we have a 200 x 200 image of a face, we represent it as a vector of length 40,000, the same way we represent the MNIST digit data.] Face recognition: Given a query face, compare it to all training faces; find nearest neighbor in R^d. [This works best if you have several training photos of each person you want to recognize, with different lighting and different facial expressions.] Problem: Each query takes Theta(nd) time. Solution: Run PCA on faces. Reduce to much smaller dimension d'. Now nearest neighbor takes O(nd') time. [Possibly even less. We'll talk about speeding up nearest-neighbor search at the end of the semester. If the dimension is small enough, you can sometimes do better than linear time.] [Show images of average face and eigenfaces (facerecaverage.jpg, facereceigen0.jpg, facereceigen119.jpg, facereceigen.jpg).] [Show images of a face projected onto the first 4 and 50 eigenvectors (eigenfaceproject.pdf). Latter is blurry but good enough for recognition.] For best results, equalize the intensity distributions first. [Show image equalization (facerecequalize.jpg).] [If each image has 40,000 pixels, and you reduce it to 40 principal components, then each query face requires you to read 20,000 stored coordinates instead of 20 million pixels.] [Eigenfaces are not perfect. They encode both face shape *and* lighting. Ideally, we would have some way to factor out lighting and analyze face shape only, but that's harder. Some people say that the first 3 eigenfaces are usually all about lighting, and you sometimes get better facial recognition by dropping the first 3 eigenfaces.] [Show Blanz-Vetter face morphing video (morphmod.mpg).] [Blanz and Vetter use PCA in a more sophisticated way for 3D face modeling. They take 3D scans of people's faces and find correspondences between peoples' faces and an idealized model. For instance, they identify the tip of your nose, the corners of your mouth, and other facial features, which is something the original eigenface work did not do. Instead of feeding an array of pixels into PCA, they feed the 3D locations of various points on your face into PCA. This works more reliably.]