Kernel Perceptrons ------------------ Note: Everywhere below, we can replace X with Phi(X ) i i Recall perceptron alg: while some y X . w < 0 i i w <- w + epsilon y X i i for each test pt z T f(z) <- w z T T Kernelize with w = X a: X . w = (XX a) = (Ka) i i i Dual perceptron alg: T a = [y_1, 0, ..., 0] [starting point is arbitrary, but can't be 0] K = k(X , X ) for all i, j <= O(n^2 d) time (kernel trick) ij i j while some y (Ka) < 0 i i a <- a + epsilon y <= optimization: can update Ka in O(n) time i i i for each test pt z n f(z) <- sum a k(X , z) <= O(nd) time j=1 j j [The d's above do not increase if we replace X_i with a kernelized Phi(X_i)!] OR we can compute w = X^T a once in O(nd') time & evaluate test pts in O(d') time/pt, where d' is length of Phi(.) Interpretation: a_i reflects how many times we have added epsilon X (or -epsilon X ) to w. i i Kernel Logistic Regression -------------------------- [The stochastic gradient ascent step for logistic regression is just a small modification of the step for perceptrons. However, remember that we're no longer looking for misclassified samples; instead, we apply the gradient ascent rule to weights in a stochastic, random order--or, alternatively, to all the weights at once.] Stochastic gradient ascent step: a_i <- a_i + epsilon (y - s((Ka) )) [where s is the logistic function] i i [Just like with perceptrons, every time you update one weight a_i, if you're clever you can update Ka in O(n) time so you don't have to compute it from scratch on the next iteration.] Gradient ascent step: a <- a + epsilon (y - s(Ka)) <= applying s component-wise to vector Ka for each test pt z: n h(z) <- s(sum a k(X , z) ) j=1 j j [or, if you're using logistic regression as a classifier and you don't care about the probability, you can skip the logistic fn and just compute the sum.] The Gaussian Kernel ------------------- [...But I think our next act is even more mind-blowing. Since we can now do fast computations in spaces with exponentially large dimension, why don't we go all the way and do computations in infinite-dimensional space?] _Gaussian_kernel_, aka _radial_basis_fn_kernel_: There exists a Phi(x) such that 2 |x - z| k(x, z) = exp ( - --------- ) 2 sigma^2 [In case you're curious, here's the feature vector that gives you this kernel, for the case where you have only one input feature per sample.] e.g. for d = 1, 2 2 x x x T Phi(x) = exp ( - --------- ) [1, --------------, ----------------, ... ] 2 sigma^2 sigma sqrt(1!) sigma^2 sqrt(2!) [This is an infinite vector, and Phi(x)^T Phi(z) is a converging series. Nobody actually uses this value of Phi(x) directly, or even cares about it; they just use the kernel function k(.,.).] [At this point, it's best *not* to think of points in a high-dimensional space. Instead, think of the kernel k as a measure of how similar or close together two samples are to each other.] n Key observation: hypothesis h(z) = sum a k(X , z) is a linear combo of j=1 j j Gaussians centered at samples. [The dual weights are the coefficients of the linear combo.] [Think of the Gaussians as a basis for the hypothesis.] [Show example hypothesis (gausskernel.pdf)] Very popular in practice! Why? - Gives very smooth h - Behaves somewhat like k-nearest neighbor, but smoother - Oscillates less than polynomials (depending on sigma) - k(x, z) can be interpreted as a "similarity measure". Gaussian is maximum when x = z, goes to 0 as distance increases. - Samples "vote" for value at z, but closer samples get weighter vote. [The "standard" kernel x . y assigns high value to vectors that point in roughly the same direction. The Gaussian kernel assigns high value to points near each other.] sigma trades off bias vs. variance: larger sigma -> wider Gaussians, smoother h -> more bias, less variance Choose by (cross-)validation. [Show decision boundary of SVM with Gauss kernel (gausskernelsvm.pdf)] [By the way, there are many other kernels like this that are defined directly as kernel functions without worrying about Phi. But not every function can be a kernel function. A function is qualified only if it always generates a positive semidefinite kernel matrix, no matter what the samples are.] SUBSET SELECTION ================ All features increase variance, but not all features reduce bias (much). Idea: Identify poorly predictive features, ignore them (weight zero). Less overfitting, lower test errors. 2nd motivation: Inference. Simpler models convey interpretable wisdom. Useful in all classification & regression methods. Sometimes it's hard: Different features can partly encode same information. Combinatorially hard to choose best feature subset. Alg: Best subset selection. Try all 2^d - 1 nonempty subsets of features. Choose the best model by (cross-)validation. Slow. [Obviously, best subset selection isn't tractible if we have a lot of features. But it gives us an "ideal" algorithm to compare practical algorithms with. If d is large, there is no algorithm that's both guaranteed to find the best subset and that runs in acceptable time. But heuristics often work well.] Heuristic: Forward stepwise selection. Start with _null_model_ (0 features); repeatedly add best feature until test errors start increasing (due to overfitting) instead of decreasing. At each outer iteration, inner loop tries every feature and chooses the best by cross-validation. Requires training O(d^2) models instead of O(2^d). Not perfect: Won't find the best 2-feature model if neither of those features yields the best 1-feature model. Heuristic: Backward stepwise selection. Start with all d features; repeatedly remove feature whose removal gives best reduction in test errors. Also trains O(d^2) models. Additional heuristic: Only try to remove features with small weights. Q: small relative to what? 2 T -1 Recall: variance of least-squ. regr. is proportional to sigma (X X) w_i _z-score_ of weight w is z = --------------- i i sigma sqrt(v_j) T -1 where v_j is jth diagonal entry of (X X) . Small z-score hints "true" w_i could be zero. [Forward stepwise selection is a better choice when you suspect only a few features will be good predictors. Backward stepwise is better when you suspect most of the features will be necessary. If you're lucky, you'll stop early.] LASSO (Tibshirani, 1996) ===== Regression w/regularization: (1) + (A) + l_1 penalized mean loss (e). "Least absolute shrinkage and selection operator" [This is a regularized regression method similar to ridge regression, but it has the advantage that it often naturally sets some of the weights to zero.] -------------------------------------------------- | 2 | | Find w that minimizes |Xw - y| + lambda |w'| | | 1 | -------------------------------------------------- n where |w'| = sum |w | (Don't penalize alpha.) 1 i=1 i 2 Recall ridge regr.: isosurfaces of |w'| are hyperspheres. 2 The isosurfaces of |w'| are _cross-polytopes_. 1 The unit cross-polytope is the convex hull of all the positive and negative unit coordinate vectors. ^ /|\ / | \ 2D: <--+--> 3D: [Draw 3D cross-polytope] \ | / \|/ v [You get other cross-polytope isosurfaces by scaling this.] [Show plot of isocontours for the two objective fn terms (lassoridge.pdf)] [Recall that beta is the least-squares solution, and the red ellipses are the isocontours of |Xw - y|^2. The isocontours of |w'|_1 are diamonds centered at the origin (blue). The solution lies where a red isocontour just touches a blue diamond. What's interesting here is that in this example, the red isocontour touches just the tip of the diamond. So one of the weights gets set to zero. That's what we want to happen to weights that don't have enough influence. This doesn't always happen--for instance, the red isosurface could touch a side of the diamond instead of a tip of the diamond.] [When you go to higher dimensions, you might have several weights set to zero. For example, in 3D, if the red isosurface touches a sharp corner of the cross-polytope, two of the three weights get set to zero. If it touches a sharp edge of the cross-polytope, one weight gets set to zero. If it touches a flat side of the cross-polytope, no weight is zero.] [Show plot of weights as a function of lambda (lassoweights.pdf)] [This shows the weights for a typical linear regression problem with about 10 variables. You can see that as lambda increases, more and more of the weights become zero. Only four of the weights are really useful for prediction; they're in color. Statisticians used to choose lambda by looking at a chart like this and trying to eyeball a spot where there aren't too many predictors and the weights aren't changing too fast. But nowadays they prefer cross-validation, as always.] Sometimes sets some weights to zero, especially for large lambda. Algs: subgradient descent, least-angle regression (LARS), forward stagewise [Lasso's objective function is not smooth, and that makes it tricky to optimize. I'm not going to teach you an optimization method for Lasso. If you need one, look up the last two of these names; LARS is built into the R Programming Language for statistics.] [As with ridge regression, you should probably normalize the features first before applying this method.]