SOFT-MARGIN SUPPORT VECTOR MACHINES (SVMs) =================================== Solves 2 problems: - Hard-margin SVMs fail if data not linearly separable. - " " " sensitive to outliers. [Show example where one outlier moves the decision boundary a lot.] Idea: Allow some samples to violate the margin, with _slack_variables_. Modified constraint for sample i: y_i (X_i . w + alpha) >= 1 - xi_i [Observe that the only difference between these constraints and the hard margin constaints we saw last lecture is the extra slack term xi_i.] [We also impose new constraints, that the slack variables are never negative.] xi_i >= 0 [This inequality ensures that all samples that *don't* violate the margin are treated the same; they all have xi_i = 0. Sample i has nonzero xi_i if and only if it violates the margin.] [Show figure of margin where some samples have slack. For each violating point, the slack distance is xi*_i = xi_i / |w|.] To prevent abuse of slack, we add a _loss_term_ to objective fn. Optimization problem: ---------------------------------------------------------------------- | n | | Find w, alpha, and xi_i that minimize |w|^2 + C sum xi_i | | i=1 | | subject to y_i (X_i . w + alpha) >= 1 - xi_i for all i in [1, n] | | xi_i >= 0 for all i in [1, n] | ---------------------------------------------------------------------- ...a quadratic program in d + n + 1 dimensions and 2n constraints. [It's a quadratic program because its objective function is quadratic and its constraints are linear inequalities.] C > 0 is a scalar _regularization_hyperparameter_ that trades off: |------------------------|------------------------------------------| | small C | big C | ----------|------------------------|------------------------------------------| desire | maximize margin 1/|w| | keep most slack variables zero or small | ----------|------------------------|------------------------------------------| danger | underfitting | overfitting | | (misclassifies much | (awesome training, awful test) | | training data) | | ----------|------------------------|------------------------------------------| outliers | less sensitive | very sensitive | ----------|------------------------|------------------------------------------| boundary | more "flat" | more sinuous | ------------------------------------------------------------------------------- [The last row only applies to nonlinear decision boundaries, which we'll discuss next. Obviously, a linear decision boundary can't be sinuous.] Use validation to choose C. [Show examples of how slab varies with C. Smallest C upper left; largest C lower right.] [One way to think about slack is to pretend that slack is money we can spend to buy permission for a sample to violate the margin. The further a sample penetrates the margin, the bigger the fine you have to pay. We want to make the margin as big as possible, but we also want to spend as little money as possible. If the regularization parameter C is small, it means we're willing to spend lots of money on violations so we can get a bigger margin. If C is big, it means we're cheap and we want to prevent violations, even though we'll get a narrower margin . If C is infinite, we're back to a hard-margin SVM.] FEATURES ======== Q: How to do nonlinear decision boundaries? A: Make nonlinear _features_ that _lift_ samples into a higher-dimensional space. High-d linear classifier -> low-d nonlinear classifier. [Features work with all classifiers, including perceptrons, hard-margin SVMs, and soft-margin SVMs.] Example 1: The _parabolic_lifting_map_ --------------------------------------- Phi(x): R^d -> R^{d+1} - - | x | 2 Phi(x) = | | <--- lifts x onto paraboloid x = |x| | |x|^2 | d+1 - - [We've added a new feature, |x|^2. Even though the new feature is just a function of other input features, it gives our linear classifier more power.] Find a linear classifier in Phi-space. It induces a sphere classifier in x-space. ^ X X | X | X O X | X O O X [Draw paraboloid, lifted samples, and |X O OO plane decision boundary in 3D here.] | X O O X X |XX X X | X X O-----------> [Draw circle decision boundary] Theorem: Phi(X_1), ..., Phi(X_n) are linearly separable iff X_1, ..., X_n are separable by a hypersphere. (Possibly a _degenerate_ hypersphere = hyperplane.) Proof: Consider hypersphere in R^d w/center c & radius rho. Points inside: |x - c|^2 < rho^2 |x|^2 - 2c . x + |c|^2 < rho^2 - - | x | [-2c^T 1] | | < rho^2 - |c|^2 | |x|^2 | - - normal vector ^ ^ Phi(x) Hence points inside sphere -> same side of hyperplane in Phi-space. (Reverse implication works too.) [Although the math above doesn't expose it, hyperplane separators are a special case of hypersphere separators, so hypersphere classifiers can do everything linear classifiers can do and more. If you take a sphere and increase its radius to infinity while making it pass through some point, in the limit you get a plane; so you can think of a plane as a degenerate sphere. With the parabolic lifting map, a hyperplane in x-space corresponds to a hyperplane in Phi-space that is parallel to the x_{d+1}-axis.] Example 2: Axis-aligned ellipsoid/hyperboloid decision boundaries ------------------------------------------------------------------ [Draw examples of axis-aligned ellipses & hyperbola.] In 3D, these have the formula A x_1^2 + B x_2^2 + C x_3^2 + D x_1 + E x_2 + F x_3 = -G [Here, the capital letters are scalars, not matrices.] Phi(x): R^d -> R^{2d} Phi(x) = [ x_1^2 ... x_d^2 x_1 ... x_d ]^T [We've turned d input features into 2d features for our linear classifier. If the samples are separable by an axis-aligned ellipsoid or hyperboloid, per the formula above, then the samples lifted to Phi-space are separable by a hyperplane whose normal vector is (A, B, C, D, E, F).] Example 3: Ellipsoid/hyperboloid --------------------------------- [Draw example of non-axis-aligned ellipse.] General formula: [for an ellipsoid or hyperboloid] A x_1^2 + B x_2^2 + C x_3^2 + D x_1 x_2 + E x_2 x_3 + F x_3 x_1 + G x_1 + H x_2 + I x_3 = -J Phi(x): R^d -> R^{(d^2+3d)/2} [The isosurface defined by this equation is called a _quadric_. In the special case of 2D, it's also known as a _conic_section_.] [You'll notice that there is a quadratic blowup in the number of features, because every *pair* of input features creates a new feature in Phi-space. If the dimension is large, these feature vectors are getting huge, and that's going to impose a serious computational cost. But it might be worth it to find good classifiers for data that aren't linearly separable.] Example 4: Predictor fn is degree-p polynomial ----------------------------------------------- E.g. a cubic in R^2: T Phi(x) = [ x_1^3 x_1^2 x_2 x_1 x_2^2 x_2^3 x_1^2 x_1 x_2 x_2^2 x_1 x_2] Phi(x): R^d -> R^{O(d^p)} [Now we're really blowing up the number of features! If you have, say, 100 features per sample and you want to use degree-4 predictor functions, then each lifted feature vector has a length on the order of 100 million, and your learning algorithm will take approximately forever to run.] [However, later in the semester we will learn an extremely clever trick that allows us to work with these huge feature vectors very quickly, without ever computing them. It's called "kernelization" or "the kernel trick". So even though it appears now that working with degree-4 polynomials is computationally infeasible, it can actually be done very quickly.] [Show SVMs with degree 1/2/5 predictor functions. Observe that the margin tends to get wider as the degree increases.] [Increasing the degree like this accomplishes two things. - First, the data might become linearly separable when you lift them to a high enough degree, even if the original data are not linearly separable. - Second, raising the degree can increase the margin, so you might get a more robust separator. However, if you raise the degree too high, you will overfit the data.] [Show training vs. test error for degree 1/2/5 predictor functions. In this example, a degree-2 predictor gives the smallest test error.] [Sometimes you should search for the ideal degree--not too small, not too big. It's a balancing act between underfitting and overfitting. The degree is an example of a *hyperparameter* that can be optimized by validation.] [If you're using both polynomial features and a soft-margin SVM, now you have two hyperparameters: the degree and C. Generally, the optimal C will be different for ever polynomial degree, so when you change degree, you have to run validation again to find the best C for that degree.] [So far I've talked only about polynomial features. But features can get much more interesting than polynomials, and they can be tailored to fit a specific problem. Let's consider a type of feature you might use if you wanted to implement, say, a handwriting recognition algorithm.] Example 5: Edge detection -------------------------- _Edge_detector_: algorithm for approximating grayscale/color gradients in image, e.g. - tap filter - Sobel filter - oriented Gaussian derivative filter [images are discrete, not continuous fields, so approximation is necessary.] [See "Image Derivatives" on Wikipedia.] Collect line orientations in local histograms (each having 12 orientation bins per region); use histograms as features (*instead* of raw pixels). [Show picture of image histograms.] Paper: Maji & Malik, 2009. [If you want to, optionally, use these features in Homework 1 and try to win the Kaggle competition, this paper is a good online resource.] [When they use a linear SVM on the raw pixels, Maji & Malik get an error rate of 15.38% on the test set. When they use a linear SVM on the histogram features, the error rate goes down to 2.64%.] [Many applications can be improved by designing application-specific features. There's no limit but your own creativity and ability to discern the structure hidden in your application.]