SOFT-MARGIN SUPPORT VECTOR MACHINES (SVMs)
===================================
Solves 2 problems:
- Hard-margin SVMs fail if data not linearly separable.
-   "    "     "   sensitive to outliers.

[Show example where one outlier moves the decision boundary a lot.]

Idea:  Allow some samples to violate the margin, with _slack_variables_.
       Modified constraint for sample i:

  y_i (X_i . w + alpha) >= 1 - xi_i

[Observe that the only difference between these constraints and the
 hard margin constaints we saw last lecture is the extra slack term xi_i.]
[We also impose new constraints, that the slack variables are never negative.]

  xi_i >= 0

[This inequality ensures that all samples that *don't* violate the
 margin are treated the same; they all have xi_i = 0.  Sample i has nonzero
 xi_i if and only if it violates the margin.]

[Show figure of margin where some samples have slack.  For each violating
 point, the slack distance is xi*_i = xi_i / |w|.]

To prevent abuse of slack, we add a _loss_term_ to objective fn.

Optimization problem:
  ----------------------------------------------------------------------
  |                                                  n                 |
  | Find w, alpha, and xi_i that minimize |w|^2 + C sum xi_i           |
  |                                                 i=1                |
  | subject to y_i (X_i . w + alpha) >= 1 - xi_i   for all i in [1, n] |
  |            xi_i >= 0                           for all i in [1, n] |
  ----------------------------------------------------------------------
...a quadratic program in d + n + 1 dimensions and 2n constraints.
[It's a quadratic program because its objective function is quadratic and its
 constraints are linear inequalities.]

C > 0 is a scalar _regularization_hyperparameter_ that trades off:
          |------------------------|------------------------------------------|
          | small C                | big C                                    |
----------|------------------------|------------------------------------------|
 desire   | maximize margin 1/|w|  | keep most slack variables zero or small  |
----------|------------------------|------------------------------------------|
 danger   | underfitting           | overfitting                              |
          | (misclassifies much    | (awesome training, awful test)           |
          |  training data)        |                                          |
----------|------------------------|------------------------------------------|
 outliers | less sensitive         | very sensitive                           |
----------|------------------------|------------------------------------------|
 boundary | more "flat"            | more sinuous                             |
-------------------------------------------------------------------------------

[The last row only applies to nonlinear decision boundaries, which we'll
 discuss next.  Obviously, a linear decision boundary can't be sinuous.]

Use validation to choose C.

[Show examples of how slab varies with C.  Smallest C upper left; largest C
 lower right.]

[One way to think about slack is to pretend that slack is money we can spend
 to buy permission for a sample to violate the margin.  The further a sample
 penetrates the margin, the bigger the fine you have to pay.  We want to
 make the margin as big as possible, but we also want to spend as little money
 as possible.  If the regularization parameter C is small, it means we're
 willing to spend lots of money on violations so we can get a bigger margin.
 If C is big, it means we're cheap and we want to prevent violations, even
 though we'll get a narrower margin .  If C is infinite, we're back to
 a hard-margin SVM.]


FEATURES
========
Q:  How to do nonlinear decision boundaries?

A:  Make nonlinear _features_ that _lift_ samples into a higher-dimensional
    space.  High-d linear classifier -> low-d nonlinear classifier.

[Features work with all classifiers, including perceptrons, hard-margin SVMs,
 and soft-margin SVMs.]

Example 1:  The _parabolic_lifting_map_
---------------------------------------
Phi(x): R^d -> R^{d+1}
         -       -
         |   x   |                                          2
Phi(x) = |       |   <--- lifts x onto paraboloid x    = |x|
         | |x|^2 |                                 d+1
         -       -

[We've added a new feature, |x|^2.  Even though the new feature is just a
 function of other input features, it gives our linear classifier more power.]

Find a linear classifier in Phi-space.
It induces a sphere classifier in x-space.

          ^    X  X
          |         X 
          |  X  O     X 
          | X   O O  X                [Draw paraboloid, lifted samples, and
          |X   O  OO                   plane decision boundary in 3D here.]
          |  X O O  X X
          |XX X    X
          |     X   X
          O----------->
  [Draw circle decision boundary]

Theorem:  Phi(X_1), ..., Phi(X_n) are linearly separable iff X_1, ..., X_n are
          separable by a hypersphere.
          (Possibly a _degenerate_ hypersphere = hyperplane.)

Proof:  Consider hypersphere in R^d w/center c & radius rho.
        Points inside:  |x - c|^2 < rho^2
                        |x|^2 - 2c . x + |c|^2 < rho^2
                                   -       -
                                   |   x   |
                        [-2c^T  1] |       | < rho^2 - |c|^2
                                   | |x|^2 |
                                   -       -
                normal vector ^        ^ Phi(x)

        Hence points inside sphere -> same side of hyperplane in Phi-space.
        (Reverse implication works too.)

[Although the math above doesn't expose it, hyperplane separators are a special
 case of hypersphere separators, so hypersphere classifiers can do everything
 linear classifiers can do and more.  If you take a sphere and increase its
 radius to infinity while making it pass through some point, in the limit you
 get a plane; so you can think of a plane as a degenerate sphere.  With the
 parabolic lifting map, a hyperplane in x-space corresponds to a hyperplane in
 Phi-space that is parallel to the x_{d+1}-axis.]

Example 2:  Axis-aligned ellipsoid/hyperboloid decision boundaries
------------------------------------------------------------------
[Draw examples of axis-aligned ellipses & hyperbola.]

In 3D, these have the formula
  A x_1^2 + B x_2^2 + C x_3^2 + D x_1 + E x_2 + F x_3 = -G

[Here, the capital letters are scalars, not matrices.]

Phi(x): R^d -> R^{2d}
Phi(x) = [ x_1^2  ...  x_d^2  x_1  ...  x_d ]^T

[We've turned d input features into 2d features for our linear classifier.
 If the samples are separable by an axis-aligned ellipsoid or hyperboloid, per
 the formula above, then the samples lifted to Phi-space are separable by
 a hyperplane whose normal vector is (A, B, C, D, E, F).]

Example 3:  Ellipsoid/hyperboloid
---------------------------------
[Draw example of non-axis-aligned ellipse.]

General formula:  [for an ellipsoid or hyperboloid]

  A x_1^2 + B x_2^2 + C x_3^2 + D x_1 x_2 + E x_2 x_3 + F x_3 x_1 +
  G x_1 + H x_2 + I x_3 = -J

Phi(x): R^d -> R^{(d^2+3d)/2}

[The isosurface defined by this equation is called a _quadric_.  In the special
 case of 2D, it's also known as a _conic_section_.]

[You'll notice that there is a quadratic blowup in the number of features,
 because every *pair* of input features creates a new feature in Phi-space.
 If the dimension is large, these feature vectors are getting huge, and that's
 going to impose a serious computational cost.  But it might be worth it to
 find good classifiers for data that aren't linearly separable.]

Example 4:  Predictor fn is degree-p polynomial
-----------------------------------------------
E.g. a cubic in R^2:
                                                                              T
Phi(x) = [ x_1^3  x_1^2 x_2  x_1 x_2^2  x_2^3  x_1^2  x_1 x_2  x_2^2  x_1  x_2]
Phi(x): R^d -> R^{O(d^p)}

[Now we're really blowing up the number of features!  If you have, say, 100
 features per sample and you want to use degree-4 predictor functions, then
 each lifted feature vector has a length on the order of 100 million,
 and your learning algorithm will take approximately forever to run.]
[However, later in the semester we will learn an extremely clever trick that
 allows us to work with these huge feature vectors very quickly, without ever
 computing them.  It's called "kernelization" or "the kernel trick".  So even
 though it appears now that working with degree-4 polynomials is
 computationally infeasible, it can actually be done very quickly.]

[Show SVMs with degree 1/2/5 predictor functions.  Observe that the
 margin tends to get wider as the degree increases.]

[Increasing the degree like this accomplishes two things.
 - First, the data might become linearly separable when you lift them to a high
   enough degree, even if the original data are not linearly separable.
 - Second, raising the degree can increase the margin, so you might get a more
   robust separator.
 However, if you raise the degree too high, you will overfit the data.]

[Show training vs. test error for degree 1/2/5 predictor functions.  In this
 example, a degree-2 predictor gives the smallest test error.]

[Sometimes you should search for the ideal degree--not too small, not too big.
 It's a balancing act between underfitting and overfitting.  The degree is an
 example of a *hyperparameter* that can be optimized by validation.]

[If you're using both polynomial features and a soft-margin SVM, now
 you have two hyperparameters:  the degree and C.  Generally, the optimal C
 will be different for ever polynomial degree, so when you change degree, you
 have to run validation again to find the best C for that degree.]


[So far I've talked only about polynomial features.  But features can get much
 more interesting than polynomials, and they can be tailored to fit a specific
 problem.  Let's consider a type of feature you might use if you wanted to
 implement, say, a handwriting recognition algorithm.]

Example 5:  Edge detection
--------------------------
_Edge_detector_:  algorithm for approximating grayscale/color gradients in
  image, e.g.
  - tap filter
  - Sobel filter
  - oriented Gaussian derivative filter
  [images are discrete, not continuous fields, so approximation is necessary.]

[See "Image Derivatives" on Wikipedia.]

Collect line orientations in local histograms (each having 12 orientation bins
  per region); use histograms as features (*instead* of raw pixels).

[Show picture of image histograms.]

Paper:  Maji & Malik, 2009.

[If you want to, optionally, use these features in Homework 1 and try to win
 the Kaggle competition, this paper is a good online resource.]

[When they use a linear SVM on the raw pixels, Maji & Malik get an error rate
 of 15.38% on the test set.  When they use a linear SVM on the histogram
 features, the error rate goes down to 2.64%.]

[Many applications can be improved by designing application-specific features.
 There's no limit but your own creativity and ability to discern the structure
 hidden in your application.]