CLASSIFIERS
===========
You are given a data set of n _samples_, each with d _features_.
Some samples belong to _class_ O; some do not.

Example:  Samples are bank loans
          Features are income & age (d = 2)
          Some are in class "defaulted", some are not

          Goal:  Predict whether future borrowers will default,
                 based on their income & age.

Represent each sample as a point in d-dimensional space,
called a _feature_vector_ (aka _predictors_, _independent_variables_).

          ^                        ^                        ^    _overfitting_
          |  X     X X             |  X     X O             |  X  O  O X   
          |    X   X               |    X   X               |    X   X     
          | O       X              | O       O              | X       O    
  income  |O     X         income  |O     X         income  |O     X       
          |    O   X               |    O X O               |    O   X     
          | O    O   X             | O    O   O             | X    O   X   
          | O       O              | O       O              | O       O    
          +----------->            +----------->            +----------->  
               age                      age                      age       

[Draw lines/curves separating O's from X's.  Use those curves to predict
 which future borrowers will default.]

_decision boundary_:  the boundary chosen by our classifier to separate
  items in the class from those not.

[By the way, when I underline a word or a short phrase, usually that is a
 *definition*.  If you want to do well in this course, you should *memorize*
 all the definitions I write down.]

Some (not all) classifiers work by computing a

_predictor function_:  A function f(x) that maps a sample point x to
  a scalar such that
               f(x) >  0     if x is in class O;
               f(x) <= 0     if x not in class O.
  Aka _decision_function_ or _discriminant_function_.

For these classifiers, the decision boundary is {x in R^d : f(x) = 0}
[That is, the set of all points where the prediction function is zero.]
Usually, this set is a (d - 1)-dimensional surface in R^d.

{x : f(x) = 0} is also called an _isosurface_ of f for the _isovalue_ 0.

  f has other isosurfaces for other isovalues, e.g. {x : f(x) = 1}.

  [Show plot & *isocontours* of sqrt(x^2 + y^2) - 3.
   Imagine a function in R^d, and imagine its (d - 1)-dimensional isosurface.]

_linear classifier_:  The decision boundary is a line.
  Usually uses a linear predictor function.  [Sometimes no predictor fn.]

_overfitting_:  When sinuous decision boundary fits sample data so well that it
  doesn't classify future items well.


Math Review
-----------
[I will write vectors in matrix notation.]

Vectors:

    -   -
    |x_1|
    |x_2|                        T
x = |x_3| = [x_1 x_2 x_3 x_4 x_5]
    |x_4|
    |x_5|
    -   -

Think of x as a point in 5-dimensional space.

Conventions (often, but not always):
  uppercase roman = matrix or random variable   X
  lowercase roman = vector                      x
  Greek = scalar                                alpha
  Other scalars:                                n = # of samples
                                                d = # of features (per sample)
                                                  = dimension of sample points
                                                i j k = indices
  function (often scalar)                       f( ), s( ), ...

_inner_product_ (aka _dot_product_):  x . y = x_1 y_1 + x_2 y_2 + ... + x_d y_d
                T
  also written x  y

  Clearly,  f(x) = w . x + alpha  is a _linear_function_ in x.

_Euclidean_norm_:  |x| = sqrt(x . x) = sqrt(x_1^2 + x_2^2 + ... + x_d^2)

  |x| is the length (aka Euclidean length) of a vector x.

                     x
  Given a vector x, --- is a _unit_ vector (length 1).
                    |x|
                                             x
  "_Normalize_ a vector x":  replace x with ---.
                                            |x|

Use dot products to compute angles:

      x
     /                       x . y     x     y
    /           cos theta = ------- = --- . ---
   / theta                  |x| |y|   |x|   |y|
  ---------->                        \___/ \___/
                                  length 1 length 1

      x   acute                ^     right       x        obtuse
     /     +ve                 |       0          \         -ve
    /                          |_                  \
   /                           | |                  \             
  ---------->                  +----------->         --------------->
  cos theta > 0                cos theta = 0         cos theta < 0

Given a linear predictor function f(x) = w . x + alpha,
the decision boundary is

  H = {x : w . x = - alpha}.

The set H is called a _hyperplane_.    (A line in 2D, a plane in 3D.)

[I want you to understand what a hyperplane is.  In 2D, it's a line.  In 3D,
 it's a plane.  Now take that concept and generalize it to higher dimensions.
 In d dimensions, a hyperplane is a flat, infinite thing with dimension d - 1.
 A hyperplane divides the d-dimensional space into two halves.]

              ->
Theorem:  Let xy be a vector that lies on H.  Then w . (y - x) = 0.

Proof:  x and y lie on H.  Thus w . (x - y) = - alpha - (- alpha) = 0.

w is called the _normal_vector_ of H,
because (as the theorem shows) w is normal (perpendicular) to H.
(I.e. w is normal to every pair of points in H.)

\
 \      +
  \    / w
   \/\/
    \/
     \                           [Add more isocontours to figure.]
      \
       \ H
        \

If w is a unit vector, then w . x + alpha is the _signed_distance_ from x to H.
I.e. it's the distance, but positive on one side of H; negative on other side.

Moreover, the distance from H to the origin is alpha.  [How do we know that?]

Hence alpha = 0 if and only if H passes through origin.

[w does not have to be a unit vector for the classifier to work.
 If w not unit vector, w . x + alpha is a multiple of signed distance.
 If you want to fix that, you can _rescale_ the equation
 by computing |w| and dividing both w and alpha by 1 / |w|.]

The coefficients in w, plus alpha, are called _weights_ or sometimes
_regression_coefficients_.

[That's why I named the vector w; "w" stands for "weights".]

The input data is _linearly_separable_ if there exists a hyperplane that
separates all the samples in class O from all the samples NOT in class O.

[At the beginning of this lecture, I showed you one plot that's linearly
 separable and two that are not.]

[We will investigate some linear classifiers that only work for linearly
 separable data, then we'll move on to more sophisticated linear classifiers
 that do a decent job with non-separable data.  Obviously, if your data is not
 linearly separable, a linear classifier cannot do a *perfect* job.  But we're
 still happy if we can find a classifier that usually predicts correctly.]


A Stupid Classifier
-------------------
_Centroid_method_:  compute mean mu_O of all vectors in class O and
                            mean mu_X of all vectors NOT in O.

We use the predictor function

                                              mu_O + mu_X
  f(x) = (mu_O - mu_X) . x - (mu_O - mu_X) . (-----------)
                                                   2

         \___________/                       \___________/
         normal vector                 midpoint between mu_O, mu_X

so that decision boundary is the hyperplane that bisects line segment
w/endpoints mu_O, mu_X.

[Better yet, we can adjust the right hand side to minimize the number of
 misclassified points.  Same normal vector, but different position.]

          ^ \
          | O\       O
          |   \      O  
          | X  \      O        [Draw mu_O, mu_X, and the normal vector]
          |X    \  O    
          |  X   \O  O  
          |XX     \
          |        \X
          +----------->

[In this example, there's clearly a better linear classifier that classifies
 every sample correctly.
 Note that this is hardly the worst example I could have given.
 If you're in the mood for an easy puzzle, pull out a sheet of paper and think
 of an example, with lots of samples, where the centroid method misclassifies
 every sample but one.]

[Nevertheless, there are cases where this method works well, like when all your
 positive examples come from one Gaussian distribution, and all your negative
 examples come from another.]


Perceptron Algorithm (Frank Rosenblatt, 1957)
--------------------
Slow, but correct for linearly separable samples.

Uses a _numerical_optimization_ algorithm, namely, _gradient_descent_.

[Poll:
 How many of you know what numerical optimization is?
 How many of you know what gradient descent is?
 How many of you know what Lagrange multipliers are?
 How many of you know what linear programming is?
 How many of you know what the simplex algorithm for linear programming is?
 How many of you know what convex programming is?

 We're going to learn what all these things are.  As machine learning people,
 we will be heavy users of all the optimization methods.  Unfortunately,
 I won't have time to teach you *algorithms* for all these optimization
 problems, but we'll learn a few, and I'll give you some hints how the other
 algorithms work.]


Consider n sample vectors X_1, X_2, ..., X_n.

[The reason I'm using capital X here is because we typically store these
 vectors as rows of a matrix X.  So the subscript picks out a row of X,
 representing a specific sample.  When I want to pick out one feature from
 a sample, I'll add a second subscript after the first one.]

                           /  1    if X_i is in class O, and
For each sample, let y_i = |
                           \ -1    if X_i not in O.

For simplicity, consider only decision boundaries that pass through the origin.
(We'll fix this later.)                                ~~~~~~~~~~~~~~~~~~~~~~~

Goal:  find weights w such that

  X_i . w >= 0     if y_i = 1, and
  X_i . w <= 0     if y_i = -1.      [remember, X_i . w is the signed distance]

Equivalently:  y_i X_i . w >= 0.     <-- inequality called a _constraint_.

Idea:  We define a _risk_function_ R that is positive if some constraints are
violated.  Then we use _optimization_ to choose w that minimizes R.

Define the _loss_function_

              / 0          if y_i z >= 0, and
  L(z, y_i) = |
              \ -y_i z     otherwise.


[Here, z is the classifier's prediction, and y_i is the correct answer.]

Idea:  if z has the same sign as y_i, the loss function is zero (happiness).
But if z has the wrong sign, the loss function is positive.

[For each sample, you want to get the loss function down to zero, or as close
 to zero as possible.  It's called the "loss function" because the bigger it
 is, the bigger a loser you are.]

Define _risk_function_ (aka _objective_function_ or _cost_function_)

          n
  R(w) = sum L(X_i . w, y_i),
         i=1

       = sum -y_i X_i . w
        i in V

where V is the set of indicies i for which y_i w . X_i < 0.

If w classifies all the points X_1, ..., X_n correctly, then R(w) = 0.
Otherwise, R(w) is positive, and we want to find a better value of w.

--------------------------------------------------------------------------
| Goal:  Solve this _optimization_problem_:  find w that minimizes R(w). |
--------------------------------------------------------------------------

[Show plot of R.]