ML ABSTRACTIONS    [some meta comments on machine learning]
===============
[When you write a large computer program, you break it down into subroutines
 and modules.  Many of you know from experience that you need to have the
 discipline to impose strong abstraction barriers between different modules, or
 your program will become so complex you can no longer manage nor maintain it.]

[When you learn a new subject, it helps to have mental abstraction barriers,
 too, so you know when you can replace one approach with a different approach.
 I want to give you four levels of abstraction that can help you think about
 machine learning.  It's important to make mental distinctions between these
 four things, and the code you write should have modules that reflect these
 distinctions as well.]

-----------------------------------------------------------------------------
| APPLICATION/DATA                                                          |
|                                                                           |
| data labeled (classified) or not?                                         |
| yes: labels categorical (classification) or quantitative (regression)?    |
| no:  similarity (clustering) or positioning (dimensionality reduction)?   |
|---------------------------------------------------------------------------|
| MODEL                           [what kinds of hypotheses are permitted?] |
|                                                                           |
| e.g.:                                                                     |
| - predictor fns:  linear, polynomial, logistic, neural net, ...           |
| - nearest neighbors, decision trees                                       |
| - features                                                                |
| - low vs. high capacity (affects overfitting, underfitting, inference)    |
|---------------------------------------------------------------------------|
| OPTIMIZATION PROBLEM                                                      |
|                                                                           |
| - variables, objective fn, constraints                                    |
| e.g., unconstrained, convex program, least squares, PCA                   |
|---------------------------------------------------------------------------|
| OPTIMIZATION ALGORITHM                                                    |
|                                                                           |
| e.g., gradient descent, simplex, SVD                                      |
-----------------------------------------------------------------------------

[In this course, we focus primarily on the middle two levels.  As a data
 scientist, you might be given an application, and your challenge is to turn it
 into an optimization problem that we know how to solve.  We'll talk a bit
 about optimization algorithms, but usually you'll use an optimization code
 that's faster and more robust than what you would write.

[The second level, the model, has a huge effect on the success of your
 learning algorithm.  Sometimes you can get a big improvement by tailoring the
 model or its features to fit the structure of your specific data.  The model
 also has a big effect on whether you overfit or underfit.  And if you want
 a model that you can interpret so you can do _inference_, the model has to
 be regular, not too complex.  Lastly, you have to pick a model that leads to
 an optimization problem that can be solved.  Some optimization problems are
 just too hard.]

[It's important to understand that when you change something in one level of
 this diagram, you probably have to change all the levels underneath it.
 If you switch from a linear classifier to a neural net, your optimization
 problem changes, and your optimization algorithm probably changes too.]

[Not all machine learning methods fit this four-level decomposition.
 Nevertheless, for everything you learn in this class, think about where it
 fits in this hierarchy.  If you don't distinguish which math is part of the
 model and which math is part of the optimization algorithm, this course will
 be very confusing for you.]


OPTIMIZATION PROBLEMS
=====================
[I want to familiarize you with some types of optimization problems that can be
 solved reliably and efficiently, and the names of some of the optimization
 algorithms used to solve them.  An important skill for you to develop is to be
 able to go from an application to a well-defined optimization problem.]

Unconstrained
-------------
Goal:  Find w that minimizes (or maximizes) a continuous fn f(w).

f is _smooth_ if its gradient is continuous too.

A _global_minimum_ of f is a value w such that f(w) <= f(v) for every v.
A _local_minimum_   " "  " "   "   "   "    "    "      "  
                   for every v in a tiny ball centered at w.
                   [In other words, you cannot walk downhill from w.]

          ^                  ---
          |--              --   --         ----          -
          |  --           -       --     --    --       -
          |    -         -          -   -        --   --
          |     -       -            - -           ---
          |      --   --              -             x
          |        ---                ^            /
          |         ^x---------- local minima ----/
         -O---------+-------------------------------------->
          |         |  
             global minimum

Usually, finding a local minimum is easy;
         finding the global minimum is hard. [or impossible]

Exception:  A function is _convex_ if for every x, y in R^d,
the line connecting (x, f(x)) to (y, f(y)) does not go below f(x).

          ^                                                  -
        f |-                                               --
          | o============================================o- 
          |  --                                      ---
          |    ----                              ----
          |        -----                     ----
          |             ------          -----
          |                   ----------
         -O-o--------------------------------------------o----->
          | x                                            y

E.g. perceptron risk fn is convex and nonsmooth.

[When you sum together convex functions, you always get a convex function.
 The perceptron risk function is a sum of convex loss functions.]

A [continuous] convex function [on a closed, convex domain] has either
- no minimum (goes to -infinity), or
- just one local minimum, or
- a connected set of local minima that are all global minima with equal f.
[The perceptron risk function has the latter.]
[In the last two cases, if you walk downhill, you eventually converge to
 a global minimum.]

[However, there are many applications where you don't have a convex objective
 function, and your machine learning algorithm has to settle for finding a
 local minimum.  For example, neural nets try to optimize an objective function
 that has *lots* of local minima; they almost never find a global minimum.]

Algs for smooth f:
- Steepest descent:
  = blind [with learning rate]              repeat:  w <- w - epsilon grad f(w)
  = with line search:
    x Secant method
    x Newton-Raphson (may need Hessian matrix of f)
  = stochastic (blind)   [trains on one sample per iteration, or a small batch]
- Nonlinear conjugate gradient              [uses the same line search methods]
- Newton's method (needs Hessian matrix)

Algs for nonsmooth f:
- Steepest descent
  = blind
  = with direct line search (e.g. golden section search)

These algs find a local minimum.

_line_search_:  finds a local minimum along the search direction by solving
                an optimization problem in 1D.
[...instead of using a blind step size like the perceptron algorithm does.
 Solving a 1D problem is much easier than solving a higher-dimensional one.]

[Neural nets are unconstrained optimization problems with many, many local
 minima.  They sometimes benefit from the more sophisticated optimization
 algorithms, but when the input data set is very large, researchers often
 favor the dumb, blind, stochastic versions of gradient descent.]

[If you're optimizing over a d-dimensional space, the Hessian matrix is
 a d-by-d matrix and it's usually dense, so most methods that use the Hessian
 are computationally infeasible when d is large.]

Constrained Optimization (smooth equality constraints)
------------------------
Goal:  Find w that minimizes (maximizes) f(w)
       subject to g(w) = 0              [<- observe that this is an isosurface]

       where g is a smooth fn
       (may be vector, encoding multiple constraints)

Alg:  Use _Lagrange_multipliers_.

Linear Program
--------------
Linear objective fn + linear *inequality* constraints.

Goal:  Find w that maximizes (or minimizes) c . w
       subject to A w <= b

       where A is n-by-d matrix, b in R^n, expressing n _linear_constraints_:
       A_i w <= b_i,        i in [1, n]

                   |    /        \            /
     ^           --+---O----------\----------/-------  <-- active constraint
      \            |  /.optimum....\        /
       \ c         | /..............\      /
        \          |/................\    /
         \         /.....feasible.....\  /
          \       /|......region.......\/
                 / |.................../\
                /  |................../  \
              -/---+-----------------/----\----
              /    |                /      \
             /
            /  <-- active constraint


The set of points that satisfy all constraints is a convex _polytope_ called
the _feasible_region_ F.  The _optimum_ is the point in F that is furthest in
the direction c.  [What does convex mean?]  A point set P is _convex_ if
for every p, q in P, the line segment with endpoints p, q lies entirely in P.

[A polytope is just a polyhedron, generalized to higher dimensions.]

We can always find a global optimum that is a vertex of the polytope.

The optimum achieves equality for some constraints (but not most), called the
_active_constraints_ of the optimum.
[In the figure above, there are two active constraints.  In an SVM, active
 constraints correspond to the samples that touch or violate the slab.]

[Sometimes, there is more than one optimal point.  For example, in the figure
 above, if c pointed straight up, every point on the top horizontal edge would
 be optimal.  The set of optimal points is always convex.]

Example:  EVERY _feasible_point_ (w, alpha) gives a linear classifier:

  ----------------------------------------------------------------
  | Find w, alpha that maximizes 0                               |
  | subject to y_i (w . X_i + alpha) >= 1    for all i in [1, n] |
  ----------------------------------------------------------------

IMPORTANT:
The data are linearly separable iff the feasible region is not the empty set.
        ^
        |----- Also true for maximum margin classifier (quadratic program)

Algs for linear programming:
- Simplex (George Dantzig, 1947)
  [Indisputibly one of the most important algorithms of the 20th century.]
  [Walks along edges of polytope from vertex to vertex until it finds optimum.]
- Interior point methods

[Linear programming is very different from unconstrained optimization; it has
 a much more combinatorial flavor.  If you knew which constraints would be the
 active constraints once you found the solution, it would be easy; the hard
 part is figuring out which constraints should be the active ones.  There are
 exponentially many possibilities, so you can't afford to try them all.
 So linear programming algorithms tend to have a very discrete, computer
 science feeling to them, like graph algorithms, whereas unconstrained
 optimization algorithms tend to have a continuous, numerical optimization
 feeling.]

[Linear programs crop up everywhere in engineering and science, but they're
 usually in disguise.  An extremely useful talent you should develop is to
 recognize when a problem is a linear program.]

[A linear program solver can find a linear classifier, but it can't find the
 maximum margin classifier.  We need something more powerful.]

Quadratic program
-----------------
Quadratic, convex objective fn + linear inequality constraints.

Goal:  Find w that minimizes f(w) = w^T Q w + c^T w
       subject to A w <= b

       where Q is a symmetric, positive definite matrix.

A matrix is _positive_definite_ if w^T Q w > 0 for all w != 0.

Only one local minimum!

[If Q is indefinite, so f is not convex, then the minimum is not unique and
 quadratic programming is NP-hard.]

Example:  Find maximum margin classifier.

[Show circular contour; draw polygon on top; show constrained minimum.
 In an SVM, we are looking for the point in this polygon that's closest to
 the origin.  Example with one active constraint; example with two.]

Algs for quadratic programming:
- Simplex-like
- Sequential minimal optimization (SMO, used in LIBSVM)
- Coordinate descent (used in LIBLINEAR)

[One clever idea used in SMO is that they do a line search that uses the
 Hessian, but it's cheap to compute because they don't walk in the direction of
 steepest descent; instead they walk along just one or a few coordinate axes at
 a time.]


Convex Program (EE 127/227A/227B)
--------------
Convex objective fn + convex inequality constraints.

[What I've given you here is, roughly, a sliding scale of optimization problems
 of increasing complexity, difficulty, and computation time.  But even convex
 programs are relatively easy to solve.  When you're trying to address the
 needs of real-world applications, it's not uncommon to devise an optimization
 problem with crazy inequalities and an objective function that's nowhere near
 convex.  These are sometimes very, very hard to solve.]