ML ABSTRACTIONS [some meta comments on machine learning] =============== [When you write a large computer program, you break it down into subroutines and modules. Many of you know from experience that you need to have the discipline to impose strong abstraction barriers between different modules, or your program will become so complex you can no longer manage nor maintain it.] [When you learn a new subject, it helps to have mental abstraction barriers, too, so you know when you can replace one approach with a different approach. I want to give you four levels of abstraction that can help you think about machine learning. It's important to make mental distinctions between these four things, and the code you write should have modules that reflect these distinctions as well.] ----------------------------------------------------------------------------- | APPLICATION/DATA | | | | data labeled (classified) or not? | | yes: labels categorical (classification) or quantitative (regression)? | | no: similarity (clustering) or positioning (dimensionality reduction)? | |---------------------------------------------------------------------------| | MODEL [what kinds of hypotheses are permitted?] | | | | e.g.: | | - predictor fns: linear, polynomial, logistic, neural net, ... | | - nearest neighbors, decision trees | | - features | | - low vs. high capacity (affects overfitting, underfitting, inference) | |---------------------------------------------------------------------------| | OPTIMIZATION PROBLEM | | | | - variables, objective fn, constraints | | e.g., unconstrained, convex program, least squares, PCA | |---------------------------------------------------------------------------| | OPTIMIZATION ALGORITHM | | | | e.g., gradient descent, simplex, SVD | ----------------------------------------------------------------------------- [In this course, we focus primarily on the middle two levels. As a data scientist, you might be given an application, and your challenge is to turn it into an optimization problem that we know how to solve. We'll talk a bit about optimization algorithms, but usually you'll use an optimization code that's faster and more robust than what you would write. [The second level, the model, has a huge effect on the success of your learning algorithm. Sometimes you can get a big improvement by tailoring the model or its features to fit the structure of your specific data. The model also has a big effect on whether you overfit or underfit. And if you want a model that you can interpret so you can do _inference_, the model has to be regular, not too complex. Lastly, you have to pick a model that leads to an optimization problem that can be solved. Some optimization problems are just too hard.] [It's important to understand that when you change something in one level of this diagram, you probably have to change all the levels underneath it. If you switch from a linear classifier to a neural net, your optimization problem changes, and your optimization algorithm probably changes too.] [Not all machine learning methods fit this four-level decomposition. Nevertheless, for everything you learn in this class, think about where it fits in this hierarchy. If you don't distinguish which math is part of the model and which math is part of the optimization algorithm, this course will be very confusing for you.] OPTIMIZATION PROBLEMS ===================== [I want to familiarize you with some types of optimization problems that can be solved reliably and efficiently, and the names of some of the optimization algorithms used to solve them. An important skill for you to develop is to be able to go from an application to a well-defined optimization problem.] Unconstrained ------------- Goal: Find w that minimizes (or maximizes) a continuous fn f(w). f is _smooth_ if its gradient is continuous too. A _global_minimum_ of f is a value w such that f(w) <= f(v) for every v. A _local_minimum_ " " " " " " " " " " for every v in a tiny ball centered at w. [In other words, you cannot walk downhill from w.] ^ --- |-- -- -- ---- - | -- - -- -- -- - | - - - - -- -- | - - - - --- | -- -- - x | --- ^ / | ^x---------- local minima ----/ -O---------+--------------------------------------> | | global minimum Usually, finding a local minimum is easy; finding the global minimum is hard. [or impossible] Exception: A function is _convex_ if for every x, y in R^d, the line connecting (x, f(x)) to (y, f(y)) does not go below f(x). ^ - f |- -- | o============================================o- | -- --- | ---- ---- | ----- ---- | ------ ----- | ---------- -O-o--------------------------------------------o-----> | x y E.g. perceptron risk fn is convex and nonsmooth. [When you sum together convex functions, you always get a convex function. The perceptron risk function is a sum of convex loss functions.] A [continuous] convex function [on a closed, convex domain] has either - no minimum (goes to -infinity), or - just one local minimum, or - a connected set of local minima that are all global minima with equal f. [The perceptron risk function has the latter.] [In the last two cases, if you walk downhill, you eventually converge to a global minimum.] [However, there are many applications where you don't have a convex objective function, and your machine learning algorithm has to settle for finding a local minimum. For example, neural nets try to optimize an objective function that has *lots* of local minima; they almost never find a global minimum.] Algs for smooth f: - Steepest descent: = blind [with learning rate] repeat: w <- w - epsilon grad f(w) = with line search: x Secant method x Newton-Raphson (may need Hessian matrix of f) = stochastic (blind) [trains on one sample per iteration, or a small batch] - Nonlinear conjugate gradient [uses the same line search methods] - Newton's method (needs Hessian matrix) Algs for nonsmooth f: - Steepest descent = blind = with direct line search (e.g. golden section search) These algs find a local minimum. _line_search_: finds a local minimum along the search direction by solving an optimization problem in 1D. [...instead of using a blind step size like the perceptron algorithm does. Solving a 1D problem is much easier than solving a higher-dimensional one.] [Neural nets are unconstrained optimization problems with many, many local minima. They sometimes benefit from the more sophisticated optimization algorithms, but when the input data set is very large, researchers often favor the dumb, blind, stochastic versions of gradient descent.] [If you're optimizing over a d-dimensional space, the Hessian matrix is a d-by-d matrix and it's usually dense, so most methods that use the Hessian are computationally infeasible when d is large.] Constrained Optimization (smooth equality constraints) ------------------------ Goal: Find w that minimizes (maximizes) f(w) subject to g(w) = 0 [<- observe that this is an isosurface] where g is a smooth fn (may be vector, encoding multiple constraints) Alg: Use _Lagrange_multipliers_. Linear Program -------------- Linear objective fn + linear *inequality* constraints. Goal: Find w that maximizes (or minimizes) c . w subject to A w <= b where A is n-by-d matrix, b in R^n, expressing n _linear_constraints_: A_i w <= b_i, i in [1, n] | / \ / ^ --+---O----------\----------/------- <-- active constraint \ | /.optimum....\ / \ c | /..............\ / \ |/................\ / \ /.....feasible.....\ / \ /|......region.......\/ / |.................../\ / |................../ \ -/---+-----------------/----\---- / | / \ / / <-- active constraint The set of points that satisfy all constraints is a convex _polytope_ called the _feasible_region_ F. The _optimum_ is the point in F that is furthest in the direction c. [What does convex mean?] A point set P is _convex_ if for every p, q in P, the line segment with endpoints p, q lies entirely in P. [A polytope is just a polyhedron, generalized to higher dimensions.] We can always find a global optimum that is a vertex of the polytope. The optimum achieves equality for some constraints (but not most), called the _active_constraints_ of the optimum. [In the figure above, there are two active constraints. In an SVM, active constraints correspond to the samples that touch or violate the slab.] [Sometimes, there is more than one optimal point. For example, in the figure above, if c pointed straight up, every point on the top horizontal edge would be optimal. The set of optimal points is always convex.] Example: EVERY _feasible_point_ (w, alpha) gives a linear classifier: ---------------------------------------------------------------- | Find w, alpha that maximizes 0 | | subject to y_i (w . X_i + alpha) >= 1 for all i in [1, n] | ---------------------------------------------------------------- IMPORTANT: The data are linearly separable iff the feasible region is not the empty set. ^ |----- Also true for maximum margin classifier (quadratic program) Algs for linear programming: - Simplex (George Dantzig, 1947) [Indisputibly one of the most important algorithms of the 20th century.] [Walks along edges of polytope from vertex to vertex until it finds optimum.] - Interior point methods [Linear programming is very different from unconstrained optimization; it has a much more combinatorial flavor. If you knew which constraints would be the active constraints once you found the solution, it would be easy; the hard part is figuring out which constraints should be the active ones. There are exponentially many possibilities, so you can't afford to try them all. So linear programming algorithms tend to have a very discrete, computer science feeling to them, like graph algorithms, whereas unconstrained optimization algorithms tend to have a continuous, numerical optimization feeling.] [Linear programs crop up everywhere in engineering and science, but they're usually in disguise. An extremely useful talent you should develop is to recognize when a problem is a linear program.] [A linear program solver can find a linear classifier, but it can't find the maximum margin classifier. We need something more powerful.] Quadratic program ----------------- Quadratic, convex objective fn + linear inequality constraints. Goal: Find w that minimizes f(w) = w^T Q w + c^T w subject to A w <= b where Q is a symmetric, positive definite matrix. A matrix is _positive_definite_ if w^T Q w > 0 for all w != 0. Only one local minimum! [If Q is indefinite, so f is not convex, then the minimum is not unique and quadratic programming is NP-hard.] Example: Find maximum margin classifier. [Show circular contour; draw polygon on top; show constrained minimum. In an SVM, we are looking for the point in this polygon that's closest to the origin. Example with one active constraint; example with two.] Algs for quadratic programming: - Simplex-like - Sequential minimal optimization (SMO, used in LIBSVM) - Coordinate descent (used in LIBLINEAR) [One clever idea used in SMO is that they do a line search that uses the Hessian, but it's cheap to compute because they don't walk in the direction of steepest descent; instead they walk along just one or a few coordinate axes at a time.] Convex Program (EE 127/227A/227B) -------------- Convex objective fn + convex inequality constraints. [What I've given you here is, roughly, a sliding scale of optimization problems of increasing complexity, difficulty, and computation time. But even convex programs are relatively easy to solve. When you're trying to address the needs of real-world applications, it's not uncommon to devise an optimization problem with crazy inequalities and an objective function that's nowhere near convex. These are sometimes very, very hard to solve.]