Perceptron Algorithm (cont'd) -------------------- Recall: - linear predictor fn f(x) = w . x (for simplicity, no alpha) - decision boundary {x : f(x) = 0} (a hyperplane through the origin) - samples X_1, X_2, ..., X_n (vectors); classifications y_1, ..., y_n = +- 1 - goal: find weights w such that y_i X_i . w >= 0 - goal, rewritten: find w that minimizes R(w) = sum -y_i X_i . w [risk fn] i in V where V is the set of indicies i for which y_i X_i . w < 0. [Our original problem was to find a separating hyperplane in one space, which I'll call x-space. But we've transformed this into a problem of finding an optimal point in a different space, which I'll call w-space. It's important to understand transformations like this, where a structure in one space becomes a point in another space. In this *particular* problem, there is a duality between hyperplanes and points.] _Duality_ between x-space and w-space: x-space (primal) w-space (dual) |---------------------------------------------------------------| | hyperplane: {z : w . z = 0} | point: w | |---------------------------------------------------------------| | point: x | hyperplane: {z : x . z = 0} | |---------------------------------------------------------------| If a point x lies on a hyperplane H, then its dual hyperplane x^* contains the dual point H^*. If we want to enforce inequality x . w >= 0, that means - x should be on the correct side of {z : z . w = 0} in primal x-space - w " " " " " " " {z : x . z = 0} in dual w-space ^ \ ^ / primal | X \ | / dual | \ | / X | ---- \ | / [Observe that the | ---- \|/ primal points are <---------+---------> <=========+=========> the normal vectors ---- | /|\ for the dual lines.] ---- | / | \ 0 / | \ | / | w \ v / v \ [For a sample x in class O, w and x must be on the *same* side of the dual hyperplane x^*. For a sample x not in class O (X), w and x must be on *opposite* sides of the dual hyperplane x^*.] [Show plots of R. Note how R's creases match the dual chart.] [In this example, we can choose w to be any point in the bottom pizza slice; all those points satisfy the inequalities.] [We have an optimization problem; we need an optimization algorithm to solve it.] An _optimization_algorithm_: _gradient_descent_ on R. Given a starting point w, find gradient of R with respect to w; this is the direction of steepest ascent. Take a step in the opposite direction. Recall [from your vector calculus class] - - - - | dR/dw_1 | | z_1 | | | | | grad R(w) = | dR/dw_2 | and grad (z . w) = | z_2 | = z. | . | | . | | . | | . | | dR/dw_d | | z_d | - - - - grad R(w) = sum grad -y_i X_i . w = - sum y_i X_i i in V i in V At any point w, we walk downhill in direction of steepest descent, - grad R(w). w = arbitrary nonzero starting point (good choice is any y_i X_i) while R(w) > 0 V <- set of indicies i for which y_i X_i . w < 0 w <- w + epsilon sum y_i X_i i in V return w epsilon is the _step_size_ aka _learning_rate_, chosen empirically. [Best choice depends on input problem!] [Show plot of R again. Show typical steps.] Problem: Slow! Each step takes O(nd) time. [Can we improve this?] Optimization algorithm 2: _stochastic_gradient_descent_ Idea: each step, pick *one* misclassified X_i; do gradient descent on loss fn L(X_i . w, y_i). Called the _perceptron_algorithm_. Each step takes O(d) time. [Not counting the time to search for a misclassified X_i.] while some y_i X_i . w < 0 w <- w + epsilon y_i X_i return w [By the way, stochastic gradient descent does not work for every problem that gradient descent works for. The perceptron risk function happens to have special properties that allow stochastic gradient descent to always succeed.] What if separating hyperplane doesn't pass through origin? Add a fictitious dimension. Hyperplane: w . x + alpha = 0 - - | x_1 | [ w_1 w_2 alpha ] . | x_2 | = 0 | 1 | - - d + 1 Now we have samples in R , all lying on plane x = 1. d + 1 Run perceptron algorithm in (d + 1)-dimensional space. [The perceptron algorithm was invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory. It was originally designed not to be a program, but to be implemented in hardware for image recognition on a 20 x 20 pixel image. Rosenblatt built a Mark I Perceptron Machine that ran the algorithm, complete with electric motors to do weight updates.] [Show Mark I photo. This is what it took to process a 20 x 20 image in 1957.] [Then he had a press conference where he predicted that perceptrons would be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." We're still waiting on that.] [One interesting aspect of the perceptron algorithm is that it's an "online algorithm", which means that if new data points come in while the algorithm is already running, you can just throw them into the mix and keep looping.] Perceptron Convergence Theorem: If data is linearly separable, perceptron algorithm will find a linear classifier that classifies all data correctly in at most O(R^2 / gamma^2) iterations, where R = max |X_i| is "radius of data" and gamma is the "maximum margin". [I'll define "maximum margin" shortly.] [We're not going to prove this, because it's obsolete.] [Although the step size/learning rate doesn't appear in that big-O expression, it does have an effect on the running time, but the effect is hard to characterize. The algorithm gets slower if epsilon is too small because it has to take lots of steps to get down the hill. But it also gets slower if epsilon is too big for a different reason: it jumps right over the region with zero risk and oscillates back and forth for a long time.] [Although stochastic gradient descent is faster for this problem than gradient descent, the perceptron algorithm is still slow. There's no reliable way to choose a good step size epsilon. Fortunately, optimization algorithms have improved a lot since 1957. You can get rid of the step size by using any decent modern "line search" algorithm. Better yet, you can find a better decision boundary much more quickly by quadratic programming, which is what we'll talk about next.] MAXIMUM MARGIN CLASSIFIERS ========================== The _margin_ of a linear classifier is the distance from the decision boundary to the nearest sample point. What if we make the margin as big as possible? . ^\ . O O X . | \ . O .| \ O | \ .O |. \ . X | X \ O X | . \ . O | X. \ .O <-------+--------\------> X | X . \ . w . x + alpha = 1 . \ w . x + alpha = -1 . \ w . x + alpha = 0 We enforce the constraints y_i (w . X_i + alpha) >= 1 for i in [1, n] [Notice that the right-hand side is a 1, rather than a 0 as it was for the perceptron risk function. It's not obvious, but this a much better way to formulate the problem, partly because it makes it impossible for the weight vector w to get set to zero.] If |w| = 1, the constraints imply the margin is at least 1; [because w . X_i + alpha is the signed distance] 1 BUT we allow w to have arbitrary length, so the margin is at least ---. |w| 2 There is a _slab_ of width --- containing no samples |w| [with the hyperplane running along its middle]. To maximize the margin, minimize |w|. Optimization problem: ---------------------------------------------------------------- | Find w and alpha that minimize |w|^2 | | subject to y_i (X_i . w + alpha) >= 1 for all i in [1, n] | ---------------------------------------------------------------- Called a _quadratic_program_ in d + 1 dimensions and n constraints. It has one unique solution! [The reason we use |w|^2 as an objective function, instead of |w|, is that the length function |w| is not smooth at zero, whereas |w|^2 is smooth everywhere.] The solution gives us a _maximum_margin_classifier_, aka a _hard_margin_ _support_vector_machine_ (SVM). [Technically, this isn't really a support vector machine yet; it doesn't fully deserve that name until we add features and kernelization, which we'll do in later lectures.] [Show 3D example in (w, alpha) weight space + 2D cross-section w1 = 1/17. Show optimal point on both graphs.]