CLASSIFIERS =========== You are given a data set of n _samples_, each with d _features_. Some samples belong to _class_ O; some do not. Example: Samples are bank loans Features are income & age (d = 2) Some are in class "defaulted", some are not Goal: Predict whether future borrowers will default, based on their income & age. Represent each sample as a point in d-dimensional space, called a _feature_vector_ (aka _predictors_, _independent_variables_). ^ ^ ^ _overfitting_ | X X X | X X O | X O O X | X X | X X | X X | O X | O O | X O income |O X income |O X income |O X | O X | O X O | O X | O O X | O O O | X O X | O O | O O | O O +-----------> +-----------> +-----------> age age age [Draw lines/curves separating O's from X's. Use those curves to predict which future borrowers will default.] _decision boundary_: the boundary chosen by our classifier to separate items in the class from those not. [By the way, when I underline a word or a short phrase, usually that is a *definition*. If you want to do well in this course, you should *memorize* all the definitions I write down.] Some (not all) classifiers work by computing a _predictor function_: A function f(x) that maps a sample point x to a scalar such that f(x) > 0 if x is in class O; f(x) <= 0 if x not in class O. Aka _decision_function_ or _discriminant_function_. For these classifiers, the decision boundary is {x in R^d : f(x) = 0} [That is, the set of all points where the prediction function is zero.] Usually, this set is a (d - 1)-dimensional surface in R^d. {x : f(x) = 0} is also called an _isosurface_ of f for the _isovalue_ 0. f has other isosurfaces for other isovalues, e.g. {x : f(x) = 1}. [Show plot & *isocontours* of sqrt(x^2 + y^2) - 3. Imagine a function in R^d, and imagine its (d - 1)-dimensional isosurface.] _linear classifier_: The decision boundary is a line. Usually uses a linear predictor function. [Sometimes no predictor fn.] _overfitting_: When sinuous decision boundary fits sample data so well that it doesn't classify future items well. Math Review ----------- [I will write vectors in matrix notation.] Vectors: - - |x_1| |x_2| T x = |x_3| = [x_1 x_2 x_3 x_4 x_5] |x_4| |x_5| - - Think of x as a point in 5-dimensional space. Conventions (often, but not always): uppercase roman = matrix or random variable X lowercase roman = vector x Greek = scalar alpha Other scalars: n = # of samples d = # of features (per sample) = dimension of sample points i j k = indices function (often scalar) f( ), s( ), ... _inner_product_ (aka _dot_product_): x . y = x_1 y_1 + x_2 y_2 + ... + x_d y_d T also written x y Clearly, f(x) = w . x + alpha is a _linear_function_ in x. _Euclidean_norm_: |x| = sqrt(x . x) = sqrt(x_1^2 + x_2^2 + ... + x_d^2) |x| is the length (aka Euclidean length) of a vector x. x Given a vector x, --- is a _unit_ vector (length 1). |x| x "_Normalize_ a vector x": replace x with ---. |x| Use dot products to compute angles: x / x . y x y / cos theta = ------- = --- . --- / theta |x| |y| |x| |y| ----------> \___/ \___/ length 1 length 1 x acute ^ right x obtuse / +ve | 0 \ -ve / |_ \ / | | \ ----------> +-----------> ---------------> cos theta > 0 cos theta = 0 cos theta < 0 Given a linear predictor function f(x) = w . x + alpha, the decision boundary is H = {x : w . x = - alpha}. The set H is called a _hyperplane_. (A line in 2D, a plane in 3D.) [I want you to understand what a hyperplane is. In 2D, it's a line. In 3D, it's a plane. Now take that concept and generalize it to higher dimensions. In d dimensions, a hyperplane is a flat, infinite thing with dimension d - 1. A hyperplane divides the d-dimensional space into two halves.] -> Theorem: Let xy be a vector that lies on H. Then w . (y - x) = 0. Proof: x and y lie on H. Thus w . (x - y) = - alpha - (- alpha) = 0. w is called the _normal_vector_ of H, because (as the theorem shows) w is normal (perpendicular) to H. (I.e. w is normal to every pair of points in H.) \ \ + \ / w \/\/ \/ \ [Add more isocontours to figure.] \ \ H \ If w is a unit vector, then w . x + alpha is the _signed_distance_ from x to H. I.e. it's the distance, but positive on one side of H; negative on other side. Moreover, the distance from H to the origin is alpha. [How do we know that?] Hence alpha = 0 if and only if H passes through origin. [w does not have to be a unit vector for the classifier to work. If w not unit vector, w . x + alpha is a multiple of signed distance. If you want to fix that, you can _rescale_ the equation by computing |w| and dividing both w and alpha by 1 / |w|.] The coefficients in w, plus alpha, are called _weights_ or sometimes _regression_coefficients_. [That's why I named the vector w; "w" stands for "weights".] The input data is _linearly_separable_ if there exists a hyperplane that separates all the samples in class O from all the samples NOT in class O. [At the beginning of this lecture, I showed you one plot that's linearly separable and two that are not.] [We will investigate some linear classifiers that only work for linearly separable data, then we'll move on to more sophisticated linear classifiers that do a decent job with non-separable data. Obviously, if your data is not linearly separable, a linear classifier cannot do a *perfect* job. But we're still happy if we can find a classifier that usually predicts correctly.] A Stupid Classifier ------------------- _Centroid_method_: compute mean mu_O of all vectors in class O and mean mu_X of all vectors NOT in O. We use the predictor function mu_O + mu_X f(x) = (mu_O - mu_X) . x - (mu_O - mu_X) . (-----------) 2 \___________/ \___________/ normal vector midpoint between mu_O, mu_X so that decision boundary is the hyperplane that bisects line segment w/endpoints mu_O, mu_X. [Better yet, we can adjust the right hand side to minimize the number of misclassified points. Same normal vector, but different position.] ^ \ | O\ O | \ O | X \ O [Draw mu_O, mu_X, and the normal vector] |X \ O | X \O O |XX \ | \X +-----------> [In this example, there's clearly a better linear classifier that classifies every sample correctly. Note that this is hardly the worst example I could have given. If you're in the mood for an easy puzzle, pull out a sheet of paper and think of an example, with lots of samples, where the centroid method misclassifies every sample but one.] [Nevertheless, there are cases where this method works well, like when all your positive examples come from one Gaussian distribution, and all your negative examples come from another.] Perceptron Algorithm (Frank Rosenblatt, 1957) -------------------- Slow, but correct for linearly separable samples. Uses a _numerical_optimization_ algorithm, namely, _gradient_descent_. [Poll: How many of you know what numerical optimization is? How many of you know what gradient descent is? How many of you know what Lagrange multipliers are? How many of you know what linear programming is? How many of you know what the simplex algorithm for linear programming is? How many of you know what convex programming is? We're going to learn what all these things are. As machine learning people, we will be heavy users of all the optimization methods. Unfortunately, I won't have time to teach you *algorithms* for all these optimization problems, but we'll learn a few, and I'll give you some hints how the other algorithms work.] Consider n sample vectors X_1, X_2, ..., X_n. [The reason I'm using capital X here is because we typically store these vectors as rows of a matrix X. So the subscript picks out a row of X, representing a specific sample. When I want to pick out one feature from a sample, I'll add a second subscript after the first one.] / 1 if X_i is in class O, and For each sample, let y_i = | \ -1 if X_i not in O. For simplicity, consider only decision boundaries that pass through the origin. (We'll fix this later.) ~~~~~~~~~~~~~~~~~~~~~~~ Goal: find weights w such that X_i . w >= 0 if y_i = 1, and X_i . w <= 0 if y_i = -1. [remember, X_i . w is the signed distance] Equivalently: y_i X_i . w >= 0. <-- inequality called a _constraint_. Idea: We define a _risk_function_ R that is positive if some constraints are violated. Then we use _optimization_ to choose w that minimizes R. Define the _loss_function_ / 0 if y_i z >= 0, and L(z, y_i) = | \ -y_i z otherwise. [Here, z is the classifier's prediction, and y_i is the correct answer.] Idea: if z has the same sign as y_i, the loss function is zero (happiness). But if z has the wrong sign, the loss function is positive. [For each sample, you want to get the loss function down to zero, or as close to zero as possible. It's called the "loss function" because the bigger it is, the bigger a loser you are.] Define _risk_function_ (aka _objective_function_ or _cost_function_) n R(w) = sum L(X_i . w, y_i), i=1 = sum -y_i X_i . w i in V where V is the set of indicies i for which y_i w . X_i < 0. If w classifies all the points X_1, ..., X_n correctly, then R(w) = 0. Otherwise, R(w) is positive, and we want to find a better value of w. -------------------------------------------------------------------------- | Goal: Solve this _optimization_problem_: find w that minimizes R(w). | -------------------------------------------------------------------------- [Show plot of R.]