DECISION THEORY =============== [Today I'm going to talk about a style of classifier very different from SVMs. The classifiers we'll cover in the next few weeks are based on probability, because sometimes a point in feature space doesn't have just one class.] [Suppose one borrower with income $30,000 and debt $15,000 defaults. another " " " " " " " doesn't default. So in your feature space, you have two samples at the same point with different classes. Obviously, in that case, you can't draw a decision boundary that classifies all points with 100% accuracy.] Multiple samples with different classes could lie at same point: we want a probabilistic classifier. Suppose 10% of population has cancer, 90% doesn't. [caps here Probability distributions for calorie intake, P(X | Y): mean random variables, calories (X) | < 1,200 | 1,200--1,600 | > 1,600 not matrices.] --------------------+-----------+--------------+---------- cancer (Y = 1) | 20% | 50% | 30% no cancer (Y = -1) | 1% | 10% | 89% [I made these numbers up. Please don't take them as medical advice.] Recall: P(X) = P(X | Y = 1) P(Y = 1) + P(X | Y = -1) P(Y = -1) P(1,200 <= X <= 1,600) = 0.5 * 0.1 + 0.1 * 0.9 = 0.14 You meet guy eating x = 1,400 calories/day. Guess whether he has cancer? [If you're in a hurry, you might see that 50% of people with cancer eat 1,400 calories, but only 10% of people with no cancer do, and conclude that someone who eats 1,400 calories probably has cancer. But that would be wrong, because that reasoning fails to take the prior probabilities into account.] Bayes' Theorem: | posterior probability | prior prob. | for 1,200 <= X <= 1,600 | v v v P(X | Y = 1) P(Y = 1) 0.05 \ P(Y = 1 | X) = --------------------- = ---- | P(X) 0.14 | > sum is 1 P(X | Y = -1) P(Y = -1) 0.09 | P(Y = -1 | X) = ----------------------- = ---- | P(X) 0.14 / P(cancer | X = 1,400 cals) = 5/14 ~ 36%. [So we probably shouldn't diagnose cancer.] [However, we've been assuming that we want to maximize the chance of a correct prediction. But that's not always the right assumption. If you're developing a cheap screening test for cancer, you'd rather have more false positives and fewer false negatives. A false negative might mean somebody misses an early diagnosis and dies of a cancer that could have been treated if caught early. A false positive just means that you spend more money on more accurate tests.] A _loss_function_ L(z, y) specifies badness if true class is y, classifier predicts z. / 1 if z = 1, y = -1 false positive is bad E.g., L(z, y) = | 5 if z = -1, y = 1 false negative is BAAAAAD \ 0 if z = y A 36% probability of loss 5 is worse than a 64% prob. of loss 1, so we recommend further cancer screening. Defs: loss fn above is _asymmetrical_. The _0-1_loss_function_ is 1 for incorrect predictions, [symmetrical] 0 for correct. [Another example where you want a very asymmetrical loss function is for spam detection. Putting a good email in the spam folder is much worse than putting spam in your inbox.] Let r : R^d -> +-1 be a _decision_rule_, aka _classifier_: a fn that maps a feature vector x to 1 ("in class") or -1 ("not in class"). The _risk_ for r is the expected loss over all values of x, y: R(r) = E[L(r(X), Y)] = sum (L(r(x), 1) P(Y = 1 | X = x) + L(r(x), -1) P(Y = -1 | X = x)) P(x) x = P(Y = 1) sum L(r(x), 1) P(X = x | Y = 1) + x P(Y = -1) sum L(r(x), -1) P(X = x | Y = -1) x The _Bayes_decision_rule_ aka _Bayes_classifier_ is the r that minimizes R(r); call it r*. Assuming L(z, y) = 0 for z = y: / 1 if L(-1, 1) P(Y = 1 | X = x) > L(1, -1) P(Y = -1 | X = x), r*(x) = | \ -1 otherwise In cancer example, r* = 1 for intakes <= 1,600; r* = -1 for intakes > 1,600. The _Bayes_risk_, aka _optimal_risk_, is the risk of the Bayes classifier. [In our cancer example, the last expression for risk gives:] R(r*) = 0.1 (5 * 0.3) + 0.9 (1 * 0.01 + 1 * 0.1) = 0.249 [It is interesting that, if we really know all these probabilities, we really can construct an ideal probabilistic classifier. But in real applications, we rarely know these probabilities; the best we can do is use statistical methods to estimate them.] Suppose X has a continuous probability density fn (PDF). Review: [Go back to your CS 70 or stats notes if you don't remember this.] / x_2 ^ P(x) ==== prob. that random variable X in [x , x ] = | P(x) dx | == |.== [shaded area] 1 2 / x_1 | = |..|= | = |..| == / inf | == |..| ===== area under whole curve = 1 = | P(x) dx | == |..| ========= / -inf +-----------+--+----------------------> x x x / inf 1 2 _expected_ value of f(X): E[f(X)] = | f(x) P(x) dx / -inf / inf _mean_ mu = E[X] = | x P(x) dx / -inf 2 2 2 2 _variance_ sigma = E[(X - mu) ] = E[X ] - mu [Perhaps our cancer statistics look like this:] ^ ==== | == == P(X | Y = 1) [area under each curve is 1] | = = | = = | = = | = = | = +=++++++++++++++ | = ++++ = ++++ P(X | Y = -1) | = ++ = ++ | = ++ == ++++ | == ++++ ====== ++++++ | ==++++++ =============== +++++ +-----------------------------------------------------------> x [Let's go back to the 0-1 loss function for a moment. In other words, you want a classifier that maximizes the chance of a correct prediction. The wrong answer would be to look where these two curves cross and make that be the decision boundary. As before, it's wrong because it doesn't take into account the prior probabilities.] Suppose P(Y = 1) = 1/3, P(Y = -1) = 2/3, 0-1 loss: ^ P(X | Y = 1) P(Y = 1) | ==== ++++++++ | == == ++++ ++++ | = =+ ++ P(X | Y = -1) P(Y = -1) | = ++|= ++ | = +..|.= + | = +...|.= + | = +....|..= ++ | = +.....|...= ++ | = ++......|....= ++ | = ++........|.....== ++++ | ==++++..........|.......====== +++++ | ==++..............|.............=============== +-------------------+----------------------------==========> x | Bayes optimal decision boundary [To maximize the chance you'll predict correctly whether somebody has cancer, the Bayes decision rule looks up x on this chart and picks the curve with the highest probability. In this example, that means you pick cancer when x is left of the optimal decision boundary, and no cancer when x is to the right.] Define _risk_ as before, replacing summations with integrals. R(r) = E[L(r(X), Y)] / = P(Y = 1) | L(r(x), 1) P(X = x | Y = 1) dx + / / P(Y = -1) | L(r(x), -1) P(X = x | Y = -1) dx / If L is 0-1 loss, [the risk has a particularly nice interpretation] R(r) = P(r(x) is wrong) [which makes sense, because R is the expected loss.] For Bayes decision rule, Bayes Risk is the area under minimum of functions above (shaded). Assuming L(z, y) = 0 for z = y: / R(r*) = | min L(-y, y) P(X = x | Y = y) P(Y = y) dx / y=+-1 [If you want to use an asymmetrical loss function, just scale the vertical reach of each curve accordingly in the figure above.] _Bayes_optimal_decision_boundary_: {x : P(Y = 1 | X = x) = 0.5} \______________/ \___/ predictor fn isovalue [Show figure of 2D Gaussians with decision boundary.] [Obviously, the accuracy of the probabilities is most important near the decision boundary. Far away from the decision boundary, a bit of error in the probabilities probably wouldn't change the classification.] [You can also have multi-class classifiers, where each sample is in one class among many. The Bayesian approach is a particularly convenient way to generate multi-class classifiers, because you can simply choose whichever class has the greatest posterior probability. Then the decision boundary lies wherever two or more classes are tied for the highest probability.] 3 WAYS TO BUILD CLASSIFIERS =========================== (1) Generative models (e.g. LDA) - Assume samples come from probability distributions, different for each class. - Guess form of distributions - For each class C, fit distribution parameters to class C samples, giving P(X | Y = C) - For each C, estimate P(Y = C) - Bayes' Theorem gives P(Y | X) - If 0-1 loss, pick class C that maximizes P(Y = C | X = x) [posterior] equivalently, maximizes P(X = x | Y = C) P(Y = C) (2) Discriminative models (e.g. logistic regression) - Model P(Y | X) directly (3) Find decision boundary (e.g. SVM) - Model r(x) directly (no posterior) Advantage of (1 & 2): P(Y | X) tells you probability your guess is wrong [This is something SVMs don't do.] Advantage of (1): you can diagnose outliers: P(X) is very small Disadvantages of (1): often hard to estimate distributions accurately; real distributions rarely match standard ones. [What I've written here doesn't actually define the phrases "generative model" or "discriminative model". The proper definitions accord with the way statisticians think about models. A _generative_model_ is a full probabilistic model of all variables, whereas a _discriminative_model_ provides a model only for the target variables.] [It's important to remember that we rarely know precisely the value of any of these probabilities. There is usually error in all of these P's, and in a generative model those errors can get compounded when we apply Bayes' Theorem to estimate P(Y | X). In practice, generative models are most popular when you have phenomena that are really well fitted by the normal distribution.]