STATISTICAL JUSTIFICATIONS FOR REGRESSION ========================================= [So far, I've talked about regression as a way to fit curves to points. Recall how early in the semester I divided machine learning into 4 levels: the application, the model, the optimization problem, and the optimization algorithm. My last two lectures about regression were at the bottom two levels: optimization. The cost functions that we optimize are somewhat arbitrary. Today let's take a step back to the second level, the model. I will describe some models, how they lead to those optimization problems, and how they contribute to underfitting or overfitting.] Typical model of reality: - samples come from unknown prob. distribution: X_i ~ D - y-values are sum of unknown, non-random surface + random noise: for all X_i, y = f(X ) + epsilon , epsilon ~ D', D' has mean zero i i i i [We are positing that reality is described by a function f. We don't know f, but f is not a random variable; it represents a real relationship between x and y that we can estimate. We add to that a random variable epsilon, which represents measurement errors and all the other sources of statistical error when we measure real-world phenomena. Notice that the noise is independent of x. That's a pretty big assumption, and often it does not apply in practice, but that's all we'll have time to deal with this semester. Also notice that this model leaves out systematic errors, like when your measuring device adds one to every measurement, because we usually can't diagnose systematic errors from data alone.] Goal of regression: find h that estimates f. Ideal approach: choose h(x) = E [Y | X = x] = f(x) + E[epsilon] = f(x) Y [If this expectation exists at all, it partly justifies our model of reality. We can retroactively define f to be this expectation.] [Draw figure showing example f, distribution for a fixed x.] Least-Squares Regression from Maximum Likelihood ------------------------------------------------ Suppose epsilon ~ N(0, sigma^2); then y ~ N(f(X ), sigma^2) i i i Recall that log likelihood for normal dist. is 2 |y_i - mu| <= mu = f(X ) ln P(y ) = - ----------- - constant, i i 2 sigma^2 ln (P(y ) P(y ) ... P(y )) = ln P(y ) + ln P(y ) + ... ln P(y ) 1 2 n 1 2 n Takeaway: Max likelihood => find f by least-squares [So if the noise is normally distributed, maximum likelihood justifies using the least-squares cost function.] [However, I've told you in previous lectures that least-squares is very sensitive to outliers. If the error is truly normally distributed, that's not a big deal, especially when you have a lot of samples. But in the real world, the real distribution of outliers often isn't normal. Outliers might come from wrongly measured measurements, data entry errors, anomalous events, or just not having a normal distribution. When you have a heavy-tailed distribution, for example, least-squares isn't a good choice.] Empirical Risk -------------- The _risk_ for hypothesis h is expected loss R(h) = E[L] over all x, y. Discriminative model: we don't know X's dist. D. How can we minimize risk? [If we have a generative model, we can estimate the probability distributions for X and Y and derive the expected loss. That's what we did for Gaussian discriminant analysis. But today I'm assuming we don't have a generative model, so we don't know those probabilities. Instead, we're going to approximate the distribution in a very crude way: we pretend that the samples *are* the distribution.] _Empirical_distribution_: a discrete probability distribution that IS the sample set, with each sample equally likely _Empirical_risk_: expected loss under empirical distribution ^ 1 n R(h) = - sum L(h(X_i), y_i) n i=1 [The hat on the R indicates it's only a cheap approximation of the true, unknown statistical risk we really want to optimize. Often, this is the best we can do. For many but not all distributions, it converges to the true risk in the limit as n -> infinity.] Takeaway: this is why we [usually] minimize the sum of loss fns. Logistic regression from maximum likelihood ------------------------------------------- If we accept the logistic regression fn, what cost fn should we use? Given arbitrary sample x, write probability it is in (not in) the class: (Fictitious dimension: x ends w/1; w ends w/alpha) P(y = 1 | x; w) = h(x; w) <= h(x; w) = s(w^T x) P(y = 0 | x; w) = 1 - h(x; w) [s is logistic fn] Combine these 2 facts into 1: y 1-y [A bit of a hack, but it works P(y | x; w) = h(x) (1 - h(x)) nicely for intermediate values of y] Likelihood is n L(w; x , ..., x ) = prod P(y | X ; w) 1 n i=1 i i Log likelihood is n l(w) = ln L(w) = sum ln P(y | X ; w) i=1 i i n / \ = sum | y ln h(X ) + (1 - y ) ln (1 - h(X )) | i=1 \ i i i i / ...which is negated logistic cost fn J(w). We want to maximize log likelihood => minimize J. THE BIAS-VARIANCE DECOMPOSITION =============================== There are 2 sources of error in a hypothesis h: _bias_: error due to inability of hypothesis h to fit f perfectly e.g. fitting quadratic f with a linear h _variance_: error due to fitting random noise in data e.g. we fit linear f with a linear h, yet h != f. Model: generate samples X_1 ... X_n from some distribution D values y_i = f(X_i) + epsilon_i fit hypothesis h to X, y Now h is a random variable; i.e. its weights are random Consider an arbitrary pt z in R^d (not necessarily a sample!) and gamma = f(z) + epsilon. [So z is *arbitrary*, whereas gamma is *random*.] Note: E[gamma] = f(z); Var(gamma) = Var(epsilon) [So the mean comes from f, and the variance comes from epsilon.] Risk fn when loss is squared error: R(h) = E[L(h(z), gamma)] ^ | we take expectation over possible training sets X, y and values of gamma [Stop and take a close look at this. Remember that the hypothesis h is a random variable. We are taking a mean over the probability distribution of hypotheses. That seems pretty weird if you've never seen it before. But remember, the training data X and y come from probability distributions. We use the training data to choose weights, so the weights that define h also come from some probability distribution. It might be pretty hard to work out what that distribution is, but it exists. This "E[.]" is integrating the loss over all possible values of the weights.] = E[(h(z) - gamma)^2] = E[h(z)^2] + E[gamma^2] - 2 E[gamma h(z)] [Observe that gamma & h(z) are independent] = Var(h(z)) + E[h(z)]^2 + Var(gamma) + E[gamma]^2 - 2 E[gamma] E[h(z)] = (E[h(z)] - E[gamma])^2 + Var(h(z)) + Var(gamma) = E[h(z) - f(z)]^2 + Var(h(z)) + Var(epsilon) \______________/ \_______/ \__________/ bias^2 variance _irreducible_error_ of method of method [Show bvn.pdf] This is pointwise version. Mean version: let z ~ D be random variable; take expectation of bias^2, variance over z. [So you can decompose one test point's error into these three components, or you can decompose the error of the hypothesis over its entire range into three components, which tells you how big they'll be on a large test set.] - Underfitting = too much bias - Overfitting caused by too much variance - Training error reflects bias but not variance; test error reflects both [which is why low training error can fool you when you've overfitted] - For many distributions, variance -> 0 as n -> infinity - If h can fit f exactly, for many distributions bias -> 0 as n -> infinity - If h cannot fit f well, bias is large at "most" points - Adding a good feature reduces bias; adding a bad feature rarely increases it - Adding a feature usually increases variance - Can't reduce irreducible error [hence its name] - Noise in test set affects only Var(epsilon); noise in training set affects only bias & Var(h) - For real-world data, f is rarely knowable (and noise model might be wrong) - But we can test learning algs by choosing f & making synthetic data [Show trade-off figures with spline, increasing degrees of freedom.] Example: Least-Squares Linear Reg. ----------------------------------- For simplicity, assume no fictitious dimension. [This implies that our linear regression fn has to be zero at the origin.] Model: f(z) = v^T z (reality is linear) [So we could fit f perfectly with a linear h if not for the noise in the training set.] Let e be noise n-vector, e ~ N(0, sigma^2) Training values: y = Xv + e [X & y are the inputs to linear regression. We don't know v or e.] Lin. reg. computes weights + + + w = X y = X (Xv + e) = v + X e \_/ noise in weights T T T + T + BIAS is E[h(z) - f(z)] = E[w z - v z] = E[z X e] = z X E[e] = 0 Warning: This does not mean h(z) - f(z) is everywhere 0! Sometimes +ve, sometimes -ve, mean over training sets is 0. [Those deviations from the mean are captured in the variance.] [Not all learning methods give you a bias of zero when a perfect fit is possible; here it's a benefit of the squared error loss fn. With a different loss fn, we might have a nonzero bias even fitting a linear h to a linear f.] T T + T + VARIANCE is Var(h(z)) = Var(z v + z X e) = Var(z X e) [This is the dot product of a vector z^T X^+ with an isotropic, normally distributed vector e. The dot product reduces it to a one-dimensional Gaussian along the direction z^T X^+, so its variance is just the variance of the 1D Gaussian times the squared length of the vector z^T X^+.] 2 | T +|2 2 T T -1 T T -1 = sigma |z X | = sigma z (X X) X X (X X) z 2 T T -1 = sigma z (X X) z If we choose coordinate system so E[X] = 0, then X^T X -> n Cov(D) as n -> infinity, so one can show that for z ~ D, Var(h(z)) ~ sigma^2 d / n [Where d is the dimension--the number of features per sample.] Takeaways: Bias can be zero when hypothesis function can fit the real one! [Nice property of squared error loss fn.] Variance portion of RSS (overfitting) decreases as 1 / n (samples), increases as d (features). [I've used linear regression because it's a relatively simple example. But this bias-variance trade-off applies to nearly all learning algorithms, including classification as well as regression. Of course, for many learning algorithms the math gets a lot more complicated than this, if you can do it at all.]