STATISTICAL JUSTIFICATIONS FOR REGRESSION
=========================================
[So far, I've talked about regression as a way to fit curves to points.
 Recall how early in the semester I divided machine learning into 4 levels:
 the application, the model, the optimization problem, and the optimization
 algorithm.  My last two lectures about regression were at the bottom two
 levels:  optimization.  The cost functions that we optimize are somewhat
 arbitrary.  Today let's take a step back to the second level, the model.
 I will describe some models, how they lead to those optimization problems, and
 how they contribute to underfitting or overfitting.]

Typical model of reality:
- samples come from unknown prob. distribution:  X_i ~ D
- y-values are sum of unknown, non-random surface + random noise:  for all X_i,

  y  = f(X ) + epsilon ,                    epsilon  ~ D', D' has mean zero
   i      i           i                            i

[We are positing that reality is described by a function f.  We don't know f,
 but f is not a random variable; it represents a real relationship between
 x and y that we can estimate.  We add to that a random variable epsilon, which
 represents measurement errors and all the other sources of statistical error
 when we measure real-world phenomena.  Notice that the noise is independent of
 x.  That's a pretty big assumption, and often it does not apply in practice,
 but that's all we'll have time to deal with this semester.  Also notice that
 this model leaves out systematic errors, like when your measuring device adds
 one to every measurement, because we usually can't diagnose systematic errors
 from data alone.]

Goal of regression:  find h that estimates f.

Ideal approach:  choose h(x) = E [Y | X = x] = f(x) + E[epsilon] = f(x)
                                Y

[If this expectation exists at all, it partly justifies our model of reality.
 We can retroactively define f to be this expectation.]
[Draw figure showing example f, distribution for a fixed x.]

Least-Squares Regression from Maximum Likelihood
------------------------------------------------
Suppose epsilon  ~ N(0, sigma^2);  then y  ~ N(f(X ), sigma^2)
               i                         i        i
Recall that log likelihood for normal dist. is
                         2
               |y_i - mu|                     <=  mu = f(X )
  ln P(y ) = - ----------- - constant,                    i
        i       2 sigma^2

  ln (P(y ) P(y ) ... P(y )) = ln P(y ) + ln P(y ) + ... ln P(y )
         1     2         n           1          2              n

Takeaway:  Max likelihood => find f by least-squares

[So if the noise is normally distributed, maximum likelihood justifies using
 the least-squares cost function.]
[However, I've told you in previous lectures that least-squares is very
 sensitive to outliers.  If the error is truly normally distributed, that's not
 a big deal, especially when you have a lot of samples.  But in the real world,
 the real distribution of outliers often isn't normal.  Outliers might come
 from wrongly measured measurements, data entry errors, anomalous events, or
 just not having a normal distribution.  When you have a heavy-tailed
 distribution, for example, least-squares isn't a good choice.]


Empirical Risk
--------------
The _risk_ for hypothesis h is expected loss R(h) = E[L] over all x, y.
Discriminative model:  we don't know X's dist. D.  How can we minimize risk?
[If we have a generative model, we can estimate the probability distributions
 for X and Y and derive the expected loss.  That's what we did for Gaussian
 discriminant analysis.  But today I'm assuming we don't have a generative
 model, so we don't know those probabilities.  Instead, we're going to
 approximate the distribution in a very crude way:  we pretend that the samples
 *are* the distribution.]

_Empirical_distribution_:  a discrete probability distribution that IS the
                           sample set, with each sample equally likely
_Empirical_risk_:  expected loss under empirical distribution
                   ^      1  n
                   R(h) = - sum L(h(X_i), y_i)
                          n i=1
[The hat on the R indicates it's only a cheap approximation of the true,
 unknown statistical risk we really want to optimize.  Often, this is the best
 we can do.  For many but not all distributions, it converges to the true risk
 in the limit as n -> infinity.]

Takeaway:  this is why we [usually] minimize the sum of loss fns.


Logistic regression from maximum likelihood
-------------------------------------------
If we accept the logistic regression fn, what cost fn should we use?

Given arbitrary sample x, write probability it is in (not in) the class:
(Fictitious dimension:  x ends w/1; w ends w/alpha)

  P(y = 1 | x; w) = h(x; w)                              <=  h(x; w) = s(w^T x)
  P(y = 0 | x; w) = 1 - h(x; w)                              [s is logistic fn]

Combine these 2 facts into 1:

                    y           1-y       [A bit of a hack, but it works
  P(y | x; w) = h(x)  (1 - h(x))           nicely for intermediate values of y]

Likelihood is
                       n
  L(w; x , ..., x ) = prod P(y  | X ; w)
        1        n    i=1     i    i

Log likelihood is
                    n
  l(w) = ln L(w) = sum ln P(y  | X ; w)
                   i=1       i    i
          n  /                                       \
       = sum | y  ln h(X ) + (1 - y ) ln (1 - h(X )) |
         i=1 \  i       i          i             i   /

...which is negated logistic cost fn J(w).
We want to maximize log likelihood  =>  minimize J.


THE BIAS-VARIANCE DECOMPOSITION
===============================
There are 2 sources of error in a hypothesis h:
_bias_:  error due to inability of hypothesis h to fit f perfectly
         e.g. fitting quadratic f with a linear h
_variance_:  error due to fitting random noise in data
             e.g. we fit linear f with a linear h, yet h != f.

Model:  generate samples X_1 ... X_n from some distribution D
        values y_i = f(X_i) + epsilon_i
        fit hypothesis h to X, y
Now h is a random variable; i.e. its weights are random

Consider an arbitrary pt z in R^d (not necessarily a sample!) and
gamma = f(z) + epsilon.  [So z is *arbitrary*, whereas gamma is *random*.]
Note:  E[gamma] = f(z); Var(gamma) = Var(epsilon)
       [So the mean comes from f, and the variance comes from epsilon.]

Risk fn when loss is squared error:

  R(h) = E[L(h(z), gamma)]
         ^
         | we take expectation over possible training sets X, y
                                             and values of gamma
[Stop and take a close look at this.  Remember that the hypothesis h is
 a random variable.  We are taking a mean over the probability distribution of
 hypotheses.  That seems pretty weird if you've never seen it before.  But
 remember, the training data X and y come from probability distributions.  We
 use the training data to choose weights, so the weights that define h also
 come from some probability distribution.  It might be pretty hard to work out
 what that distribution is, but it exists.  This "E[.]" is integrating the loss
 over all possible values of the weights.]
       = E[(h(z) - gamma)^2]
       = E[h(z)^2] + E[gamma^2] - 2 E[gamma h(z)]
                                    [Observe that gamma & h(z) are independent]
       = Var(h(z)) + E[h(z)]^2 + Var(gamma) + E[gamma]^2 - 2 E[gamma] E[h(z)]
       = (E[h(z)] - E[gamma])^2 + Var(h(z)) + Var(gamma)
       = E[h(z) - f(z)]^2 + Var(h(z)) + Var(epsilon)
         \______________/   \_______/   \__________/
              bias^2        variance    _irreducible_error_
             of method      of method                            [Show bvn.pdf]

This is pointwise version.  Mean version:  let z ~ D be random variable; take
expectation of bias^2, variance over z.
[So you can decompose one test point's error into these three components, or
 you can decompose the error of the hypothesis over its entire range into three
 components, which tells you how big they'll be on a large test set.]

- Underfitting = too much bias
- Overfitting caused by too much variance
- Training error reflects bias but not variance; test error reflects both
  [which is why low training error can fool you when you've overfitted]
- For many distributions, variance -> 0 as n -> infinity
- If h can fit f exactly, for many distributions bias -> 0 as n -> infinity
- If h cannot fit f well, bias is large at "most" points
- Adding a good feature reduces bias; adding a bad feature rarely increases it
- Adding a feature usually increases variance
- Can't reduce irreducible error [hence its name]
- Noise in test set affects only Var(epsilon);
  noise in training set affects only bias & Var(h)
- For real-world data, f is rarely knowable (and noise model might be wrong)
- But we can test learning algs by choosing f & making synthetic data

[Show trade-off figures with spline, increasing degrees of freedom.]


Example:  Least-Squares Linear Reg.
-----------------------------------
For simplicity, assume no fictitious dimension.
[This implies that our linear regression fn has to be zero at the origin.]

Model:  f(z) = v^T z   (reality is linear)
        [So we could fit f perfectly with a linear h if not for the noise in
         the training set.]
        Let e be noise n-vector, e ~ N(0, sigma^2)
        Training values:  y = Xv + e
        [X & y are the inputs to linear regression.  We don't know v or e.]

Lin. reg. computes weights

       +     +                +
  w = X y = X (Xv + e) = v + X e
                             \_/ noise in weights

                            T     T        T  +       T  +
BIAS is E[h(z) - f(z)] = E[w z - v z] = E[z  X  e] = z  X  E[e] = 0
Warning:  This does not mean h(z) - f(z) is everywhere 0!
          Sometimes +ve, sometimes -ve, mean over training sets is 0.
          [Those deviations from the mean are captured in the variance.]
[Not all learning methods give you a bias of zero when a perfect fit is
 possible; here it's a benefit of the squared error loss fn.  With a different
 loss fn, we might have a nonzero bias even fitting a linear h to a linear f.]

                             T     T  +           T  +
VARIANCE is Var(h(z)) = Var(z v + z  X  e) = Var(z  X  e)

[This is the dot product of a vector z^T X^+ with an isotropic, normally
 distributed vector e.  The dot product reduces it to a one-dimensional
 Gaussian along the direction z^T X^+, so its variance is just the variance
 of the 1D Gaussian times the squared length of the vector z^T X^+.]

                             2 | T  +|2        2  T   T  -1  T    T  -1
                      = sigma  |z  X |  = sigma  z  (X X)   X X (X X)   z

                             2  T   T  -1
                      = sigma  z  (X X)   z

If we choose coordinate system so E[X] = 0,
then X^T X -> n Cov(D) as n -> infinity, so one can show that for z ~ D,

  Var(h(z)) ~ sigma^2 d / n

[Where d is the dimension--the number of features per sample.]

Takeaways:  Bias can be zero when hypothesis function can fit the real one!
              [Nice property of squared error loss fn.]
            Variance portion of RSS (overfitting) decreases as 1 / n (samples),
                                                  increases as d (features).

[I've used linear regression because it's a relatively simple example.  But
 this bias-variance trade-off applies to nearly all learning algorithms,
 including classification as well as regression.  Of course, for many learning
 algorithms the math gets a lot more complicated than this, if you can do it at
 all.]