RIDGE REGRESSION (aka Tikhonov regularization)
================
(1) + (A) + l_2 penalized mean loss (d).
--------------------------------------------------
| 2 2 |
| Find w that minimizes |Xw - y| + lambda |w'| | = J(w)
--------------------------------------------------
where w' is w with component alpha replaced by 0.
X has fictitious dimension but we DON'T penalize alpha.
Adds a penalty term to encourage small |w'|--called _shrinkage_. Why?
- Guarantees positive definite normal eq's; always unique solution.
[When samples lie on a common hyperplane in feature space.] E.g. when d > n.
[Show figure comparing +ve semidef bowl w/+ve definite bowl (ridgequad.png)]
[That was the original motivation, but the next has become more important...]
- Reduces overfitting by reducing variance. Why?
Imagine: 275238 x_1^2 - 543845 x_1 x_2 + 385832 x_2^2 is best fit for
well-spaced samples all with |y| < 1.
Small change in x => big change in y!
[Given that all the y values in the data are small, it's a sure sign of
overfitting if tiny changes in x cause huge changes in y.]
So we penalize large weights.
[Show plot of isocontours for the two terms of J (ridgeterms.pdf)]
[In this plot, beta is the least-squares solution. The red ellipses are the
isocontours of |Xw - y|^2. The isocontours of |w'| are circles centered at
the origin (blue). The solution lies where a red isocontour just touches
a blue isocontour tangentially. As lambda increases, the solution will
occur at a more outer red isocontour and a more inner blue isocontour.]
Setting grad J = 0 gives normal eq'ns
T T
(X X + lambda I') w = X y
where I' is identity matrix w/bottom right set to zero.
T
Algorithm: Solve for w. Return h(z) = w z.
Increasing lambda => more _regularization_; smaller |w'|
Given our data model y = Xv + e, where e is noise,
T T -1 T
variance of ridge reg. is Var(z (X X + lambda I') X e).
As lambda -> infinity, variance -> 0, but bias increases.
[Show graph of bias^2 & variance as lambda increases (ridgebiasvar.pdf)]
[So, as usual for the bias-variance trade-off, the test error as a function of
lambda is a U-shaped curve, and we find the bottom by validation.]
lambda is a hyperparameter; tune by (cross-)validation.
Ideally, features should be "normalized" to have same variance.
Alternative: use asymmetric penalty by replacing I' w/other diagonal matrix.
Bayesian justification for ridge reg.
-------------------------------------
Assign a prior probability on w': a Gaussian centered at 0.
Posterior prob ~ likelihood of w x prior P(w') <= Gaussian PDF
Maximize log posterior = ln likelihood + ln P(w')
= - const |Xw - y|^2 - const |w'|^2 - const
This method (using likelihood, but maximizing posterior) is called
_maximum_a_posteriori_ (MAP).
KERNELS
=======
Recall: with d input features, degree-p polynomials blow up to O(d^p) features.
[When d is large, this gets computationally intractible really fast.
As I said in Lecture 4, if you have 100 features per sample and you want to
use degree-4 predictor functions, then each lifted feature vector has a length
on the order of 100 million.]
Today we use magic to use those features without computing them!
Observation: In many learning algs,
- the weights can be written as a linear combo of input samples, &
- we can use inner products of Phi(x)'s only => don't need to compute Phi(x)!
T n n
Suppose w = X a = sum a X for some a in R .
i=1 i i
Substitute this identity into alg. and optimize n _dual_weights_ a
(aka _dual_parameters_) instead of d+1 _primal_weights_ w.
Kernel Ridge Regression
-----------------------
_Center_ X and y so their means are zero; e.g. X_i <- X_i - mu_X
This lets us replace I' with I in normal equations:
T T
(X X + lambda I) w = X y
[When we center X and y, the expected value of the intercept w_{d+1} = alpha
is zero. The actual value won't usually be exactly zero, but it will be close
enough that we won't do much harm by penalizing the intercept.]
1 T T T 1
=> w = ------ (X y - X X w) = X a where a = ------ (y - X w)
lambda lambda ^ T
| w = X a
This shows that w is a linear combo of samples. To compute a:
T T -1
lambda a = (y - XX a) => a = (XX + lambda I) y
a is the _dual_solution_; solves the _dual_form_ of ridge regression:
------------------------------------------------------
| T 2 T 2 |
| Find a that minimizes |XX a - y| + lambda |X a| |
------------------------------------------------------
Regression fn is
T T n T
h(z) = w z = a X z = sum a (X z) <= weighted sum of inner products
i=1 i i
T
Let k(x, z) = x z be _kernel_fn_.
[Later, we'll replace x and z with Phi(x) and Phi(z), and that's where the
magic will happen.]
T
Let K = XX be n-by-n _kernel_matrix_. Note K = k(X , X ).
ij i j
K is singular if n > d. [And sometimes otherwise.]
In that case, no solution if lambda = 0. [Then the penalty term is necessary.]
Summary of kernel ridge reg.:
K = k(X , X ) for all i, j <= O(n^2 d) time
ij i j
Solve (K + lambda I) a = y for a <= O(n^3) time
for each test pt z
n
h(z) = sum a k(X , z) <= O(nd) time
i=1 i i
Does not use X directly! Only k. [This will become important soon.]
Dual: solve n-by-n linear system
Primal: solve d-by-d linear system
[So we prefer the dual form when d > n. If we add new features like polynomial
terms, this d goes up, but the d in the kernel running time doesn't.]
The Kernel Trick (aka _kernelization_)
----------------
[Here's the magic part. We will see that we can compute a polynomial kernel
that involves many monomial terms without actually computing those terms.]
T p
The _polynomial_kernel_ of degree p is k(x, z) = (x z + 1)
T p T
Theorem: (x z + 1) = Phi(x) Phi(z) where Phi(x) contains every monomial in x
of degree 0...p.
Example for d = 2, p = 2:
T 2 2 2 2 2
(x z + 1) = x z + x z + 2 x z x z + 2 x z + 2 x z + 1
1 1 2 2 1 1 2 2 1 1 2 2
2 2 _ _ _
= [x x \/2 x x \/2 x \/2 x 1]
1 2 1 2 1 2
2 2 _ _ _ T
[z z \/2 z z \/2 z \/2 z 1]
1 2 1 2 1 2
T
= Phi(x) Phi(z) [This is how we're defining Phi.]
[Notice the factors of sqrt(2). If you try a higher polynomial degree p,
you'll see a wider variety of these constants. We have no control of the
constants used in Phi(x), but they don't matter very much, because the primal
weights w will scale themselves to compensate. Even though we aren't directly
computing the primal weights...they still implicitly exist.]
Key win: compute Phi(x)^T Phi(z) in O(d) time instead of O(d^p),
even though Phi(x) has length O(d^p).
Kernel ridge regression replaces X_i with Phi(X_i):
T
Let k(x, z) = Phi(x) Phi(z)
[I think what we've done here is pretty mind-blowing: we can now do polynomial
regression with an exponentially long, high-order polynomial in less time than
it would take even to compute the final polynomial. The running time is
sublinear, actually much smaller than linear, in the size of the Phi vectors.]