WEIGHTED LEAST-SQUARES REGRESSION
=================================
linear regression fn (1) + squared loss fn (A) + cost fn (c).
[The idea of weighted least-squares is that some samples might be more trusted
than others, or there might be certain samples you want to fit particularly
well. So you assign those more trusted samples a higher weight. If you
suspect some samples of being outliers, you can assign them a lower weight.]
Assign each sample a weight omega_i; collect them in nxn diagonal matrix Omega.
^ ^
Greater omega_i -> work harder to minimize |y_i - y_i| recall: y = Xw
--------------------------------------------------
| T | n 2
| Find w that minimizes (Xw - y) Omega (Xw - y) | = sum omega ((Xw) - y )
-------------------------------------------------- i=1 i i i
T T
Solve for w in normal equations: X Omega Xw = X Omega y
[Once again, you can interpret it as a projection:]
1/2 ^ 1/2
Note: Omega y is orthogonal projection of Omega y onto
1/2
subspace spanned by columns of Omega X.
[If you stretch the n-dimensional space by applying the linear transformation
Omega^1/2, it's an orthogonal projection in that stretched space.]
NEWTON'S METHOD
===============
Iterative optimization method for smooth fn J(w).
Often much faster than gradient descent. [We'll use for logistic regression.]
Idea: You're at point v. Approximate J(w) near v by quadratic fn.
Jump to its unique critical pt. Repeat until bored.
[Show method in 1D (newton1.pdf--newton3.pdf); in 2D (newton2D.png).]
Taylor series about v:
2 2
grad J(w) = grad J(v) + (grad J(v)) (w - v) + O(|w - v| )
where grad^2 J(v) is the _Hessian_matrix_ of J(w) at v.
Find critical pt w by setting grad J(w) = 0:
2 -1
w = v - (grad J(v)) grad J(v)
[This is an iterative update rule you can repeat until it converges to
a solution. As usual, we don't really want to compute a matrix inverse.]
Newton's method:
pick starting point w
repeat until convergence 2
e <- solution to linear system (grad J(w)) e = - grad J(w)
w <- w + e
Warning: Doesn't know difference between minima, maxima, saddle points.
Starting point must be "close enough" to desired solution.
[If the objective function J is actually quadratic, Newton's method needs only
one step to find the correct answer. The closer J is to quadratic, the faster
Newton's method tends to converge.]
[Newton's method is superior to blind gradient descent for some optimization
problems for several reasons. First, it tries to find the right step length
to reach the minimum, rather than just walking an arbitrary distance downhill.
Second, rather than follow the direction of steepest descent, it tries to
optimize all directions at once.]
[Nevertheless, it has some major disadvantages. The biggest one is that
computing the Hessian can be quite expensive, and it has to be recomputed
every iteration. It can work well for low-dimensional feature spaces, but
you would never use it for a neural net, because there are too many weights.
Newton's method also doesn't work for most nonsmooth functions. It
particularly fails for the perceptron risk function, whose Hessian is zero,
except where it's not even defined.]
LOGISTIC REGRESSSION (continued)
====================
Recall: s'(gamma) = s(gamma) (1 - s(gamma)) s = s(X . w)
i i
n T
grad J(w) = sum (y - s ) X = X (y - s)
w i=1 i i i
where s is n-vector with components s_i
[Now we work out the Hessian too, so we can use Newton's method.]
2 n T T
grad J(w) = - sum s (1 - s ) X X = - X Omega X
w i=1 i i i i
where Omega is diagonal matrix with components s (1 - s )
i i
T
Omega is +ve definite for all w, so X Omega X is +ve semidefinite,
so -J(w) is convex.
[Therefore, Newton's method finds an optimal point if it converges at all.]
Newton's method:
T T
Solve for e in normal equations: (X Omega X) e = X (y - s)
w <- w + e ^ ^
repeat until "convergence" Remember: Omega, s are fns of w
[Notice that this looks a lot like weighted least squares, but the weight
matrix Omega and the right-hand-side vector y-s change every iteration.]
An example of _iteratively_reweighted_least_squares_.
Omega prioritizes samples with s_i near 0.5; tunes out samples near 0/1.
[In other words, samples near the decision boundary have the biggest effect on
the iterations. Meanwhile, the iterations move the decision boundary; in
turn, that movement may change which samples have the most influence. In the
end, only the samples near the decision boundary make a big contribution to
the logistic fit.]
Idea: If n very large, save time by using a random subsample of the samples
per iteration. Increase sample size as you go.
[The principle is that the first iteration isn't going to take you all the way
to the optimal point, so why waste time looking at all the samples? Whereas
the last iteration should be the most accurate one.]
LDA vs. logistic regression
---------------------------
Advantages of LDA:
- For well-separated classes, LDA stable; log. reg. surprisingly unstable
- > 2 classes easy & elegant, log. reg. needs modifying ("softmax regression")
- Slightly more accurate when classes nearly normal, especially if n is small
Advantages of log. reg.:
- More emphasis on decision boundary [though not as much as SVMs]
[Show figure of log.reg. vs. LDA for tight boundary (logregvsLDAauni.pdf)]
- Hence less sensitive to outliers
- Easy & elegant treatment of "partial" class membership; LDA is all-or-nothing
- More robust on some non-Gaussian distributions (e.g. large skew)
ROC CURVES (for test sets)
==========
[Show ROC curve (ROC.pdf).]
[This is a _ROC_curve_. That stands for _receiver_operating_characteristics_,
which is an awful name but we're stuck with it for historical reasons.
It is made by running a classifier on the test set or validation set.
It shows the rate of false positives vs. true positives for a range of
settings.
We assume there is a knob we can turn to trade off these two types of error;
in this case, that knob is the probability threshold for Gaussian discriminant
analysis or linear regression.
However, neither axis is the probability threshold.]
x-axis: "false positive rate = % of -ve classified as +ve"
y-axis: "true positive rate = % of +ve classified as +ve aka _sensitivity_"
"false negative rate": vertical distance from curve to top
"_specificity_": horizontal distance from curve to right
[You generate this curve by trying *every* probability threshold; for each
threshold, measure the false positive & true positive rates and plot a point.]
upper right corner: "always classify +ve (Pr >= 0)"
lower left corner: "always classify -ve (Pr > 1)"
diagonal: "random classifiers"
[A rough measure of a classifier's effectiveness is the area under the curve.
For a classifier that is always correct and never assigns a probability
between 0 and 1, the area under the curve is one.
[IMPORTANT: In practice, the trade-off between false negatives and false
positives is usually negotiated by choosing a point on this plot, based on real
test data, and NOT by taking the choice of threshold that's best in theory.]
LEAST-SQUARES POLYNOMIAL REGRESSION
===================================
Replace each X with feature vector Phi(X ) with all terms of degree 1...p
i i
2 2 T
e.g. Phi(X ) = [ X X X X X X 1 ]
i i1 i1 i2 i2 i1 i2
[Notice that we've added the fictitious dimension "1" here, so we don't need
to add it again. This basis covers all polynomials quadratic in X_i1, X_i2.]
Can also use non-polynomial features (e.g. edge detectors).
Otherwise just like linear or logistic regression.
Log. reg. + quadratic features = same logistic posteriors as QDA.
Very easy to overfit! [Show overfits (overunder.png; UScensusquartic.png)]