REGRESSION aka Fitting Curves to Data
==========
Classification: given sample x, predict class (often binary)
Regression: given sample x, predict a numerical value
[Classification gives a discrete prediction, whereas regression gives us
a quantitative prediction, usually on a continuous scale.]
[We've already seen an example of regression in Gaussian discriminant analysis.
QDA and LDA don't just give us a classifier; they also give us the probability
that a particular class label is correct. So QDA and LDA implicitly do
regression on probability values.]
- Choose form of regression fn h(x; p) with parameters p (h = _hypothesis_)
- like predictor fn in classification [e.g. linear, quadratic, logistic in x]
- Choose a cost fn (objective fn) to optimize
- usually based on a loss fn; e.g. risk fn = expected loss
Some regression fns:
(1) linear: h(x; w, alpha) = w^T x + alpha
(2) polynomial
[equivalent to linear regression with added polynomial features]
(3) logistic: h(x; w, alpha) = s(w^T x + alpha)
1
recall: logistic fn s(gamma) = -----------
-gamma
1 + e
[This choice is interesting. You'll recall that LDA produces a posterior
probability fn with this equation. So this equation seems to be
a natural form for modeling certain probabilities. If we want to model
probabilities, sometimes we use LDA; but alternatively, we could skip
fitting Gaussians to samples, and instead just directly try to fit
a logistic function to a set of probabilities.]
Some loss fns: let z be prediction h(x); y be true value
(A) L(z, y) = (z - y)^2 _squared_error_
(B) L(z, y) = |z - y| _absolute_error_
(C) L(z, y) = - y ln z - (1 - y) ln (1 - z) _logistic_loss_: y in [0, 1],
z in (0, 1)
Some cost fns to minimize:
1 n
(a) J(h) = - sum L(h(X ), y ) _mean_loss_
n i=1 i i [you can leave out the "1/n"]
n
(b) J(h) = max L(h(X ), y ) _maximum_loss_
i=1 i i
n
(c) J(h) = sum omega_i L(h(X ), y ) _weighted_sum_ [some samples
i=1 i i more important than others]
1 n 2
(d) J(h) = - sum L(h(X ), y ) + lambda |w| l _penalized_/_regularized_
n i=1 i i 2
1 n
(e) J(h) = - sum L(h(X ), y ) + lambda |w| l _penalized_/_regularized_
n i=1 i i l_1 1
Least-squares linear regr.: (1) + (A) + (a) \
Weighted least-squ. linear: (1) + (A) + (c) | quadratic; minimize w/calculus
Ridge regression: (1) + (A) + (d) /
Lasso: (1) + (A) + (e) \ minimize w/gradient descent
Logistic reg.: (3) + (C) + (a) /
Least absolute deviations: (1) + (B) + (a) \ minimize w/linear programming
Chebyshev criterion: (1) + (B) + (b) /
[I have given you several choices of regression fn form, several choices of
loss fn, and several choices of objective fn. These are interchangeable parts
where you can snap one part out and replace it with a different one. But the
optimization algorithm and its speed depend crucially on which parts you pick.
Let's see some examples.]
LEAST-SQUARES LINEAR REGRESSION
===============================
linear regression fn (1) + squared loss fn (A) + cost fn (a).
-----------------------------------------------------------
| n 2 |
| Find w, alpha that minimizes sum (X . w + alpha - y ) |
| i=1 i i |
-----------------------------------------------------------
[Show linear regression example (linregress.pdf)]
Convention: X is n-by-d _design_matrix_ of samples
y is d-vector of dependent scalars.
- - - -
| X_11 X_12 ... | X_1j | ... X_1d | | y_1 |
| X_21 X_22 | X_2j | X_2d | | y_2 |
| ... | | | | |
| --------------------+--------+------------- | T | ... |
| X_i1 X_i2 | X_ij | X_id | < sample X | |
| -----------------------------+------------- | i | |
| ... | | | | |
| X_n1 X_n2 ... | X_nj | X_nd | | y_n |
- - - -
^ ^
feature column X y
*j
Usually n > d.
Recall fictitious dimension trick [from Lecture 3]: replace x . w + alpha with
- -
| w_1 |
[ x_1 x_2 1 ] . | w_2 |
| alpha |
- -
Now X is an n-by-(d+1) matrix; w is a (d+1)-vector.
-----------------------------------
| 2 |
| Find w that minimizes |Xw - y| | = RSS(w), for _residual_sum_of_squares_
-----------------------------------
[We know X and y, but w is unknown; we want to solve for w.]
Optimize by calculus:
T T T T
minimize RSS(w) = w X Xw - 2y Xw + y y
T T
grad RSS = 2X Xw - 2X y = 0
T T
==> X Xw = X y <== the _normal_equations_
\_/\_____/ [w unknown; X & y known]
(d+1)-by-(d+1) (d+1)-vectors
T
If X X is singular, problem is underconstrained [because the samples all lie on
a common hyperplane. Notice that X^T X is always positive semidefinite.]
T -1 T
We use a linear solver to find w = (X X) X y [never actually invert!]
\________/
X^+, the _pseudoinverse_ of X, (d+1)-by-n
[X is usually not square, so it can't have an inverse. However, every X has
a pseudoinverse X^+, and if X has rank d+1, then X^+ is a "left inverse".]
+ T -1 T
Observe: X X = (X X) X X = I <= (d+1)-by-(d+1) [which explains its name]
^ ^ +
Observe: the predicted values of y are y = h(x ) ==> y = Xw = XX y = Hy
i i
+
where H = XX is called the _hat_matrix_ because it puts the hat on y
n-by-n [and if n > d+1, it is not the identity matrix!]
Interpretation as a projection:
^
- y = Xw in R^n is a linear combination of columns of X (one per feature)
- For fixed X, varying w, Xw is subspace of R^n spanned by columns
O y
________|___________ Figure in n-dimensional space (1 dim/sample)
/ _| / NOT d-dimensional feature space (row space of X).
/ | | ^ /
/ ' O y = Xw / <== subspace spanned by X's columns
/ / (at most d+1 dimensions)
---------------------
^ ^
- Minimizing |y - y| finds point y nearest y on subspace
==> projects y onto subspace
[the vertical line is the direction of projection and the error vector]
T
- Error is smallest when line is perpendicular to subspace: X (Xw - y) = 0
==> the normal equations!
- Hat matrix H does the projecting. [Also sometimes called projection matrix.]
Advantages:
- Easy to compute; just solve a linear system.
- Unique, stable solution. [Except when the problem is underconstrained.]
Disadvantages:
- Very sensitive to outliers, because error is squared!
- Fails if X^T X is singular.
[Least-squares linear regression was apparently first posed and solved in 1801
by the great mathematician Carl Friedrich Gauss, who used least squares to
predict the trajectory of the planetoid Ceres. A paper he wrote on the topic
is regarded as the birth of modern linear algebra.]
LOGISTIC REGRESSION (David Cox, 1958)
===================
logistic regression fn (3) + logistic loss fn (C) + cost fn (a).
Fits "probabilities" in range (0, 1).
Usually used for classification. The input y_i's *can* be probabilities,
but in most applications they're all 0 or 1.
QDA, LDA: generative models
logistic regression: discriminative model
With X and w including the fictitious dimension; alpha is w's last component...
-------------------------------------------------------------
| Find w that maximizes |
| |
| n / \ |
| J = sum | y ln s(X . w) + (1 - y ) ln (1 - s(X . w)) | |
| i=1 \ i i i i / |
-------------------------------------------------------------
[Note that we are maximizing, not minimizing, this function because
I've flipped the sign of the logistic loss (C).]
[Show log loss plots (logloss0.pdf, logloss.7.pdf)]
-J(w) is convex! [J is "concave".] Solve by gradient ascent.
[To do gradient ascent, we'll need to compute some derivatives.]
-gamma
d 1 e
s'(gamma) = ------- ----------- = --------------
d gamma -gamma -gamma 2
1 + e (1 + e )
= s(gamma) (1 - s(gamma)) [Show s' plot (dlogistic.pdf)]
Let s = s(X . w)
i i
/ y_i 1 - y_i \
grad J = sum | --- grad s_i - ------- grad s_i |
w \ s_i 1 - s_i /
/ y_i 1 - y_i \
= sum | --- - ------- | s (1 - s ) X
\ s_i 1 - s_i / i i i
= sum (y - s ) X
i i i
n
Gradient ascent rule: w <- w + epsilon sum (y - s(X . w)) X
i=1 i i i
Stochastic gradient ascent: w <- w + epsilon (y - s(X . w)) X
i i i
Works best if we shuffle samples in random order, process one by one.
For very large n, sometimes converges before we visit all samples!
[This looks a lot like the perceptron learning rule. The only difference is
that the "- s_i" part is new.]
Starting from w = 0 works well in practice.
[Show mwascom's logistic regression plot (problogistic.png) from
http://stackoverflow.com/questions/28256058/plotting-decision-boundary-of-logistic-regression]