ANISOTROPIC GAUSSIANS
=====================
X ~ N(mu, Sigma) <== X is a random dvector with mean mu
1 1 T 1
P(x) =  exp (   (x  mu) Sigma (x  mu) )
sqrt(2 pi)^d sqrt(Sigma) 2
^ determinant of Sigma
Sigma is the dbyd SPD _covariance_matrix_.
Sigma^1 is the dbyd SPD _precision_matrix_; serves as metric tensor.
T 1
Write P(x) = n(q(x)), where q(x) = (x  mu) Sigma (x  mu)
^ ^
R>R, exponential R^d > R, quadratic
[Now q(x) is a function we understandit's just a quadratic bowl centered at
mu whose curvature is represented by the metric tensor Sigma^1. q(x) is the
squared distance from mu to x under this metric. The other function n(.) is
like a 1D Gaussian with a different normalization factor. It is helpful to
understand that this mapping n(.) does not change the isosurfaces.]
Principle: given f:R > R, isosurfaces of f(q(x)) are same as q(x)
(different isovalues), except that some might be "combined"
[if f maps them to the same value]
[Show example of paraboloid q(x) and bivariate Gaussian n(q(x)), w/univariate
Gaussian n(.) between them.]
T T T
_covariance_: Cov(X, Y) = E[(X  E[X]) (Y  E[Y]) ] = E[XY ]  mu mu
Var(X) = Cov(X, X) X Y
[These two definitions hold for both vectors and scalars.]
For a Gaussian, one can show Var(X) = Sigma.
[...by integrating the expectation in anisotropic spherical coordinates.]
Hence
 
 Var(X_1) Cov(X_1, X_2) ... Cov(X_1, X_d) 
Sigma =  Cov(X_2, X_1) Var(X_2) Cov(X_2, X_d) 
 ... ... 
 Cov(X_d, X_1) Cov(X_d, X_2) ... Var(X_d) 
 
[An important point is that statisticians didn't just arbitrarily decide to
call this thing a covariance matrix. Rather, statisticians discovered that
if you find the covariance of the normal distribution by integration, it turns
out that each covariance happens to be an entry in the matrix inverse to the
metric tensor. This is a happy discovery; it's rather elegant.]
[Observe that Sigma is symmetric, as Cov(X_i, X_j) = Cov(X_j, X_i).]
X_i, X_j independent ==> Cov(X_i, X_j) = 0
Cov(X_i, X_j) = 0 AND joint normal distribution ==> X_i, X_j independent
all features pairwise independent ==> Sigma is diagonal
Sigma is diagonal <==> axisaligned Gaussian; squared radii on the diagonal
<==> P(X) = P(X_1) P(X_2) ... P(X_d)
\__/ \______________________/
multivariate each univariate
[So when the features are independent, you can write the multivariate Gaussian
as a product of univariate Gaussians. When they aren't, you can do a change
of coordinates to the eigenvector coordinate system, and write it as a product
of univariate Gaussians in those coordinates.]
[It's tricky to keep track of the relationships between the matrices, so here's
a handy chart.]
covariance precision
matrix matrix
(metric)
inverse 1
> Sigma <> M = Sigma
/ ^ ^
    
 square   square  
   root   root
    
 v v
 1/2 inverse 1/2
 Sigma <> A = Sigma
 (maps spheres to (maps ellipsoids
 ellipsoids) to spheres)

 ^ 1/2
  eigenvalues of Sigma are ellipsoid radii
 (standard deviations along the eigenvectors) \
 
eigenvalues of Sigma are variances along the eigenvectors 
 v
v T 1/2 1/2 T
Diagonalizing: Sigma = V Lambda V Sigma = V Lambda V
[Remember that all four of these matrices have the same eigenvectors V.
Remember that when you take the inverse or square or square root of an SPD
matrix, you do the same to its eigenvalues. So the ellipsoid radii, being the
eigenvalues of Sigma^1/2, are the square roots of the eigenvalues of Sigma;
moreover, they are the inverse square roots of the precision matrix (metric).]
[Keep this chart handy when you do Homework 3.]
Maximum likelihood estimation for anisotropic Gaussians

Given samples x_1, ..., x_n and classes y_1, ..., y_n, find bestfit Gaussians.
[Once again, we want to fit the Gaussian that maximizes the likelihood of
generating the samples in a specified class. This time I won't derive the
maximumlikelihood Gaussian; I'll just tell you the answer.]
For QDA:
^ 1 T
Sigma =  sum (x  mu ) (x  mu ) <== _conditional_covariance_
C n {i:y = C} i C i C for samples in class C
C i \____________________/
outer product matrix [Show example maxlike.jpg]
Priors pi_C, means mu_C: same as before
[Priors are class samples / total samples; means are perclass sample means]
^
Sigma_C is positive semidefinite, but not always definite!
If some zero eigenvalues, must eliminate the zerovariance dimensions
(eigenvectors).
[I'm not going to discuss how to do that today, but it involves projecting the
samples onto a subspace along the eigenvectors with eigenvalue zero.]
For LDA:
^ 1 T
Sigma =  sum sum (x  mu ) (x  mu ) <== _pooled_withinclass_
n C {i:y = C} i C i C _covariance_matrix_
i
[Notice that although we're computing one covariance matrix for all the data,
each sample contributes with respect to *its*own*class's*mean*. This gives
a very different result than if you simply compute one covariance matrix for
all the samples using the global mean! In the former case, the variances are
typically smaller.]
[Let's revisit QDA and LDA and see what has changed now that we know
anisotropic Gaussians. The short answer is "not much has changed".
By the way, capital X is once again a random variable.]
QDA

pi , mu , Sigma may be different for each class C.
C C C
Choosing C that maximizes P(X = x  Y = C) pi is equivalent to maximizing
the _quadratic_discriminant_fn_ C
d 1 1
Q (x) = ln (sqrt(2 pi) P(x) pi ) =   q (x)   ln Sigma  + ln pi
C ^ C 2 C 2 C C

Gaussian for C [works for any # of classes]
2 classes: Prediction fn Q (x)  Q (x) is quadratic, but may be indefinite
C D
==> Bayes decision boundary is a quadric.
Posterior is P(Y = C  X = x) = s(Q (x)  Q (x))
C D
where s(.) is logistic fn
[Show example where decision boundary is a hyperbola: two anisotropic
Gaussians, Q_C  Q_D, logistic fn, posterior. Show anisotropic Voronoi
diagram. Separate "whiteboard".]
LDA

One Sigma for all classes.
[Once again, the quadratic terms cancel each other out so the predictor
function is linear and the decision boundary is a hyperplane.]
Q (x)  Q (x) =
C D
T 1 T 1
T 1 mu_C Sigma mu_C  mu_D Sigma mu_D
(mu  mu ) Sigma x   + ln pi  ln pi
C D 2 C D
\____________________/ \_______________________________________________________/
T
w x + alpha
Choose class C that maximizes the _linear_discriminant_fn_
T 1 1 T 1
mu Sigma x   mu Sigma mu + ln pi [works for any # of classes]
C 2 C C C
T
2 classes: Decision boundary is w x + alpha = 0
T
Posterior is P(Y = C  X = x) = s(w x + alpha)
[Show example where decision boundary is a line: two anisotropic Gaussians,
Q_C  Q_D, logistic fn, posterior. Show ESL, Figure 4.11 (LDAdata.pdf):
example of LDA with messy data. Separate "whiteboard".]
Notes:
 Changing prior pi_C (or loss) is easy: adjust alpha.
 LDA often interpreted as projecting samples onto normal vector w;
cutting the line in half.
[Show projection onto normal vector (project.png)]
 For 2 classes, LDA has d+1 parameters (w, alpha);
QDA has d(d+3)/2 + 1 parameters;
QDA more likely to overfit.
[Show comparison (ldaqda.pdf)]
 With features, LDA can give nonlinear boundaries; QDA can give nonquadratic.
 We don't get *true* optimum Bayes classifier
 estimate distributions from finite data
 realworld data not perfectly Gaussian
 Posterior gives decision boundaries for 10% confidence, 50%, 90%, etc.
 Choosing isovalue = probability p is same as choosing asymmetrical loss
p for false positive, (1  p) for false negative.
[LDA & QDA are best in practice for many applications. In STATLOG project, LDA
or QDA were in the top three classifiers for 10 out of 22 datasets. But it's
not because all those datasets are Gaussian. LDA & QDA work well when the
data can only support simple decision boundaries such as linear or quadratic,
because Gaussian models provide stable estimates.] [ESL, Section 4.3]