ANISOTROPIC GAUSSIANS
=====================
  X ~ N(mu, Sigma)                     <==  X is a random d-vector with mean mu

                     1                      1         T      -1
  P(x) = -------------------------- exp ( - - (x - mu)  Sigma   (x - mu) )
         sqrt(2 pi)^d sqrt(|Sigma|)         2
                           ^ determinant of Sigma

Sigma is the d-by-d SPD _covariance_matrix_.
Sigma^-1 is the d-by-d SPD _precision_matrix_; serves as metric tensor.

                                           T      -1
Write P(x) = n(q(x)), where q(x) = (x - mu)  Sigma   (x - mu)
             ^              ^
     R->R, exponential   R^d -> R, quadratic

[Now q(x) is a function we understand--it's just a quadratic bowl centered at
 mu whose curvature is represented by the metric tensor Sigma^-1.  q(x) is the
 squared distance from mu to x under this metric.  The other function n(.) is
 like a 1D Gaussian with a different normalization factor.  It is helpful to
 understand that this mapping n(.) does not change the isosurfaces.]

Principle:  given f:R -> R, isosurfaces of f(q(x)) are same as q(x)
            (different isovalues), except that some might be "combined"
            [if f maps them to the same value]

[Show example of paraboloid q(x) and bivariate Gaussian n(q(x)), w/univariate
 Gaussian n(.) between them.]
                                                  T        T          T
_covariance_:  Cov(X, Y) = E[(X - E[X]) (Y - E[Y]) ] = E[XY ] - mu  mu
               Var(X) = Cov(X, X)                                 X   Y
               [These two definitions hold for both vectors and scalars.]

For a Gaussian, one can show Var(X) = Sigma.
[...by integrating the expectation in anisotropic spherical coordinates.]
Hence
          -                                                        -
          |   Var(X_1)       Cov(X_1, X_2)    ...    Cov(X_1, X_d) |
  Sigma = | Cov(X_2, X_1)      Var(X_2)              Cov(X_2, X_d) |
          |      ...                                      ...      |
          | Cov(X_d, X_1)    Cov(X_d, X_2)    ...      Var(X_d)    |
          -                                                        -

[An important point is that statisticians didn't just arbitrarily decide to
 call this thing a covariance matrix.  Rather, statisticians discovered that
 if you find the covariance of the normal distribution by integration, it turns
 out that each covariance happens to be an entry in the matrix inverse to the
 metric tensor.  This is a happy discovery; it's rather elegant.] 

[Observe that Sigma is symmetric, as Cov(X_i, X_j) = Cov(X_j, X_i).]

X_i, X_j independent   ==>   Cov(X_i, X_j) = 0
Cov(X_i, X_j) = 0  AND  joint normal distribution   ==>   X_i, X_j independent
all features pairwise independent   ==>   Sigma is diagonal
Sigma is diagonal   <==>   axis-aligned Gaussian; squared radii on the diagonal
                    <==>   P(X) = P(X_1) P(X_2) ... P(X_d)
                           \__/   \______________________/
                     multivariate     each univariate

[So when the features are independent, you can write the multivariate Gaussian
 as a product of univariate Gaussians.  When they aren't, you can do a change
 of coordinates to the eigenvector coordinate system, and write it as a product
 of univariate Gaussians in those coordinates.]

[It's tricky to keep track of the relationships between the matrices, so here's
 a handy chart.]

            covariance                 precision
              matrix                    matrix
                                       (metric)
                           inverse              -1
     ----->   Sigma     <----------->  M = Sigma
    /          ^                          ^
    |          | |                        | |
    |   square | |                 square | |
    |          | | root                   | | root
    |          | |                        | |
    |            v                          v
    |             1/2      inverse              -1/2
    |        Sigma      <----------->  A = Sigma
    |   (maps spheres to             (maps ellipsoids
    |      ellipsoids)                  to spheres)
    |
    |          ^                     1/2
    |          | eigenvalues of Sigma    are ellipsoid radii
    |            (standard deviations along the eigenvectors) ---\
    |                                                            |
    eigenvalues of Sigma are variances along the eigenvectors    |
                                |                                v
                                v     T                1/2           1/2  T
    Diagonalizing:  Sigma = V Lambda V            Sigma    = V Lambda    V

[Remember that all four of these matrices have the same eigenvectors V.
 Remember that when you take the inverse or square or square root of an SPD
 matrix, you do the same to its eigenvalues.  So the ellipsoid radii, being the
 eigenvalues of Sigma^1/2, are the square roots of the eigenvalues of Sigma;
 moreover, they are the inverse square roots of the precision matrix (metric).]

[Keep this chart handy when you do Homework 3.]


Maximum likelihood estimation for anisotropic Gaussians
-------------------------------------------------------
Given samples x_1, ..., x_n and classes y_1, ..., y_n, find best-fit Gaussians.

[Once again, we want to fit the Gaussian that maximizes the likelihood of
 generating the samples in a specified class.  This time I won't derive the
 maximum-likelihood Gaussian; I'll just tell you the answer.]

For QDA:

    ^      1                                  T
  Sigma  = -     sum     (x  - mu ) (x  - mu )     <== _conditional_covariance_
       C   n  {i:y  = C}   i     C    i     C          for samples in class C
            C     i      \____________________/
                          outer product matrix       [Show example maxlike.jpg]

Priors pi_C, means mu_C:  same as before
[Priors are class samples / total samples; means are per-class sample means]
  ^
Sigma_C is positive semidefinite, but not always definite!
If some zero eigenvalues, must eliminate the zero-variance dimensions
(eigenvectors).

[I'm not going to discuss how to do that today, but it involves projecting the
 samples onto a subspace along the eigenvectors with eigenvalue zero.]

For LDA:

    ^      1                                     T
  Sigma  = - sum    sum     (x  - mu ) (x  - mu )     <== _pooled_within-class_
           n  C  {i:y  = C}   i     C    i     C          _covariance_matrix_
                     i

[Notice that although we're computing one covariance matrix for all the data,
 each sample contributes with respect to *its*own*class's*mean*.  This gives
 a very different result than if you simply compute one covariance matrix for
 all the samples using the global mean!  In the former case, the variances are
 typically smaller.]


[Let's revisit QDA and LDA and see what has changed now that we know
 anisotropic Gaussians.  The short answer is "not much has changed".
 By the way, capital X is once again a random variable.]

QDA
---
pi , mu , Sigma  may be different for each class C.
  C    C       C

Choosing C that maximizes P(X = x | Y = C) pi  is equivalent to maximizing
the _quadratic_discriminant_fn_              C
                        d               1         1
  Q (x) = ln (sqrt(2 pi)  P(x) pi ) = - - q (x) - - ln |Sigma | + ln pi
   C                      ^      C      2  C      2          C         C
                          |
                    Gaussian for C                 [works for any # of classes]

2 classes:  Prediction fn Q (x) - Q (x) is quadratic, but may be indefinite
                           C       D
            ==>  Bayes decision boundary is a quadric.

            Posterior is P(Y = C | X = x) = s(Q (x) - Q (x))
                                               C       D
            where s(.) is logistic fn

[Show example where decision boundary is a hyperbola:  two anisotropic
 Gaussians, Q_C - Q_D, logistic fn, posterior.  Show anisotropic Voronoi
 diagram.  Separate "whiteboard".]

LDA
---
One Sigma for all classes.
[Once again, the quadratic terms cancel each other out so the predictor
 function is linear and the decision boundary is a hyperplane.]

Q (x) - Q (x) =
 C       D
                           T       -1          T       -1
           T      -1     mu_C Sigma   mu_C - mu_D Sigma   mu_D
(mu  - mu )  Sigma   x - ------------------------------------- + ln pi  - ln pi
   C     D                                 2                          C        D
\____________________/ \_______________________________________________________/
          T
         w x          +                          alpha

Choose class C that maximizes the _linear_discriminant_fn_

    T      -1     1   T      -1
  mu  Sigma   x - - mu  Sigma   mu  + ln pi        [works for any # of classes]
    C             2   C           C        C

                                  T
2 classes:  Decision boundary is w x + alpha = 0
                                               T
            Posterior is P(Y = C | X = x) = s(w x + alpha)

[Show example where decision boundary is a line:  two anisotropic Gaussians,
 Q_C - Q_D, logistic fn, posterior.  Show ESL, Figure 4.11 (LDAdata.pdf):
 example of LDA with messy data.  Separate "whiteboard".]


Notes:
- Changing prior pi_C (or loss) is easy:  adjust alpha.
- LDA often interpreted as projecting samples onto normal vector w;
  cutting the line in half.
  [Show projection onto normal vector (project.png)]
- For 2 classes, LDA has d+1 parameters (w, alpha);
                 QDA has d(d+3)/2 + 1 parameters;
                 QDA more likely to overfit.
                 [Show comparison (ldaqda.pdf)]
- With features, LDA can give nonlinear boundaries; QDA can give nonquadratic.
- We don't get *true* optimum Bayes classifier
  -- estimate distributions from finite data
  -- real-world data not perfectly Gaussian
- Posterior gives decision boundaries for 10% confidence, 50%, 90%, etc.
  -- Choosing isovalue = probability p is same as choosing asymmetrical loss
     p for false positive, (1 - p) for false negative.


[LDA & QDA are best in practice for many applications.  In STATLOG project, LDA
 or QDA were in the top three classifiers for 10 out of 22 datasets.  But it's
 not because all those datasets are Gaussian.  LDA & QDA work well when the
 data can only support simple decision boundaries such as linear or quadratic,
 because Gaussian models provide stable estimates.]  [ESL, Section 4.3]