The Singular Value Decomposition (SVD) [and its application to PCA]
--------------------------------
Problems:  Computing X^T X takes Theta(n d^2) time.
           X^T X is poorly conditioned -> numerically inaccurate eigenvectors.
           [The SVD improves both these problems.]

[Earlier this semester, we learned about the eigendecomposition of a square,
 symmetric matrix.  Unfortunately, nonsymmetric matrices don't decompose nearly
 as nicely, and non-square matrices don't have eigenvectors at all.  Happily,
 there is a similar decomposition that works for all matrices, even if they're
 not symmetric and not square.]

Fact:  If n >= d, we can find a _singular_value_decomposition_

                                                          T     d             T
         X       =       U               D               V   = sum delta  u  v
                                      diagonal                 i=1      i  i  i
    -----------     -----------     -----------     -----------    \__________/
    |         |     |^       ^|     |d_1     0|     |<-- v -->|       rank 1
    |         |     ||       ||     |  d_2    |     |     1   |   outer product
    |         |  =  ||       ||     |    ..   |     |         |       matrix
    |         |     |u       u|     |     ..  |     |         |   
    |         |     ||1      |d     |0     d_d|     |<-- v -->|
    |         |     ||       ||     -----------     ------d----
    |         |     |v       v|        d x d           d x d
    -----------     -----------                        T
       n x d           n x d                          V  V = I
                                               orthonormal v_i's are
                       T                       _right_singular_vectors_ of X
                      U  U = I
               orthonormal u 's are _left_singular_vectors_ of X
                            i

Diagonal entries delta_1, ..., delta_d of D are nonnegative _singular_values_
of X.

[Some of the singular values might be zero.  The number of nonzero singular
 values is equal to the rank of the centered design matrix X.  If all the
 sample points lie on a line, there is only one nonzero singular value.  If the
 points span a subspace of dimension r, there are r nonzero singular values.]
[If n < d, an SVD still exists, but now U is square and V is not.]

Fact:  v_i is an eigenvector of X^T X w/eigenvalue delta_i^2.
Proof:  X^T X = V D U^T U D V^T = V D^2 V^T
        which is an eigendecomposition of X^T X.

[The columns of V are the eigenvectors of X^T X, which is what we need for PCA.
 If n < d, V will omit some of the eigenvectors that have eigenvalues zero, but
 those are useless for PCA.  The SVD also tells us the eigenvalues, which are
 the squares of the singular values.  By the way, that's related to why the SVD
 is more numerically stable:   because the ratios between singular values are
 smaller than the ratios between eigenvalues.]

Fact:  We can find the k greatest singular values & corresponding vectors
       in O(n d k) time.
       [So we can save time by computing some of the singular vectors without
        computing all of them.]
       [There are approximate, randomized algorithms that are even faster,
        producing an approximate SVD in O(n d log k) time.  These are starting
        to become popular in algorithms for very big data.]
        https://code.google.com/archive/p/redsvd/ ]

Row i of UD gives the coordinates of sample point X_i in principal components
space (i.e. X_i . v_j for each j).  So we don't need to project the input
points onto that space; the SVD has already done it for us.

CLUSTERING
==========
Partition data into clusters so points within a cluster are more similar than
across clusters.
Why?
- Discovery:  Find songs similar to songs you like; determine market segments
- Hierarchy:  Find good taxonomy of species from genes
- _Quantization_:  Compress a data set by reducing choices
- Graph partitioning:  Image segmentation; find groups in social networks
[Show clusters that classify Barry Zito's baseball pitches (zito.pdf).
 Here we discover that there really are distinct classes of baseball pitches.]

k-Means Clustering aka Lloyd's Algorithm (Stuart Lloyd, 1957)
----------------------------------------
Goal:  Partition n points into k disjoint clusters.
       Assign each input point X_i a cluster label y_i in [1, k].
                                   1
       Cluster i's _mean_ is mu  = --  sum  X , given n  points in cluster i.
                               i   n   y =i  j         i
                                    i   j

  ----------------------------------------------
  |                        k       |        |2 |  [Sum of the squared distances
  | Find y that minimizes sum sum  |X  - mu |  |   from points to their cluster
  |                       i=1 y =i | j     i|  |   means.]
  |                            j               |
  ----------------------------------------------

NP-hard.  Solvable in O(n k^n) time.  [Try every partition.]
k-means heuristic:  Alternate between
                      (1)  y_j's are fixed; update mu_i's
                      (2)  mu_i's are fixed; update y_j's
                    Halt when step (2) changes no assignments.
[So, we have an assignment of points to clusters.  We update the cluster means.
 Then we reconsider the assignment.  A point might change clusters if some
 other's cluster's mean is closer than its own cluster's mean.  Then repeat.]

Step (1):  One can show (calculus) the optimal mu_i is the mean of the points
           in cluster i.
           [This is really easy calculus, so I leave it as a short exercise.]
Step (2):  The optimal y assigns each point X_j to the closest center mu_i.
           [This should be even more obvious than step (1).]
           [If there's a tie, and one of the choices is for X_j to stay in the
            same cluster as the previous iteration, always take that choice.]

[Show example of 2-means (2means.png).]
[Show animation of 4-means with many points (4meansanimation.gif).]
Both steps decrease objective fn *unless* they change nothing.
[Therefore, the algorithm never returns to a previous assignment.]
Hence alg. must terminate.  [As there are only finitely many assignments.]
[This argument doesn't say anything optimistic about the running time, because
 we might see O(k^n) different assignments before we halt.  In theory, one can
 actually construct point sets in the plane that take an exponential number of
 iterations, but those never happen in practice.]
Usually very fast in practice.  Finds a local minimum, often not global.
[...which is not surprising, as this problem is NP-hard.]
[Show example where 4-means clustering fails (4meansbad.png).]

Getting started:
- Forgy method:  choose k random sample points to be initial mu_i's; go to (2).
- Random partition:  randomly assign each sample point to a cluster; go to (1).
[Forgy seems to be better, but Wikipedia mentions some variants of k-means for
 which random partition is better.]

For best results, run k-means multiple times with random starts.
[Show clusters found by running 3-means 6 times on the same sample points
 (ISL, Figure 10.7) (kmeans6times.pdf).]

[Why did we choose that particular objective function to minimize?  Partly
 because it is equivalent to minimizing the following function.]

Equivalent objective fn:  the _within-cluster_variation_

  -----------------------------------------------------
  |                        k  1            |       |2 |
  | Find y that minimizes sum -- sum  sum  |X  - X |  |
  |                       i=1 n  y =i y =i | j    m|  |
  |                            i  j    m              |
  -----------------------------------------------------
[This objective function is equal to twice the previous one.  It's a worthwhile
 exercise to show that.  The nice thing about this expression is that it
 doesn't include the means; it's a direct function of the input points and
 the clusters we assign them to.]

Normalize the data?  Same advice as for PCA.  Sometimes yes, sometimes no.
[If some features are much larger than others, they will tend to dominate the
 Euclidean distance.  So if you have features in different units of
 measurement, you probably should normalize them.  If you have features in the
 same unit of measurement, you probably shouldn't, but it depends on context.]

k-Medoids Clustering
--------------------
Generalizes k-means beyond Euclidean distance.
Specify a _distance_fn_ d(x, y) between points x, y, aka _dissimilarity_.
Can be arbitrary; ideally satisfies
  triangle inequality d(x, y) <= d(x, z) + d(z, y).
[Sometime people use the l_1 norm or the l_infinity norm.  Sometimes people
 specify a matrix of pairwise distances between the input points.]
[Suppose you have a database that tells you how many of each product each
 customer bought.  You'd like to cluster together customers who buy similar
 products for market analysis.  But if you cluster customers by distance,
 you'll get one big cluster of all the customers who have only ever bought one
 thing.  So distance is not a good measure of dissimilarity.  Instead, it makes
 more sense to treat each customer as a vector and measure the *angle* between
 two customers.  If there's a large angle between customers, they're
 dissimilar.]
Replace mean computation with _medoid_, the sample point that minimizes total
  distance to other points in same cluster.
[So the medoid of a cluster is always one of the input points.]

[One difficulty with k-means is that you have to choose the number k of
 clusters before you start, and there isn't any reliable way to guess how many
 clusters will best fit the data.  The next method, hierarchical clustering,
 has the advantage in that respect.  By the way, there is a whole Wikipedia
 article on "Determining the number of clusters in a data set".]

Hierarchical Clustering
-----------------------
Creates a tree; every subtree is a cluster.
[So some clusters contain smaller clusters.]

Bottom-up, aka _agglomerative_clustering_:
start with each point a cluster; repeatedly fuse pairs.
[Draw figure of points in the plane; pair clusters together until all points
 are in one top-level cluster.]

Top-down, aka _divisive_clustering_:
start with all pts in one cluster; repeatedly split it.
[Draw figure of points in the plane; divide points into subsets hierarchicially
 until each point is in its own subset.]

[When the input is a point set, agglomerative clustering is used much more in
 practice than divisive clustering.  But when the input is a graph, it's the
 other way around:  divisive clustering is more common.]

We need a distance fn for clusters A, B:

_complete_linkage_:  d(A, B) = max { d(w, x) : w in A, x in B }
_single_linkage_:    d(A, B) = min { d(w, x) : w in A, x in B }
                                  1
_average_linkage_:   d(A, B) = -------  sum     sum   d(w, x)
                               |A| |B| w in A  x in B
_centroid_linkage_:  d(A, B) = d(mu_A, mu_B)            where mu_S is mean of S
[The first three of these methods work for any distance function, even if the
 input is just a matrix of distances between all pairs of points.  The centroid
 linkage only really makes sense if we're using the Euclidean distance.  But
 there's a variation of the centroid linkage that uses the medoids instead of
 the means, and medoids are defined for any distance function.  Moreover,
 medoids are more robust to outliers than means.]

Greedy agglomerative alg.:
Repeatedly _fuse_ the two clusters that minimize d(A, B)
Naively takes O(n^3) time.
[But for complete and single linkage, there are more sophisticated algorithms
 called CLINK and SLINK, which run in O(n^2) time.]

_Dendrogram_:  Illustration of the cluster hierarchy (tree) in which the
vertical axis encodes all the linkage distances.
[Show example of dendogram (ISL, Figure 10.9) (dendrogram.pdf).]
Cut dendrogram into clusters by horizontal line according to your choice of
# of clusters OR intercluster distance.
[It's important to be aware that the horizontal axis of a dendrogram has no
 meaning.  You could swap some node's left children and right children and it
 would still be the same dendrogram.  It doesn't always mean anything that two
 leaves happen to be next to each other.]

[Show comparison of average/complete/single linkages (ISL, Figure 10.12)
 (linkages.pdf).]
[Probably the worst of these is the single linkage, because it's very sensitive
 to outliers.  Notice that if you cut it into three clusters, two of them have
 only one node.  It also tends to give you a very unbalanced tree.]
[The complete linkage tends to be the best balanced, because when a cluster
 gets large, the furthest point in the cluster is always far away.  So large
 clusters are more resistant to growth than small ones.  If balanced clusters
 are your goal, this is your best choice.]
[In most cases you probably want the average or complete linkage.]
Warning:  centroid linkage can cause _inversions_ where a parent cluster is
          fused at a lower height than its children.
[So statisticians don't like it, but nevertheless, centroid linkage is popular
 in genomics.]

[As a final note, all the clustering algorithms we've studied so far are
 unstable, in the sense that deleting a few input points can sometimes give you
 very different results.  But these unstable heuristics are still the most
 commonly used clustering algorithms.  And it's not clear to me whether a truly
 stable clustering algorithm is even possible.]