The Singular Value Decomposition (SVD) [and its application to PCA] -------------------------------- Problems: Computing X^T X takes Theta(n d^2) time. X^T X is poorly conditioned -> numerically inaccurate eigenvectors. [The SVD improves both these problems.] [Earlier this semester, we learned about the eigendecomposition of a square, symmetric matrix. Unfortunately, nonsymmetric matrices don't decompose nearly as nicely, and non-square matrices don't have eigenvectors at all. Happily, there is a similar decomposition that works for all matrices, even if they're not symmetric and not square.] Fact: If n >= d, we can find a _singular_value_decomposition_ T d T X = U D V = sum delta u v diagonal i=1 i i i ----------- ----------- ----------- ----------- \__________/ | | |^ ^| |d_1 0| |<-- v -->| rank 1 | | || || | d_2 | | 1 | outer product | | = || || | .. | | | matrix | | |u u| | .. | | | | | ||1 |d |0 d_d| |<-- v -->| | | || || ----------- ------d---- | | |v v| d x d d x d ----------- ----------- T n x d n x d V V = I orthonormal v_i's are T _right_singular_vectors_ of X U U = I orthonormal u 's are _left_singular_vectors_ of X i Diagonal entries delta_1, ..., delta_d of D are nonnegative _singular_values_ of X. [Some of the singular values might be zero. The number of nonzero singular values is equal to the rank of the centered design matrix X. If all the sample points lie on a line, there is only one nonzero singular value. If the points span a subspace of dimension r, there are r nonzero singular values.] [If n < d, an SVD still exists, but now U is square and V is not.] Fact: v_i is an eigenvector of X^T X w/eigenvalue delta_i^2. Proof: X^T X = V D U^T U D V^T = V D^2 V^T which is an eigendecomposition of X^T X. [The columns of V are the eigenvectors of X^T X, which is what we need for PCA. If n < d, V will omit some of the eigenvectors that have eigenvalues zero, but those are useless for PCA. The SVD also tells us the eigenvalues, which are the squares of the singular values. By the way, that's related to why the SVD is more numerically stable: because the ratios between singular values are smaller than the ratios between eigenvalues.] Fact: We can find the k greatest singular values & corresponding vectors in O(n d k) time. [So we can save time by computing some of the singular vectors without computing all of them.] [There are approximate, randomized algorithms that are even faster, producing an approximate SVD in O(n d log k) time. These are starting to become popular in algorithms for very big data.] https://code.google.com/archive/p/redsvd/ ] Row i of UD gives the coordinates of sample point X_i in principal components space (i.e. X_i . v_j for each j). So we don't need to project the input points onto that space; the SVD has already done it for us. CLUSTERING ========== Partition data into clusters so points within a cluster are more similar than across clusters. Why? - Discovery: Find songs similar to songs you like; determine market segments - Hierarchy: Find good taxonomy of species from genes - _Quantization_: Compress a data set by reducing choices - Graph partitioning: Image segmentation; find groups in social networks [Show clusters that classify Barry Zito's baseball pitches (zito.pdf). Here we discover that there really are distinct classes of baseball pitches.] k-Means Clustering aka Lloyd's Algorithm (Stuart Lloyd, 1957) ---------------------------------------- Goal: Partition n points into k disjoint clusters. Assign each input point X_i a cluster label y_i in [1, k]. 1 Cluster i's _mean_ is mu = -- sum X , given n points in cluster i. i n y =i j i i j ---------------------------------------------- | k | |2 | [Sum of the squared distances | Find y that minimizes sum sum |X - mu | | from points to their cluster | i=1 y =i | j i| | means.] | j | ---------------------------------------------- NP-hard. Solvable in O(n k^n) time. [Try every partition.] k-means heuristic: Alternate between (1) y_j's are fixed; update mu_i's (2) mu_i's are fixed; update y_j's Halt when step (2) changes no assignments. [So, we have an assignment of points to clusters. We update the cluster means. Then we reconsider the assignment. A point might change clusters if some other's cluster's mean is closer than its own cluster's mean. Then repeat.] Step (1): One can show (calculus) the optimal mu_i is the mean of the points in cluster i. [This is really easy calculus, so I leave it as a short exercise.] Step (2): The optimal y assigns each point X_j to the closest center mu_i. [This should be even more obvious than step (1).] [If there's a tie, and one of the choices is for X_j to stay in the same cluster as the previous iteration, always take that choice.] [Show example of 2-means (2means.png).] [Show animation of 4-means with many points (4meansanimation.gif).] Both steps decrease objective fn *unless* they change nothing. [Therefore, the algorithm never returns to a previous assignment.] Hence alg. must terminate. [As there are only finitely many assignments.] [This argument doesn't say anything optimistic about the running time, because we might see O(k^n) different assignments before we halt. In theory, one can actually construct point sets in the plane that take an exponential number of iterations, but those never happen in practice.] Usually very fast in practice. Finds a local minimum, often not global. [...which is not surprising, as this problem is NP-hard.] [Show example where 4-means clustering fails (4meansbad.png).] Getting started: - Forgy method: choose k random sample points to be initial mu_i's; go to (2). - Random partition: randomly assign each sample point to a cluster; go to (1). [Forgy seems to be better, but Wikipedia mentions some variants of k-means for which random partition is better.] For best results, run k-means multiple times with random starts. [Show clusters found by running 3-means 6 times on the same sample points (ISL, Figure 10.7) (kmeans6times.pdf).] [Why did we choose that particular objective function to minimize? Partly because it is equivalent to minimizing the following function.] Equivalent objective fn: the _within-cluster_variation_ ----------------------------------------------------- | k 1 | |2 | | Find y that minimizes sum -- sum sum |X - X | | | i=1 n y =i y =i | j m| | | i j m | ----------------------------------------------------- [This objective function is equal to twice the previous one. It's a worthwhile exercise to show that. The nice thing about this expression is that it doesn't include the means; it's a direct function of the input points and the clusters we assign them to.] Normalize the data? Same advice as for PCA. Sometimes yes, sometimes no. [If some features are much larger than others, they will tend to dominate the Euclidean distance. So if you have features in different units of measurement, you probably should normalize them. If you have features in the same unit of measurement, you probably shouldn't, but it depends on context.] k-Medoids Clustering -------------------- Generalizes k-means beyond Euclidean distance. Specify a _distance_fn_ d(x, y) between points x, y, aka _dissimilarity_. Can be arbitrary; ideally satisfies triangle inequality d(x, y) <= d(x, z) + d(z, y). [Sometime people use the l_1 norm or the l_infinity norm. Sometimes people specify a matrix of pairwise distances between the input points.] [Suppose you have a database that tells you how many of each product each customer bought. You'd like to cluster together customers who buy similar products for market analysis. But if you cluster customers by distance, you'll get one big cluster of all the customers who have only ever bought one thing. So distance is not a good measure of dissimilarity. Instead, it makes more sense to treat each customer as a vector and measure the *angle* between two customers. If there's a large angle between customers, they're dissimilar.] Replace mean computation with _medoid_, the sample point that minimizes total distance to other points in same cluster. [So the medoid of a cluster is always one of the input points.] [One difficulty with k-means is that you have to choose the number k of clusters before you start, and there isn't any reliable way to guess how many clusters will best fit the data. The next method, hierarchical clustering, has the advantage in that respect. By the way, there is a whole Wikipedia article on "Determining the number of clusters in a data set".] Hierarchical Clustering ----------------------- Creates a tree; every subtree is a cluster. [So some clusters contain smaller clusters.] Bottom-up, aka _agglomerative_clustering_: start with each point a cluster; repeatedly fuse pairs. [Draw figure of points in the plane; pair clusters together until all points are in one top-level cluster.] Top-down, aka _divisive_clustering_: start with all pts in one cluster; repeatedly split it. [Draw figure of points in the plane; divide points into subsets hierarchicially until each point is in its own subset.] [When the input is a point set, agglomerative clustering is used much more in practice than divisive clustering. But when the input is a graph, it's the other way around: divisive clustering is more common.] We need a distance fn for clusters A, B: _complete_linkage_: d(A, B) = max { d(w, x) : w in A, x in B } _single_linkage_: d(A, B) = min { d(w, x) : w in A, x in B } 1 _average_linkage_: d(A, B) = ------- sum sum d(w, x) |A| |B| w in A x in B _centroid_linkage_: d(A, B) = d(mu_A, mu_B) where mu_S is mean of S [The first three of these methods work for any distance function, even if the input is just a matrix of distances between all pairs of points. The centroid linkage only really makes sense if we're using the Euclidean distance. But there's a variation of the centroid linkage that uses the medoids instead of the means, and medoids are defined for any distance function. Moreover, medoids are more robust to outliers than means.] Greedy agglomerative alg.: Repeatedly _fuse_ the two clusters that minimize d(A, B) Naively takes O(n^3) time. [But for complete and single linkage, there are more sophisticated algorithms called CLINK and SLINK, which run in O(n^2) time.] _Dendrogram_: Illustration of the cluster hierarchy (tree) in which the vertical axis encodes all the linkage distances. [Show example of dendogram (ISL, Figure 10.9) (dendrogram.pdf).] Cut dendrogram into clusters by horizontal line according to your choice of # of clusters OR intercluster distance. [It's important to be aware that the horizontal axis of a dendrogram has no meaning. You could swap some node's left children and right children and it would still be the same dendrogram. It doesn't always mean anything that two leaves happen to be next to each other.] [Show comparison of average/complete/single linkages (ISL, Figure 10.12) (linkages.pdf).] [Probably the worst of these is the single linkage, because it's very sensitive to outliers. Notice that if you cut it into three clusters, two of them have only one node. It also tends to give you a very unbalanced tree.] [The complete linkage tends to be the best balanced, because when a cluster gets large, the furthest point in the cluster is always far away. So large clusters are more resistant to growth than small ones. If balanced clusters are your goal, this is your best choice.] [In most cases you probably want the average or complete linkage.] Warning: centroid linkage can cause _inversions_ where a parent cluster is fused at a lower height than its children. [So statisticians don't like it, but nevertheless, centroid linkage is popular in genomics.] [As a final note, all the clustering algorithms we've studied so far are unstable, in the sense that deleting a few input points can sometimes give you very different results. But these unstable heuristics are still the most commonly used clustering algorithms. And it's not clear to me whether a truly stable clustering algorithm is even possible.]