SPECTRAL GRAPH CLUSTERING
=========================
Input:  Weighted, undirected graph G = (V, E).  No self-edges.
        w_ij = weight of edge (i, j) = (j, i); zero if (i, j) not in E.
[Think of the edge weights as a similarity measure.  A big weight means that
 the two vertices want to be in the same cluster.  So the circumstances are
 the opposite of the last lecture on clustering.  Then, we had a distance or
 dissimilarity function, so small numbers meant that points wanted to stay
 together.  Today, big numbers mean that vertices want to stay together.]
Goal:  Cut G into 2 (or more) pieces G_i of similar sizes,
       but don't cut too much edge weight.
       [That's a vague goal.  There are many ways to make this precise.
        Here's a typical goal, which we'll solve approximately.]

                                       Cut(G_1, G_2)
       e.g. Minimize the _sparsity_ -------------------, aka _cut_ratio_
                                    Mass(G_1) Mass(G_2)

       where Cut(G_1, G_2) = total weight of cut edges
             Mass(G_1) = # of vertices in G_1 OR assign masses to vertices
[The denominator "Mass(G_1) Mass(G_2)" penalizes imbalanced cuts.]
[Show illustration of four cuts (graph.pdf).  All edges have weight 1.
 Upper left:  the _minimum_bisection_; a _bisection_ is perfectly balanced.
 Upper right:  the _minimum_cut_.  Usually very unbalanced; not what we want.
 Lower left:  the _sparsest_cut_, which is good for many applications.
 Lower right:  the _maximum_cut_; in this case also the maximum bisection.]
       Sparsest cut is NP-hard.  [We will look for an approximate solution.]

[We will turn this combinatorial graph cutting problem into algebra.]
Let n = |V|.  Let y in R^n be an _indicator_vector_:
        / 1     vertex i in G_1,
  y  = <
   i    \ -1    vertex i in G_2.

       w_ij          2    / w       (i, j) is cut,
  Then ---- (y  - y )  = <   ij
        4     i    j      \ 0       (i, j) is not cut.

                            w_ij          2
  Cut(G , G ) =    sum      ---- (y  - y )
       1   2    (i,j) in E   4     i    j

                1                   2                      2
              = -    sum      (w   y  - 2 w   y  y  + w   y )
                4 (i,j) in E    ij  i      ij  i  j    ij  j

                1                                n   2
              = - (    sum      - 2 w   y  y  + sum y  sum  w   )
                4   (i,j) in E       ij  i  j   i=1  i k!=i  ik
                    \_______________________/   \_____________/
                y^T L y     off-diagonal terms   diagonal terms
              = -------
                   4
                                                   (1)           -            -
               / - w_ij        i != j,          1 /   \ 3        |  4  -1  -3 |
  where L   = <                                  /     \     L = | -1   6  -5 |
         ij    \ sum  w        i = j.          (2)-----(3)       | -3  -5   8 |
                 k!=i  ik                           5            -            -

L is symmetric, n-by-n _Laplacian_matrix_ for G.

[L is effectively a matrix representation of G.  For the purpose of
 partitioning a graph, there is no need to distinguish edges of weight zero
 from edges that are not in the graph.]
[We see that minimizing the weight of the cut is equivalent to minimizing the
 _Laplacian_quadratic_form_ y^T L y.  This lets us turn graph partitioning into
 a problem in matrix algebra.]
[Usually we assume there are no negative weights, in which case Cut(G_1, G_2)
 can never be negative, so it follows that L is positive semidefinite.]

If y = *1* = [ 1  1  ...  1 ]^T, then Cut(G_1, G_2) = 0, so     [*1* = bold 1.]
*1* is an eigenvector of L with eigenvalue 0.

[If G is connected and all the edge weights are positive, then this is the
 only zero eigenvalue.  But if G is not connected, L has one zero eigenvalue
 for each connected component of G.  It's easy to prove, but time prevents me.]

_Bisection_:  exactly n/2 vertices in G_1, n/2 in G_2.  Write *1*^T y = 0.
[So we have reduced graph bisection to this constrained optimization problem.]

  --------------------------------------------
  |                        T                 |
  | Find y that minimizes y  L y             |
  |                                          |
  | subject to  for all i, y  = 1 or y  = -1 | <- _binary_constraint_
  |                         i         i      |
  |                T                         |
  |        and  *1*  y = 0                   | <- _balance_constraint_
  --------------------------------------------

Also NP-hard.  We _relax_ the binary constraint.  -> fractional vertices!

[A very common approach in combinatorial optimization algorithms is to relax
 some of the constraints so a discrete problem becomes a continuous problem.
 Intuitively, this means that you can put 1/3 of vertex 7 in graph G_1 and
 the other 2/3 of vertex 7 in graph G_2.  You can even put -1/2 of vertex 7 in
 graph G_1 and 3/2 of vertex 7 in graph G_2.  This sounds crazy, but the
 continuous problem is much easier to solve than the combinatorial problem.
 After we solve it, we will _round_ the vertex values to +1/-1, and we'll hope
 that our solution is still close to optimal.]
[We can't just drop the binary constraint, though.  We still need *some*
 constraint to rule out the solution y = 0.]

New constraint:  y must lie on sphere of radius sqrt(n).
[Draw figure showing constraint before--y lies at a vertex of the hypercube--
 and the constraint after--y lies on the sphere through those vertices.]

Relaxed problem:
  --------------------------
  |           T            |                T
  | Minimize y  L y        | \             y  L y
  |              T         |  > = Minimize ------ = Rayleigh quotient of L & y
  | subject to  y  y = n   | /               T
  |                        |                y  y
  |                T       |
  |        and  *1*  y = 0 |
  --------------------------

[Show illustration of isosurfaces of y^T Ly (cylinder.pdf) and
 illustration restricted to the hyperplane 1^T y = 0 (endview.pdf).]

[You should remember this Rayleigh quotient from the lecture on PCA.  As I said
 then, when you see a Rayleigh quotient, you should smell eigenvalues nearby.
 The y that minimizes this Rayleigh quotient is the eigenvector with the
 smallest eigenvalue.  We already know what that eigenvector is:  it's *1*.
 But that violates our balance constraint.  As you should recall from PCA, when
 you've used the most extreme eigenvector and you need an orthogonal one, the
 next-best optimizer of the Rayleigh quotient is the next eigenvector.]

Let lambda_2 = second-smallest eigenvalue of L.
Eigenvector v_2 is the _Fiedler_vector_.
[It would be wonderful if every component of the Fiedler vector was 1 or -1,
 but that happens more or less never.  So we round it.  The simplest way is to
 round all positive entries to 1 and all negative entries to -1.  But in both
 theory and practice, it's better to choose the threshold as follows.]

Spectral partitioning alg.:
- Compute Fiedler vector v_2 of L
- _Round_ v_2 with a _sweep_cut_:
  * Sort components of v_2.
  * Try the n - 1 cuts between successive components.  Choose min-sparsity cut.
    [If we're clever about it, we can try all these cuts in time linear in the
     number of edges in G.]

[Show example of graph partitioned by the sweep cut (specgraph.pdf).]
[Show what the un-rounded Fiedler vector looks like (specvector.pdf).]
[One consequence of relaxing the binary constraint is that the balance
 constraint no longer forces an exact bisection.  But that's okay; we're cool
 with a slightly off-balance constraint if it means we cut fewer edges.]
[Show illustration where an off-balance cut is better (lopsided.pdf).]

                                                                 L_ii
Fact:  Sweep cut finds a cut w/sparsity <= sqrt(2 lambda_2 max_i ----);
       _Cheeger's_inequality_.                                   M_ii
       The optimal cut has sparsity >= lambda_2 / 2.
[So the spectral partitioning algorithm is an approximation algorithm, albeit
 not one with a constant factor of approximation.  Cheeger's inequality is
 a very famous result in spectral graph theory, because it's one of the
 most important cases where you can relax a combinatorial optimization problem
 to a continuous optimization problem, round the solution, and still have
 a provably decent solution to the original combinatorial problem.]

Vertex Masses
-------------
[Sometimes you want the notion of balance to accord more prominence to some
 vertices than others.  We can assign masses to vertices.]
Let M be diagonal matrix with vertex masses on diagonal.
New balance constraint:  *1*^T M y = 0.
[This new balance constraint says that G_1 and G_2 should each have the same
 total mass.  It turns out that this new balance constraint is easier
 to satisfy if we also revise the sphere constraint a little bit.]
New ellipsoid constraint:   y^T M y = Mass(G) = sum M_ii.
[Instead of a sphere, now we constrain y to lie on an axis-aligned ellipsoid.]
[Draw the ellipsoid, which passes through points of hypercube.]

Now we want Fiedler vector of _generalized_eigensystem_ Lv = lambda Mv.
[Most algorithms for computing eigenvectors and eigenvalues of symmetric
 matrices can easily be adapted to compute eigenvectors and eigenvalues of
 symmetric generalized eigensystems like this too.]


Vibration Analogy
-----------------
[Show figure of system of springs and masses (vibrate.pdf).]
[For intuition about spectral partitioning, think of the eigenvectors as
 vibrational modes in a physical system of springs and masses.  Each vertex
 models a point mass that is constrained to move freely along a vertical rod.
 Each edge models a vertical spring with rest length zero and stiffness
 proportional to its weight, pulling two point masses together.  The masses are
 free to oscillate sinusoidally on their rods.  The eigenvectors of the
 generalized eigensystem Lv = lambda Mv are the vibrational modes of this
 physical system, and their eigenvalues are proportional to their frequencies.]

[Show figure of vibrational modes in path graph and grid graph (grids.pdf).]
[These illustrations show the first four eigenvectors for two simple graphs.
 On the left, we see that the first eigenvector is the eigenvector of all 1's,
 which represents a vertical translation of all the masses in unison.  That's
 not really a vibration, which is why the eigenvalue is zero.  The second
 eigenvector is the Fiedler vector, which represents the vibrational mode with
 the lowest frequency.  Each component indicates the amplitude with which the
 corresponding point mass oscillates.  At any point in time as the masses
 vibrate, roughly half the mass is moving up while half is moving down.  So it
 makes sense to cut between the positive components and the negative
 components.  The third eigenvector also gives us a nice bisection of the grid
 graph, entirely different from the Fiedler vector.  Some more sophisticated
 graph clustering algorithms use multiple eigenvectors.]

[I want to emphasize that spectral partitioning takes a global view of a graph.
 It looks at the whole gestalt of the graph and finds a good cut.
 By comparison, the clustering algorithms we saw last lecture were much more
 local in nature, so they're easier to fool.]

Greedy Divisive Clustering
--------------------------
Partition G into 2 subgraphs; recursively cluster them.
[The sparsity is a good criterion for graph clustering.  Use G's sparsest cut
 to divide it into two subgraphs, then recursively cut them.  You can stop when
 you have the right number of clusters, or you could keep going until each
 subgraph is a single vertex and create a dendrogram.]
Can form a dendrogram, but it may have inversions.
[There's no reason to expect that the sparsity of a subgraph is smaller than
 the sparsity of the parent graph, so the dendrogram can have inversions.
 But it's still useful for getting an arbitrary number of clusters on demand.]

The Normalized Cut
------------------
Set vertex i's mass M_ii = L_ii.  [Sum of edge weights adjoining vertex i.]
[That is how we define a "normalized cut", which turns out to be a good choice
 for many different applications.]
Popular for _image_segmentation_.
[Image segmentation is the problem of looking at a photograph and separating it
 into different objects.  To do that, we define a graph on the pixels.]
For pixels with location w_i, brightness b_i, use graph weights

                        2                    2
  w   = exp( - |w  - w |  / alpha - |b  - b |  / beta )
   ij            i    j               i    j

or zero if |w_i - w_j| large.
[We choose a distance threshold, typically less than 4 to 10 pixels apart.
 Pixels that are far from each other aren't connected.  alpha and beta are
 empirically chosen constants.  It often makes sense to choose beta
 proportional to the variance of the brightness values.]

[Show segmentation of baseball image (baseballsegment.pdf).  The upper left
 figure is a photo of a scene during a baseball game.  The other figures show
 segments of the image extracted by recursive spectral partitioning.]
[Show eigenvectors 2-9 from the baseball image (baseballvectors.pdf).]

Invented by [our own] Prof. Jitendra Malik and his student Jianbo Shi.