Here is the notation we will use in this and subsequent lectures.
We consider a weighted, undirected graph G=(N, E, W_{N}, W_{E})
without self edges (i,i) or multiple edges from one node to another.
N is the set of *nodes*, E is the set of *edges* (i,j)
connecting nodes, W_{N} are the *node weights*, a nonnegative
weight for each node, and W_{E} are the *edge weights*, a nonnegative
weight for each edge. When we refer to a graph in this lecture,
we will always mean this kind of graph. If W_{N} or W_{E}
is omitted from
the definition of G, the weight of each node in N and each edge in E
is assumed to be 1. We use the notation |N| (or |E|) to mean the number
of members of the set N (or E).

We think of a node n in N as representing an independent
job to do, and weight W_{n} as measuring the cost of job n.
An edge e=(i,j) in E means that an amount of data W_{e} must be
transferred from job j to job i to complete the overall task.
Partitioning G means dividing N into the union of P disjoint pieces

N = Nwhere the nodes (jobs) in N_{1}U N_{2}U ... U N_{P}

- The sums of the weights W
_{n}of the nodes n in each N_{i}is approximately equal. This means the load is approximately balanced across processors. - The sum of the weights W
_{e}of edges connecting nodes in different N_{i}and N_{j}should be minimized. This means that the total size of all messages communicated between different processors is minimized.

Now we consider in more detail the problem of multiplying a vector x
by a sparse symmetric matrix A
to get y = A*x. We represent this problem by a graph G with one node i for
each x_{i} and y_{i}, and an edge (i,j) for each nonzero A_{i,j}.
We call G the
*(undirected) graph of the sparse (symmetric) matrix A*.
To map this problem to a parallel computer, we assume node i stores x_{i}, y_{i},
and A_{i,j} for all j such that A_{i,j} is nonzero,
and let node i also represent the job of computing
y_{i} = sum_{j} A_{i,j}*x_{j}.
We let the weight of node i be the number of floating point
multiplications needed to compute y_{i}, i.e. the number of nonzeros
in row i of A. We let the weight of each edge (i,j) be one; this represents
the cost of sending x_{j} from node j to node i.

Equivalently, we could start with graph G=(N,E), and let the problem be to compute
a weighted average y_{i} of a value x_{i} stored at each node i and values x_{j} on
neighboring nodes. The reader can confirm that this is equivalent to computing
y=A*x with a matrix A whose graph is G, and whose nonzero entries are the weights
in the weighted average.

Given a solution of the graph partitioning problem, i.e. an assignment of nodes to processors as in the figure below, the algorithm for computing y=A*x is simply

- For each i, the processor owning node i gets x
_{j}whenever A_{i,j}is nonzero and x_{j}is stored on another processor. - For each i, the processor owning node i computes y
_{i}= sum_{j}A_{i,j}x_{j}.

- The sums of node weights are approximately equal for each processor. This means that each processor has an equal amount of floating point work to do, the the problem is load balanced.
- As few edges cross processor boundaries as possible. This minimizes communication,
since each crossing edge A
_{i,j}means that x_{j}must be sent to the processor owning x_{i}.

We present a variety of useful algorithms for solving the graph partitioning problem.
For simplicity, we will present most of them with unit weights W_{n} = 1
and W_{e} = 1,
although most can be generalized to general weights.

Given n nodes and p processors, there are exponentially many ways to assign n nodes to p processors, some of which more nearly satisfy the optimality conditions than others. To illustrate, the following figure shows two partitions of a graph with 8 nodes onto 4 processors, with 2 nodes per processor. The partitioning on the left has 6 edges crossing processor boundaries and so is superior to the partitioning on the right, with 10 edges crossing processor boundaries. The reader is invited to find another 6-edge-crossing partition, and show that no fewer edge crossings suffice.

In general, computing the optimal partitioning is an NP-complete problem,
which means that the best known algorithms take time which is an
exponential function of n=|N| and p, and it is widely believed that no algorithm
whose running time is a polynomial function of n=|N| and p exists
(see ``Computers and Intractability'', M. Garey and D. Johnson, W. H. Freeman,
1979, for details.)
Therefore we need to use heuristics to get approximate solutions for problems
where n is large. The picture below illustrates a larger graph
partitioning problem; it was generated using the
spectral partitioning algorithm as implemented in the
graph partitioning software by Gilbert et al,
described below. The partition
is N = N_{blue} U N_{black}, with red edges connecting
nodes in the two partitions.

Almost all the algorithms we discuss are only for *graph bisection*,
i.e. they partition N = N_{1} U N_{2}. They are intended to
be applied recurively, again bisecting N_{1} and N_{2}, etc.,
until the partitions are small and numerous enough.

There are two classes of heuristics, depending on the information available
about the graph G. In the first case, we have *coordinate information*
with each node, such as (x,y) or (x,y,z) coordinates.
This is frequently available if the graph comes from
triangulating or otherwise discretizing a physical domain. For example, in the
NASA Airfoil,
there are (x,y) coordinates associated with each node.
(To see the data and be able to manipulate it, start up matlab and type
"airfoil". Arrays *i* and *j* contains the endpoints of edges --
edge *k* connects nodes *i _{k}* and

For graphs with coordinate information, the partitioning heuristics are
*geometric*, in the following sense. Assume first that their are
two coordinates x and y, i.e. that the node lie in the plane. Then the
algorithms seek either a line or a circle that divide the nodes in half,
with (about) half the nodes on one side of the line or circle, and the
rest on the other side. All these algorithm work by trying to find a
geometric ``middle'' of the nodes (there are several definitions of middle)
and then pass a line (or plane) or circle (or sphere) through this middle
point. Note that these algorithm completely ignore where the edges are;
they implicitly assume that edges connect nearby nodes only.
If the graph satisfies some reasonable properties (which often arise
in finite element meshes), then one can prove that the best of
these algorithms find good partitions (i.e. split the nodes nearly in half
while cutting a nearly minimal number of edges).

The second kind of graph we want to partition is *coordinate free*,
i.e. we have no identification of a node with a physical point in space.
In addition, edges have no simple interpretation such as representing physical
proximity. These kinds of graphs require combinatorial rather than
geometric algorithms to partition them (although obviously combinatorial
algorithms may also be applied to graphs with coordinate information).

The earliest algorithms used a graph algorithm called *breadth first
search*, which works well for some finite element meshes. The
*Kernighan-Lin* algorithm (the same Kernighan that designed the language C)
takes an initial partitioning and iteratively
improves it by trying to swap groups of nodes between N_{1} and N_{2},
greedily picking the group to swap that best minimizes the number of edge crossings
at each step. In practice it converges quickly to a (local) optimum if it
has a good starting partition. Most recently, the *spectral partitioning*
algorithm was devised, based on the intuition that the second lowest
vibrational mode of a vibrating string naturally divides the string in half:

Finally, many of the above algorithms can be accelerated by using a
*multilevel technique* analogous to the multigrid method studied in
Lecture 17.
The idea is to replace the graph by a coarser graph with many fewer nodes
(this was easy to see how to do with rectangular grids in
Lecture 17 -- take every other node -- but is a little harder for general
graphs), partition the coarser (and so easier) graph, and use the
coarse partition as a starting guess for a partition of the original graph.
The coarse graph may in turn be partitioned by using the same algorithm
recursively: a yet coarser graph is partitioned to get a starting guess,
and so on.

Which algorithm is best? As of this writing, Kernighan-Lin combined with the multilevel technique is the fastest available technique, although the geometric methods are also fast and effective on graphs with appropriate properties. The multilevel Kernighan-Lin algorithm is available both in the Chaco and METIS packages described below. After discussing these methods in subsequent sections, we will return to discuss the performance of many of these methods on a set of test problems.

Another package is called Chaco, available from Bruce Hendrickson at Los Alamos National Labs, bahendr@cs.sandia.gov. ("The Chaco User's Guide, Version 1.0", Technical Report SAND93-2339, Sandia National Labs, Albuquerque NM 1993). The Matlab software described above also provides an interface to the algorithms in Chaco, although Chaco must still be obtained directly from Los Alamos. There are more details in the file meshpart.uu.

The package METIS by George Karypis at the University of Minnesota is also available electronically.

Both METIS and Chaco include implementations of multi-level Kernighan-Lin, the currently fastest available algorithm.

Yet another package is JOSTLE, which was written by C. Walshaw, M. Cross, and M. Everett at the University of Greenwich in the UK. It is available free to researchers, but a license agreement is involved. Further information is available here, and the license agreement is here.

The other way to bisect a graph is to find a *vertex separator*,
a subset N_{s} of N, such that removing N_{s} and all incident edges from G
also results in two disconnected subgraphs G_{1} and G_{2} of G. In other words
N = N_{1} U N_{s} U N_{2}, where all three subsets of N are disjoint,
N_{1} and N_{2} are equally large, and no edges connect N_{1} and N_{2}.

The following figure illustrates these ideas. The green edges,
E_{s1},
form an edge separator, as well as the blue edges E_{s2}. The red nodes, N_{s},
are a vertex separator, since removing them and the indicident edges
(E_{s1}, E_{s2}, and the purple edges), leaves two disjoint subgraphs.

Given an edge separator E_{s}, one can produce a vertex separator N_{s} by taking one
end-point of each edge in E_{s}; N_{s} clearly has at most as many members as E_{s}.
Given a vertex separator N_{s}, on can produce an edge separator E_{s} by
taking all edges incident to N_{s}; E_{s} clearly has at most d times as many members
as E_{s}, where d is the maximum degree (number of incident edges) of any node.
Thus, if d is small, finding good edge separators and good vertex separators
are closely related problems, a fact we exploit below.

|S| = m = sqrt(mIt turns out that it is generally possible to find a vertex separtor containing about the square root of the number of nodes.^{2}) = sqrt(number of nodes).

**Theorem.** (Tarjan, Lipton, "A separator theorem for planar graphs", SIAM J. Appl. Math.,
36:177-189, April 1979). *Let G=(N,E) be an planar graph. Then we can find a vertex
separator N _{s}, so that N = N_{1} U N_{s} U N_{2} is a disjoint partition of N,
|N_{1}| <= (2/3)*|N|, |N_{2}| <= (2/3)*|N|, and |N_{s}| <= sqrt(8*|N|).*

In other words, there is a vertex separator N_{s} that separate the nodes N into two parts,
N_{1} and N_{2}, neither of which can be more than twice as large as the other, and
N_{s} has no more than about the square root of the number of nodes. This intuition
motivates the algorithms below.

- Choose a straight line L, given by a*(x-xbar)+b(y-ybar) = 0. This is a straight line through
(xbar,ybar), with slope -a/b. We assume without loss of generality that
a
^{2}+ b^{2}= 1. - For each node n
_{i}=(x_{i},y_{i}), compute a coordinate by computing the dot-product S_{i}= -b*(x_{i}-xbar) + a*(y_{i}-ybar). S_{i}is distance from (xbar,ybar) of the projection of (x_{i},y_{i}) onto the line L. - Find the median value Sbar of the S
_{i}'s. - Let the nodes (x
_{i},y_{i}) satisfying S_{i}<= Sbar be in partition N_{1}, and the nodes where S_{i}> Sbar be in partition N_{2}.

It remains to show how to pick the line L. Intuitively, if the nodes are located in a long, thin region of the plane, as in the above example, we want to pick L along the long axis this region:

In mathematical
terms, we want to pick a line such that the sum of squares of lengths of the
green lines in the figure are minimized; this is also called doing a
*total least squares fit* of a line to the nodes. In physical terms,
if we think of the nodes as unit masses, we choose (x,y) to be the axis about
which the moment of inertia of the nodes is minimized. This is why the method
is called inertial partitioning. This means choosing a, b, xbar and ybar so that
a^{2} + b^{2} = 1, and the following quantity is minimized:

sumwhere X2, Y2 and XY are the summations in the previous lines. On can show that an answer is to choose_{i=1,...,|N|}(length of i-th green line)^{2}= sum_{i=1,...,|N|}((x_{i}-xbar)^{2}+ (y_{i}-ybar)^{2}- (-b*(x_{i}-xbar) + a*(y_{i}-ybar))^{2}) ... by the Pythagorean theorem = a^{2}* ( sum_{i=1,...,|N|}(x_{i}-xbar)^{2}) + b^{2}* ( sum_{i=1,...,|N|}(y_{i}-ybar)^{2}) + 2*a*b * ( sum_{i=1,...,|N|}(x_{i}-xbar)*(y_{i}-ybar) ) = a^{2}* X2 + b^{2}* Y2 + 2*a*b * XY = [ a b ] * [ X2 XY ] * [ a ] [ XY Y2 ] [ b ] = [ a b ] * M * [ a ] [ b ]

xbar = sumi.e. (xbar,ybar) is the "center of mass" of the nodes, and (a,b) is the unit eigevector corresponding to the smallest eigenvalue of the matrix M._{i=1,...,|N|}x_{i}/ |N| ybar = sum_{i=1,...,|N|}y_{i}/ |N|

The 3D (and higher) case is similar, except that we get a 3-by-3 matrix whose eigenvector we need.

To extend our intuition about planar graphs to higher dimensions,
we need a notion of graphs that are *well-shaped* in some sense.
For example, we might ask that the cells created by edges joining nodes
(triangles in 2D, simplices in 3D, etc.) cannot have angles
that are too large or too small. The following approach is taken by G. Miller,
S.-H. Teng, W. Thurston and S. Vavasis ("Automatic mesh partitioning",
in Graph Theory and Sparse Matrix Computation, Springer-Verlag 1993, editted by
J. George, J. Gilbert and J. Liu). Our presentation is drawn from
"Geometric Mesh Partitioning" by J. Gilbert, G. Miller and S.-H. Teng.

**Definition. ** A *k-ply neighborhood system in d dimensions*
is a set {D_{1},...,D_{n}} of closed disks in d-dimensional real space R^{d},
such that no point in R^{d} is strictly interior to more than k disks.

**Definition. ** An *(alpha,k) overlap graph*
is a graph defined in terms of a constant alpha >= 1,
and a k-ply neighborhood system {D_{1},...,D_{n}}. There is a node
for each disk D_{i}. There is an edge (i,j) if expanding the radius
of the smaller of D_{i} and D_{j} by a factor alpha causes the two disks
to overlap.

For example, an n-by-n mesh is a (1,1) overlap graph, as shown below. The same is true for any d-dimensional grid. One can show any planar graph is an (alpha,k) overlap graph for suitable values of alpha and k.

One can prove the following generalization of the Lipton-Tarjan theorem:

**Theorem.** *(Miller, Teng, Thurston, Vavasis). Let G=(N,E) be an
(alpha,k) overlap graph in d dimensions with n nodes. Then there is a vertex
separator N _{s}, so that N = N_{1} U N_{s} U N_{2},
with the following properties:
*

- N
_{1}and N_{2}each has at most n*(d+1)/(d+2) nodes, and - N
_{s}has at most O(alpha * k^{1/d}* n^{(d-1)/d}) nodes.

The separator is constructed by choosing a sphere in d-dimensional space, using the edges it cuts as an edge separator, and extracting the corresponding vertex separator. There is a randomized algorithm for choosing the sphere from a particular probability distribution such that it satisfies the conditions of the theorem with high probability.

To describe the algorithm, we need to explain *stereographic projection*,
which is a way of mapping points in a d-dimensional space to the unit
sphere centered at the origin in (d+1) dimensional space. When d=2,
this is done by drawing a line from the point p to the north pole of
the sphere; the point p' where the line and sphere intersect is the
stereographic projection of p. The same idea applies to higher dimensions.

Here is the algorithm. Matlab software implementing it is available via anonymous ftp from ftp.parc.xerox.com at /pub/gilbert/meshpart.uu.

- Project up. This means that we stereographically project the nodes of the graph in d-dimensions to the unit sphere in d+1 dimensions, as described above.
- Find the
*centerpoint*of the projected points. The centerpoint of a set of points has the property that every hyperplane through the centerpoint divides the points into two approximately equal subsets (in the sense that the number of points on one side of the hyperplane cannot exceed the number of points on the other side by a factor exceeding d). A centerpoint can be found by linear programming, but cheaper heurstics exist. See the paper by Miller, Gilbert and Teng for details. - Conformally map the points on the sphere. This is done in two steps. First, rotate the projected points around the origin, including the centerpoint, so that the centerpoint has coordinates (0,...,0,r), for some r. Second, dilate the points. This is done by reversing the stereographic projection (projecting the rotated points on the sphere back to the plane), multiplying their coordinates in the plane by sqrt((1-r)*(1+r)), and sterographically projecting back to the sphere. One can show that the new centerpoint of the conformally mapped points lies at the origin of the sphere.
- Pick a random d-dimensional plane though the centerpoint (origin of the sphere). The intersection of this plane and the sphere is a circle. This circle is the sterographic image of another circle C in d-dimensional space.
- Convert the circle C to a vertex separator. A node i belongs to
N
_{s}if its disk D_{i}, with radius magnified by alpha, intersects C.

More recently, a slightly different approach was taken in
"Partitioning Meshes with Lines and Planes",
by F. Cao, J. Gilbert, and S-H. Teng.
Their algorithm is even simpler than the one above, but
requires the following slightly different notion of *well-shaped* graph
to analyze. Assume that the node of the graph G have coordinates in R^{d},
as before.

Let n be a node in G, and let alpha(n) be the ratio

Euclidean distance from n to the farthest node to which it is connected alpha(n) = ----------------------------------------------- Euclidean distance from n to the nearest other node, whether it is connected to n or notNote that alpha(n) >= 1. Then let alpha(G) be the largest value of alpha(n) for any n. We say that G is an

Next, we define the *trestle* D(G) of G as follows.
Let m be the shortest length of any edge of G, and M the longest length.
Then

D(G) = 1 + logThe idea is that D(G) is the number of significant bits needed to resolve the shortest and longest edges. If all edges are equally long, M=m and D(G) = 1. If the mesh is_{2}(M/m)

The partitioning algorithm is simply

- Find a centerpoint c in R
^{d}of the graph G. - Return a random hyperplane H through c.
- Partition the graph using H to find a vertex separator as follows:
For each node n, let B(n) be a ball centered at n and with radius
equal to the distance to the farthest node to which it is connected.
If B(n) intersects H, n is part of the vertex partition N
_{s}. The nodes on opposite sides of H whose balls B(n) do not intersect H form the partitions N_{1}and N_{2}.

Finally, Cao, Gilbert and Teng prove the following

**Theorem.** Let G be an alpha(G)-well-shaped mesh
in d dimensions,
with n nodes,
and with trestle D(G).
Then the above algorithm finds a vertex separator N_{s},
so N = N_{1} U N_{2} U N_{s},
with the following properties:

- N
_{1}and N_{2}each has at most n*d/(d+1) nodes. - N
_{s}has at most O( alpha(G) * D(G)^{1/d}* n^{(d-1)/d}) nodes.

The implementation requires a data structure called a *Queue*, or a
*First-In-First-Out (FIFO)* list. It will contain a list of objects to be processed.
There are two operations one can perform on a Queue. Enqueue(x) adds an object x to the
left end of the Queue. y=Dequeue() removes the rightmost entry of the Queue and returns it in y.
In other words, if x_{1}, x_{2}, ..., x_{k} are Enqueued on the Queue in that order, then k consecutive
Dequeue operations (possibly interleaved with the Enqueue operations) will return x_{1}, x_{2}, ... , x_{k}.

NT = {(r,0)} ... Initially T is just the root r, ... which is at level 0 ET = empty set ... T = (NT, ET) at each stage of the algorithm Enqueue((r,0)) ... Queue is a list of nodes to be processed Mark r ... Mark the root r as having been processed While the Queue is nonempty ... While nodes remain to be processed (n,level) = Dequeue() ... Get a node to process For all unmarked children c of n NT = NT U (c,level+1) ... Add child c to the list of nodes NT of T ET = ET U (n,c) ... Add the edge (n,c) to the edge list ET of T Enqueue((c,level+1)) ... Add child c to Queue for later processing Mark c ... Mark c as having been visited End for End whileFor example, the following figure shows a graph, as well as the embedded BFS tree (edges shown in black). In addition to the tree edges, there are two other kinds of edges.

This fact means that simply partitioning the graph into nodes at level L or lower, and
nodes at level L+1 or higher, guarantees that only tree and interlevel edges will be
cut. There can be no "extra" edges connecting, say, the root to the leaves of the tree.
This is illustrated in the above figure, where the 10 nodes above the dotted blue line are
assigned to partition N_{1}, and the 10 nodes below the line as assigned to N_{2}.

For example, suppose one had an n-by-n mesh with unit distance between nodes. Choose any node r as root from which to build a BFS tree. Then the nodes at level L and above approximately form a diamond centered at r with a diagonal of length 2*L. This is shown below, where nodes are visited counterclockwise starting with the north.

We start with an edge weighted graph G=(N,E,W_{E}), and a partitioning G = A U B into equal
parts: |A| = |B|. Let w(e) = w(i,j) be the weight of edge e=(i,j), where the weight is 0
if no edge e=(i,j) exists. The goal is to find equal-sized subsets X in A and Y in B, such
that exchanging X and Y reduces the total cost of edges from A to B. More precisely,
we let

T = sumand seek X and Y such that_{[ a in A and b in B ]}w(a,b) = cost of edges from A to B

new_A = A - X U Y and new_B = B - Y U Xhas a lower cost new_T. To compute new_T efficiently, we introduce

E(a) = external cost of a = sumand analogously_{[ b in B ]}w(a,b) I(a) = internal cost of a = sum_{[ a' in A, a' != a ]}w(a,a') D(a) = cost of a = E(a) - I(a)

E(b) = external cost of b = sumThen it is easy to show that swapping a in A and b in B changes T to_{[ a in A ]}w(a,b) I(b) = internal cost of b = sum_{[ b' in B, b' != b ]}w(b,b') D(b) = cost of b = E(b) - I(b)

new_T = T - ( D(a) + D(b) - 2*w(a,b) ) = T - gain(a,b)In other words, gain(a,b) = D(a)+D(b)-2*w(a,b) measures the improvement in the partitioning by swapping a and b. D(a') and D(b') also change to

new_D(a') = D(a') + 2*w(a',a) - 2*w(a',b) for all a' in A, a' != a new_D(b') = D(b') + 2*w(b',b) - 2*w(b',a) for all b' in B, b' != b

Now we can state the Kernighan/Lin algorithm (costs show in comments)

(0) Compute T = cost of partition N = A U B ... cost = O(|N|^{2}) Repeat (1) Compute costs D(n) for all n in N ... cost = O(|N|^{2}) (2) Unmark all nodes in G ... cost = O(|N|) (3) While there are unmarked nodes ... |N|/2 iterations (3.1) Find an unmarked pair (a,b) maximizing gain(a,b) ... cost = O(|N|^{2}) (3.2) Mark a and b (but do not swap them) ... cost = O(1) (3.3) Update D(n) for all unmarked n, as though a and b had been swapped ... cost = O(|N|) End while ... At this point, we have computed a sequence of pairs ... (a_{1},b_{1}), ... , (a_{k},b_{k}) and ... gains gain(1), ..., gain(k) ... where k = |N|/2, ordered by the order in which ... we marked them (4) Pick j maximizing Gain = sum_{i=1...j}gain(i) ... Gain is the reduction in cost from swapping ... (a_{1},b_{1}),...,(a_{j},b_{j}) (5) If Gain > 0 then (5.2) Update A = A - {a_{1},...,a_{j}} U {b_{1},...,b_{j}} ... cost = O(|N|) (5.2) Update B = B - {b_{1},...,b_{j}} U {a_{1},...,a_{j}} ... cost = O(|N|) (5.3) Update T = T - Gain ... cost = O(1) End if Until Gain <= 0

In one pass through the Repeat loop, this algorithm computes |N|/2 possible
pairs of sets A and B to swap, A = (a_{1},...,a_{j}) and B=(b_{1},...,b_{j}), for j=1 to |N|/2.
These sets are chosen greedily, so swapping ai and bi makes the most progress
of swapping any single pair of nodes. The best of these |N|/2 pairs of sets
is chosen, by maximizing Gain. Note that some gains may be negative, if no improvement
can be made by swapping just two nodes. Nevertheless, later gains may be large,
and so the algorithms can escape this "local minimum".

The cost of each step is shown to the right of the algorithm. Since loop (3) is
repeated O(|N|) times, the cost of one pass through the Repeat loop is
O(|N|^{3}).
The number of passes through the Repeat loop is hard to predict. Empirical testing
by Kernighan and Lin on small graphs (|N|<=360), showed convergence after 2 to 4
passes. For a random graph, the probability of getting the optimum in one pass
appears to shrink like 2^{|N|/30}. A parallel implementation of this algorithm
was design by J. Gilbert and E. Zmijewski in 1987.

More recently, other significant improvements to Kernighan/Lin were proposed in ``A fast and high quality multilevel scheme for partitioning irregular graphs'', by G. Karypis and V. Kumar (click here to get this and related papers).