CS 61B: Lecture 28 Wednesday, April 2, 2014 GRAPHS ====== A graph G is a set V of vertices (sometimes called nodes), and a set E of edges (sometimes called arcs) that each connect two vertices together. To confuse you, mathematicians often use the notation G = (V, E). Here, "(V, E)" is an _ordered_pair_ of sets. This isn't as deep and meaningful as it sounds; some people just love formalism. The notation is convenient when you want to discuss several graphs with the same vertices; e.g. G = (V, E) and T = (V, F). Graphs come in two types: _directed_ and _undirected_. In a directed graph (or _digraph_ for short), every edge e is directed from some vertex v to some vertex w. We write "e = (v, w)" (also an ordered pair), and draw an arrow pointing from v to w. The vertex v is called the _origin_ of e, and w is the _destination_ of e. In an undirected graph, edges have no favored direction, so we draw a curve connecting v and w. We still write e = (v, w), but now it's an unordered pair, which means that (v, w) = (w, v). One application of a graph is to model a street map. For each intersection, define a vertex that represents it. If two intersections are connected by a length of street with no intervening intersection, define an edge connecting them. We might use an undirected graph, but if there are one-way streets, a directed graph is more appropriate. We can model a two-way street with two edges, one pointing in each direction. On the other hand, if we want a graph that tells us which cities adjoin each other, an undirected graph makes sense. --- Bancroft --- --- -------- ------------ |1|<------------|2|<------------|3| |Albany|------|Kensington| --- --- --- -------- ------------ | ^ | ^ \ / Dana | Telegraph | Bowditch | | ------------ ---------- v | v | |Emeryville|-----|Berkeley| --- --- --- ------------ ---------- |4|------------>|5|------------>|6| \ / --- Durant --- --- --------- ---------- |Oakland|-----|Piedmont| Multiple copies of an edge are forbidden, --------- ---------- but a directed graph may contain both (v, w) and (w, v). Both types of graph can have _self-edges_ of the form (v, v), which connect a vertex to itself. (Many applications, like the two illustrated above, don't use these.) A _path_ is a sequence of vertices such that each adjacent pair of vertices is connected by an edge. If the graph is directed, the edges that form the path must all be aligned with the direction of the path. The _length_ of a path is the number of edges it traverses. Above, <4, 5, 6, 3> is a path of length 3. It is perfectly respectable to talk about a path of length zero, such as <2>. The _distance_ from one vertex to another is the length of the shortest path from one to the other. A graph is _strongly_connected_ if there is a path from every vertex to every other vertex. (This is just called _connected_ in undirected graphs.) Both graphs above are strongly connected. The _degree_ of a vertex is the number of edges incident on that vertex. (Self-edges count just once in 61B.) Hence, Berkeley has degree 4, and Piedmont has degree 1. A vertex in a directed graph has an _indegree_ (the number of edges directed toward it) and an _outdegree_ (the number of edges directed away). Intersection 6 above has indegree 2 and outdegree 1. Graph Representations --------------------- There are two popular ways to represent a graph. The first is an _adjacency_ _matrix_, a |V|-by-|V| array of boolean values (where |V| is the number of vertices in the graph). Each row and column represents a vertex of the graph. Set the value at row i, column j to true if (i, j) is an edge of the graph. If the graph is undirected (below right), the adjacency matrix is _symmetric_: row i, column j has the same value as row j, column i. 1 2 3 4 5 6 Alb Ken Eme Ber Oak Pie 1 - - - T - - Albany - T - T - - 2 T - - - - - Kensington T - - T - - 3 - T - - - T Emeryville - - - T T - 4 - - - - T - Berkeley T T T - T - 5 - T - - - T Oakland - - T T - T 6 - - T - - - Piedmont - - - - T - It should be clear that the maximum possible number of edges is |V|^2 for a directed graph, and slightly more than half that for an undirected graph. In many applications, however, the number of edges is much less than Theta(|V|^2). For instance, our maps above are _planar_graphs_ (graphs that can be drawn without edges crossing), and all planar graphs have O(|V|) edges. A graph is called _sparse_ if it has far fewer edges than the maximum possible number. ("Sparse" has no precise definition, but it usually implies that the number of edges is asymptotically smaller than |V|^2.) For a sparse graph, an adjacency matrix representation is very wasteful. A more memory-efficient data structure for sparse graphs is the _adjacency_ _list_. An adjacency list is actually a collection of lists. Each vertex v maintains a list of the edges directed out from v. The standard list representations all work--arrays (below left), linked lists (below right), even search trees (because you can traverse one in linear time). --- ----- --- ------ ------ 1 |.+-> | 4 | Albany |.+-> |Ken.+-> |Ber*| --- ===== === ====== ====== 2 |.+-> | 1 | Kensington |.+-> |Alb.+-> |Ber*| --- =====---- === ====== ====== 3 |.+-> | 2 | 6 | Emeryville |.+-> |Ber.+-> |Oak*| --- =====---- === ====== ====== ------ ------ 4 |.+-> | 5 | Berkeley |.+-> |Alb.+-> |Ken.+-> |Eme.+-> |Oak*| --- =====---- === ====== ====== ====== ------ 5 |.+-> | 2 | 6 | Oakland |.+-> |Eme.+-> |Ber.+-> |Pie*| --- =====---- === ====== ------ ------ 6 |.+-> | 3 | Piedmont |.+-> |Oak*| --- ----- --- ------ The total memory used by all the lists is Theta(|V| + |E|). If your vertices have consecutive integer names, you can declare an array of lists, and find any vertex's list in O(1) time. If your vertices have names like "Albany," you can use a hash table to map names to lists. Each entry in the hash table uses a vertex name as a key, and a List object as the associated value. You can find the list for any label in O(1) average time. An adjacency list is more space- and time-efficient than an adjacency matrix for a sparse graph, but less efficient for a _complete_graph_. A complete graph is a graph having every possible edge; that is, for every vertex u and every vertex v, (u, v) is an edge of the graph. Graph Traversals ---------------- We'll look at two types of graph traversals, which can be used on either directed or undirected graphs to visit each vertex once. Depth-first search (DFS) starts at an arbitrary vertex and searches a graph as "deeply" as possible as early as possible. For example, if your graph is an undirected tree, DFS performs a preorder (or if you prefer, postorder) tree traversal. Breadth-first search (BFS) starts by visiting an arbitrary vertex, then visits all vertices whose distance from the starting vertex is one, then all vertices whose distance from the starting vertex is two, and so on. If your graph is an undirected tree, BFS performs a level-order tree traversal. In a graph, unlike a tree, there may be several ways to get from one vertex to another. Therefore, each vertex has a boolean field called "visited" that tells us if we have visited the vertex before, so we don't visit it twice. When we say we are "marking a vertex visited", we are setting its "visited" field to true. Assume that we are traversing a strongly connected graph, thus there is a path from the starting vertex to every other vertex. When DFS visits a vertex u, it checks every vertex v that can be reached by some edge (u, v). If v has not yet been visited, DFS visits it recursively. public class Graph { // Before calling dfs(), set every "visited" flag to false; takes O(|V|) time public void dfs(Vertex u) { u.visit(); // Do some unspecified thing to u u.visited = true; // Mark the vertex u visited for (each vertex v such that (u, v) is an edge in E) { if (!v.visited) { dfs(v); } } } } In this DFS pseudocode, a "visit()" method is defined that performs some action on a specified vertex. For instance, if we want to count the total population of the city graph above, "visit()" might add the population of the visited city to the grand total. The order in which cities are visited depends partly on their order in the adjacency lists. The sequence of figures below shows the behavior of DFS on our street map, starting at vertex 1. A "V" is currently visited; an "x" is marked visited; a "*" is a vertex which we try to visit but discover has already been visited. V<-2<-3 x<-2<-3 x<-2<-3 x<-V<-3 *<-V<-3 x<-x<-3 x<-x<-V x<-*<-V x<-x<-V | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ | ^ ^ v | v v | v v | v v | v v | v v | v v | v v | v v | v 4->5->6 V->5->6 x->V->6 x->x->6 x->x->6 x->x->V x->x->x x->x->x x->x->* DFS runs in O(|V| + |E|) time if you use an adjacency list; O(|V|^2) time if you use an adjacency matrix. Hence, an adjacency list is asymptotically faster if the graph is sparse. What's an application of DFS? Suppose you want to determine whether there is a path from a vertex u to another vertex v. Just do DFS from u, and see if v gets visited. (If not, you can't there from here.) I'll discuss BFS in the next lecture. More on the Running Time of DFS ------------------------------- Why does DFS on an adjacency list run in O(|V| + |E|) time? The O(|V|) component comes up solely because we have to initialize all the "visited" flags to false (or at least construct an array of flags) before we start. The O(|E|) component is trickier. Take a look at the "for" loop of the dfs() pseudocode above. How many times does it iterate? If the vertex u has outdegree d(u), then the loop iterates d(u) times. Different vertices have different degrees. Let the total degree D be the sum of the outdegrees of all the vertices in V. D = sum d(v). v in V A call to dfs(u) takes O(d(u)) time, NOT counting the time for the recursive calls it makes to dfs(). A depth-first search never calls dfs() more than once on the same vertex, so the total running time of all the calls to dfs() is in O(D). How large is D? - If G is a directed graph, then D = |E|, the number of edges. - If G is an undirected graph with no self-edges, then D = 2|E|, because each edge offers a path out of two vertices. - If G is an undirected graph with one or more self-edges, then D < 2|E|. In all three cases, the running time of depth-first search is in O(|E|), NOT counting the time required to initialize the "visited" flags.