Math 110 - Fall 05 - Lectures notes # 33 - Nov 16 (Wednesday) We continue chapter 5, concluding with a brief discussion of section 5.3 on Markov chains, including a discussion of how Google works, i.e. how they decide which web pages to display in what order when you do a search. We will not prove everything we tell you, but we will prove the basic results. We will use the language of probability. Even if you have not had a course in probability before, our results will be intuitively meaningful. Ex 1: Suppose there is a metropolitan area with city dwellers and suburb dwellers. For city planning purposes, it is important to understand whether people are tending to move from the city to the suburbs or vice versa, and what the eventual populations will be. To make it simple we will assume the total population is constant. The data available from surveys says that in any given year The probability that a city dweller stays in the city is .90 The probability that a city dweller moves to the suburbs is .10 These two probabilities must be nonnegative and add up to 1. The probability that a suburbanite stays in the suburbs is .98 The probability that a suburbanite moves to the city is .02 We can also represent this information as a labelled graph (picture) We record this data in a probability matrix: city suburb P = [ .90 .02 ] city [ .10 .98 ] suburb in which there is one column for each current location (city, suburb) and one row for each location next year (in the same order) so P_ij = probability of moving from location j to location i Note that each column of P must add up to 1. Suppose we know that this year the fraction of the whole population who are city dwellers is .7, and the remaining .3 are suburbanites. Here are the questions we will ask and answer: Q1: What will be the fractions of the population who are city dwellers and suburbanites after 1 year? (Sneak preview: do a matrix-vector multiply) Q2: What will be the fractions of the population who are city dwellers and suburbanites after 5 years? 10 years? (Sneak preview: do more matrix-vector multiplies) Q3: What will be the fractions of the population who are city dwellers and suburbanites "eventually" (after n years, as n -> infinity) (Sneak preview: compute an eigenvector) Ex 2: Instead of classifying people by location, we classify by political party. Suppose there are 3 political parties, R, B and G. Data about party registration says what fraction of voters registered with each party either remain with the party or switch to another one. One can represent this data with a labelled graph (picture) or a matrix: R B G P = [ .85 .05 .01 ] R [ .07 .90 .04 ] B [ .08 .05 .95 ] G Again, each column of P must add up to 1. Our 3 questions are similar to before (and the sneak previews are identical): How many voters will be registered in each party next year? After 5 years? 10 years? eventually? Ex 3: Now our rows and column labels will not be locations, or political party affiliations, but what web pages you are looking at. Imagine an "average" web user who is looking at a web page. We'd like a model of what web pages they are most likely to look at. So we imagine that our "average" web user takes all the outward links from the current web page (say there are 10 such links), picks one at random (say each with probability .1), and clicks on it. On which web pages will such an "average" user spend most time? This is the basic model used by Google. The matrix P for this is just the transpose of the "incidence matrix of the Web" that we defined in Lecture 13, where we divide each column entry by the number of its nonzero entries so that each column adds up to 1. P_ij = 1/m if there is a link on web page j pointing to web page i and there are a total of m such outward link on web page j = 0 if there is no link from web page j to web page i (There are variations on this, depending on whether one takes into account multiple links from web site i to web site j, but we will keep this presentation simple.) The original algorithm, invented by the founders of Google as students, is called Pagerank. If you type "Pagerank" into Google there are over 22M hits, so it is easy to find out more. Now for some formal definitions: Def: An n by n probability matrix P, or stochastic matrix, is a square matrix with nonnegative entries where each column adds up to 1. Let u be the vector each of whose entries is 1. Then the definition says that P^t * u = u, i.e. that u is an eigenvector of P^t with eigenvalue 1. Recall that the eigenvalues of P and P^t are the same. We will soon see that 1 is also the largest eigenvalue of P. Now we imagine a process where at any step we have a population (say of people), each of which has an associated "state" (say location, or political party, or current web page). At each "step" of this process, each member of the population randomly picks a new state to be in at the next step of the process. If a member is in state j, the probability of picking next state = i is P_ij. This process is called a Markov process. If the number of members now in state k is m_k, then the "expected value" of the population of state i at the next step is given as follows: next m_i = m_1 * probability(switching from 1 to i) + m_2 * probability(switching from 2 to i) ... + m_n * probability(switching from n to i) = sum_{j=1 to n} P_ij * m_j or next m = P * m By dividing each m_i by the total population M, to get m_i/M = fraction of population in state i, we get vector of next population fractions = P * vector of current population fractions In other words, figuring out the expected fractions in each state after 1 step requires matrix-vector multiplication by P. If we want the expected fractions after k years, we multiply by P k times: vector of fractions after k years = P*(P* ... k times ... *(P* vector of current fractions) = P^k * vector of current fractions) This suggests that we can think of P^k as a probability matrix too, corresponding to taking k steps. Indeed it is: Lemma: If P is a probability matrix, so is P^k for any k Proof: if P has nonnegative entries, obviously so does P^k. and if P has column sums equal to 1, that is P^t * u = u, then (P^k)^t * u = (P^t)^k * u = P^t * P^t * ... * P^t * u = P^t * (P^t * ... * (((P^t * u ))) = u as desired Ex: Consider the city/suburb population example: P = [ .90 .02 ] P^2 = [ .812 .038 ] [ .10 .98 ] [ .188 .962 ] P^5 = [ .606 .079 ] P^10 = [ .399 .120 ] [ .394 .921 ] [ .601 .880 ] P^100 = [ .166 .166 ] "P^inf" = [ 1/6 1/6 ] [ .833 .833 ] [ 5/6 5/6 ] Now if m is any initial population with total population M "P^inf"*m = [ 1/6 1/6 ]*[m_1] = [1/6*M] [ 5/6 5/6 ] [m_2] [5/6*M] In other words, the eventual distribution of populations converges to the same answer independent of who lived where in the beginning. To see how this final distribution is related to P, set [m_1] = [1/6] [m_2] = [5/6] and note that P * m = m, i.e. m is an eigenvector of P for the eigenvalue 1. Thm 1: If P is a probability matrix where each entry of P^k is positive for k large enough (i.e. you can get to any state from any other, eventually), then as k->infinity, (1) P^k converges to to a limiting probability matrix L where each column is equal to the same vector v of positive numbers (summing to 1). (2) v is an eigenvector of P for the eigenvalue 1, scaled so its entries sum to 1 (3) For any m, P^k * m converges to v*M, where m = sum_i m_i We will prove this only partially, assuming P is diagonalizable (a complete proof would use ideas from Chapter 7, on Jordan forms, which is an eigendecomposition for non diagonalizable matrices). We will need a Theorem of independent interest: Gershgorin's Theorem: Let A be an n by n matrix with entries from C (the complex numbers). Then any eigenvalue lambda of A must satisfy the following inequality for some i: | lambda - A_ii | <= sum_{j=1 to n except i} |A_ij| Note: each such inequality defines a circle in the complex plane. In other words, the eigenvalues must lie in a union of circles (which might overlap). Example of Gershgorin's Theorem: If A = [ 1 .1 .2 ] [ .1 4 .5 ] [ 1 4 20 ] then each lambda of A must lie in one of 3 circles | lambda - 1 | <= .3 | lambda - 4 | <= .6 | lambda -20 | <= 5 Proof of Gershgorin's theorem: Suppose A*x = lambda*x, so that x is an eigenvector. Divide x by its largest component in absolute value. If this component was the i-th component then x_i is now = 1, and the other components |x_j| <= 1. Then 0 = (A*x - lambda*x)_i = (A_ii - lambda)*x_i + sum_{j=1 to n except i} A_ij*x_j or (A_ii - lambda)*x_i = -sum_{j=1 to n except i} A_ij*x_j Using x_i = 1, |x_j|<=1 and the triangle inequality we get |A_ii - lambda| <= sum_{j=1 to n except i} |A_ij*x_j| <= sum_{j=1 to n except i} |A_ij| Partial proof of Thm 1: We apply Gershgorin's Theorem to P^t (which has the same eigenvalue as P) to get that any eigenvalue lambda of P satisfies | lambda - P_ii | <= sum_{j=1 to n except i} |P_ji| = sum_{j=1 to n except i} P_ji ... since all P_ij >= 0 or | lambda | = | lambda - P_ii + P_ii | <= | lambda - P_ii | + | P_ii | <= sum_{j=1 to n except i} P_ji + P_ii = 1 So all the eigenvalues of P are less than or equal to 1 in magnitude. We know from before that one eigenvalue equals 1. Assuming that (1) P is diagonalizable, and (2) all eigenvalues but one of P are actually < 1 in magnitude then from diagonalizability we get P = Q * Lambda *Q^{-1} where Lambda = diag( 1, lambda_2 , ... , lambda_n} where | lambda_i | < 1 for i >= 2 so P^k = Q * Lambda^k * Q^{-1} = Q * diag(1, lambda_2^k , ... , lambda_n^k) * Q^{-1} As k -> infinity, all the lambda_i^k -> 0, so P^k -> Q * diag(1,0,...,0) * Q^{-1} = Q * ( e_1 * e_1^t) * Q^{-1} = (Q * e_1) * (e_1^t * Q^{-1}) = (column 1 of Q) * (row 1 of Q^{-1}) = (v) * (r^t) = L v is an eigenvector of P corresponding to eigenvalue 1. But note that any nonzero multiple of v is also an eigenvector, so we haven't determined it uniquely yet. Nor have we determined r. To determine r and then L, we note that since P^k has unit column sums for all k, i.e. P^k*u = u, then this is obviously true as k-> infinity, i.e. L*u=u. In other words, L's column sums are 1 too. This is all we need: Note that column i of L is v*r_i. Since the sum of the entries of each v*r_i is 1, we get that L = v*u^t, where v is the eigenvector whose entries sum to 1, and the entries of u are all ones. (Strictly speaking, we have assumed that there is only one eigenvector of P corresponding the the eigenvalue 1, except for constant multiples. To prove this we would need to use our assumption that P^k has all positive entries for k large enough.) Now it is easy to prove the third part of Thm 1: P^k*m -> L*m = (v*u^t)*m = v*(u^t*m) = v * (sum_i m_i) = v*M A complete proof of Thm 1 would use our assumption that each entry of P^k is positive for large enough k to show that all eigenvalues of P except 1 are less than 1 in absolute value. We would also use the Jordan form to show that Lambda^k -> diag(1,0,...,0) even when P is not diagonalizable. Now we discuss how Google works, roughly: Once every few days: (1) Form the 8 billion by 8 billion matrix P corresponding to the web. (2) Compute (approximately!) the eigenvector v of P for eigenvalue 1. Thus v(i) assigns a number to each web page i, that corresponds to the chance than an "average" web user will look at it. Every time an user types in a search request into Google: (1) Find all the pages that contain the search string (or a synonym, different spelling, etc). (2) Sort the pages in decreasing order of their values of v(i) (3) Display the pages in this order, ie the ones an "average user" would view most often coming first. In Lecture 13, we said that multiplying P by itself would take millions of years, unless we took advantage of the fact that nearly all its entries are zeros: Most web pages have links to very few other web pages. If the average number of such links were, say, 10, then the number of nonzero matrix entries in P is 10*n where n is the dimension of P. Such a matrix, most of whose entries are zero, is called sparse. To multiply a sparse matrix times a vector, we only need to store and multiply by the nonzero entries. So instead of costing n^2 multiplies and n^2 additions to compute P*x, it costs only 10*n multiplies and 10*n additions, a factor of n/10 faster, or about .8 billion times faster. So how does Google find the eigenvector v of P corresponding the eigenvalue 1? According to Theorem 1 (we assume P satisfies its conditions) if we pick an arbitrary starting vector x^(1), and repeatedly multiply it by P: x^(2) = P * x^(1), x^(3) = P*x^(2) ,..., x^(k) = P*x^(k-1) then x^(k) converges to a multiple of v. Once x^(k) stops changing very much, we assume it has converged well enough, and then set v = x^(k) / sum_i x^(k)_i.