Math 55 - Spring 2004 - Lecture notes #24 - April 22 (Thursday) Goals for today: Applications to computer science searching a list collisions in a hash table load balancing EX: Suppose we have a list of n distinct items L(1),...,L(n), and want an algorithm that takes an input x and (1) returns i if L(i)=x, (2) returns n+1 otherwise An obvious algorithm is "linear search" i=0 repeat i=i+1 until L(i)=x or i=n+1 Suppose x is chosen at random from a sample space S with probability function P. What is the expectation of the operation count C of this algorithm, i.e. how many times is the line "i=i+1" executed? If we run this algorithm many times, E(C) tells us how long it will take "on average", and sigma(C) tells us how variable the running time will be. The answer depends on S and P: EX 1: Suppose S = {L(1),...,L(n)} and P(L(i))=1/n; this means each input x must be in the list L, and is equally likely. Then E(C) = 1*(1/n) + 2*(1/n) + 3*(1/n) + ... + n*(1/n) = (n+1)/2 and V(C) = E(C^2) - (E(C))^2 = 1^2*(1/n) + 2^2*(1/n) + 3^2*(1/n) + ... + n^2*(1/n) - ((n+1)/2)^2 = (n^3/3 + O(n^2))*(1/n) - (n^2/4 + O*(n^2)) = n^2/12 + O(n) so sigma(C) = n/sqrt(12) + O(1) ~ .29*n + O(1) EX 2: Suppose S = {L(1),...,L(n),z}, and P(L(i))=p/n, P(z)=1-p i.e. the probability that x is not on the list is 1-p ASK&WAIT: What is E(C)? How does it depend on p? ASK&WAIT: What are V(C) and sigma(C)? How do they depend on p? EX 3: Suppose P(L(i)) = p(i) are all different. In what order should we search the L(i) to minimize E(C)? ASK&WAIT: Suppose n=2, and p(2) >> p(1); which item should we search first? Thm: Searching list in decreasing order of p(i) minimizes E(C). Proof: Suppose that we do not search list in decreasing order of p(i); then we will show that there is a different search order with smaller E(C). Let q(1),...,q(n) be the probabilities of the items in the order they are searched, so q(1),...,q(n) is a permutation of p(1),...,p(n). Suppose that q(1),...,q(n) is not in decreasing order; this means that for some j q(j) < q(j+1). Let r = 1 - p(1) - p(2) - ... - p(n) be the probability that the search item is not in the list. Then C1 = E(C) for search order 1,...,j,j+1,...,n = sum_{i=1 to n} i*q(i) + (n+1)*r and C2 = E(C) for search order 1,...,j+1,j,...,n (i.e. with j+1 searched before j) = sum_{i=1 to n except j and j+1} i*q(i) + j*q(j+1) + (j+1)*q(j) + (n+1)*r Then C1 - C2 = j*q(j) + (j+1)*q(j+1) - j*q(j+1) - (j+1)*q(j) = q(j+1) - q(j) > 0 In other words, searching j+1 before j lowers the expected cost E(C) So unless q(j)>q(j+1) for all j, i.e. unless the list is sorted, there is a better order to search in. EX 4: Suppose P(L(i)) = c*q^i, q<1, where c is chosen so that sum_{i=1 to n} P(L(i)) = 1, i.e. x guaranteed to be in list ASK&WAIT: What is c? ASK&WAIT: What is E(C)? ASK&WAIT: What happens to E(C) as n grows? Recall Binary Search: Assume L(1) < L(2) < ... < L(n), by sorting if necessary istart = 1; iend = n repeat middle = floor((istart + iend)/2) if x < L(middle) iend = middle-1 if x > L(middle) istart = middle+1 until x = L(middle) or iend < istart Cost = O(log_2 n) steps, because each time the list to search is at most half as long. ASK&WAIT? What is faster for large n, linear search or binary search? But in the worst case, linear search will take n steps, much slower than binary search. ASK&WAIT: Can you think of an algorithm that takes O(1) steps on average, and O(log n) at worst, i.e. has the advantages of both kinds of search? The next two topics are based on notes from CS 70. EX The next question we ask is about hash tables. A hash table is a data structure that lets us store and look up data items in close to constant time no matter how many items n there are in the hash table (unlike linear search or binary search, which cost O(n) and O(log n) respectively). The hash table uses a so-called hash function f(x) which takes a data item x and computes an index f(x) between 1 and n = hash table size. The data item x is then stored at H(f(x)), that in the f(x)-th location of the hash table. If more than one data item x1,x2, ... all have the same index i = f(x1)=f(x2)=... (called a "collision") then all these data items are stored in a linked list starting at H(i). So for a hash table to have the attractive property of taking a constant amount of time to find a data time, these linked lists should be very short, ideally with one data item each (no collisions). This means that the hash function f(x) should "spread" all the data items out as evenly as possibly over all n hash table entries. To model the behavior of a hash table, we will model our ideal hash function f(x) as independently picking a random hash table location for each x, where each location is chosen with equal probability 1/n. We then ask how many items the hash table can contain before the probability that the hash function picks the same location for two items (i.e. is no longer "perfect") exceeds some probability, say 1/2 (we could pick any number we like). The purpose of this is to pick the size n of the hash table we need to store a given number m of data items without collision. So what we want to compute is P(after m items are inserted randomly into n table entries, no collisions occur) = P(1st item causes no collision) * P(2nd item causes no collision) * ... * P(3rd item causes no collision) * ... * ... P(mth item causes no collision) ... by independence of each choice = 1 * ... collision impossible on first item (n-1)/n * ... n-1 equally likely free locations for 2nd item (n-2)/n * ... n-2 equally likely free locations for 3rd item ... (n-m+1)/n ... n-m+1 equally likely free locations for mth item = (n-1)!/[(n-m)!*n^(m-1)] We want to choose m depending on n such that this probability is about 1/2. We will use Stirling's formula for this: n! ~ sqrt(2*pi) * n^(n+ 1/2) * e^(-n) to get 1/2 ~ (n-1)! / (n-m)! / n^(m-1) ~ sqrt(2*pi) * (n-1)^(n- 1/2) * e^(-n+1) / [sqrt(2*pi) * (n-m)^(n-m+1/2)* e^(-n+m) * n^(m-1) ] ... cancelling sqrt(2*pi), substituting m = a*n, ... and factoring n^(...) out ~ n^(n-1/2) * (1 - 1/n)^n * (1 - 1/n)^(1/2) * e / [ n^(n*(1-a)+1/2) * (1-a)^((1-a)*n + 1/2) * e^(a*n) * n^(a*n - 1) ] ... cancelling n^(...) and using (1- 1/n)^n -> 1/e ~ (1-a)^(-1/2) * ((1-a)^(1-a)*e^a)^(-n) taking logs yields log 1/2 = -1/2*log(1-a) -n*log((1-a)^(1-a)*e^a) = -1/2*log(1-a) - n*[(1-a)*log(1-a) + a] For this to be near log(1/2), the factor multiplying n has to shrink like 1/n, i.e. a has to be tiny, so we use use the Taylor expansion for log(1-a): log(1-a) = - a - a^2/2 - ... ~ -a to get that the log is log(1/2) ~ 1/2*a - n*[(1-a)*(-a) + a] ~ 1/2*a - n*[a^2] or, solving for a: a ~ sqrt(log(2))/sqrt(n) ~ .8326/sqrt(n) or, solving for m = a*n m ~ .8326*sqrt(n) In other words, we would want the size n of the hash table to be about the square of the number m of data items to be sure of no collisions with probability 1/2. With any other probability, we would have gotten a similar result with a slightly different constant. EX: We consider "load balancing." For example, suppose you run a web service (like Google) to which large numbers of requests regularly stream in, and which need to be assigned to processors. A typical algorithm takes each incoming request and randomly picks a processor to assign it to. The question, given m requests and n processors, is will each processor have roughly the same amount of work to do, i.e. will the load be balanced? (The reason this is often done instead of having a centralized processor evenly divide the work among processors is that the centralized processor becomes a bottleneck.) A similar question would be this: suppose you are a spammer, and randomly send m email messages to n recipients. Does each recipient get about the same number of spam messasges? This is similar to the last example, but there we wanted each processor (or hash table entry) to get at most one request (or data item) with probability 1/2. Here, we will instead ask the following question: Given m requests assigned randomly to n processors what is the smallest value of k such that P(some processor gets k or more requests) <= 1/2. In other words, we have a good chance (1/2) that no processor has more than k requests to handle. (We could change the constant 1/2 to .1 or .01 if we wanted to be more sure. ASK&WAIT: Would changing 1/2 to a smaller value make the answer k larger or smaller? why? We will approximate this as follows. Consider just processor 1. The probability that processor 1 gets k or more requests is the same as the probability that after flipping a biased coin m times where the coin comes up "assign to processor 1" with probability 1/n, and comes up "assign to another processor" with probability (n-1)/n the number of times the coin comes up "processor 1" is k, k+1,..., m. This probability is P(processor 1 has at least k requests) = P_1(k,m,n) = sum_{j=k to m} C(m,j) (1/n)^j * ((n-1)/n)^(m-j) Note that the analogous function for any other processor, say P_i(...) for processor i, is the same. Therefore, the P(some processor will have at least k requests) = P(proc 1 has at least k requests or proc 2 has at least k requests or ... proc n has at least k requests) <= P(proc 1 has at least k requests) + P(proc 2 has at least k requests) + ... P(proc n has at least k requests) ... because the probability of a union of events ... it at most the sum of the individual probabilities. ... We only get an upper bound because the events can ... overlap, i.e. more than 1 processor may ... simultaneously have more than k requests) = n*P_1(k,m,n) ... since all the functions P_i(k,m,n) are the same Suppose we choose k as small as possible (and depending on m and n) so that P_1(k,m,n) <= 1/(2*n) Then P(some processor will have at least k requests) <= n/(2*n) = 1/2 as desired. The Central Limit Theorem can be used, and we discover (in the special case of m=n), that the value of k is quite small, namely k ~ 2*log n / log log n. So if n = 10^6, then the probability is 1/2 that no processor has more 11 requests. If 250M pieces of spam are randomly mailed to 250M recipients, the probability is 1/2 that no one will get more than 12 pieces of spam.