CS 70 - Lecture 22 - Mar 11, 2011 - 10 Evans Goal for today (Note 13): Hash functions Recall data structures for storing n items of data: Simplest: unsorted list cost = constant to add something to the end of the list cost = proportional to n to look something up Better: sorted list cost = proportional to n*log n to sort given all the data, so log n per item; adding a new item trickier cost = proportional to log n to look something up Best: constant time to add something, or look up something: Hash table A Hash table is a data structure for quickly storing and looking up data, usually in constant time per item, if designed properly: Given keys in some set S, a hash function h:S ->[0,n-1] maps each key into an integer from 0 to n-1, which is used to look up the data in a list. So there is one List(i) (possibly empty) for each i in the range [0,n-1]. add(key,data) insert (key,data) into List(h(key)) data = find(key) if (key,data) stored in List(h(key)), return data, else "empty" For this to work well, all the List(i) have to be short, so adding to or searching List(i) takes a constant amount of time. This depends on the hash function h(x). A really bad h(x) would be the function h(x) = 0 for all x, so everything would be stored in List(0), and inserting and finding data would be as slow as using a single list. Ideally, if the number of lists is at least as large as the number of data items, i.e. n >= |S|, then each list will have at most one data item. In other words h(x) would spread out the data as evenly as possibly across List(0),...,List(n-1). There are a lot of different kinds of hash functions that try to achieve this. One example is h(x) = x mod n. This works well if the rightmost digits of x are likely to be uniformly distributed. If this is not the case then h(x) = a*x mod n, where a is a large number such that gcd(a,n) = 1, might be a better choice. Designing good hash functions is a topic for another class. Here we assume we have done a good job, so that using h(x) is like picking a random integer from [0,n-1]. In other words, inserting m data items is like throwing m balls at random into n bins (Lists). This lets us use probability theory to ask interesting questions, like How long is longest list likely to be? How big does n = #Lists have to be compared to m = #keys, so that the probability of having a long List is small? We will start with the simpler question: how big does n have to be so that the chance of a List having more than 1 item is less than 1/2? Clearly n has to be at least m, if h(x) did a perfect job of distributing the keys uniformly across the lists. But since we are throwing balls into bins, n will have to be larger. So let's compute P(E) where E = {m balls thrown into n bins with no "collisions"} Each possible outcome (where a balls lands) is equally likely, so we just need to count |E| = n ... number of ways to throw first ball x (n-1) ... number of ways to throw second ball x (n-2) ... number of ways to throw third ball ... x (n-m+1) ... number of ways to throw m-th ball and divide by |S| = |{all ways to throw m balls into n bins}| = n ... number of ways to throw first ball = x n ... number of ways to throw second ball ... = x n ... number of ways to throw m-th ball = n^m So P(E) = |E|/|S| = prod_{i=0 to m-1} (n-i)/n Here is another way to compute the same results, using a result on conditional probability from last time: P(A_1 inter A_2 inter ... inter A_m) = P(A_1) * P(A_2 | A_1) * P(A_3 | A_1 inter A_2 ) * P(A_4 | A_1 inter A_2 inter A_3 ) ... * P(A_m | A_1 inter A_2 inter ... inter A_{m-1} ) Let A_i = {i-th ball does not collide with balls 1...i-1} Then A_1 inter ... inter A_m = {for i in [1,m], ball i does not collide with balls 1 through i-1} = {for i neq j, both in [1,m], ball i does not collide with ball j} and our goal is to compute P(A_1 inter ... inter A_m): P(A_1) = 1, and P(A_i | A_1 inter ... inter A_{i-1}) = (n-i+1)/n because A_1 inter ... inter A_{i-1} means that balls 1 to i-1 occupy i-1 different bins, so there are n-i+1 empty bins for the i-th ball to land in to avoid collisions. Thus P(A_1 inter ... inter A_m) = prod_{i=1 to m-1} (n-i)/n as above. Given n, we seek the value of m that makes P(m balls thrown into n bins without collision) = prod_{i=1 to m-1} (n-i)/n = prod_{i=1 to m-1} (1- i/n) = p(m,n) equal to .5 (or just smaller). We compute instead ln p(m,n) = sum_{i=1 to m-1} ln(1 - i/n) Since ln(1 - x) = -x - x^2/2 - x^3/3 - ... Taylor expansion ~ -x when x is small (i.e. when i/n is small, or n is large) then ln p(m,n) = sum_{i=1 to m-1} ln(1 - i/n) ~ sum_{i=1 to m-1} -i/n = (-1/n) sum_{i=1 to m-1} i = (-1/n) m*(m-1)/2 ~ (-1/n) m^2/2 Equating ln (1/2) = ln p(m,n) = -m^2/(2*n) yields m^2/(2*n) = ln 2 or m = sqrt(2 * ln 2 * n) ~ 1.177 * sqrt(n). In other words, we can probably only throw about sqrt(# bins) balls without a collision. ASK&WAIT: What would change if we want the probability of collision to be 5%? Ex: "Birthday Paradox" How many different people do you need to have before the chance that two of them have a common birthday is at least .5? Ex: "Coupon collector's problem" Suppose I like to buy cereal because each box contains a random baseball card from a collection of n baseball cards. How many boxes m do I have to buy so that the probability that I have at least one of each card is at least 1/2? Pick a card, any card. The chance that I do not get this particular card in any one box is (n-1)/n, so the chance of not getting it in m boxes is ((n-1)/n)^m = (1 - 1/n)^m From calculus we know that lim_{n -> infinity} (1 - 1/n)^n = 1/e = 1/2.71828... so when n is large (1 - 1/n)^m = ((1 - 1/n)^n)^(m/n) ~ exp(-m/n) Said another way, let E_i be the event that I do not get card i in m boxes, so P(E_i) ~ exp(-m/n) What we want is the probability that we don't get some card, i.e. card 1 or card 2 or ... or card n; this is P(E_1 union E_2 union ... union E_n) The sets E_i and E_j are not disjoint when i neq j, but it is still true that Thm (Union Bound) For any events E_i, disjoint or not P(E_1 union ... union E_n) <= sum_{i=1 to n} P(E_i) Proof: Let E = E_1 union ... union E_n P(E) = sum_{x in E} P(x) <= sum_{i=1 to n} sum_{x in E_i} P(x) ... over-counting will occur if E_i not all disjoint, ... but we still get an upper bound = sum_{i=1 to n} P(E_i) Therefore P(some card missing) = P(E) <= sum_{i=1 to n} P(E_i) ~ n*exp(-m/n) So if we pick m large enough to make the upper bound n*exp(-m/n) < 1/2 then we will be sure that the probability of missing some card is < 1/2. Solving 1/2 = n*exp(-m/n) for m we get m = n*ln(2*n). This is clearly a good marketing ploy for cereal.