```CS 70 - Lecture 22 - Mar 11, 2011 - 10 Evans

Goal for today (Note 13): Hash functions

Recall data structures for storing n items of data:
Simplest: unsorted list
cost = constant to add something to the end of the list
cost = proportional to n to look something up
Better: sorted list
cost = proportional to n*log n to sort given all the data,
so log n per item; adding a new item trickier
cost = proportional to log n to look something up
Best: constant time to add something, or look up something: Hash table

A Hash table is a data structure for quickly storing
and looking up data, usually in constant time per item,
if designed properly:

Given keys in some set S, a hash function h:S ->[0,n-1]
maps each key into an integer from 0 to n-1, which is used to
look up the data in a list. So there is one List(i) (possibly empty)
for each i in the range [0,n-1].
insert (key,data) into List(h(key))

data = find(key)
if (key,data) stored in List(h(key)), return data, else "empty"

For this to work well, all the List(i) have to be short, so adding to
or searching List(i) takes a constant amount of time. This depends
on the hash function h(x).

A really bad h(x) would be the function h(x) = 0 for all x, so
everything would be stored in List(0), and inserting and finding
data would be as slow as using a single list.

Ideally, if the number of lists is at least as large as the number
of data items, i.e. n >= |S|, then each list will have at most
one data item. In other words h(x) would spread out the data as
evenly as possibly across List(0),...,List(n-1).

There are a lot of different kinds of hash functions that try to
achieve this. One example is h(x) = x mod n. This works well if
the rightmost digits of x are likely to be uniformly distributed.
If this is not the case then h(x) = a*x mod n, where a is a large
number such that gcd(a,n) = 1, might be a better choice.

Designing good hash functions is a topic for another class.
Here we assume we have done a good job, so that using h(x) is
like picking a random integer from [0,n-1].
In other words, inserting m data items is like throwing m balls
at random into n bins (Lists).
This lets us use probability theory to ask interesting questions, like
How long is longest list likely to be?
How big does n = #Lists have to be compared to m = #keys, so that
the probability of having a long List is small?

We will start with the simpler question: how big does n have to be
so that the chance of a List having more than 1 item is less than 1/2?
Clearly n has to be at least m, if h(x) did a perfect job of
distributing the keys uniformly across the lists. But since
we are throwing balls into bins, n will have to be larger.

So let's compute P(E) where
E = {m balls thrown into n bins with no "collisions"}
Each possible outcome (where a balls lands) is equally likely,
so we just need to count
|E| =     n    ... number of ways to throw first ball
x (n-1)  ... number of ways to throw second ball
x (n-2)  ... number of ways to throw third ball
...
x (n-m+1)  ... number of ways to throw m-th ball
and divide by
|S| = |{all ways to throw m balls into n bins}|
=   n ... number of ways to throw first ball
= x n ... number of ways to throw second ball
...
= x n ... number of ways to throw m-th ball
= n^m
So P(E) = |E|/|S| = prod_{i=0 to m-1} (n-i)/n

Here is another way to compute the same results, using a result
on conditional probability from last time:
P(A_1 inter A_2 inter ...  inter A_m)
=  P(A_1)
* P(A_2 | A_1)
* P(A_3 | A_1 inter A_2 )
* P(A_4 | A_1 inter A_2  inter A_3 )
...
* P(A_m | A_1 inter A_2 inter ... inter A_{m-1} )
Let A_i = {i-th ball does not collide with balls 1...i-1}
Then A_1 inter ... inter A_m
= {for i in [1,m], ball i does not collide with balls 1 through i-1}
= {for i neq j, both in [1,m], ball i does not collide with ball j}
and our goal is to compute P(A_1 inter ... inter A_m):
P(A_1) = 1, and
P(A_i | A_1 inter ... inter A_{i-1}) = (n-i+1)/n
because A_1 inter ... inter A_{i-1} means that balls 1 to i-1
occupy i-1 different bins, so there are n-i+1 empty bins for the
i-th ball to land in to avoid collisions.
Thus
P(A_1 inter ... inter A_m) = prod_{i=1 to m-1} (n-i)/n
as above.

Given n, we seek the value of m that makes
P(m balls thrown into n bins without collision)
= prod_{i=1 to m-1} (n-i)/n
= prod_{i=1 to m-1} (1- i/n)
= p(m,n)
equal to .5 (or just smaller).

We compute instead ln p(m,n) = sum_{i=1 to m-1} ln(1 - i/n)
Since
ln(1 - x) = -x - x^2/2 - x^3/3 - ...  Taylor expansion
~ -x when x is small (i.e. when i/n is small, or n is large)
then
ln p(m,n) = sum_{i=1 to m-1} ln(1 - i/n)
~ sum_{i=1 to m-1} -i/n
= (-1/n) sum_{i=1 to m-1} i
= (-1/n) m*(m-1)/2
~ (-1/n) m^2/2
Equating ln (1/2) = ln p(m,n) = -m^2/(2*n) yields
m^2/(2*n) = ln 2
or
m = sqrt(2 * ln 2 * n) ~ 1.177 * sqrt(n).
In other words, we can probably only throw about sqrt(# bins) balls without
a collision.

ASK&WAIT: What would change if we want the probability of collision to be 5%?

How many different people do you need to have before the chance that two of them
have a common birthday is at least .5?

Ex: "Coupon collector's problem"
Suppose I like to buy cereal because each box contains a random baseball
card from a collection of n baseball cards. How many boxes m do I have to
buy so that the probability that I have at least one of each card is at
least 1/2?

Pick a card, any card. The chance that I do not get this particular card
in any one box is (n-1)/n, so the chance of not getting it in m boxes is
((n-1)/n)^m  = (1 - 1/n)^m
From calculus we know that
lim_{n -> infinity} (1 - 1/n)^n = 1/e = 1/2.71828...
so when n is large
(1 - 1/n)^m = ((1 - 1/n)^n)^(m/n) ~ exp(-m/n)
Said another way, let E_i be the event that I do not get card i in m boxes, so
P(E_i) ~ exp(-m/n)
What we want is the probability that we don't get some card,
i.e. card 1 or card 2 or ... or card n; this is
P(E_1 union E_2 union ... union E_n)

The sets E_i and E_j are not disjoint when i neq j, but it is still true that
Thm (Union Bound)  For any events E_i, disjoint or not
P(E_1 union ... union E_n) <= sum_{i=1 to n} P(E_i)
Proof: Let E = E_1 union ... union E_n
P(E) = sum_{x in E} P(x)
<= sum_{i=1 to n} sum_{x in E_i} P(x)
... over-counting will occur if E_i not all disjoint,
... but we still get an upper bound
= sum_{i=1 to n} P(E_i)

Therefore P(some card missing) = P(E) <= sum_{i=1 to n} P(E_i) ~ n*exp(-m/n)
So if we pick m large enough to make the upper bound n*exp(-m/n) < 1/2
then we will be sure that the probability of missing some card is < 1/2.
Solving
1/2 = n*exp(-m/n)
for m we get m = n*ln(2*n). This is clearly a good marketing ploy for cereal.
```