CS174 Spring 99, lecture 8 summary

CS174 Spring 99 Lecture 8 Summary

More on Coupon Collecting

Recall that coupon collecting is equivalent to placing m balls in n bins so that no bin is empty. Last time we derived an upper bound for the probability that some bin is empty which is

Pr[some bin empty] < ne^-m/n

Then we showed that if you fix this probability and rearrange the equation, the number of balls you need is:

m > n ln(n) + W (n)

So if m is somewhat bigger than n ln(n), we have a very good chance of hitting all the bins. So if you make about n ln n visits to the supermarket, you have a good chance to get all n coupons.

That also implies the result that we were interested in about stable marriages: After about n ln n proposals under the random algorithm, every female will have been proposed to, and a stable marriage will result. That means if you started with random permutations as preferences, the proposal algorithm would run for about n ln n steps.

Expected number of rounds to collect all coupons

Now lets turn to an analysis of the expected value of m to hit all of the bins (or collect all coupons). Let X be the number of balls placed when the last bin is hit.

Let X₀ be the number of trials til the first bin is hit (=1).

Let X₁ be the number of trials after the 1^st bin is hit until the 2^nd bin is hit.

Let X₂ be the number of trials after the 2^nd bin is hit until the 3^rd bin is hit.

…

Let X_n-1 be the number of trials after the (n-1)^st bin is hit until the n^th bin is hit.

Then the total number of trials is:

The trials counted in each X_i are called epochs. In epoch i (corresponding to X_i-1), the probability that any trial hits the next bin is

That’s because there are i bins that have already been hit, and n-i that havent.

X_i is called a geometrically distributed random variable. We will see why in a second.

Geometrically distributed random variables

The value of X_i is the first k such that a certain event happens for the first time. The probability of that event is always the same, p_i. So the distribution of X_i looks like this:

Which explains why its called geometric. The sequence of probabilities forms a geometric series with ratio (1-p_i).

Let’s compute the mean and variance for a geometric r.v. First the expected value:

We notice that the k(1-p_i)^k-1 term looks like a derivative…

And now we can sum the geometric series…

So the expected value of a geometric r.v. is just 1/p_i. For the variance, we use the formula

We just computed E[X_i] so all we need now is E[X_i²].

And we do some rearranging to make it look like a second and a first derivative:

Replacing the terms with the derivative expressions gives:

Then we can swap the derivatives and sums:

And solving for the sums of the geometric series gives:

The variance is the difference between this term and the square of E[X_i]

Back to Coupon Collectors

We have divided the set of choices into epochs, with X_i being the number of placements during the i^th epoch. The expected value of X is

And we change variables and use j=n-i, which gives

for the expected number of rounds to hit all of the bins. The variance is

Applying the substitution j = n-i

The first sum approaches p ²/6 as n ® ¥ , so we have

That is, the standard deviation is approximately (ignoring lower order terms)

Now we can use Chebyshev. E.g. suppose we want to ensure that the probability of an empty bin is less than 0.01. Pick t = 10 in the Chebyshev formula

Which requires that X-E[X] ³ ts _X or X ³ n ln(n) + 10np /Ö 6

That’s roughly the same kind of bound we obtained earlier using a direct probability analysis. It’s worth noting however, that if we want to make the probability very small, the Chebyshev bounds don’t help very much. The actual probability falls off exponentially with distance from the mean, where as the Chebyshev bound falls off only as 1/t².