Math 55 - Fall 2007 - Lecture notes #33 - Nov 19 (Monday) Goals for today: Application to politics how close is an election before it is "too close to tell" Variance and Standard deviation Applications to computer science searching a list EX: Suppose there is an election between two politicians, say G and B. Suppose that despite the best intentions (a big assumption!) there is still a small chance of miscounting a ballot for the other person. In fact, according to a front page NY Times article (11/17/2000) it is not unreasonable to assume an error rate of p=.001, that is the chance that a ballot will be counted for G instead of B, or for B instead of G, is .001. In fact an error rate of .01 is not unheard of. The question is: what is the probability that the wrong person wins, because of miscounting? The concern is that if the election is very close, that is if the actual number of votes cast for G and B is very close, then a few (randomly) miscounted ballots could swing the election the wrong way. If the election is not close, then we don't have to worry about this. So what we'd like to quantify is: Assuming the probability of miscounting a vote is some (tiny) p, how close does an election have to be before there is a large enough probability of the wrong outcome to want to do a recount? (The hope is that the recount is at least as accurate as the first count!) There are in fact laws in some jurisdictions that require recounts when the election is too close. We will try to give a mathematical basis for making such a decision, using data from the 2000 election. To model this, we use Bernoulli trials: each ballot is independently counted correctly with probability 1-p = .999, and incorrectly with probability p = .001. This does not model the real political process, (eg it does not account for ballots thrown out entirely) but is intended to say whether a 537 vote margin out of ~6 million votes cast is "statistically significant", i.e. whether there is a significant probability that B could win by 537 votes, including miscounts, even though more people voted for G. To proceed, we assume there were T = vcB = 2,912,790 votes counted for B and vcG = 2,912,253 votes counted for G for a total of vcB + vcG = 5,825,043 = T votes cast. For motivation, let f(x) be the random variable equal to the total number of incorrectly counted ballots; we want E(f). Letting f_i(x) = 1 if the i-th ballot was counted incorrectly, 0 otherwise, we see f(x) = sum_{i=1 to 5,825,043) f_i(x) and E(f) = sum_{i=1 to 5,825,043) E(f_i) = sum_{i=1 to 5,825,043) p = 5,825,043*p ~ 5,825 or more than 10 times the margin of the winner, 537 votes. So it seems plausible that the wrong person could have won by "mistake". We will also let avB = votes actually cast for B and avG = votes actually cast for G which of course we don't know (though we know avB + avG = T, assuming no votes were lost!). To see how likely it could be that the wrong person won, we ask the following question: If avG = avB+1, that is G should (barely) have won, then what is the probability that the margin for B, vcB - vcG = 537, was as large as it was? More precisely, we want P(vcB - vcG >= 537 when avG = avB+1) (Note: you may think that what we really want is P(avG > avB when vcB - vcG = 537) i.e. the probabilty that G won given the vote counts, but we don't know have enough information to compute this: The actual number of votes is not random, and there is no probability function for it. The most we can do is ask for different values of avG and avB, how likely are the vote counts. For example, suppose there is one vote, counted for B. The chance of this, given that the actual vote was cast for B, is 1-p. The chance given that the actual vote was cast for G, is p. This does not mean that the probability that G won is p, because that is not what is random.) If avG = avB + 1 and avG + avB = T = 5,825,043, then avG = ceiling(5,825,043/2) = 2,912,522 votes, and avB = 2,912,521 votes. Each vote is counted correctly and independently with probability 1-p. Now we compute the probability that with miscounts, B wins by a margin of 2,912,790 - 2,912,253 = 537 or more. We let the random variable mvB = margin of votes for B, i.e. the number of actual votes for B that were counted as votes for B, minus the number that were miscounted as votes for G. Computing P(mvB=i) is identical to the problem of taking a coin marked G and B, with probability P(B)=1-p and P(G)=p, flipping it avB times, and computing the number of B's minus the number of G's. To get mvB=i there needs to be j of the avB votes counted for B and avB-j miscounted for G, such that i = j-(avB-j) or j = (i+avB)/2. Thus P(mvB = i) = { C( avB, j ) * (1-p)^j * p^(avB-j) if j is an integer { 0 if j is not an integer Similarly, we let the random variable mvG = margin of votes for G, i.e. the number of actual votes for G that were counted as votes for G, minus the number that were miscounted as votes for B. Again letting j = (i+avG)/2 we get P(mvG = i) = { C( avG, j ) * (1-p)^j * p^(avG-j) if j is an integer { 0 if j is not an integer Finally, we want the probability that B wins by a margin of 537 or more votes: Let the random variable d = mvB - mvG. Note that d = number of votes for B counted correctly (so for B) - number of votes for B counted incorrectly (so for G) - number of votes for G counted correctly (so for G) + number of votes for G counted incorrectly (so for B) = number of votes counted for B - number of votes counted for G Then we want P(d >= 537) = P(mvB - mvG >= 537) = sum_{m = 537 to T} P(mvB - mvG = m) because mvB - mvG could equal 537, 538, ..., T = sum_{m = 537 to T} sum_{r = m to T} P(mvB=r and mvG=r-m) because we could have mvB = m and mvG = 0 or mvB = m+1 and mvG = 1 or mvB = m+2 and mvG = 2 or ... mvB = T and mvG = T - m = sum_{m = 537 to T} sum_{r = m to T} P(mvB=r) * P(mvG=r-m) by independence of mvG and mvB Now you can plug in the formulas for P(mvB=r) and P(mvG=r-m) and get something you can compute, which I invite you to try (not required). If you write the simplest program, it will cost O(T^3), too expensive! Also, many of the C(avB,j) will overflow (be larger than 10^308, and so not be representable in double precision floating point), and many p^(avB-j) will underflow (be less than 10^(-308)) so care is needed. Later we will describe an easy approximation you can use for this problem. ASK&WAIT: When p = probability of error = .001, about what do you think the probability of a margin of 537 votes or more is? (just to the nearest power of 10)? ASK&WAIT: When p = .01, what do you think the probability of margin of 537 or more votes is? for p=.1? for p=.5? Mean and Standard deviation: many of you ask what the mean and standard deviation of the scores on the midterm are, so you already understand that 1) the mean tells you the average score 2) the standard deviation (std) tells you how many people were "close to the average" i.e expect many scores to lie in range from mean-x*std to mean+x*std where x = 1 or 2; if your score is < mean-3*std you know you're in trouble, and if your score is > mean+3*std you feel really good Now it is time to define this carefully: DEF: Let S be a sample space, P a probability function, and f a random variable. E(f) the expectation of f. Then let g = (f - E(f))^2. Then the variance of f is defined as V(f) = E( g ) = sum_{e in S} g(e)*P(e) = sum_{e in S} (f(e) - E(f))^2*P(e) The standard deviation of f is defined as sigma(f) = ( var(f) )^(1/2) EX: Roll a fair die 100 times, getting value fi in set {1,2,3,4,5,6} with probability 1/6 at roll i. Then E(fi) = 1/6 + 2/6 + ... + 6/6 = 3.5 and V(fi) = (1-3.5)^2/6 + (2-3.5)^2/6 + ... + (6-3.5)^2/6 ~ 2.9 sigma(fi) ~ sqrt(2.9) ~ 1.7 Let g = f1 + f2 + ... + f100 and h = 100*f1. and E(g) = E(f1) + ... + E(f100) = 100*E(f1) = E(h) = 350. but V(g) = 100*V(f1) ~ 290 whereas V(h) = 100^2*V(f1) ~ 29000 so sigma(g) ~ 17 and sigma(h) = 170 (how to do this calculation later). The following figure (only in the pdf version of these notes) illustrates this, where the vertical dashed lines are drawn at mean-sigma and mean+sigma. The point is that "a lot" of the likely values are in this range, even more in the range mean-2*sigma and mean+2*sigma, etc (we make this more precise later, via Chebyshev's inequality) (see figure in pdf version of notes) Here are some other formulas for computing the variance: Thm 0: Let g(x) = a*f(x) + b where a and b are constants. Then E(g) = a*E(f)+b, V(g) = a^2*V(f), sigma(g)=|a|*sigma(f) Proof: E(g) = sum_x (a*f(x)+b)*P(x) by definition = a*sum_x f(x)*P(x) + b*sum_x P(x) since a and b are constant = a*E(f) + b V(g) = E((g-E(g))^2) by definition = sum_x (g(x) - E(g))^2*P(x) = sum_x (a*f(x) + b - a*E(f) - b)^2*P(x) = sum_x a^2*(f(x) *E(f))^2*P(x) = a^2*V(f) sigma(g) = sqrt(V(g)) = sqrt(a^2)*sqrt(V(f)) = |a|*sigma(f) Thm 1: V(f) = E(f^2) - (E(f))^2 Proof: V(f) = E((f-E(f))^2) by definition = sum_x (f(x)-E(f))^2*P(x) by definition = sum_x f(x)^2*P(x) - sum_x 2*E(f)*f(x)*P(x) + sum_x E(f)^2*P(x) = sum_x f(x)^2*P(x) - 2*E(f)*sum_x f(x)*P(x) + E(f)^2*sum_x P(x) = E(f^2) - 2*E(f)*E(f) + E(f)^2*1 = E(f^2) - (E(f))^2 Thm 2: Suppose f and g are independent. Then V(f+g) = V(f) + V(g) Proof: V(f+g) = E((f+g)^2)) - E(f+g)^2 by Thm 1 = E(f^2 + 2*f*g + g^2) - (E(f)+E(g))^2 = E(f^2) + 2*E(f*g) + E(g^2) - (E(f)^2 - 2*E(f)*E(g) - (E(g)^2) = E(f^2) + 2*E(f)*E(g) + E(g^2) - (E(f))^2 - 2*E(f)*E(g) - (E(g))^2 by independence of f and g = E(f^2) - (E(f))^2 + E(g^2) - (E(g))^2 = V(f) + V(g) by Thm 1 EX: g = f1 + f2 + ... + f100 where fi is the result of rolling a fair die once. By Thm 2, V(g) = V(f1) + ... + V(f100) = 100*V(f1) ~ 100*2.9 = 290 and so sigma(g) ~ sqrt(290) ~ 17 But V(100*f1) = 100^2*V(f1) by Thm 0 and sigma(100*f1) ~ 100*sigma(f1) ~ 170 as claimed above Now we quantify the notion that the value of a random variable f(x) is unlikely to fall very far from E(f): Thm (Chebyshev's Inequality): Let f be a random variable. Then P( |f(x) - E(f)| >= r*sigma(f) ) <= 1/r^2 In words, the probability that the value of a random variable f(x) is farther than r times the standard deviation sigma(f) from its mean value E(f) can be no larger than 1/r^2, which decreases rapidly as r increases. Here is a table: r P( |f(e) - E(f)| >= r*sigma(f)) --- ------------------------------- 1 <= 1 (trivial bound on a probability!) 2 <= 1/4 = .25 5 <= 1/25 = .04 10 <= 1/100= .01 Proof: Let A be the event {e: |f(e) - E(f)| >= r*sigma(f)}. We want to compute P(A). Now compute V(f) = sum_{e in S} (f(e)-E(f))^2 * P(e) = sum_{e in A} (f(e)-E(f))^2 * P(e) + sum_{e not in A} (f(e)-E(f))^2 * P(e) The second sum is at least 0, and each term in the first sum is at least (r*sigma(f))^2*P(e), by the definition of A. Thus V(f) >= sum_{e in A} (r*sigma(f))^2 * P(A) = (r*sigma(f))^2 * P(A) or 1/r^2 >= P(A) as desired. How good is Chebyshev's inequality, i.e. how close to P( |f(x) - E(f)| >= r*sigma(f) ) can 1/r^2 be? EX: Again consider rolling a die 100 times, and computing sum h. Then E(h) = 350 and sigma(h)~17. Comparing actual P( |h-350| > r*17) with 1/r^2 from Chebyshev yields plot below: Chebyshev is much too large for large r. (Later we will get a much better approximation from the Central Limit Theorem). (see figure only in pdf version of notes) EX: Suppose we have a list of n distinct items L(1),...,L(n), and want an algorithm that takes an input x and (1) returns i if L(i)=x, (2) returns n+1 otherwise An obvious algorithm is "linear search" i=0 repeat i=i+1 until L(i)=x or i=n+1 Suppose x is chosen at random from a sample space S with probability function P. What is the expectation of the operation count C of this algorithm, i.e. how many times is the line "i=i+1" executed? If we run this algorithm many times, E(C) tells us how long it will take "on average", and sigma(C) tells us how variable the running time will be. The answer depends on S and P: EX 1: Suppose S = {L(1),...,L(n)} and P(L(i))=1/n; this means each input x must be in the list L, and is equally likely. Then E(C) = 1*(1/n) + 2*(1/n) + 3*(1/n) + ... + n*(1/n) = (n+1)/2 and V(C) = E(C^2) - (E(C))^2 = 1^2*(1/n) + 2^2*(1/n) + 3^2*(1/n) + ... + n^2*(1/n) - ((n+1)/2)^2 = (n^3/3 + O(n^2))*(1/n) - (n^2/4 + O*(n^2)) = n^2/12 + O(n) so sigma(C) = n/sqrt(12) + O(1) ~ .29*n + O(1) EX 2: Suppose S = {L(1),...,L(n),z}, and P(L(i))=p/n, P(z)=1-p i.e. the probability that x is not on the list is 1-p ASK&WAIT: What is E(C)? How does it depend on p? ASK&WAIT: What are V(C) and sigma(C)? How do they depend on p? EX 3: Suppose P(L(i)) = p(i) are all different. In what order should we search the L(i) to minimize E(C)? ASK&WAIT: Suppose n=2, and p(2) >> p(1); which item should we search first? Thm: Searching list in decreasing order of p(i) minimizes E(C). Proof: Suppose that we do not search list in decreasing order of p(i); then we will show that there is a different search order with smaller E(C). Let q(1),...,q(n) be the probabilities of the items in the order they are searched, so q(1),...,q(n) is a permutation of p(1),...,p(n). Suppose that q(1),...,q(n) is not in decreasing order; this means that for some j q(j) < q(j+1). Let r = 1 - p(1) - p(2) - ... - p(n) be the probability that the search item is not in the list. Then C1 = E(C) for search order 1,...,j,j+1,...,n = sum_{i=1 to n} i*q(i) + (n+1)*r and C2 = E(C) for search order 1,...,j+1,j,...,n (i.e. with j+1 searched before j) = sum_{i=1 to n except j and j+1} i*q(i) + j*q(j+1) + (j+1)*q(j) + (n+1)*r Then C1 - C2 = j*q(j) + (j+1)*q(j+1) - j*q(j+1) - (j+1)*q(j) = q(j+1) - q(j) > 0 In other words, searching j+1 before j lowers the expected cost E(C) So unless q(j)>q(j+1) for all j, i.e. unless the list is sorted, there is a better order to search in. EX 4: Suppose P(L(i)) = c*q^i, q<1, where c is chosen so that sum_{i=1 to n} P(L(i)) = 1, i.e. x guaranteed to be in list ASK&WAIT: What is c? ASK&WAIT: What is E(C)? ASK&WAIT: What happens to E(C) as n grows? Recall Binary Search: Assume L(1) < L(2) < ... < L(n), by sorting if necessary istart = 1; iend = n repeat middle = floor((istart + iend)/2) if x < L(middle) iend = middle-1 if x > L(middle) istart = middle+1 until x = L(middle) or iend < istart Cost = O(log_2 n) steps, because each time the list to search is at most half as long. ASK&WAIT? What is faster for large n, linear search or binary search? But in the worst case, linear search will take n steps, much slower than binary search. ASK&WAIT: Can you think of an algorithm that takes O(1) steps on average, and O(log n) at worst, i.e. has the advantages of both kinds of search?