CS 70 - Lecture 38 - Apr 25, 2011 - 10 Evans Goals for today: Normal or Gaussian Distribution Central Limit Theorem Florida election in 2000: how close is a 537 vote margin out of 5.8M votes, when 5800 votes may be counted wrong, on average? Exponential Distribution The most important probability distribution in many applications is Def: A random variable X with density function f(x) = (1/sqrt(2*pi)) * exp(-x^2/2) is said to have a standard normal distribution with mean 0 and standard deviation 1. More generally, a random variable Y with density function g(x) = (1/(sigma*sqrt(2*pi)) * exp(-(x-mu)^2/(2*sigma^2)) is said to have a standard normal distribution with mean mu and standard deviation sigma. This is also called a Gaussian distribution. The plot of f(x) or g(x) is called a "bell shaped curve." f(x) is centered at its mean 0 (g(x) at mu), and the width of the main part of the bell is 2 (2*sigma for g(x)). Def: The cumulative distribution function (cdf) for either X or Y can be written P(X <= z) = P( Y-mu <= z*sigma) = (1/sqrt(2*pi))*integral_{-inf}^z exp(-x^2/2) dx = Normal(z) As in Chebyshev's inequality, we use the quantity z to measure the number of standard deviations sigma the random variable is away from its expectation mu. The plot below shows both f(z) and Normal(z) for the range -4 to 4. The sides of the bell curve decrease rapidly, which means that the chance of being far away from the mean is small: The second column in the table below, Normal(-z), is the probability of being z standard deviations smaller than the mean, the third column is the probability of being within z standard deviations of the mean, and the last column is what Chebyshev's inequality would yield as an upper bound on Normal(-z); it is clearly much larger than the actual value Normal(-z). z Normal(-z) Normal(z)-Normal(-z) Chebyshev / 2 4 3e-5 .99994 1/32 ~ .031 3 .001 .997 1/18 ~ .056 2 .023 .95 1/8 = .125 1 .16 .68 1/2 = .5 We note that there is no simple closed form formula to compute Normal(z). But it is so important, that historically big tables of its values were published, and today it is a built-in function in statistical software packages, and of course there are free webpages that will compute it for you.

The most surprising and important fact about the Gaussian distribution is how well it approximates so many other distributions, a fact called the Central Limit Theorem (CLT): Central Limit Theorem: Let X_i be independent random variables, continuous or discrete, with expectations mu_i and standard deviations sigma_i. Let S_n = sum_{i=1 to n} X_i, so that E(S_n) = sum_{i=1 to n} mu_i and V(S_n) = sum_{i=1 to n} sigma_i^2 = sigma^2(S_n). Then lim_{n -> inf} P( S_n - E(S_n) <= z*sigma(S_n) ) = Normal(z) In the special case when the X_i are i.i.d., and so all have the same expectation mu and standard deviation sigma, we let A_n = S_n/n and simplify this to lim_{n -> inf} P( A_n - mu <= z*sigma/sqrt(n) ) = Normal(z) There are some technical assumptions needed to prove these theorems (we will not do the proofs, which are hard). It is enough to assume that all the X_i are bounded, and the sigma_i are also bounded away from zero (if sigma_i weren't nonzero, X_i wouldn't be "random"!). The examples below illustrate this theorem in two cases, where the X_i are i.i.d. Ex 1: We flip a fair coin n times, where n = 1, 2, 10, and 100 in the four pictures below, with X_i = 1 if the i-th toss is Heads, and 0 if Tails. Thus S_n = X_1 + ... + X_n is the number of Heads, and mu = .5 and sigma = .5. The black and blue curves are the same as in the figure above, and the red curve C(z) is the true cumulative distribution function for the problem, namely C(z) = P( (#Heads/n - .5) <= z * .5 / sqrt(n) ) = P( #Heads <= .5 + z * .5/sqrt(n) ) = sum_{i=0 to .5 + z*.5/sqrt(n)} C(n,i)*.5^i*(1-.5)^(n-i) = sum_{i=0 to .5 + z*.5/sqrt(n)} C(n,i)/2^n The red curve has steps, because it only changes when .5 + z*.5/sqrt(n) is an integer from 0 to n. We see that as n increases, the red curve C(z) gets closer and closer to the black curve Normal(z).

Ex 2: We make the same plots, but with a biased coin where P(H) = .9, so mu = .9 and sigma = .3. It takes a larger n but eventually C(z) = P( (#Heads/n - .9) <= z * .3 / sqrt(n) ) = sum_{i=0 to .9 + z*.3/sqrt(n)} C(n,i)*.9^i*(.1)^(n-i) approaches Normal(z) as closely as you like.

The CLT does not require the X_i to be i.i.d. Indeed, one could flip fair coins, unfair coins, roll dice, ask random people their salaries, etc., add up all the results to get S_n, and you'd still get a bell curve. We use general version for our final example: Ex 3: We consider the 2000 election in Florida. Recall the historical data: Out of 5,825,043 votes cast, 2,912,790 were counted for Bush, and 2,912,253 for Gore, a margin of 537 votes. According to a NY Times article of the time, it is not unusual for 1 out of 1000 votes to be counted incorrectly; let us model this by saying that whatever an actual ballot says, it is counted correctly with probability .999 and incorrectly with probability .001, and that each vote is counted independently, so it like coin flipping with coin biased to be "correct" or "Heads" with probability .999, and "incorrect" or "Tails" with probability .001. Thus, with about 5,825,043 votes, the expected number of incorrectly counted votes is about 5,825, or 10 times larger than the winning margin. How can we tell if this is so close that we need to worry, and count it again more carefully? To answer this, we will ask the following question: supposing Gore had actually won, what is the probability that the margin of counted votes for Bush would be at least 537? If this is tiny, we should feel reassured. To get an upper bound on this probability, we will assume that Gore actually won by just one vote to compute this probability (because if he won by more votes, the chance of miscounting enough votes to get a margin of 537 for Bush would be even smaller). In other words, we will compute P(at least 537 more votes are counted for Bush than Gore | there were 2912522 votes for Gore and 2912521 votes for Bush) To answer this question, we will use the first, more general form of the Central Limit Theorem: Let G_1,...,G_2912522 be i.i.d. random variables representing the actual votes for Gore: G_i = {+1 with probability .999 (i.e. counted correctly) {-1 with probability .001 (i.e. counted incorrectly) Similarly, let B_1,...,B_2912521 be i.i.d. random variables representing the actual votes for Bush: B_i = {-1 with probability .999 (i.e. counted correctly) {+1 with probability .001 (i.e. counted incorrectly) Then S = G_1 + ... + G_2912522 + B_1 + ... + B_2912521 is a sum of independent random variable which counts the margin of votes counted for Gore (if positive) or Bush (if negative). The first part of the Central Limit Theorem applies to S, with E(G_i) = .998 = -E(B_i) V(G_i) = 1-.998^2 ~ .004 = V(B_i) so E(S) = 2,912,522*.998 - 2,912,521*.998 = .998 V(S) = 5,825,043*.004 ~ 23277 and sigma(S) ~ 153 The CLT tells us that for n large (and 5,825,043 is large!) P( S - E(S) <= z*sigma(S) ) ~ Normal(z) or P( S - .998 <= z*153 ) ~ Normal(z) or P( S <= z*153 + .998 ) ~ Normal(z) We want to compute P(S <= -537), i.e. that the counted votes gave Bush a lead of at least 537; solving z*153 + .998 = -537 yields z ~ -3.52, so P( S <= -537 ) = P(Bush had a margin of at least 537 votes | Gore won) ~ Normal(-3.52) ~ .00021 So we can be reassured... Exponential Distribution: Recall our discussion of the Poisson distribution: The physical situations of interest were ones where there were lots of random physical events, like calls to a call center accesses to a web service ticks on a Geiger counter caused by a very large number n of entities (callers, web-surfers, radioactive nuclei) making independent decisions whether or not to call/access a website/decay, each with very low probability p. We defined lambda = n*p, and showed that it could be interpreted as the average number of events per time unit (calls per hour, web accesses per minute, ticks per second). Then the Poisson distribution told us the probability of getting i events per time unit: P(i events per time unit) = exp(-lambda)*lambda^i/i! The other natural question to ask about such a physical situation is the distribution of the time between consecutive events, eg between the starts of two calls, the starts of two web-accesses, or two clicks on the Geiger counter. This time T is a continuous random variable, with an exponential distribution: P(T <= t) = 1 - exp(-lambda*t) and so a density function f(t) = d (1 - exp(-lambda*t))/dt = lambda * exp(-lambda*t) Another way to see why this intuitively makes sense is that P(T > t) = exp(-lambda*t) gets small very fast, either as t increases, or as the average number of events-per-time-unit, lambda, increases. We can compute E(T) = integral_0^{infinity} t* lambda*exp(-lambda*t) dt = 1/lambda This also makes intuitive sense, because if there are lambda events per time unit on average, the time between events must be 1/lambda. We can also compute V(T) = E(T^2) - (E(T))^2 = integral_0^{infinity} t^2* lambda*exp(-lambda*t) dt - 1/lambda^2 = 1/lambda^2