Math 55 - Fall 2007 - Lecture notes #36 - November 28 (Wednesday)

Goals for today:  Understand how likely a random variable f is to be far 
                  from its average value E(f) - Central Limit Theorem
  
  Last time we defined the variance of a random variable f:
   V(f) = E( (f(x) - E(f))^2 ) = sum_x (f(x) - E(f))^2 * P(x)    
        = average of the square of distance from f to its average value E(f)
  and the standard deviation sigma(f) = sqrt(V(f))

  We claimed that sigma(f) measured how "spread out" f was around E(f),
  and proved Chebyshev's Inquality to quantify this idea:

     P( |f(x) - E(f)| >= r*sigma(f) ) <= 1/r^2

  In words, the probability that the distance |f(x) - E(f)| from
  f(x) to its average value E(f) is many (r >> 1) standard deviations, 
  is low (shrinks like 1/r^2).

  We showed a picture for the random variable gotten by summing
  the value of 100 dice throws (f = f_1 + f_2 +  ... + f_100 where
  f_i is the number on top of the i-th die throw) where
  P( |f(x) - E(f)| >= r*sigma(f) )  shrank much more quickly
  than 1/r^2.

  We will discuss (without a complete proof) the following amazing fact:

     Suppose f is gotten by summing a large number of independent 
     random variables.  That is 
             f = f_1 + f_2 + ... + f_n    
     for some large n, where each f_i comes from flipping a coin,
     rolling a die, playing poker, or doing _anything_ randomly and 
     independently, over and over again. For example, f_1 could
     be the number of heads from tossing a fair coin once,
     f_2 could be the number of Kings - number of Queens in 5 randomly
     selected cards, etc.

     Then the function P( |f(x) - E(f)| >= r*sigma(f) ) is almost
     the same for any f. In other words there is a single function
     Normal(r) such that P( |f(x) - E(f)| >= r*sigma(f) ) gets closer
     and closer to Normal(r) as n grows, where f = f_1 + ... + f_n.

  This fact is called the Central Limit Theorem, and is one of the 
  most important theorems in probability theory. Among other things,
  it gives us a fast way to accurately approximate all the complicated 
  probabilities of the wrong person winning an election that we did earlier.

There are some simple and natural conditions on the f_i for this to be true, 
for example f_i can't take one value with probability (nearly) 1, in which 
case f would also take one value with probability (nearly) 1. 
The theorem is true for all the examples from coin tossing, dice rolling,
card playing, etc. that we have used.

  We will start by describing pictures for different f and n, just
  to see with our own eyes what happens. We will show use 3 different 
  cases (the pictures are in the pdf version of these notes:)
     1) Flipping a fair coin n times, and counting the number of heads:
        f is a sum of f_i where P(f_i = 0) = .5
                                P(f_i = 1) = .5
     2) Flipping a biased coin n times, and counting the number of heads:
        f is a sum of f_i where P(f_i = 0) = .1
                                P(f_i = 1) = .9
     3) Flipping a biased die n times, and adding the numbers that come up:
        f is a sum of f_i where P(f_i = 0) = .05
                                P(f_i = 1) = .15
                                P(f_i = 2) = .25
                                P(f_i = 3) = 0
                                P(f_i = 4) = .1
                                P(f_i = 5) = .45

  Here is an explanation of the following 3 sets of pictures.
  The first set shows pictures of results for tossing a fair coin n times for 
    n  = 1, 2, 3, 5, 10, 20, 50, 100, and 500
  For each n, there are three pictures showing the same data in different ways.

  The top plot on each page of 3 plots shows P(f(x)=i) as a function of i. 
  For example, with n=1, P(f(x)=0) = .5 and P(f(x)=1) = .5, 
  and this is shown by the two red lines at 0 and 1 (both with height .5). 
  With n=2, P(f(x)=0)=.25, P(f(x)=1)=.5 and P(f(x)=2)=.25, so there
  are two red lines at 0 and 2 (with height .25) and one red line at 1.
  In addition, the mean E(f) is marked by a vertical black line, and
  the standard deviation is indicated by a horizontal green line stretching
  from E(f)-sigma(f) to E(f)+sigma(f), i.e. the range in which we expect most
  values of f to lie. 

  The middle plot on each page plots the same data but with a 
  different horizontal scale: 
     0 corresponds to E(f)
     1 corresponds to E(f) + sigma(f)
    -1 corresponds to E(f) - sigma(f), etc.
  An equivalent way to describe the middle plot is to say that it plots
  the function P( f(x) - E(f) = r*sigma(f) ) as a function of r.

  For large n, only the red +'s at the tops of the red lines are shown,
  not the red lines, to make the plot easier to read.
  
  It is the middle plot where the result of the Central Limit Theorem
  will first become apparent: the points 
     (r, P( f(x) - E(f) = r*sigma(f) ) ) 
  marked by red +'s will converge to lie on a "bell curve" proportional
  to a function normal(r) defined below,
  which is marked in blue for plots with large values of n. 
  (When the blue curve is shown, the maximum distance between the 
  red +'s and the blue curve is shown.  For example, when n=100 and 
  flipping a fair coin, the distance is shown on the plot as 
  "P(sum)-limit = .0025031".

  What is striking is that the same bell-curve-shaped function proportional to
  normal(r) appears as the limit of the red +'s for all the experiments: 
  a fair coin, biased coin, or biased die.

  The bottom plot plots P( |f(x) - E(f)| >= r*sigma(f) ) versus r.
  When n is large, this red curve approaches a limit which is shown
  in blue. This blue curve is the plot of the function Normal(r)
  described in the Central Limit Theorem. Normal(r) and normal(r) are related
  in a simple way described below (Normal(r) is the area under the curve
  normal(r)).

  The first set of plots is for a fair coin, the second set for the biased
  coin, and the third set for the biased die.

  While it is beyond the scope of the course to prove the Central Limit
  Theorem in general, we will prove a special case, and we can certainly 
  use the general case.

  Here is an easy-to-state and fairly general version of the 
  Central Limit Theorem (the most general version is rather
  more complicated to state):
    Supposed we have any finite set R of possible kinds random variables.
    For example, R could be
       R = {"flip a fair coin, getting 1 for Heads and 0 for Tails",
            "flip an unfair coin, getting 1 for Heads and -3.5 for Tails",
            "roll a fair die 3 times, and add up the 3 values you get", 
            "draw a random set of 5 cards, count the number of Kings k,
                 and return k^2/7", 
            "count an actual vote for B correctly (+1) with probability .999,
               and incorrectly (-1) with probability .001",
            "count an actual vote for G correctly (+1) with probability .999,
               and incorrectly (-1) with probability .001"}
    The only restriction is that each kind of random variable takes at least 
    2 different values with probability > 0 (so it really is random).

    Now let f_1, f_2, f_3, ... f_n be a sequence of independent random
    variables, each one of a kind taken from R.
    For example, each f_i could be the same kind ("flip a fair coin, getting
    1 for Heads and 0 for Tails"), or they could be different.
    Sum these random variables to get

       f = f_1 + f_2 + ... + f_n

    Then as n gets larger, the function P( |f(x)-E(f)| >= r*sigma(f))
    converges to the function Normal(r) where

     Normal(r) = integral from r to infinity of sqrt(2/pi) * exp(-s^2/2) ds

    Another way to say this is that Normal(r) is the area under the "bell curve"
      normal(s) = sqrt(1/(2*pi)) * exp(-s^2/2)
    between for s >= r and  s <= r, i.e. under the "tails" of the bell curve.
    Normal(r) is the blue curve in the bottom plots on previous pages,
    and c * normal(r) is the blue curve in the middle plots
    (c is a constant that depends on n and sigma(f), as described later).


    Normal(r) is also called the "normal distribution function",
    and because of its importance it is widely available via subroutines
    libraries on computers to compute it (eg "normcdf" in Matlab), 
    and in tables in books.

  We will sketch a proof of this in a simple case: fair coin flipping,
  where f_i = 1 for Heads and 0 for Tails, so that
     f = f_1 + f_2 + ... + f_n
  is just the total number of heads in n coin flips.

  Let us say more precisely what we will prove. First consider the
  bell shaped curve. What it will mean to converge to the bell-shaped curve
  is that 

    lim_{n -> infinity} .5 * sqrt(n) * P(#Heads(n) - E(f) = r*sigma(f))
         = normal(r) = sqrt(1/(2*pi)) * exp(-r^2/2)

  where E(f) and sigma(f) are mean and standard deviation of f, 
  i.e. the number of Heads after n throws
ASK&WAIT: What are E(f) and sigma(f)?
ASK&WAIT: What is P(#Heads(n) - E(f) = r*sigma(f))?

  Thus we have to show
(*)    lim_{n -> infinity} .5 * sqrt(n) * C(n, n/2 + r*sqrt(n/4) )*(1/2)^n
     = lim_{n -> infinity} .5 * sqrt(n) * 
           n! / [ (n/2 + r*sqrt(n/4))! * (n/2 - r*sqrt(n/4))!] * (1/2)^n
     = normal(r) 
     = sqrt(1/(2*pi)) * exp(-r^2/2)

To do this we need another way to approximate n! for large n:

Stirling's Formula:  for large n, n! ~ sqrt(2*pi) * n^(n+1/2) * exp(-n)

(By this we mean that the ratio n!/(sqrt(2*pi) * n^(n+1/2) * exp(-n)) 
approaches 1 as n gets larger and larger.)

              n   n!     n!/Stirling's formula
              -   --     ---------------------
              5  120       1.012
             10  3.6e6     1.008
             20  2.4e18    1.004
             40  8.2e47    1.002
             80  7.2e118   1.001

For the moment we will just use Stirling's Formula, and
come back later to (mostly) prove it.

Now we can plug Stirling's formula into (*) to get

.5 * sqrt(n) * C(n, n/2 + r*sqrt(n/4) )*(1/2)^n ~

.5 * sqrt(n) * n! / [ (n/2 + r*sqrt(n/4))! * (n - (n/2 + r*sqrt(n/4)))! ] * 2^(-n)

    ~ .5 * sqrt(n) * sqrt(2*pi*n)*n^n*e^(-n) / [
       sqrt(2*pi*(n/2 + r*sqrt(n/4))*(n/2 + r*sqrt(n/4))^(n/2 + r*sqrt(n/4))*
           e^(-(n/2 + r*sqrt(n/4))) *
       sqrt(2*pi*(n/2 - r*sqrt(n/4)))*(n/2 - r*sqrt(n/4))^(n/2 - r*sqrt(n/4))*
           e^(-(n/2 - r*sqrt(n/4))) ] *
      2^(-n)

some simplification yields

     sqrt(n/[2 * pi * (n-r^2) ])
     * n^n / [ 
       (n/2)^(n/2 + r*sqrt(n/4)) * (1 + r/sqrt(n))^(n/2 + r*sqrt(n/4)) *
       (n/2)^(n/2 - r*sqrt(n/4)) * (1 - r/sqrt(n))^(n/2 - r*sqrt(n/4)) ] *2^(-n)
            ... cancelling all the exponential terms e^(...)

   = sqrt(n/[2* pi * (n-r^2) ])
     * n^n / [ (n/2)^n * 2^n *
        (1 + r/sqrt(n))^(n/2 + r*sqrt(n/4)) *
        (1 - r/sqrt(n))^(n/2 - r*sqrt(n/4)) ] 
            ... combining the (n/2)^(...) terms

   = sqrt(n/[2* pi * (n-r^2) ])
     * 1 / [
        (1 - r^2/n)^(n/2) *
        (1 + r/sqrt(n))^( r*sqrt(n/4)) *
        (1 - r/sqrt(n))^(-r*sqrt(n/4)) ] 
            ... cancelling the n^n and 2^n factors

To further simplify we recall two facts from calculus:

(**)    lim_{x -> infinity} (1 + 1/x)^x = exp(1)
        lim_{x -> infinity} (1 - 1/x)^x = exp(-1)

and reorganize the last expression to fit this pattern:

   = sqrt(n/[2* pi * (n-r^2) ])
     * 1 / [
        (1 - r^2/n)^[(n/r^2)*(r^2/2)] *
        (1 + r/sqrt(n))^[ (sqrt(n)/r)*(r^2/2)] *
        (1 - r/sqrt(n))^[-(sqrt(n)/r)*(r^2/2)] ] 

Now we can let n -> infinity, and use (**) on the 3 expressions
in the denominator to get

  lim_{n -> infinity} .5 * sqrt(n) * C(n, n/2 + r*sqrt(n/4) )*(1/2)^n
    = sqrt(1/(2*pi)) / [ exp(-r^2/2) * exp(r^2/2) * exp(r^2/2) ]
    = sqrt(1/(2*pi)) * exp(-r^2/2)

as desired (whew!).

For fun you can try doing this limit with an unfair coin.

Finally, we comment briefly on where the formula for 

   Normal(r) ~ P( |f(x) - E(f)| >= r*sigma(f) )

arises. We want to sum

   P(#Heads - E(f) = s*sigma(f))
 = P(#Heads - n/2  = s*sqrt(n/4))

for all values of s >= r, and s <=-r but only those s that correspond
to an integer value of #Heads. In other words, s*sqrt(n/4)
has to be an integer (assuming n/2 is an integer, for simplicity),
and so s is an integer multiple of sqrt(4/n), say s=m*sqrt(4/n)
where m is an integer.
So we want to sum these probabilities over integers m
starting at m=r*sqrt(n/4). Just doing the sum over s >= r
(the sum over s <= -r has the same value)

   sum_{m=r*sqrt(n/4) to n} P(#Heads - E(f) = m)
 ~ sum_{m=r*sqrt(n/4) to n} sqrt(4/n) * sqrt(1/(2*pi)) * exp(-(m*sqrt(4/n))^2/2)
       ... from the limit we did above
 ~ sum_{m=r*sqrt(n/4) to infinity} sqrt(4/n) * sqrt(1/(2*pi)) *
                                       exp(-(m*sqrt(4/n))^2/2)
       ... since we are taking n -> infinity in the limit anyway
 
Now we recognize this sum as an approximation to the
  integral from r to infinity sqrt(1/(2*pi)) * exp(-s^2/2) ds
because it is the sum of areas of rectangles filling up the area
under this curve, the rectangles having base sqrt(4/n) and
height sqrt(1/(2*pi)) * exp(-(m*sqrt(4/n))^2/2).
As n -> infinity, these rectangles get narrower and
their sum (called a Riemann sum in Math 1B) gets to
be a better and better approximation of the integral.

Now we return to the proof of Stirling's Formula.
We will not be able to show this completely, but instead we will show
that n! is approximatly C*n^(n+1/2)*exp(-n) for some constant C.

Start by noting that log(n!) = log(2) + log(3) + ... + log(n)
is also the area inside the black boxes in the figure shown 
in the pdf version of these notes (for n=6).
The area under the upper (green) curve y=log(x+1) is clearly an upper
bound for log(n!), the area under the bottom (red) curve y=log(x)
is clearly a lower bound for log(n!), and the area under the
middle (blue) curve y = log(x+1/2) is a reasonable approximation
to log(n!). 

Integrating log(x+1/2) from 1 to n to get the area under the blue curve
yields

    log(n!) ~ integral from 1 to n log(x+1/2) dx
            = integral from 1.5 to n+1/2 log(s) ds
                  ... by substituting x = s - 1/2
            = s*log(s)-s  at s = n+1/2 minus s*log(s)-s at s = 3/2
            = (n+1/2)*log(n+1/2) - n + c

where c is a constant, and so

    n! = exp(log(n!)) ~ exp((n+1/2)*log(n+1/2) - n + c)
                      = (n+1/2)^(n+1/2) * exp(-n) * exp(c)

This is essentially Stirling's formula except for the constant factor.
We use the fact (**) from calculus used above to simplify further and get

    (n+1/2)^(n+1/2) = n^(n+1/2) * (1 + 1/(2*n))^(n+1/2)
                    = n^(n+1/2) * (1 + 1/(2*n))^[(2*n) * (1/2 + 1/(4*n))]
                    ~ n^(n+1/2) * e^(1/2 + 1/(4*n))
                    ~ n^(n+1/2) * e^(1/2)

As a final application of the Central Limit Theorem, we will
compute a good approximzation to the probability that the
wrong person could win an election by random miscounting of votes.

Recall the situation: We are assuming there are
   avB = actual votes for B = 2,912,521
   avG = actual votes for G = 2,912,522
   T   = total number of votes = avB + avG = 5,825,043
i.e. G won by one vote, and want to know the
probability that
   margin = vcB - vcG
          = votes counted for B - votes counted for G
          >= 537
given that each vote is independently counted correctly
with probability .999, and incorrectly with probability .001.

We let
   f_B1, f_B2, ... , f_B,avB
be avB independent random variables representing whether
the votes for B are counted correctly:
   f_B,i = { 1 if vote i for B is counted correctly, i.e. with probability .999
           {-1 if vote i for B is counted incorrectly, i.e. with probability .001
Similarly, we let
   f_G1, f_G2, ... , f_G,avG
be avG independent random variables representing whether
the votes for G are counted correctly:
   f_G,i = {-1 if vote i for G is counted correctly, i.e. with probability .999
           { 1 if vote i for G is counted incorrectly, i.e. with probability .001
Thus we see that the margin can be represented as

   margin = f_B1 + f_B2 + ... + f_B,avB
           +f_G1 + f_G2 + ... + f_G,avG

the sum of T = avB+avG independent random variables. Our problem is
to compute P(margin >= 537). According to the general form of the
Central Limit Theorem, all we have to do is compute

   E(margin) = avB*E(f_B1) + avG*E(f_G1)
                 ... since we can just add expectations, and
                 ... all the f_B,i have the same expectation, and
                 ... all the f_G,i have the same expectation
             = avB*(1*.999 -1*.001) + avG*(-1*.999 + 1*.001)
             = avB*(.998) + avG*(-.998)
             = .998(avB-avG)
             = -.998 
Similarly, by independence of all the f_B,i and f_G,i, we can
add their variances to get

   V(margin) = avB*V(f_B1) + avG*V(f_G1)
             =   avB*(1^2*.999 + (-1)^2*.001 - .998^2 ) 
               + avG*((-1)^2*.999 + 1^2*.001 - (-.998)^2 )
             ~ 23,277
   sigma(margin) = sqrt(V(margin)) ~ 153

Thus
   P(margin >= 537 ) 
      = P( margin - E(margin) >=  [(537 - E(margin))/sigma(margin)]*sigma(margin) )
      ~ P( margin - E(margin) >=  3.5 * sigma(margin) )
      ~ Normal(3.5) / 2      ... by the Central Limit Theorem
      ~ .00023      
as claimed earlier in class.