CS 70 - Lecture 28 - Apr 1, 2011 - 10 Evans

Goals for today:  Variance and Standard Deviation (Please read Note 16)

 Mean and Standard deviation: some of you ask what the mean and standard
 deviation of the scores on the midterm are, so you already understand that
   1) the mean tells you the average score
   2) the standard deviation (std) tells you how many people were 
      "close to the average"
      i.e expect many scores to lie in range from  mean-x*std to mean+x*std
          where x = 1 or 2; if your score is < mean-3*std you know you're
          in trouble, and if your score is > mean+3*std you feel really good
 Now it is time to define this carefully:

DEF: Let S be a sample space, P a probability function, and f a random variable.
     E(f) the expectation of f.
     Then let g = (f - E(f))^2. Then the variance of f is defined as
     V(f) = E( g )
          = sum_{e in S} g(e)*P(e)
          = sum_{e in S} (f(e) - E(f))^2*P(e)
     The standard deviation of f is defined as
     sigma(f) = ( V(f) )^(1/2)

It may seem that what we want is E(| f - E(f) |), but this
is generally hard to compute, so instead we compute
V(f) = E((f-E(f))^2), which tells us how far we expect (f-E(F))^2 to
be from zero, and then take the square root to get sigma(f).

Ex: Roll a fair coin n times, bet $1 on H each time,
    let f(x) = #H - #T = amount you win or lose.
    We know E(f) = 0, but how far from breaking even
    can we expect to be? 

    Let f = f_1 + f_2 + ... + f_n where
     f_i = { +1 if i-th throw is H
           { -1 if i-th throw is T
    Then f^2 = sum_{i=1 to n} f_i^2 + sum_{i,j = 1 to n, i neq j} f_i*f_j,
    so E(f^2) = sum_{i=1 to n} E(f_i^2)
                + sum_{i,j=1 to n, i neq j} E(f_i*f_j)
    Now f_i^1 = 1 always, so E(f_i^2) = 1
    f_i*f_j = { +1 if f_i = f_j (prob 1/2)
              { -1 if f_i =-f_j (prob 1/2)
    so E(f_i*f_j) = +1*.5 + (-1)*.5 = 0
    and E(f^2) = V(f) = n and sigma(f) = sqrt(n)
    In other words, we can expect to be up or down about sqrt(n) dollars

Ex: What about using a biased coin in the last example, with P(H) = p?
    Need a theorem to simplify this:

Thm: V(f) = E(f^2) - (E(f))^2
Proof: V(f) = E((f - E(f))^2) 
            = E(f^2 - 2*f*E(f) + (E(f))^2)
            = E(f^2) - E(2*f*E(f)) + E((E(f))^2))
            = E(f^2) - 2*E(f)*E(f) + (E(f))^2
            = E(f^2) - (E(f))^2

Ex: Throw a biased coin n times, bet $1 on H each time
    f = f_1 + ... + f_n 
    Proceeding as above, we need
    E(f_i^2) = E(1) = 1
    E(f_i*f_j) = (+1)*(p^2 + (1-p)^2) + (-1)*(2*p*(1-p)) 
               = 4*p^2-4*p+1 = (2*p-1)^2
    so E(f^2) = n*1 + (n^2-n)*(2*p-1)^2
    and E(f) = n*(2*p-1) so
    V(f) = E(f^2) - (E(f))^2 = n - n*(2p-1)^2 = 4*n*p*(1-p)
    and sigma(f) = 2*sqrt(n*p*(1-p)) = sqrt(n) if p=1/2

Ex: Roll a fair die, f = value on top of die
    E(f) = (1/6)*(1+2+3+4+5+6) = 7/2
    V(f) = E(f^2) - (E(f))^2
         = (1/6)*(1^2 + ... + 6^2) - (7/2)^2
         = (91/6) - (49/4) = 35/12

Ex: Choose a number from {1,2,...,n} with equal probability 1/n
    E(f) = (1/n)*(1+...+n) = (n+1)/2
    V(f) = (1/n)*(1^2+...+n^2) - ((n+1)/2)^2
         = (1/n)*(n/6 + n^2/2 + n^3/3) - (n^2/4 + n/2 + 1/4)
         = (n^2 - 1)/12 
    sigma(f) ~ n/sqrt(12) ~ .29 * n

EX: Suppose we have a list of n distinct items L(1),...,L(n), and
    want an algorithm that takes an input x known to be on the list, and
    returns i if L(i)=x.
    An obvious algorithm is "linear search"
        i=0
        repeat
           i=i+1
        until L(i)=x or i=n+1

     Suppose x is chosen at random from the list, with equal probabilities.
     What is the expectation of the operation count C of this algorithm, 
     i.e. how many times is the line "i=i+1" executed? If we run this algorithm
     many times, E(C) tells us how long it will take "on average",
     and sigma(C) tells us how variable the running time will be.
     Since C has the same distribution as the last example, the answer
     is the same.

Ex: Suppose f ~ Poiss(lambda), that is P(f=r) = exp(-lambda)*lambda^r/r! Then
    E(f^2) = sum_{r=1 to infinity} r^2*exp(-lambda)*lambda^r/r!
           = exp(-lambda)*sum_{r=1 to infinity} r*lambda^r/(r-1)!
    To see how to do this sum, we start with
       exp(lambda) = sum_{r=0 to infinity} lambda^r/r!
    and manipulate until we get what we want:
       lambda * exp(lambda) = sum_{r=0 to infinity} lambda^(r+1)/r!
       d/d lambda( lambda * exp(lambda) ) 
             = d/d lambda ( sum_{r=0 to infinity} lambda^(r+1)/r! )
             = sum_{r=0 to infinity} (r+1) lambda^r/r! 
    Multiply this by lambda to get
       lambda * d/d lambda( lambda * exp(lambda) ) 
             = sum_{r=1 to infinity} r lambda^r/(r-1)! 
    as desired.  Simplifying, we get
       E(f^2) = exp(-lambda) * lambda * d/d lambda( lambda * exp(lambda) ) 
              = exp(-lambda) * lambda ( 1 * exp(lambda) + lambda * exp(lambda) )
              = exp(-lambda) * exp(lambda) * ( lambda + lambda^2)
              = lambda + lambda^2
    and V(f) = E(f^2) - (E(f))^2 = lambda, so V(f) = E(f).

Now we quantify the notion that the value of a random variable f(x) is
unlikely to fall very far from E(f):

Thm (Chebyshev's Inequality): Let f be a random variable. Then
     P( |f - E(f)| >= r )  <= V(f)/r^2
Letting r = z*sigma(f), we can also write this as
     P( |f - E(f)| >= z*sigma(f) )  <= 1/z^2

In words, the probability that the value of a random variable f is
farther than z times the standard deviation sigma(f) from its mean value
E(f) can be no larger than 1/z^2, which decreases as z increases.
Here is a table:
              z     P( |f - E(f)| >= z*sigma(f))
             ---    -------------------------------
              1      <= 1  (trivial bound on a probability!)
              2      <= 1/4  = .25
              5      <= 1/25 = .04
             10      <= 1/100= .01

We will prove this as a corollary of another useful result:

Thm: Markov's Inequality. Let g(x) be a nonnegative random variable. Then
     P( g >= s ) <= E(g)/s

Proof: E(g) = sum_r  r * P(g = r)
            = sum_{r < s} r*P(g=r) + sum_{r >= s} r*P(g=r)
           >= sum_{r >= s} r*P(g=r)     ... since omitted sum is positive
           >= sum_{r >= s} s*P(g=r)
            = s * sum{r >=s } P(g=r)
            = s * P(g >= s) 

Proof of Chebyshev: Apply Markov to g(x) = (f(x) - E(f))^2,
which by construction has E(g) = V(f), yielding
    V(f)/s >= P( g >= s ) = P( (f - E(f))^2 >= s )
Now let s = r^2, so
    V(f)/r^2 >= P( |f - E(f)| >= r )
as desired.
 
 How good is Chebyshev's inequality, i.e. how close to 
    P( |f(x) - E(f)| >= z*sigma(f) )  can 1/z^2 be?
 EX: Again consider rolling a die 100 times, and computing sum h.
     Then E(h) = 350 and sigma(h)~17.
     Comparing  actual P( |h-350| > z*17)  with 1/z^2 from Chebyshev
     yields table below: Chebyshev is much too large for large z.
     (Later we will get a much better approximation from the Central Limit Theorem).
              z     P( |f - E(f)| >= z*sigma(f))
                    from Chebyshev    More accurate
             ---    -------------------------------
              1      <= 1             ~ .4
              2      <= .25           ~ .05
              5      <= .04           ~ 10^(-6)
             10      <= .01           ~ 10^(-25)