CS 70 - Lecture 32 - Apr 11, 2011 - 10 Evans

Goals for today:  Recall how to estimate P(H)=p given a biased coin:
                    flip it many times, take fraction of heads
                  How to generalize beyond coin flipping:
                    Law of Large Numbers 
                  Random Variables (independent and otherwise)
                  (Please read Note 18)

Review:

DEF: Let S be a sample space, P a probability function, and f a random variable.
     E(f) = the expectation of f.
     Let g = (f - E(f))^2. Then the variance of f is defined as
     V(f) = E( g ) = E(f^2) - (E(f))^2
     sigma(f) = ( V(f) )^(1/2)

Thm: Chebyshev's Inequality:
     P( |f - E(f)| >= r ) <= V(f)/r^2   
     P( |f - E(f)| >= z*sigma(f) ) <= 1/z^2   

Ex: Flip biased coin n times, let 
    X_i =  1 if i-th flip head, 0 otherwise
      E(X_i) = p, V(X_i) = p - p^2 = p*(1-p)
    S_n = X_1 + ... + X_n = number of heads
      E(S_n) = n*p, V(S_n) = n*p*(1-p)
    A_n = S_n/n = fraction of heads
      E(A_n) = p, V(A_n) = p*(1-p)/n

    P( |A_n - p| >= r ) <= p*(1-p)/(n*r^2)

Law of Large Numbers - (special case)
    lim_{n -> infinity} P( |A_n - p| >= r ) = 0
In other words, A_n gets closer and closer to p 
(with high probability) as n increases.

This worked because we knew how to compute V(A_n)
for coin flipping from the definition, using the fact
that each flip was independent.

We want to make it possible to use Chebyshev and the
Law of Large Numbers to more general cases than coin flipping.
    
Ex: flip a biased die n times, 
    f = value on top of die
    A_n = average value of f over n flips
    Show that A_n -> E(f) with high probability

Ex: Survey n randomly chosen people, 
    f = annual income
    A_n = average of f over n people

In all cases, we want to show that A_n -> E(f) with high probability
as n increases (and even goes to infinity, in the first example)

More generally, we want to reason about multiple random variables,
to be be able to learn more from multiple random variables
(think of them as multiple measurements) than we can from looking
at each one individually.

Def: Let f1 and f2 be two random variables on sample space S.
     Then the set of values {(a,b,P({x: f1(x)=a and f2(x)=b}))}
     for all a in the range of f1 and b in the range of f2,
     is called the joint distribution of f1 and f2.

We will often abbreviate this as P(f1=a and f2=b).

Note that all these probabilities must add up to 1:
   sum_{a,b} P(f1=a and f2=b) = 1

Ex 1: Flip a fair coin twice, let fi = 1 if i-th toss = Heads, 0 if Tails
      We write the joint distribution as a table of probabilities
      (we explain the row and column sums, and how we got E(f1*f2) 
      and V(f1+f2), later)
          f1=1  f1=0   P(f2=b)
    f2=1   .25   .25    .5
    f2=0   .25   .25    .5
P(f1=a)    .5    .5
E(f1*f2) = .25, V(f1+f2) = .5

Ex 2: Ask two random people whether they want to raises taxes,
      where an (unknown) fraction p of the population would say "yes"
      Let f_i = 1 if person i says "yes", 0 if no
           f1=1     f1=0     P(f2=b)
    f2=1   p^2      p*(1-p)  p
    f2=0   p*(1-p)  (1-p)^2  (1-p)
P(f1=a)    p        (1-p)
E(f1*f2) = p^2, V(f1+f2) = 2*p*(1-p)

Ex 3: Pick a random person, determine if they have a particular disease,
      and apply a new test to see if it works (see Lecture 20 for data)
      f1(x) = 1 if person actually has disease, 0 if not
      f2(x) = 1 if person tests positive, 0 if not
                      f1=1           f1=0          P(f2=b)
                      "sick"         "healthy"
    f2=1 "positive"   .045 (=.9*.05) .19 (=.2*.95) .235
    f2=0 "negative"   .005 (=.1*.05) .76 (=.8*.95) .765
    P(f1=a)           .05            .95
E(f1*f2) = .045

Ex 4:   f1=0  f1=1  f1=2  P(f2=b)
   f2=0  .1    .2    .15  .45
   f2=1  .05   .05   .2   .3
   f2=2  .1    .1    .05  .25
P(f1=a)  .25   .35   .4
E(f1*f2) = .85

Given a joint probability distribution, we can compute quantities like
   P(f1=a) = sum_b P(f1=a and f2=b)
   P(f2=b) = sum_a P(f1=a and f2=b)
These are also called "marginal distributions" of f1 and f2, respectively.

Given the tables in our above examples, 
this amounts to summing columns of the table (to get P(f1=a))
or summing rows of the table (to get P(f2=b)) (see examples above)

And using the tables we can compute
   E(f1*f2) = sum_r  r*P(f1*f2=r)
            = sum_{a,b} a*b*P(f1=a and f2=b)
or indeed any function, say
    E(f1^2*f2^2) = sum_{a,b} a^2*b^2*P(f1=a and f2=b)

Now we can define (again) independence of random variables,
which generalized the notion of independence of events:

Def: f1 and f2 are independent if 
     P(f1=a and f2=b) = P(f1=a)*P(f2=b)
In words, knowing the value of f1 tells you nothing about the value of f2,
and vice versa. We can also express this using the

Def: The Conditional Distribution of f1 given f2=b is the collection
   {(a,P(f1=a | f2=b))} for all a in the range of f1

When f1 and f2 are independent, this simplifies to 
  P(f1=a | f2=b) = P(f1=a and f2=b)/P(f2=b) = P(f1=a)*P(f2=b)/P(f2=b) = P(f1=a)
  P(f2=b | f1=a) = P(f2=b and f1=a)/P(f1=a) = P(f2=b)*P(f1=a)/P(f1=a) = P(f2=b)

ASK&WAIT: Looking at our 4 examples above, which pairs f1, f2 are independent,
and which are not?


Independence makes computing many functions of f1 and f2 easier,
as we showed last time:

Thm 1: If f1 and f2 are independent, then E(f1*f2) = E(f1)*E(f2)
Thm 2: If f1 and f2 are independent, then V(f1+f2) = V(f1)+V(f2)

This makes it easy to compute E(f1*f2) and V(f1+f2) in Ex 1 and 2 above.

These ideas apply not just to two random variables, but n random variables
(we usually flip more than two coins, survey more than two prospective voters,
test more than two patients, etc.):

Def: The joint probability distribution of n random variables f1,...,fn
     on a sample space S is the collection of values
        P(f1=a1 and f2=a2 and ... and fn=an) 
     for all values of ai in the range of fi

Def: Random variables f1,...,fn are mutually independent if for 
     any subset of them (say the first m for illustration)
     P(f1=a1 and ... and fm=am) = P(f1=a1)*...*P(fm=am)

Thm 1a: If f1,...,fn are mutually independent, then E(f1*...*fn) = E(f1)*...*E(fn)
Thm 2a: If f1,...,fn are mutually independent, then V(f1+...+fn) = V(f1)+...+V(fn)

Ex 1a: Flip fair coin n times, fi(x) = 1 if i-th toss a Head, else 0
       E(f1+...+fn) = n*(1/2), V(f1+...+fn) = n*.25

Ex 2a: Ask n random people if they want to raise taxes, 
       fi(x) = 1 if i-th person says "yes", else 0
       E(f1+...+fn) = n*p, V(f1+...+fn) = n*p*(1-p)
       A_n = (f1+...+fn)/n => E(A_n) = p, V(A_n) = p*(1-p)/n
Note that the variance of A_n approaches 0 as n increases,
which we can use with Chebyshev's Inequality.

Now recall
Def: random variables f1,...,fn are called
     independent and identically distributed (i.i.d) if
     (1) they are mutually independent
     (2) they have the same distribution, i.e. P(fi=a) the same for all i

Part (2) means that i.i.d. random variables have the same mean and variance.

Thm (Law of Large Numbers) Let f1,f2,f3,... be i.i.d.
random variables with common expectation mu and finite variance.
Let A_n = (f1 + ... + fn)/n  be their average. Then for any r>0,
   lim_{n -> infinity} P(|A_n - mu| > r) = 0
In other words, the average A_n gets closer and closer to
a single value mu, with high probability.

Proof:  Thm 1a tells us
   E(A_n) = E(S_n)/n = E(f1) = mu
and Thm 2a tells us
   V(A_n) = (1/n^2)*V(S_n) = (1/n^2)*(n*V(f1)) = V(f1)/n
So by Chebyshev's Inequality, for any r>0
   P( |A_n - mu| >= r ) <= V(f1)/(n*r^2)
Taking the limit as n -> infinity yields zero as desired.

The Law of Large Numbers is still true if the variance is infinite,
but tricker to prove.