CS 70 - Lecture 32 - Apr 11, 2011 - 10 Evans Goals for today: Recall how to estimate P(H)=p given a biased coin: flip it many times, take fraction of heads How to generalize beyond coin flipping: Law of Large Numbers Random Variables (independent and otherwise) (Please read Note 18) Review: DEF: Let S be a sample space, P a probability function, and f a random variable. E(f) = the expectation of f. Let g = (f - E(f))^2. Then the variance of f is defined as V(f) = E( g ) = E(f^2) - (E(f))^2 sigma(f) = ( V(f) )^(1/2) Thm: Chebyshev's Inequality: P( |f - E(f)| >= r ) <= V(f)/r^2 P( |f - E(f)| >= z*sigma(f) ) <= 1/z^2 Ex: Flip biased coin n times, let X_i = 1 if i-th flip head, 0 otherwise E(X_i) = p, V(X_i) = p - p^2 = p*(1-p) S_n = X_1 + ... + X_n = number of heads E(S_n) = n*p, V(S_n) = n*p*(1-p) A_n = S_n/n = fraction of heads E(A_n) = p, V(A_n) = p*(1-p)/n P( |A_n - p| >= r ) <= p*(1-p)/(n*r^2) Law of Large Numbers - (special case) lim_{n -> infinity} P( |A_n - p| >= r ) = 0 In other words, A_n gets closer and closer to p (with high probability) as n increases. This worked because we knew how to compute V(A_n) for coin flipping from the definition, using the fact that each flip was independent. We want to make it possible to use Chebyshev and the Law of Large Numbers to more general cases than coin flipping. Ex: flip a biased die n times, f = value on top of die A_n = average value of f over n flips Show that A_n -> E(f) with high probability Ex: Survey n randomly chosen people, f = annual income A_n = average of f over n people In all cases, we want to show that A_n -> E(f) with high probability as n increases (and even goes to infinity, in the first example) More generally, we want to reason about multiple random variables, to be be able to learn more from multiple random variables (think of them as multiple measurements) than we can from looking at each one individually. Def: Let f1 and f2 be two random variables on sample space S. Then the set of values {(a,b,P({x: f1(x)=a and f2(x)=b}))} for all a in the range of f1 and b in the range of f2, is called the joint distribution of f1 and f2. We will often abbreviate this as P(f1=a and f2=b). Note that all these probabilities must add up to 1: sum_{a,b} P(f1=a and f2=b) = 1 Ex 1: Flip a fair coin twice, let fi = 1 if i-th toss = Heads, 0 if Tails We write the joint distribution as a table of probabilities (we explain the row and column sums, and how we got E(f1*f2) and V(f1+f2), later) f1=1 f1=0 P(f2=b) f2=1 .25 .25 .5 f2=0 .25 .25 .5 P(f1=a) .5 .5 E(f1*f2) = .25, V(f1+f2) = .5 Ex 2: Ask two random people whether they want to raises taxes, where an (unknown) fraction p of the population would say "yes" Let f_i = 1 if person i says "yes", 0 if no f1=1 f1=0 P(f2=b) f2=1 p^2 p*(1-p) p f2=0 p*(1-p) (1-p)^2 (1-p) P(f1=a) p (1-p) E(f1*f2) = p^2, V(f1+f2) = 2*p*(1-p) Ex 3: Pick a random person, determine if they have a particular disease, and apply a new test to see if it works (see Lecture 20 for data) f1(x) = 1 if person actually has disease, 0 if not f2(x) = 1 if person tests positive, 0 if not f1=1 f1=0 P(f2=b) "sick" "healthy" f2=1 "positive" .045 (=.9*.05) .19 (=.2*.95) .235 f2=0 "negative" .005 (=.1*.05) .76 (=.8*.95) .765 P(f1=a) .05 .95 E(f1*f2) = .045 Ex 4: f1=0 f1=1 f1=2 P(f2=b) f2=0 .1 .2 .15 .45 f2=1 .05 .05 .2 .3 f2=2 .1 .1 .05 .25 P(f1=a) .25 .35 .4 E(f1*f2) = .85 Given a joint probability distribution, we can compute quantities like P(f1=a) = sum_b P(f1=a and f2=b) P(f2=b) = sum_a P(f1=a and f2=b) These are also called "marginal distributions" of f1 and f2, respectively. Given the tables in our above examples, this amounts to summing columns of the table (to get P(f1=a)) or summing rows of the table (to get P(f2=b)) (see examples above) And using the tables we can compute E(f1*f2) = sum_r r*P(f1*f2=r) = sum_{a,b} a*b*P(f1=a and f2=b) or indeed any function, say E(f1^2*f2^2) = sum_{a,b} a^2*b^2*P(f1=a and f2=b) Now we can define (again) independence of random variables, which generalized the notion of independence of events: Def: f1 and f2 are independent if P(f1=a and f2=b) = P(f1=a)*P(f2=b) In words, knowing the value of f1 tells you nothing about the value of f2, and vice versa. We can also express this using the Def: The Conditional Distribution of f1 given f2=b is the collection {(a,P(f1=a | f2=b))} for all a in the range of f1 When f1 and f2 are independent, this simplifies to P(f1=a | f2=b) = P(f1=a and f2=b)/P(f2=b) = P(f1=a)*P(f2=b)/P(f2=b) = P(f1=a) P(f2=b | f1=a) = P(f2=b and f1=a)/P(f1=a) = P(f2=b)*P(f1=a)/P(f1=a) = P(f2=b) ASK&WAIT: Looking at our 4 examples above, which pairs f1, f2 are independent, and which are not? Independence makes computing many functions of f1 and f2 easier, as we showed last time: Thm 1: If f1 and f2 are independent, then E(f1*f2) = E(f1)*E(f2) Thm 2: If f1 and f2 are independent, then V(f1+f2) = V(f1)+V(f2) This makes it easy to compute E(f1*f2) and V(f1+f2) in Ex 1 and 2 above. These ideas apply not just to two random variables, but n random variables (we usually flip more than two coins, survey more than two prospective voters, test more than two patients, etc.): Def: The joint probability distribution of n random variables f1,...,fn on a sample space S is the collection of values P(f1=a1 and f2=a2 and ... and fn=an) for all values of ai in the range of fi Def: Random variables f1,...,fn are mutually independent if for any subset of them (say the first m for illustration) P(f1=a1 and ... and fm=am) = P(f1=a1)*...*P(fm=am) Thm 1a: If f1,...,fn are mutually independent, then E(f1*...*fn) = E(f1)*...*E(fn) Thm 2a: If f1,...,fn are mutually independent, then V(f1+...+fn) = V(f1)+...+V(fn) Ex 1a: Flip fair coin n times, fi(x) = 1 if i-th toss a Head, else 0 E(f1+...+fn) = n*(1/2), V(f1+...+fn) = n*.25 Ex 2a: Ask n random people if they want to raise taxes, fi(x) = 1 if i-th person says "yes", else 0 E(f1+...+fn) = n*p, V(f1+...+fn) = n*p*(1-p) A_n = (f1+...+fn)/n => E(A_n) = p, V(A_n) = p*(1-p)/n Note that the variance of A_n approaches 0 as n increases, which we can use with Chebyshev's Inequality. Now recall Def: random variables f1,...,fn are called independent and identically distributed (i.i.d) if (1) they are mutually independent (2) they have the same distribution, i.e. P(fi=a) the same for all i Part (2) means that i.i.d. random variables have the same mean and variance. Thm (Law of Large Numbers) Let f1,f2,f3,... be i.i.d. random variables with common expectation mu and finite variance. Let A_n = (f1 + ... + fn)/n be their average. Then for any r>0, lim_{n -> infinity} P(|A_n - mu| > r) = 0 In other words, the average A_n gets closer and closer to a single value mu, with high probability. Proof: Thm 1a tells us E(A_n) = E(S_n)/n = E(f1) = mu and Thm 2a tells us V(A_n) = (1/n^2)*V(S_n) = (1/n^2)*(n*V(f1)) = V(f1)/n So by Chebyshev's Inequality, for any r>0 P( |A_n - mu| >= r ) <= V(f1)/(n*r^2) Taking the limit as n -> infinity yields zero as desired. The Law of Large Numbers is still true if the variance is infinite, but tricker to prove.