Lecture Notes (Introduction To Probability and Statistics)
Lecture Notes (Introduction To Probability and Statistics)
Required Textbook - DeGroot & Schervish, Probability and Statistics, Third Edition Recommended Introduction to Probability Text - Feller, Vol. 1
1.2-1.4. Probability, Set Operations. What is probability? Classical Interpretation: all outcomes have equal probability (coin, dice) Subjective Interpretation (nature of problem): uses a model, randomness involved (such as weather) ex. drop of paint falls into a glass of water, model can describe P(hit bottom before sides) or, P(survival after surgery)- subjective, estimated by the doctor. Frequency Interpretation: probability based on history P(make a free shot) is based on history of shots made. Experiment has a random outcome. 1. Sample Space - set of all possible outcomes. coin: S={H, T}, die: S={1, 2, 3, 4, 5, 6} two dice: S={(i, j), i, j=1, 2, ..., 6} 2. Events - any subset of sample space ex. A S, A - collection of all events. 3. Probability Distribution - P: A [0, 1] Event A S, P(A) or Pr(A) - probability of A Properties of Probability: 1. 0 P(A) 1 2. P(S ) = 1 3. For disjoint (mutually exclusive) events A, B (denition A B = ) P(A or B) = P(A) + P(B) - this can be written for any number of events. For a sequence of events A 1 , ..., An , ... all disjoint (Ai Aj = , i = j): P(
Ai ) =
i=1
i=1
P(Ai )
Need to group outcomes, not sum up individual points since they all have P = 0.
Union of Sets: A B = {s S : s A or s B }
Intersection: A B = AB = {s S : s A and s B }
Complement: Ac = {s S : s / A}
Symmetric Dierence: (A B c ) (B Ac ) Summary of Set Operations: 1. Union of Sets: A B = {s S : s A or s B } 2. Intersection: A B = AB = {s S : s A and s B } 3. Complement: Ac = {s S : s / A} 4. Set Dierence: A \ B = A B = {s S : s A and s / B} = A B c 5. Symmetric Dierence:
/ B ) or (s B and s /
A)} = AB = {s S : (s A and s (A B c ) (B Ac ) Properties of Set Operations: 1. A B = B A 2. (A B ) C = A (B C ) Note that 1. and 2. are also valid for intersections. 3. For mixed operations, associativity matters:
(A B ) C = (A C ) (B C )
think of union as addition and intersection as multiplication: (A+B)C = AC + BC
4. (A B )c = Ac B c - Can be proven by diagram below:
Both diagrams give the same shaded area of intersection. 5. (A B )c = Ac B c - Prove by looking at a particular point: s (A B )c = s / (A B ) s / A or s / B = s Ac or s B c s (Ac B c ) QED ** End of Lecture 1
1.5 Properties of Probability. 1. P(A) [0, 1] 2. P(S ) = 1 3. P(Ai ) = P (Ai ) if disjoint Ai Aj = , i = j The probability of a union of disjoint events is the sum of their probabilities. 4. P(), P(S ) = P(S ) = P(S ) + P() = 1
where S and are disjoint by denition, P(S) = 1 by #2., therefore, P() = 0.
5. P(Ac ) = 1 P(A)
because A, Ac are disjoint, P(A Ac ) = P(S ) = 1 = P(A) + P(Ac )
the sum of the probabilities of an event and its complement is 1. 6. If A B, P(A) P(B )
by denition, B = A (B \ A), two disjoint sets.
P(B ) = P(A) + P(B \ A) P(A)
7. P(A B ) = P(A) + P(B ) P(AB )
must subtract out intersection because it would be counted twice, as shown:
Finite Sample Spaces There are a nite # of outcomes S = {s1 , ..., sn } Dene pi = P(si ) as the probability function.
pi 0,
n i=1
pi = 1 P(s)
P(A) =
sA
Classical, simple sample spaces - all outcomes have equal probabilities. A) P(A) = #( #(S ) , by counting methods. Multiplication rule: #(s1 ) = m, #(s2 ) = n, #(s1 s2 ) = mn Sampling without replacement: one at a time, order is important
s1 ...sn outcomes
k n (k chosen from n)
#(outcome vectors) = (a1 , a2 , ..., ak ) = n(n 1) ... (n k + 1) = Pn,k
Example: order the numbers 1, 2, and 3 in groups of 2. (1, 2) and (2, 1) are dierent.
P3,2 = 3 2 = 6
Pn,n = n(n 1) ... 1 = n!
Pn,k = n! (n k )!
each set can be ordered k! ways, so divide that out of Pn,k Cn,k - binomial coecients Binomial Theorem: (x + y )n =
n n
Sampling without replacement, k at once s1 ...sn sample a subset of size k, b1 ...bk , if we arent concerned with order. n n! number of subsets = Cn,k = = k k !(n k )!
k=0
xk y nk
There are
n
k
Fix the outer walls, rearrange the balls and the separators. If you x the outer walls of the rst and last boxes,
you can rearrange the separators and the balls using the binomial theorem.
There are n balls and k-1 separators (k boxes).
Number of dierent ways to arrange the balls and separators =
n+k1 n+k1 = n k1 Example: f (x1 , x2 , ..., xk ), take n partial derivatives: nf 2 x1 x2 5 x3 ...xk k boxes k coordinates
n balls n partial derivatives
k1 n+k1
number of dierent partial derivatives = n+n = k 1
** End of Lecture 2.
1.9 Multinomial Coecients These values are used to split objects into groups of various sizes.
s1 , s2 , ..., sn - n elements such that n1 in group 1, n2 in group 2, ..., nk in group k.
n1 + ... + nk = n
n n n1 n n1 n2 n n1 ... nk2 nk ... n2 n3 n k 1 nk n1 = (n n1 )! (n n1 n2 )! (n n1 ... nk2 )! n! 1 ... n1 !(n n1 )! n2 !(n n1 n2 )! n3 !(n n1 n2 n3 )! nk1 !(n n1 ... nk1 )! = n! = n1 !n2 !...nk1 !nk ! n n1 , n2 , ..., nk
Further explanation: You have n spots in which you have n! ways to place your elements.
However, you can permute the elements within a particular group and the splitting is still the same.
You must therefore divide out these internal permutations.
This is a distinguishable permutations situation.
Example #1 - 20 members of a club need to be split into 3 committees (A, B, C) of 8, 8, and 4 people,
respectively. How many ways are there to split the club into these committees?
20 20! ways to split = = 8!8!4! 8, 8, 4 Example #2 - When rolling 12 dice, what is the probability that 6 pairs are thrown?
This can be thought of as each number appears twice
There are 612 possibilities for the dice throws, as each of the 12 dice has 6 possible values.
In pairs, the only freedom is where the dice show up.
12! 12! 12 = P= = 0.0034 (2!)6 612 2, 2, 2, 2, 2, 2 (2!)6
P(A B C ) = P(A) + P(B ) + P(C ) P(AB ) P(BC ) P(AC ) + P(ABC ) 1.10 - Calculating a Union of Events - P(union of events)
P(A B ) = P(A) + P(B ) P(AB ) (Figure 1)
P(A B C ) = P(A) + P(B ) + P(C ) P(AB ) P(BC ) P(AC ) + P(ABC ) (Figure 2)
Theorem:
P(
i=1
Ai ) =
i n
P(Ai )
i<j
P(Ai Aj ) +
i<j<k
Express each disjoint piece, then add them up according to what sets each piece
belongs or doesnt belong to.
A1 ... An can be split into a disjoint partition of sets:
Ai ) =
i=1
P(disjoint partition)
To check if the theorem is correct, see how many times each partition is counted.
P (A1 ), P(A2 ), ..., P ( Ak ) - k times
k P ( A A ) i j i<j 2 times
(needs to contain Ai and Aj in k dierent intersections.) Example: Consider the piece A B C c , as shown:
This piece is counted: P(A B C ) = once. P(A) + P(B ) + P(C ) = counted twice.
P(AB ) P(AC ) P(BC ) = subtracted once.
+P(ABC ) = counted zero times.
The sum: 2 - 1 + 0 = 1, piece only counted once.
Example: Consider the piece A1 A2 A3 Ac 4 k = 3, n = 4.
P(A1 ) + P(A2 ) + P(A3 ) + P(A4 ) = counted k times (3 times).
P(A1 A2 ) P(A1 A3 ) P(A1 A4 ) P(A2 A3 ) P(A2 A4 ) P(A3 A4 ) = counted k 2
times (3 times).
k as follows: i<j<k = counted 3 times (1 time). k k k+1 k total in general: k k 2 + 3 4 + ... + (1) k = sum of times counted. To simplify, this is a binomial situation.
0 = (1 1) =
k k i=0
(1) (1)
(ki)
k k k k = + ... 0 1 2 3
0 = 1 sum of times counted therefore, all disjoint pieces are counted once.
** End of Lecture 3
10
P(Ai )
i<j
P(Ai Aj ) +
i<j<k
P(Ai Aj Ak ) + ...
1 1 1 + ... + (1)n+1 2! 3! n!
3
x Recall: Taylor series for ex = 1 + x + x 2! + 3! + ... 1 1 1 for x= -1, e = 1 1 + 2 3! + ... therefore, SUM = 1 - limit of Taylor series as n When n is large, the probability converges to 1 e1 = 0.63
2.1 - Conditional Probability Given that B happened, what is the probability that A also happened? The sample space is narrowed down to the space where B has occurred:
The sample size now only includes the determination that event B happened. Denition: Conditional probability of Event A given Event B: P(A|B ) = P(AB ) P(B )
It is sometimes easier to calculate intersection given conditional probability: P(AB ) = P(A|B )P(B ) Example: Roll 2 dice, sum (T) is odd. Find P(T < 8). B = {T is odd}, A = {T < 8} P(A|B ) = P(AB ) 18 1 , P(B ) = 2 = P(B ) 6 2
Example, considering Placebo: B = Placebo, A = Relapse. P(A|B ) = 13 Example, considering treatment B: P(A|B ) = 13+25 = 0.34
= 0.7
13
2.2 Independence of events. (AB ) P(A|B ) = PP (B ) ; Denition - A and B are independent if P(A|B ) = P(A) P(A|B ) = P(AB ) = P(A) P(AB ) = P(A)P(B ) P(B )
Experiments can be physically independent (roll 1 die, then roll another die),
or seem physically related and still be independent.
Example: A = {odd}, B = {1, 2, 3, 4}. Related events, but independent.
2 1 .P(B ) = 3 .AB = {1, 3} P(A) = 2 1 2 1 , therefore independent. P(AB ) = 2 3 = P(AB ) = 3 Independence does not imply that the sets do not intersect.
Example: Toss an unfair coin twice, these are independent events. P(H ) = p, 0 p 1, nd P(T H ) = tails rst, heads second P(T H ) = P(T )P(H ) = (1 p)p Since this is an unfair coin, the probability is not just 1 4 TH 1 If fair, HH +HT +T H +T T = 4 If you have several events: A1 , A2 , ...An that you need to prove independent:
It is necessary to show that any subset is independent.
Total subsets: Ai1 , Ai2 , ..., Aik , 2 k n
Prove: P(Ai1 Ai2 ...Aik ) = P(Ai1 )P(Ai2 )...P(Aik )
You could prove that any 2 events are independent, which is called pairwise independence,
but this is not sucient to prove that all events are independent.
Example of pairwise independence:
Consider a tetrahedral die, equally weighted.
Three of the faces are each colored red, blue, and green,
but the last face is multicolored, containing red, blue and green.
P(red) = 2/4 = 1/2 = P(blue) = P(green)
P(red and blue) = 1/4 = 1/2 1/2 = P(red)P(blue)
Therefore, the pair {red, blue} is independent.
The same can be proven for {red, green} and {blue, green}.
but, what about all three together?
P(red, blue, and green) = 1/4 = P(red)P(blue)P(green) = 1/8, not fully independent.
Example: P(H ) = p, P(T ) = 1 p for unfair coin
Toss the coin 5 times P(HTHTT)
= P(H )P(T )P(H )P(T )P(T )
= p(1 p)p(1 p)(1 p) = p2 (1 p)3
Example: Find P(get 2H and 3T, in any order)
= sum of probabilities for ordering
= P(HHT T T ) + P(HT HT T ) = ...
2 =p (1 p)3 + p2 (1 p)3 + ...
5 = 2 p2 (1 p)3
Example: Toss a coin until the result is heads; there are n tosses before H results.
P(number of tosses = n) =?
needs to result as TTT....TH, number of Ts = (n - 1)
P(tosses = n) = P(T T...H ) = (1 p)n1 p
Example: In a criminal case, witnesses give a specic description of the couple seen eeing the scene.
P(random couple meets description) = 8.3 108 = p
We know at the beginning that 1 couple exists. Perhaps a better question to be asked is:
Given a couple exists, what is the probability that another couple ts the same description?
P(2 couples exists)
A = P(at least 1 couple), B = P(at least 2 couples), nd P(B |A)
(BA) P(B ) P(B |A) = PP (A) = P(A) 15
If n = 8 million people, P(B |A) = 0.2966, which is within reasonable doubt! P(2 couples) < P(1 couple), but given that 1 couple exists, the probability that 2 exist is not insignicant.
In the large sample space, the probability that B occurs when we know that A occured is signicant! 2.3 Bayess Theorem It is sometimes useful to separate a sample space S into a set of disjoint partitions:
B1 , ..., Bk - a partition of sample space S. k Bi Bj = , for i = j, S = i=1 Bi (disjoint) k k Total probability: P(A) = i=1 P(ABi ) = i =1 P(A|Bi )P(Bi ) k (all ABi are disjoint, i=1 ABi = A)
** End of Lecture 5
16
Solutions to Problem Set #1 1-1 pg. 12 #9 Bn = i=n Ai , Cn = i=n Ai a) Bn Bn +1 ... Bn = An ( i=n+1 Ai ) = An Bn+1 s Bn+1 s Bn+1 An = Bn Cn Cn+1 ... Cn = An Cn+1 s Cn = n Cn+1 s Cn+1 A b) s n=1 Bn s Bn for all n s i=1 Ai for all n s some Ai for i n, for all n s innitely many events Ai Ai happen innitely often.
c) s n=1 Cn s some Cn = i=n Ai for some n, s all Ai for i n
s all events starting at n. 1-2 pg. 18 #4 P (at least 1 fails) = 1 P (neither fail) = 1 0.4 = 0.6 1-3 pg. 18 #12 A1 , A2 , ... c c B1 = A1 , B2 = Ac 1 A2 , ..., Bn = A1 ...An1 An n n P ( i=1 Ai ) = i=1 P (B i ) splits the union into disjoint events, and covers the entire space. n follows from: n A = i i=1 i=1 Bi n take point (s) in i=1
Ai , s at least one s A1 = B1 , c
if not, s Ac 1 , if s A2 , then s A1 A2 = B2 , if not... etc. at some point, the point belongs to a set. c The sequence stops when s Ac Ac 1 2 ... Ak1 Ak = Bk n n n s i=1 Bi .P ( i=1 Ai ) = P ( i=1 Bi ) n = i=1 P (Bi ) if Bi s are disjoint. Should also prove that the point in Bi belongs in Ai . Need to prove Bi s disjoint - by construction: c Bi , Bj B i = A c i ... Ai1 Ai
c c Bj = A1 ... Ai ... Ac j 1 A j s B i s A i , s Bj s / Ai . implies that s = s 1-4 pg. 27 #5 #(S ) = 6 6 6 6 = 64 #(all dierent) = 6 5 4 3 = P6,4 P6,4 5 P (all dierent) = 6 = 18 4 1-5 pg. 27 #7
12 balls in 20 boxes.
P(no box receives > 1 ball, each box will have 0 or 1 balls)
also means that all balls fall into dierent boxes.
#(S ) = 2012
#(all dierent) = 20 19... 9 = P20,12
17
P (...) =
P20,12 2012
(98 8) (100 8 )
18
n + (r n) 1 rn
r1 rn
19
Bayes Formula.
Partition B1 , ..., Bk k B = S, Bi Bj = for i = j i i=1 k P(A) = k i=1 P(ABi ) = i=1 P(A|Bi )P(Bi ) - total probability.
Example: In box 1, there are 60 short bolts and 40 long bolts. In box 2,
there are 10 short bolts and 20 long bolts. Take a box at random, and pick a bolt.
What is the probability that you chose a short bolt?
B1 = choose Box 1.
B2 = choose Box 2.
60 1 1 P(short) = P(short|B1 )P(B1 ) + P(short|B2 )P(B2 ) = 100 ( 2 ) + 10 30 ( 2 )
Example:
Partitions: B1 , B2 , ...Bk and you know the distribution.
Events: A, A, ..., A and you know the P(A) for each Bi
If you know that A happened, what is the probability that it came from a particular B i ?
P(Bi |A) = P(Bi A) P(A|Bi )P(Bi ) = : Bayess Formula P(A) P(A|B1 )P(B1 ) + ... + P(A|Bk )P(Bk )
The probability is still very small that you actually have the disease.
20
Example: A gene has 2 alleles: A, a. The gene exhibits itself through a trait with two versions.
The possible phenotypes are dominant, with genotypes AA or Aa, and recessive, with genotype aa.
Alleles travel independently, derived from a parents genotype.
In a population, the probability of having a particular allele: P(A) = 0.5, P(a) = 0.5
Therefore, the probabilities of the genotypes are: P(AA) = 0.25, P(Aa) = 0.5, P(aa) = 0.25
Partitions: genotypes of parents: (AA, AA), (AA, Aa), (AA, aa), (Aa, Aa), (Aa, aa), (aa, aa).
Assume pairs match regardless of genotype.
Parent genotypes
(AA, AA)
(AA, Aa)
(AA, aa)
(Aa, Aa)
(Aa, aa)
(aa, aa)
Probabilities
1 2 (1 4 )( 2 ) = 1 1 2 ( 4 )( 4 ) = 1 1 (1 2 )( 2 ) = 4 1 2 (1 2 )( 4 ) = 1 16 1 16 1 4 1 8 1 4
If you see that a person has dark hair, predict the genotypes of the parents: P ((AA, AA)|A) =
1 16 (1) 1 4 (1) 1 8 (1) 1 1 16 1 1 3 + 4(4) + 1 4(2) + 1 16 (0)
1 12
You can do the same computation to nd the probabilities of each type of couple. Bayess formula gives a prediction inside the parents that you arent able to directly see. Example: You have 1 machine.
In good condition: defective items only produced 1% of the time. P(in good condition) = 90%
In broken condition: defective items produced 40% of the time. P(broken) = 10%
Sample 6 items, and nd that 2 are defective. Is the machine broken?
This is very similar to the medical example worked earlier in lecture:
P(good|2 out of 6 are defective) =
= P (2of 6|good)P (good) P (2of 6|good)P (good) + P (2of 6|broken)P (broken) 6 2 4 2 (0.01) (0.99) (0.9) 6 = 6 = 0.04 2 4 2 4 2 (0.01) (0.99) (0.9) + 2 (0.4) (0.6) (0.1)
** End of Lecture 7
21
3.1 - Random Variables and Distributions Transforms the outcome of an experiment into a number.
Denitions:
Probability Space: (S, A, P)
S - sample space, A - events, P - probability
Random variable is a function on S with values in real numbers, X:S R
Examples:
Toss a coin 10 times, Sample Space = {HTH...HT, ....}, all congurations of H & T.
Random Variable X = number of heads, X: S R
X: S {0, 1, ..., 10} for this example.
There are fewer outcomes than in S, you need to give the distribution of the
random variable in order to get the entire picture. Probabilities are therefore given.
Denition: The distribution of a random variable X:S R, is dened by: A R, P(A) = P(X A) = P(s S : X (s) A)
Example: Uniform distribution of a nite number of values {1, 2, 3, ..., n} each outcome 22
1 : uniform probability function. has equal probability f (sk ) = n random variable X R, P(A) = P(X A), A R
can redene probability space on random variable distribution:
(R, A, P) - sample space, X: R R, X (x) = x (identity map)
P(A) = P(X : X (x) A) = P(x A) = P(x A) = P(A)
all you need is the outcomes mapped to real numbers and relative probabilities
of the mapped outcomes.
Example: In a uniform distribution [a, b], denoted U[a, b]: 1 p.d.f.: f (x) = b / [a, b] a , for x [a, b]; 0, for x Example: On an interval [a, b], such that a < c < d < b, d 1 dc P([c, d]) = c b a dx = ba (probability on a subinterval) Example: Exponential Distribution
23
1 x ex dx = ( e |0 = 1 Real world: Exponential distribution describes the life span of quality products (electronics). 0
** End of Lecture 8
24
Discrete Random Variable: - dened by probability function (p.f.) {s1 , s2 , ...}, f (si ) = P(X = si ) Continuous: probability distribution function (p.d.f.) - also called density function. f (x) 0, f (x)dx, P(X A) = A f (x)dx
P(X x < 0) = 0 1 1 P(X 0) = P(X = 0) = 2 , P(X x) = P(X = 0) = 2 , x [0, 1) P(X x) = P(X = 0 or 1) = 1, x [1, ) 3. right continuous: limyx+ F (y ) = F (x) F (y ) = P(X y ), event {X y }
n=1
Probability of random variable occuring within interval: P(x1 < X < x2 ) = P({X x2 }\{X x1 }) = P(X x2 ) P(X x1 ) = F (x2 ) F (x1 )
25
{X x2 } {X x1 } Probability of a point x, P(X = x) = F (x) F (x ) where F (x ) = limxx F (x), F (x+ ) = limxx+ F (x)
If continuous, probability at a point is equal to 0, unless there is a jump,
where the probability is the value of the jump.
P(x1 X x2 ) = F (x2 ) F (x 1 ) P(A) = P(X A)
X - random variable with distribution P
When observing a c.d.f:
Discrete: sum of probabilities at all the jumps = 1. Graph is horizontal in between the jumps, meaning that probability = 0 in those intervals.
26
If f continuous, f (x) = F (x) Quantile: p [0, 1], p-quantile = inf {x : F (x) = P(X x) p} nd the smallest point such that the probability up to the point is at least p.
The area underneath F(x) up to this point x is equal to p.
If the 0.25 quantile is at x = 0, P(X 0) 0.25
Note that if disjoint, the 0.25 quantile is at x = 0, but so is the 0.3, 0.4...all the way up to 0.5. What if you have 2 random variables? multiple?
ex. take a person, measure weight and height. Separate behavior tells you nothing
about the pairing, need to describe the joint distribution.
Consider a pair of random variables (X, Y)
Joint distribution of (X, Y): P((X, Y ) A)
Event, set A R2
27
2 1 2 Discrete distribution: {(s1 1 , s1 ), (s2 , s2 ), ...} (X, Y ) 1 2 2 Joint p.f.: f (si , si ) = P((X, Y ) = (s1 i , s1 )) 1 2 = P(X = si , Y = si ) Often visualized as a table, assign probability for each point:
0 0.1 0 0.2
-1 0 0 0
5 0 0.1 0
f (x, y )dxdy =
Joint p.d.f. f (x, y ) : P((X, Y ) A) = A f (x, y )dxdy Joint c.d.f. F (x, y ) = P(X x, Y y )
R2
f (x, y )dxdy = 1
Continuous: F (x, y ) =
x y
2F xy
= f (x, y )
** End of Lecture 9
28
x y In the continuous case: F (x, y ) = P(X x, T y ) = f (x, y )dxdy. Marginal Distributions Given the joint distribution of (X, Y), the individual distributions of X, Y
are marginal distributions.
Discrete (X, Y): marginal
probability function f1 (x) = P(X = x) = y P(X = x, Y = y ) = y f (x, y )
In the table for the previous lecture, of probabilities for each point (x, y):
Add up all values for y in the row x = 1 to determine P(X = 1)
Continuous (X, Y): joint p.d.f. f(x, y); p.d.f. of X: f1 (x) = f (x, y )dy x F (x) = P(X x) = P(X x, Y ) = f (x, y )dydx
Review of Distribution Types Discrete distribution for (X, Y): joint p.f. f (x, y ) = P(X = x, Y = y ) Continuous: joint p.d.f. f (x, y ) 0, R2 f (x, y )dxdy = 1 Joint c.d.f.: F (x, y ) = P(X x, Y y ) F (x) = P(X x) = limy F (x, y )
29
21 2 8 x (1
x4 ), 1 x 1
Discrete values for X, Y in tabular form: 1 2 1 0.5 0 0.5 2 0 0.5 0.5 0.5 0.5 Note: If all entries had 0.25 values, the two variables would have the same marginal dist. Independent X and Y: Denition: X, Y independent if P(X A, Y B ) = P(X A)P(Y B )
Joint c.d.f. F (x, y ) = P(X x, Y y ) = P(X x)P(Y y ) = F1 (x)F2 (y ) (intersection of events)
The joint c.d.f can be factored for independent random variables.
Implication: (X, Y): joint p.d.f. f(x, y), x), f2 (y ) y x continuous x marginal f1 ( y F (x, y ) = f (x, y )dydx = F1 (x)F2 (y ) = f1 (x)dx f2 (y )dy
2
30
P(square) = 0 = P(X side) P(Y side) Example: f (x, y ) = kx2 y 2 , 0 x 1, 0 y 1; 0 otherwise Can be written as a product, as they are independent:
f (x, y ) = kx2 y 2 I (0 x 1, 0 y 1) = k1 x2 I (0 x 1) k2 y 2 I (0 y 1)
Conditions on x and y can be separated.
Note: Indicator Notation
/
A I (x A) = 1, x A; 0, x For the discrete case, given a table of values, you can tell independence: b1 p11 ... ... pn1 p+1 b2 p12 ... ... ... p+2 ... ... ... ... ... ... bm p1m ... ... pnm p+n
a1 a2 ... an
pij = P(X = ai , Y = bj ) = P(X = ai )P(Y = bj ) m pi+ = P(X = ai ) = j =1 pij n p+j = P(Y = bj ) = i=1 pij pij = pi+ p+j , for every i, j - all points in table. ** End of Lecture 10
31
P(X = x, Y = y ) P(Y = y )
P=
of Y given X = x. Note: dened when f(x) is positive. If the marginal probabilities are zero, conditional probability is undened.
Continuous Case:
Formulas are the same, but cant treat like exact possibilities at xed points.
Consider instead in terms of probability density:
Conditional c.d.f. of X given Y=y;
P(X x|Y [y , y + ]) = P(X x, Y [y , y + ]) P(Y [y , y + ])
f (x, y )dxdy
= As 0:
Conditional c.d.f:
f (x, y )dx f (y )
32
f (x, y )dx f (y )
Bayess formula for Random Variables. For each y, you know the distribution of x. Note: When considering the discrete case, In statistics, after observing data, gure out the parameter using Bayess Formula. Example: Draw X uniformly on [0, 1], Draw Y uniformly on [X, 1] p.d.f.: 1 f (x) = 1 I (0 x 1), f (y |x) = I (x y 1) 1x Joint p.d.f: 1 I (0 x y 1) f (x, y ) = f (y |x)f (x) = 1x Marginal: y 1 f (y ) = f (x, y )dx = dx = ln(1 x)|y 0 = ln(1 y ) 1 x 0
Keep in mind, this condition is everywhere: given, y [0, 1] and f(y) = 0 if y / [0, 1] Conditional (of X given Y): f (x|y ) = 1 f (x, y ) I (0 x y 1) = f (y ) (1 x) ln(1 y )
Multivariate Distributions Consider n random variables: X1 , X2 , ..., Xn Joint p.f.: f (x1 , x2 , ..., xn ) = P(X f =1 1 = x1 , ..., Xn = xn ) 0, Joint p.d.f.: f (x1 , x2 , ..., xn ) 0, f dx1 dx2 ...dxn = 1 Marginal, Conditional in the same way: Dene notation as vectors to simplify: X = (X1 , ..., Xn ), x = (x1 , ..., xn ) X = ( Y , Z ) subsets of coordinates: Y = (X1 , ..., Xk ), y = (y1 ...yk ) z = (z1 ...znk ) Z = (Xk+1 , ..., Xn ), Joint p.d.f. or joint p.f. of X , f ( x ) = f ( y , z) 33
f ( y , z )d z , f ( z)=
f ( y , z )d y
y | z )f ( f ( z) f ( y , z) , f ( z | y)= f ( y | z)= f( z ) z f ( y | z )f ( z )d
Functions of Random Variables Consider random variable X and a function r: R R, Y = r(X ), and you want to calculate the distribution of Y. Discrete Case: Discrete p.f.: f (y ) = P(Y = y ) = P(r(X ) = y ) = P(x : r(x) = y ) = (very similar to change of variable) Continuous Case:
Find the c.d.f. of Y = r(X) rst.
P(Y y ) = P(r(X ) y ) = P(x : r(x) y ) = P(A(y )) = p.d.f. f (y ) =
y
x:r (x)=y
P(X = x) =
x:r (x)=y
f (x)
** End of Lecture 11
f (x)dx
A(y )
A(y )
f (x)dx
34
18.05 Lecture 12
March 2, 2005
Example:
Take random variable X, uniform on [-1, 1]. Y = X 2 , nd distribution of Y.
1 p.d.f. f (x) = { 2 for 1 x 1; 0 otherwise} Y = X , P(Y y ) = P(X Y ) = P( y X y ) =
2 2
f (x)dx
35
c.d.f. P(Y y ) = P(F (X ) y ) = P(X F 1 (y )) = F (F 1 (y )) = y, 0 y 1 p.d.f. f (y ) = {1, 0 y 1; 0, otherwise.} Y - uniform on interval [0, 1]
36
Y = F 1 (X ); P(Y y ) = P(F 1 (x) y ) = P(X F (y )) = F (y ). Random Variable Y = F 1 (X ) has c.d.f. F (y ). Suppose that (X, Y) has joint p.d.f. f(x, y). Z = X + Y. P(Z z ) = P(X + Y z ) = f (x, y )dxdy =
x+y z
z x
f (x, y )dydx,
p.d.f.: f (z ) = P(Z z ) = z
f (x, z x)dx
If X, Y independent, f1 (x) = p.d.f. of X. f2 (y ) = p.d.f. of Y Joint p.d.f.: f (x, y ) = f1 (x)f2 (y ); f (z ) = f1 (x)f2 (z x)dx
ex e(zx) dx
z 0
dx = 2 zez
37
1 = ex ex dx = ex ( )| x
38
18.05 Lecture 13 March 4, 2005 Functions of random variables. If (X, Y) with joint f (x, y ), consider Z = X + Y. p.d.f. p.d.f. of Z: f (z ) = f (x, z x)dx If X and Y independent: f (z ) = f1 (x)f2 (z x)dx
Example:
X, Y independent, uniform on [0, 1], X, Y U [0, 1], Z = X + Y
p.d.f. of X, Y:
f1 (x) = {1, 0 x 1; 0 otherwise} = I (0 x 1),
f2 (y ) =I (0 y 1) = I (0 z x 1)
f (z ) = I (0 x 1) I (0 z x 1)dx
Limits: 0 x 1; z 1 x z
Both must be true, consider all the cases for values of z:
Case 1: (z 0) = 0 z Case 2: (0 z 1) 0 1dx = z 1 Case 3: (1 z 2) z1 1dx = 2 z Case 4: (z 2) = 0 Random variables likely to add up near 1, peak of f(z) graph. Example: Multiplication of Random Variables X 0, Y 0.Z = XY (Z is positive) First, look at the c.d.f.:
f (x, y )dxdy =
XY z
z/x
f (x, y )dydx
0
39
P(Z z ) f (z ) = = z
Example: Ratio of Random Variables zy Z = X/Y (all positive), P(Z z ) = P(X zY ) = xzy f (x, y )dxdy = 0 0 f (x, y )dxdy p.d.f. f (z ) = 0 f (zy, y )ydy In general, look at c.d.f. and express in terms of x and y. Example: X1 , X2 , ..., Xn - independent with same distribution (same c.d.f.)
f (x) = F (x) - p.d.f. of Xi
P(Xi x) = F (x)
Y = maximum among X1 , X2 ...Xn
P(Y y ) = P(max(X1 , ..., Xn ) y ) = P(X1 y, X2 y...Xn y )
Now, use denition of independence to factor:
= P(X1 y )P(X2 y )...P(Xn y ) = F (y )n p.d.f. of Y: (y ) = F (y )n = nF (y )n1 F (y ) = nF (y )n1 f (y ) f y Y = min(X1 , . . . , Xn ), P(Y y ) = P(min(X1 , ..., Xn ) y )
Instead of intersection, use union. But, ask if greater than y :
= 1 P(min(X1 , ..., Xn ) > y )
= 1 P(X1 > y, ..., Xn > y )
= 1 P(X1 > y )P(X2 > y )...P(Xn > y )
= 1 P(X1 > y )n
= 1 (1 P(X1 y ))n
= 1 (1 F (y ))n
z 1 f (x, ) dx x x
X = (X1 , X2 , .., Xn ), Y = (Y1 , Y2 , ..., Yn ) = r( X ) Y1 = r1 (X1 , ..., Xn )
Y2 = r2 (X1 , ..., Xn )
...
Yn = rn (X1 , ..., Xn )
Suppose that a map r has inverse. X = r1 ( Y ) P( Y A) = A g ( y )d y g ( y ) is the joint p.d.f. of Y P( Y A) = P(r( X ) A) = P( X s(A)) = s(A) f ( x )dx = A f (s( y ))|J |d y, Note: change of variable x = s( y ) Note: J = Jacobian:
s1 y1
...
sn y1
s1 yn
...
sn yn
Example:
(X1 , X2 ) with joint p.d.f. f (x1 , x2 ) = {4x1 x2 , for 0 x1 1, 0 x2 1; 0, otherwise}
40
Y1 =
X1 ; Y 2 = X 1 X2 X2 Y2
= s2 (Y1 , Y2 ) Y1
But, keep in mind the intervals for non-zero values: J = det Joint p.d.f. of (Y1 , Y2 ): g (y1 , y2 ) = {4 y1 y2
y 2 2 y1 y2 2y1
3/2
Y1 = r1 (X1 , X2 ), Y2 = r2 (X1 , X2 ) inverse X1 = Y1 Y2 = s1 (Y1 , Y2 ), X2 =
y 1 2 y2 1 2 y 2 y1
y2 1; 0 otherwise } y1
41
18.05 Lecture 14 March 7, 2005 Linear transformations of random vectors: Y = r( X ) y1 x1 . . . =A . . . yn xn 0 A 1 = B A - n by n matrix, X = A1 Y if det A = x1 = b1 y1 + ... + b1n yn b11 ... b1n ... ... where b J = Jacobian = det i s are partial derivatives of si with respect to yi bn1 ... bnn det B = det A1 = 1 detA p.d.f. of Y: g (y ) = Example: X = (x1 , x2 ) with p.d.f.: f (x1 , x2 ) = {cx1 x2 , 0 x1 1, 0 x2 1; 0 otherwise} To make integral equal 1, c = 4. Y1 = X1 + 2X2 , Y2 = 2X1 + X2 ; A = Calculate the inverse functions: 1 1 X1 = (Y1 2Y2 ), X2 = (Y2 2Y1 ) 3 3 New joint function: 1 1 1 g (y1 , y2 ) = { 4( (y1 2y2 ))( (y2 2y1 )) 3 3 3 1 1 for 0 (y1 2y2 ) 1 and 0 (y2 2y1 ) 1; 3 3 0, otherwise} Simplied: f (y1 , y2 ) = { 4 (y1 2y2 )(y2 2y1 ) for 3 y1 2y2 0, 3 y2 2y1 0; 27 0, otherwise} 1 2 2 det(A) = 3 1 1 f (A1 x) |detA|
42
Linear transformation distorts the graph from a square to a parallelogram. Note: From Lecture 13, when min() and max() functions were introduced, such functions
describe engines in series (min) and parallel (max).
When in series, the length of time a device will function is equal to the minimum life
in all the engines (weakest link).
When in parallel, this is avoided as a device can function as long as one engine functions.
Review of Problems from PSet 4 for the upcoming exam: (see solutions for more details) Problem 1 - f (x) = {ce2x for x 0; 0 otherwise}
Find c by integrating over the range and setting equal to 1:
c 1 1= ce2x dx = ce2x | 0 = 1 = 1 c = 2 2 2 0 2 2x P(1 X 2) = 1 2e dx = e2 e4
43
Cases:
y < 0 P(Y y ) = P() = 0
0 y 1 P(Y y ) = P(0 X 1) = 1/5 1 < y 3 P(Y y ) = P(X y ) = y/5 3 < y 5 P(Y y ) = P(X 3) = 3/5 y > 5 P(Y 5) = P(X 5) = 1 These values over X from 0 to give its c.d.f. Problem 8 - 0 x 3, 0 y 4 1 c.d.f. F (x, y ) = 156 xy (x2 + y ) P(1 x 2, 1 y 2) = F (2, 2) F (2, 1) F (1, 2) + F (1, 1)
Rectangle probability algorithm. or, you can nd the p.d.f. and integrate (more complicated): c.d.f. of Y: P(Y y ) = P(X , Y y ) = P(X 3, Y y )
(based on the domain of the joint c.d.f.)
1 P(Y y ) = 156 3y (9 + y ) for 0 y 4
Must also mention: y 0, P(Y y ) = 0; y 4, P(Y y ) = 1 Find the joint p.d.f. of x and y: f (x, y ) = 2 F (x, y ) 1 ={ (3x2 + 2y ), 0 x 3, 0 y 4; 0 otherwise} xy 156
f (x, y )dxdy =
y x
3 0
x 0
44
Review for Exam 1 Practice Test 1: 1. In the set of all green envelopes, only 1 card can be green.
Similarly, in the set of red envelopes, only 1 card can be red.
Sample Space = 10! ways to put cards into envelopes, treating each separately.
You cant have two of the same color matching, as that would be 4 total.
Degrees of Freedom = which envelope to choose (5 5) and which card to select (5 5)
Then, arrange the red in green envelopes (4!), and the green in red envelopes (4!)
P= 2. Bayes formula: 0.53 0.5 P(HHH |f air)P(f air) = 3 P(HHH |f air)P(f air) + P(HHH |unf air)P(unf air) 0.5 0.5 + 1 0.5 54 (4!)2 10!
P(f air|HHH ) =
3. f1 (x) = 2xI (0 < x < 1), f2 (x) = 3x2 I (0 < x < 1) Y = 1, 2 P(Y = 1) = 0.5, P(Y = 2) = 0.5 f (x, y ) = 0.5 I (y = 1) 2xI (0 < x < 1) + 0.5 I (y = 2) 3x2 I (0 < x < 1) f (x) = 0.5 2xI (0 < x < 1) + 0.5 3x2 I (0 < x < 1) = (x + 1.5x2 )I (0 < x < 1) P(Y = 1|X =
1 f1 ( 1 1 4) 2 )= 1 1 1 4 ) f1 ( 4 ) 2 + f 2 ( 4 1 2
4. f (z ) = 2e2z I (Z > 0), T = 1/Z we know t > 0 P(T t) = P(1/Z t) = P(Z 1/t) = 1 2 F (T t) = 2e2/t 2 = 2 e2/t , t > 0 (0 otherwise) = 2e2z dz, p.d.f. f (t) = t t t 1/t T = r(Z ), Z = s(T ) =
1 T
5. f (x) = ex I (x > 0)
Joint p.d.f. f (x, y ) = ex I (x > 0)ey I (y > 0) = e(x+y) I (x > 0, y > 0)
X ;V = X +Y X +Y Step 1 - Check values for random variables: (0 < V < ), (0 < U < 1) Step 2 - Account for change of variables: X = U V ; Y = V U V = V (1 U ) Jacobian: U= J = det
X U Y U X V Y V
V -V
U = V (1 U ) + U V = V 1-U
45
g (u, v ) = f (uv, v (1 u)) |v |I (uv > 0, v (1 u) > 0) = ev vI (v > 0, 0 < u < 1) Problem Set #5 (practice pset, see solutions for details): p. 175 #4
f (x1 , x2 ) = x1 + x2 I (0 < x1 < 1, 0 < x2 < 1)
Y = X1 X2 (0 < Y < 1)
First look at the c.d.f.: P(Y y ) = P(X1 X2 y ) = {x1 x2 y}={x2 y/x1 } f (x1 , x2 )dx1 dx2
Due to the complexity of the limits, you can integrate the area in pieces, or you can nd the complement, which is easier with only 1 set of limits. f (x1 , x2 ) = 1 =1
1 y
{x1 x2 >y }
1 y/x1
f (x1 , x2 ) = 0 for y < 0; 2y y 2 for 0 < y < 1; 1 for y > 1. p.d.f.: g (y ) = { P(Y y ) = 2(1 y ), y (0, 1); 0, otherwise.} y
Y varies from 0 to 1 as X varies from 0 to 2. Look at the c.d.f.: P(Y y ) = P(X (2 X ) y ) = P(X 2 2X + 1 1 y ) = P((1 X )2 1 y ) = = P(|1 X |
1 y ) = P(1 X 46
1 y or 1 X 1 y) =
= P(X 1 = P(0 X 1 ={
1 1y
x dx + 2
1 y or X 1 +
1 y ) + P(1 +
1+ 1y
x dx = 1 1 y, 0 y 1; 0, y < 0; 1, y > 1}
2
1 y X 2) =
1 y) =
1
g (y ) = , 0 y 1; 0, otherwise. 4 1y ** End of Lecture 15
47
2 1/6
3 1/6
1 6
4 1/6
5 1/6
6 1/6
E=1
+ ... + 6
= 3.5
Consider each pi as a weight on a horizontal bar. Expectation = center of gravity on the bar. If X - continuous, f (x) = p.d.f. then E(X ) = xf (x)dx 1 Example: X - uniform on [0, 1], E(X ) = 0 (x 1)dx = 1/2
Consider Y = r(x), then EY = x r(x)f (x) or r(x)f (x)dx p.f. g (y ) = {x:y=r(x)} f (x)
E(Y ) = y yg (y ) = y y {x:y=r(x)} f (x) = y {x:r(x)=y} yf (x) = y {x:r(x)=y} r(x)f (x)
then, can drop y since no reference to y : E(Y ) = x r(x)f (x) Example: X - uniform on [0, 1] 1 EX 2 = 0 X 2 1dx = 1/3
X1 , ..., Xn - random variables with joint p.f. or p.d.f. f (x1 ...xn ) E(r(X1 , ..., Xn )) = r(x1 , ..., xn )f (x1 , ..., xn )dx1 ...dxn Example: Cauchy distribution p.d.f.:
1 (1 + x2 )
1 1 dx = tan1 (x)| = 1 (1 + x2 )
48
1 x = ln(1 + x2 )| 0 = (1 + x2 ) 2
1) E(aX + b) = aE(X ) + b Proof: E(aX + b) = (aX + b)f (x)dx = a xf (x)dx + b f (x)dx = aE(X ) + b 2) E(X1 + X2 + ... + X ) = EX1 + EX2 + ... + EXn n Proof: E ( X + X ) = (x + x2 )f (x1 , x2 )dx1 dx2 = 1 2 1 f ( x , x ) dx dx + = x 1 1 2 1 2 2 f (x1 , x2 )dx1 dx2 = x f (x1 , x2 )dx1 dx2 = dx + x = x1 f (x1 , x2 )dx 2 1 2 = x1 f1 (x1 )dx1 + x2 f2 (x2 )dx2 = EX1 + EX2
Example: Toss a coin n times, T on i: Xi = 1; H on i: Xi = 0. Number of tails = X1 + X2 + ... + Xn E(number of tails) = E(X1 + X2 + ... + Xn ) = EX1 + EX2 + ... + EXn EXi = 1 P(Xi = 1) + 0 P(Xi = 0) = p, probability of tails Expectation = p + p + ... + p = np This is natural, because you expect np of n for p probability.
k nk Y = Number oftails, P(Y = k ) = n k p (1 p) n n k nk = np E(Y ) = k=0 k k p (1 p) More dicult to see though denition, better to use sum of expectations method. Two functions, h and g, such that h(x) g (x), for all x R Then, E(h(X )) E(g (X )) E(g (X ) h(X )) 0 (g (x) h(x)) f (x)dx 0 You know that f (x) 0, therefore g (x) h(x) must also be 0 If a X b a E(X ) E(b) b E(I (X A)) = 1 P(X A) + 0 P(X / A), for A being a set on R / A) = 1 P(X A) Y = I (X A) = {1, with probability P(X A); 0, with probability P(X E(I (X A)) = P(X A)} In this case, think of the expectation as an indicator as to whether the event happens. Chebyshevs Inequality Suppose that X 0, consider t > 0, then: 1 E(X ) t Proof: E(X ) = E(X )I (X < t) + E(X )I (X t) E(X )I (X t) E(t)I (X t) = tP(X t) P(X t) ** End of Lecture 16
49
Properties of Expectation. Law of Large Numbers. E(X1 + ... + Xn ) = EX1 + ... + EXn Matching Problem (n envelopes, n letters)
Expected number of letters in correct envelopes?
Y - number of matches
Xi = {1, letter i matches; 0, otherwise}, Y = X1 + ... + Xn
E(Y ) = EX1 + ... + EXn , but EXi = 1 P(Xi = 1) + 0 P(Xi = 0) = P(Xi = 1) = Therefore, expected match = 1: 1 =1 n If X1 , ..., Xn are independent, then E(X1 ... Xn ) = EX1 ... EXn As with the sum property, we will prove for two variables: EX1 X2 = EX1 EX2
joint p.f. or p.d.f.: f (x1 , x2 ) = f1 (x1 )f x2 )
2 ( X = x x f ( x , x ) dx dx = x1 x2 f1 (x1 )f2 (x2 )dx1 dx2 =
EX 1 2 1 2 1 2 1 2 = f1 (x1 )x1 dx1 f2 (x2 )x2 dx2 = EX1 EX2
E(Y ) = n
1
n
For discrete random variables, X takes values 0, 1, 2, 3, ... E(X ) = n=0 nP(x = n) for n = 0, contribution = 0; for n = 1, P(1); for n = 2, 2P(2); for n = 3, 3P(3); ... E(X ) = n=1 P(X n) Example: X - number of trials until success.
P(success) = p
P(f ailure) = 1 p = q
E(X ) =
n=1
P(X n) =
n=1
Formula based upon reasoning that the rst n - 1 times resulted in failure.
Much easier than the original formula:
n1
n P ( X = n ) = p n=0 n=1 n(1 p) Variance: Denition: Var(X ) = E(X E(X ))2 = 2 (X )
Measure of the deviation from the expectation (mean).
Var(X ) = (X E(X ))2 f (x)dx - moment of inertia.
50
(1 p)n1 = 1 + q + q 2 + ... =
1 1 = 1q p
Standard Deviation:
(X ) = Var(X ) Var(aX + b) = a2 Var(X ) (aX + b) = |a| (X ) Proof by denition:
E((aX + b) E(aX + b))2 = E(aX + b aE(X ) b)2 = a2 E(X E(X ))2 = a2 Var(X )
Property: Var(X ) = EX 2 (E(X ))2
Proof:
Var(X ) = E(X E(X ))2 = E(X 2 2X E(X ) + (E(X ))2 ) =
EX 2 2E(X ) E(X ) + (E(X ))2 = E(X )2 (E(X ))2
Example: X U [0, 1] EX =
(X center of gravity )2 mx
1 0
x 1dx =
1 , EX 2 = 2
1 0
x2 1dx =
1 3
1 1 1 ( )2 = 3 2 12 If X1 , ..., Xn are independent, then Var(X1 + ... + Xn ) = Var(X1 ) + ... + Var(Xn ) Proof: Var(X ) = Var(X1 + X2 ) = E(X1 + X2 E(X1 + X2 ))2 = E((X1 EX1 ) + (X2 EX2 ))2 = = E(X1 EX1 )2 + E(X2 EX2 )2 + 2E(X1 EX1 )(X2 EX2 ) = = Var(X1 ) + Var(X2 ) + 2E(X1 EX1 ) E(X2 EX2 ) By independence of X1 and X2 : = Var(X1 ) + Var(X2 )
2 Property: Var(a1 X1 + ... + an Xn + b) = a2 1 Var(X1 ) + ... + an Var(Xn )
k nk Example: Binomial distribution - B (n, p), P(X = k ) = n k p (1 p) X = X1 + ... + Xn , Xi = {1, Trial i is success ; 0, Trial i is failure.} Var(X ) = n i=1 Var(Xi ) Var(Xi ) = EXi2 (EXi )2 , EXi = 1(p) + 0(1 p) = p; EXi2 = 12 (p) + 02 (1 p) = p. Var(Xi ) = p p2 = p(1 p) Var(X ) = np(1 p) = npq Law of Large Numbers: X1 , X2 , ..., Xn - independent, identically distributed. X1 + ... + Xn n EX1 n Take > 0 - but small, P(|Sn EX1 | > ) 0 as n By Chebyshevs Inequality: Sn = 51
1 EY = M
1 1 X1 + ... + Xn 1 1 EX1 )2 = 2 Var( (X1 + ... + Xn )) = E(Sn EX1 )2 = 2 E( 2 n n = for large n. ** End of Lecture 17 1 2 n2 (Var(X1 ) + ... + Var(Xn )) = nVar(X1 ) Var(X1 ) = 0 2 2 n n2
52
18.05 Lecture 18
March 18, 2005
Law of Large Numbers. X1 , ..., Xn - i.i.d. (independent, identically distributed) X1 + ... + Xn as n , EX1 n Can be used for functions of random variables as well:
Consider Yi = r(X1 ) - i.i.d.
x= r(X1 ) + ... + r(Xn ) as n ,
EY1 = Er(X1 ) n Relevance for Statistics: Data points xi , as n
, The average converges to the unknown expected value of the distribution which often contains a lot (or all) of information about the distribution. Y = Example: Conduct a poll for 2 candidates:
p [0, 1] is what were looking for
Poll: choose n people randomly: X1 , ..., Xn
P(Xi = 1) = p
P(Xi = 0) = 1 p
EX1 = 1(p) + 0(1 p) = p X1 + ... + Xn as n n
k=0
k!
tk
k=0
tk (tX )k = EX k k! k!
k=0
53
t tk 1 e(t)x |0 = 1 = = = ( )k = EX k t t t 1 t/ k!
k=0
xk =
k=0
1 when k < 1 1x
1 Exk k! = Exk = k k k! The moment generating function completely describes the distribution.
Exk = xk f (x)dx
If f(x) unknown, get a system of equations for f unique distribution for a set of moments.
M.g.f. uniquely determines the distribution. X1 , X2 from E(), Y = X1 + X2 .
To nd distribution of sum, we could use the convolution formula,
but, it is easier to nd the m.g.f. of sum Y :
EetY = Eet(X1 +X2 ) = EetX1 etX2 = EetX1 EetX2 Moment generating function of each: t ( Consider the exponential distribution: E () X1 , EX = 2 ) t
1 , f (x) = {ex , x 0; 0, x < 0} This distribution describes the life span of quality products. = E1 X , if small, life span is large. Median: m R such that: P(X m) 1 1 , P(X m) 2 2
(There are times in discrete distributions when the probability cannot ever equal exactly 0.5) When you exclude the point itself: P(X > m) 1 2 P(X m) + P(X > m) = 1 The median is not always uniquely dened. Can be an interval where no point masses occur.
54
1 . For a continuous distribution, you can dene P > or < m as equal to 2 But, there are still cases in which the median is not unique!
The average will be pulled towards the tail of a p.d.f. relative to the median. Mean: nd a R such that E(X a)2 is minimized over a. E(X a)2 = E2(X a) = 0, EX a = 0 a = EX a expectation - squared deviation is minimized. Median: nd a R such that E|X a| is minimized. E|X a| E|X m|, where m - median E (|X a| |X m|) 0 (|x a| |x m|)f (x)dx
55
(a m)f (x)dx +
f (x)dx
(m a)f (x)dx =
As both (a m) and the dierence in probabilities are positive. The absolute deviation is minimized by the median. ** End of Lecture 18
56
1 1 1 , 3, 3} Example: X takes values {1, 0, 1} with equal probabilities { 3 2 Y =X X and Y are dependent, but they are uncorrelated. Cov(X, Y ) = EX 3 EX EX 2 but, EX = 0, and EX 3 = EX = 0 Covariance is 0, but they are still dependent. Also - Correlation is always between -1 and 1.
|Cov(X, Y )| 1 x y
1 (X, Y ) 1 When is the correlation equal to 1, -1? |(X, Y )| = 1 only when Y EY = c(X EX ), or Y = aX + b for some constants a, b.
(Occurs when your data points are in a straight line.)
If Y = aX + b :
(X, Y ) = a E(aX 2 + bX ) EX E(aX + b) aVar(X )
= = sign(a) = 2 |a|Var(X ) |a| Var(X ) a Var(X )
If a is positive, then the correlation = 1, X and Y are completely positively correlated. If a is negative, then correlation = -1, X and Y are completely negatively correlated.
Looking at the distribution of points on Y = X 2 , there is NO linear dependence, correlation = 0. However, if Y = X 2 + cX , then there is some linear dependence introduced in the skewed graph. Property 3: Var(X + Y ) = E(X + Y EX EY )2 = E((X EX ) + (Y EY ))2 = E(X EX )2 2E(X EX )(E(Y EY ) + E(Y EY )2 = Var(X ) + Var(Y ) 2Cov(X, Y ) Conditional Expectation:
(X, Y) - random pair.
What is the average value of Y given that you know X?
f(x, y) - joint p.d.f. or p.f. then f (y |x) - conditional p.d.f. or p.f.
Conditional expectation: E(Y |X = x) = yf (y |x)dy or yf (y |x) E(Y |X ) = h(X ) = Property 4:
E(E(Y |X )) = EY 58
Proof:
E(E f( x)f (x)dx =
(Y |X )) = E(h(X )) = =( yf (y |x)dy )f (x)dx = yf (y |x)f (x)dydx = yf (x, y )dydx =
= y ( f (x, y )dx)dy = yf (y )dy = EY
Property 5: E(a(X )Y |X ) = a(X )E(Y |X ) See text for proof. Summary of Common Distributions: 1) Bernoulli Distribution: B (p), p [0, 1] - parameter Possible values of the random variable: X = {0, 1}; f (x) = px (1 p)1x P(1) = p, P(0) = 1 p E(X ) = p, Var(X ) = p(1 p) 2) Binomial Distribution: B (n, p), n repetitions of Bernoulli x 1x X {0, 1, ..., n}; f (x) = n x p (1 p) E(X ) = np, Var(X ) = np(1 p) 3) Exponential Distribution: E (), parameter > 0 X = [0, ), p.d.f. f (x) = {ex , x 0; 0, otherwise } EX = k! 1 , EX k = k 2 1 1 2 = 2 2
59
x0
EX k = k (0) t EX = (0) = e(e 1) et |t=0 = t t EX 2 = (0) = (e(e 1)+t ) |t=0 = e(e 1)+t (et + 1)|
t=0 = ( + 1)
Var(X ) = EX 2 (EX )2 = ( + 1) 2 = If X1 (1 ), X2 (2 ), ...Xn (n ), all independent:
Y = X1 + ... + Xn , nd moment generating function of Y,
(t) = EetY = Eet(X1 +...+Xn ) = EetX1 ... etXn By independence: EetX1 EetX2 ... EetXn = e1 (e Moment generating function of (1 + ... + n ): (t) = e(1 +2 +...+n )(e
t t
1) 2 (et 1)
...en (e
1)
1)
(1
1 k!
A region from [0, T] is split into n sections, each section has size |T |/n The count on each region is X1 , ..., Xn By 2), X1 , ..., Xn are independent. P(Xi 2) is small if n is large. | By 1), EXi = |T n = 0(P(X1 = 0)) + 1(P(X1 = 1) + 2(P(X1 = 2)) + ...
But, over 1 the value is very small.
T |
P(Xi = 1) |n T| P(X1 = 0) 1 |n P(count(T ) = k ) = P(X1 + ... + Xn = k ) B (n, 5.6 - Normal Distribution |T | (|T |)k |T | ) (|T |) e n k!
61
x2 2
dx)2
2 0
1 2 2 r
rdrd = 2 2
2 1 2r
rdr = 2
2 1 2r
r2 rd( ) = 2 2
et dt = 2
p.d.f.:
1 x2 e 2 dx = 1 2
62
/2
Proof - Simplify integral by completing the square: 2 2 1 1 (t) = etx ex /2 dx = etxx /2 dx = 2 2 2 2 2 2 2 1 1 1 et /2t /2+txx /2 dx = et /2 e 2 (tx) dx 2 2 Then, perform the change of variables y = x - t: 2 2 2 2 1 2 1 2 1 1 = et /2 e 2 y dy = et /2 f (x)dx = et /2 e 2 y dy = et /2 2 2 Use the m.g.f. to nd expectation of X and X 2 and therefore Var(X ): E(X ) = (0) = tet
2
/2
/2 2
t + et
/2
|t=0 = 1; Var(X ) = 1
2 1 ex /2 dx 2
63
To describe an altered standard normal distribution N(0, 1) to a normal distribution N (, ), The peak is located at the new mean , and the point of inection occurs away from
Moment Generating Function of N (, ); Y = X + (t) = EetY = Eet(X +) = Ee(t)X et = et Ee(t)X = et e(t) Note: X1 N (1 , 1 ), ..., Xn N (n , n ) - independent.
Y = X1 + ... + Xn , distribution of Y:
Use moment generating function:
EetY = Eet(X1 +...+Xn ) = EetX1 ...etXn = EetX1 ...EetXn = e1 t+1 t =e
P i t+ P
2 2 i t /2 2 2 2
/2
= et+t
( )2 /2
/2
... en t+n t
2 2
/2
N(
i ,
2) i
64
Flip 10,000 coins, expect 5,000 tails, and the deviation can be larger, perhaps 4,950-5,050 is typical. Xi = {1(tail); 0(head)} X1 + ... + Xn 1 1 1 1 number of tails E(X1 ) = by LLN Var(X1 ) = (1 ) = = n n 2 2 4 2 But, how do you describe the deviations? X1 , X2 , ..., Xn are independent with some distribution P = EX1 , 2 = Var(X1 ); x = 1 Xi EX1 = n i=1
n
x) x on the order of n
n( behaves like standard normal. n(x )
is approximately standard normal N (0, 1) for large n n(x ) x) P(standard normal x) = N (0, 1)(, x) P( n This is useful in terms of statistics to describe outcomes as likely or unlikely in an experiment.
) N (0, 1)(,
100(0.01)
1 2
= 2) = 0.0267
Tabulated values always give for positive X, area to the left. In the table, look up -2 by nding the value for 2 and taking the complement. ** End of Lecture 21
65
Central Limit Theorem X1 , ..., Xn - independent, identically distributed (i.i.d.) 1 x= n (X1 + ... + Xn ) = EX, 2 = Var(X ) n(x ) N (0, 1) n You can use the knowledge of the standard normal distribution to describe your data: n(x ) Y = Y, x = n This expands the law of large numbers:
It tells you exactly how much the average value and expected vales should dier.
xn 1 n(x ) 1 x1 = n ( + ... + ) = (Z1 + ... + Zn ) n n
; E(Zi ) = 0, Var(Zi ) = 1 where: Zi = Xi Consider the m.g.f., see that it is very similar to the standard normal distribution:
Ee
1 (Z1 +...+Zn ) t n
= EetZ1 /
... etZn /
= (EetZ1 /
n n
1 1 2 3 EetZ1 = 1 + tEZ1 + t2 EZ1 + t3 EZ1 + ... 2 6 1 1 3 + ... = 1 + t2 + t3 EZ1 2 6 Ee(t/ Therefore: (EetZ1 / (1 +
n n n)Z1
=1+
) (1 +
2 t2 n ) n et /2 - m.g.f. of standard normal distribution! 2n Gamma Distribution: Gamma function; for > 0, > 0
66
() =
x1 ex dx
p.d.f of Gamma distribution, f(x): 1 1 x 1 1 x x e dx, f (x) = { x e , x 0; 0, x < 0} 1= ( ) ( ) 0 Change of variable x = y, to stretch the function: 1 1 1 y 1= y e dy = y 1 ey dy () () 0 0
Integrate by parts:
x1 ex dx =
x1 d(ex ) =
= x1 ex | 0
(ex )( 1)x2 dx = 0 + ( 1)
x2 ex dx = ( 1)( 1)
In summary, Property 1: () = ( 1)( 1) You can expand Property 1 as follows: (n) = (n 1)(n 1) = (n 1)(n 2)(n 2) = (n 1)(n 2)(n 3)(n 3) = = (n 1)...(1)(1) = (n 1)!(1), (1) = In summary, Property 2: (n) = (n 1)!
0
ex dx = 1 (n) = (n 1)!
Make this integral into a density to simplify: ( + k ) +k x(+k)1 ex dx = () +k ( + k ) 0 The integral is just the Gamma distribution with parameters ( + k, )! = For k = 1:
67
( + 1) 2
( + 1) 2 2 = 2 2
If the mean = 50 and variance = 1 are given for a Gamma distribution, Solve for = 2500 and = 50 to characterize the distribution. Beta Distribution:
1 0
x1 (1 x) 1 dx =
()( ) ,1 = ( + )
1 0
( + ) 1 x (1 x) 1 dx ()( )
x1 ex dx
y 1 ey dy =
x1 y 1 e(x+y) dxdy
x 1 (x+y) ) e x+y
t1 s+ 2 (1 t) 1 es sdsdt = =
1 0
1 0
t1 (1 t) 1 dt
s+ 1 es ds =
t1 (1 t) 1 ( + ) = ()( )
EX k =
Once again, the integral is the density function for a beta distribution. = ( + ) ( + k )( ) ( + ) ( + k ) ( + k 1) ... = = ()( ) ( + + k ) ( + + k ) () ( + + k 1) ... ( + ) EX = For k = 2: EX 2 = Var(X ) = ( + 1) ( + + 1)( + ) +
1 0
xk
( + ) 1 ( + ) x (1 x) 1 dx = ()( ) ()( )
1 0
x(+k)1 (1 x) 1 dx
For k = 1:
( + 1) 2 = ( + + 1)( + ) ( + )2 ( + )2 ( + + 1)
69
X1 , ..., Xn P0 , 0 Prior Distribution - describes the distribution of the set of parameters (NOT the data)
f () - p.f. or p.d.f. corresponds to intuition.
P0 has p.f. or p.d.f.; f (x|)
Given x1 , ..., xn joint p.f. or p.d.f.: f (x1 , ..., xn |) = f (x1 |) ... f (xn |) To nd the Posterior Distribution - distribution of the parameter given your collected data. Use Bayes formula:
The posterior distribution adjusts your assumption (prior distribution) based upon your sample data. Example: B (p), f (x|p) = px (1 p)1x ; 70
f (|x1 , ..., xn ) =
xi
(1 p)n
xi
Your only possibilities are p = 0.4, p = 0.6, and you make a prior distribution based on the probability that the parameter p is equal to each of those values. Prior assumption: f(0.4) = 0.7, f(0.6) = 0.3 You test the data, and nd that there are are 9 successes out of 10, p = 0.9 Based on the data that give p = 0.9, nd the probability that the actual p is equal to 0.4 or 0.6. You would expect it to shift to be more likely to be the larger value. Joint p.f. for each value: f (x1 , ..., x10 |0.4) = 0.49 (0.6)1 f (x1 , ..., x10 |0.6) = 0.69 (0.4)1 Then, nd the posterior distributions: f (0.4|x1 , ..., xn ) = f (0.6|x1 , ..., xn ) = (0.49 (0.6)1 )(0.7) (0.49 (0.6)1 )(0.7) = 0.08 + (0.69 (0.4)1 )(0.3)
(0.69 (0.4)1 )(0.3) = 0.92 (0.49 (0.6)1 )(0.7) + (0.69 (0.4)1 )(0.3)
Note that it becomes much more likely that p = 0.6 than p = 0.4
Example: B(p), prior distribution on [0, 1]
Choose any prior to t intuition, but simplify by choosing the conjugate prior.
f (p|x1 , ..., xn ) = pxi (1 p)nxi f (p) (...)dp
Choose f(p) to simplify the integral. Beta distribution works for Bernoulli distributions. Prior is therefore: f (p) = ( + ) 1 p (1 p) 1 , 0 p 1 ()( )
Then, choose and to t intuition: makes E(X ) and Var(X ) t intuition. P P ( + xi + + n xi ) f (p|x1 ...xn ) = p(+ x1 )1 (1 p)( +n xi )1 ( + xi )( + n xi ) Posterior Distribution = Beta( + xi , + n xi )
The conjugate prior gives the same distribution as the data.
Example:
71
B (, ) such that EX = 0.4, Var(X ) = 0.1 Use knowledge of parameter relations to expectation and variance to solve: EX = 0.4 = , Var(X ) = 0.1 = + ( + )2 ( + + 1)
The posterior distribution is therefore: Beta( + 9, + 1) And the new expected value is shifted: EX = +9 + + 10
Once this posterior is calculated, choose the parameters by nding the expected value. Denition of Bayes Estimator: Bayes estimator of unknown parameter 0 is (X1 , ..., Xn ) = expectation of the posterior distribution. Example: B(p), prior Beta(, ), X1 , ..., Xn posterior Beta( + xi , + n xi ) + xi + xi = Bayes Estimator: + xi + + n xi ++n = /n + x /n + /n + 1
72
Bayes Estimator. Prior Distribution f () compute posterior f (|X1 , ..., Xn ) Bayess Estimator = expectation of the posterior. E(X a)2 minimize a a = EX Example: B (p), f (p) = Beta(, ) f (p|x1 , ..., xn ) = Beta( + + xi (x1 , ..., xn ) = ++n Example: Poisson Distribution x x!e
P
xi , + n
xi )
xi n
f ()
Need to choose the appropriate prior distribution, Gamma distribution works for Poisson.
xi +1 (n+ )
1 e () ( + xi , + n) f () =
(x1 , ..., xn ) = EX =
Once again, balances both prior intuition and data, by law of large numbers: /n + xi /n (x1 , ..., xn ) = n x E(X1 ) 1 + /n The estimator approaches what youre looking for, with large n. Exponential E (), f (x|) = ex , x P0 xi = n e( xi ) f (x1 , ..., xn |) = n i=1 e If f () - prior, the posterior:
+ xi n+
73
f (|x1 , ..., xn ) n e( Once again, a Gamma distribution is implied. Choose f () (u, v ) f () = New posterior: f (|x1 , ..., xn ) n+u1 e( Bayes Estimator: (x1 , ..., xn ) =
P
xi )
f ()
v u u1 v e (u)
xi +v )u
(u + n, v +
xi )
1 1 e 22 n ( 2 )
Pn
i=1 (xi )
It is dicult to nd simple prior when both , are unknown. Say that is given, and is the only parameter:
2 1 1 Prior: f () = e 2b2 (a) = N (a, b) b 2
Posterior: f (|X1 , ..., Xn ) e 22 Simplify the exponent: = 1 2 1 1 a xi 2 n 2 2 2 2 x + ) + ( 2 a + a ) = ( + ) 2 ( + 2 ) + ... ( x i i 2 2 2 2 2 2 2b 2 2b 2 2b = 2 A 2B + ... = A(2 2 f (|X1 , ..., Xn ) eA( A ) = e Normal Bayes Estimator: (X1 , ..., Xn ) = 2 a + nb2 x 2 a/n + b2 x = n x E(X1 ) = 2 + nb2 2 /n + b2
B 2 1 2(1/ 2A)2 1
2 1 (xi )2 2b 2 (a)
B B2 B B + ( )2 ) + ... = A( )2 + ... A A A A
B 1 2 b2 2 A + nb2 x B 2 ) = N( , ) = N( 2 , 2 ) 2 A A + nb + nb2 2A
** End of Lecture 24
74
Maximum Likelihood Estimators X1 , ..., Xn have distribution P0 {P : } Joint p.f. or p.d.f.: f (x1 , ..., xn ) = f (x1 |) ... f (xn |) = () - likelihood function. If P - discrete, then f (x|) = P (X = x), and () - the probability to observe X1 , ..., Xn Denition: A Maximum likelihood estimator (M.L.E.):
= (X1 , ..., Xn ) such that ( ) = max ()
Suppose that there are two possible values of the parameter, = 1, = 2
p.f./p.d.f. - f (x|1), f (x|2)
Then observe points x1 , ..., xn
view probability with rst parameter and second parameter:
(1) = f (x1 , ..., xn |1) = 0.1, (2) = f (x1 , ..., xn |2) = 0.001,
The parameter is much more likely to be 1 than 2. Example: Bernoulli Distribution B(p), P p [0.1], P (p) = f (x1 , ..., xn |p) = p xi (1 p)n xi () max log () max (log-likelihood)
log (p) = xi log p + (n xi ) log(1 p), maximize over [0, 1]
Find the critical point:
log (p) = 0 p n xi xi =0 p 1p xi p xi np + p xi = 0 xi (1 p) p(n xi ) = p =
xi = x E(X ) = p n For Bernoulli distribution, the MLE converges to the actual parameter of the distribution, p. Example: Normal Distribution: N (, 2 ),
2 1 1 f (x|, 2 ) = e 22 (x) 2
(, 2 ) = (
1 1 )n e 22 2
Pn
i=1 (xi )
Note that the two parameters are decoupled. First, for a xed , we minimize n
i=1 (xi
)2 over
75
To summarize, the estimator of for a Normal distribution is the sample mean. To nd the estimator of the variance:
n 1 (xi x)2 maximize over n log( 2 ) 2 2 i=1 n n 1 (xi x)2 = 0 = + 3 i=1
n i=1
xi n = 0, =
1 xi = x E(X ) = 0 n i=1
2 = Find 2
2 =
1
2 (xi x)2 - MLE of 0 ; 2 a sample variance n
1 2 1 2 1 1 2
xi + (x)2 = (xi 2xi x + (x)2 ) = xi 2x xi 2(x)2 + (x)2 = n n n n = 1 2 2 2 xi (x)2 = x2 (x)2 E(x2 1 ) E(x1 ) = 0 n
The likelihood function will be 0 if any points fall outside of the interval.
If will be the correct parameter with P = 0,
you chose the wrong for your distribution.
() maximize over > 0
76
If you graph the p.d.f., notice that it drops o when drops below the maximum data point. = max(X1 , ..., Xn ) The estimator converges to the actual parameter 0 :
As you keep choosing points, the maximum gets closer and closer to 0
Sketch of the consisteny of MLE. () max Ln () = 1 log () max n
n
1 1 1 log f (xi |) L() = E0 log f (x1 |). log () = log f (xi |) = n n n i=1
by denition of MLE. Let us show that L() is maximized at 0 . Ln () is maximized at , 0 . L() L(0 ) : Then, evidently, Expand the inequality: f (x|) log f (x|0 )dx f (x|0 ) f (x|) 1 f (x|0 )dx f (x|0 )
L() L(0 )
= =
(f (x|) f (x|0 )) dx = 1 1 = 0.
Here, we used that the graph of the logarithm will be less than the line y = x - 1 except at the tangent point. ** End of Lecture 25
77
You can guarantee that the mean or variance are in a particular interval with some probability:
Denition: Take [0, 1], condence level
If P(S1 (X1 , ..., Xn ) 0 S2 (X1 , ..., Xn )) = ,
then interval [S1 , S2 ] is the condence interval for 0 with condence level . Consider Z0 , ..., Zn - i.i.d., N(0, 1)
2 2 2 + Z2 + ... + Zn is called a chi-square (2 ) distribution,
Denition: The distribution of Z1 with n degrees of freedom. 1 As shown in 7.2, the chi-square distribution is a Gamma distribution ( n 2 , 2) Denition: The distribution of Z0
1 2 n (Z1 2) + ... + Zn
The t-distribution is also called Students distribution, see 7.4 for detail.
2 ), need the following: To nd the condence interval for N (0 , 0 Fact: Z1 , ..., Zn i.i.d.N (0, 1)
z=
1 2 1 1 2 zi ) (Z1 + ... + Zn ), z 2 (z )2 = zi ( n n n
Z1 =
xn 0 x1 0 , ..., Zn = N (0, 1) 0 0 A= nz = n( x 0 ) 0
B = n(z 2 (z )2 ) = = To summarize:
n(
2 0
2 2 (x2 20 x + 2 0 x + 20 x 0 ) =
A=
Choose the c values from the chi-square tabled values, such that area = condence. With probability = condence (), c1 B c2 c1 Solve for 0 : n(x2 (x)2 ) 2 0 c2 n(x2 (x)2 ) c1 n(x2 (x)2 ) c2 2 0
Choose c1 and c2 such that the right tail has probability 1 2 , same as left tail. This results in throwing away the possibilities outside c1 and c2 c1 given Or, you could choose to make the interval as small as possible, minimize: c1 1 2
Why wouldnt you throw away a small interval in between c1 and c2 , with area 1 ? Though its the same area, you are throwing away very likely values for the parameter! ** End of Lecture 26
79
A, B - independent.
To determine the condence interval for , must eliminate from A:
A
1 n1 B
Where Z0 , Z1 , .., Zn1 N (0, 1) The standard normal is a symmetric distribution, and
Z0
1 2 n1 (z1 2 + ... + zn 1 ) 1 2 n1 (Z1
tn1
2 2 + ... + Zn 1 ) EZ1 = 1
So tn -distribution still looks like a normal distribution (especially for large n), and it is symmetric about zero. Given (0, 1) nd c, tn1 (c, c) = c n(x ) / A
1 n1 B
1 n(x2 (x)2 ) c 2 n1
for large n.
x2
Note that for i = j, EXi Xj = EXi EXj = (EX1 )2 = 2 , n(n - 1) terms with dierent indices. E 2 = = Therefore: n1 2 < 2 n Good estimator, but more often than not, less than actual. So, to compensate for the lower error: E 2 = E Consider ( )2 =
n 2 , n1
n 2 = 2 n1
7.5 pg. 140 Example: Lactic Acid in Cheese 0.86, 1.53, 1.57, ..., 1.58, n = 10 N (, 2 ), x = 1.379, 2 = x2 (x)2 = 0.0966 Predict parameters with condence = 95% Use a t-distribution with n - 1 = 9 degrees of freedom.
81
1 2 x + 2.262 9
1 2 9
0.6377 2.1203 Large interval due to a high guarantee and a small number of samples. If we change to 90% c = 1.833, interval: 1.189 1.569
Much better sized interval.
Condence interval for variance:
c1 n 2 c2 2
Not symmetric, all positive points given for 2 distribution. c1 = 2.7, c2 = 19.02 0.0508 2 0.3579 again, wide interval as result of small n and high condence. Sketch of Fishers theorem. z 1 , ..., zn N (0, 1) 1 nz = (z + ... + zn ) N (0, 1) n 1 n(z 2 (z )2 ) = n(
The graph is symmetric with respect to rotation, so rotating the coordinates gives again i.i.d. standard normal sequence.
i
2 1 e1/2yi y1 , ..., yn i.i.d.N (0, 1) 2
Choose coordinate system such that: 1 1 1 y1 = (z1 + ... + zn ), i.e. v1 = ( , . . . , ) - new rst axis. n n n Choose all other vectors however you want to make a new orthogonal basis:
82
2 2 2 2 y1 + ... + yn = z1 + .. + zn
since the length does not change after rotation! n(z 2 (z )2 ) = ** End of Lecture 27 nz = y1 N (0, 1)
2 2 2 2 yi y1 = y2 + ... + yn 2 n1
83
Problem 4 f (x|) = {ex, x 0; 0, x < 0} Find the MLE of Likelihood () = f (x1 |) ... f (xn |) P = ex1 ...exn I (x1 , ..., xn ) = en xi I (min(x1 , ..., xn ) ) 84
Maximize over .
Note that the graph increases in , but must be less than the min value.
If greater, the value drops to zero.
Therefore:
= min(x1 , ..., xn ) Also, by observing the original distribution, the maximum probability is at the smallest Xi .
p. 415, Problem 7:
To get the condence interval, compute the average and sample variances:
Condence interval for :
1 1 2 2 (x (x) ) x c (x2 (x)2 ) xc n1 n1
tn1 = t19 (, c) = 0.95, c = 1.729 Condence interval for 2 : n(x ) n(x2 (x)2 ) N (0, 1), 2 n1 2
85
n(x2 (x)2 ) c2 2
p. 196, Number 9 P(X1 = def ective) = p Find E(X Y ) Xi = {1, def ective; 1, notdef ective}; X Y = X1 + ... + Xn E(X Y ) = EX1 + ... + EXn = nEX1 = n(1 p 1(1 P )) = n(2p 1) p. 396, Number 10 X1 , ..., X6 N (0, 1) c ((X1 + X2 + X3 )2 + (X4 + X 5 + X 6 )2 ) 2 n 2 ( c(X1 + X2 + X3 )) + ( c(X4 + X5 + X6 ))2 2 2 But each needs a distribution of N(0, 1) E c( X1 + X2 + X3 ) = c(EX1 + EX2 + EX3 ) = 0 Var( c(X1 + X2 + X3 )) = c(Var(x1 ) + Var(X2 ) + Var(X3 )) = 3c In order to have the standard normal distribution, variance must equal 1. 3c = 1, c = 1/3 ** End of Lecture 28
86
Score distribution for Test 2: 70-100 A, 40-70 B, 20-40 C, 10-20 D Average = 45 Hypotheses Testing. X1 , ..., Xn with unknown distribution P Hypothesis possibilities: H1 : P = P 1 H2 : P = P 2 ... Hk : P = P k There are k simple hypotheses. A simple hypothesis states that the distribution is equal to a particular probability distribution. Consider two normal distributions: N(0, 1), and N(1, 1).
Assign a weight to each hypothesis, based upon the importance of the dierent errors.
(1), ..., (k ) 0, (i) = 1
Bayes error ( ) = (1)1 + (2)2 + ... + (k )k
Minimize the Bayes error, choose the appropriate decision rule.
Simple solution to nding the decision rule:
X = (X1 , ..., Xn ), let fi (x) be a p.f. or p.d.f. of Pi
fi (x) = fi (x1 ) ... fi (xn ) - joint p.f./p.d.f.
Theorem: Bayes Decision Rule:
= {Hi : (i)fi (x) = maxij k (j )fj (x) Similar to max. likelihood.
Find the largest of joint densities, but weighted in this case.
( ) = (i)Pi ( = Hi ) = (i)(1 Pi ( = Hi )) =
(i)Pi ( = Hi ) = 1 (i) I ( (x) = Hi )fi (x)dx =
=1 = 1 ( (i)I ( (x) = Hi )fi (x))dx - minimize, so maximize the integral:
Function within the integral:
I ( = H1 ) (1)f1 (x) + ... + I ( = Hk ) (k )fk (x) The indicators pick the term
= H1 : 1 (1)f1 (x) + 0 + 0 + ... + 0
So, just choose the largest term to maximize the integral.
Let pick the largest term in the sum.
Most of the time, we will consider 2 simple hypotheses:
= {H1 : (1)f1 (x) > (2)f2 (x), Example:
H1 : N (0, 1), H2 : N (1, 1)
(1)f1 (x) + (2)f2 (x) minimize
= {H1 :
xi <
88
X1 , n = 1, (1) = (2) =
1 2
1 1 ; H2 x1 > ; H1 or H2 if =} 2 2 However, if 1 distribution were more important, it would be weighted. = {H1 : x1 <
If N(0, 1) were more important, you would choose it more of the time, even on 1 some occasions when xi > 2 Denition: H1 , H2 - two simple hypotheses, then: 1 ( ) = P( = H1 |H2 ) - level of signicance. ( ) = 1 2 ( ) = P( = H2 |H2 ) - power. For more than 2 hypotheses,
1 ( ) is always the level of signicance, because H1 is always the
Most Important hypothesis.
( ) becomes a power function, with respect to each extra hypothesis.
Denition: H0 - null hypothesis
Example, when a drug company evaluates a new drug,
the null hypothesis is that it doesnt work.
H0 is what you want to disprove rst and foremost,
you dont want to make that error!
Next time: consider class of decision rules.
K = { : 1 ( ) }, [0, 1]
Minimize 2 ( ) within the class K
** End of Lecture 29
89
Bayes Decision Rule (1)1 ( ) + (2)2 ( ) minimize. = {H1 : Example: see pg. 469, Problem 3 H0 : f1 (x) = 1 for 0 x 1 H1 : f2 (x) = 2x for 0 x 1 Sample 1 point x1 Minimize 30 ( ) + 11 ( ) = {H0 : Simplify the expression: 3 3 ; H1 : x1 > } 2 2 Since x1 is always between 0 and 1, H0 is always chosen. = H0 always. = {H0 : x1 Errors:
H0 ) = 0
0 ( ) = P0 ( = 1 ( ) = P1 ( = H1 ) = 1 We made the 0 very important in the weighting, so it ended up being 0. Most powerful test for two simple hypotheses. Consider a class K = { such that 1 ( ) [0, 1]} Take the following decision rule: = {H1 : f1 (x) f1 (x) < c} c; H2 : f2 (x) f2 (x) 1 1 1 1 < ; either if equal} > ; H1 : 2x1 3 2x1 3 f1 (x) (2) > ; H2 : if <; H1 or H2 : if =} f2 (x) (1)
Calculate the constant from the condence level : 1 ( ) = P1 ( = H1 ) = P1 ( f1 (x) < c) = f2 (x)
Sometimes it is dicult to nd c, if discrete, but consider the simplest continuous case rst: (2) Find (1), (2) such that (1) + (2) = 1, (1) = c
Then, is a Bayes decision rule.
(1)1 ( ) + (2)2 ( ) (1)1 ( ) + (2)2 ( )
for any decision rule
If K then 1 ( ) . Note: 1 ( ) = , so: (1) + (2)2 ( ) (1) + (2)2 ( )
Therefore: 2 ( ) 2 ( ), is the best (mosst powerful) decision rule in K
Example:
H1 : N (0, 1), H2 : N (1, 1), 1 ( ) = 0.05
90
Always simplify rst: n n xi log(c), xi + log(c), xi c 2 2 The decision rule becomes: = {H1 : xi c ; H 2 : xi > c }
Now, nd c 1 ( ) = P1 ( xi > c )
recall, subscript on P indicates that x1 , ..., xn N (0, 1)
Make into standard normal:
c xi P1 ( > ) = 0.05 n n Check the table for P(z > c ) = 0.05, c = 1.64, c = n(1.64)
These two conversions are the same! Dont combine techniques from both. The Bayes decision rule now becomes: = {H1 : xi 1.64 n; H2 : xi > 1.64 n}
Error of Type 2:
2 ( ) = P2 ( xi c = 1.64 n)
Note: subscript indicates that X1 , ..., Xn N (1, 1) xi n(1) 1.64 n n ) = P2 (z 1.64 n) = P2 ( n n
1 ( 2 )n e 2
x2 i
91
x
2 2 c 2 i 1 ( ) = P1 ( x
i
> c ) = P1 ( 2
> 2 ) = P1 (n > c ) = 0.05 2
If n = 10, P1 (10 > c ) = 0.05; c = 18.31, c = 36.62 Can nd error of type 2 in the same way as earlier: c 2
P(2 n > 3 ) P(10 > 12.1) 0.7 A dierence of 1 in variance is a huge deal! Large type 2 error results, small n. ** End of Lecture 30
92
t-test X1 , ..., Xn - a random sample from N (, 2 ) 2-sided Hypothesis Test: H1 : = 0 H2 : = 0 2 sided hypothesis - parameter can be greater or less than 0 Take (0, 1) - level of signicance (error of type 1) Construct a condence interval condence = 1 - If 0 falls in the interval, choose H1 , otherwise choose H2 How to construct the condence interval in terms of the decision rule: T = x 0 (x)2 ) t distribution with n - 1 degrees of freedom.
1 2 n1 (x
93
1 2 n1 (x
1 2 n1 (x
( 0 ) n 1 T tn1 +
1 2 n1 (x
1 2 n1 (x
94
1 = P1 (T > c) =
p-value: Still the probability of values less likely than T ,
but since it is 1-sided,
you dont need to consider the area to the left of T as you would in the 2-sided case.
The p-value is the area of everything to the right of T Example: 8.5.1, 8.5.4 0 = 5.2.n = 15, x = 5.4, = 0.4226 5.2 H1 : = 5.2, H2 : = T is calculated to be = 1.833, which leads to a p-value of 0.0882
95
From the table c = 2.145 = {H1 : 2.145 T 2.145; H2 otherwise} Consider 2 samples, want to compare their means: 2 2 X1 , ..., Xn N (1 , 1 ) and Y1 , ..., Ym N (2 , 2 ) Paired t-test: Example (textbook): Crash test dummies, driver and passenger seats (X, Y)
See if there is a dierence in severity of head injuries depending on the seat:
(X1 , Y1 ), ..., (Xn , Yn )
Observe the paired observations (each car) and calculate the dierence:
Hypothesis Test:
H1 : 1 = 2
H2 : 1 = 2
Consider Z1 = X1 Y1 , ..., Zn = Xn Yn N (1 2 = , 2 )
H1 : = 0; H2 : = 0
Just a regular t-test:
p-values comes out as < 106 , so they are likely to be dierent.
** End of Lecture 31
96
T =
1 2 n1 (x
x (x)2 )
tn1
97
tm+n2
(x y ) (1 2 ) (x y ) (1 2 ) T = tm+n2 = 2 +n 2 m 1 1 1 2 + n 2 ) x y 1 1 1 ( + ) ( m m +n ( ) x y 2 m n m + n 2 m+n2
T =
1 1 2 2 +n ) m+1 (m n2 (mx + ny )
xy
tm+n2
Decision Rule: = {H1 : c T c, H2 : otherwise} where the c values come from the t distribution with m + n - 2 degrees of freedom.
c = T value where the area is equal to /2, as the failure is both below -c and above +c
If the test were: H1 : 1 2 , H2 : 1 > 2 ,
then the T value would correspond to an area in one tail, as the failure is only above +c.
There are dierent functions you can construct to approach the problem,
based on dierent combinations of the data.
This is why statistics is entirely based on your assumptions and the resulting
98
distribution function! Example: Testing soil types in dierent locations by amount of aluminum oxide present.
m = 14, x = 12.56 N (1 , 2 ); n = 5, y = 17.32 N (2 , 2 )
H1 : 1 2 ; H2 : 1 > 2 T = 6.3 t14+52=17
c-value is 1.74, however this is a one-sided test. T is very negative, but we still accept H 1
If the hypotheses were: H1 : 1 2 ; H2 : 1 < 2 ,
Then the T value of -6.3 is way to the left of the c-value of -1.74. Reject H1
Goodness-of-t tests. Setup: Consider r dierent categories for the random variable. The probability that a data point takes value Bi is pi pi = p1 + ... + pr = 1 Hypotheses: H1 : pi = p0 i for all i = 1, ..., r; H2 : otherwise. Example: (9.1.1)
3 categories exist, regarding a familys nancial situation.
They are either worse, better, or the same this year as last year.
Data: Worse = 58, Same = 64, Better = 67 (n = 189)
1 Hypothesis: H1 : p1 = p2 = p3 = 3 , H2 : otherwise. Ni = number of observations in each category.
You would expect, under H1 , that N1 = np1 , N2 = np2 , N3 = np3
Measure using the central limit theorem:
N1 np1
N (0, 1) np1 (1 p1 ) 99
However, keep in mind that the Ni values are not independent!! (they sum to 1) Ignore part of the scaling to account for this (proof beyond scope):
N1 np1 1 p1 N (0, 1) = N (0, 1 p1 ) np1 T = If H1 is true, then: T = If H1 is not true, then:
r (Ni np0 )2 i i=1
Pearsons Theorem:
np0 i
2 r 1
= {H1 : T c, H2 : T > c}
2 The example yields a T value of 0.666, from the 2 r 1=31=2 = 2 c is much larger, therefore accept H1 .
The dierence among the categories is not signicant.
** End of Lecture 32
100
np0 i
2 r 1
Decision Rule: = {H1 : T c; H2 : T > c} If the distribution is continuous or has innitely many discrete points: Hypotheses: H1 : P = P0 ; H2 : P = P0
Discretize the distribution into intervals, and count the points in each interval.
You know the probability of each interval by area, then, consider a nite number of intervals.
This discretizes the problem.
3.912 N (3.912, 0.25) X X N (0, 1) 0.25 Dividing points: c1 , c2 = 3.912, c3 Find the normalized dividing points by the following relation:
ci 3.912 = c i 0.5
102
Decision Rule:
= {H1 : T 7.815; H2 : T > 7.815}
T = 3.609 < 7.815, conclusion: accept H1
The distribution is relatively uniform among the intervals.
Composite Hypotheses: H1 : pi = pi (), i r for - parameter set.
H2 : not true for any choice of
Step 1: Find that best describes the data.
Find the MLE of
Likelihood Function: () = p1 ()N1 p2 ()N 2 ... pr ()Nr
Take the log of () maximize is good enough. Step 2: See if the best choice of ) for i r, H2 : otherwise. H1 : pi = pi ( T =
i=1 r ))2 (Ni npi (
) npi (
2 r s1
Example: (pg. 543) Gene has 2 possible alleles A1 , A2 Genotypes: A1 A1 , A1 A2 , A2 A2 Test that P(A1 ) = , P(A2 ) = 1 ,
103
but you only observe genotype. H1 : P(A1 A2 ) = 2(1 ) N2 P(A1 A1 ) = 2 N1 P(A2 A2 ) = (1 )2 N3 r = 3 categories. s = 1 (only 1 parameter, ) () = (2 )N1 (2(1 ))N2 ((1 )2 )N3 = 2N2 2N1 +N2 (1 )2N3 +N2 log () = N2 log 2 + (2N1 + N2 ) log + (2N3 + N2 ) log(1 ) 2N3 + N2 2N1 + N2 = =0 1 (2N1 + N2 )(1 ) (2N3 + N2 ) = 0 = 2N1 + N2 2N1 + N2 = 2N1 + 2N2 + 2N3 2n
(Ni np0 )2
i
np0 i
2 2 r s1 = 1
For an = 0.05, c = 3.841 from the 2 1 distribution. Decision Rule: = {H1 : T 3.841; H2 : T > 3.841} ** End of Lecture 33
104
Contingency tables, test of independence. Feature 2 = 1 N11 ... ... ... Na1 N+a F2 = 2
F2 = 3
...
...
...
...
...
...
...
...
...
...
...
...
... ... ... ... ... ... ... F2 = b N1b ... ... ... Nab N+b row total N1+ ... ... ... Na+ n
Feature 1 = 1 F1 = 2 F1 = 3 ... F1 = a col. total Xi1 {1, ..., a} Xi2 {1, ..., b}
Random Sample: 1 2 1 2 X1 = (X1 , X1 ), ..., Xn = (Xn , Xn ) Question: Are X 1 , X 2 independent? Example: When asked if your nances are better, worse, or the same as last year, see if the answer depends on income range: 20K 20K - 30K 30K Worse 20 24 14 Same 15 27 22 Better 12 32 23
Check if the dierences and subtle trend are signicant or random. ij = P(i, j ) = P(i) P(j ) if independent, for all cells ij Independence hypothesis can be written as: H1 : ij = pi qj where p1 + ... + pa = 1, q1 + ... + qb = 1 H2 : otherwise. r = number of categories = ab s = dimension of parameter set = a + b 2 The MLE p i , qj needs to be found T =
2 (Nij np i qj ) 2 r s1=ab(a+b2)1=(a1)(b1) q np i j i,j
(pi qj )Nij =
i
pi
Ni+
j
qj
N+j
Note: Ni+ = j Nij and N+j = i Nij Maximize each factor to maximize the product. 105
Use Lagrange multipliers to solve the constrained maximization: N log p ( p 1) maxp min i + i i i i
i
T =
Decision Rule:
= {H1 : T c; H2 : T > c}
Choose c from the chi-square distribution, (a - 1)(b - 1) d.o.f., at a level of signicance = area.
From the above example:
N1+ = 47, N2+ = 83, N3+ = 59
N+1 = 58, N+2 = 64, N+3 = 67
n = 189
For each cell, the component of the T statistic adds as follows:
T = Is T too large? 2 T 2 (31)(31) = 4 (20 58(47)/189)2 + ... = 5.210 58(47)/189
For this distribution, c = 9.488 According to the decision rule, accept H1 , because 5.210 9.488 Test of Homogeniety - very similar to independence test. 106
1. Sample from entire population. 2. Sample from each group separately, independently between the groups. Question: P(category j | group i) = P(category j) This is the same as independence testing! P(category j, group i) = P(category j)P(group i) P(Cj |Gi ) = P(Cj Gi ) P(Cj )P(Gi ) = = P(Cj ) P(Gi ) P(Gi )
Consider a situation where group 1 is 99% of the population, and group 2 is 1%.
You would be better o sampling separately and independently.
Say you sample 100 of each, just need to renormalize within the population.
The test now becomes a test of independence.
Example: pg. 560
100 people were asked if service by a re station was satisfactory or not.
Then, after a re occured, the people were asked again.
See if the opinion changed in the same people.
Before Fire After Fire 80 72 satised 20 28 unsatised
But, you cant use this if you are asking the same people! Not independent! Better way to arrange: Originally Satised Originally Unsatised 70 2 After, Satised 10 18 After, Not Satised
If taken from the entire population, this is ok. Otherwise you are taking from a dependent population. ** End of Lecture 34
107
Kolmogorov-Smirnov (KS) goodness-of-t test Chi-square test is used with discrete distributions.
If continuous - split into intervals, treat as discrete.
This makes the hypothesis weaker, however, as the distribution isnt characterized fully.
The KS test uses the entire distribution, and is therefore more consistent.
Hypothesis Test:
H1 : P = P0
H2 : P = P0
P0 - continuous In this test, the c.d.f. is used. Reminder: c.d.f. F (x) = P(X x), goes from 0 to 1.
The c.d.f. describes the entire function. Approximate the c.d.f. from the data Empirical Distribution Function: 1 #(points x) Fn (x) = I (X x) = n i=1 n by LLN, Fn (x) EI (X1 x) = P(X1 x) = F (x)
n
From the data, the composed c.d.f. jumps by 1/n at each point. It converges to the c.d.f. at large n. Find the largest dierence (supremum) between the disjoint c.d.f. and the actual. sup |Fn (x) F (x)| n 0
x
108
By the central limit theorem: N 0, Var(I (Xi x)) = p(1 p) = F (x)(1 F (x) You can tell exactly how close the values should be! Dn =
x
a) Under H1 , Dn has some proper known distribution. b) Under H2 , Dn + If F (x) implies a certain c.d.f. which is away from that predicted by H0
Fn (x) F (x), |Fn (x) F0 (x)| > /2 n n|Fn (x) F0 (x)| > 2 + The distribution of Dn does not depend on F(x), this allows to construct the KS test. Dn = n supx |Fn (x) F (x)| = n supy |Fn (F 1 (y )) y | y = F (x), x = F 1 (y ), y [0, 1] Fn (F 1 (y )) = 1 1 1 I (Xi F 1 (y )) = I (F (Xi ) y ) = I (Yi y ) n i=1 n i=1 n i=1
n n n
109
Distribution - distance the particle travels from the starting point. The maximum distance is the distribution of Dn H(t) = distribution of the largest deviation of particle in liquid (Brownian Motion) Decision Rule:
} = {H1 : Dn c; H2 : Dn > c
Choose c such that the area to the right is equal to
Example:
Set of data points as follows
n = 10,
0.58, 0.42, 0.52, 0.33, 0.43, 0.23, 0.58, 0.76, 0.53, 0.64
H1 : P uniform on [0, 1]
Step 1: Arrange in increasing order.
0.23, 0.33, 0.42, 0.43, 0.52, 0.53, 0.58, 0.64, 0.76
Step 2: Find the largest dierence.
Compare the c.d.f. with data.
Note: largest dierence will occur before or after the jump, so only consider end points. x: F(x): Fn (x) before: Fn (x) after: Calculate the dierences: |Fn (x) F (x)| Fn (x) before and F(x): Fn (x) after and F(x): 0.23 0.13 0.23 0.13 0.22 0.12 ... ... 0.23 0.23 0 0.1 0.33 0.33 0.1 0.2 0.42 0.42 0.2 0.3 ... ... ... ...
The largest dierence occurs near the end: |0.9 0.64| = 0.26 Dn = 10(0.26) = 0.82 Decision Rule: = {H1 : 0.82 c; H2 : 0.82 > c} c for = 0.05 is 1.35. Conclusion - accept H1 . ** End of Lecture 35
110
1 4
Problem 5:
Condence Intervals, keep in mind the formulas!
1 1 2 2 xc (x x ) x + c (x2 x2 ) n1 n1 Find c from the T distribution with n - 1 degrees of freedom.
n n + n log xi = 0 = log xi n
111
Set up such that the area between -c and c is equal to 1 In this example, c = 1.833 n(x2 x2 ) n(x2 x2 ) 2 c1 c2 Find c from the chi-square distribution with n - 1 degrees of freedom.
Posterior Distribution:
+n n + log xi 112
npi
2 r s1=3 1 2 ))N6
2N1 2N2 (1 , 2 ) = 1 2
(1 1 2 )2N3 (21 2 )N4 (21 (1 1 2 ))N5 (22 (1 N4 +N5 +N6 2N1 +N4 +N5 2N2 +N4 +N6 (1 1 2 )2N3 +N5 +N6 2
1 =2
Solve for 1 , 2
113
Decision Rule:
= {H1 : T c, H2 : T < c}
Find c values from chi-square dist. with r
s - 1 d.o.f. Area above c = c = 7.815
Problem 5:
There are 4 blood types (O, A, B, AB)
There are 2 Rhesus factors (+, -)
Test for independence:
+ - O 82 13 95 T = A 89 27 116 (82 B 54 7 61 AB 19 9 28 244 56 300
+ ...
Find the T statistic for all 8 cells. 2 2 (a1)(b1) = 3 , and the test is same as before. ** End of Lecture 36
114
1 u 1 xi ) u MLE n
f () =
192 I ( 4) 4
Data: X1 = 5, X2 = 3, X3 = 8 Posterior: f (|x1 , ..., xn ) f (x1 , ..., xn |)f () 1 f (x1 , ..., xn |) = 1 n I (0 all xs ) = n I (max(X1 , ..., Xn ) ) 1 f (|x1 , ..., xn ) n+4 I ( 4)I (max(x1 , ..., xn ) ) n1 +4 I ( 8) Find constant so it integrates to 1. c c6 1 d n = 3 1 = | 8 = 8 6 = 1 1= c7 d n +4 6 6 8 8 c = 6 86 3. Two observations (X1 , X2 ) from f(x)
H1 : f (x) = 1/2, I (0 x 2)
H2 : f (x) = {1/2, 0 x 1, 2/3, 1 < x
2} H3 : f (x) = {3/4, 0 x 1, 1/4, 1 < x 2} minimizes 1 ( ) + 22 ( ) + 23 ( ) = P( = Hi |Hi ) i ( ) Find (i)i , Decision rule picks (i)fi (x1 , ..., xn ) max for each region. 115
(i)fi (x1 )fi (x2 ) both x1 , x2 [0, 1] point in [0, 1], [1, 2] both in [1, 2]
Decision Rule:
= {H1 : never pick , H2 : both in [1, 2], one in [0, 1], [1, 2] , H3 : both in [0, 1]}
If two hypotheses:
f1 (x) (2) > f2 (x) (1) Choose H1 , H2 4.
1 2 1 f (x|) = { e 2 (ln x) for x 0, 0 for x < 0} x 2 If X has this distribution, nd distribution of ln X . Y = ln X ey c.d.f. of Y: P(Y y ) = (ln x y ) = P(x ey ) = 0 f (x)dx However, you dont need to integrate. P(Y y ) = f (ey ) ey p.d.f. of Y, f (y ) = y
ey
y 2 1 2 (ln e )
2 1 1 ey = e 2 (y) N (, 1) 2
f1 ( x) c, H2 if less than} f2 ( x )
P 2 1 1 e 2 (ln xi +1) n xi ( 2 ) P 2 1 1 e 2 (ln xi 1) n xi ( 2 )
c n = 1.64, c = 4.81 n Power = 1 - type 2 error = 1 P2 ( = H2 ) = 1 P2 ( ln xi < c) = 1 P2 ( N (1, 1) < c) 4.81 10 xi n(1) )1 < = 1 P2 ( n 10
f2 ( x)=
f1 ( x)=
116
5 , p2 = 6. H1 : p1 = 2 3 , p3 = 1 6 , [0, 1]
Step 1) Find MLE 5
Step 2) p 1 = 2 , p2 = 3 , p 3 = 1 6 Step 3) Calculate T statistic. r (Ni np )2 i i=1
T =
npi
2 r s1=311=1
5 () = ( )N1 ( )N2 (1 )N3 2 3 6 log () = (N1 + N2 ) log() + N3 log(1 5 ) N1 log(2)N2 log(3) max 6
N1 + N2 5/6 = + N3 1 56 5 5 N1 + N2 (N1 + N2 ) N3 = 0 6 6 solve for = Compute statistic, T = 0.586 = {H1 : T 3.841, H2 : T > 3.841} Accept H1 7. n = 17, x = 3.2, x = 0.09 From N (, 2 ) H1 : 3 H2 : > 3 at = 0.05 T = 3.2 3 tn1 = = 2.67 1 1 2 (x)2 ) (0 . 09) ( x n1 16 x 0 23 6 N1 + N 2 ( )= 5 n 25
Choose decision rule from the chi-square table with 17-1 degrees of freedom:
= 12.1
117
2 2
2 (a1)(b1) = 32 = 6 at 0.05, c = 12.59 : {H1 : T 12.59, H2 : T > 12.59}
Accept H1 . But note if condence level changes, 0.10, c = 10.64, would reject H1
9.
f (x) = 1 2, I (0 x 2)
/x F (x) = f (t)dt = x/2, x 2
x:
F(x):
F(x) before:
F(x) after:
di F(x) before: di F(x) after:
n = |F (x) Fn (x)| max n = 0.295 c for = 0.05 is 1.35 Dn = 10(0.295) = 0.932872 = {H1 : 0.932872 1.35, H2 : 0.932872 > 1.35} Accept H1 ** End of Lecture 37
118
18.05. Practice test 1. (1) Suppose that 10 cards, of which ve are red and ve are green, are placed at random in 10 envelopes, of which ve are red and ve are green. Determine the probability that exactly two envelopes will contain a card with a matching color. (2) Suppose that a box contains one fair coin and one coin with a head on each side. Suppose that a coin is selected at random and that when it is tossed three times, a head is obtained three times. Determine the probability that the coin is the fair coin. (3) Suppose that either of two instruments might be used for making a certain measurement. Instrument 1 yields a measurement whose p.d.f. is f1 (x) =
Suppose that one of the two instruments is chosen at random and a mea surement X is made with it. (a) Determine the marginal p.d.f. of X . (b) If X = 1/4 what is the probability that instrument 1 was used? (4) Let Z be the rate at which customers are served in a queue. Assume that Z has p.d.f. 2e2z , z > 0, f (z ) = 0, otherwise
1 Find the p.d.f. of average waiting time T = Z . (5) Suppose that X and Y are independent random variables with the following p.d.f. ex , x > 0, f (x) = 0, otherwise
18.05. Practice test 2. (1) page 280, No. 5 (2) page 291, No. 11 (3) page 354, No. 10 (4) Suppose that X1 , . . . , Xn form a random sample from a distribution with p.d.f. ex , x f (x| ) = 0, x < . Find the MLE of the unknown parameter . (5) page 415, No. 7. (Also compute 90% condence interval for 2 .)
Extra practice page 196, No. page 346, No. page 396, No. page 409, No. page 415, No.
18.05. Test 1. (1) Consider events A = {HHH at least once} and B = {TTT at least once}. We want to nd the probability P (A B ). The complement of A B will be Ac B c , i.e. no TTT or no HHH, and P (A B ) = 1 P (Ac B c ). To nd the last one we can use the probability of a union formula P (Ac B c ) = P (Ac ) + P (B c ) P (Ac B c ). Probability of Ac , i.e. no HHH, means that on each toss we dont get HHH. The probability not to get HHH on one toss is 7/8 and therefore, P (Ac ) =
7 10 8 .
The same for P (B c ). Probability of Ac B c , i.e. no HHH and no TTT, means that on each toss we dont get HHH and TTT. The probability not to get HHH and TTT on one toss is 6/8 and, therefore, P (A B ) = Finally, we get, P (A B ) = 1
7 10 8
c c
6 10 8
7 10 8
6 10 8
(2) We have P (F ) = P (M ) = 0.5, P (CB |M ) = 0.05 and P (CB |F ) = 0.0025. Using Bayes formula, P (M |CB ) = P (CB |M )P (M ) 0.05 0.5 = P (CB |M )P (M ) + P (CB |F )P (F ) 0.05 0.5 + 0.0025 0.5
which is dened only when f (x) > 0. To nd f1 (x) we have to integrate out y, i.e. f1 (x) = f (x, y )dy.
2 To nd the limits we notice that for a given x, 0 < y 2 < 1 x which is not 2 2 empty only if x < 1, i.e. 1 < x < 1. Then 1 x < y < 1 x2 . So if 1 < x < 1 we get, 2 y3 1 1x f1 (x) = c(x +y )dy = c(x y + ) = 2c(x2 1 x2 + (1x2 )3/2 ). 2 3 1x 3 1x2 2 2 2
1x2
(4) Let us nd the c.d.f rst. P (Y y ) = P (max(X1 , X2 ) y ) = P (X1 y, X2 y ) = P (X1 y )P (X2 y ). The c.d.f. of X1 and X2 is P (X1 y ) = P (X2 y ) = If y 0, this is P (X1 y ) = and if y > 0 this is P (X1 y ) = Finally, the c.d.f. of Y, P (Y y ) =
0
f (x)dx.
y ex dx = ex
x x
= ey
0 e dx = e
= 1.
e2y , y 0 1, y > 0.
y z1
PSfrag
z > 1.
Figure 1: Region {x zy } for z 1 and z > 1. (5) Let us nd the c.d.f. of Z = X/Y rst. Note that for X, Y (0, 1), Z can take values only > 0, so let z > 0. Then P (Z z ) = P (X/Y z ) = P (X zY ) = f (x, y )dxdy.
{xzy }
To nd the limits, we have to consider the intersection of this set {x zy } with the square 0 < x < 1, 0 < y < 1. When z 1, the limits are 1 zy 1 2 1 2 zy x z z2 z 2 (x + y )dxdy = ( + xy ) dy = ( + z )y dy = + . 2 2 6 3 0 0 0 0 0 When z 1, the limits are dierent 1 2 1 1 1 y 1 1 (x + y )dydx = ( + xy ) dx = 1 2 . 2 6z 3z x/z 0 x/z 0 So the c.d.f. of Z is P (Z z ) = The p.d.f. is f (z ) = +z , 0<z1 3 1 1 3z , z>1
1 6z 2 z 3 1 3z 3 z2 6
+1 , 0<z1 3 1 + 3z 2 , z>1
18.05. Test 2. (1) Let X be the players fortune after one play. Then
P (X = 2c) = and the expected value is EX = 2c 1 c 1 5 + = c. 2 2 2 4 1 c 1
and P (X = ) = 2 2 2
Repeating this n times we get the expected values after n plays (5/4)n c. (2) Let Xi , i = 1, . . . , n = 1000 be the indicators of getting heads. Then Sn = X1 + . . . + Xn is the total number of heads. We want to nd k such that P (440 Sn k ) 0.5. Since = EXi = 0.5 and 2 = Var(Xi ) = 0.25 by central limit theorem, Z= Sn n Sn 500 = n 250
is approximately standard normal, i.e. P (440 Sn k ) = P ( k 500 440 500 = 3.79 Z ) 250 250 k 500 ( ) (3.79) = 0.5. 250
From the table we nd that (3.79) = 0.0001 and therefore k 500 ( ) = 0.4999. 250 Using the table once again we get (3) The likelihood function is
k 500 250
0 and k 500.
n en ( ) = ( Xi )+1
Xi .
1 e ()
Therefore, the posterior is proportional to (as usual, we keep track only of the terms that depend on ) 1 +n1 e+n n en = f ( |X1 , . . . , Xn ) 1 e ( Xi )+1 Xi ( Xi ) Q Q +n1 e+n log Xi = (+n)1 e( n+log Xi ) . This shows that the posterior is again a gamma distribution with parameters
( + n, n + log Xi ). = +n . n + log Xi
(5) The condence interval for is given by 1 1 2) X +c 2) c (X 2 X X (X 2 X n1 n1 where c that corresponds to 90% condence is found from the condition t101 (c) t101 (c) = 0.9
or t9 (c) = 0.95 and c = 1.833. The condence interval for 2 is 2) 2) n(X 2 X n(X 2 X 2 c2 c2 where c1 , c2 satisfy
2 2 101 (c1 ) = 0.05 and 101 (c2 ) = 0.95,