0% found this document useful (0 votes)
47 views74 pages

5 Random Var PDF

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 74

Random variables

9.07

2/19/2004

A few notes on the homework

If you work together, tell us who youre working


with.
You should still be generating your own homework
solutions. Dont just copy from your partner. We want
to see your own words.

Turn in your MATLAB code (this helps us give


you partial credit)
Label your graphs

xlabel(text)

ylabel(text)

title(text)

More homework notes

Population vs. sample

The population to which the researcher wants to


generalize can be considerably more broad than
might be implied by the narrow sample.

High school students who take the SAT

High school students


Anyone who wants to succeed
Anyone

More homework notes

MATLAB:

If nothing else, if you cant figure out


something in MATLAB, find/email a TA, or
track down one of the zillions of fine web
tutorials.
Some specifics

MATLAB

Hint: MATLAB works best if you can think


of your problem as an operation on a
matrix. Do this instead of for loops,
when possible.
E.G. coinflip example w/o for loops

x = rand(5,10000);

coinflip = x>0.5;

numheads = sum(coinflip); % num H in 5 flips

MATLAB

randn(N) -> NxN matrix!


randn(1,N) -> 1xN matrix
sum(x) vs. sum(x,2)
hist(data, 1:10) vs. hist(data, 10)

plot(hist(data)) vs.
[n,x]=hist(data); plot(x,n)

A few more comments

Expected value can tell you whether or not you want to


play game even once.
It tells you if the game is in your favor.

In our example of testing positive for a disease, P(D) is the


prior probability that you have the disease. What was the
probability of you having the disease before you got
tested? If you are from a risky population, P(D) may be
higher than 0.001. Before you took the test you had a
higher probability of having the disease, so after you test
positive, your probability of having the disease, P(D|+)
will be higher than 1/20.

Random Variables

Variables that take numerical values associated


with events in an experiment
Either discrete or continuous
Integral (not sum) in equations below for continuous r.v.

Mean, , of a random variable is the sum of each


possible value multiplied by its probability:
= xiP(xi) E(x)
Note relation to expected value from last time.

Variance is the average of squared deviations

multiplied by the probability of each value

2 = (xi-)2P(xi) E((x-)2)

Weve already talked about a few

special cases

Normal r.v.s (with normal distributions)


Uniform r.v.s (with distributions like this:)

p
x

Etc.

Random variables

Can be made out of functions of other


random variables.
X r.v., Y r.v. ->
Z=X+Y
r.v.
Z=sqrt(X)+5Y + 2

r.v.

Linear combinations of random

variables

We talked about this in lecture 2. Heres a review,


with new E() notation.
Assume:

E(x) =

E(x-)2 = E(x2-2x+2) = 2

E(x+5) = E(x) + E(5) = E(x) + 5 = + 5 =

E((x+5-)2) = E(x2+2(5-)x + (5-)2)

= E(x2-2x+2) = 2 = ()2

Adding a constant to x adds that constant to , but


leaves unchanged.

Linear combinations of random

variables

E(2x) = 2E(x) = 2 =

E((2x-)2) = E(4x2 8x + 42) = 42 =


()2
= 2

Scaling x by a constant scales both and by


that constant. But

Multiplying by a negative constant

E(-2x) = 2E(x) = -2 =

E((-2x-)2) = E(4x2 +2(2x)(-2) + (-2)2)


= E(4x2 8x + 42) = 42 = ()2
= 2

Scaling by a negative number multiples the


mean by that number, but multiplies the
standard deviation by (the number).
(Standard deviation is always positive.)

What happens to z-scores when you

apply a transformation?

Changes in scale or shift do not change


standard units, i.e. z-scores.
When you transform to z-scores, youre already
subtracting off any mean, and dividing by any
standard deviation. If you change the mean or
standard deviation, by a shift or scaling, the
new mean (std. dev.) just gets subtracted
(divided out).

Special case: Normal random

variables

Can use z-tables to figure out the area under


part of a normal curve.

An example of using the table

P(-0.75<z<0.75) =
0.5467
P(z<-0.75 or
z>0.75) = 1-0.5467
0.45
Thats our answer.

What %

here and here

-.75

0.70
0.75
0.80

0 .75

Height

31.23

30.11
28.97

Area

51.61
54.67
57.63

Another way to use the z-tables

Mean SAT score = 500, std. deviation = 100

Assuming that the distribution of scores is


normal, what is the score such that 95% of
the scores are below that value?
95%

5%
z = ?

Using z-tables to find the 95

percentile point

90%
5%

5%

From the tables:

z
1.65

Height Area
10.23 90.11

z=1.65 -> x=? Mean=500, s.d.=100


1.65 = (x-500)/100; x = 165+500 = 665

Normal distributions

A lot of data is normally distributed because of the


central limit theorem from last time.
Data that are influenced by (i.e. the sum of) many
small and unrelated random effects tend to be
approximately normally distributed.
E.G. weight (Im making up these numbers)

Overall average = 120 lbs for adult women


Women add about 1 lb/year after age 29
Illness subtracts an average of 5 lbs
Genetics can make you heavier or thinner
A given sample of weight is influenced by being an adult
woman, age, health, genetics,

Non-normal distributions

For data that is approximately normally


distributed, we can use the normal
approximation to get useful information
about percent of area under some fraction of
the distribution.
For non-normal data, what do we do?

Non-normal distributions

E.G. income distributions tend to be very


skewed
Can use percentiles, much like in the last z-

table example (except without the tables)

Whats the 10th percentile point? The 25th


percentile point?

Percentiles & interquartile range

Divide data into 4 groups, see how far about the


extreme groups are.
Median = 50th percentile

median=Q1
= 25th percentile

median=Q3

= 75th percentile

Q3-Q1 = IQR = 75th percentile 25th percentile

What do you do for other

percentiles?

Median = point such that 50% of the data


lies below that point
Similarly, 10th percentile = point such that
10% of the data lies below that point.

What do you do for other

percentiles?

If you have a theory for the distribution of the


data, you can use that to find the nth percentile.
Estimating it from the data, using MATLAB (to a

first approximation)
(x = the data)
y = sort(x);
N = length(x);
% how many data points there are
TenthPerc = y(0.10*N);

This isnt exactly right (remember, for instance,


that median (1 2 4 6) is 3), but its close enough
for our purposes.

How do you judge if a distribution is

normal?

So far weve been eyeballing it. (Does it


look symmetric? Is it about the right
shape?) Can we do better than this?

Normal quantile Plots

A useful way to judge whether or not a set


of samples comes from a normal
distribution.
Well still be eyeballing it, but with a more
powerful visualization.

Normal quantile plots

data
For each datum, what % of the data is
below this value whats its percentile?
If this were a normal distribution, what z
would correspond to that percentile?
Compare the actual data values to those
predicted (from the percentiles) if it were
a standard normal (z) distribution.

z=?

Normal quantile plots

If the data~N(0, 1), the


points should fall on a 45
degree line through the
origin.
If the data~N(, 1), the
points should fall on a 45
degree line.
If the data~N(, ), the
points will fall on a line
with slope (or 1/,
depending on how you
plotted it).

data

data~N(, )

-1 0 1

Normal Quantile Plots

Basic idea:

Order the samples from smallest to largest. Assume


you have N samples. Renumber the ordered samples
{x1, x2, , xN}.
Each sample xi has a corresponding percentile
ki = (i-0.5)/N. About ki% of the data in the sample is <
xi.
If the distribution is normal, we can look up ki % in the
z-tables, and get a corresponding value for zi.
Plot xi vs. zi (it doesnt matter which is on which axis)

40

y axis

20

-20

-40
-3

-2

-1

0
z-score

Figure by MIT OCW.

Lets remove those outliers

40

y axis

30

20

10
-3

-2

-1

0
z-score

Figure by MIT OCW.

The normal quantile plot allows us to see


which points deviate strongly from a line.
This helps us locate outliers.

Non-linear plots

Concave-up (with the axes as shown here)


means positive skew
Concave-down means negative skew

100

Dollars Spent

80

60
40

20

0
-3

-2

-1

0
z-score

Figure by MIT OCW.

600

Survival Time (days)

500
400
300
200
100
0

-3

-2

-1

0
z-score

Figure by MIT OCW.

140

IQ Score

120

100

80

60
-3

-2

-1

z-score
Figure by MIT OCW.

Granularity

When the r.v. can only take on certain values,


the normal quantile plot looks like funny stair
steps
E.G. binomial distributions well get there in
a sec.

Distance (thousandths
of an inch)

80

60

40

20

-3

-2

-1

0
z-score

Figure by MIT OCW.

Normal quantile plots in MATLAB

qqplot(x) generates a normal quantile plot


for the samples in vector x
You should have access to this command on
the MIT server computers.

The binomial distribution

An important special case of a probability distribution.

One of the most frequently encountered distributions in


statistics
Two possible outcomes on each trial, e.g. {H, T}

One outcome is designated a success, the other a


failure
The binomial distribution is the distribution of the number
of successes on N trials.
E.G. the distribution of the number of heads, when you flip
the coin 10 times.

Example

Flip a fair coin 6 times.

What is P(4H, 2T)?


Well, first, note that P(TTHHHH) =
P(THHHHT) = = (0.5)4 (1-0.5)2 = (0.5)6
All events with 4H have the same probability

How many such events are there?

P(4H, 2T) =
(# events of this type) x (0.5)4 (1-0.5)2

How many events of this type are

there? The binomial coefficient

Equals number of possible combinations of

N draws such that you have k successes.

N
N!
=
k k!(N k )!

N! = N factorial = N(N-1)(N-2)(1)
= factorial(N) in MATLAB
0! = 1

Intuition for the binomial coefficient

N! = number of possible ways to arrange 6 unique


items (a,b,c,d,e,f)
6 in 1st slot, 5 remain for 2nd slot, etc.

But, they arent unique. k are the same


(successes), and the remaining (N-k) are the same
(failures).
k! and (N-k)! are the # of duplicates you get

from having k and N-k items be the same.

The result is the number of combinations with k


successes.

Binomial coefficient

Number of ways of getting k heads in N tosses

Number of ways of drawing 2 R balls out of 5


draws, with p(R) = 0.1
Number of ways of picking 2 people out of a
group of 5 (less obvious)
Associate an indicator function with each person = 1 if
picked, 0 if not
p(p1 = 1) is like p(toss 1 = H)

The Binomial distribution

Probability of k successes in N tries

Repeatable sampling of a binomial variable (e.g.,


tossing a coin), where you decide the number of
samples in advance
(versus: I keep drawing a ball until I get 2 reds, then I
quit. What was my probability of getting 2R and 3G?)

Three critical properties


Result of each trial may be either a failure or a success
Probability of success is the same for each trial
The trials are independent

Back to tossing coins

The coin-toss experiment is an example of a


binomial process
Lets arbitrarily designate heads as a
success
p(heads) = 0.5

What is the probability of obtaining 4 heads


in 6 tosses?

Example

P(4 H in 6 tosses) =

N
k
N k

p
(1

p
)

=
(0.5) 4 (0.5) 2

6
5 4 3
2 1 1 15

3
2 1 2
1 64 64

Kangaroo example from book

10 pairs of kangaroos
Half of them get vitamins
10 races (vitamin vs. no vitamin)

7 out of 10 races, the kangaroo taking


vitamins wins
Do the vitamins help, or is this just
happening by chance?

How do we decide?

What we want to do is to set a criterion # of wins,


and decide that the vitamins had an effect if we
see a # of wins equal to or greater than the
criterion.
How do we set the criterion?

Well, what if we had set the criterion right at 7


wins? What would be our probability of saying
there was an effect of the vitamins, when really
the results were just due to chance?

Roo races

If we set the criterion at 7 wins, and there


were no effect of vitamins, what is the
probability of us thinking there were an
effect?
Probability of the vitamin roo winning, if
vitamins dont matter, = p = 0.5
What is the probability, in this case, of 7
wins, or 8, or 9, or 10?

Roo races

P(7 wins out of 10) + P(8 wins out of 10) +

P(9 wins out of 10) + P(10 wins out of 10)

Use the binomial formula, from before.


17% (see problem 6, p. 258, answer on

p. A-71)

Roo races

Remember, this is the probability of us thinking


there were an effect, when there actually wasnt, if
we set the criterion at 7 wins.
17% is a pretty big probability of error. (In
statistics we like numbers more like 5%, 1%, or
0.1%.)
We probably wouldnt want to set the criterion at
7 wins. Maybe 8 or 9 would be better.
We decide that the vitamins probably have no
effect.

Well see LOTS more problems like the


kangaroo problem in this class.
And this whole business of setting a
criterion will become more familiar and
intuitive.
For now, back to binomial random
variables.

Mean and variance of a binomial

random variable

The mean number of successes in a

binomial experiment is given by:

= np

n is the number of trials, p is the probability of


success

The variance is given by

2 = npq

q = 1-p

What happens to the binomial

distribution as you toss the coin

more times?

Probability Histogram

(3 coins)

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0

1
Number of Heads

0.4

Probability Histogram

(4 coins)

0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

2
Number of Heads

Probability Histogram

(10 coins)

0.3

0.25

0.2

0.15

0.1

0.05

Number of Heads

10

Binomial Distribution

0.7

p = .05
n = 10

0.6

0.5

p(x)

0.4

0.3

0.2

0.1

0
0

Number of Successes

Binomial Distribution

0.3

p = .05
n = 50
0.25

0.2

0.15

0.1

0.05

0
0

Number of Successes

Binomial Distribution

0.2

p = .05
n = 100

0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0

Number of Successes

10

The central limit theorem, again

As the number of tosses goes up, the binomial


distribution approximates a normal distribution.
The total number of heads on 100 coin tosses =
number on 5 tosses + number on next 5 tosses +

Thus, a binomial process can be thought of as the


sum of a bunch of independent processes, the
central limit theorem applies, and the distribution
approaches normal, for a large number of coin
tosses = trials.

The normal approximation

This means we can use z-tables to answer

questions about binomial distributions!

Normal Approximation

When is it OK to use the normal


approximation?
Use when n is large and p isnt too far from

0.5
The further p is from .5, the larger n you need
Rule of thumb: use when np10 and nq10

Normal Approximation

For any value of p, the binomial distribution


of n trials with probability p is
approximated by the normal curve with
= np and
= sqrt(npq)
Where q = (1-p)

Lets try it for 25 coin flips...

25 coin flips

What is the probability that the number of


heads is 14?
We can calculate from the binomial formula
that p(x14) is .7878 exactly

Normal Approximation

Using the normal approximation with


= np = (25)(5) = 12.5 and
= sqrt(npq) = sqrt((25)(.5)(.5)) = 2.5 we get

p(x14) = p(z (14-12.5)/2.5))


= p(z .6) = .7257

.7878 vs. .7257 -- not great!!

Need a better approximation...

Normal Approximation of

Binomial Distribution

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

10

11

12

13

14

15

Number of Successes

16

17

18

19

20

21

22

23

24

25

Continuity Correction

Notice that the bars are centered on the


numbers
This means that p(x14) is actually the area
under the bars less than x=14.5
We need to account for the extra 0.5

P(x14.5) = p(z.8) = .7881 -- a much


better approximation!

Continuity Correction

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0
0

10

11

12

13

14

15

Number of Successes

16

17

18

19

20

21

22

23

24

25

# of times you do an experiment, vs.

# of trials in that experiment

In MATLAB:

x = rand(5,10000);

coinflip = x>0.5;
% 1 = heads

y = sum(x);
% number of heads

# of coin
flips =
# trials = 5

# of times you do the experiment

# of coin

flips =
# trials = 5

# of times you do the experiment

Increase this, and


central limit thm. will
start to apply distribution
will look more normal.

Increase this, and the


empirical distribution
will approach the
theoretical distribution
(and get less variable).

Binomial distribution and percent

Can also use binomial distribution for percent


success, by dividing by the number of samples
(trials)
Mean = np/n = p
Std. deviation = sqrt(npq)/n = sqrt(pq/n)

Well use this a lot in class, as we often have a


situation like that for elections: 45% favor Kerry,
39% favor Edwards are these different by
chance, or is there a real effect there?

You might also like