5 Random Var PDF

Random variables
9.07
2/19/2004
A few notes on the homework
If you work together, tell us who youre working

with.
You should still be generating your own homework
solutions. Dont just copy from your partner. We want
to see your own words.
Turn in your MATLAB code (this helps us give

you partial credit)
Label your graphs
xlabel(text)
ylabel(text)
title(text)
More homework notes
Population vs. sample
The population to which the researcher wants to

generalize can be considerably more broad than
might be implied by the narrow sample.
High school students who take the SAT
High school students

Anyone who wants to succeed
Anyone
More homework notes
MATLAB:
If nothing else, if you cant figure out

something in MATLAB, find/email a TA, or
track down one of the zillions of fine web
tutorials.
Some specifics
MATLAB
Hint: MATLAB works best if you can think

of your problem as an operation on a
matrix. Do this instead of for loops,
when possible.
E.G. coinflip example w/o for loops
x = rand(5,10000);
coinflip = x>0.5;
numheads = sum(coinflip); % num H in 5 flips
MATLAB
randn(N) -> NxN matrix!

randn(1,N) -> 1xN matrix
sum(x) vs. sum(x,2)
hist(data, 1:10) vs. hist(data, 10)
plot(hist(data)) vs.
[n,x]=hist(data); plot(x,n)
A few more comments
Expected value can tell you whether or not you want to

play game even once.
It tells you if the game is in your favor.
In our example of testing positive for a disease, P(D) is the

prior probability that you have the disease. What was the
probability of you having the disease before you got
tested? If you are from a risky population, P(D) may be
higher than 0.001. Before you took the test you had a
higher probability of having the disease, so after you test
positive, your probability of having the disease, P(D|+)
will be higher than 1/20.
Random Variables
Variables that take numerical values associated

with events in an experiment
Either discrete or continuous
Integral (not sum) in equations below for continuous r.v.
Mean, , of a random variable is the sum of each

possible value multiplied by its probability:
= xiP(xi) E(x)
Note relation to expected value from last time.
Variance is the average of squared deviations
multiplied by the probability of each value
2 = (xi-)2P(xi) E((x-)2)
Weve already talked about a few
special cases
Normal r.v.s (with normal distributions)

Uniform r.v.s (with distributions like this:)
p
x
Etc.
Random variables
Can be made out of functions of other

random variables.
X r.v., Y r.v. ->
Z=X+Y
r.v.
Z=sqrt(X)+5Y + 2
r.v.
Linear combinations of random
variables
We talked about this in lecture 2. Heres a review,

with new E() notation.
Assume:
E(x) =
E(x-)2 = E(x2-2x+2) = 2
E(x+5) = E(x) + E(5) = E(x) + 5 = + 5 =
E((x+5-)2) = E(x2+2(5-)x + (5-)2)
= E(x2-2x+2) = 2 = ()2
Adding a constant to x adds that constant to , but

leaves unchanged.
Linear combinations of random
variables
E(2x) = 2E(x) = 2 =
E((2x-)2) = E(4x2 8x + 42) = 42 =

()2
= 2
Scaling x by a constant scales both and by

that constant. But
Multiplying by a negative constant
E(-2x) = 2E(x) = -2 =
E((-2x-)2) = E(4x2 +2(2x)(-2) + (-2)2)

= E(4x2 8x + 42) = 42 = ()2
= 2
Scaling by a negative number multiples the

mean by that number, but multiplies the
standard deviation by (the number).
(Standard deviation is always positive.)
What happens to z-scores when you
apply a transformation?
Changes in scale or shift do not change

standard units, i.e. z-scores.
When you transform to z-scores, youre already
subtracting off any mean, and dividing by any
standard deviation. If you change the mean or
standard deviation, by a shift or scaling, the
new mean (std. dev.) just gets subtracted
(divided out).
Special case: Normal random
variables
Can use z-tables to figure out the area under

part of a normal curve.
An example of using the table
P(-0.75<z<0.75) =
0.5467
P(z<-0.75 or
z>0.75) = 1-0.5467
0.45
Thats our answer.
What %
here and here
-.75
0.70
0.75
0.80
0 .75
Height
31.23
30.11
28.97
Area
51.61
54.67
57.63
Another way to use the z-tables
Mean SAT score = 500, std. deviation = 100
Assuming that the distribution of scores is

normal, what is the score such that 95% of
the scores are below that value?
95%
5%
z = ?
Using z-tables to find the 95
percentile point
90%
5%
5%
From the tables:
z
1.65
Height Area
10.23 90.11
z=1.65 -> x=? Mean=500, s.d.=100

1.65 = (x-500)/100; x = 165+500 = 665
Normal distributions
A lot of data is normally distributed because of the

central limit theorem from last time.
Data that are influenced by (i.e. the sum of) many
small and unrelated random effects tend to be
approximately normally distributed.
E.G. weight (Im making up these numbers)
Overall average = 120 lbs for adult women

Women add about 1 lb/year after age 29
Illness subtracts an average of 5 lbs
Genetics can make you heavier or thinner
A given sample of weight is influenced by being an adult
woman, age, health, genetics,
Non-normal distributions
For data that is approximately normally

distributed, we can use the normal
approximation to get useful information
about percent of area under some fraction of
the distribution.
For non-normal data, what do we do?
Non-normal distributions
E.G. income distributions tend to be very

skewed
Can use percentiles, much like in the last z-
table example (except without the tables)
Whats the 10th percentile point? The 25th

percentile point?
Percentiles & interquartile range
Divide data into 4 groups, see how far about the

extreme groups are.
Median = 50th percentile
median=Q1
= 25th percentile
median=Q3
= 75th percentile
Q3-Q1 = IQR = 75th percentile 25th percentile
What do you do for other
percentiles?
Median = point such that 50% of the data

lies below that point
Similarly, 10th percentile = point such that
10% of the data lies below that point.
What do you do for other
percentiles?
If you have a theory for the distribution of the

data, you can use that to find the nth percentile.
Estimating it from the data, using MATLAB (to a
first approximation)
(x = the data)
y = sort(x);
N = length(x);
% how many data points there are
TenthPerc = y(0.10*N);
This isnt exactly right (remember, for instance,

that median (1 2 4 6) is 3), but its close enough
for our purposes.
How do you judge if a distribution is
normal?
So far weve been eyeballing it. (Does it

look symmetric? Is it about the right
shape?) Can we do better than this?
Normal quantile Plots
A useful way to judge whether or not a set

of samples comes from a normal
distribution.
Well still be eyeballing it, but with a more
powerful visualization.
Normal quantile plots
data
For each datum, what % of the data is
below this value whats its percentile?
If this were a normal distribution, what z
would correspond to that percentile?
Compare the actual data values to those
predicted (from the percentiles) if it were
a standard normal (z) distribution.
z=?
Normal quantile plots
If the data~N(0, 1), the

points should fall on a 45
degree line through the
origin.
If the data~N(, 1), the
points should fall on a 45
degree line.
If the data~N(, ), the
points will fall on a line
with slope (or 1/,
depending on how you
plotted it).
data
data~N(, )
-1 0 1
Normal Quantile Plots
Basic idea:
Order the samples from smallest to largest. Assume

you have N samples. Renumber the ordered samples
{x1, x2, , xN}.
Each sample xi has a corresponding percentile
ki = (i-0.5)/N. About ki% of the data in the sample is <
xi.
If the distribution is normal, we can look up ki % in the
z-tables, and get a corresponding value for zi.
Plot xi vs. zi (it doesnt matter which is on which axis)
40
y axis
20
-20
-40
-3
-2
-1
0
z-score
Figure by MIT OCW.
Lets remove those outliers
40
y axis
30
20
10
-3
-2
-1
0
z-score
Figure by MIT OCW.
The normal quantile plot allows us to see

which points deviate strongly from a line.
This helps us locate outliers.
Non-linear plots
Concave-up (with the axes as shown here)

means positive skew
Concave-down means negative skew
100
Dollars Spent
80
60
40
20
0
-3
-2
-1
0
z-score
Figure by MIT OCW.
600
Survival Time (days)
500
400
300
200
100
0
-3
-2
-1
0
z-score
Figure by MIT OCW.
140
IQ Score
120
100
80
60
-3
-2
-1
z-score
Figure by MIT OCW.
Granularity
When the r.v. can only take on certain values,

the normal quantile plot looks like funny stair
steps
E.G. binomial distributions well get there in
a sec.
Distance (thousandths
of an inch)
80
60
40
20
-3
-2
-1
0
z-score
Figure by MIT OCW.
Normal quantile plots in MATLAB
qqplot(x) generates a normal quantile plot

for the samples in vector x
You should have access to this command on
the MIT server computers.
The binomial distribution
An important special case of a probability distribution.
One of the most frequently encountered distributions in

statistics
Two possible outcomes on each trial, e.g. {H, T}
One outcome is designated a success, the other a

failure
The binomial distribution is the distribution of the number
of successes on N trials.
E.G. the distribution of the number of heads, when you flip
the coin 10 times.
Example
Flip a fair coin 6 times.
What is P(4H, 2T)?

Well, first, note that P(TTHHHH) =
P(THHHHT) = = (0.5)4 (1-0.5)2 = (0.5)6
All events with 4H have the same probability
How many such events are there?
P(4H, 2T) =
(# events of this type) x (0.5)4 (1-0.5)2
How many events of this type are
there? The binomial coefficient
Equals number of possible combinations of
N draws such that you have k successes.
N
N!
=
k k!(N k )!
N! = N factorial = N(N-1)(N-2)(1)
= factorial(N) in MATLAB
0! = 1
Intuition for the binomial coefficient
N! = number of possible ways to arrange 6 unique

items (a,b,c,d,e,f)
6 in 1st slot, 5 remain for 2nd slot, etc.
But, they arent unique. k are the same

(successes), and the remaining (N-k) are the same
(failures).
k! and (N-k)! are the # of duplicates you get
from having k and N-k items be the same.
The result is the number of combinations with k

successes.
Binomial coefficient
Number of ways of getting k heads in N tosses
Number of ways of drawing 2 R balls out of 5

draws, with p(R) = 0.1
Number of ways of picking 2 people out of a
group of 5 (less obvious)
Associate an indicator function with each person = 1 if
picked, 0 if not
p(p1 = 1) is like p(toss 1 = H)
The Binomial distribution
Probability of k successes in N tries
Repeatable sampling of a binomial variable (e.g.,

tossing a coin), where you decide the number of
samples in advance
(versus: I keep drawing a ball until I get 2 reds, then I
quit. What was my probability of getting 2R and 3G?)
Three critical properties

Result of each trial may be either a failure or a success
Probability of success is the same for each trial
The trials are independent
Back to tossing coins
The coin-toss experiment is an example of a

binomial process
Lets arbitrarily designate heads as a
success
p(heads) = 0.5
What is the probability of obtaining 4 heads

in 6 tosses?
Example
P(4 H in 6 tosses) =
N
k
N k
p
(1
p
)
=
(0.5) 4 (0.5) 2
6
5 4 3
2 1 1 15
3
2 1 2
1 64 64
Kangaroo example from book
10 pairs of kangaroos
Half of them get vitamins
10 races (vitamin vs. no vitamin)
7 out of 10 races, the kangaroo taking

vitamins wins
Do the vitamins help, or is this just
happening by chance?
How do we decide?
What we want to do is to set a criterion # of wins,

and decide that the vitamins had an effect if we
see a # of wins equal to or greater than the
criterion.
How do we set the criterion?
Well, what if we had set the criterion right at 7

wins? What would be our probability of saying
there was an effect of the vitamins, when really
the results were just due to chance?
Roo races
If we set the criterion at 7 wins, and there

were no effect of vitamins, what is the
probability of us thinking there were an
effect?
Probability of the vitamin roo winning, if
vitamins dont matter, = p = 0.5
What is the probability, in this case, of 7
wins, or 8, or 9, or 10?
Roo races
P(7 wins out of 10) + P(8 wins out of 10) +
P(9 wins out of 10) + P(10 wins out of 10)
Use the binomial formula, from before.

17% (see problem 6, p. 258, answer on
p. A-71)
Roo races
Remember, this is the probability of us thinking

there were an effect, when there actually wasnt, if
we set the criterion at 7 wins.
17% is a pretty big probability of error. (In
statistics we like numbers more like 5%, 1%, or
0.1%.)
We probably wouldnt want to set the criterion at
7 wins. Maybe 8 or 9 would be better.
We decide that the vitamins probably have no
effect.
Well see LOTS more problems like the

kangaroo problem in this class.
And this whole business of setting a
criterion will become more familiar and
intuitive.
For now, back to binomial random
variables.
Mean and variance of a binomial
random variable
The mean number of successes in a
binomial experiment is given by:
= np
n is the number of trials, p is the probability of

success
The variance is given by
2 = npq
q = 1-p
What happens to the binomial
distribution as you toss the coin
more times?
Probability Histogram
(3 coins)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
Number of Heads
0.4
(4 coins)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
2
Number of Heads
(10 coins)
0.3
0.25
0.2
0.15
0.1
0.05
Number of Heads
10
Binomial Distribution
0.7
p = .05
n = 10
0.6
0.5
p(x)
0.4
0.3
0.2
0.1
0
0
Number of Successes
0.3
p = .05
n = 50
0.25
0.2
0.15
0.1
0.05
0
0
Number of Successes
0.2
p = .05
n = 100
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
Number of Successes
10
The central limit theorem, again
As the number of tosses goes up, the binomial

distribution approximates a normal distribution.
The total number of heads on 100 coin tosses =
number on 5 tosses + number on next 5 tosses +
Thus, a binomial process can be thought of as the

sum of a bunch of independent processes, the
central limit theorem applies, and the distribution
approaches normal, for a large number of coin
tosses = trials.
The normal approximation
This means we can use z-tables to answer
questions about binomial distributions!
Normal Approximation
When is it OK to use the normal

approximation?
Use when n is large and p isnt too far from
0.5
The further p is from .5, the larger n you need
Rule of thumb: use when np10 and nq10
For any value of p, the binomial distribution

of n trials with probability p is
approximated by the normal curve with
= np and
= sqrt(npq)
Where q = (1-p)
Lets try it for 25 coin flips...
25 coin flips
What is the probability that the number of

heads is 14?
We can calculate from the binomial formula
that p(x14) is .7878 exactly
Using the normal approximation with

= np = (25)(5) = 12.5 and
= sqrt(npq) = sqrt((25)(.5)(.5)) = 2.5 we get
p(x14) = p(z (14-12.5)/2.5))

= p(z .6) = .7257
.7878 vs. .7257 -- not great!!
Need a better approximation...
Normal Approximation of
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
10
11
12
13
14
15
Number of Successes
16
17
18
19
20
21
22
23
24
25
Continuity Correction
Notice that the bars are centered on the

numbers
This means that p(x14) is actually the area
under the bars less than x=14.5
We need to account for the extra 0.5
P(x14.5) = p(z.8) = .7881 -- a much

better approximation!
Continuity Correction
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
10
11
12
13
14
15
Number of Successes
16
17
18
19
20
21
22
23
24
25
# of times you do an experiment, vs.
# of trials in that experiment
In MATLAB:
x = rand(5,10000);
coinflip = x>0.5;
% 1 = heads
y = sum(x);
% number of heads
# of coin
flips =
# trials = 5
# of times you do the experiment
# of coin
flips =
# trials = 5
# of times you do the experiment
Increase this, and

central limit thm. will
start to apply distribution
will look more normal.
Increase this, and the

empirical distribution
will approach the
theoretical distribution
(and get less variable).
Binomial distribution and percent
Can also use binomial distribution for percent

success, by dividing by the number of samples
(trials)
Mean = np/n = p
Std. deviation = sqrt(npq)/n = sqrt(pq/n)
Well use this a lot in class, as we often have a

situation like that for elections: 45% favor Kerry,
39% favor Edwards are these different by
chance, or is there a real effect there?

5 Random Var PDF

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

5 Random Var PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 Random Var PDF

Uploaded by

Copyright:

Available Formats

Random variables

A few notes on the homework

If you work together, tell us who youre working

Turn in your MATLAB code (this helps us give

More homework notes

Population vs. sample

The population to which the researcher wants to

High school students who take the SAT

High school students

More homework notes

If nothing else, if you cant figure out

Hint: MATLAB works best if you can think

numheads = sum(coinflip); % num H in 5 flips

randn(N) -> NxN matrix!

A few more comments

Expected value can tell you whether or not you want to

In our example of testing positive for a disease, P(D) is the

Variables that take numerical values associated

Mean, , of a random variable is the sum of each

Variance is the average of squared deviations

multiplied by the probability of each value

Weve already talked about a few

Normal r.v.s (with normal distributions)

Can be made out of functions of other

Linear combinations of random

We talked about this in lecture 2. Heres a review,

E(x+5) = E(x) + E(5) = E(x) + 5 = + 5 =

E((x+5-)2) = E(x2+2(5-)x + (5-)2)

Adding a constant to x adds that constant to , but

Linear combinations of random

E((2x-)2) = E(4x2 8x + 42) = 42 =

Scaling x by a constant scales both and by

Multiplying by a negative constant

E((-2x-)2) = E(4x2 +2(2x)(-2) + (-2)2)

Scaling by a negative number multiples the

What happens to z-scores when you

Changes in scale or shift do not change

Special case: Normal random

Can use z-tables to figure out the area under

An example of using the table

here and here

Another way to use the z-tables

Mean SAT score = 500, std. deviation = 100

Assuming that the distribution of scores is

Using z-tables to find the 95

From the tables:

z=1.65 -> x=? Mean=500, s.d.=100

A lot of data is normally distributed because of the

Overall average = 120 lbs for adult women

For data that is approximately normally

E.G. income distributions tend to be very

table example (except without the tables)

Whats the 10th percentile point? The 25th

Percentiles & interquartile range

Divide data into 4 groups, see how far about the

Q3-Q1 = IQR = 75th percentile 25th percentile

What do you do for other

Median = point such that 50% of the data