0% found this document useful (0 votes)
39 views50 pages

CL202: Introduction To Data Analysis

This document provides an introduction to various types of random variables that are commonly used in data analysis. It begins with definitions and properties of discrete random variables like the Bernoulli, binomial, geometric, and uniform random variables. It then discusses continuous random variables like the uniform and exponential distributions. Examples are provided to illustrate concepts. The goal is to introduce students in the CL202 course to special random variables that are important in modeling real-world phenomena involving chance and uncertainty.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views50 pages

CL202: Introduction To Data Analysis

This document provides an introduction to various types of random variables that are commonly used in data analysis. It begins with definitions and properties of discrete random variables like the Bernoulli, binomial, geometric, and uniform random variables. It then discusses continuous random variables like the uniform and exponential distributions. Examples are provided to illustrate concepts. The goal is to introduce students in the CL202 course to special random variables that are important in modeling real-world phenomena involving chance and uncertainty.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CL202: Introduction to Data Analysis

Mani Bhushan, Sachin Patwardhan


Department of Chemical Engineering,
Indian Institute of Technology Bombay
Mumbai, India- 400076

mbhushan,[email protected]

Acknowledgements: Santosh Noronha (some material from his slides)

Spring 2015

(IIT Bombay) CL202 Spring 2015 1 / 50


Special Random Variables

Chapter 5 of Ross.
Some material also from Montgomery and Runger, Applied Probability and
Statistics for Engineers, John Wiley, 2003.

(IIT Bombay) CL202 Spring 2015 2 / 50


Random Variables

Some random variables occur frequently.


Used to model different types of situations.
We will consider some of these starting with discrete random variables.

(IIT Bombay) CL202 Spring 2015 3 / 50


The Bernoulli Random variable
A Bernoulli random variable has two values 1 (success) and 0 (failure).
It models a situation with only two possible outcomes.
Example: coin toss: heads with probability p, tails with probability 1 − p.
X = 1 (heads) or 0 (tails).
This is a very simple ‘on-off’ distribution. The PMF is:

p if x = 1
p(x) =
1 − p if x = 0

The PMF can also be written as

p(x) = p x (1 − p)1−x

E [X ] = 1 × p + 0 × (1 − p) = p
E [X 2 ] = 12 × p + 02 × (1 − p) = p which implies that
2
Var [X ] = E [X 2 ] − (E [X ]) = p − p 2
= p(1 − p)

(IIT Bombay) CL202 Spring 2015 4 / 50


Binomial random variable

A biased coin (P{H} = p, P{T } = 1 − p) is tossed n times.


The toss outcomes are independent.
X = number of H in an n toss sequence.

PMF : p(k) = P{X = k} = n Ck p k (1 − p)n−k , k = 0, 1, 2, 3.......n

Parameters of a binomial random variable: n, p.


E [X ] = np, Var(X ) = np(1 − p).

(IIT Bombay) CL202 Spring 2015 5 / 50


Binomial random variable: PMF sketch

(IIT Bombay) CL202 Spring 2015 6 / 50


Example of a Binomial RV

Q (Montgomery and Runger, 2003) Each sample of water has a 10% chance of
containing a particular organic pollutant. Assume that samples are
independent with regards to presence or absence of pollutant. Find the
probability that in the next 18 samples: (i) exactly two contain the pollutant,
(ii) atleast four contain the pollutant.
A (i) P{X = 2} = 18 C2 (0.1)2 (0.9)16 = 0.284
P3
(ii) P{X ≥ 4} = 1 − P{X < 4} = 1 − i=0 18 Ci (0.1)i (0.9)(18−i)
= 1 − (0.15 + 0.3 + 0.284 + 0.168) = 0.098

(IIT Bombay) CL202 Spring 2015 7 / 50


Geometric random variable

In a series of Bernoulli trials (independent trials with constant probability p of


success), let the random variable X denote the number of trials until the first
success.
Then X is a geometric random variable with parameter 0 < p < 1 and PMF

P{X = k} = pX (k) = (1 − p)k−1 p; k = 1, 2, ...

PMF at k is 1 − p times the PMF at k − 1 i.e. probabilities decrease in a


geometric progression. Hence the name geometric.
1−p
E [X ] = p1 , Var(X ) = p2 .

(IIT Bombay) CL202 Spring 2015 8 / 50


Geometric RV: PMF Sketch

(IIT Bombay) CL202 Spring 2015 9 / 50


Geometric RV: Example

Q (Montgomery and Runger, 2003) The probability that a silicon wafer contains
a large particle (contamination) is 0.01. If it is assumed that the wafers are
independent, what is the probability that exactly 125 wafers need to be
analyzed before a large particle is detected?
A X : number of samples analyzed until a large particle is detected.
X is a geometric RV with p = 0.01. Then,

P{X = 125} = 0.99124 0.01 = 0.0029

(IIT Bombay) CL202 Spring 2015 10 / 50


Discrete uniform random variable

The result of the roll


 of a die is a random variable X .
1/6 if xi = 1, 2, 3, 4, 5, 6
pmf (X ) = p(xi ) =
0 otherwise
E [X ] = 3.5 (using a symmetry argument about 3.5).
var(X ) = 16 × 12 + 22 + 32 + 42 + 52 + 62 − 3.52 = 35
 
12 .

(IIT Bombay) CL202 Spring 2015 11 / 50


Discrete uniform random variable

In general, for a < b,


 1
(b−a+1) if k = a, a + 1, ......b
p(k) =
0 otherwise
(a+b)
The expectation of X is 2
(b−a)(b−a+2)
The variance of X is 12

(IIT Bombay) CL202 Spring 2015 12 / 50


Discrete RVs Studied

Bernoulli (coin toss)


Binomial (number of heads in n independent coin tosses)
Geometric (number of independent tosses till first heads)
Uniform (roll of a die)

(IIT Bombay) CL202 Spring 2015 13 / 50


Continuous Random Variables

Can take any value on an interval (or R).


Uniform, Exponential, Gaussian (Normal), t, χ2 , F.

(IIT Bombay) CL202 Spring 2015 14 / 50


Continuous uniform random variable

Discrete Uniform law: Sample space had finite number of equally likely
outcomes. For discrete variables, we count the number of outcomes
concerned with an event.
For continuous variables we compute the length of a subset of the real line.
The PDF of a continuous uniform random variable is

c, α ≤ x ≤ β
fX (x) =
0, otherwise

where c must be > 0.

(IIT Bombay) CL202 Spring 2015 15 / 50


Continuous uniform random variable

For f to be a PDF,
Z β Z β
1
1= c dz = c dz = c(β − α) ⇒ c =
α α (β − α)

(IIT Bombay) CL202 Spring 2015 16 / 50


Mean of the uniform RV

The mean is
Z ∞ Z b
1
E [X ] = xfX (x)dx = x dx
−∞ a b−a
b
1 1 1 b 2 − a2
× × x 2 =

= ×
b−a 2 a b−a 2
a+b
=
2
The PDF is symmetric around (a + b)/2

(IIT Bombay) CL202 Spring 2015 17 / 50


Variance of the uniform RV

b b
x 3
Z
2 2 1 1
E [X ] = x dx = ×
a b−a b−a 3 a
b 3 − a3 a2 + ab + b 2
= =
3(b − a) 3
Var(X ) = E [X 2 ] − (E [X ])2
2
(a2 + ab + b 2 )

a+b
= −
3 2
(b − a)2
=
12

(IIT Bombay) CL202 Spring 2015 18 / 50


Exponential Random Variable
PDF of X is
λe −λx

if x ≥ 0
fX (x) =
0 x <0
for some constant λ > 0 known as rate of the exponential distribution.
Examples:
I Time till a light bulb burns out.
I Time till an accident or a failure (remember control valve, electronics hardware
example).

(IIT Bombay) CL202 Spring 2015 19 / 50


Exponential RV

Is it a PDF?
Z ∞ Z ∞ ∞
fX (x)dx = λe −λx dx = −e −λx 0 = 1
−∞ 0

For any a ≥ 0
Z ∞ ∞
P(X ≥ a) = λe −λx dx = −e −λx a
a
−λa
= e

or
F (a) = 1 − e −λa

(IIT Bombay) CL202 Spring 2015 20 / 50


Moments of Exponential RV

Mean
Z ∞ ∞
Z ∞
E [X ] = x(λe −λx )dx = −xe −λx 0 + e −λx dx
0 0

e −λx 0 1
= 0− =
λ λ
Variance
Z ∞ ∞
Z ∞
E [X 2 ] = x 2 (λe −λx )dx = −x 2 e −λx 0 + 2xe −λx dx
0 0
2 2
= 0+ E [X ] = 2
λ λ
 2
2 1 1
Var(X ) = − = 2
λ2 λ λ

(IIT Bombay) CL202 Spring 2015 21 / 50


Exponential RV

The moment generating function (MGF) of the exponential is


Z ∞ Z ∞
−λx
tX
φ(t) = E [e ] = tx
e λe dx = λ e −(λ−t)x dx
0 0
λ
= (for t < λ)
λ−t
Differentiation gives
0 λ 00 2λ
φ (t) = φ (t) =
(λ − t)2 (λ − t)3

Therefore
0 1
E [X ] = φ (0) =
λ
00 2 1 1
Var(X ) = φ (0) − (E [X ])2 = − 2 = 2
λ2 λ λ

(IIT Bombay) CL202 Spring 2015 22 / 50


Example: The sky is falling...

Q. Meteorites land in a certain area at an average of 10 days. What is the


probability that the first meteorite lands between 6 AM and 6 PM of the first
day given that it is now midnight?
A. Let X = elapsed time till strike (in days) = exponential variable ⇒ mean
= 1/λ = 10 days ⇒ λ = 1/10.
     
1 3 1 3
P ≤X ≤ = P X ≥ −P X ≥
4 4 4 4
= e −λ(1/4) − e −λ(3/4)
= e −1/40 − e −3/40
= 0.047

(IIT Bombay) CL202 Spring 2015 23 / 50


The sky is falling...

Q. What is the probability that the meteorite lands between 6 AM and 6 PM of


any day?
A. For k th day, this time frame is (k − 3/4) ≤ X ≤ (k − 1/4)
∞   
X 3 1
= P k−
≤X ≤ k−
4 4
k=1
∞    ∞
X  
X 3 1
= P X ≥ k− − P X ≥k−
4 4
k=1 k=1
∞ 
X 
= e −(4k−3)/40 − e −(4k−1)/40
k=1

(IIT Bombay) CL202 Spring 2015 24 / 50


Important Property of Exponential RV
Q X denotes the time between detections of a particle with a geiger counter
and assume that X has an exponential distribution with λ = 1/1.4 per
minute. What is the probability that we detect a particle within 0.5 minutes
of starting the counter?
A
P{X < 0.5 min} = F (0.5) = 1 − e −0.5/1.4 = 0.3

Q Suppose we wait 3 minutes without detecting a particle after turning on the


geiger counter. What is the probability that a particle is detected in the next
0.5 minutes.
A
P{X < 3.5 | X > 3} = P{3 < X < 3.5}/P{X > 3}

Numerator: F (3.5) − F (3) = (1 − e −3.5/1.4 ) − (1 − e −3/1.4 ) = 0.0035.


Denominator: = 1 − F (3) = e −3/1.4 = 0.117.
Answer: 0.035/0.117 = 0.3.
(IIT Bombay) CL202 Spring 2015 25 / 50
Memoryless Property of Exponential RVs

P{X < t1 + t2 | X > t1 } = P{X < t2 }

Exponentially distributed random variables are memoryless.


Sketch plot of statement.

(IIT Bombay) CL202 Spring 2015 26 / 50


A Very Important Continuous RV

Normal (Gaussian) Random Variable


Distributions arising from Normal
I Chi-square distribution
I t-distribution
I F-distribution

(IIT Bombay) CL202 Spring 2015 27 / 50


Normal (Gaussian) random variables

Most useful distribution: we use it to approximate other distributions.


The PDF of X is
1 2 2
fX (x) = √ e −(x−µ) /2σ −∞<x <∞
2πσ
for some parameters µ & σ (σ > 0).

(IIT Bombay) CL202 Spring 2015 28 / 50


Moments of a Normal random variables

E [X ] = mean = µ (from symmetry arguments alone).


Alternatively, show E [X − µ] = E [X ] − µ = 0 (refer book)
For normal, mean=median=mode.
Var(X ) = σ 2 =
Z ∞ Z ∞
1 2 2
2
(x − E [X ]) fX (x)dx = (x − µ)2 √ e −(x−µ) /2σ dx
−∞ −∞ 2πσ

(IIT Bombay) CL202 Spring 2015 29 / 50


Variance of a Normal RV

Let
x −µ
z=
σ
Z ∞
σ2 2
Var(x) = √ z 2 e −z /2 dz
2π −∞
∞ Z ∞
σ2 −z 2 /2 σ2 2
e −z /2 dz

= √ × (−ze ) +√
2π −∞ 2π −∞
Z ∞
σ2 2
= √ e −z /2 dz
2π −∞
= σ2

(IIT Bombay) CL202 Spring 2015 30 / 50


Moment Generating Function of a Normal RV

The MGF of a Normal RV is (using y = (x − µ)/σ)


Z ∞
1 2 2
tX
φ(t) = E [e ] = √ e tx e −(x−µ) /2σ dx
2πσ −∞
1 µt ∞ tσy −y 2 /2
Z
= √ e e e dy
2π −∞

e µt y 2 − 2tσy
Z   
= √ exp − dy
2π −∞ 2
Z ∞
e µt (y − tσ)2 t 2 σ2
 
= √ exp − + dy
2π −∞ 2 2
Z ∞
σ2 t 2
 
1 2
= exp µt + √ e −(y −tσ) /2 dy
2 2π −∞
2 2
 
σ t
= exp µt +
2

(IIT Bombay) CL202 Spring 2015 31 / 50


Use of MGF of a Normal random variable
The MGF for a normal RV is
σ2 t 2
 
φ(t) = exp µt +
2

Differentiation gives

σ2 t 2
 
0
φ (t) = (µ + tσ 2 ) exp µt +
2
2 2
 
00 σ t
φ (t) = σ 2 exp µt +
2
2 2
 
σ t
+ exp µt + (µ + tσ 2 )2
2

Therefore
0
E [X ] = φ (0) = µ
00
Var(X ) = φ (0) − (E [X ])2 = σ 2

(IIT Bombay) CL202 Spring 2015 32 / 50


Normal random variables

X = normal var with µ and σ 2

X ∼ N (µ, σ 2 )

The maximum height of N (µ, σ) = 1/(σ 2π) and hence maximum height
∝ 1/σ.
Another important property of normal random variables:
If X is normal with mean µ and variance σ 2 , then for any constants a and
b, b 6= 0, the random variable Y = a + bX is also a normal random variable
with

E [Y ] = E [a + bX ] = a + bµ
Var(Y ) = b 2 σ 2

(Refer to book for proof)

(IIT Bombay) CL202 Spring 2015 33 / 50


Standard Normal Distribution

A Standard Normal RV Z has mean = 0 and σ 2 = 1;

N (0, 1)

The CDF of Z is
Z z
1 2
Φ(z) = P{Z ≤ z} = P{Z < z} = √ e −t /2
dt
2π −∞

(IIT Bombay) CL202 Spring 2015 34 / 50


Standard Normal Distribution

Φ(z) is available as a table.


If the table only gives the value of Φ(z) for z ≥ 0, use symmetry.

(IIT Bombay) CL202 Spring 2015 35 / 50


Use of Standard Normal Distribution

Given a normal random var X with mean µ and variance σ 2 , we use

(X − µ)
Z=
σ
Then Z is also normal with
(E [X ] − µ)
E [Z ] = =0
σ
Var(X )
Var(Z ) = =1
σ2
⇒ Z ∼ N (0, 1) = standard normal.

(IIT Bombay) CL202 Spring 2015 36 / 50


Example

Q. What is Φ(−0.5)?
A.

Φ(−0.5) = P{Z ≤ −0.5} = P{Z ≥ 0.5} = 1 − P{Z ≤ 0.5}


= 1 − Φ(0.5) = 1 − 0.6915
= 0.3085

(IIT Bombay) CL202 Spring 2015 37 / 50


Example

Q. Average rain at a spot = normal distribution, mean = µ = 60 mm/yr and


σ= 20 mm/yr. What is the probability that we get at least 80 mm/yr?
A. Let X = rainfall/yr. Then

X −µ (X − 60)
Z= =
σ 20
So
 
80 − 60
P{X ≥ 80} = P Z≥
20
= P{Z ≥ 1} = 1 − Φ(1) = 1 − 0.8413
= 0.1587

(IIT Bombay) CL202 Spring 2015 38 / 50


Notation
In general (with Z being standard normal).
 
X −µ x −µ
P{X ≤ x} = P ≤
σ σ
   
x −µ x −µ
= P Z≤ =Φ
σ σ
100 × (1 − α)th percentile of N (0, 1) = zα where zα is such that
P{Z > zα } = α
or
P{Z < zα } = 1 − α
i.e. 100(1 − α) percent of the time a standard normal random variable will be
less than zα .

(IIT Bombay) CL202 Spring 2015 39 / 50


The Chi-Square Distribution

If Z1 , Z2 , ..., Zn are independent standard normal random variables, then X ,


defined as
X = Z12 + Z22 + ... + Zn2
is said to have a chi-square distribution with n degrees of freedom or

X ∼ χ2n

Let X1 , X2 be independent chi-square random vraiables with n1 and n2


degrees of freedom, respectively. Then X1 + X2 is chi-square with n1 + n2
degrees of freedom.
For X a chi-square RV with n degrees of freedom, quantity χ2α,n is defined
such that
P{X ≥ χ2α,n } = α

Table A2 (appendix) of textbook lists χ2α,n for various values of α and n.

(IIT Bombay) CL202 Spring 2015 40 / 50


Chi-squared PDF Sketch

(IIT Bombay) CL202 Spring 2015 41 / 50


Chi-squared RV

Density function of chi-squared RV involves gamma function.


(without proof) For a chi-squared RV with n degrees of freedom:
E [X ] = n, Var(X ) = 2n.

(IIT Bombay) CL202 Spring 2015 42 / 50


The t−Distribution

If Z and χ2n are independent random variables, with Z having a standard


normal distribution and χ2,n having a chi-squared distribution with n degrees
of freedom, then the random variable Tn defined by
Z
Tn = p
χ2n /n

is said to have a t-distribution with n degrees of freedom.


t−density function is in terms of gamma function.
t−density is symmetric about 0 like standard normal density.
As n becomes larger, t−density tends to a standard normal density.

(IIT Bombay) CL202 Spring 2015 43 / 50


t−distribution PDF Sketch

(IIT Bombay) CL202 Spring 2015 44 / 50


The t−Distribution

(Without proof)

E [Tn ] = 0, n > 1, otherwise undefined


n
Var(Tn ) = for n > 2, ∞ for 1 < n ≤ 2, otherwise undefined
n−2

Note as n → ∞, variance tends to 1 (same as that of a standard normal RV).


Let tα,n be such that

P{Tn ≥ tα,n } = α, 0<α<1

From symmetry, t1−α,n = −tα,n .


Values of tα,n listed for various n, α in Table A3 (appendix) in textbook.

(IIT Bombay) CL202 Spring 2015 45 / 50


The F −Distribution

Ratio of two indpendent chi-square variables


If χ2n , χ2m are independent chi-square RVs with n and m degrees of freedom
respectively, then the RV Fn,m defined by

χ2n /n
Fn,m =
χ2m /m

is said to have an F-distribution with n and m degrees of freedom.


Not worry about its density function.
For any α ∈ (0, 1) let Fα,n,m be such that

P{Fn,m > Fα,n,m } = α

the quantities Fα,n,m are tabulated in table A4 (appendix) of textbook for


various n, m, α ≤ 0.5.

(IIT Bombay) CL202 Spring 2015 46 / 50


F −distribution PDF Sketch

(IIT Bombay) CL202 Spring 2015 47 / 50


Important Property of F −distribution
For α > 0.5
χ2n /n
 
α = P > F α,n,m
χ2m /m
 2 
χm /m 1
= P <
χ2n /n Fα,n,m
 2 
χm /m 1
= 1−P ≥
χ2n /n Fα,n,m
or,
χ2m /m
 
1
P 2
≥ =1−α
χn /n Fα,n,m
χ2m /m
But χ2n /n has an F −distribution with m, n degrees of freedom, thus
 2 
χm /m
1−α=P ≥ F1−α,n,m
χ2n /n
Thus,
1
= F1−α,m,n
Fα,n,m
(IIT Bombay) CL202 Spring 2015 48 / 50
Illustration

F0.9,5,7 = 1/F0.1,7,5 .
From tables F0.1,7,5 = 3.37, thus F0.9,5,7 = 0.2967.

(IIT Bombay) CL202 Spring 2015 49 / 50


THANK YOU

(IIT Bombay) CL202 Spring 2015 50 / 50

You might also like