Module 3 (Regression Line) and Module 4
Module 3 (Regression Line) and Module 4
Rank correlation is
1 1 1 1
6 [∑ 𝑑 2 + 2 + 2 + 5 + 2 + 2]
𝜌 = 1− = 0.722
12(144 − 1)
Linear Regression:
If the variables in bivariate distribution are related, will find that the points in the scatter
diagram will cluster round some curve called the ‘’curve of regression’’. If the curve is a straight
line, it is called the line of regression and there is said to be linear regression between the
variables, otherwise regression is said to be curvilinear.
The lines of regression are the line which gives to be best estimate to the value of one variable
for any specific value of the other variable. Thus, the line of regression is the line of ‘best fit’ and
is obtained by the principle of least squares.
Let us suppose that the in the bivariate distribution (𝑥𝑖 , 𝑦𝑖 ); 𝑖 = 1,2,3, … , 𝑛; 𝑦 is dependent
variable and 𝑥 is independent variable. Let the line of regression is the line of 𝑦 on 𝑥 be
𝑦 = 𝑎 + 𝑏𝑥 (1)
Eq. (1) represents the family of straight lines for different values of the arbitrary constants
′𝑎′ 𝑎𝑛𝑑 ′𝑏′. The problem is to determine the ′𝑎′ 𝑎𝑛𝑑 ′𝑏 so that the line Eq. (1) is the line of best
fit.
According to the principle of the principle of least squares, we have to determine ′𝑎′ 𝑎𝑛𝑑 ′𝑏′.
𝑛
2
𝐸 = ∑(𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 ))
𝑖=1
Is minimum. From the principle of maxima and minima, the partial derivatives of 𝐸, with respect
to ′𝑎′ 𝑎𝑛𝑑 ′𝑏′ should vanish separately, i.e.,
𝑛
𝜕𝐸
= 0 = −2 ∑(𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 ))
𝜕𝑎
𝑖=1
𝑦̅ = 𝑎 + 𝑏𝑥̅ (4)
1
𝜇11 = 𝐶𝑜𝑣(𝑥, 𝑦) = 𝑛 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑥̅ 𝑦̅
1
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 = 𝜇11 + 𝑥̅ 𝑦̅ (5)
𝑛
𝑛
2
1
𝜎𝑥 = ∑ 𝑥𝑖 2 − 𝑥̅ 2
𝑛
𝑖=1
1
∑𝑛𝑖=1 𝑥𝑖 2 = 𝜎𝑥 2 + 𝑥̅ 2 (6)
𝑛
Dividing Eq. (3) by 𝑛 and using Eqs. (5) and (6), we get
𝜇11
𝑏=
𝜎𝑥 2
Since ‘b’ is the slope of the line of regression of Y on X and since the line of regression passes
through the point (𝑥̅ , 𝑦̅ ) its equation is
𝜇11
𝑌 − 𝑦̅ = 𝑏(𝑥 − 𝑥̅ ) = (𝑋 − 𝑥̅ )
𝜎𝑥 2
𝜎𝑌
𝑌 − 𝑦̅ = 𝑟 (𝑋 − 𝑥̅ )
𝜎𝑋
𝜎𝑋
𝑋 − 𝑥̅ = 𝑟 (𝑌 − 𝑦̅)
𝜎𝑌
Problem:
Solution:
Let us denote the sales by the variable 𝑥 𝑎𝑛𝑑 𝑦 the purchases by the variable 𝑦
𝑥 𝑦 𝑑𝑥 𝑑𝑦 𝑑𝑥 2 𝑑𝑦 2 𝑑𝑥 𝑑𝑦
= 𝑥 − 90 = 𝑦 − 70
91 71 1 1 1 1 1
97 75 7 5 49 25 35
108 69 18 -1 324 1 -18
121 97 31 27 961 729 837
67 70 -23 0 529 0 0
124 91 34 21 1156 441 714
51 39 -39 -31 1521 961 1209
73 61 -17 -9 289 81 153
111 80 21 10 441 100 210
57 47 -33 -23 1089 529 759
∑𝑥 ∑𝑦 ∑ 𝑑𝑥 = 0 ∑ 𝑑𝑦 = 0 ∑ 𝑑𝑥 2 ∑ 𝑑𝑦 2 ∑ 𝑑𝑥 𝑑𝑦
= 900 = 700 = 6360 = 2868 = 3900
∑𝑥 900
We have, 𝑥̅ = = = 90
𝑛 10
∑ 𝑦 700
𝑦̅ = = = 70
𝑛 10
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ∑ 𝑑𝑥𝑑𝑦 3900
𝑏𝑦𝑥 = = = = 0.6132
∑(𝑥 − 𝑥̅ )2 ∑ 𝑑𝑥 2 6360
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ∑ 𝑑𝑥𝑑𝑦 3900
𝑏𝑥𝑦 = = = = 1.361
∑(𝑦 − 𝑦̅)2 ∑ 𝑑𝑦 2 2868
Equation of regression of 𝑦 𝑜𝑛 𝑥 is
𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )
𝑦 − 70 = 0.6132(𝑥 − 90)
𝑦 − 70 = 0.6132 𝑥 − 0.613 × 90
= 0.6132 𝑥 − 55.188
𝑦 = 0.6132 𝑥 − 55.188 + 70
𝑦 = 0.6132 𝑥 + 14.812
Equation of regression of 𝑥 𝑜𝑛 𝑦 is
𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅)
𝑥 − 90 = 1.361(𝑦 − 70)
𝑥 − 90 = 1.361 𝑦 − 1.361 × 70
= 1.361 𝑦 − 95.27
𝑥 = 1.361 𝑦 − 95.27 + 90
𝑥 = 1.361 𝑦 − 5.27
𝑟 2 = 𝑏𝑦𝑥 . 𝑏𝑥𝑦
Marks in 25 28 35 32 31 36 29 38 34 32
Economics
Marks in 43 46 49 41 36 32 31 30 33 39
Statistics
Solution:
𝑥 𝑦 𝑑𝑥 𝑑𝑦 𝑑𝑥 2 𝑑𝑦 2 𝑑𝑥 𝑑𝑦
= 𝑥 − 32 = 𝑦 − 38
25 43 -7 5 49 25 -35
28 46 -4 8 16 64 -32
35 49 3 11 9 121 33
32 41 0 3 0 9 0
31 36 -1 -2 1 4 2
36 32 4 -6 16 36 -24
29 31 -3 -7 9 49 21
38 30 6 -8 36 64 -48
34 33 2 -5 4 25 -10
32 39 0 1 0 1 0
∑𝑥 ∑𝑦 ∑ 𝑑𝑥 = 0 ∑ 𝑑𝑦 = 0 ∑ 𝑑𝑥 2 ∑ 𝑑𝑦 2 ∑ 𝑑𝑥 𝑑𝑦
= 320 = 380 = 140 = 398 = −93
∑ 𝑥 320
𝑥̅ = = = 32
𝑛 10
∑ 𝑦 380
𝑦̅ = = = 38
𝑛 10
(a) Regression coefficients:
𝑥 − 32 = −0.2337(𝑦 − 38)
𝑥 − 32 = −0.2337 𝑦 + 38 × 0.2337
= −0.2337 𝑦 + 8.8806
𝑥 = −0.2337 𝑦 + 8.8806 + 32
𝑥 = −0.2337 𝑦 + 40.8806
Equation of line of regression of 𝑦 𝑜𝑛 𝑥 is
𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )
𝑦 − 38 = −0.6643(𝑥 − 32)
𝑦 − 38 = −0.6643 𝑥 + 0.6643 × 32
= −0.6643 𝑥 + 0.6643 × 32 + 38
𝑦 = −0.6643 𝑥 + 59.2576
(c) Correlation coefficient:
𝑟 2 = 𝑏𝑦𝑥 . 𝑏𝑥𝑦
𝑟 2 = (−0.6643)(−0.2337) = 0.1552
𝑟 = ∓0.394
Since the both regression coefficients are negative. Hence the discarding plus sign, we get
𝑟 = −0.394
(d) In order to estimate the most likely marks in Statistics (𝑦) when marks in Economics (𝑥)
are 30, we use the line of regression of 𝑦 𝑜𝑛 𝑥.
The equation is
𝑦 − 38 = −0.6643 (30) + 59.2576
𝑦 = 39.3286
Hence the most likely marks in Statistics when in Economics are 30, are 39.3286≈ 39.
Problem:
(b) From eq. (1), the estimated supply of sugar when its price is Rs. 20 per kg is given by
𝑦 = 0.025 + 1.5 × 20 = 30.025 kg
(c) 𝑟(𝑥, 𝑦) = 1
The relationship between that 𝑥 𝑎𝑛𝑑 𝑦 is exactly linear. i.e., all the observed values (𝑥, 𝑦)
lies on straight line.
Problem:
Solution:
Regression equation of 𝑦 𝑜𝑛 𝑥 is 𝑦 = 𝑥
𝑏𝑦𝑥 = 1
Regression equation of 𝑥 𝑜𝑛 𝑦 is 4𝑥 − 𝑦 = 3
1 3
𝑥= 𝑦+
4 4
1
𝑏𝑥𝑦 =
4
𝑟 2 = 𝑏𝑦𝑥 . 𝑏𝑥𝑦
1 1
𝑟2 = 1 × =
4 4
𝑟 = ∓0.5
Since the both the regression coefficients are positive 𝑟 = 0.5.
∑ 𝑥2
(b) We are given that the second moment of 𝑥 about origin is 2. i.e., 𝑛
=2
Since (𝑥̅ , 𝑦̅) is the point of intersection of the two lines of regression
Solving 𝑦 = 𝑥 and 4𝑥 − 𝑦 = 3, then 𝑥 = 1 = 𝑦
𝑥̅ = 1 𝑎𝑛𝑑 𝑦̅ = 1
∑ 𝑥2
𝜎𝑥 2 = − 𝑥̅ 2 = 2 − 1 = 1
𝑛
𝜎𝑦
Also, 𝑏𝑦𝑥 = 𝑟
𝜎𝑥
1 𝜎𝑦
1= ( )
2 1
𝜎𝑦 = 2
Coefficient of Determination:
Coefficient of correlation between two variable series is a measure of linear relationship between
them and indicates the amount of variation of one variable which is associated with or accounted
for by another variable. A more useful and readily comprehensible measure for this purpose is
the coefficient of determination which gives the percentage variation in the dependent variable
that is accounted for by the independent variable.
In other words, the coefficient of determination gives the ratio of the explained variance to the
total variance. The coefficient is given by the square of the correlation coefficient i.e.,
𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑟2 = .
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Ex:
If the value of 𝑟 = 0.8, we cannot conclude that 80% of the variation in the relative series
(dependent variable) is due to the variation in the subject series (independent variable). But the
coefficient of determination in this case 𝑟 2 = 0.64 which implies that only 64% of the variation
in the relative series has been explained by the subject series and the remaining 36% of the
variation is due to other factors.
Similarly,
𝑟13 − 𝑟12 𝑟32
𝑟13.2 =
√(1 − 𝑟12 2 )(1 − 𝑟32 2 )
and
𝑟23 − 𝑟21 𝑟31
𝑟23.1 =
√(1 − 𝑟21 2 )(1 − 𝑟31 2 )
Note:
𝜔
1 − 𝑅1.23 2 =
𝜔11
1 𝑟12 𝑟13
Where, 𝜔 = |𝑟21 1 𝑟23 | = 1 − 𝑟12 2 − 𝑟13 2 − 𝑟23 2 + 2𝑟12 𝑟13 𝑟23
𝑟31 𝑟32 1
1 𝑟23
and 𝜔11 = | | = 1 − 𝑟23 2
𝑟32 1
Problem:
From the data relating to the yield of dry bark (𝑋1 ), height (𝑋2 ) and girth (𝑋3 ) for 18 cinchona
plants, the following correlation coefficients were obtained:
𝑟12 = 0.77, 𝑟13 = 0.72 𝑎𝑛𝑑 𝑟23 = 0.52. find the partial correlation coefficients 𝑟12.3 and multiple
correlation coefficient 𝑅1.23.
Solution:
Problem:
In a trivariate distribution 𝜎1 = 2, 𝜎2 = 𝜎3 = 3, 𝑟12 = 0.7, 𝑟23 = 𝑟31 = 0.5.
Find (i) 𝑟23.1 (ii) 𝑅1.23 (iii) 𝑏12.3 , 𝑏13.2 and (iv) 𝜎1.23 .
Solution:
𝑅1.23 = +0.7211
(iii)
𝜎 𝜎
𝑏12.3 = 𝑟12.3 (𝜎1.3 ) and 𝑏13.2 = 𝑟13.2 (𝜎1.2 ) (1)
2.3 3.2
1.7320 1.4282
Eq. (1) gives 𝑏12.3 = 0.6 × = 0.4 and 𝑏13.2 = 0.2425 × = 0.1333
2.5980 2.5980
𝜔
(iv) 𝜎1.23 = 𝜎1 (√𝜔 )
11
1 𝑟12 𝑟13
𝑟
𝜔 = | 21 1 𝑟23 | = 1 − 𝑟12 2 − 𝑟13 2 − 𝑟23 2 + 2𝑟12 𝑟13 𝑟23 = 0.36
𝑟31 𝑟32 1
1 𝑟23
and 𝜔11 = | | = 1 − 𝑟23 2 = 1 − 0.52 = 0.75
𝑟32 1
0.36
𝜎1.23 = 2 × (√ ) = 1.3856
0.75
Multiple regression:
Example:
1) The salary of a person in an organisation has to be regressed in terms of experience (X1)
and mistakes (X2). If it is given that the values
𝑌̅ = 3.3; 𝑋
̅̅̅1 = 2.7; 𝑋
̅̅̅2 = 13.7
𝑆𝑦 = 2.1; 𝑆1 = 1.5; 𝑆2 = 2.6
and the zero order correlations :
𝑟𝑦1 = 0.5; 𝑟𝑦2 = −0.3; 𝑟12 = −0.47;
Find the linear regression and interpret the results.
So,
𝑆𝑦 𝑟𝑦1 − 𝑟𝑦2 𝑟12
𝑏1 = 𝛽1 = ( 2 )
𝑆1 1 − 𝑟12
2.1 0.50 − (−0.3)(−0.47)
𝑏1 = 𝛽1 = ( ) = 0.65
1.5 1 − (−0.47)2
Similarly,
𝑆𝑦 𝑟𝑦2 − 𝑟𝑦1 𝑟12
𝑏2 = 𝛽2 = ( 2 )
𝑆2 1 − 𝑟12
2.1 0.30 − (0.5)(−0.47)
𝑏2 = 𝛽2 = ( ) = −0.07
2.6 1 − (−0.47)2
Calculation of a:
𝑎 = 𝑌̅ − 𝑏1 𝑋
̅̅̅1 − 𝑏2 ̅̅̅
𝑋2
𝛽0 = 𝑌̅ − 𝛽1 𝑋̅̅̅1 − 𝛽2 𝑋
̅̅̅2
= 3.3 − (0.65)(2.7) − (−0.07) 13.7
= 2.5
Interpretation:
1) If a person has no experience and has not done any mistakes, he would get a salary of
2.5 units.
2) If the experience goes up by 1 unit, there would be an increment in the salary by 0.65
units.
3) If he/ she commits a mistake, then the salary would decrease by 0.07 units.
Module-4
Bernoulli’s trials:
Suppose, associated with random trial there is an event called ‘success’ and the complementary
event is called ‘failure’. Let the probability for success be 𝑝 and probability for failure be 𝑞.
Suppose the random trials are prepared 𝑛 times under identical conditions. These are called
Bernoullian trials.
Bernoulli’s Distribution:
A random variable 𝑋 which takes two values 0 and 1 with probability 𝑞 𝑎𝑛𝑑 𝑝 respectively. That
is 𝑃(𝑋 = 0) = 𝑞 𝑎𝑛𝑑 𝑃(𝑋 = 1) = 𝑝, 𝑞 = 1 − 𝑝 is called a Bernoulli’s discrete random
variable. The probability function of Bernoulli’s distribution can be written as
𝜇 = 𝐸(𝑋) = ∑ 𝑋𝑖 . 𝑃(𝑋𝑖 ) = (0 × 𝑞) + (1 × 𝑝) = 𝑝
2. Variance of 𝑋 is
Let a random experiment be performed repeatedly and let the occurrence of an event 𝐴 is any
trial be called success and non-occurrence 𝑃(𝐴̅), a failure (Bernoulli trial). Consider a series of 𝑛
independent Bernoulli trials (𝑛 being finite) in which the probability of success 𝑃(𝐴) = 𝑝 or
𝑃(𝐴̅) = 1 − 𝑝 = 𝑞 in any trial is constant for each trial.
Since the probabilities of 0,1,2, 3,…,n successes, namely 𝑞 𝑛 , 𝑛𝐶1 𝑞 𝑛−1 𝑝, 𝑛𝐶2 𝑞 𝑛−2 𝑝2 , … , 𝑝𝑛 are
the successive terms of the Binomial expansion of (𝑞 + 𝑝)𝑛 , the probability distribution so
obtained is called Binomial probability distribution.
Definition:
A random variable 𝑋 is said to be follow Binomial distribution denoted by B (n, p), if it assumes
only non-negative values and its probability mass function is given by
𝑛𝐶𝑥 𝑝 𝑥 𝑞 𝑛−𝑥 , 𝑥 = 0,1,2,3, … , 𝑛
𝑃(𝑋 = 𝑥) = {
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where 𝑛 𝑎𝑛𝑑 𝑝 are known as parameters.
Note:
If 𝑛 is also sometimes known as the degree of the distribution
∑𝑛𝑥=0 𝑛𝐶𝑥 𝑝 𝑥 𝑞 𝑛−𝑥 = (𝑞 + 𝑝)𝑛 = 1
The Binomial distribution is important not only because of its wide range applicability,
but also because it gives rise to many other probability distributions.
Any variable which follows Binomial distribution is known as Binomial variate.
Conditions for Binomial Experiment:
The Bernoulli process involving a series of independent trials, is based on certain conditions as
under:
There are only two mutually exclusive and collective exhaustive outcomes of the random
variable and one of them is referred to as a success and the other as a failure.
The random experiment is performed under the same conditions for a fixed and finite
(also discrete) number of times, say n. Each observation of the random variable in a
random experiment is called a trial. Each trial generates either a success denoted by 𝑝 or
a failure denoted by 𝑞.
The outcome (i.e., success or failure) of any trial is not affected by the outcome of any
other trial.
All the observations are assumed to be independent of each of each other. This means
that the probability of outcomes remains constant throughout the process.
Example:
To understand the Bernoulli process, consider the coin tossing problem where 3 coins are tossed.
Suppose we are interested to know the probability of two heads. The possible sequence of
outcomes involving two heads can be obtained in the following three ways:
HHT, HTH, THH.
In general, for a binomial random variable, 𝑋 the probability of success (occurrence of desired
outcome) 𝑟 number of times in 𝑛 independent trials, regardless of their order of occurrence is
given by the formula:
𝑛!
𝑃(𝑋 = 𝑟 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠) = 𝑛𝐶𝑟 𝑝𝑟 𝑞 𝑛−𝑟 = 𝑝𝑟 𝑞 𝑛−𝑟 , 𝑟 = 0,1,2,3, … , 𝑛
(𝑛 − 𝑟)! 𝑟!
where
n = number of trials (specified in advance) or sample size
p = probability of success
q = (1 – p), probability of failure
x = discrete binomial random variable
r = number of successes in n trials
Mean of 𝑋 is
𝑛 𝑛
= ∑ 𝑟 2 𝑝(𝑟) − 𝜇 2
𝑟=0
𝑛
Problem 1:
A fair coin is tossed six times, then find the probability of getting four heads.
Solution:
𝑝= probability of getting a head=1/2
𝑞= probability of not getting a head=1/2
𝑛 = 6, 𝑟 = 4
𝑝(𝑟) = 6𝐶4 𝑝𝑟 𝑞 𝑛−𝑟
1 4 1 2
𝑝(4) = 6𝐶4 ( ) ( )
2 2
6! 1 6 15
= ( ) =
4! 2! 2 64
Problem 2:
The incidence of an occupational disease in an industry is such that the workers have a 20%
chance of suffering from it. What is the probability that out of 6 workers chosen at random, four
or more will suffer from disease?
Solution:
The probability that four or more workers suffer from disease = 𝑃(𝑋 ≥ 4)
𝑃(𝑋 ≥ 4) = 𝑃(𝑋 = 4) + 𝑃(𝑋 = 5) + 𝑃(𝑋 = 6)
Problem 3:
Six dice are thrown 729 times. How many times do you except at least three dice to show a 5 or
6.
Solution:
2 1
𝑝 = probability of occurrence of 5 or 6 in one throw= 6 = 3
1 2
𝑞 =1−𝑝=1− =
3 3
𝑛=6
The probability of getting at least three dice to show a 5 or 6
𝑃(𝑋 ≥ 3) = 𝑃(𝑋 = 3) + 𝑃(𝑋 = 4) + 𝑃(𝑋 = 5) + 𝑃(𝑋 = 6)
1 3 2 3 1 4 2 2 1 5 2 1 1 6
= 6𝐶3 ( ) ( ) + 6𝐶4 ( ) ( ) + 6𝐶5 ( ) ( ) + 6𝐶6 ( )
3 3 3 3 3 3 3
1 233
= (160 + 60 + 12 + 1) =
(3) 6 729
The expected number of such cases in 729 times
233
= 729 ( ) = 233
729
Problem 4:
Solution:
Given 𝑛 = 400, 𝑝 = 0.2, 𝑞 = 0.8
(i) Mean is 𝑛𝑝 = 400 × 0.2 = 80
(ii) Standard deviation is √𝑛𝑝𝑞 = √80 × 0.8 = √64 = 8
Problem 5:
Find the maximum 𝑛 such that the probability of getting no head in tossing a fair coin 𝑛 times is
greater than 0.1.
Solution:
Solution:
The number of trials is 𝑛 = 6
𝑁 = ∑ 𝑓𝑖 = total frequency
∑ 𝑓 𝑖 𝑥𝑖 25+104+174+128+8+24
Mean = ∑ 𝑓𝑖
= 200
= 2.675
Mean = 𝑛𝑝
2.675
𝑛𝑝 = 6𝑝, 𝑡ℎ𝑒𝑛 𝑝 = = 0.446
6
𝑞 = 1 − 0.446 = 0.554
= 200[6𝐶0 (0.554)6 + 6𝐶1 (0.554)5 (0.446) + 6𝐶2 (0.554)4 (0.446)2 + 6𝐶3 (0.554)3 (0.446)3
+ 6𝐶4 (0.554)2 (0.446)4 + 6𝐶5 (0.554)1 (0.446)5 + 6𝐶6 (0.446)6 ]
= 200[0.02891 + 0.1396 + 0.2809 + 0.3016 + 0.1821 + 0.05864 + 0.007866]
= 5.782 + 27.92 + 56.18 + 60.32 + 36.42 + 11.728 + 1.5732
The successive terms in the expansion give the expected or theoretical frequencies which are
𝑥 0 1 2 3 4 5 6
𝑓 6 28 56 60 36 12 2
(expected
or
theoretical
frequencies)
Home Work:
1. A die is tossed thrice. A success is getting 1 or 6 on a toss. Find the mean and variance of
the number of successes.
2. The mean and variance of a binomial distribution are 4 and 4/3 respectively. Then find
𝑃(𝑋 ≥ 1).
3. Fit a binomial distribution to the following frequency distribution
𝑥 0 1 2 3 4 5
𝑓 2 14 20 34 22 8
= ∑ 𝑒 𝑡𝑥 𝑛𝐶𝑥 𝑝 𝑥 𝑞 𝑛−𝑥
𝑥=0
𝑛
If the moment generating function of a random variable 𝑋 is of the form (0.4 𝑒 𝑡 + 0.6)8, find the
moment generating function of 3𝑋 + 2.
Solution:
Moment generating function of a random variable 𝑋 is
𝑥
𝑥
𝐵(𝑥; 𝑛, 𝑝) = 𝑃(𝑋 ≤ 𝑥) = ∑ 𝐵(𝑘; 𝑛, 𝑝) = ∑ 𝑛𝐶𝑘 𝑝𝑘 𝑞 𝑛−𝑘 , 𝑥 = 0,1,2,3, … , 𝑛
𝑘=0
𝑘=0
S.D. Poisson (1837) introduced Poisson distribution as a rare distribution of rare events.
i.e. The events whose probability of occurrence is very small but the no. of trials which could
lead to the occurrence of the event, are very large.
Ex:
1. The no. of printing mistakes per page in a large text
2. Number of suicides reported in a particular city
3. Number of air accidents in some unit time
4. Number of cars passing a crossing per minute during the busy hours of a day, etc.
Definition:
A random variable X taking on one of the non-negative values 0,1,2,3,4, … (i.e. which do not
have a natural upper bound) with parameter λ, λ >0, is said to follow Poisson distribution if its
probability mass function is given by
λ𝑥 𝑒 −λ
𝑃(𝑥; λ) = 𝑃(𝑋 = 𝑥) = { 𝑥! , 𝑥 = 0,1,2,3, …
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Then X is called the Poisson random variable and the distribution is known as Poisson
distribution.
And the Poisson parameter, λ = np>0
𝜇 = 𝐸(𝑋) = λ = np
= ∑ 𝑥 2 𝑝(𝑥; 𝛌) − 𝜇 2
𝑥=0
𝑉(𝑋) = 𝜎 2 = λ
∞ ∞
𝑡𝑋 ) 𝑡𝑥 𝑡𝑥
λ𝑥 𝑒 −λ 𝑡
𝑀(𝑡) = 𝐸(𝑒 = ∑ 𝑒 . 𝑃(𝑥; λ) = ∑ 𝑒 = 𝑒 λ(𝑒 −1)
𝑥!
𝑥=0 𝑥=0
𝑡 −1)
𝑀(𝑡) = 𝑒 λ(𝑒
Characteristic function:
∞ ∞
𝑖𝑡𝑋 𝑖𝑡𝑥
λ𝑥 𝑒 −λ 𝑖𝑡
∅(𝑡) = 𝐸(𝑒 ) = ∑𝑒 . 𝑃(𝑥; λ) = ∑ 𝑒 𝑖𝑡𝑥 = 𝑒 λ(𝑒 −1)
𝑥!
𝑥=0 𝑥=0
𝑖𝑡 −1)
∅(𝑡) = 𝑒 λ(𝑒
Problem 1:
A hospital switch board receives an average of 4 emergency calls in a 10-minute interval. What
is the probability that
i. there at most 2 emergency calls in a 10-minute interval
ii. there are exactly 3 emergency calls in a 10-minute interval.
Solution:
Mean =λ = 4
𝑒 −λ λ𝑥
𝑃(𝑋 = 𝑥) = 𝑝(𝑥) =
𝑥!
i. P(at most 2 calls) = 𝑃(𝑋 ≤ 2)
= 𝑃(𝑋 = 0) + 𝑃(𝑋 = 1) + 𝑃(𝑋 = 2)
1 1 1
= + 4. + 8.
𝑒4 𝑒4 𝑒4
1
= (1 + 4 + 8) = 0.2381
𝑒4
1 16
ii. P(Exactly 3 calls) = 𝑃(𝑋 = 3) = 𝑒 4 . 3! = 0.1954
Problem 2:
If a random variable has a Poisson distribution such that P (1) =P (2). Find
i. Mean of the distribution
ii. P(4)
iii. 𝑃(𝑋 ≥ 1)
iv. 𝑃(1 < 𝑋 < 4)
Solution:
𝑒 −λ λ1 𝑒 −λ λ2
=
1! 2!
λ2 = 2λ
λ = 0 or 2
But λ ≠ 0 or 2
Therefore λ = 2
i. Mean of the distribution is λ = 2
𝑒 −λ λ𝑥
ii. p(x) =
𝑥!
𝑒 −2 24
p(4) = = 0.09022
4!
iii. 𝑃(𝑋 ≥ 1) = 1 − 𝑃(𝑋 < 1) = 1 − 𝑃(𝑋 = 0)
𝑒 −2 20
=1− = 0.8647
0!
iv. 𝑃(1 < 𝑋 < 4) = 𝑃(𝑋 = 2) + 𝑃(𝑋 = 3)
𝑒 −2 22 𝑒 −2 23
= + = 0.4511
2! 3!
Problem 3:
Solution:
∑ 𝑓 𝑖 𝑥𝑖 0+156+138+81+20+5
Mean = ∑ 𝑓𝑖
= =1
400
𝑥 0 1 2 3 4 5
Theoretical 142 156 69 27 5 1
frequency
Expected 147 147 74 25 6 1
frequency
Problem 4:
𝑡 −1)
If the moment generating function of the random variable is 𝑒 4(𝑒 , 𝑓𝑖𝑛𝑑 𝑃(𝑋 = 𝜇 + 𝜎), where
𝜇 and 𝜎 2 are the mean and variance of the Poisson random variable X.
Solution:
𝑡 −1) 𝑡 −1)
𝑀(𝑡) = 𝑒 λ(𝑒 = 𝑒 4(𝑒
Mean=Variance= λ = 4
Standard deviation = √4 =2
𝑃(𝑋 = 𝜇 + 𝜎) = 𝑃(𝑋 = 4 + 2) = 𝑃(6)
𝑒 −4 46
𝑃(𝑋 = 𝑥) = 𝑃(𝑋 = 6) = = 0.1042
6!
Try yourself:
1. The distribution of typing mistakes committed by a typist is given below. Assuming the
distribution to be Poisson, find the expected frequencies
𝑥 0 1 2 3 4 5
𝑓 42 33 14 6 4 1
Hypergeometric Distribution:
A discrete random variable X is said to follow the hypergeometric distribution with parameters
𝑁, 𝑀 𝑎𝑛𝑑 𝑛, if it assumes only non-negative values and its probability mass function is given by
(𝑀
𝑘
)(𝑁−𝑀
𝑛−𝑘
)
, 𝑘 = 0,1,2,3, … , min(𝑛, 𝑀)
𝑃(𝑋 = 𝑥) = ℎ(𝑘; 𝑁, 𝑀, 𝑛) = { (𝑁𝑛)
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where 𝑁 is a positive integer, 𝑀 is a positive integer not exceeding 𝑁 and 𝑛 is a positive integer
that is at most 𝑁.
(Or)
𝑀𝐶𝑘 × 𝑁 − 𝑀𝐶𝑛−𝑘
𝑃(𝑋 = 𝑥) = ℎ(𝑘; 𝑁, 𝑀, 𝑛) =
𝑁𝐶𝑛
Problem 1:
A batch of 10 rocker cover gaskets contains 4 defective gaskets. If we draw samples of size 3
without replacement, from the batch of 10, find the probability that a sample contains 2 defective
gaskets. Also find mean and variance.
Solution:
𝑀𝐶𝑘 × 𝑁 − 𝑀𝐶𝑛−𝑘
𝑃(𝑋 = 𝑥) = ℎ(𝑘; 𝑁, 𝑀, 𝑛) =
𝑁𝐶𝑛
Here 𝑁 = 10, 𝑀 = 4, 𝑛 = 3, 𝑘 = 2
4𝐶2 × 6𝐶1
𝑃(𝑋 = 2) = ℎ(𝑘; 𝑁, 𝑀, 𝑛) = = 0.3
10𝐶3
𝑛𝑀 3×4 12
Mean is 𝐸(𝑋) = 𝑛𝑝 = = = = 1.2
𝑁 10 10
𝑛𝑀(𝑁−𝑀)(𝑁−𝑛) 3×4(10−4)(10−3)
Variance is 𝑣𝑎𝑟(𝑋) = 𝑛𝑝𝑞 = = = 0.56
N2 (𝑁−1) 102 (10−1)
Problem 2:
In the manufacture of car tyres, a particular production process is known to yield 10 tyres with
defective walls in every batch of 100 tyres produced. From a production batch of 100 tyres, a
sample of 4 is selected for testing to destruction. Find
Solution:
Sampling is clearly without replacement and we use the hypergeometric distribution with
𝑁 = 100, 𝑀 = 10, 𝑛 = 4, 𝑘 = 1
𝑀𝐶𝑘 × 𝑁−𝑀𝐶𝑛−𝑘
i. 𝑃(𝑋 = 𝑥) =
𝑁𝐶 𝑛
𝑛!
𝑐=
𝑥1 ! 𝑥2 ! … 𝑥𝑘 !
𝑛! 𝑥1
Hence 𝑝(𝑥1 , 𝑥2 , … 𝑥𝑘 ) = 𝑝 𝑝2 𝑥2 … 𝑝𝑘 𝑥𝑘 , 0 ≤ 𝑥𝑖 ≤ 𝑛
𝑥1 !𝑥2 !…𝑥𝑘 ! 1
𝑛!
𝑝(𝑥1 , 𝑥2 , … 𝑥𝑘 ) = 𝑝 = ∏𝑘 ∏𝑘𝑖=1 𝑝𝑖 𝑥𝑖 (1)
𝑖=1 𝑥𝑖 !
Which is the required probability function of the multinomial distribution. Eq. (1) is called in
multinomial expansion
𝑘
𝑛
(𝑝1 + 𝑝2 + … + 𝑝𝑘 ) , ∑ 𝑝𝑖 = 1
𝑖=1
Problem 1:
Suppose we have a bowl with 10 marbles 2 red marbles, 3 green marbles, and 5 blue marbles.
We randomly select 4 marbles from the bowl, with replacement. What is the probability of
selecting 2 green marbles and 2 blue marbles?
Solution:
To solve this problem, we apply the multinomial formula. We know the following:
On any particular trial, the probability of drawing a red, green, and blue marble is 0.2, 0.3, and
0.5, respectively.
𝑛! 𝑥1
i.e., 𝑝 = 𝑝 𝑝2 𝑥2 𝑝3 𝑥3
𝑥1 !𝑥2 !𝑥3 ! 1
4!
= (0.2)0 (0.3)2 (0.5)2 = 0.135
0! 2! 2!
𝑝 = 0.135
Thus, if we draw 4 marbles with replacement from the bowl, the probability of drawing 0 red
marbles, 2 green marbles, and 2 blue marbles is 0.135.
Problem 2:
In India, 30% of the population has a blood type of O+, 33% has A+, 12% has B+, 6% has AB+,
7% has O-, 8% has A-, 3% has B-, and 1% has AB-. If 15 Indian citizens are chosen at random,
what is the probability that 3 have a blood type of O+, 2 have A+, 3 have B+, 2 have AB+, 1 has
O-, 2 have A-, 1 has B-, and 1 has AB-?
Solution:
𝑛 = 15 trials
𝑥1 = 3 (3 O+)
𝑥2 = 2 (2 A+)
𝑥3 = 3 (3 B+)
𝑥4 = 2 (2 AB+)
𝑥5 = 1 (1 O−)
𝑥6 = 2 (2 A−)
𝑥7 = 1 (1 B−)
𝑥8 = 1 (1 𝐴B−)
𝑘 = 8 (8 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠)
𝑛!
𝑝= 𝑝 𝑥1 𝑝 𝑥2 𝑝 𝑥3 𝑝 𝑥4 𝑝 𝑥5 𝑝 𝑥6 𝑝 𝑥7 𝑝 𝑥8
𝑥1 ! 𝑥2 ! 𝑥3 ! 𝑥4 ! 𝑥5 ! 𝑥6 ! 𝑥7 ! 𝑥8 ! 1 2 3 4 5 6 7 8
15!
= × 0.303 × 0.332 × 0.123 × 0.062 × 0.071 × 0.082 × 0.031
3! 2! 3! 2! 1! 2! 1! 1!
× 0.011
𝑝 = 0.000011
Covariance:
We are often interested in the inter-relationship, or association, between two random variables.
The covariance of two random variables X and Y is
𝐶𝑜𝑣(𝑋, 𝑌) = 𝐸(𝑋𝑌) − 𝐸(𝑋)𝐸(𝑌)
Or
𝐶𝑜𝑣(𝑋, 𝑌) = 𝐸[(𝑋 − 𝐸(𝑋))(𝑌 − 𝐸(𝑌))]
Note:
Moments:
we know that 𝜇0 = 1, 𝜇1 = 0
𝑀𝑒𝑎𝑛 = 𝑛𝑝
𝜇2 = 𝑛𝑝𝑞
𝜇3 = 𝑛𝑝𝑞(𝑞 − 𝑝)
𝜇4 = 𝑛𝑝𝑞[1 + 3𝑝𝑞(𝑛 − 2)]
𝜇1 = 0
𝜇2 = 𝜆
𝜇3 = 𝜆
𝜇4 = 3𝜆2 + 𝜆