Statistics For Finance Notes
Statistics For Finance Notes
This handbook is written with the basic objective of introducing to students some basic
quantitative statistical techniques administered through statistical concepts to help in business
decision making process. An attempt has been made to present explanations in such a way that the
underlying statistical theory is fully exposed to enable thorough understanding of the relationship
between theory and application. Many illustrations on the applications of these statistical
techniques with the help of business data, makes this handbook to be a non-technical and non-
mathematical in character. Various types of study material are given at the end of each chapter to
aid students in applying theories discussed in the text to develop the attitude of quantitative
thinking and develop skills for performing calculations needed for various methods of analysis.
i
PREFACE ...................................................................................................................................... i
CHAPTER ONE ........................................................................................................................... 5
RANDOM VARIABLES AND PROBABILITY DISTRIBUTION ............................................ 5
1.1 Introduction ........................................................................................................................... 5
1.2 Definition of key Terms ......................................................................................................... 5
1.2.1 A variable........................................................................................................................ 5
1.2.2 Random Variable (R.V)................................................................................................... 5
1.2.2.1 Types Of Random Variable ......................................................................................... 5
1.3 Probability Distribution Of Random Variable ....................................................................... 6
1.3.1 Discrete Probability Distributions .................................................................................... 6
1.3.1.1 Probability Mass Function (PMF) ................................................................................. 6
1.4 Cumulative Mass Function (CMF) ......................................................................................... 8
1.5 Characteristics Of Probability Distribution........................................................................... 10
1.5.1 Expected Value of a Discrete Probability Distribution ................................................... 11
1.5.2 Variance of a Discrete Probability Distribution.............................................................. 12
1.8 Multivariate Or Joint Probability Distribution ...................................................................... 16
1.8.1 Discrete Joint Probability Distribution ............................................................................. 16
1.8.2 Marginal Discrete Probability Distribution .................................................................... 16
1.9 Covariance And Correlation ................................................................................................ 17
1.9.1 Covariance .................................................................................................................... 17
1.9.2 Correlation .................................................................................................................... 18
1.9.3 Properties Of Correlation Coefficient ............................................................................ 18
1.10 Relationship Between Population And Sample ................................................................... 19
1.11 Linear Combination Of Random Variables ........................................................................ 21
1.11.1 Properties Of Linear Combination Of Random Variables ............................................ 22
1.12 Special Types Of Probability Distributions ....................................................................... 23
1.12.1 Binomial Distribution .................................................................................................. 24
1.12.2 Mean, Variance, and Standard Deviation for a Binomial Random Variable ................. 25
1.12.2 Poisson Probability Distribution .................................................................................. 27
1.13 Continuous Probability Distributions ................................................................................. 29
1. 14 Normal Distribution .......................................................................................................... 31
1.14.1 Properties of normal distribution ................................................................................. 31
1.14.2 Standard Normal distribution ....................................................................................... 31
1.14.3 The Standard Normal Distribution Table ..................................................................... 32
CHAPTER TWO ........................................................................................................................ 39
SAMPLING DISTRIBUTION ................................................................................................... 39
2.1 Introduction ......................................................................................................................... 39
2.2 Distribution of the sample mean ....................................................................................... 39
2.3 Distribution Of The Sample Proportion ............................................................................ 41
2.4 .Sampling Distribution of the Difference between Two Sample Means ................................ 42
2.5 Distribution for the differences between two sample proportions, − .......................... 43
CHAPTER THREE .................................................................................................................... 46
ESTIMATION OF PARAMETERS .......................................................................................... 46
3.1 Introduction ......................................................................................................................... 46
3.2 Definition of Key terms: ...................................................................................................... 46
3.3 Properties of good estimator ................................................................................................ 47
3.3.1 Unbiasedness................................................................................................................. 47
2
3.3.2 Efficiency ...................................................................................................................... 47
3.3.3 Consistency ................................................................................................................... 47
3.3.4. Sufficiency ................................................................................................................... 47
3.4 Types of Estimation ............................................................................................................. 48
3.4.1 Point Estimation ............................................................................................................ 48
3.4.2 Interval Estimation ........................................................................................................ 48
3.4.3 Confidence Intervals..................................................................................................... 48
3.5 Formula for Confidence Interval for the population mean ................................................. 49
3.5.1 When population variance ( ) is known ...................................................................... 49
3.5.2 When population variance ( ) is unknown................................................................... 51
3.5.2.1 Confidence interval for µ when (σ ) is unknown and the sample size is large ........... 51
3.5.2.2 Small sample confidence Interval for the mean µ, when (σ ) is unknown.................. 51
3.5.3 Summary on how to estimate the confidence interval for µ ............................................ 53
3.6 Confidence Interval estimate for the population Proportions ................................................ 53
3.7 Estimation of the difference between two population means ( − )................................. 55
3.7.1 When population variances are known.......................................................................... 55
3.7.2 When population variances are unknown....................................................................... 55
3.8 Estimation of the difference between two population proportions ( − ) ...................... 57
3.9 Estimation of the Sample size .............................................................................................. 57
CHAPTER FOUR ....................................................................................................................... 62
HYPOTHESIS TESTING .......................................................................................................... 62
4.1 Introduction ......................................................................................................................... 62
4.2 Definition of key Terms ....................................................................................................... 62
4.2.1 Statistical Hypothesis .................................................................................................... 62
4.2.2 A Hypotheses Testing ................................................................................................... 63
4. 3 Types of Error .................................................................................................................... 63
4.4 Types of Statistical Tests ..................................................................................................... 64
4.5 Formulation of Null and Alternative hypotheses .................................................................. 65
4.6 Approaches For Hypothesis Testing..................................................................................... 66
4.6.1 The Test of Significance Approach................................................................................ 66
4.6.1.1 Critical Value Approach ............................................................................................. 66
4.6.1.2. P-value approach ....................................................................................................... 67
4.6.2 Confidence Interval approach ........................................................................................ 67
4.7 Hypothesis testing for population mean ( ).......................................................................... 69
4.7.1 Hypothesis testing for μ when is known ................................................................... 69
2
3
5.2 Objective of Regression Analysis ........................................................................................ 81
5.3 Simple Linear Regression Model ......................................................................................... 81
5.4 The method of Ordinary Least Square (OLS) ....................................................................... 83
5.5 Sampling distribution for Estimators of regression line ........................................................ 86
5.5.1 Distribution for and ............................................................................................... 86
5.5. 2. Sampling Distribution for and ............................................................................ 87
5.6. Estimation for regression coefficients .................................................................. 87
5.7. Hypothesis testing for Regression Coefficients ( ) ............................................................ 89
5.8. Correlation Analysis ........................................................................................................... 90
5.8.1 Sample correlation coefficient ( ) .................................................................................. 91
5.8.2 Properties of correlation coefficient ............................................................................... 91
5.9 Coefficient of determination ( ) ......................................................................................... 91
5.9.1.Properties of coefficient of determination ...................................................................... 92
5.10 Hypothesis Testing For Correlation Coefficient ( ) ........................................................... 92
5.11 Hypothesis Testing For Significance Of Regression Model ( )........................................ 93
5.12 Analysis of Variance (ANOVA) ........................................................................................ 94
5.13. Computer output and interpretation of the results. ............................................................. 96
5.13 Multiple Linear Regression Analysis ................................................................................. 98
5.13. 1Assumption of Multiple Linear Regression models...................................................... 99
5.14 OLS estimators of Multiple Linear Regression model ..................................................... 100
5.15 Estimation for partial regression coefficients.................................................................... 100
5.16 Hypothesis testing in Multiple Linear Regression Analysis .............................................. 101
5.16.1 Testing Individual Partial Regression Coefficients ..................................................... 101
5.16.2 Testing for two or more Partial Coefficients .............................................................. 101
5.16.3 Testing for the significance of the model ................................................................... 103
4
CHAPTER ONE
RANDOM VARIABLES AND PROBABILITY DISTRIBUTION
1.1 Introduction
In this chapter we discuss probability distributions of random variables and these are customary
used to model some problems in various fields such as business, finance, economics and in general
life. To clearly understand this chapter, you need some basic knowledge in probability
fundamentals.
Traditionally random variables are denoted by capital letters, X, Y, Z or X1, X2, X3 etc.
There are two types of random variable (R.V), these are discrete and continuous random
variables. A discrete random variables takes on only a finite number of values and these are
integers. Examples of discrete random variables are; the number of cars passing through the
roadblock, an experiment of tossing one or more fair coins, number of defective items in a sample,
number of death by COVID-19 in year 2020, etc. On the other hand a continuous random variable
is a random variable that can take on any value (always real numbers)in a given interval of values.
Examples of continuous R.V are, height, weight, rainfall, temperature etc.
5
General Examples of random variables in the real life are; the unemployment rate, consumer
price index, number of sales made in week, yearly profit of a company, share prices, return on
investments, money supply, GDP, wages, cash flows, interest rates, etc.
If X is a discrete random variable, the function denoted by ( ) = ( = ) for each within the
range of X is called Probability Mass Function of X. To capture clearly the meaning of the
probability distribution of a discrete random variable consider Example 1.1.
6
Example 1.1
Consider an experiment of tossing two fair coins simultaneously. Find the probability distribution
of obtaining a total number of heads.
Solution:
The following are the procedures for building probability distribution:
Solution:
The list of all possible events can be obtained easily by using a structure of tree diagram as shown
below:
H
T
Start
H
T
TT
= , , ,
From the sample space described above, a random variable (i.e. number of heads) takes on three
different values, 0, 1, 2 depending on whether zero head (no head), one head, or two heads were
obtained in the experiment of tossing two fair coins. That is
→ 2heads
→ 1 heads
→ 1 head
→ 0 head
Probability of an event can be obtained by employing traditional definition of probability, that is:
7
n( E )
P( E )
n( S )
Let be the number of observed heads. Theprobabilities of the number of heads showing up are as
indicated below and the probability distribution is shown inTable 1.
(0) 1
(0) = ( = 0) = =
( ) 4
(1) 1
(1) = ( = 1) = =
( ) 2
(2) 1
(2) = ( = 2) = =
( ) 4
Probability Distribution
Number of heads ( )= ( = )
( )
0 /
1 /
2 /
Total 1.00
Properties of PMF
1. ( ) ≥ 0 for each ∈
2. 0 ≤ ( ) ≤ 1
3. ∑ ( ) = 1
Example 1.2
An employment rate of a certain country A ( ) in percentage is assumed to be a discrete random
variable whose probability distribution is as shown below:
-12 -10 -6 0 4 8 10 12
Prob. 0.10 0.15 0.10 0.15 0.1 0.15 0.1 0.15
Find
a) ( ≥ 0)
b) ( ≥ −10)
Solution:
a) ( ≥ 0) = ( = 0) + ( = 4) + ( = 8) + ( = 10) + ( = 12)
= 0.15 + 0.1 + 0.15 + 0.1 + 0.15
= 0.65
c) ( ≥ −10) = 1 − ( < −10)
= 1 − ( = −12)
= 1 − 0.10
= 0.9
8
1.4Cumulative Mass Function (CMF)
Definition: Cumulative Mass Function, ( ) associated with Probability Mass Function, ( ) of a
discrete random variable is defined as follows:
( )= ( ≤ )
Where ( ≤ ) means the probability that a random variable takes a value of less than or equal
to a specific value , where is given. For example ( ≤ 2) means the probability that the random
variable takes the value less than of equal to 2.
Example 1.3
Find Probability Mass Function (PMF) and Cumulative Mass Function (CMF) of a total number of
heads obtained by tossing a fair coin three times.
Solution:
The following tree diagram is used to obtain the sample space
H T
H
H
T T
Start
H
T T
H
T
Therefore = , , ,T , , , ,
By letting =number of heads shown up, we find that can take values 0, 1, 2 or 3 and hence the
corresponding PMF will be obtained as indicated below:
1
(0) = ( = 0) =
8
3
(1) = ( = 1) =
8
3
(2) = ( = 2) =
8
1
(3) = ( = 3) =
8
9
In Tabula form:
0 1 2 3
( ) 1/8 3/8 3/8 1/8
It follows that, the Cumulative Mass Function (CMF) will be obtained as indicated here:
From:
( )= ( ≤ )
1
(0) = ( ≤ 0) = ( = 0) =
8
4
(1) = ( ≤ 1) = ( = 0) + ( = 1) =
8
7
(2) = ( ≤ 2) = ( = 0) + ( = 1) + ( = 2) =
8
(3) = ( ≤ 3) = ( = 0) + ( = 1) + ( = 2) + ( = 3) = 1
However in plotting the CMF the following will be the ranges of :
0 for x 0
1
for 0 x 1
8
4
F (x) for 1 x 2
8
7
8 for 2 x 3
1 for 3 x
With reference to the previous example, it can be observed that, a CMF is merely an accumulation
of PMF for the values of less than or equal to a given . That is,
( )= ( )
Note: The results for both PMF and CMF can also be summarized in the following Table:
Number of
heads values of PMF values of
( ) CMF ( )
0 0≤ <1 1/8 =0 1/8
1 1≤ <2 3/8 ≤1 4/8
2 2≤ <3 3/8 ≤2 7/8
3 3≤ 1/8 ≤3 1
10
1.5 Characteristics Of Probability Distribution
Although a probability distribution shows the values taken by a random variable and their
corresponding probabilities, in most cases a researcher might be interested in deducing some
summary characteristics from such probability distribution. These summary characteristics include
among others; the expected value(population mean), variance, covariance, correlations, etc.
Definition:
Let be a random variable with probability ( ) = ( = ) then the expected value ( ) is given
by
( )=∑ ( = )or
= ( )
=
In other words; if a discrete random variable has possible values , , ⋯ with
corresponding probabilities ( = ), ( = ), ( = ), ⋯ , ( = )then expected value
is obtained by multiplying the value the random variable takes with the corresponding probability
of occurrence, i.e.
( )= ( = )+ ( = )+ ( = ) + ⋯+ ( = )
Therefore Σ denotes summation notation whose properties are as indicated below:
Properties of summation notation
1. If is constant, then
2. If is constant, then
( + )= +
11
( )=
2. The expected value of the sum of two random variables is equal to sum of expected value of
the two random variables. That is for the random variables X and Y.
( + )= ( )+ ( )
3. Also
( )≠ ( ) ( )
That is, generally, the expected value of the product of two random variables is not equal to
product of the expected values of those random variables. However, there is an exception to
the rule, if X and Y are independent then
( )= ( ) ( )
4. If is a constant, then
( )= ( )
That is to say, the expected value of a constant times a random variable X, is equal to the
constant times the expected value of the R.V
5. If and are constants, then
( + )= ( )+ ( )
= ( )+
Variance indicates how individual values are spread, dispersed or distributed around the mean
value. But also the statistical concept of variance is a useful measure of risk of any kind. Generally if
X is a discrete random variable, then its variance is given by:
( ) = − ( )
= ( ) − ( ( ))
= ( )−
=
Where ( )=∑ ( )
The standard deviation of X is therefore given by
( )= Var( )
=
Example 1.4
A company estimates the net profit for a new product to be launched with its corresponding
probabilities under different market conditions as follows;
12
Market Condition Good Fair Poor
Net Profit (in million Tsh.) 30 10 -3
Probability , ( = ) 0.15 0.25 0.60
Required:
a) Calculate the expected value of the net profit for the Company
b) What is standard deviation of the net profit
Solution:
a) We know that
E ( X ) xP ( X x )
n
i 1
Therefore the expected value of the net profit for the company under all three given market
conditions is 5.2 million Tsh.
( )= ( )− ( )
13
The Standard Deviation is given by:
= ( ) = √138.36 = 11.76
Therefore the standard deviation of the net profit for the company under all three given market
conditions is 11.76 million Tsh. This tells us how much the net profit deviates from the expected
value of 5.2 million Tsh. Thus, we may say that although the expected net profit is about 5.2 million
Tsh, it may go above or below this value by 11.76 million Tsh. You may calculate the confidence
interval to estimate the interval on which the expected net profit will fall.
Example 1.5
( ) ( ) ( )
-10 0.15 -1.5 15
-6 0.2 -1.2 7.2
0 0.15 0 0
4 0.1 0.4 1.6
8 0.15 1.2 9.6
12 0.25 3 36
TOTAL 1.9 69.4
Hence:
( )= ( )
= 1.9
( )= ( )− ( )
= ( )− ( )
= 69.4 − 1.9
= 65.79
( )= ( )
14
= √65.79
= 8.11
Example 1.6
A monthly income of workers in millions of TShs from a certain sector with their associated
probabilities are as indicated in the following probability distribution
( ) ( ) ( )
1.4 0.25 0.35 0.49
3.5 0.2 0.7 2.45
2 0.15 0.3 0.6
0.9 0.1 0.09 0.081
3 0.3 0.9 2.7
TOTAL 2.34 6.321
Hence:
( )= ( )
= 2.34
( )= ( )− ( )
= ( )− ( )
= 6.321 − 2.34
= 0.8454
( )= ( )
= √0.8454
= 0.9194
Properties of variance
15
2. If X and Y are two independent random variables, then
( + )= ( )+ ( )and
( − )= ( )+ ( )
3. If is a constant, then
( + )= ( )
4. If is a constant, then
( )= ( )
5. If X and Y are independent random variables and and are constants, then
( + )= ( )+ ( )
1. ( , )≥0
2. f ( x, y) 1
x y
Example 1.6
Solution
Recall one of the condition of joint probability distribution function, that is f ( x, y) 1
x y
16
It implies that,
( , )+ ( , )+ ( , )+ ( , )=
+ + + =
=
Therefore:
= .
( )= ( , )
for each within the range of X is called marginal distribution of X. Similarly the function
( , )= − ( ) ( − ( )
= ( )− ( ) ( )
= ( )−
Where = ( ) and = ( )
To compute the covariance as defined in the above equation, we now use the following formula
( , )= ( X
x y
x )(Y y ) ( , )
17
xyf ( x, y ) x y
x y
Where ( , ) is the joint probability distribution function of the two random variables X and Y.
The double summation sign in this expression indicates that covariance requires the summation of
both variables over the range of their values.
Properties Of Covariance
1. If X and Y are independent random variables, their covariance is zero. This can be verified
as indicated here. Recall that if two random variables are independent, then
( )= ( ) ( )=
Substituting the above expression into equation (...), we see at once that the covariance of
two independent random variable is zero
( + , + )= ( , )
1.9.2 Correlation
Correlation coefficient is a numerical value which measures the strength of relationships between
two random variables. IF X and Y are two random variables, their correlation coefficient denoted
by ( , )or is given by
( , )
=
If the correlation coefficient is +1, it means that the two variables are perfectly positive correlated,
whereas if the correlation coefficient is −1, it means that they are perfectly negative correlated. If 0,
it means no relationship at all. However if 0.8 ≤ < 1 then this indicates very strong linear
relationship, it is just strong when 0.6 ≤ < 0.8. Furthermore if ≤ 0.3 then it indicates weak
linear relationship
Example 1.7
18
The following Table gives the joint PDF of two random variables X and Y, where X represent the
first-year rate of return (%) expected from investment A, and Y stands for the first-year rate of
return expected from investment B.
X(%)
-10 0 20 30
a) Find the marginal distribution of X and hence the expected rate of return and standard
deviation
b) Find the marginal distribution of Y and hence the expected rate of return and standard
deviation
c) Compute the covariance, correlation coefficient between these cash flows and comment on
the results
d) Are the expected rates of return from the two investments independent?
Solution:
The marginal distribution of is given below:
-10 0 20 30
Prob. 0.27 0.12 0.26 0.35
ℎ( ) ℎ( ) ℎ( )
-10 0.27 -2.7 27
0 0.12 0 0
20 0.26 5.2 104
30 0.35 10.5 315
TOTAL 13 446
Hence:
( )= ℎ( )
= 13
( )= ( )− ( )
= ℎ( ) − ℎ( )
= 446 − 13
19
= 277
( )= ( )
= √277
= 16.64
The marginal distribution of is given below:
y 20 50
g(y) 0.51 0.49
( ) = ( )
= 20(0.51) + 50(0.49)
= 10.2 + 24.5
= 34.7
( )= ( )− ( )
= ( )− ( )
= √224.91
= 14.99
So we have seen on how to compute numerous characteristics of PDF of discrete random variables
such as expected value, variance, standard deviation, covariance, and correlation coefficient. All
these are population variables. In reality when conducting quantitative research it is somehow
difficult to deal with the whole population, unless otherwise if the population is finite. But in most
cases , we normally use a sample which is subset of the population to draw conclusion about the
properties of a given population. This is the basis of the so called inference statistics that will be
discussed in later sections. But mean while the sample counterpart together with their
corresponding population variables are summarized in the following Table.
20
Population variable Sample counterpart (Raw Data)
( )= ( )=
=
( )= ( − ( )) ( − )
=
= ( − ) −1
=
( )= ( )= ( )= ( )=
( , )= ( − )( − ) ∑( − )( − )
sample ( , )=
−1
= ( )−
( , ) sample ( , )
( , )= = ( , )= =
( ) ( )
Example 1.8
Returns (in millions of shillings) from two samples of investments projects X and Y were recorded
as follows;
X 3 4 6 8 7
Y 2 5 7 8 10
(a) Compute the mean return and standard deviation for each Project.
(b) Compute the correlation between the returns and comment on your results.
Hint: Let the students find the solution of the above problem
= + +⋯+ = = ́
21
is a linear combination of random variables. Where ́ is a row vector (a row matrix) and isa
columnvector (column matrix). That is to say a linear combination of random variables can also be
expressed in terms of matrix notation.
= + +⋯+ =
Then
( )= ( )+ ( ) + ⋯+ ( )= ( )
( )= + + ⋯+ = = ́
( )= ( )+2 ( , )
( )= ( )
= +
Then
( )= ( )+ ( )
( )= ( )+ ( )+2 ( , )
For three random variables as indicated below:
= + +
( )= ( )+ ( )+ ( )
( )= ( )+ ( )+ ( )+2 ( , )+2 ( , )
+2 ( , )
If we have three or more random variables, it is more convenient to use matrix approach to
compute mean and variance of linear combination.
22
Example 1.9: Application in Portfolio mathematics:
Eighty percent of a portfolio is invested by in TBL stock and the remaining 20% is invested in UTT
stock. TBL stock has expected return of 6% and the expected standard deviation of return of 9%.
UTT stock has expected return of 20% and an expected standard deviation of 30%. The coefficient
of correlation between of the two securities is expected to be 0.4. Determine the following:
Solution:
Let be the amount invested in TBL, the amount invested in UTT, and denotes the
corresponding weights
Data Given:
= 80%, = 20%, ( ) = 6%, ( ) = 20%, ( ) = 30%, ( ) = 30% = 0.4
a) = ( )+ ( )
b) = ( )+ ( )+2 ( , )
Where:
( , )= ( ) ( )
Therefore:
= ( )+ ( )+2 ( ) ( )
c) = ( )
= √0.012196
= 0.11043
d) If the two securities are independent, then ( , )=0
23
Therefore:
= ( )+ ( )
= 0.00874
A binomial experiment is a statistical experiment which consist of repeated trials. Each trial can
result in just two possible outcomes. We call one of these outcome a success and the other, a
failure. The probability of success, denoted by P, is the expected to be constant on every trial.
This is the number of successes denoted by a letter X in n repeated trials of a binomial experiment.
24
ii) Each trial can result in just two mutually exclusive possible outcomes. We call one of these
outcomes a success and the other, a failure.
iii) The probability of success, denoted by P, is the same (constant) on every trial.
iv) The trials are independent; that is, the outcome on one trial does not affect the outcome on
other trials.
Consider the following random experiment. You flip a coin 10 times and count the number of times
the coin lands on tails. This is a binomial experiment because:
n
b( x; n, p) P( X x) p x q n x for x 0, 1, 2, , n
x
n n n!
C x
x!( n x )!
Note:
x
1.12.2Mean, Variance, and Standard Deviation for a Binomial Random Variable
The expected value (mean) of binomial random variable is given by ( ) = , and the standard
deviation is given by ( )= (1 − . That is,
Mean: =
Variance: =
Standard Devition: =
25
Examples 1.10
Given that the expected value of a binomial distribution is 40 and standard deviation is 6.
Required:
Calculate n, p and q.
Solution:
From
np 40
and
2 npq 36
Put 1 into 2, we have
40q 36
q 0.9 p q 1 p 0.1
Thus p 0.1, q 0.9, and n 400
Example 1.11
Assume that on an average one telephone number out of 15 is busy.
Required:
What is the probability that if six randomly selected telephone numbers are dialled
a) Not more than three will be busy?
b) At least three of them will be busy?
Solution:
1 14
p , q , n 6 , then
15 15
a) p ( x 3) p ( x 0) p ( x 1) p ( x 2) p ( x 3) 0.9997 (Use the Binomial
distribution formula)
Example 1.12
The probability that a student is accepted to a prestigious college is 0.3. If 5 students from the same
school apply, what is the probability that at most 2 are accepted?
26
Data Given:
= 0.3 =5
Find ( ≤ 2)
( ≤ 2) = ( = 0) + ( = 1) + ( = 2)
From
( = )= (1 − )
!
= (1 − )
( − )! !
5!
( = 0) = × 0.3 × 0.7
(5 − 0)! 4!
5!
= × 0.3 × 0.7
5! 0!
= .
5!
( = 1) = × 0.3 × 0.7
(5 − 1)! 1!
5!
= × 0.3 × 0.7
5! 1!
= .
5!
( = 2) = × 0.3 × 0.7
(5 − 2)! 2!
5!
= × 0.3 × 0.7
5! 2!
= .
Therefore:
Note: the specified region could take many forms. For instance, it could be a length, an area, a
volume, a period of time, etc.
27
1.12.2.1 Application of Poisson Distribution
The number of deaths by COVID-19 in the global in 2020
The number of birth defects and genetic mutations
The number of car accidents in Dar es Salaam city
The number of typing errors on a page
The spread of an endangered animal in Sub Saharan Africa
The number of failure of a machine in one month
Notation
The following notation is helpful, when we talk about the Poisson distribution.
: A constant equal to approximately 2.71828. (Actually, e is the base of the natural
logarithm system.)
λ: The mean number of successes that occur in a specified region.
x : The actual number of successes that occur in a specified region.
P( x, ) : The Poisson probability that exactly x successes occur in a Poisson
The probability distribution of the Poisson random variable X, representing the number of
outcomes occurring in a given time interval or a specified region of space is:
e x
P ( x, ) x 0,1,2...
x! for
Or
( , )= ( = )= = 0,1,2, ⋯
!
Where λ represent the average number of outcomes occurring in the specified time or region.
Furthermore, if X has a Poisson distribution, then ( ) = and ( ) = √
Examples 1.13
The average number of days a school is closed due to snow during winter in a certain City in USA is
4. Calculate the probability that the schools in this city will close for 6 days during a winter?
Solution:
28
e 4 4 6
p ( x 6) 0.1042
6!
Note: The Poisson distribution may be used to approximate the Binomial distribution, when n-is
very large and p-is very small. See the following example
Example 1.14
Suppose that on average, 1 person in every 1000 is an alcoholic. Find the probability that a random
sample of 8000 people will yield fewer than 7 alcoholics.
Solution:
Let represent the number of alcoholic persons
1
p( x) 0.001, n 8000
1000
Since p is very small, and n is very large, then
= = 0.001 × 8000 = 8
Now,
p( x 7) p ( x 0) p( x 1) p( x 2) ... p ( x 6)
e 8 8 0 e 8 81 e 8 8 2 e 8 8 6
... 0.3134
0! 1! 2! 6!
Example 1.15
The number of customers attended at CRDB bank follows Poisson distribution with a mean of 10
customers per hour, find the probability that in any given hour
Solution:
Data Given
= 10
From:
( = )=
!
×
a) ( = 6) =
!
= .
29
×
b) ( = 0) =
!
= .
c) ( ≥ 2) = 1 − (< 2)
× 10 × 10
=1− +
0! 1!
= 1 − 0.0000454 + 0.000454
= .
Most often, the equation used to describe a continuous probability distribution is called a
probability density function (PDF). Sometimes, it is referred to as a density function. For a
continuous probability distribution, the density function has the following properties:
1. 0 f ( x) 1
The continuous random variable is defined over a continuous range of values (called the
domain of the variable), the graph of the density function will also be continuous over that
range.
2. f ( x)dx 1
The area bounded by the curve of the density function and the x-axis is equal to 1, when
computed over the domain of the variable.
A f ( x)dx
b
3.
a
The probability that a random variable assumes a value between a and b is equal to the area
under the density function bounded by a and b.
30
1. 14 Normal Distribution
Normal distribution is perhaps the single most important probability distribution involving
continuous random variable. By definition, a continuous random variable X has a normal
distribution if its probability distribution function (PDF) is given by:
( )= , −∞ < <∞
√
31
Where Z has a zero mean and a unit variance. The common notion of expressing a standard normal
random variable is as indicated below:
~ (0,1)
Therefore, a normally distributed random variable with a given mean and variance can be
converted to a standard normal variable (aka normal deviate), which greatly simplifies our task of
computing probabilities.
Example 1.15
Find the probability that a -score will be greater than 3.00 from the standard normal Table.
Solution:
Required to find ( > 3.00)
( > 3.00) = 0.5 − (0 ≤ ≤ 3.00)
= 0.5 − 0.4987
= 0.0013
Example 1.16
It is given that, the daily sale of bread in a bakery, follows the normal distribution with a mean of
70 loaves and variance of 9, i.e ~ (70,9). What is the probability that on any given day the sale of
bread is greater than 75 loaves?
Solution:
Data Given:
= 70, = 9, =3
Let represent the daily sale of the bread in a bakery, then:
− 75 − 70
( > 75) = >
3
= ( > 1.67)
= ( > 1.67)
= 0.5 − (0 ≤ ≤ 1.67)
= 0.5 − 0.4525
= .
32
Example 1.17
An investor is considering to purchase a stock whose monthly return is approximately normally
distributed with an expected return of 0.01 and a standard deviation of 0.02. Use the standard
normal distribution table to find the probability that the stock return is positive.
Data Given:
= 0.01, = 0.02
Let represent the stock return, then required to find:
− 0 − 0.01
( ≥ 0) = ≥
0.02
= ( ≥ −0.5)
= ( ≥ −0.5)
≅ ( ≤ 0.5)
= 0.5 + (0 ≤ ≤ 0.5)
= 0.5 + 0.1915
= .
Example 1.18
The income in thousands of dollars of a given company are normally distributed with the mean 20
and the standard deviation of 5. Find the probability that a selected income will be
a) More than twenty five thousand dollars
b) Anywhere between eighteen twenty four thousand dollars
Data Given:
= 20, =5
Let represent the income of the company, then required to find:
− 25 − 20
( > 25) = >
5
= ( > 1)
= ( > 1)
= 0.5 − (0 ≤ ≤ 1)
= 0.5 − 0.0000393
= .
18 − 20 − 24 − 20
(18 ≤ ≤ 24) = ≤ ≤
5 5
= (−0.4 ≤ ≤ 0.8)
≅ (0 ≤ ≤ 0.4) + (0 ≤ ≤ 0.8)
= 0.1554 + 0.2881
= .
33
Example 1.19
Applicants for a certain job are given an aptitude test. Past experience shows that score from the
test are normally distributed with a mean of 60 points and a standard deviation of 12 points. What
percentage of candidates would be expected to pass the test, if a minimum score of 75 is required?
Data Given;
= 60 = 12
Let represent the scores of candidate, required to find ( ≥ 75)
Solution:
− 75 − 60
( ≥ 75) = ≥
12
= ( ≥ 1.25)
= 0.5 − (0 ≤ ≤ 1.25)
= 0.5 − 0.3944
= 01056
Conclusion: Almost 10.56% of candidates would be expected to pass the test
Example 1.20
34
= 0.6611 × 1800
= 1244
Conclusion: The results indicates that 1244 candidates would be suitable for the work based on
IQ test alone
35
Review Questions:
1. Suppose that in one year the number of industrial accidents follows a Poisson distribution with
mean 3. What is the probability that in a given year there will be at least 1 accident?
2. A company owns 400 laptops. Each laptop has an 8% probability of not working. You randomly
select 20 laptops for your salespeople.
(a) What is the likelihood that 5 will be broken?
(b) What is the likelihood that they will all work?
(c) What is the likelihood that they will all be broken?
3. The LMB Company manufactures tires. They claim that only 0.007 of LMB tires are defective.
What is the probability of finding 2 defective tires in a random sample of 50 LMB tires?
4. A study was conducted at Muhimbili National Hospital by the National Institute for Medical
Research (NIMR) to examine the national altitudes about “SP” drugs. The study revealed that
about 70% believe “SP” doesn’t really cure malaria; they just cover up the real trouble.
According to this study, what is the probability that at least 3 of the next 5 people selected at
random will be of the opinion that SP doesn’t really cure the problem but just cover it up?
5. On an average a certain intersection results in 3 traffic accidents per month. What is the
probability that in any given month at this intersection
i. Exactly 5 accidents will occur?
ii. Less than 3 accidents will occur?
iii. At least 2 accidents will occur?
6. An average light bulb manufactured by the Acme Corporation lasts 300 days with a standard
deviation of 50 days. Assuming that bulb life is normally distributed, what is the probability
that an Acme light bulb will last at most 365 days?
7. The heights of 1000 students are normally distributed with a mean of 174.5 centimeters and a
standard deviation of 6.9 centimeters. Assuming that the heights are recorded to the nearest
half of a centimeter, how many of these students would you expect to have heights
i. Less than 160.0 centimeters?
ii. Between 171.5 and 182.0 centimeters inclusive?
iii. Greater than or equal to 188.0 centimeters?
36
8. A drug manufacturer claims that a certain drug cures a blood disease on the average 80% of the
time. To check the claim, government testers used the drug on a sample of 100 individuals and
decide to accept the claim if 75 or more are cured.
i. What is the probability that the claim will be rejected when the cure probability is
in fact 0.8?
ii. What is the probability that the claim will be accepted by the government when the
cure probability is as low as 0.7?
9. A sampling scheme involves taking a sample of ten items from each batch produced and
rejecting the batch if more than two defectives are found. If in fact five per cent of all items are
defective, what is the probability that a batch will be rejected?
10. A batch of items is believed to contain 20 percent defectives. A sample of six items is taken at
random from the batch. Use binomial distribution to find the probability that the sample
contains:
(a) one defective;
(b) two or more defectives.
11. Each year a company selects a number of employees for a management-training program given
by a nearby university. On the average, 70% of those sent complete the program. Out of seven
people sent by the company, what is the probability that:
(a) Exactly five complete the program?
(b) Five or more complete the program?
12. Printing errors in the work produced by a particular firm occur randomly at an average rate of
0.6 per page. What is the probability that a seven-page pamphlet prepared by the firm contains
more than three errors?
13. The proportion of articles produced by a company, which are defective, is 0.5 per cent. They are
sold in boxes of 100 and the company guarantees to replace any box containing more than two
defectives. The cost of this replacement is TZS 20,000. The company is considering introducing
a new inspection scheme, which will cost TZS 50 per box but eliminate all defectives. Use the
Poisson approximation to the binomial distribution to decide whether this inspection is
worthwhile.
14. In a certain large factory the mean number of stoppages per week is 1.5. What is the probability
that:
(a) In a given week there will be no stoppages
(b) In a given week there will be three or more than three
37
(c) In a given two-week period there will be at most one stoppage?
15. The mean life of a certain type of electric light bulb is 1400 hours, with a standard deviation of
300 hours.
(a) If the manufacturer guarantees a life of 1000 hours, what percentage of bulbs can he
expect to have returned?
(b) At what length of life should the guarantee be set in order for 95 per cent of bulbs to be
found satisfactory?
16. Applicants for a certain job are given an aptitude test. Past experience shows that score from
the test are normally distributed with a mean of 60 points and a standard deviation of 12
points. What percentage of candidates would be expected to pass the test, if a minimum score
of 75 is required?
18. If 60 percent of the customers of a large department store charge all their purchases, what is
the probability that among 200 customers (randomly chosen), at least 125 charge all their
purchases?
19. An electrical appliance manufacturer claims that 20 percent of all appliance breakdowns are
caused by the failure of customers to follow operating instructions. If this claim is correct, what
is the probability that among 100 breakdowns, more than 25 are caused by the failure of
customers to follow operating instructions?
20. The annual salaries of employees in a large company are approximately normally
distributed with a mean of $5,000 and a standard deviation of $2,000.
(a) What percent of people earn less than $4,000?
(b) What percent of people earn between $4,500 and $6,500?
(c) What percent of people earn more than $7,000?
38
CHAPTER TWO
SAMPLING DISTRIBUTION
2.1 Introduction
In Chapter One we discussed probability distributions of discrete and continuous random
variables. In this chapter we extend the concept of probability distribution of a random variable to
that of a sample statistic. We usually use inferential statistics methods to estimate population
parameter values of random variables by using findings from samples. In most cases when
conducting a certain investigation, we normally select some individuals to represent the whole
group of individuals at which an investigator is interested with. This is due to factors related to cost
of dealing with entire population, time, etc. Sampling is the process of selecting a representative
group from the population under study.The target population is a collection of all individuals
from which the sample might be drawn. A sample is the group of people who take part in the
investigation or in other words, a sample is a subset of a given target population. There are
different techniques for selecting sample, these can be grouped into two categories; random
(probability) sampling and non-random(non-probability) sampling techniques.
When a sample is selected from a normally distributed population, obviously it is also normally
distributed. Therefore in this section we expect to study the sampling distribution of the sample
mean and the so called sample proportion ̂ . Furthermore, it is anticipated that since more than
one samples can be selected from the given population, the corresponding sample means are also
expected to vary from one sample to the other, which means sample mean can be treated as a
random variable, which will have its own PDF.
1
=
1
= ( + + +⋯+ )
1
= (( ( )+ ( )+ ( ) + ⋯+ ( )
39
1
( )= ( + + + ⋯+ )
1
=
1
=
=
Therefore ( ) =
Since ( )= for each and that " are independent, then the variance of is given by
( )= ( )+ ( )+ ( )+⋯ ( ))
1
= ( × )
Therefore ~ ( , )
The square root of variance of is called the standard error of which is given by
( )=
√
Therefore the standardized normal variable in this case is given by
− ̅ −
= = ~ (0,1)
( )
√
Example 2.1
A random X is normally distributed with mean 8 and variance 25. From a random sample of 36
observations, find
a) Standard error of this sampling distribution
b) The probability that the sample mean is greater than 9
Data Given
= 8, = 25, = 36
Let represents the sample mean calculated from a random sample of 36 observation. Then
a) ( )=
√
8
=
√36
= .
b) ( > 9) = >
( ) .
40
1
= >
1.33
= ( > 0.75)
= 0.5 − (0 ≤ ≤ 0.75)
= 0.5 − 0.2734
= .
Example 2.2
Revenues collected from a certain firm in thousands of dollars are assumed to be normally
distributed with mean 20 and standard deviation 8. Suppose that a random sample of 100
revenues were selected, find
a) Standard error of the sampling distribution.
b) Probability that the mean revenue is less 21.6 thousand dollars.
Data Given
= 20, = 8, = 100
Let represents the average revenue collected from 100 samples. Then
a) ( )=
√
8
=
√100
= .
( < 21.6) =
.
b) ( )
<
.
1.6
= <
0.8
= ( < 2)
= 0.5 + (0 ≤ ≤ 2)
= 0.5 + 0.04772
= .
(1 − )
̂~ ,
41
Therefore the standard variable Z becomes
̂− ̂−
= = ~ (0,1)
( ) ( ̂)
Example 2.3
Suppose the proportion of all college students who have used marijuana in the past 6 months is
40%, suppose a random sample of 200 students representing the population of all students who
use marijuana was taken, what is the probability that the proportion of students who have used
marijuana is less than 32%?
Solution
Given = 0.4, = 200. Required to find ( ̂ < 0.32)
̂− 0.32 − 0.4
( ̂ < 0.32) = <
( ) . × .
= ( < −2.31)
≅ ( > 2.31)
= 0.5 − (0 ≤ ≤ 2.31)
= 0.5 − 0.4896
= 0.0104
42
x2 y2
Thus, − has mean x y and variance
nx ny
X Y ~ N x y ,
y2
2
x
nx n y
and the expression is used when the variances are known but different i.e. x2 y2
1 1
Then, X Y ~ N x y ,
n
2
x ny
X Y y
~ N 0, 1
And hence, Z
x
1 1
2
nx n y
When the population variance is unknown but assumed to be equal then we have
( ) ( )
= ~
( )
Where = is called a pooled variance.
Example 2.4
A random sample of size 60 was taken from normal population with mean 30 and variance 121.
Another independent random sample of size 100 was taken from a normal population with mean
20 and variance 81. Find the probability that the difference between the arithmetic means is at
least 13
( ) ( )
̂ ~ , and ̂ ~ ,
43
Then it can be shown that the difference between the two sample proportions, ̂ − ̂ is also
normally distributed with the mean − and variance
(1 − ) (1 − )
+
written as
(1 − ) (1 − )
( ̂ − ̂ )~ − , +
44
Review Questions
1. A Lorry carries heavy Cartons of Goods. The weights of these cartons are distributed about a
mean of 100kg and with a standard deviation of 7kg. Find how many cartons the Lorry can
carry so that the probability of the total load exceeding 4500kg is less than 0.05.
2. Consider a random sample of size 16 from a N (100, 400) distribution. Find P( X 97) .
3. The electric light bulbs of manufacturer A have a mean lifetime of 1,400 hrs with a standard
deviation of 200 hrs, while those of manufacturer B have a mean lifetime of 1,200 hrs with a
standard deviation of 100 hrs. If random sample of 125 bulbs of each brand are tested, what is
the probability that the brand A will have a mean lifetime that is at least (a) 160 hrs, (b) 250 hrs
more than brand B bulbs?
4. Ball bearings of a given brand weigh 0.5 oz with a standard deviation of 0.02 oz. What is the
probability that two lots of 1,000 ball bearings each, will differ in weight by more than 2 oz?
5. The weights of the packages received by a department store have a mean of 300 lb and a
standard deviation of 50 lb. What is the probability that 25 packages received at random and
loaded on an elevator will exceed the safety limit of the elevator, listed as 8,200 lb?
6. A random variable is normally distributed with mean 18 and variance 25. From a random
sample of 36 observations, find
Standard error of this sampling distribution
The probability that the sample mean is greater than 16
7. A certain research showed that 45 out of 60 workers from a certain company use smart phones.
Is a sample of 100 workers is randomly selected, what is the probability that 80% of workers
use smart phones?
8. A random sample of size 60 was taken from a normal population with mean 30 and variance
121. Another independent random sample of size 100 was taken from a normal population
with mean 20 and variance 81. Find the probability that the difference between the arithmetic
means is at least 13
45
CHAPTER THREE
ESTIMATION OF PARAMETERS
3.1 Introduction
In chapter two we said something about how sampling distribution plays a big role in inferential
statistics. In this chapter we are going to start the discussion of inferential statistics by applying the
concepts of sampling distribution discussed in chapter two. As we have said earlier, inferential
statistics is the part of statistics that uses information from a sample to make decisions and draw
conclusions for the population from which the sample was drawn. There are two important parts of
inferential statistics; estimation and hypotheses testing, these two taken together are referred to as
inference making. In this chapter we will discuss the concepts of estimation (point and interval
estimation) and hypothesis testing will be discussed in the next chapter
Estimation is the process of estimating unknown parametersof the given population based on the
sample statistics. Statisticians use sample statistics to estimate population parameters. For
example, sample means are used to estimate population means; sample proportions, to estimate
population proportions. In most cases it is not always possible to work with the whole population
and determine the desirable statistical measures. This is due to factors related to time, cost and
sometimes the population might be infinite. Therefore under this situation, the findings from the
samples are used to represent the whole population.
Parameters are the variables that summarizes data for entire population
Statistics are variables computed from sample information that are used to estimate certain
population parameters
Estimator is a statistic (a function of the observable sample data) that is used to estimate an
unknown population parameter
Significance Level
Since we are using samples to estimate for the population we cannot be 100% sure that the
estimated value or interval will be true. Significance level is the probability (percent) that the true
population parameter might fall out of the confidence interval constructed. For example if the
confidence level is 95%, then, the significance level is 5%. This concept will be explicitly discussed
in the next chapter of hypothesis testing.
Degrees of Freedom
The number of data values which are allowed to vary once a statistic has been determined.
46
Maximum Error of the Estimate
The maximum difference between the point estimate and the actual parameter is known as the
maximum error of the estimate, it is 0.5 the width of the confidence interval for means and
proportions.
As we have already said, one area of concern in inferential statistics is the estimation of the
population parameters from the sample statistics. It is important to realize the order here. The
sample statistic is calculated from the sample data and the population parameter is inferred
(estimated) from this sample statistic. In that case statistics are usually calculated while
parameters are estimated.
Another area of inferential statistics is sample size determination. That is, how large of a sample
should be taken to make an accurate estimation. In these cases, the statistics cannot be used since
the sample has not been taken yet.
3.3.2 Efficiency
An estimator is said to be efficient if it has got relatively smaller variance or standard deviation.
That is, if you have two or more estimators select that one with smaller standard deviation and
make it as your estimator. Let Band be two unbiased estimators of , then is said to be more
efficient than if
3.3.4. Sufficiency
An estimator is said to be sufficient if it utilizes all the information in the sample to arrive to
estimate. For example, when estimating a population mean we would prefer a sample mean to
sample median and a sample mode because the last two estimators utilize only some of the
information to come to the estimate.
47
3.4 Types of Estimation
There are two types of estimation: Point estimation and interval estimation
This is the process of estimating the interval through which the unknown population parameters
lie. An interval estimate is defined by two numbers, between which a population parameter is said
to lie. For example, a<µ<b is an interval estimate of the population mean μ. It indicates that the
population mean is greater than a but less than b.
A confidence level.
A statistic.
A margin of error.
Confidence Level
The confidence level is the probability value (1-α) 100% associated with a confidence interval. For
example, say α=0.05=5%, then the confidence level is equal to (1- 0.05) = 0.95, i.e. a 95%. The
confidence level describes how strongly we believe that a particular sampling method will produce
a confidence interval that includes the true population parameter.
Statistic
This is a variable that is calculated from the sample information that is used to estimate a certain
population parameter
48
Margin of Error.
These are range of values in a confidence interval that are above and below the sample statistic are
called the margin of error. Therefore interval estimate of a confidence interval is defined by the
sample statistic+margin of error.
When population variance is known, then (1 )100% confidence interval estimate for is given
by:
(1)
X Z / 2
n
Where is the level of significance
is the error term
E Z / 2
n
n Sample size
Population standard deviation
Hint: If the population variance ( 2 ) is known, the above formula is applied regardless of the
sample size
Example 3.1
Let X be a normally distributed random variable with variance equal to 81. A random sample of
size 25 is drawn and the arithmetic mean is calculated to be 10.2; use this information to compute a
99% confidence interval estimate for
Data Given:
n 25, x 10.2 2 81, CL 99%, 1% Solution: Since
n 30 and 2 is known, then 99% confidence interval estimate for will be:
49
x Z / 2
n
9
10.2 Z 0.005
25
9
10.2 2.575
25
10.2 4.635
5.565; 14.835
Conclusion: We are 99% confident that the true mean lies within 5.565 and 14.835
Example 3.2
A mining Company in Zambia needs to estimate the average amount of Copper ore per ton mined.
A random sample of 50 tons gives a sample mean of 146.75 kg. The population standard deviation
is assumed to be 35.2 kg.
a) Provide a 90% CI for the average amount of copper in the population of tons mined
b) Give a 95% C.I and 99% C.I for the average amount of copper per ton
Data Given:
= 50, ̅ = 146.75, = 35.2, = 90%, = 10%
Solution:
Since is known then 90% confidence interval for is given by:
= ̅±
√
35.2
= 146.75 ± . ×
√50
35.2
= 146.75 ± 1.645 ×
√50
= 146.75 ± 8.189
= 138.561; 154.94
Conclusion: We are 90% confident that the true mean lies within 138.561 and 154.94
Example 3.3
A team of efficiency experts intends to use the mean of random sample size n 150 to estimate the
average mechanical aptitude of assembly-line workers in a large industry (as measured by certain
standardized test). If based on experience, the efficiency experts can assume that 6.2 points on
the test scale for such data. Estimate 99% about the maximum error of their estimate.
Data Given:
50
= 150, = 6.2, = 99%, = 1%
Solution:
From
=
√
The maximum error is obtained by:
≤
√
6.2
≤ . ×
√150
6.2
≤ 2.575 ×
√150
≤ 1.303
Therefore the maximum error of their estimate is 1.303
3.5.2.1 Confidence interval for µ when ( ) is unknown and the sample size is large
If the sample size is large and population standard deviation is unknown, the formula for
(1 )100% confidence interval estimate for the population mean is given by:
s
X Z / 2
n (2)
Data Given:
= 50, ̅ = 14.5, = 5.6, = 95%, = 5%
Solution:
51
Since is unknown and the sample size is more than 30, then 95% confidence interval for is
given by:
= ̅±
√
5.6
= 14.5 ± . ×
√50
5.6
= 14.5 ± 1.96 ×
√50
= 14.5 ± 1.55
= 12.95; 16.05
Conclusion: We are 95% confident that the true mean lies within 12.95 and 16.05
3.5.2.2 Small sample confidence Interval for the mean µ, when ( ) is unknown
When σ is unknown, and the sample size is small, the sampling distribution for the sample mean
follows a new distribution called student’s t-distribution. This distribution has the same properties
like the normal distribution except that it is not being described by mean and standard distribution
like the normal distribution does. In steady the t-distribution is described by a parameter called
degrees of freedom (df). For this case the formula for(1 − )100% confidence interval for µ is
defined by:
s
X t / 2, n1
n (3)
Where n is the sample size, (n-1) is the degree of freedom
Example 3.5
A management Consulting firm needs to estimate the average number of years of experience of
executive in a given branch of management. A random sample of 25 executive gives ̅ = 2.4years
and s=1.5. Give a 99% C.I for the average number of years of experience for all executives in a
branch
Data Given:
= 25, ̅ = 2.4, = 1.5, = 99%, = 1%
Solution:
Since is unknown and the sample size is less than 30, then 99% confidence interval for is given
by:
52
= ̅±
√
,( )
1.5
= 2.4 ± . , ×
√25
1.5
= 2.4 ± 2.064 ×
√25
= 2.4 ± 0.62
= 1.78; 3.02
Conclusion: We are 99% confident that the true mean lies within 1.78 and 3.02
this case the formula for (1 − )100% (1 )100% confidence interval for the population
proportion given a lager sample is given by:
pq
p p Z / 2 (5)
n
x
Where p
, q 1 p
n
x is the number of success in n trials
Example 3.6
53
In a random sample of 400 cars stopped at a roadblock, 152 of the drivers were wearing their seat
belts. Construct 95% confidence interval for the corresponding true proportion in the population
sampled
Data given:
= 400, = 152
Solution:
95% CI estimate for P is given by:
̂
= ̂±
0.38 × 0.62
= 0.38 ± .
400
= 0.38 ± 1.96√0.000589
= 0.38 ± 1.96 × 0.02427
= 0.38 ± 0.048
= . ; .
Conclusion: We are 95% confident that the true proportion lies within 0.332 and 0.428
Example 3.7
In a random sample of 200 vacationers interview at a resort, 142 said that they chose the resort
mainly because of its climate. With 99% confidence, find the maximum error we can make
Data given:
= 200, = 142, = 99%, = 1%
Solution:
From
̂
=
̂
≤
But
54
142
̂= = = 0.71
200
=1− ̂
= 1 − 0.71
= 0.29
Therefore:
0.71 × 0.29
≤ .
200
≤ 2.575 × √0.00103
≤ 0.083
The maximum error we can make is 0.083
When two populations are normally distributed the independent sample extracted from such
population are also normally distributed. The formula for (1 − )100% confidence interval for the
difference between the two populations 1 2 is given by:
1 2 x1 x2 Z / 2 1 2
2 2
(8)
n1 n2
Example 3.8
Mr Juma, a bank official in Dar es Salaam wants to know the difference between the average
amount of money customers have on deposit in two branch banks. He selects a random sample of
25 customers from each branch and the results are as indicated below:
Branch A Branch B
Sample mean Tshs. 450 Tshs. 325
Variances 759 850
If the two populations A and B are normally distributed, Construct 95% confidence interval for
1 2
Hint: The formula above is used when the population variances are known regardless the sample
sizes
3.7.2 When population variances are unknown
55
Two situations are considered under this category
a) When both sample sizes are large, n1 30 and n2 30
b) At least one of the samples is small
1 2 x1 x2 Z / 2 1 2
s s
2 2
(9)
n1 n2
Example 3.9
A utility company used to send out monthly statements to its customers without addressed return
envelopes. From a random sample of 120 customers it was determined that, on average it took 9
days for a payment to be made, and with a sample standard deviation of 2 days. Wishing to speed
up receipt of payments, pre-addressed return envelops were subsequently included with the
invoices. An independent sample of 130 customers indicated that average payment time fell to 8
days, with a sample standard deviation of 2.2 days. Compute a 95% confidence interval estimate for
the differences between population means
In this situation the estimation of the differences between the two population means is done under
the assumption that samples are drawn from two independent population (say X 1 and X 2 ) and
that these populations have a common variance. i.e. 12 22 2 . In this case the confidence
interval estimate for the difference between the two population is given by:
S p2 S 2
1 2 x1 x2 t P
n1 n2
, ( n1 n2 2 )
2
Where S p is known as pooled variance (aka common variance) which is obtained by computing
2
the weighted average of the two samples variances, it is given by the following formula:
(n1 1) s1 (n2 1) s 2
2 2
Sp and n1 n2 2 is the degrees of freedom
2
n1 n2 2
Example 3.10
56
A random sample of 14 firms belonging to a particular industry is selected and the current
percentage gross yield X is noted for each firm. The data show that X 46.8 and
confidence interval estimate for x y on the assumption that two population variances are
equal.
3.8Estimation of the difference between two population proportions ( − )
The confidence interval for estimating the difference between two population proportions is given
by the following formula
p1 q1 p2 q 2
p1 p 2 p1 p2 Z / 2
n1 n2
x1 x
Where pˆ 1 , qˆ1 1 pˆ 1 and pˆ 2 2 , qˆ 2 1 pˆ 2
n1 n2
Example 3. 11
A random sample of 350 sales persons and an independent random sample of 325 personnel
managers are interviewed regarding their reading habits. Out of the 350 salespersons, 105 show
that they subscribe to financial review magazine. 130 of the executives subscribe to financial
Review magazine. Construct the 99% confidence interval for the difference between the true
proportions subscribing to this magazine
3.9Estimation of the Sample size
The formula for confidence interval estimate for either population mean or population
proportion P is used to estimate a sample size that is suitable to provide good approximation of
population parameters. This can be done earlier if one can set the interval length and when
population/sample variance is known in advance. For the case of proportion, we need the estimate
of sample proportion p̂ .
Given:
x Z / 2 , we obtain x E , for the case of population proportion, the CI is
n
expressed as:
pˆ qˆ
P pˆ Z / 2 and this shows that P pˆ E
n
57
Where, E Z / 2 is known as the margin of error. Making n (sample size) the subject, we get
n
Z
2
n /2 (5)
E
For the case of population proportion the sample size is given by:
Z
2
n / 2 pˆ qˆ (6)
E
Hint: The inequality is preferred because the larger the sample size the more precise are the
estimates
Example 3.12
What minimum sample size would be required to estimate the population mean for a large set of
company invoices to within 0.30 with 95% confidence, given that the estimated population
standard deviation is 5?
58
Review Questions
1. Suppose 100 randomly selected used cars on a lot have an average of 40,000 miles on them,
with a standard deviation of 500 miles.
a) Find the standard errors for the average miles of the cars in this lot.
b) Find the margin of errors for the average miles of the cars in this lot.
c) Find a 95 percent confidence interval for the average miles on all the cars in this lot and
interpret your answer.
3. If the population standard deviation, σ, is not known, what standardized statistic is used to
construct a confidence interval?
4. If x = 49, s = 4.5 and n = 20, set up a 90% confidence interval estimate of the population
mean, μ.
5. The operations manager of a sugar mill in Morogoro wants to estimate the average size of an
order received. An order is measured in the number of pallets shipped. A random sample of 90
orders from customers had a sample mean value of 135.4 pallets. Assume that the population
standard deviation is 25 pallets and that order size is normally distributed.
(a) Estimate, with 95% confidence, the mean size of orders received from all the mill’s
customers.
(b) If the sugar mill receives 820 orders this year, calculate, with 95% confidence, the total
number of pallets of sugar that they will ship during the year.
6. A travel agency call centre wants to know the average number of calls received per day by its
call centre. A random sample of 25 days is selected and the sample mean number of calls
received was found to be 175.8 with a sample standard deviation of 23.5 calls. Assume that calls
received daily are normally distributed.
(a) Calculate a 95% confidence interval for the mean number of daily calls received by the call
centre. Interpret the findings.
(b) Find a 99% confidence interval for the mean number of daily calls received by the call
centre.
(c) Compare the findings of (a) and (b) and explain the reason for the difference.
59
(d) Estimate, with 95% confidence, the total number of calls received over a 30-day period.
Interpret the result.
7. The Department of Trade and Industry (DTI) wants to determine the percentage of
manufacturing firms that have met the employment equity charter. To assist the department,
National Bureau of Statistics (NBS) selected a random sample of 250 manufacturing firms and
established that 89 have met the employment equity charter. Determine, with 95% confidence,
the percentage of manufacturing firms that have met the employment equity charter. Prepare a
brief report to the DTI detailing your findings.
8. AQB bank analyzed a random sample of 465 business accounts at their city central branch and
found that 88 of them were overdrawn. Estimate, with 90% confidence, the percentage of all
bank accounts at the city central branch of the bank that were not overdrawn. Interpret the
findings.
9. A market research firm wants to estimate the share that foreign companies have in the TZ
market for certain products. A random sample of 100 consumers is obtained, and t=it is found
that 34 people in the sample are users of foreign made products; the rest are users of domestic
products. Give a 95% confidence interval for the share of foreign products in this market.
10. A random sample of 350 shoppers in a shopping mall is interviewed to identify their reasons
for coming to this particular mall. The factor of ‘I prefer the store mix’ was the most important
reason for 125 of those interviewed. Estimate the likely percentage of all shoppers who frequent
this shopping mall primarily because of the mix of stores in the mall, using 95% confidence
limits.
11. A 2008 survey of low- and middle-income households showed that consumers aged 65 years
and older had an average credit card debt of $10,235 and consumers in the 50- to 64-year age
group had an average credit card debt of $9342 at the time of the survey (USA TODAY, July 28,
2009). Suppose that these averages were based on random samples of 1200 and 1400 people
for the two groups, respectively. Further assume that the population standard deviations for the
two groups were $2800 and $2500, respectively. Let 1 and 2 be the respective population
means for the two groups, people aged 65 years and older and people in the 50- to 64-year age
group.
13. (a) When are the samples considered large enough for the sampling distribution of the
difference between two sample proportions to be (approximately) normal?
14. Suppose that you computed a 95% confidence interval for a population mean. The user of the
statistics claims your interval is too wide to have any meaning in the specific use for which it is
intended. Give two methods of solving this problem.
15. A research firm wants to conduct a survey to estimate the average amount spent on
entertainment by each person visiting a popular resort. The people who plan the survey would
like to be able to determine the average amount spent by all people visiting the resort to within
$120, with 95% confidence. From past operation of the resort, an estimate of the population
standard deviation is σ = $400. What is the minimum required sample size?
61
CHAPTER FOUR
HYPOTHESIS TESTING
4.1 Introduction
In Chapter Three we discussed parameter estimation which is the first part of inferential statistics;
the second part is the testing of hypothesis which is discussed in this chapter. When a sales
manager claims that his company’s product has the largest market share compared to the rival
company, we can use hypothesis testing techniques to test statistically the validity of the manager’s
claim. Here we need to go to the market and collect appropriate samples on the sales of the
products from the two companies and decide to either support or refute the manager’s claim based
on the sample evidence.
Hypothesis testing is very important in decision making process as it involves collecting evidence
(data) on the claim and use statistical methods to determine whether there is any significant
difference between the claim and the obtained information. For example if a person is suspected to
have committed a crime and taken to court for trial, based on the available evidence the jury can
make one of the two possible decisions; it is either the person is guilty or innocent. At the outset of
the trial the person is presumed innocent until proven guilty. The role of the prosecutor is to prove
that the person is guilty and if he can prove this beyond doubt then the defendant convicted. This is
a non-statistical example but in statistical terms the statement that the defendant is innocent is
known as the null hypothesis and the statement of the prosecutor that the person is guilty is termed
as alternative hypothesis. The null and alternative hypotheses are always stated in terms of the
population parameters that need to be verified from the sample statistics. As we have seen in the
court trial example, the null hypothesis is true until we are able to nullify its truthiness.
In hypothesis testing, the researcher must do the following to reach desired conclusion
62
This is verbal statement or claim made about a population parameter. In other words, this is a
premise or claim that we want to test.
statement of inequality such a ≠, >, or <. Note that since always a statistical hypothesis will be in a
form of verbal statement ; in order to write down the two complementary hypotheses we will be
required to transform the given verbal statement into mathematical statement.
4. 3 Types of Error
In making statistical decision it is also possible to commit error. This is because the procedures for
testing hypothesis relies on sample data, and because sample data are not completely reliable, then
there is possibility of having a wrong conclusion about the population parameters being tested.
Hence in carrying out hypothesis testing, there are two types of error that can be committed. These
are Type I and type II errors.
This is committed when we reject a true H 0 , i.e. the null hypothesis was not supposed to be
rejected but due to either insufficient or wrong data we wrongly come to a decision of rejecting. The
significance level ( ) is defined as the probability of Type I error. Thus gives us the probability
of wrongly rejecting a true null hypothesis.
63
4.3.2 Type II error
This is the probability of failing to reject H 0 when it is really false, i.e. accept H 0 when it is not
The following Table summarizes the relationship between H 0 and the two types of errors.
sided test if the alternative hypothesis will take the form : ≠ . That is in general,
1. If the alternative hypothesis contains the not-equal- to symbol (≠) then the test is two sided
test.
i.e : ≠ .
2. If the alternative hypothesis contains the greater than inequality symbol (>) then the test is
right tailed tests
i.e : >
3. If the alternative hypothesis contains the less than inequality symbol (<) , the hypothesis
test is left tailed test
i.e : <
Furthermore, two-tailed test is when the hypothesis about the population mean is rejected for a
value falling into either tails of the sampling distribution as indicated in figure 2.
64
When the hypothesis about the population mean is rejected only for the value falling into one of the
tails of the sampling distribution; it is known as one-tailed test. See figure 2 and 3
For one tailed hypothesis test we have
The shaded region is equivalent to the magnitude of the significance level, for a two tailed test we
equally distribute the significance level between the upper and lower tails by dividing by 2.
The critical value is obtained from statistical tables depending on the sampling distribution of the
test statistic. For example if normal distribution is assumed then the standard normal table should
be used to obtain the critical value, if it is the student’s t-distribution then the t-distribution table is
appropriate.
4.5 Formulation of Null and Alternative hypotheses
So far we have seen the main differences between the null and alternative hypotheses, we will now
demonstrate on how to formulate the two conflicting hypothesis by using the following examples.
Example 4.1
State the null and alternative hypotheses and identify which represents the claim.
(i) A local government at Ubungo Municipality claims that the mean monthly
household income at its locality is at least Tshs. 50,000.
(ii) A Pepsi Cola manufacturer claims that less than a quarter of Tanzania population
drink Pepsi Cola.
65
(iii) The mean weight of babies born in 2019 is at most 3.5kgs
Example 4.2
A company has stated that their plastics machine makes plastics that are 8 mm diameters. A
worker believes the machine no longer makes plastics of this size and samples 100 plastics to
perform the hypothesis testing with 99% confidence. State both null and alternative hypothesis.
Example 4.3
Doctors believe that the average child sleeps on average for at most 10 hours per day. A researcher
believes that children on average sleep longer. Write down and
Example 4.4
The school board claims that at least 80% of students bring a phone to school. A teacher believes
this number is too high and randomly samples 25 students to test at a level of significance of 0.02.
Write down and
The testing procedures include choosing a suitable tests statistic and dividing its value into two
regions known as a rejection and non-rejection regions. This partitioning is done at the critical
value(s). The size of rejection region is just the probability of committing Type I error. This
probability is also known as the level of significance. A critical value is the value of the random
variable whose area is equal to the level of significant.
66
3. Using the given level of significance, find the critical value(s) and specify the rejection and
non-rejection regions of the distribution. Allocate also the value of the test statistic in the
distribution
4. Make a statistical decision based on where the value of the test statistic falls and then give a
managerial decision or conclusion or comment.
Alternatively after computing the appropriate test statistic from procedure number two, we can
also make decision by calculating its corresponding probability value (commonly known as Prob.
value/P-value) and compare with the given level of significance. If the P-value is found to be
greater than the given level of significance ( . . − > ) then we will fail to reject the null
hypothesis, otherwise if the P-value is found to be less than or equal to the level of significance
(. . − ≤ ) then the null hypothesis will be rejected. This option is often employed in
many statistical computer packages. Generally the test procedures under this technique are as
indicated below:
Test procedures under P-value approach
1. Identify the two conflicting hypotheses (i. e. and ) and the level of significance ( )
2. Compute the appropriate test statistics
3. Calculate the P-value: Note that if the appropriate test statistics is then, the P-value will
be deduced as indicated below
Test Type P-value
Right tailed P( > )
Left tailed P( < )
Two tailed2 × P( > ( ))
4. Decision:
If P-value≤ then reject and accept
If P-value> then we do not reject/we fail to reject
4.6.2 Confidence Interval approach
The following procedures will be used under confidence interval approach
1. Identify the null hypothesis ( ), alternative hypothesis ( ) and the level of significance
( )
2. Establish or compute the appropriate confidence interval by using the given sample
statistics.
3. Make both statistical and managerial decisions based on whether the claimed hypothesis
falls within or outside the confidence interval
Key Concepts
67
Test statistics
This is calculated from the sample and will depend on the nature of the test and the corresponding
sampling distribution for the population parameter. The following are the formula for calculating
the test statistics:
small sample
Level of significance ( )
This is the maximum allowable probability of committing a type I error. It is denoted by a lowest
Greek letter alpha ( ). The commonly used level of significance in research are:
= 1% = 5% = 10%
Critical value
A critical value is any value that separates the normal curve into critical region (where we reject the
null hypothesis) and non rejection region. This value will depend on the nature of the alternative
hypothesis, the sampling distribution that applies, and the significance level and will also be
obtained from statistical Tables. If the normal distribution is assumed then the standard normal
Table will be used to obtain the critical value, if it is student's t-distribution then t-distribution will
b appropriate
Rejection Region
The critical region (or rejection region) is the set of all values of the test statistic that cause us to
reject the null hypothesis. For example in the case of right tailed test, the rejection region can be
shown as indicated in the following normal curve:
68
Non Rejection region
(NRR)
Rejection
region(RR)
0
Where denotes critical value in this case
As usual the first step will be to identify the two conflicting hypotheses ( and ) and state the
level of significance ( ). Since population variance ( ) will be known in this case, regardless of
X 0
the sample size the appropriate test statistics (T.S) will be such that Z 0
n
The critical value(s) will depend on the nature of alternative hypothesis ( H 1 ) stated in the
question. That is if:
H 1 : 0 then, critical value will be Z
= ,
√
then we will be required to compute the corresponding P-value by using expression stated earlier in
testing procedures.
Example 4.5:
A random sample of size 16 was taken from a normal population of variance 4. The information
from the sample showed that x 9.8 . Test H 0 : 9.0 against the alternative H 1 : 9.0 at 5%
level of significance. (Hint: Use both critical value and P-value approaches)
Example 4.6
The average IQ for adult population is 100 with standard deviation of 15. A researcher believes this
value has changed. The researcher decides to test the IQ of 75 randomly selected adults. The
69
average IQ of the sample is 105. Is there enough evidence to suggest the average IQ has changed?
Use = 0.05
Example 4.7
Use the confidence interval approach to test the hypothesis stipulated in example 1.5
Generally when population variance ( ) is unknown and the sample size is large the appropriate
test statistics will be and we will be testing the null hypothesis against one of the alternative
hypotheses. In this case the test statistic is again but is replaced by the sample standard
x 0
deviation s and it is given by Z 0
s
n
The critical value(s) will depend on the nature of alternative hypothesis ( H 1 ) stated in the
question. That is if:
H 1 : 0 then, critical value will be Z
Note: By using the P-value approach we expect that after calculating the test statistics: = ,
√
then we will be required to compute the corresponding P-value by using expression stated earlier in
testing procedures.
Example 4.8
A study is conducted to look at the time students exercise in average. Currently it is believed that
they spend at least 16 hours per month doing exercises. However, a researcher claims that in
average students exercise less than 16 hours per month contrary to what is currently believed. In a
random sample of size n = 120 he finds that the mean time students exercise is ̅ =12.3h/month
with s = 7.43h/month. Use the test of significance approach to test the above claim by using
= 10%
Example 4.9
70
A manufacturer company claims that its rechargeable batteries are good for an average of more
than 1,000 charges. A random sample of 81 batteries has a mean life of 1002 charges and a
standard deviation of 14. Is there enough evidence to support this claim at = 0.01?
Example 4.10
A local telephone company claims that the average length of a phone call is 8 minutes. In a random
sample of 58 phone calls, the sample mean was 7.8 minutes and the standard deviation was 0.5
minutes. Is there enough evidence to support this claim at = 0.05?
Generally when population variance ( ) is unknown and the sample size is small, in this case the
statistic will no longer be used. The appropriate test statistic will be where has a t-
x 0
distribution with n 1 degrees of freedom and is given by t 0
s
n
The critical value(s) under this situation will depend on the nature of alternative hypothesis stated
in the question. That is if:
H 1 : 0 then, critical value will be t ,( n 1)
Note that if the appropriate test statistics is then, the P-value will be deduced as indicated below
Test Type P-value
Right tail P( > )
Example 4.12
The average IQ for adult population is 100 with standard deviation of 15. A researcher believes the
average IQ is lower. A random sample of 5 adults are tested and the scores are ; 69, 79,89,99, and
71
109. Is there enough evidence to suggest the average IQ is lower? Hint: Use the critical value and
confidence interval approaches.
Example 4.13
Airtel claims that the average length of a phone call is 8 minutes. In a random sample of 25 phone
calls, the sample mean was 9.8 minutes and the standard deviation was 0.5 minutes. Is there
enough evidence to support this claim at = 0.05? (Hint: Use the test of significance approach)
Example 4.14
A manufacturer claims that its rechargeable batteries have an average life greater than 1,000
charges. A random sample of 10 batteries has a mean life of 1002 charges and a standard deviation
of 14. Is there enough evidence to support this claim at = 0.1? (Use the test of significance
Approach)
4.8 Hypothesis testing for the difference between two population means:
4.8.1 When population variances are known
As usual the following procedures will be used. Step one will be to formulate the two conflicting
hypotheses. That is we test for H 0 : 1 2 c against one of the alternative hypotheses
H 1 : 1 2 c or H 1 : 1 2 c or H1 : 1 2 c
x1 x2 c
The appropriate test statistic is such that Z 0
12 22
n1 n2
By using Critical value approach, the above test statistics will be compared with critical values. That
is if:
H 1 : 0 then, critical value will be Z
In this case the test procedure is similar to previous situation except that the population variances
( i2 ) are replaced by sample variances ( s i2 ) for i 1, 2
72
x1 x2 c
The test statistics is thus Z 0
s12 s 22
n1 n2
By using Critical value approach, the above test statistics will be compared with critical values. That
is if:
Example 4.15:
Let X denotes the annual income of workers from financial sector and Y denotes the annual income
of workers from education sector. Data collected from these two sectors provide the following
information; n x 210 , X 11,340 , ( X X ) 2
13,376 , n y 190 , Y 8,930 ,
(Y Y ) 2
10,584 . Test the null hypothesis that the two sectors have the same annual income
against the alternative hypothesis that the financial sector pays more. Use 5%
Example 4.16:
73
A random sample of size 16 showed an average of 480g with a standard deviation of 21g, on the
other hand, a sample of size 25 resulted to an average of 490g with a standard deviation of 24g.
Test a null hypotheses H 0 : 1 2 0 against the alternative H1 : 1 2 0 at 5% level of
significance.
xd d
tc
sd
n
xd
x d
is mean of the differences and n is the number of pairs
n
x
d d x n
2
2 d
x x
2
sd
d
n 1 n 1
is the standard deviation of the differences
74
Example 4.17
A company sent seven of its employees to attend a course in building self-confidence. These
employees were evaluated for their self-confidence before and after attending this course. The
following table gives the scores (on a scale of 1 to 15, 1 being the lowest and 15 being the highest
score) of these employees before and after they attended the course.
Before 9 8 11 5 7 6 10
After 9 5 9 4 5 8 5
Test whether attending the course has positively affected the self-confidence of the employees.
Solution
H 0 : after before
H1 : after before
xd
x d
11
1.57
n 7
x 112
xd2
2
d
47
sd n 7 2.225
n 1 7 1
75
xd d 1.57 0
tc 1.8669
sd 2.225
n 7
t , n 1 t0.05, 6 1.1943
The calculated tc value falls under the non rejection, so we do not reject the null hypothesis at 5%
level and we can conclude that the training had no significant positive effect on the employees
score.
1. Identify H 0 , H 1 and
We will be testing the null hypothesis such that H 0 : P P0 against one of the alternative
hypotheses
H 1 : P P0 or H 1 : P P0 or H 1 : P P0
76
x
pˆ
n
Note:
The testing can be carried out if the following condition holds:
npˆ (1 pˆ ) 9
Example 4.18:
A marketing company claims that it receives 8% responses from its mailing. To test this claim, a
random sample of size 500 were surveyed with 25 responses. Test this claim at the = .05
significance level. Hint use test of significance approach
respective population having proportions P1 and P2 . We can test the null hypothesis that: There is
no difference between the population proportions.
In this case we will be testing the null hypothesis such that H 0 : P1 P2 against one of the
alternative hypotheses
H 1 : P1 P2 or H 1 : P1 P2 or H 1 : P1 P2
But, since the two populations are assumed to be independent, then the appropriate test statistic
becomes
( pˆ 1 pˆ 2 ) ( P1 P2 )
Z0
1 1
p (1 p )
n1 n2
x1 x2
Where pˆ 1 and pˆ 2 are the sample proportions, and p is called the pooled
n1 n2
sampleproportion from the two samples, and is given by:
x1 x2
p
n1 n2
Where x1 and x 2 are number of successes from the two samples, and n1 and n 2 are the sample sizes
Example 4.19:
In a random sample of 100 persons taken from village A, 60 are found to be consuming tea. In
another sample of 200 persons from village B, 100 are found to be consuming tea. Do the data
reveal significant difference between the two villages so far as the habit of taking tea is concerned?
Use 1% level of significance
77
Review Questions
1. The average salary of graduates entering the ICT field is reported to be Tsh 400,000 per
month. To test this, ICT company manager surveys 20 graduates and finds their average
salary to be Tsh 432,280 with a standard deviation of Tsh 40,000. Using 10% level
significance, has he shown the reported salary to be incorrect? Hint: Use test of significance
approach
2. The mean life time of a sample of 100 light bulbs produced by a certain company found to
be 1,580 hrs with the standard deviation of 90 hrs. Test the hypothesis that the mean life
time of bulbs produced by the company is 1,600 hrs at 5% level of significance. Hint:Use
confidence interval approach
3. A company uses two machines to fill packets of crisps. A sample of 30 packets from
the first machine had a mean weight of 180g and a standard deviation of 14g. A
sample of 40 packets from the second machine had a mean weight of 170g and a
standard deviation of 10g. Does the evidence from these samples support the view
that the two machines produce packets of equal weights? Use 5%
4. Let X denotes the annual income of workers from financial sector and Y denotes the annual
income of workers from education sector. Data collected from these two sectors provide the
following information; n x 25 , X 11,340 , ( X X ) 2
13,376 , n y 35 ,
Y 8,930 , (Y Y ) 2
10,584 . Test the null hypothesis that the two sectors have the
same annual income against the alternative hypothesis that the financial sector pays more.
Use 5%
5. A lecturer wants to know if her introductory statistics class has a good grasp of basic math. 8
students are chosen at random from the class and given a math proficiency test. The lecture
wants the class to be able to score above 70 on the test. The six students get scores of 62, 92,
75, 68, 83, 75, 90 and 95. Can the lecturer have 90 percent confidence that the mean score
for the class on the test would be above 70?
78
the allergy was taken and showed that the medicine provided relief for 160 people.
Determine whether the manufacturer’s claim is legitimate.
7. Cereals business man took 70 observations concerning the price of rice/kg in Mbeya region.
The results produced a mean price of Tsh. 700 with a standard deviation of Tsh. 30.
Similarly 65 observations were taken in Shinyanga region. The result revealed a mean price
of Tsh.650 with a standard deviation of Tsh. 20. Determine whether there is enough
evidence that the mean prices of rice in Mbeya region are significantly greater than those in
Shinyanga region at 1% level of significance.
8. An automatic bottling machine fills cola into 2-litre bottles. A consumer advocate wants to
test the null hypothesis that the average amount filled by the machine into a bottle is at
least 2 liters. A random sample of 40 bottles coming out of the machine and the exact
contents of the selected bottles are recorded. The sample mean was 1.9996 liters. The
population standard deviation is known from past experience to be 0.0013 liters.
a) Test the hypothesis at α = 5%
b) Assume that the population is normally distributed with the same σ. Assume that
the sample size is only 20 but the sample mean is the same. Conduct the test once
again at α = 0.05.
c) If there is a difference in the two test results, explain the reason for the differences.
9. St. Michaels Swahili Medium School has 300 students. The head teacher claims that the
average IQ of students at this school is at least 110. To prove his point, he administered an
IQ test to 20 randomly selected students. The results revealed the average IQ of 108 with a
standard deviation of 10. Based on these results, at a 0.01 level, should the head teacher
stick to his origin assumption?
10. The same test was given to a group of 100 scouts and to a group of 144 guides. The mean
score for the scouts was 27.53 and the mean score for the guides was 26.81. Assuming a
population variance of 12.11 for both scouts and guides scores.
a) Test whether the scout’s performance in the test is the same to that of guides at 5%
level of significance, assuming that the scores are normally distributed. Use the
general procedures for testing hypothesis.
b) If you used the interval estimation to test whether the scout’s performance in the
test is the same to that of guides, what decision could you make from the 95%
confidence interval? Is it the same decision as in (a) above? Explain.
79
10. A recent article describes how finance incentives by major automakers are reducing banks’
share of the market for automobile loans. The article reports that in 1990, banks wrote
about 53% of all car loans, and in 2000 the banks’ share was only 43%. Suppose that these
data are based on a random sample of 100 car loans in 1990 were 53 of the loans were
found to be bank loans; and the 2000 data also based on a random sample of 100 loans, 43
of which were found to be bank loans. Carry out the test of the equality of the banks’ share
of the car loan market in 1990 and in 2000.
11. Within a District, students were randomly assigned to two Mathematics teachers; Mrs.
Smith and Mrs. Jones. After the assignment, Mrs. Smith had 30 students and Mrs. Jones
had 45 students. At the end of the year, each class took the same standardized test. Mrs.
Smith’s students had an average test score of 78 with a standard deviation of 10, and Mrs.
Jone’s students had an average test score of 85 with a standard deviation of 15. Assume that
student’s performance is approximately normal. Test the hypothesis that Mrs. Smith and
Mrs. Jones are equally effective teachers at 10% level.
12. The CEO of a large electric utility claims that 80% of his 1,000,000 customers are very
satisfied with the services they receive. To test this claim, the local Newspaper surveyed 100
customers, using simple random sampling. Among the sampled customers 73% say they are
very satisfied. Based on this finding, can we reject the CEO’s hypothesis that 80% of the
customers are very satisfied? Use 5% level of significance.
13. Suppose that the Goodyear Tire Company has historically held 42% of the market for
automobile tires in Tanzania. Recent changes in company operations, especially its
diversification to other areas of business, as well as changes in competing firms’ operations,
prompt the firm to test the validity of the assumption that it still controls 42% of the
market. A random sample of 550 automobiles on the road shows that 219 of them have
Goodyear tires. Conduct the test at α = 0.01.
80
CHAPTER FIVE
REGRESSION AND CORRELATION ANALYSIS
5.1 Introduction
Regression analysis is concerned with the study of the relationship between one variable called
dependent variable and one or more independent variables. In other words this is the statistical
analysis which deals with the prediction or estimation of dependent variable based on values of
another variable called independent variables.Other names of dependent variable are outcome
variable or predicted variable or explained variable whereby other names of independent variables
are explanatory variables. Thus we may be interested in studying the relationship between the
profitability of commercial bank in terms of liquidity, capital adequacy ratio, non performing loan,
interest rate and GDP growth. Or, we may be interested to examine how individual income is
related to the level of education acquired and working experience, or how sales of a certain product
are attributed to advertising expenditure incurred. Hence all those situations are examples where
regression analysis can be applied. Furthermore the relationship between numerous economic or
financial variables can be linear or non-linear. However, linear regression analysis assumes linear
relationship among the variables of interest. More specifically linear in parameter. For consistency
of notation we normally use to denote dependent variable and to represent independent
variables.
1. To predict, or forecast, the mean value of the dependent variable, given the value of the
independent variable(s)
2. To estimate the mean or average value of the dependent variable, given the value of the
independent variables.
3. To test the hypotheses about the nature of the dependence-hypotheses suggested by some
economic theories
( )= +
81
The above model is called deterministic population regression function (PRF) where and are
constant parameters also known as regression coefficients. The corresponding stochastic
population regression function is as indicated below:
= + + = 1, ⋯ ,
= ( )+
Where is called stochastic or random error term. In general stochastic PRF indicates that,
the actual observation of each individual is equal to the average of that group plus or minus some
quantity. In other words, the stochastic version of PRF states that, any individual Y value can be
expressed as the sum of two components; deterministic/systematic( + )and
nonsystematic/random ( ) .Both and are treated as random (stochastic) variables, whereby
and as well as are treated as non random. Since is treated as a random variable, then it is
characterized by probability distribution as it will be shown later. The differences between the
stochastic PRF and its deterministic counterpart can be explained in more detail using the
following example.
Example 5.1
Economic theory claims that there exist a positive linear relationship between individual
consumption and income as indicated in the following scatter plot. From the graph, deduce both
deterministic and stochastic PRF. Note that, individual consumption is the dependent variable (Y),
and Income is independent variable (X).
82
( )= +
= + +
However, as we have pointed out in the previous sections, we normally use sample information to
represent the whole population due to factors related to cost, time, and complexity of obtaining
information from the whole population. Therefore, the Sample Regression Function (SRF)
which is estimator of PRF in its deterministic form is as indicated below:
= +
and in its stochastic form is as indicated below:
= + +
= +
Where
= the estimator of ( )
= the estimator of
= the estimator of
= the estimator of
83
coefficients ( ) should be chosen in such a way that the residual sum of squares (RSS) is as small
as possible. In other words OLS method tries to minimize the sum of squares of the vertical
distance between the actual data points and the points in the regression line. Algebraically, it states
that
Minimize: = ( − )
= ( − − )
∑ −
=
∑ −
∑ − ∑ ∑
=
∑ − (∑ )
1
= − = −
1
= −
1
= −
1
= −
Example 5.2
The following pairs of value ( , ) represent the amount of sales ( )of a certain product in million's
Tshs, and the amount spent in advertising the same product ( )
X 5 10 15 20 25
Y 22 32 38 59 67
a) Plot a scatter diagram and estimate the line of the best fit
84
b) Obtain the regression line by using least squares method
c) Predict the value of Y when = 110
Solution
a)A scatter diagram and the estimated line of the best fit is shown in the figure below:
80
70
60
50
40
Y
30
20
10
0
0 5 10 15 20 25 30
5 22 110 25 484
10 32 320 100 1024
85
1
= 1375 − (75)
5
= 250
1
= −
1
= 3855 − (75)(218)
5
= 585
1
= −
1
= 10922 − (218)
5
= 1417.2
Hence,
=
585
=
250
= 2.34
= −
1
= −
1
= 218 − 2.34(75)
5
= 8.5
One of the classical linear regression assumptions states that the stochastic error terms are
independent but identically normally distributed with the mean zero and constant variance. That
is to say
~ (0, )
The dependent variable has also normal distribution with variance and mean + . That is
~ ( + , )
86
5.5. 2. Sampling Distribution for and
Because and are computed from a random sample, these estimators themselves are
random variables hence they are also characterised with a probability distribution and suppose
the underlying assumptions concerning the probability distribution for are held true, then it can
∑
be shown that is normally distributed with the mean and variance . That is symbolically
∑
~ ,
Similarly, it can be shown that is normally distributed with the mean and variance . That is
symbolically
~ ,
= ( )
~ (0,1) or = ( )
~ (0,1)
However in most cases the homoscedastic cannot be computed easily hence it is normmally
estimated by the following formula:
∑
=
−2
1
= =
−
−2 −2
However if we replace by its estimator these estimator are no longer normally distributed
instead they follow t-distribution with ( − 2)d.f such that
= ( )
~ ,( ) or = ( )
~ ,( )
∑
Where = and = , 2 represent the number of variables used in the
regression model; in the case of simple regression model, there are two variables.
87
So far we have seen the sampling distribution for sample regression coefficients from the previous
section. In the following section we expect to learn on how to construct (1 − )100% confidence
interval for the corresponding population regression coefficients. Under estimation as we have seen
before, we employ sample statistics to estimate unknown population parameters. Therefore sample
regression coefficients are used to estimate the corresponding population variances. Hence
confidence interval for is given by:
= ± ( ) ( )
And for is given by
= ± ( ) ( )
Example 5.3
Use the results from Example 5.2 to construct 95% confidence interval estimate for and
Solution:
Recall from Example 4.1 = 240, = 1417.2, = 585, = 8.5, = 2.34
From:
= ± ( )
Where,
∑
=
=
−2
1
= −
−2
1
= (1417.2 − 2.34 × 585)
3
= 16.1
Therefore:
16.1 × 1375
=
5 × 250
88
= 4.21
( ) = . ,
= 3.182
16.1
=
250
= 0.254
4) The last step will be to make both statistical and managerial decision
89
Example 5.4
Use information given in Example 5.2 to test the significance of each regression coefficient
Solution
Testing for significance of
Procedures
Test : = 0 against : ≠ 0 at = 5%
The appropriate tests statistics is :
−
=
8.5 − 0
=
4.21
8.5
=
4.21
= 2.019
Critical value = ± ( )
=± . ,
= ±3.182
Since the value of test statistics (TS) falls within non-rejection region (NRR), then null hypothesis
should not be rejected at 5% level. Therefore the intercept is not significant
90
normally based on probability distribution and in most cases its value will be unknown, and hence
it will be estimated or tested by using its sample counterpart .
Sample Cov( , )
=
( ) ( )
As defined earlier:
1
= −
1
= −
1
= −
−1 ≤ ≤1
If the correlation coefficient is +1, it means that the two variables are perfectly positive correlated,
whereas if the correlation coefficient is −1, it means that they are perfectly negative correlated. If 0,
it means no relationship at all. However if 0.8 ≤ < 1 then this indicates very strong linear
relationship, it is just strong when 0.6 ≤ < 0.8. Furthermore if ≤ 0.3 then it indicates weak
linear relationship
91
5.9.1.Properties of coefficient of determination
1) The value is always positive
2) Its value ranges between zero and one, i.e. 0 ≤ ≤ 1. An of 1 mean the entire variation
in Y is explained by the regression. An of zero mean no relationship between Y and X
Example 5.4
Use information given in Example 5.2, compute both and and interpret the results:
Solution
From
Sample Cov( , )
=
( ) ( )
585
=
√250 × 1417.2
= 0.9828
Interpretation: The value indicates there is a very strong positive relationship between the two
variables
= (0.9828)
≈ 96.6%
Interpretation: The value indicates that 96.6% variations in the dependent variable ( ) has been
explained by the independent variable ( ), the remaining 3% has been explained by other factors
not currently included in the regression model.
( − 2)
=
(1 − )
92
: < 0 then critical value will be = − ,( )
4) The last step will be to make both statistical and managerial decision
Where is the total number of observation and is the sample correlation coefficient
Example 5.5
Use the results obtained in Example 5.4 to test : = 0 against : ≠ 0 at 5% level of
significance
Solution
Procedures:
( − 2)
=
(1 − )
3
= 0.9828
(1 − 0.966)
=9.314
iii. Critical value = ± ( )
=± . ,
= ±3.182
iv. Conclusion: Since the value of test statistics falls with rejection region (RR), then the null
hypothesis should be rejected at 5% level. Therefore there exist a linear relationship between the
two variables.
93
1) The first step is to state the null hypothesis ( ) against one of the three alternative
hypothesis ( ). That is
: = 0 againsteither : ≠ 0 or : > 0 or : <0
2) The appropriate test statistics as we have pointed out earlier will be such that
( − 2)
=
1−
3) Critical value to be compared with the above test statistics will always be = ( , )
Example 5.6
Use the results obtained in Example 5.4 to test for the significance of the model at 5% level of
significance:
Solution
Procedures:
Test : = 0 against : ≠ 0 at 5% level
The appropriate test statistics is :
( − 2)
=
1−
0.966 × 3
=
1 − 0.966
= 85.24
Critical value = ( , )
= . ( , )
= 10.1
Since the value of Test Statistics falls within RR then the null hypothesis should be rejected at 5%,
hence the regression model is significant.
5.12 Analysis of Variance (ANOVA)
Analysis of variance also can be employed for testing for significance of the linear regression
model. This is made by first computing numerous statistics and hence creating an ANOVA Table
whose general structure is as indicated below.
SV DF SS MS F-value
94
Where
SS = Source of variation
DF = Degree of freedom
SS = Sum of squares
MS = Mean squares
k = number of variables ( = 2 for simple linear regression)
The above statistics are normally obtained from the following identities
= +
Then,
1= +
But,
=
This implies that
1= +
Where = =∑ − (∑ ) and =∑ =∑ − = −
The other statistics are computed as indicated below:
= , = , =
−1 −
Example 5.7
Use the results from Example 5.1 to fill the following ANOVA Table, and use it to test for
significance of regression model.
SV DF SS MS F-value
Estimated A D G I
Residual B E H
Total C F
Solution:
= −1
=2−1
=1
= −
=5−2
=3
= −
=5−1
=4
= = = 1417.2
= = − = 1417.2 − 2.34 × 585 = 48.3
95
= = −= 1417.2 − 48.3 = 1368.9
1368.9
= = = = 1368.9
1
48.3
= = = = 16.1
3
1368.9
= − Value = = = 85.02
16.1
Therefore the complete ANOVA Table is as shown below:
SV DF SS MS F-value
Total 4 1417.2
Example 5.8
The following pairs of value ( , ) represent the amount of sales ( )of a certain product in million's
Tshs, and the amount spent in advertising the same product ( )
X 5 10 15 20 25
Y 22 32 38 59 67
Use Excel to find regression output and interpret all the statistics obtained.
Note: The computer output for regression analysis of the relationship between sales ( )of a certain
product and the amount spent in the advertising the same product ( ) for 5 days is as asked in
Example 8.3 indicated below:
SUMMARY
OUTPUT
Regression Statistics
Multiple R 0.982811637
R Square 0.965918713
Adjusted R
Square 0.954558284
96
Standard Error 4.01248053
Observations 5
ANOVA
Significance
SV Df SS MS F F
Regression 1 1368.9 1368.9 85.02484472 0.002698129
Residual 3 48.3 16.1
Total 4 1417.2
COEFFICIENT
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 8.5 4.208325083 2.01980594 0.136681808 -4.89276860 21.89276861
X Variable 1 2.34 0.253771551 9.220891753 0.002698129 1.532385666 3.147614334
Interpretation of results
Solution
The output as shown above mainly consists of three Tables namely; the summary output, ANOVA,
and coefficients. We will interpret the meaning of just few information from these three Tables
= 0.9828 indicates very strong positive relationship between dependent variable and
independent variable.
= 0.966 shows that 96.6% variations in the amount of sales ( )has been explained by the
amount spent in the advertisement ( )
From ANOVA Table as we can see, there are various statistics and these are d.f, SS, MS, F-
statistics and significance F. The significance F is used to draw conclusion about the significance
of the model. From econometric point of view, the model is said to be significant if the significance
F value is less than or equal to 10% ( − ≤ 10%)
The coefficient Table consist of numerous information as well. In summary, the coefficients column
gives the value of regression coefficients ( and ), the standard errors of the regression
coefficients are shown next. The t-statistics for testing the significance of individual parameters are
given in column labelled t Stat.
97
= 8.5 indicates the average value of sales ( ) when the amount spent on advertisement ( ) is
zero
= 2.34 indicates a change in the mean value of sales ( ) when the amount spent on
advertisement ( ) increases by 1 unit
The P-value as it is for significance F, is used to conclude for the significance of individual
parameters. The parameter is said to be significant if P-value is less than or equal to 10%. This is
from econometric point of view. In our case is not significant while is significant
The confidence interval estimates are also given in the last columns. In this example 95%
confidence interval estimate for is (-4.8927; 21.8927) while for is (1.5323; 3.1476). The
general regression model is as shown below
= 8.5 + 2.34
98
As usual the population regression coefficients cannot be computed easily, therefore we normally
use their sample counterpart to estimated their values. Therefore the sample regression coefficient
in its deterministic form is as indicated below:
= + + + ⋯+
= + + +⋯+ +
Where
= the estimator of
= the estimator of
= the estimator of
= the estimator of
= the estimator of
= the estimator of
5.13. 1Assumption of Multiple Linear Regression models
1. The dependent variable is linearly related to the coefficients of the regression model and the
model is correctly specified. That is the regression coefficients should be raised to the power
of 1 only (i.e neither = 2,3 ⋯ nor are allowed)
2. The independent variable(s) is/are uncorrelated with the random error term. That is
mathematically
( , )=0
( )=0
4. The error term has a constant variance (homoscedastic error). No heteroscedasticity. That
is
( )=
5. The error terms are uncorrelated with each other. That is, No autocorrelation or serial
correlation. Algebraically this assumption can be written as
, =0 ≠
99
, =0 ≠
7. The error term is normally distributed. That is algebraically
~ (0, )
8. Independent variables are non-stochastic, that is their values are fixed in repeated trial
Note that, the above distribution assumptions we put for multiple regression analysis allows us
to do inferences on the remaining model parameters
= + +
= + +
= + +
Generally for the case of three variables and three equation as indicated above we can use simple
algebraic manipulation to solve the above system of linear equation. However, for more than three
variables things becomes complex, therefore we normally use the concepts of matrices or vectors
to solve the systems of such linear equations. But also due to their complexity, the OLS estimators
and other statistics are obtained using computers software such as EXCEL, STATA, SAS, EVIEWS,
R, SPSS and so on.
= ± ( ) ( )
Where denoted the total number of observation, represent the number of variables employed in
a specific multiple regression model.
100
5.16 Hypothesis testing in Multiple Linear Regression Analysis
The following section discusses hypothesis testing for regression coefficients in multiple linear
regression analysis. As in the case of simple linear regression, these tests can only be carried out if
it can be assumed that the random error terms, , are normally and independently distributed
with a mean of zero and variance of . Three types of hypothesis tests can be carried out for
multiple linear regression models:
1) The first step will be to state the null hypothesis against one of the three alternative
hypothesis. That is
: = againsteither : ≠ or : > or : < for = 1,2
2) The appropriate test statistics will be such that
−
= ~ ,( )
( )
3) Critical value will depend on the nature of alternative hypothesis. That is if
: < then critical value will be = − ,( )
4) The last step will be to make both statistical and managerial decision
Where is the total number of observation, represent the total number of the variables used in
the regression model. Note that, decision can also be made using P-value approach, and these value
can be computed from their corresponding test statistics or they can be deduced directly from
computer output. Generally if P-value is less than the given level of significance , then the null
hypothesis is reject, otherwise we fail to reject the null hypothesis.
101
1 against one of the alternative hypotheses. In this example we say there are ( = 2) restriction
under the null hypothesis.
In order to carry out this testing procedure, two different regression or analysis should be done,
one before restriction and the other after parameter restrictions. From each these analysis the
residual sum of square (RSS) is recorded, therefore, we get residual sum of squares for unrestricted
model labelled ( ) and the one for restricted model labelled ( ). The appropriate test
statistics for this procedure is always F such that
− −
=
Where denotes the number of variables or number of estimated parameters in the original
unrestricted regression model, represent the number of restrictions under the null hypothesis
Example 5.9
= + + + + + ,
Suppose one wishes to test a null hypothesis : = = 0. If this hypothesis is true then we
expected a restricted model to be:
= + + +
To test the null hypothesis at 5% level of significance, two regression was carried out using time
series data on 25 observation, and the resulting residual sum of square were given by =
6722.04 and = 7088.58
Solution
Data Given
102
Procedures
Test : = = 0 against : ≠ ≠0
The appropriate test statistics if F:
− −
=
7088.58 − 6722.04 25 − 5
=
6722.04 2
366.54 21
=
6722.04 2
= 0.57
Critical value = ( , − )
= . ,( , )
= 3.49
Conclusion: Since the value of test statistic falls within Non Rejection Region (NRR) then the null
hypothesis should not be rejected and therefore the given restriction is valid (i.e the variables
and have no contribution to the original regression model.
: = =⋯= =0
The null hypothesis above is a joint hypothesis that , up to are jointly or simultaneously
equal to zero. In other words, this hypothesis states that all the explanatory variables together have
no influence on response variable Y. To put differently, it means the explanatory variables explain
zero percent of the variation in the dependent variable. This is the same as saying that
: =0
Therefore the two sets of hypotheses are equivalent; one implies the other
−
=
1− −1
Where is the coefficient of multiple determination. The F value is also obtained from ANOVA
table.
103
The above test statistics is compared with critical F value deduced as indicated below:
( − 1, − )
Decision can also be made using a significance F obtained from the computer output. If the
significance F is less than the given level of significance , then the null hypothesis is rejected,
otherwise it is not rejected.
Example 5.10
The data on sales (Y) in thousands of dollars and two independent variables customers annual
income ( ) in thousands of dollars and the average annual local currency inflation rate ( ) are
shown below:
1 25 4 0.3
2 22 3 0.4
3 26 5 0.3
4 25 4.5 0.25
5 30 5 0.2
6 33 7 0.2
7 30 6.5 0.3
8 37 8 0.06
9 40 10 0.03
10 42 12 0.02
a) Use excel to deduce output
b) Fit the multiple linear regression model and interpret the regression coefficients
c) Comment on the value of and
d) Construct 95% confidence interval for partial coefficient of inflation rate
e) Construct 95% confidence interval for partial coefficient of customer annual income
f) Test at 5% level the contribution of the average inflation rate in the model
g) Test for the significance of the model
Solution
104
a) Hint: Computer OUTPUT is as indicated below:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.987663
R Square 0.975477
Observations 10
ANOVA
Significance
DF SS MS F F
Total 9 422
COEFFICIENT
Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
= + +
= 25.730
= 1.478
= −21.059
Therefore:
105
Interpretation of coefficients
= 25.730 is the average value of sales when income and inflation are held equal to zero
= 1.478 this indicates that the average value of sales increases by 1.478 by a unit increase in
income when inflation is held constant.
= −21.059this indicates that the average value of sales decreases by 21.059 by a unit increase
in inflation when income is held constant.
c) The value of multiple correlation coefficient = 09877 shows a very strong positive relationship
between the dependent variable and independent variables. The coefficient of multiple
determination = 97.6% indicates that 97.6% proportion of variation of sales (dependent
variable) has been explained by the income and inflation (independent variables). The remaining
2.4% has been explained by other variables not currently included in regression model
d) 95% confidence interval for partial coefficient of inflation is given by:
= ± ( )
= −21.059 ± . ,( ) × 7.20776
= −21.059 ± . , × 7.20776
= −21.059 ± 2.365 × 7.20776
= −21.059 ± 17.046
= −38,11 ; −4.013
= ± ( )
= 1.478 ± . ,( ) × 0.332
= 1.478 ± . , × 0.332
= 1.478 ± 2.365 × 0.332
= 1.478 ± 0.78518
= 0.69282 ; 2.26318
Note that: If computer output gives out the (1 − )100% confidence interval, there is no need of
computing manually the confidence interval rather you just pick them from the results.
106
Procedures
Test : = 0 against : ≠0
Test statistics from computer output = −2.922
By using P-value approach: Since P-value = 0.022 < 0.05 ,reject the null hypothesis and
therefore inflation is statistically significant
Procedures
Test : = 0 against : ≠0
The appropriate test statistics is from ANOVA Table = 139.23
By using P-value approach: Since Significance-F= 0.0000023 < 0.05 ,reject the null
hypothesis and therefore the regression model is significant
107
Review Questions
QUESTION ONE
To investigate linear relationship between a certain variable Y and an explanatory variable X,
sample paired data on 100 observations were collected, and the following summary data were
obtained.
QUESTION TWO
The following is incomplete computer output to analyze the following multiple linear regression
model
= + + + +
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7810
R Square 0.6100
Observations 25
ANOVA
df SS MS F-value
Regression A D 7302.58 G
Residual B E F
Total C 35916
108
QUESTION THREE
The profitability of commercial banks has been claimed to be influenced by some external factors
such GDP growth rate (GDP), inflation (INFL) and interest rate (INTR). The data for such financial
variables are as indicated below, where return on asset (ROA) is a proxy variable for profitability.
109
-17.57% 7.00% 5.20% 10.80%
-6.05% 8.50% 7.00% 9.00%
-6.05% 5.60% 10.30% 4.70%
-25.57% 5.40% 12.10% 2.90%
-39.07% 6.40% 7.40% 7.20%
-16.19% 7.90% 12.60% 2.30%
-17.25% 5.10% 16.10% -0.60%
-17.24% 7.30% 7.90% 7.90%
-27.35% 7.00% 6.10% 10.20%
-29.23% 7.00% 5.60% 10.50%
-14.98% 7.00% 5.20% 10.80%
2.29% 8.50% 7.00% 9.00%
-0.33% 5.60% 10.30% 4.70%
2.88% 5.40% 12.10% 2.90%
-1.54% 6.40% 7.40% 7.20%
2.51% 7.90% 12.60% 2.30%
-4.39% 5.10% 16.10% -0.60%
-0.82% 7.30% 7.90% 7.90%
-28.41% 7.00% 6.10% 10.20%
-18.01% 7.00% 5.60% 10.50%
-38.19% 7.00% 5.20% 10.80%
a) Use Excel to run regression analysis
b) Obtain the regression model of financial performance as explained by the given
independent variables.
c) Interpret the meaning of each regression coefficients
d) Comment on the value of multiple R and R square
e) Construct 95% confidence interval for each population regression coefficient
f) Test for the significance of each explanatory variable in the model by using both critical
value approach and P-value approach
g) Test for significance of the model
QUESTION FOUR
The Table below provides some information on pairs of values (x, y):
8 12 15 20 25
25 35 40 59 67
a) Use Excel to plot a scatter diagram and estimate the line of best fit (label all necessary
information in the graph).
b) Obtain the regression line by using Least Squares Method.
c) Predict the value of when = 90.
d) Construct 95% confidence interval estimate of for the data given.
e) Compute correlation coefficient and comment on the result.
f) Compute the coefficient of determination and comment on the results.
g) Test : = 0 against : ≠ 0 at 5% level of significance.
110
111