2021 Introduction To Basic Statistics-1 PDF
2021 Introduction To Basic Statistics-1 PDF
CHAPTER 1
1. INTRODUCTION
Definition and classifications of statistics
Definition:
We can define statistics in two ways.
1. Plural sense.
It is an aggregate or collection of numerical facts.
2. Singular sense (formal definition)
Statistics is defined as the science of collecting, organizing, presenting, analyzing
and interpreting numerical data for the purpose of assisting in making a more
effective decision.
Classifications:
Depending on how data can be used statistics is sometimes divided in to two main
areas or branches.
1. Descriptive Statistics: is concerned with summary calculations, graphs, charts
and tables.
2. Inferential Statistics: is a method used to generalize from a sample to a
population. For example, the average income of all families (the population) in
Ethiopia can be estimated from figures obtained from a few hundred (the sample)
families.
It is important because statistical data usually arises from sample.
Statistical techniques based on probability theory are required.
The goal of measurement systems is to structure the rule for assigning numbers to
objects in such a way that the relationship between the objects is preserved in the
numbers assigned to the objects. The different kinds of relationships preserved are
called properties of the measurement system.
Order
The property of order exists when an object that has more of the attribute than
another object, is given a bigger number by the rule system. This relationship must
hold for all objects in the "real world".
CHAPTER 2
2. METHODS OF DATA PRESNTATION
INTRODUCTION TO METHODS OF DATA COLLECTION
There are two sources of data:
1. Primary Data
Data measured or collect by the investigator or the user directly from
the source.
Two activities involved: planning and measuring.
a) Planning:
Identify source and elements of the data.
Decide whether to consider sample or census.
If sampling is preferred, decide on sample size, selection
method,… etc
Decide measurement procedure.
Set up the necessary organizational structure.
b) Measuring: there are different options.
Focus Group
Telephone Interview
Mail Questionnaires
Door-to-Door Survey
Mall Intercept
New Product Registration
Personal Interview and
Experiments are some of the sources for collecting the
primary data.
2. Secondary Data
Data gathered or compiled from published and unpublished sources or
files.
When our source is secondary data check that:
The type and objective of the situations.
The purpose for which the data are collected and
compatible with the present problem.
The nature and classification of data is appropriate to our
problem.
There are no biases and misreporting in the published
data.
Note: Data which are primary for one may be secondary for the other.
80 76 90 85 80
70 60 62 70 85
65 60 63 74 75
76 70 70 80 85
Class Class boundary Class Tally Freq Cf (less than Cf (more rf. rcf (less than
limit Mark . type) than type) type
6 – 11 5.5 – 11.5 8.5 // 2 2 20 0.10 0.10
12 – 17 11.5 – 17.5 14.5 // 2 4 18 0.10 0.20
18 – 23 17.5 – 23.5 20.5 ////// 7 11 16 0.35 0.55
Pictogram
-In these diagram, we represent data by means of some picture symbols. We
decide about a suitable picture to represent a definite number of units in which
the variable is measured.
Example:
Bar Charts:
- A set of bars (thick lines or narrow rectangles) representing some magnitude over time
space.
- They are useful for comparing aggregate over time space.
- Bars can be drawn either vertically or horizontally.
- There are different types of bar charts. The most common being :
Simple bar chart
Component or sub divided bar chart.
Multiple bar charts.
Simple Bar Chart
A 12 14 18
Sales in $
20
15
B 24 21 18 10
5
C 24 35 54 0
A B C
product
-The bars represent total value of a variable with each total broken in to its component parts and
different colours or designs are used for identifications
Example:
Draw a component bar chart to represent the sales by product from 1957 to 1959.
Solutions:
SALES BY PRODUCT 1957-1959
100
80
Sales in $
Product C
60
Product B
40
Product A
20
0
1957 1958 1959
Year of production
Solutions:
60
50
Sales in $
40 Product A
30 Product B
20 Product C
10
0
1957 1958 1959
Year of production
Value Frequency
2
0
2.5 8.5 14.5 20.5 26.5 32.5 38.5 44.5
The "i=1" in the bottom of the summation notation tells where to begin the
sequence of summation. If the expression were written with "i=3", the summation
would start with the third number in the set. For example:
In the example set of numbers, this would give the following result:
Properties of Summation
n
1. k nk
i 1
where k is any constant
n n
2. kX
i 1
i k X i where k is any constant
i 1
n n
3. (a bX i ) na b X i where a and b are any constant
i 1 i 1
n n n
4. (X
i 1
i Yi ) X i Yi
i 1 i 1
5.
Example: considering the following data determine
X Y
5 6
7 7
7 8
6 7
8 8
5
a) X
i 1
i
5
b) Y
i 1
i
5
c) 10
i 1
5
g) X
2
i
i 1
5 5
h) ( X i )( Yi )
i 1 i 1
Solutions:
5
a) X
i 1
i 5 7 7 6 8 33
5
b) Y
i 1
i 6 7 8 7 8 36
5
c) 10 5 *10 50
i 1
5
d) (X
i 1
i Yi ) (5 6) (7 7) (7 8) (6 7) (8 8) 69 33 36
5
e) (X
i 1
i Yi ) (5 6) (7 7) (7 8) (6 7) (8 8) 3 33 36
5
f) X Y
i 1
i i 5 * 6 7 * 7 7 * 8 6 * 7 8 * 8 241
5
g) X 5 2 7 2 7 2 6 2 8 2 223
2
i
i 1
5 5
h) ( X i )( Yi ) 33 * 36 1188
i 1 i 1
X i
X i 1
n
If X1 occurs f1 times
If X2occurs f2 times
If Xn occurs fn times
k
f
i 1
i
f
i 1
i n
f X i i
36
X i 1
4
5.15
f
7
i
i 1
f i Xi
X i 1
k
, Where Xi =the class mark of the ith class and fi = the frequency
f
i 1
i
Solutions:
First find the class marks
Find the product of frequency and class marks
Find mean using the formula.
Class fi Xi Xifi
6- 10 35 8 280
11- 15 23 13 299
16- 20 15 18 270
21- 25 12 23 276
26- 30 9 28 252
31- 35 6 33 198
Total 100 1575
f X i i
1575
X i 1
6
15.75
f
100
i
i 1
If the values in a series or mid values of a class are large enough, coding of values is a
good device to simplify the calculations.
Special properties of Arithmetic mean
1. The sum of the deviations of a set of items from their mean is always zero. i.e.
n
( X X ) 0.
i 1
i
2. The sum of the squared deviations of a set of items from their mean is the
n n
minimum. i.e. ( Xi X ) 2 ( X i A) 2 , A X
i 1 i 1
X1n1 X 2 n 2 .... X k n k
Xini
Xc i1k
n1 n 2 ...n k
n
i 1
i
X 1 n1 X 2 n 2 X n i i
Xc i 1
n1 n 2
2
n
i 1
i
4. If a wrong figure has been used when calculating the mean the correct mean can
be obtained without repeating the whole process using:
(CorrectValue WrongValue)
CorrectMean WrongMean
n
Weighted Mean
When a proper importance is desired to be given to different data a weighted
mean is appropriate.
Weights are assigned to each item in proportion to its relative importance.
Let X1, X2, …Xn be the value of items of a series and W1, W2, …Wn their
corresponding weights , then the weighted mean denoted X w is defined as:
X W i i
Xw i 1
n
W
i 1
i
Example:
A student obtained the following percentage in an examination:
English 60, Biology 75, Mathematics 63, Physics 59, and chemistry 55.Find
the students weighted arithmetic mean if weights 1, 2, 1, 3, 3 respectively are
allotted to the subjects.
Solutions:
5
X W i i
60 * 1 75 * 2 63 * 1 59 * 3 55 * 3 615
Xw i 1
61.5
1 2 1 3 3
5
10
W
i 1
i
G.M n X 1 * X 2 * ... * X n
The harmonic mean of X1, X2, X3 …Xn is denoted by H.M and given by:
n
H.M n , This is called simple harmonic mean.
1
i 1 X i
H.M k
n , n fi
fi i 1
i 1 X i
If observations X1, X2… Xn have weights W1, W2 …Wn respectively, then their
harmonic mean is given by
n
W i
, This is called Weighted Harmonic Mean.
H.M n
i 1
W
i 1
i Xi
Remark: The Harmonic Mean is useful and appropriate in finding average speeds
and average rates.
Example: A cyclist pedals from his house to his college at speed of 10 km/hr and
back from the college to his house at 15 km/hr. Find the average speed.
Solution: Here the distance is constant
Where:
Xˆ the mod e of the distributi on
w the size of the mod al class
1 f mo f1
2 f mo f 2
f mo frequency of the mod al class
f1 frequency of the class preceeding the mod al class
f 2 frequency of the class following the mod al class
Note: The modal class is a class with the highest frequency.
Example: Following is the distribution of the size of certain farms selected at random
from a district. Calculate the mode of the distribution.
Size of farms No. of farms
5-15 8
15-25 12
25-35 17
35-45 29
45-55 31
1 f mo f1 2
2 f mo f 2 26
f mo 31
f1 29
f2 5
Merits:
It is not affected by extreme observations.
Easy to calculate and simple to understand.
It can be calculated for distribution with open end class
Demerits:
It is not rigidly defined.
It is not based on all observations
It is not suitable for further mathematical treatment.
It is not stable average, i.e. it is affected by fluctuations of sampling to
some extent.
Often its value is not unique.
Note: being the point of maximum density, mode is especially useful in finding the
most popular size in studies relating to marketing, trade, business, and industry. It is
the appropriate average to be used to find the ideal size.
The Median
- In a distribution, median is the value of the variable which divides it in to two
equal halves.
- In an ordered series of data median is an observation lying exactly in the middle of the
series. It is the middle most value in the sense that the number of values less than the
median is equal to the number of values greater than it.
-If X1, X2 …Xn be the observations, then the numbers arranged in ascending
order will be X [1], X [2] …X[n], where X[i] is ith smallest value.
X[1]< X[2]< …<X[n]
1
(X [3] X [ 4 ] )
2
1
( 5 6) 5.5
2
b) Order the data :1, 2, 3, 5, 8
Here n=5
~ X
X n 1
[ ]
2
X[3]
3
Median for grouped data If data are given in the shape of continuous frequency
distribution, the median is defined as:
~ w n
X L med ( c)
f med 2
Where :
L med lower class boundary of the median class.
w the size of the median class
n total number of observations.
c the cumulative frequency( less than type) preceedingthe median class.
f med thefrequency of the median class.
Remark:
The median class is the class with the smallest cumulative frequency (less than type)
greater than or equal to n .
2
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions:
First find the less than cumulative frequency.
Identify the median class.
Find median using formula.
L 49.5, w 5
med
n 75, c 17, f 22
med
~
X L w ( n c)
med f 2
med
49.5 5 (37.5 17)
22
54.16
Merits and Demerits of Median
Merits:
Median is a positional average and hence not influenced by extreme observations.
Can be calculated in the case of open end intervals.
Quantiles
When a distribution is arranged in order of magnitude of items, the median is the value of
the middle term. Their measures that depend up on their positions in distribution
quartiles, deciles, and percentiles are collectively called quantiles.
Quartiles:
- Quartiles are measures that divide the frequency distribution in to four equal
parts.
- The value of the variables corresponding to these divisions are denoted Q 1, Q2,
and Q3 often called the first, the second and the third quartile respectively.
- Q1 is a value which has 25% items which are less than or equal to it. Similarly Q2
has 50%items with value less than or equal to it and Q3 has 75% items whose
values are less than or equal to it.
Calculating quartiles for raw data
To calculate the three quartiles from the raw data, we must arranged the
data from least to highest 1st if the data are arranged in increasing order
,then
Qi i 4 (n 1) th value, i 1,2,3, then
Q1 1 (n 1) th vlaue
4
Q2 2 ( n 1) th vlaue
4
Q3 3 (n 1)th vlaue , where n is number of
4
observations.
E.g. the following data shows the age of 30 sampled patients in JUSH
6,9,11,14,16,17,18,21,22,22,22,22,23,25,25,26,27,28,28,32,33,34,34,36,39,39,
41,45,46,49 find the lower middle and upper quartiles for the above data.
Solution:
1st order the data (if it hasn’t been ordered)
6,9,11,14,16,17,18,21,22,22,22,22,23,25,25,26,27,28,28,32,33,34,34,36,39,39,
41,45,46,49
w ( in c) , i 1,2,3
Q
i LQ i f 4
Qi
Where :
L lower class boundary of the quartile class.
Qi
w the size of the quartile class
n total number of observations.
c the cumulative frequency (less than type) preceeding the quartile class.
f thefrequency of the quartile class.
Qi
Remark:
The quartile class (class containing Qi) is the class with the smallest cumulative frequency
in
(less than type) greater than or equal to .
4
Deciles:
- Deciles are measures that divide the frequency distribution in to ten equal parts.
- The values of the variables corresponding to these divisions are denoted D 1, D2,..
, D9 often called the first, the second,…, the ninth deciles respectively.
w iN
Di LD i ( c) , i 1,2,...,9
f Di 10
Where :
LDi lower class boundary of the decile class .
w the size of the decileclas s
n total number of observations.
c the cumulative frequency (less than type) preceeding the decile class .
f Di thefrequency of the decile class .
Remark:
The decile class (class containing Di) is the class with the smallest cumulative frequency
in
(less than type) greater than or equal to .
10
Percentiles:
- Percentiles are measures that divide the frequency distribution in to hundred
equal parts.
- The values of the variables corresponding to these divisions are denoted P1, P2,..
P99 often called the first, the second,…, the ninety-ninth percentile respectively.
- To calculate the nine deciles from the raw data, we must arranged the data from
least to highest 1st if the data are not arranged in increasing order ,then
Remark: The percentile class (class containing Pi) is the class with the
in
smallest cumulative frequency (less than type) greater than or equal to .
100
Example: Considering the following distribution
Calculate:
a) All quartiles.
b) The 7th decile.
c) The 90th percentile.
Values Frequency
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
190- 200 107
ii. Q2
- determine the class containing the second quartile.
2*n
246.5
4
190 200 is the class containing the sec ond quartile.
w 2*n
Q2 LQ2 ( c)
f Q2 4
10
170 (246.5 244)
72
190.23
iii. Q3
- determine the class containing the third quartile.
3* n
369.75
4
200 210 is the class containing the third quartile.
w 3* n
Q3 LQ 3 ( c)
f Q3 4
10
200 (369.75 351)
49
203.83
b. D7
- determine the class containing the 7th decile.
7*n
345.1
10
190 200 is the class containing the seventh decile.
LD7 190 , w 10
n 493 , c 244 , f D7 107
w 7*n
D7 LD7 ( c)
f D7 10
10
190 (345.1 244)
107
199.45
90 * n
443.7
100
220 230 is the class containing the 90 th percentile.
LP90 220 , w 10
n 493 , c 434 , f P90 31
w 90 * n
P90 LP9 0 ( c)
f P9 0 100
10
220 ( 443.7 434)
31
223.13
CHAPTER 4
4. Measures of Dispersion (Variation)
The measures of dispersion which are expressed in terms of the original unit of a
series are termed as absolute measures. Such measures are not suitable for comparing
the variability of two distributions which are expressed in different units of
measurement and different average size. Relative measures of dispersions are a ratio or
percentage of a measure of absolute dispersion to an appropriate measure of central
tendency and are thus pure numbers independent of the units of measurement. For
comparing the variability of two distributions (even if they are measured in the same
unit), we compute the relative measure of dispersion instead of absolute measures of
dispersion.
Various measures of dispersions are in use. The most commonly used measures of
dispersions are:
1) Range and relative range
2) Mean deviation
3) Standard deviation ,coefficient of variation and standard scores
Merits:
It is rigidly defined.
It is easy to calculate and simple to understand.
Demerits:
It is not based on all observation.
It is highly affected by extreme observations.
It is affected by fluctuation in sampling.
It is not liable to further algebraic treatment.
It cannot be computed in the case of open end distribution.
It is very sensitive to the size of the sample.
Relative Range (RR)
-it is also sometimes called coefficient of range and given by:
Solutions :( 2)
R 4 L S 4 _________________(1)
RR 0.25 L S 16 _____________(2)
Solving (1) and (2) at the same time , one can obtain the following value
L 10 and S 6
n
Xi X
M .D ( X ) i 1
n
For the case of frequency distribution it is given as:
k
fi X i X
M .D ( X ) i 1
n
n ~
~
Xi X
M .D ( X ) i 1
n
For the case of frequency distribution it is given as:
k ~
~
fi X i X
M .D ( X ) i 1
n
~
Steps to calculate M.D ( X ):
~
1. Find the median, X
~
2. Find the deviations of each reading from X .
3. Find the arithmetic mean of the deviations, ignoring sign.
X i
ˆ
X
ˆ)
M.D( X i 1
n
k
f i X i Xˆ
ˆ)
M .D ( X i 1
n
Examples:
1. The following are the number of visit made by ten mothers to the local doctor’s
surgery. 8, 6, 5, 5, 7, 4, 5, 9, 7, 4
Find mean deviation about mean, median and mode.
Solutions:
First calculate the three averages
~
X 6, X 5.5, Xˆ 5
X i 5.5 1.5 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
Xi 5 1 1 0 0 0 1 2 2 3 4 14
10
X i 6) 14
M .D ( X ) i 1
1.4
10 10
10
~
X i 5.5 14
M .D ( X ) i 1
1.4
10 10
10
X i 5) 14
ˆ)
M .D ( X i 1
1.4
10 10
2. Find mean deviation about mean, median and mode for the following
distributions.(exercise)
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
X nX
2 2
i
S 2
i 1
, for raw data.
n 1
k
f X i nX 2
2
i
S 2
i 1
, for frequency distributi on.
n 1
Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That
means that the units were also squared. To get the units back the same as the original
data values, the square root must be taken.
The following steps are used to calculate the sample standard deviation
1. Find the arithmetic mean.
2. Find the difference between each observation and the mean.
3. Square these differences.
4. Sum the squared differences.
5. Since the data is a sample, divide the number (from step 4 above) by the number
of observations minus one, i.e., n-1 (where n is equal to the number of observations
in the data set).
6. Square root the result obtained from step 5
Examples: Find the variance and standard deviation of the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions:
1. X 11
Xi 5 10 12 17 Total
(Xi- X )2 36 1 1 36 74
n
( X i X )2 74
S2 i 1
24.67.
n 1 3
S S2 24.67 4.97.
2. X 55
Xi(C.M) 42 47 52 57 62 67 72 Total
fi(Xi- X )2 1183 640 198 60 588 864 867 4400
n
fi ( X i X )2 4400
S2 i 1
59.46.
n 1 74
S S2 59.46 7.71.
1. ( X i X )2 ( X i A) 2 ,A X
n 1 n 1
2. For normal (symmetric distribution the following holds.
Approximately 68.27% of the data values fall within one standard deviation of the
mean. i.e. with in ( X S , X S )
Solutions:
a) 38 and 62 are at equal distance from the mean,50 and this distance is 12
ks 12
12 12
k 2
S 6
b) Similarly done.
c) It is just the complement of a) i.e. at most 12 *100% 25% of the numbers
k
lie less than 32 or more than 62.
d) Similarly done.
Example 2:
The average score of a special test of knowledge of wood refinishing has a mean of
53 and standard deviation of 6. Find the range of values in which at least 75% the
scores will lie. (Exercise)
Examples:
1. The mean and standard deviation of n Tetracycline Capsules X 1 , X 2 , ..... X n
are known to be 12 gm and 3 gm respectively. New set of capsules of another
drug are obtained by the linear transformation Yi = 2Xi – 0.5 ( i = 1, 2, …, n )
then what will be the standard deviation of the new set of capsules
2. The mean and the standard deviation of a set of numbers are respectively 500 and
10.
a. If 10 is added to each of the numbers in the set, then what will be
the variance and standard deviation of the new set?
b. If each of the numbers in the set are multiplied by -5, then what
will be the variance and standard deviation of the new set?
Solutions:
1. Using c) above the new standard deviation = k S 2 * 3 6
2. a. They will remain the same.
b. New standard deviation k S 5 *10 50
Coefficient of Variation (C.V)
Is defined as the ratio of standard deviation to the mean usually expressed as
percents.
S
C.V *100
X
The distribution having less C.V is said to be less variable or more consistent.
Examples:
1. An analysis of the monthly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results
Solutions:
Calculate coefficient of variation for both firms.
City 1 25 24 23 26 17
City2 22 21 24 22 20
City3 32 27 35 24 28
Which city have the most consistent temperature, based on these data?
(Exercise)
X
Z , for population.
X X
Z , for sample
S
Z gives the deviations from the mean in units of standard deviation
Z gives the number of standard deviation a particular observation lie above
or below the mean.
It is used to compare two observations coming from different groups.
Examples:
1. Two sections were given introduction to statistics examinations. The following
information was given.
Solutions:
Calculate the standard score of both students.
X A X 1 90 78
ZA 2
S1 6
X B X 2 95 90
ZB 1
S2 5
Student A performed better relative to his section because the score of student A
is two standard deviation above the mean score of his section while, the score of
student B is only one standard deviation above the mean score of his section.
2. Two groups of people were trained to perform a certain task and tested to find
out which group is faster to learn the task. For the two groups the following
information was given:
Relatively speaking:
a) Which group is more consistent in its performance
b) Suppose a person A from group one take 9.2 minutes while
person B from Group two take 9.3 minutes, who was faster
in performing the task? Why?
Solutions:
a) Use coefficient of variation.
S1 1.2
C.V1 *100 *100 11.54%
X1 10.4
S 1.3
C.V2 2 *100 *100 10.92%
X2 11.9
Since C.V2 < C.V1, group 2 is more consistent.
b) Calculate the standard score of A and B
X A X1 9.2 10.4
ZA 1
S1 1.2
XB X2 9.3 11.9
ZB 2
S2 1.3
4.3 Moments
- If X is a variable that assume the values X1, X2,…..,Xn then
1. The rth moment is defined as:
X 1 X 2 ... X n
r r r
Xr
n
n
Xi
r
i 1
n
- For the case of frequency distribution this is expressed as:
k
fi X i
r
Xr i 1
n
- If r 1,it is the simple arithmetic mean, this is called the first moment.
2. The rth moment about the mean ( the rth central moment)
Denoted by Mr and defined as:
n n
( X i X )r
( n 1) i
( X i X )r
Mr i 1
1
n n n 1
- For the case of frequency distribution this is expressed as:
k
fi ( X i X )r
Mr i 1
n
- If r 2 , it is population variance, this is called the second central moment. If we
assume n 1 n ,it is also the sample variance.
n n n 1
- For the case of frequency distribution this is expressed as:
k
f i ( X i A) r
Mr i 1
'
Xr i 1
n
23 7
X1 4 X
3
2 2 32 7 2
X2 20.67
3
n
( X i X )r
Mr i 1
n
( 2 4) (3 4) (7 4)
M1 0
3
( 2 4) 2 (3 4) 2 (7 4) 2
M2 4.67
3
( 2 4) 3 (3 4) 3 (7 4) 3
M3 6
3
n
( X i A) r
i 1
Mr
n
(2 3) 3 (3 3) 3 (7 3) 3
M3 21
'
4.4 Skewness
Measures of Skewness
- Denoted by 3
- There are various measures of skewness.
1. The Pearsonian coefficient of skewness
Mean Mode X Xˆ
3
S tan dard deviation S
M3 M3 M
3 33 , Where is the population s tan dard deviation.
M2
3 2
( )
2 3 2
The shape of the curve is determined by the value of 3
If 3 0 then the distribution is positively skewed .
If 3 0 then the distribution is symmetric .
If 3 0 then the distribution is negatively skewed .
Remark:
o In a positively skewed distribution, smaller observations are more
frequent than larger observations. i.e. the majority of the
observations have a value below an average.
o In a negatively skewed distribution, smaller observations are less
frequent than larger observations. i.e. the majority of the
observations have a value above an average
Examples:
4.5 Kurtosis
Solutions:
M3 60
a) 3 3 2
0.94 0
M2 163 2
The distribution is negatively skewed .
M4 162
b) 4 2
0.6 3
M2 162
The curve is platykurtic.
CHAPTER 5
5. ELEMENTARY PROBABILITY
5.1 Introduction
Probability theory is the foundation upon which the logic of inference is built.
It helps us to cope up with uncertainty.
In general, probability is the chance of an outcome of an experiment. It is the
measure of how likely an outcome is to occur.
5.2 Definitions of some probability terms
1. Experiment: Any process of observation or measurement or any process which
generates well defined outcome.
2. Probability Experiment: It is an experiment that can be repeated any number of
times under similar conditions and it is possible to enumerate the total number of outcomes
without predicting an individual out come. It is also called random experiment.
Example: If a fair die is rolled once it is possible to list all the possible outcomes
i.e.1, 2, 3, 4, 5, 6 but it is not possible to predict which outcome will
occur.
3. Outcome :The result of a single trial of a random experiment
4. Sample Space: Set of all possible outcomes of a probability experiment
5. Event: It is a subset of sample space. It is a statement about one or more outcomes of a
random experiment .They are denoted by capital letters.
Example: Considering the above experiment let A be the event of odd numbers,
B be the event of even numbers, and C be the event of number 8.
A 1,3,5
B 2,4,6
C or empty space or impossible event
Remark:
If S (sample space) has n members then there are exactly 2n subsets or
events.
6. Equally Likely Events: Events which have the same chance of occurring.
7. Complement of an Event: the complement of an event A means non- occurrence of
' c
A and is denoted by A , or A , or A contains those points of the sample space which
don’t belong to A.
8. Elementary Event: an event having only a single element or sample point.
9. Mutually Exclusive Events: Two events which cannot happen at the same time.
10. Independent Events: Two events are independent if the occurrence of one does
not affect the probability of the other occurring.
11. Dependent Events: Two events are dependent if the first event affects the
outcome or occurrence of the second event in a way the probability is changed.
Example: .What is the sample space for the following experiment
To list the outcomes of the sequence of events, a useful device called tree
diagram is used.
The addition rule
Suppose that the 1st procedure designed by 1 can be performed in n1 ways. Assume that
2nd procedure designed by 2 can be performed in n2 ways. (n1 * n2 * ........ * nk ) ways.
suppose further more that, it is not possible that both procedures 1 and 2 are performed
together then the number of ways in which we can perform 1or 2 procedure is n1+n2
ways, and also if we have another procedure that is designed by k with possible way of n k
we can conclude that there is n1+n2+…+nk possible ways.
Example: suppose we planning a trip and are deciding by bus and train transportation. If
there are 3 bus routes and 2 train routes to go from A to B. find the available routes for
the trip.
Solution:
There are 3+2 =5 routes for someone to go from A to B.
Example 3
The digits 0, 1, 2, 3, and 4 are to be used in 4 digit identification card. How many
different cards are possible if
a) Repetitions are permitted.
b) Repetitions are not permitted.
Solutions
a)
1st digit 2nd digit 3rd digit 4th digit
5 5 5 5
There are four steps
1. Selecting the 1st digit, this can be made in 5 ways.
Permutation
n!
n Pr
( n r )!
Pr n!
n
k1!*k 2 * ... * k n
Example:
1. Suppose we have a letters A,B, C, D
a) How many permutations are there taking all the four?
b) How many permutations are there two letters at a time?
2. How many different permutations can be made from the letters in the word
“CORRECTION”?
Solutions:
1.
a)
Here n 4, there are four disnict object
There are 4! 24 permutations.
b)
Here n 4, r2
4! 24
There are 4 P2 12 permutations.
(4 2)! 2
2.
Here n 10
Of which 2 are C , 2 are O, 2 are R ,1E ,1T ,1I ,1N
K1 2, k 2 2, k3 2, k 4 k5 k6 k7 1
U sin g the 3rd rule of permutation , there are
10!
453600 permutations.
2!*2!*2!*1!*1!*1!*1!
Exercises:
1. Six different statistics books, seven different physics books, and 3 different
Economics books are arranged on a shelf. How many different arrangements
are possible if;
i. The books in each particular subject must all stand together
ii. Only the statistics books must stand together
Combination
AB BA CA DA AB BC
AC BC CB DB AC BD
AD BD CD DC AD DC
Note that in permutation AB is different from BA. But in combination AB is the same as
BA.
Combination Rule
n n!
r ( n r )!*r!
Examples:
1. In how many ways a committee of 5 people be chosen out of 9 people?
Solutions:
n9 , r 5
n n! 9!
126 ways
r ( n r )!*r! 4!*5!
2. Among 15 clocks there are two defectives .In how many ways can an inspector
chose three of the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.
Solutions:
2 13
0 *
286 ways.
3
c) Only one of the defective clocks is included.
This is equivalent to one defective and two non defective, which can be
done in:
2 13
* 156 ways.
1 2
d) Two of the defective clock is included.
This is equivalent to two defective and one non defective, which can be
done in:
2 13
* 13 ways.
2 3
Exercises:
1. Out of 5 Mathematician and 7 Statistician a committee consisting of 2
Mathematician and 3 Statistician is to be formed. In how many ways this
can be done if
N A n( A) 0
n( A)
P( A) 0 60
n( S )
2. A box of 80 candles consists of 30 defective and 50 non defective candles. If
10 of this candles are selected at random, what is the probability
a) All will be defective.
b) 6 will be non defective
c) All will be non defective
Solutions:
80
Total selection N n( S )
10
30 50
Total way in which A occur * N A n( A)
10 0
30 50
*
n( A) 10 0
P( A) 0.00001825
n( S ) 80
10
30 50
Total way in which A occur * N A n( A)
0 10
30 50
*
n( A) 0 10
P ( A) 0.00624
n( S ) 80
10
Exercises:
1. What is the probability that a waitress will refuse to serve alcoholic beverages
to only three minors if she randomly checks the I.D’s of five students from
among ten students of which four are not of legal age?
2. If 3 books are picked at random from a shelf containing 5 novels, 3 books of
poems, and a dictionary, what is the probability that
a) The dictionary is selected?
b) 2 novels and 1 book of poems are selected?
AUB AnB A
In general p( A B) p( A) p( B) p( A B)
Example: Suppose we have two red and three white balls in a bag
The conditional probability of an event A given that B has already occurred, denoted
p( A B) is
p( A B)
p ( A B) = , p( B) 0
p( B)
Remark: (1) p( A' B) 1 p( A B)
(2) p( B ' A) 1 p( B A)
Examples
1. For a student enrolling at freshman at certain university the probability is 0.25
that he/she will get scholarship and 0.75 that he/she will graduate. If the
probability is 0.2 that he/she will get scholarship and will also graduate. What is
the probability that a student who get a scholarship graduate?
Note; for any two events A and B the following relation holds.
pB pB A. p A pB A' . pA'
CHAPTER 6
6. PROBABILITY DISTRIBUTIONS
Random variables
One of the fundamental concepts of probability theory is that of a random variable.
Definition
A random variable is a variable that assumes numerical values associated with events of an
experiment. Usually denoted by capital letters.
Example 6.1 Observe 100 babies to be born in a clinic. The number of boys, which
have been born, is a random variable. It may take values from 0 to 100.
Example 6.2 Number of patients of a clinic daily is a random variable.
Example 6.3 Select one student from a university and measure his/her height and
record this height by x. Then x is a random variable, assuming values from, say from
100 cm to 250 cm in dependence upon each specific student.
Example 6.4 the weight of babies at birth also is a random variable. It can assume
values in the interval, for example, from 800 grams to 6000 grams.
Classification of random variables: Random variables may be divided into two types:
discrete random variables and continuous random variables.
Definition
1. A discrete random variable is one that can assume only a countable number of values
continuous random variable can assume any value in one or more intervals on a
line. Among the random variables described above the number of boys in
Example 6.1 and the number of patients in Example 6.2 are discrete random
variables, the height of students and the weight of babies are continuous random
variables. Discrete random variable: are variables which can assume only a
specific number of values. They have values that can be counted
Examples:
Toss coin n times and count the number of heads.
Number of children in a family.
Example: If X is a random variable, then it is a function from the elements of the sample
space to the set of real numbers. i.e.
X is a function X: S R
A random variable takes a possible outcome and assigns a number to it.
Example: Flip a coin three times, let X be the number of heads in three tosses.
S HHH , HHT , HTH , HTT , THH , THT , TTH , TTT
X HHH 3, X HHT X HTH X THH 2,
X HTT X THT X TTH 1
X TTT 0
X = {0, 1, 2, 3, 4, 5}
X assumes a specific number of values with some probabilities.
Example: Consider the experiment of tossing a coin three times. Let X is the number of
heads. Construct the probability distribution of X.
Solution:
First identify the possible value that X can assume.
Calculate the probability of each possible distinct value of X and express X in
the form of frequency distribution.
X x 0 1 2 3
P X x 18 38 38 18
P X x
x
1 , if X is discrete.
2.
f ( x)dx
x
1 , if is continuous.
Note:
1. If X is a continuous random variable then
b
P(a X b) f ( x)dx
a
2. Probability of a fixed value of a continuous random variable is zero.
P(a X b) P(a X b) P(a X b) P(a X b)
3. If X is discrete random variable the
b 1
P ( a X b) P ( x )
x a 1
b 1
P ( a X b) p ( x )
xa
b
P ( a X b) P ( x )
x a 1
b
P ( a X b) P ( x )
xa
4. Probability means area for continuous random variable.
Introduction to expectation
Definition:
1. Let a discrete random variable X assume the values X1, X2, ….,Xn with the
probabilities P(X1), P(X2), ….,P(Xn) respectively. Then the expected value of X
,denoted as E(X) is defined as:
E ( X ) X 1 P( X 1 ) X 2 P ( X 2 ) .... X n P ( X n )
n
X i P( X i )
i 1
Examples:
1. What is the expected value of a random variable X obtained by tossing a coin
three times where is the number of heads
Solution:
First construct the probability distribution of X
X x 0 1 2 3
P X x 18 38 38 18
E ( X ) X 1 P( X 1 ) X 2 P( X 2 ) .... X n P( X n )
0 *1 8 1* 3 8 ..... 2 *1 8
2. Suppose a charity organization is
1.5
mailing printed return-address stickers
to over one million homes in the Ethiopia. Each recipient is asked to
donate$1, $2, $5, $10, $15, or $20. Based on past experience, the amount a
person donates is believed to follow the following probability distribution:
i 1
x 2 f ( x)dx , if X is continuous.
x
Examples:
1. Find the mean and the variance of a random variable X in example 2 above.
Solutions:
X x $1 $2 $5 $10 $15 $20 Total
P X x 0.1 0.2 0.3 0.2 0.15 0.05 1
xP( X x) 0.1 0.4 1.5 2 2.25 1 7.25
x 2 P( X x ) 0.1 0.8 7.5 20 33.75 20 82.15
E ( X ) 7.25
Var ( X ) E ( X 2 ) [ E ( X )] 2 82.15 7.25 2 29.59
2. Two dice are rolled. Let X
be a random variable denoting the sum of the numbers on the two dice.
i) Give the probability distribution of X
ii) Compute the expected value of X and its variance
There are some general rules for mathematical expectation.
Let X and Y are random variables and k is a constant.
RULE 1 E (k ) k ,
RULE 2 Var (k ) 0 ,
RULE 3 E (kX ) kE( X )
RULE 4 Var(kX ) k 2Var( X )
RULE 5 E ( X Y ) E ( X ) E (Y )
COMMON PROBABILITY DISTRIBUTIONS
COMMON DISCRETE PROBABILITY DISTRIBUTIONS
1. Binomial Distribution
b) P( X 2) ?
P ( X 2) P ( X 2) P ( X 3) P ( X 4) P ( X 5) P ( X 6)
0.324 0.185 0.060 0.010 0.001
0.58
c) P( X 3) ?
P( X 3) P( X 0) P( X 1) P( X 2) P( X 3)
0.118 0.303 0.324 0.185
0.93
d) P( X 5) ?
P( X 5) 1 P( X 5)
1 {P( X 5) P( X 6)}
1 (0.010 0.001)
0.989
Remark: If X is a binomial random variable with parameters n and p then
E ( X ) np , Var ( X ) npq
2. Poisson Distribution
- A random variable X is said to have a Poisson distribution if its probability
distribution is given by:
x e
P( X x) , x 0,1,2,......
x!
Where the average number.
Note:
The Poisson probability distribution provides a close approximation to the binomial
probability distribution when n is large and p is quite small or quite large with np .
(np) x e ( np)
P( X x) , x 0,1,2,......
x!
Where np the average number.
Usually we use this approximation if np 5 . In other words, if n 20 and np 5 [or
n(1 p ) 5 ], then we may use Poisson distribution as an approximation to binomial
distribution.
Example:
1. Find the binomial probability P(X=3) by using the Poisson distribution if
p 0.01 and n 200
Solution:
Exercises:
1. Suppose that 4% of all TVs made by A&B Company in 2000 are defective.
If eight of these TVs are randomly selected from across the country and
tested, what is the probability that exactly three of them are defective?
Assume that each TV is made independently of the others.
2. An allergist claims that 45% of the patients she tests are allergic to some
type of weed. What is the probability that
a) Exactly 3 of her next 4 patients are allergic to weeds?
b) None of her next 4 patients are allergic to weeds?
3. Explain why the following experiments are not Binomial
Rolling a die until a 6 appears.
Drawing 5 cards from a deck for a poker hand.
On the average, five smokers pass a certain street corners every 10 minutes,
what is the probability that during a given 10 minutes the number of smokers
passing will be
o 6 or fewer
o 7 or more
o Exactly 8
1. Normal Distribution
A random variable X is said to have a normal distribution if its probability density function
is given by
1 x 2
1
f ( x) e 2
, x , , 0
2
Where E ( X ), 2 Variance ( X )
and 2 are the Parameters of the Normal Distributi on.
1
1 z2
f ( z) e 2
2
P ( Z z ) 0.9868
P ( Z 0) P (0 Z z )
0.50 P (0 Z z )
P (0 Z z ) 0.9868 0.50 0.4868
and from table
P (0 Z 2.2) 0.4868
z 2.2
Solution
X is normal with mean, 80, s tan dard deviation, 4.8
a)
X 87.2
P ( X 87.2) P ( )
87.2 80
P( Z )
4.8
P ( Z 1.5)
P ( Z 0) P (0 Z 1.5)
0.50 0.4332 0.9332
b)
Solution
X 72.9
P ( X 72.9) 0.2005 P ( ) 0.2005
72.9 62.4
P( Z ) 0.2005
10.5
P( Z ) 0.2005
10.5
P (0 Z ) 0.50 0.2005 0.2995
And from table P (0 Z 0.84) 0.2995
10.5
0.84
12.5
5. A random variable has a normal distribution with 5 .Find its mean if the
probability that the random variable will assume a value less than 52.5 is 0.6915.
Solution
2. Chi-square distribution
CHAPTER 7
7. Sampling and Sampling Distribution
Note:
let N population size , n sample size.
1. Suppose simple random sampling is used Nn
1. From a finite population of size N , randomly draw all possible samples of size
n.
2. Calculate the mean for each sample.
3. Summarize the mean obtained in step 2 in terms of frequency distribution or
relative frequency distribution.
Example:
Suppose we have a population of size N 5 , consisting of the age of five children:
6, 8, 10, 12, and 14
Population mean 10
population Variance 2 8
Take samples of size 2 with replacement and construct sampling distribution of the
sample mean.
Solution:
N 5, n2
We have N n 52 25 possible samples since sampling is with replacement.
Step 1: Draw all possible samples:
6 8 10 12 14
6 6 7 8 9 10
8 7 8 9 10 11
10 8 9 10 11 12
12 9 10 11 12 13
14 10 11 12 13 14
X Frequency
6 1
7 2
8 3
9 4
10 5
11 4
12 3
13 2
14 1
( X i X ) 2 f i 100
X 2
4 2
fi 25
Remark:
1. In general if sampling is with replacement
2
X ~ N ( , )
n
X
Z ~ N (0,1)
n
Central Limit Theorem
Given a population of any functional form with mean and finite variance 2 , the
sampling distribution of X , computed from samples of size n from the population will be
2
approximately normally distributed with mean and variance , when the sample size
n
is large.
Standard error
if all samples of size n are taken from the same population, the mean of the
sample means, denoted by X , equals the population mean and the
standard deviation of the sample means, denoted by X , equals n
“What is the size of the sample which one should study?” is the question which
comes to the mind of every researcher.
Here is the formula for the sample size which is obtained by solving the
maximum error of the estimate formula for the population mean for n.
n= ( Z 2 ) 2 /E2
Where
Z 2 = desired level of confidence
= population standard deviation
E= maximum error to be tolerated
Comment Since is often unknown, a small-scale pilot study may be required to
estimate using S.
Example:-How large a sample should be taken to estimate the mean waiting time of
patients at a clinic with an error of ± 1.5 minutes and 95% confidence? s = 8.37
minutes.
Solution: n= ( Z 2 S SSS ) 2 /E2 =(Z0.025*8.37)2/(1.5)2 =120
Therefore, we can take 120 number of elements included in the sample.
CHAPTER 8
8. ESTIMATION AND HYPOTHESIS TESTING
Inference Analyzed
Population
Data
Numerical
Sample
data
Data analysis is the process of extracting relevant information from the summarized
data.
Statistical Estimation
This is one way of making inference about the population parameter where the
investigator does not have any prior notion about values or characteristics of the
population parameter.
There are two ways estimation.
1) Point Estimation
It is a procedure that results in a single value as an estimate for a parameter.
2) Interval estimation
It is the procedure that results in the interval of values as an estimate for a
parameter, which is interval that contains the likely values of a parameter.
It deals with identifying the upper and lower limits of a parameter. The limits by
themselves are random variable.
Definitions
Confidence Interval: An interval estimate with a specific level of confidence
Confidence Level: The percent of the time the true value will lie in the interval
estimate given.
Consistent Estimator: An estimator which gets closer to the value of the
parameter as the sample size increases.
Degrees of Freedom: The number of data values which are allowed to vary
once a statistic has been determined.
Estimator: A sample statistic which is used to estimate a population parameter.
It must be unbiased, consistent, and relatively efficient.
Estimate: Is the different possible values which an estimator can assumes.
Interval Estimate: A range of values used to estimate a parameter.
Point Estimate: A single value used to estimate a parameter.
Relatively Efficient Estimator: The estimator for a parameter with the
smallest variance.
Unbiased Estimator: An estimator whose expected value is the value of the
parameter being estimated.
We can phrase the latter question differently: How confident can we be that the
value of the statistic falls within a certain "distance" of the parameter? Or, what is the
probability that the parameter's value is within a certain range of the statistic's value?
This range is the confidence interval.
The confidence level is the probability that the value of the parameter falls within
the range specified by the confidence interval surrounding the statistic.
There are different cases to be considered to construct confidence intervals.
Case 1: If sample size is large or if the population is normal with known
variance
Recall the Central Limit Theorem, which applies to the sampling distribution of the
mean of a sample. Consider samples of size n drawn from a population, whose mean
is and standard deviation is with replacement and order important. The
population can have any frequency distribution. The sampling distribution of X will
have a mean x and a standard deviation x , and approaches a normal
n
distribution as n gets large. This allows us to use the normal distribution curve for
computing confidence..intervals.
X
Z has a normal distribution with mean 0 and var iance 1
n
X Z n
X , where is a measure of error .
Z n
- For the interval estimator to be good the error should be small. How it be small?
By making n large
Small variability
Taking Z small
- To obtain the value of Z, we have to attach this to a theory of chance. That is, there is an
area of size1 such
Here 100(1 ) %
are the z values corresponding
2 Z 2
to the most commonly used
90 0.10 0.05 1.645 confidence levels.
95 0.05 0.025 1.96
99 0.01 0.005 2.58
Case 3: If sample size is small and the population variance, 2 is not known.
X
t has t distributi on with n 1 deg rees of freedom.
S n
( X t 2 S n, X t 2 S n ) is a 1001 % conifidence int erval for
The unit of
measurement of the confidence interval is the standard error. This is just the standard
deviation of the sampling distribution of the statistic.
Examples:
1. From a normal sample of size 25 a mean of 32 was found .Given that the
population standard deviation is 4.2. Find
a) A 95% confidence interval for the population mean.
b) A 99% confidence interval for the population mean.
Solution:
b)
X 32, 4.2, 1 0.99 0.01, 2 0.005
Z 2 2.58 from table.
The required int erval will be X Z 2 n
32 2.58 * 4.2 25
32 2.17
( 29.83, 34.17)
Solution:
X 2.28, S 0.95, 1 0.95 0.05, 2 0.025
t 2 2.571 with df 5 fromtable.
The required int erval will be X t 2 S n
2.28 2.571 * 0.95 6
2.28 1.008
(1.28, 3.28)
That is, we can be 95% confident that the mean decrease in blood pressure is between 1.28
and 3.28 points.
Hypothesis Testing
- This is also one way of making inference about population parameter, where the
investigator has prior notion about the value of the parameter.
Decision
Reject H0 Don't reject H0
H0 Type I Error Right Decision
Truth
H1 Right Decision Type II Error
CASES:
Case 1: When sampling is from a normal distribution with 2 known
- The relevant test statistic is
X
Z
n
- After specifying we have the following regions (critical and acceptance) on the
standard normal distribution corresponding to the above three hypothesis.
Summary table for decision rule.
X 0
Where: Z cal
n
X 0
Where: tcal
S n
X 0
Z cal , if 2 is known.
n
X 0
, if 2 is unknown.
S n
Examples:
1. Test the hypotheses that the average height content of containers of certain lubricant is
10 liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3,
10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use the 0.01 level of significance and assume
that the distribution of contents is normal.
Solution:
Let Population mean. , 0 10
Step 1: Identify the appropriate hypothesis
H 0 : 10 vs H1 : 10
Step 2: select the level of significance, 0.01( given)
Step 6: Decision
Accept H0 , since tcal is in the acceptance region.
Step 7: Conclusion
At 1% level of significance, we have no evidence to say that the average height content of
containers of the given lubricant is different from 10 litters, based on the given sample data.
2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is
computed to be 1570 hours. The population standard deviation is 120 hours. Suppose the
hypothesized value for the population mean is 1600 hours. Can we conclude that the life
time of light bulbs is decreasing?
(Use 0.05 and assume the normality of the population)
Solution:
Let Population mean. , 0 1600
Step 1: Identify the appropriate hypothesis
H 0 : 1600 vs H1 : 1600
Step 2: select the level of significance, 0.05 ( given)
Step 3: Select an appropriate test statistics
Z- Statistic is appropriate because population variance is known.
3. It is known in a pharmacological experiment that rats fed with a particular diet over a
certain period gain an average of 40 gms in weight. A new diet was tried on a sample of
20 rats yielding a weight gain of 43 gms with variance 7 gms2 . Test the hypothesis that
the new diet is an improvement assuming normality.
a) State the appropriate hypothesis
b) What is the appropriate test statistic? Why?
c) Identify the critical region(s)
d) On the basis of the given information test the hypothesis and make
conclusion.
Solution (exercise).
Test of Association
B
A B1 B2 . . Bj . Bc Total
A1 O11 O12 O1j O1c R1
A2 O21 O22 O2j O2c R2
.
.
Ai Oi1 Oi2 Oij Oic Ri
.
.
Ar Or1 Or2 Orj Orc
Remark:
r c r c
n Oij eij
i 1 j 1 i 1 j 1
- The null and alternative hypothesis may be stated as:
H 0 : There is no association between A and B.
H1 : not H 0 ( There is association between A and B ).
Decision Rule:
i 1 j 1 eij
Examples:
1. A geneticist took a random sample of 300 men to study whether there is association
between father and son regarding boldness. He obtained the following results.
Son
Father Bold Not
Bold 85 59
Not 65 91
Using 5% test whether there is association between father and son regarding
boldness.
Solution:
H 0 : There is no association between Father and Son regarding boldness.
H1 : not H 0
R2 * C1 156 *150
e21 78
n 300
0.05
Degrees of freedom (r 1)(c 1) 1 *1 1
02.05 (1) 3.841 from table.
- The decision is to reject H0 since cal 0.05 (1)
2 2
Solution:
H 0 : There is no association between the size of the family and the level of
education attained by fathers.
H 1 : not H 0 .
CHAPTER 9
Linear regression and correlation is studying and measuring the linear relationship
among two or more variables. When only two variables are involved, the analysis is
referred to as simple correlation and simple linear regression analysis, and when
there are more than two variables the term multiple regression and partial
correlation is used.
Correlation Analysis: deals with the measurement of the closeness of the relation
ship which are described in the regression equation.
We say there is correlation when the two series of items vary together directly or
inversely.
Simple Correlation
Examples:
- Income and expenditure
- Number of hours spent in studying and the score obtained
- Height and weight
- Distance covered and fuel consumed by car.
When higher values of X are associated with lower values of Y and lower
values of X are associated with higher values of Y, then the correlation is said
to be negative or inverse.
Examples:
- Demand and supply
- Income and the proportion of income spent on food.
The correlation between X and Y may be one of the following
1. Perfect positive (slope=1)
2. Positive (slope between 0 and 1)
3. No correlation (slope=0)
4. Negative (slope between -1 and 0)
5. Perfect negative (slope=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject”
or “independent” variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the
correlation that exists between two variables is due to their being related
to some third force.
Example:
Let X1= be ESLCE result
Y1=be rate of surviving in the University
Y2=be the rate of getting a scholar ship.
3. Chance:
Examples:
Price of teff in Addis Ababa and grade of students in USA.
Weight of individuals in Ethiopia and income of individuals in Kenya.
r
( X X )(Y Y )
i i
and the short cut formula is
( X X )
i (Y Y ) 2
i
2
n XY ( X )( Y )
r
[ n X ( X ) ] [ n Y ( Y ) ]
2 2 2 2
r
XY nXY
[ X nX ] [ Y nY ]
2 2 2 2
Remark:
Interpretation of r
Examples:
Solution:
n 10, X 31.2, Y 32.9, X 2 973.4, Y 2 1082.4
XY 10331, X 2 9920, Y 2 11003
r
XY nXY
[ X n X 2 ] [ Y 2 nY 2 ]
2
10331 10(31.2)(32.9)
(9920 10(973.4)) (11003 10(1082.4))
66.2
0.363
182.5
This means mid semester exam and final exam scores have a slightly positive
correlation.
2. The following data were collected from a certain household on the monthly
income (X) and consumption (Y) for the past 10 months. Compute the simple
correlation coefficient.( Exercise)
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is
correlation between the tests of the ladies.
Lipsticks A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
X (R1) Y (R2) R1-R2 (D) D2
2 1 1 1
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0
7 6 1 1
6 7 -1 1
Total 12
6 Di
2
6(12)
rs 1 1 0.786
n( n 1)
2
7( 48)
- Simple linear regression refers to the linear relation ship between two variables
- We usually denote the dependent variable by Y and the independent variable by
X.
- A simple regression line is the line fitted to the points plotted in the scatter
diagram, which would describe the average relationship between the two
variables. Therefore, to see the type of relationship, it is advisable to prepare
scatter plot before fitting the model.
Y X
Where :Y Dependent var iable
X independent var iable
Re gression cons tan t
regression slope
random disturbance term
Y ~ N ( X , 2 )
~ N (0, 2 )
Where a is a constant which gives the value of Y when X=0 .It is called the Y-
intercept. b is a constant indicating the slope of the regression line, and it gives a
measure of the change in Y for a unit change in X. It is also regression coefficient of
Y on X.
- a and b are found by minimizing SSE 2 (Yi Yˆi ) 2
Where : Yi observed value
Yˆi estimated value a bX i
b
( X i X )(Yi Y ) XY nXY
( X i X )2 X 2 nX 2
a Y bX
Example 1: The following data shows the score of 12 students for Accounting and
Statistics Examinations.
Accounting Statistics
X Y
1 74.00 81.00
2 93.00 86.00
3 55.00 67.00
4 41.00 35.00
5 23.00 30.00
6 92.00 100.00
Accounting Statistics
X2 Y2 XY
X Y
1 74.00 81.00 5476.00 6561.00 5994.00
2 93.00 86.00 8649.00 7396.00 7998.00
3 55.00 67.00 3025.00 4489.00 3685.00
4 41.00 35.00 1681.00 1225.00 1435.00
5 23.00 30.00 529.00 900.00 690.00
6 92.00 100.00 8464.00 10000.00 9200.00
a)
The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two
variables are positively correlated (Y increases as X increases).
b)
Using OLS:
Yˆ 7.0194 0.9560 X
7.0194 0.9560(85) 88.28
Example 2:
A car rental agency is interested in studying the relationship between the distance
driven in kilometer (Y) and the maintenance cost for their cars (X in birr). The
following summarized information is given based on samples of size 5.
(Exercise)
2
i 1 X i 147,000,000 i 1 Yi 314
5 5 2
- To know how far the regression equation has been able to explain the variation in Y
2
we use a measure called coefficient of determination ( r )
(Yˆ Y ) 2
i.e r 2
(Y Y ) 2
Where r the simple correlatio n coefficien t.
ii. bS X rS
r b Y
SY SX
o When we fit the regression of X on Y , we interchange X and Y in all formulas,
i.e. we fit
Xˆ a1 b1Y
b1
XY nXY
Y nY
2 2
b1SY
a1 X b1Y , r
SX
Here X is dependent and Y is independent.
Example: The regression line between height (X) in inches and weight (Y) in lbs
of male students are:
4Y 15 X 530 0 and
20 X 3Y 975 0
Solution
We will assume one of the equation as regression of X on Y and the other as Y on X
and calculate r