0% found this document useful (0 votes)
17 views101 pages

QAB - II - Lecture - Notes Statistic

Quantitative analysis of a business Notes Step by step
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views101 pages

QAB - II - Lecture - Notes Statistic

Quantitative analysis of a business Notes Step by step
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

QUANTITATIVE ANALYSIS FOR BUSINESS TWO CIN 1207

Sampling Techniques

Simple Random Sampling

Description: - Equal chance, unbiased


Methods
Business Examples
Advantages
Disadvantages
Uses or applications of the technique

Systematic Sampling
Description: First item is selected randomly with Kth items
Method: Calculation
Formula
Business Examples
Advantages
Disadvantages
Uses or applications of the technique

Stratified Sampling
Description: Homogenous strata, random sampling within each stratum

Cluster Sampling
Description: Heterogeneous clusters

Overview of Statistics
Before examining the broad areas of statistics, it is necessary to become familiar
with certain terms and concepts used extensively in the subject.

1. Random Variable- A characteristic being measured or observed is called a


variable for example weight and height. Since a variable can take on different
values at each measurement or observation, it is termed a random variable
that is different measurements of height or weight for example the distance
travelled per day by a delivery truck.
2. Sampling Unit- A sampling unit is the item or individual being measured or
counted with respect to the random variable under study for example the
random variable is distance and the sampling unit is each delivery truck.
3. Population – Is the collection of all observations of a random variable
understudy and the one on which the researcher is trying to draw conclusions.
A population must be defined in very specific terms to include only those
sampling units with characteristics that are relevant to the problem for
example all the delivery vehicles in Zimbabwe.
4. Sample- Not every member of the population is observable or measurable for
reasons mainly of cost and time. A subset of the population on which
observations are made or measurements taken is referred to as a sample for
example a random sample of two hundred delivery vehicles is selected and
their daily distances travelled are recorded.
There are two major components in the discipline of statistics
a) Descriptive Statistics- It aims to identify the essential characteristics of a
random variable and produce a profile of its behaviour. This is achieved
through summary measures.
b) Inferential Statistics-This generalizes sample findings to the broader
population, it is that area of statistics which extends the information extracted
from a sample to the actual environment.
1. Qualitative and Quantitative Data
2. Scales of Measurement. a) Nominal scaled data {categorizes data for
example gender and profession}. b) Ordinal scaled data {ranks data for
example Likert type scales. Statement: Nust is the best university in
Zimbabwe. Options: Strongly disagreed, neutral, agree, and strongly agreed} c)
Interval scaled data. d) Ratio- scaled data

3. Data Collection Methods


4. Data sources
Data can be classified into two classes that is discrete or continuous
DISCRETE DATA

A random variable whose observations can take only specific values which are
integers {whole numbers is referred to as a discrete random variable}. In such
instances certain values are valid whilst others are invalid e.g. The number of
cars in a parking lot at a given time, the numbers of students in a class or the
number of employees in an organization.

CONTINUOS DATA
A random variable whose observations can take on any value in an interval is
said to generate continuous data for example the mass of a person, distance
travelled, and time taken to travel to work daily.

STRUCTURE OF THE COURSE

The course is organized under the following general headings


1. Methods to describe the characteristics of a random variable.
a) Presentation of data
b) Measures of central tendency
c) Measures of dispersion
d) Skewness
2) Quantifying Uncertainty
a) Basic Probability Concepts
b) Probability distributions- The Binomial distribution
- The Poisson distribution
- The Normal distribution
3) Inferential Statistics
a) The basic of sampling
b) Confidence Intervals
c) Hypothesis Testing
d) The chi-squared distribution

4) Forecasting
a) Regression and Correlation
Section 1

PRESENTATION OF DATA
UNGROUPED DATA
There are a number of ways in which ungrouped data can be presented such as
frequency distribution tables, stem and leaf.
A frequency distribution is a table which summarizes data with corresponding
frequencies. The following data correspond to the performance of students in a test.
Construct the frequency distribution table to illustrate the information. The random
variable is marks represented by X
X1=10 X6=20 X11=25 X16=21
X2=20 X7=27 X12=15 X17=80
X3=25 X8=80 X13=17 X18=10
X4=15 X9=15 X14=25 X19=27
X5=10 X10=20 X15=30 X20=21

Mark Frequency
10 3
15 3
17 1
20 3
21 2
25 3
27 2
30 3

∑f=20
Stem and Leaf Plots
A stem and leaf plot is a way of summarizing data. It can be constructed in two
phases, rough draft and final draft. The numbers are divided into two parts one
called a stem and the other called a leaf.
NB: In the final draft data values are arranged inn ascending order and it is important
to have a key. Example: The following data shows ages in years of people who were
shipping at a supermarket one afternoon. Construct a stem and leaf plot to present
the data.

55 15 25 50 28 66 73 25 24 47 10 45 54
55 55 43 57 53 65 38 30 29 64 12 70 16
24 25 40 15 36 53 57 24 27

Rough Draft

Stem Leaf
1 0 6 5 2 5
2 9 4 5 5 5 4 4 8 7
3 0 6 8
4 3 5 0 7
5 5 7 3 43 70 5 5
6 6 4 5
7 3 0

Stem Leaf
1 0 25 5 5 6
2 4 4 4 5 5 5 7 8 9
3 0 6 8
4 0 3 5 7
5 0 3 3 45 55 7
6 4 5 6
7 0 3

Key 1/0 =10


Example
Twenty-five students obtained the following marks in a statistics test and economics
test. Prepare a stem and leaf plot for the data and comment on the overall
performance of the students in both subjects.
Marks in Stats
75 34 29 91 81 47
30 20 21 32 32 68
58 36 18 15 23 30
21 23 28 22 34 45

Marks in Economics
16 29 45 58 64 78
42 54 66 72 34 35
54 91 74 24 84 92
70 78 54 52 18 41 65

Marks in Statistics Stem Marks in Economics


5 8 1 6 8
3 2 8 19 3 0 2 9 4
1
0 4 2 2 6 4 3 4 5
0
5 7 4 25 1
0 8 5 4 4 4 8 2
6 5 6 4
8
5 7 0 8 4 2 8
1 8 4
1 9 1 2
Final Draft Back to Back Stem and leaf Plot

Marks in Statistics Stem Marks in Economics


8 8 1 6 8
9 8 3 3 2 1 1 2 4 9
0
6 4 4 2 2 0 3 4 5
0
7 5 4 12 5
8 0 5 2 4 4 4 8
8 6 4 5 6
5 7 0 2 4 8 8
1 8 4
1 9 1 2

Statistics Key 5/1 =15 Key Economics 2/4=24


The stem and leaf plot shows that students performed better in economics than in
statistics

Exercise
The data below shows the number of villages interviewed in different villages of the
country
1) Construct a frequency distribution to illustrate the data
2) Construct a stem and leaf plot of the data

176 168 168 147 156


152 180 153 165 160
140 134 168 143 171
158 136 162 166 155
170 153 174 170 169

Grouped Data
Type A: Continuous
Data is grouped into classes for example 20≤30, 30≤40
If the random variable is x: 20<x<30 then 20≤x≤30 is a class where 20 is the lower
limit and 30 is the upper limit.

We use the abbreviations LCL and UCL to denote these. The difference between the
LCL and UCL is called the class interval or class length or class width.
Class width= UCL – LCL
= 30 - 20
=10
The sum of the LCL and UCL divided by two gives the midpoint of a class usually
denoted as x

(LCL + UCL)
Midpoint (x) =
2

Class Limits X
20≤30 25
30≤40 35
40≤50 45
50≤60 55
60≤70 65
70≤80 75
80≤90 85

NB. The UCL of the first class is the LCL of the succeeding class.

Type B: Discrete
Classes Adjusted Classes
5-9 4.5≤9.5
10-14 9.5≤14.5
15-19 14.5≤19.5
20-24 19.5≤24.5
25-29 24.5≤29.5

In type B classifications we need to adjust class limits before any statistical


procedures can be carried out. We adjust the classes by adding one half /0.5 to the
original UCL and subtracting one half/0.5 from the original LCL for example 5-9
becomes (5- 0.5) ≤ (9 + 0.5) = 4.5≤ 9.5

If the less than side is given no need to adjust

There are a number of ways of presenting grouped data such as frequency


distribution, histogram, frequency polygon and cumulative frequency distributions.

Example

The owner of a small business once to analyse profits over past 25 day period using
a class interview of 5 beginning at 20 construct. a) Frequency Distribution

b) Histogram

c) Frequency polygon

N.B In the original class you put an interval of four such that the adjusted classes will
have an interview of 5.

21 27 35 41 23

32 30 35 28 38

36 32 33 32 34

42 29 43 37 20

32 30 20 34 35

Classes Adjusted Classes Distribution Frequency


20-24 19.5≤24.5 4
25-29 24.5≤29.5 3
30-34 29.5≤34.5 9
35-39 34.5≤39.5 6
40-44 39.5≤44.5 3

A histogram is a form of a bar-chart or graph where the areas or lengths of each


bar are proportional to the frequencies of the classes

c. Frequency Polygon

Before constructing a frequency polygon find the midpoint of each and then plot the
midpoint against the corresponding frequency

Adjusted Classes Distribution Frequency Midpoint x


19.5≤24.5 4 22
24.5≤29.5 3 27
29.5≤34.5 9 32
34.5≤39.5 6 37
39.5≤44.5 3 42

Adjusted Classes Distribution Frequency Cumulative Frequency


19.5≤24.5 4 4
24.5≤29.5 3 7
29.5≤34.5 9 16
34.5≤39.5 6 22
39.5≤44.5 3 25
CUMULATIVE FREQUENCY DISTRIBUTION

Cumulate comes from the word accumulate. A less than cumulative frequency
distribution is called an ogive. To construct an ogive, plot the UCL of each class
against the cumulative frequency and join the points using free hand [ It has to be a
curve]
Example

The following data shows the number of days on which patients visited a clinic for
counselling using a class interval of 15 starting at 125 construct

a) Frequency distribution

b) Cumulative frequency distribution

c) Ogive

d) Relative frequency distribution

Cumulative Frequency Distributors

185 165 180 174 125

185 168 188 182 181

154 175 175 160 192

168 134 188 142 175

156 150 214 160 145

172 188 192 185 172

Original Adjusted Frequency Cumulative Relative


Classes Classes Frequency Frequency
(R.F)
125-139 124.5≤139.5 2 2 0.067
140-154 139.5≤154.5 4 6 0.133
155-169 154.5≤169.5 6 12 0.2
170-184 169.5≤184.5 9 21 0.3
185-199 184.5≤199.5 8 29 0.267
200-214 199.5≤214.5 1 30 0.033

N.B The last figure on your cumulative frequency should be equal to ∑f


Absolute Frequency
The relative frequency is calculated as follows [R.F] =
Total Frequency

N.B. The summation of all relative frequencies should be one

Example

The following data are the marks obtained by a group of students in a statistics
exam

68 49 69 41 79

42 60 87 65 68

50 61 85 66 63

52 56 74 59 81

57 88 47 55 65

78 90 65 72 95

a) Group the data into classes with an interval of 10 starting at 40 until all the values
have been accounted for.

b) Construct an ogive for this data

Original Classes Adjusted Classes Frequency Cumulative


Frequency40
40-49 39.5≤49.5 4 4
50-59 49.5≤59.5 6 10
60-69 59.5≤69.5 10 20
70-79 69.5≤79.5 4 24
80-89 79.5≤89.5 4 28
90-99 89.5≤99.5 2 30

Measures of Central tendency

The behaviour of any random variable can be described by a measure of central


tendency and a measure of dispersion about a central value. Observations of a
random variable tend to group about some central value. The statistical measures
that quantify where the majority of observations are concentrated are referred to as
measures of central tendency. There are 3 main measures of central tendency
mainly mean, mode and median. Each measure will be compared for both grouped
and ungrouped data.

Mean for Ungrouped Data

In general, the mean which is denoted as x̅ is defined as the sum of all observation
divided by the number of all observation

sum of all observations ∑x


x̅ = =
total number of observation n

A merchandising manager of a retail clothing chain has recorded 0 observations on


the interval between orders for a particular range of women clothing. The order
intervals in days are

18 15 7 24 10

23 28 10 16 12

5 23 24 16 19

26 17 27 17 17

29 18 23 9 26

12 22 14 26 22

555
a) Find the mean number of days between orders = = 18.5
30

Therefore, the average time between orders is 18.5 days

Mean for Grouped Data

Grouped data are represented by a frequency distribution. To calculate the mean for
grouped data, find the midpoint for each class and multiply it by the corresponding
frequency, the formula therefore for calculating the mean is given as follows
∑f(x) ∑f(x)
x̅ = , x̅ =
∑f n

where x is the midpoint for each class and f is the absolute frequency for each class.

Find the mean number of days between orders using the data from the previous
example assuming that the data are grouped as shown in the following table

Order Period Frequency X f(x)


5≤10 3 7.5 22.5
10≤15 5 12.5 62.5
15≤20 9 17.5 157.5
20≤25 7 22.5 157.5
25≤30 6 27.5 165

∑f(x)=565
565
The average time between order is = 18.83
30

The mean for grouped data is not exactly equal to the mean for ungrouped data. The
more reliable mean is the one for ungrouped data because it uses absolute values
unlike grouped data which puts the values unto classes.

N.B The mean uses every value of the data set in its computation as a result it
possesses certain useful properties, which make it the most widely, used measure
of central tendency.

Mode for Ungrouped data

Mode is the most frequently occurring value in a dataset. If the number of


observations is not too large, the mode can be found by arranging the data in
ascending order and by inspection that is identifying the value that occurs the most.

The mode is denoted as Mo

Example

Identify the mode for the following data set


74, 48, 36,74, 70, 67, 48, 74, 70, 36, 36, 40, 50, 74

Arranging in ascending order

36, 36, 36, 40, 48, 48, 50, 67, 70, 70, 74, 74, 74, 74

Therefore, Mo=74

Calculating the mode for grouped data is based on a frequency distribution table.
The first step is to identify the modal interval and then determine the modal value
within the modal interval. The formula used to accomplish this is given as follows

Mo= Lm + { [Cm ( fm - fm-1)] divided [2fm - fm-1 –fm+1] }

Where Lm =lower limit of the modal class

Cm = class width of the modal class

fm = the absolute frequency of the modal class

fm-1 = the frequency of the class proceeding the modal class

fm+1 = the frequency of the class succeeding the modal

Using the following data find the mode

Classes F
125≤140 4
140≤155 11
155≤170 9
170≤185 8
185≤200 10
200≤215 2

15 (11-4)
Mode =
2(11)-4-9

= 151.67

Find the mode for the following sets of grouped data


a)

Classes f F
5≤10 3
10≤15 5
15≤20 9
20≤25 7
25≤30 6

b)

Classes F
50≤90 2
90≤130 9
130≤170 26
170≤210 27
210≤250 6

5( 9-5) 40(27-26)
Mo= 15 + Mo = 170 +
2(9)-5-7 2(27)-26-6

= 18.33 = 171.82

N.B A major disadvantage of using the mode as a measure of central tendency is


that they can be more than mode making it difficult to make a decision on which one
to sell. A distribution with one mode is said to be unimodal and with two modes is
said to be bimodal

Median for Ungrouped Data

The median is that value of a random variable, which divides an ordered data set
into two equal parts. Half of the observations will fall below this median value and
the other half above it.

When finding the median for ungrouped data, the first step is to arrange the
observations in ascending order. If n is odd, identify the median position as the
(n+1)th
position.
2

(n)th
If n is even identify the value in the position, average this value and the
2
adjacent value in its right, to find the median value.

Example

27 38 12 34 42 40 24 40 23

Step

Arranging the data in ascending order

12 23 24 27 34 38 40 40 42

n= 9, n is odd

(n+1)th (9+1)th
position = position = 5th position.Median =34
2 2

Identify the median for the following data sets

27 38 12 42 40 24 40 23 18 34

Ascending order

12 18 23 24 27 34 38 40 40 42

27+34
Median = = 30.5
2

The formula for finding the median for grouped data using the arithmetic method is
as follows

n
Me= Lm + [Cm ( - Fm-1)] divided [fm]
2

n
Cm ( -F )
2 m-1
=:Lm +
fm

Where Lm = the lower class limit of the median


Cm = the class width of the median class

n = total number of observations

Fm-1 = cumulative frequency of the class interval before the median


interval

fm = the absolute frequencies of the median class

Basically there are two methods of finding the median for grouped data

1. Arithmetic Method- Both the frequency distribution and the cumulative


frequency distribution values are required using the cumulative frequency
distribution values. The median interval is that class interval into which the
n
( )th observation falls
2

Examples

Find the median for the following set of grouped data using the arithmetic
method.

Classes F F
125≤140 4 4
140≤155 11 15
155≤170 9 24
170≤185 9 33
185≤200 10 43
200≤215 2 45

∑x=45

n 45
Median Class = ( )th position= ( )th = 22.5th position
2 2
NB: You identify the median class using the cumulative frequency

45
15( -15)
2
Median = 155 +
9

= 167.5

2. Graphical Method: The median is found by reading off the value of random
variable associated with the fifty percent cumulative frequency on the vertical
axis

The cumulative frequency distribution is required in both methods

SKEWNESS

After calculating the mean, mode and median the decision has to be made as
to which one should be preferred as a measure of central tendency for a data
set. The following comparisons might help in this endeavour.

Symmetrical Distribution

If the mean = mode= median. Then a symmetrical distribution has been


identified. For a symmetrical distribution the best measures of central
tendency is the mean because it contains all the properties of a given data set.

Negatively Skewed Distribution

This is known as the left skewed distribution, if the mean< median< mode.
This situation indicates that more data values are distributed to the right
owing to few data values to the left as such a long tail results to the left. This
yields a negatively skewed distribution

Therefore, if a distribution is negatively skewed the median is preferred as the best


measure of central tendency.

Positively Skewed Distribution

This is also known as the right –skewed distribution if the


mean>median>mode. Then it means that the data are not evenly distributed
that is more data values are distributed to the left and few data values to the
right resulting in a long tail to the right.

The best measure of central tendency in this case is the median

Measures of Position

There are two types of measures of position that is quartiles and percentiles

QUARTILES

These values divide a data set that is ordered in ascending order into four
equal parts. There are three quartiles, which are

Q1 = Lower quartile

Q2 = Middle quartile

Q3 = Upper Quartile

Ungrouped Data

In ungrouped data the observations are arranged in ascending order before


the required quartile position are determined. To get these position the
following formulae are used
Q1= ( 4n )th position
Q2= ( 2n )th position
Q3= ( 3n4 )th position
If n is odd and the value obtained after making the calculations is as whole
number, the required quartile position will be to the count of that figure. If
after making the calculation the answer is not a whole number, consider the
next whole position.

Example

Find q1, q2 and q3 for the data below

18 9 11 30 15 22 19 20 35 40 43

9 11 15 18 19 20 22 30 35 40 43

n 11
Q1 = ( )th position = = 2.75 th position
4 4

Q1= 15

11
Q2 = = 5.5 position = 20
2

3(11)
Q3 = = 8.25 position = 35
4

If n is even and the result you get after calculation is a whole number average
this value and the value to its right and if it is not a whole number, consider
the position to the next whole number.

18 9 11 30 15 22 19 20 35 40 43 24 9 11 15 18
19 20 22 24 30 35 40 43

n 12 15+18
Q1 = ( ) th = ( ) = 3rd position = = 16.5
4 4 2

n 12 20+22
Q2 = ( ) th = ( ) = 6th position = = 21
2 2 2

3n 3(12) 30+35
Q3 = ( ) th =( )= 9th position = = 32.5
4 4 2

Grouped data

There are two methods that can be used to compute quartiles for grouped
data and these are the graphical method and arithmetic method. The
cumulative frequency distribution is required for both methods.

Graphical Method

An ogive is used to find or estimate quartiles. To find the Q1 position


determine 25% of the total frequency and find value that corresponds to it on
the x-axis. To find Q2 and Q3 determine 50% and 75 %respectively of the total
frequency. Example
Find Q1, Q2 and Q3 for the grouped data below

Classses Frequency Cumulative frequency


125≤140 4 4
140≤55 11 15
155≤170 9 24
170≤185 9 33
185≤200 10 43
200≤215 2 45

45
Q1 = = 11,25th
4

45
Q2 = = 22,5th
2

3(45)
Q3 = = 33,75th
4

The following formulae are used to find the lower quartile Q1 and the upper
quartile Q3. Q2 is found using the median formulae

n
Cq( -Fm-1)
4
Q1 = Lq1 +
fq
3n
Cq( -Fm-1)
4
Q3 = Lq3 +
fq

Where Lq is the LCL of the required quartile interval and

Cq = the class width

n = the total number of observations

Fm-1 = Cumulative frequency of the interval preceding the required quartile


interval

fq = the frequency of the required quartile interval

n
Cq( -Fm-1)
4
Q1 = Lq1 +
fq

45
15( -4)
4
= 140 +[ ]
11

= 149,89

3n
Cq( -Fm-1)
4
Q3 = Lq3 +
fq
3(45)
15( -33)
4
= 185 +[ ]
10

= 186,13

In general any percentile value can be found by adjusting the median formula
to find the required percentile position and from this establish the percentile
9n
for example 90th percentile position =( ) th position
10

40th percentile position = ( 104n )th position


35th percentile position = ( 100
35n
)th position

Measures of Position

Symbolic Notation for Sample

A measure found from analysing sample data is called a statistic while a


measure describing a population attribute is called a parameter. Various
symbols are used for each of these measures.

Statistical Measure Sample Statistic Population Parameter


Mean x̅ μ
Variance S2 δ2
Standard Deviation S δ
Sample Size N N
Measures of Dispersion

There are several types of measures of dispersion and these include

(i) Range

(ii) Interquartile Range

(iii) Variance

(iv) Standard Deviation

(v) Coefficient of Variation

They are used to describe the extent to which the values of a random variable are
scattered about a central value. The central value can be described as more reliable
if there is a high concentration of the values of observations about it. On the other
hand, widely spread observations show low reliability of the central value.

Range

Is the gap or difference between the smallest and biggest observation in a given data
set.

Example

The following data show the amount in millions of dollars paid to employees in
different companies. Find the data range and interpret your solution.

16 2 38 9 20 80 3 10 50

Range = 80-2

= $78 million

Since 78 is very close to the highest observation and very far from the lowest
observation this suggests a wide dispersion hence the mean as a measure of central
tendency will be strongly unrepresentative.

Interquartile Range

It is simply the difference between the upper quartile and the lower quartile
IQR =Q3 –Q1

Variance

It is a measure of spread or dispersion that includes all the observations of a data


set in its computation. It can be computed for both ungrouped and grouped data.

Variance for Ungrouped data

The formula used to compute this is

∑(x )-n( x )
2
2 ̅

n-1

Where

x = is the value of each observation

n= sample size

x̅ = sample mean

Standard Deviation for ungrouped data

The standard deviation is found by computing the square root of the variance

S= √(S)2

∑(x )-n( x )
2
2 ̅
S=√( )
n-1

The following data show the weights in kgs of 8 patients who visited a clinic one
afternoon. Compute the variance and standard deviation of the weights.

80 70 60 50 40 35 65 45 Mean = 55,63

X x2
80 6400
70 4900
60 3600
50 2500
40 1600
35 1225
65 4225
45 2025

∑x = 445 ∑(x )= 26475


2

26475-8(55,63)2
S2 =
8-1

=245,35

Standard Deviation = √(245.35)

= 15.66 kg

Example

The following data give the time in minutes spent by a sample of 20 students to
complete a given task. Showing all workings calculate the standard deviation of the
data.

16 29 58 66 78 42 54 72 54 72 54 91 44 84 92 70
78 52 28 41

x̅ = 58.75

∑((x)^2) = 77671

77671-20((58.75)^2)
S2 =
19

= 454,72

Standard deviation= 454,72

= 21.32
Variance for grouped data

( )
2
̅
∑f(x2)-n x
S=
n-1
S= √(S)2

( ))
2
̅
∑f(x2)-n x
S=√(
n-1

Where f is the frequency for each class and x is the midpoint of each class.

Example

Find the variance and standard deviation for the grouped data below

Classes F X fx fx2
125≤140 4 132.5 530 70225
140≤155 11 147.5 1622.5 239318.75
155≤170 9 162.5 1462.5 237656.25
170≤185 9 177.5 1597.5 283556.25
185≤200 10 192.5 1925 370562.5
200≤215 2 207.5 415 86112.5

∑fx 7552,5
x̅ = = = 167.83
n 45

( )
2
̅
2 ∑f(x2)-n x
S =
n-1

1287431.25-45((167.83)2)
=
45-1

= 452.74

S= √(452.74= 21.28
Coefficient of Variation

This is a measure of dispersion that we use to compare consistency in performance


between different random variables with diff units of measurement

The sample coefficient of variation is calculated as

s
Sample C.O.V = x 100

δ
Population C.O.V= x 100
μ

It is always expressed as a percentage.

The following data show the mean and standard deviation of sales per month and
the experience of employees in years. Calculate and compare the coefficients of
variation of two random variables. Which random variable is exhibiting better
variation?

Experience of Employees (yrs) Sales Per month


x̅ =20 x̅ = 500
S= 4 S= 80

s
COV for experience = x 100
̅
x

4
= x 100
20

= 20%

s
COV for sales per month = x 100
̅
x

80
= x 100
500

= 16%

NB: The one with a higher percentage has greater variability, which means that it is
less consistent. It follows therefore that the one with smaller variability is more
consistent.

Interpretation

Sales per month are more consistent than experience as shown by the variability of
16% compared to 20%.

Coefficient of Skewness

The coefficient of skewness values should lie between negative 3 and positive 3
inclusive

-3≤ coefficient of skewness≤ 3

A value less than zero indicate negative skewness. A value equal to zero represents
a symmetrical distribution. A value greater than zero indicate positive skewness.

The common coefficient of skewness that is used is called Pearson’s coefficient of


skewness. The first coefficient of skewness is denoted as Sk1 is calculated as

mean-mode
Sk1 =
standard deviation

̅
x -Mode
=
s

The second coefficient of skewness denoted as

3( mean-median)
Sk2 =
standard deviation

̅
3( x -Median)
=
s

 When Sk1 and Sk2 are both less than zero then we have a negatively skewed
distribution.

 If Sk1 and Sk2 are both equal to zero, then we have a symmetrical distribution

 Sk1 and Sk2 are both greater than zero then we have a positively skewed
distribution
Example

Compute Pearson’s first, second coefficient of skewness, and interpret your results.

Cm(fm-fm-1)
Mode= Lm+ [ ]
2fm-fm-1-fm+1

15(11-4)
= 140+[ ]
2(11)-4-9

= 151.67

Mean= 16783

Mode= 151.67

Median =167.5

Standard Deviation= 21.28

mean-mode
Sk1=
standard deviation

167.83-151.67
=
21.28

= 0,76

3( mean-median)
Sk2 =
standard deviation

̅
3( x -Median)
=
s

3(167.83-167.5)
=
21.28

= 0.05

This distribution is skewed to the right hence it is positively skewed.

PROBABILITY THEORY

A probability can be defined as a chance or likelihood of a particular outcome out of


a number of possible outcomes occurring for a given outcome.

Subjective Probability

Where the probability of an event is based on an educated guess or expert opinion. It


is referred to as a subjective probability. Subjective probabilities cannot be
statistically verified.

Objective Probability

When the probability of an event can be verified statistically, it is referred to as an


objective probability. It is this type of probability that is used extensively is statistical
analysis. Mathematically a probability is defined as the ratio of two numbers that is
the probability of an event A occurring =r/n where capital A is an event of a specific
type.

r = the number of outcomes of event A

n= total number of possible outcomes or the sample space

p (A) = probability of event A occurring

Example

Out of a class of 3 girls and 4 boys, what is the probability of selecting

a) A girl

b) A boy

3 4
a) P(G) = b) P(B) =
7 7

Basic Properties of Probability

1. A probability value lies only between zero and 1 inclusive that is 0≤P(A)≤1

2. If an event A cannot occur that is it is an impossible event P(A)=0

3. If an event A is certain to occur that is it is definite then P(A) =1

4. The sum of probabilities of all possible outcomes of a random experiment (=1)


equals one that is exhaustive probability for example the sum of the
probability of a girl and probability of a boy equals one that is P(G)+ P(B)=1

5. If P(A) is the probability of event A occurring, then the probability of event A


not occurring is defined as P(A1) = 1-P(A)

Example

Consider a random process of drawing cards from a pack of playing cards find the
probability of selecting

a) A red card

b) A spade

c) An ace

d) Not an ace

26 1
a) P ( red card) = =
52 2

13 1
b) P ( spade) = =
52 4

4 1
c) P ( Ace) = =
52 13

1 12
d) P ( Ace1) = 1 - =
13 13

Basic Probability Concepts

The concepts will be illustrated using the following example

Consider a random experiment of selecting companies from the Zimbabwe stock


exchange (ZSE).Values for the random variables, which are company size and
industry type, are measured or summarized as shown in the following table

Industry Type Small Medium Large Row Total


Mining 0 0 35 35
Finance 9 21 42 72
Service 6 3 1 10
Retail 14 13 6 33
Column Total 29 37 84 150

Computation of Objective Probability

Objective probability can be classified into three categories

(1) Marginal Probability

Marginal Probability

A marginal probability is the probability of only a single event e.g the probability of event A
occurring, it is written as P(A). A single event is an event that describes outcomes of one
random variable only. If A represents the event of a small company fund P(A).

29
P(A) =
150

Let B be the event of a Finance Company.

29
P(A) =
150

Marginal Probability is the probability of an event occurring at any time.

JOINT PROBABILITY

A joint probability is the probability of both events A and B occurring simultaneously on a


given random experiment. It is denoted as : P(A n B)

Let A be event of a small company and B the event of a Finance company. Therefore, the
probability of P (A n B) = 9/150.

CONDITIONAL PROBABILITY

A conditional probability is the probability of an event A occurring given information about


the occurrence of another event B. A conditional event describes the behaviour of a random
variable in light of additional information about a second random variable. A conditional
probability is defined as follows;

P(A n B)
P(A|B) =
P(B)
This the probability of event A occurring given that event B has already occurred.

The essential feature here is that the sample space is reduced to the to the outcomes
describing event B only and not all possible outcomes as for marginal and joint probabilities.

Let A be the event of a large company and B the event of a retail company. Find P(A|B) using

i) Intuition ii) The formula

INTERSECTION OF EVENTS

The intersection of events A and B is the set of outcomes that belong to both A and B
simultaneously. It is written as:

Let A be the event of a small company and B the event of a service company. A n B is the set
of all small and service companies. A n B = 6.

UNION OF EVENTS

The Union of events A and B is the set of outcomes that belong to A or B or Both. It is written
as A u B [A or B].

Let A be the event of a small company and be the event of a service company. Then A u B is
the set of all small or service or both companies. A u B = 29 + 10 - 6 = 33.

MUTUALLY EXCLUSIVE EVENTS

Events are mutually exclusive if they cannot occur together on a single trial of a random
experiment. For example, let A be event of a small company and B be the event of a medium
company. Events A and B are mutually exclusive because a randomly selected company
from ZSE cannot be both small and medium at the same time.

STATISTICALLLY INDEPENDENT EVENTS

The events are said to be statistically independent if the occurrence of one event A has no
effect on the outcome of event B occurring or vice versa. For example, let L be the event of
an accident occurring in London and H be the event of an industrial strike occurring in
Harare. These scenarios have no effect on each other event if they may occur at the same
time.
The terms statistically independent events and mutually exclusive event must not be
confused. When the events are mutually exclusive they not statistically independent. They
are dependent in the sense that when one event occurs then the other will not occur. In
probability terms the probability of an intersection between two mutually exclusive events is
zero.

PROBABILITY RULES

There are two basic probability rules:

1. Addition Rule (u ; “or”) ≤ for both mutually and non-mutually exclusive events.

2. Multiplication Rule ( n ; “and”) ≤ For both statistically and non-statistically


independent events.

ADDITION RULE FOR MUTUALLY EXCLUSIVE EVENTS

The probability of either event A or B occurring in a single trial of a random experiment is


defined as:

P(A u B) = P(A) + P(B)

For mutually exclusive events there is no intersection event, therefore P(A n B)= 0.

Let A be the event of small company and B the event of a large company. Since these two
events are mutually exclusive therefore P(A u B) = 29/150 + 84/150 = 113/150.

This is the probability that a randomly selected company from ZSE will either be a small
company or a large company.

ADDITION RULE FOR NON-MUTUALLY EXCLUSIVE EVENTS

Probability of either event A or B occurring in a single trial of a random experiment is given


by;

P(A u B) = P(A) + P(B) – P(A n B)

For example: Let A be event of a small company and B the event of a service company.
These two events are not mutually exclusive as they can occur at the same time. Therefore
P(A u B)= 29/150 + 10/150 – 6/150 = 11/50

This the probability that a randomly selected company for ZSE will either be a small
company or a service company or both.

MULTIPLICATION RULE FOR STATISTICALLY INDEPENDENT EVENTS

If two events are statistically independent, then the multiplication rule reduces to the
probability of

P( A n B)= P(A) * P(B)


P( A|B) = P(A)

NB* two events A and B are statistically independent if the following test can be satisfied.

This means that if the marginal probability of event A equals the conditional probability of A
given that event B has occurred, then these two events are statistically independent. This
means that the prior occurrence of event B does not influence the outcome of event A.

Let A be the event of a media company and B the vent of a Finance company. Determine if
the two events are statistically independent or not.

P(A n B)
= P(A)
P(B)

21
150 37
=
72 150
15

7 37

24 150

Since P(A|B) is not equal to P(A), then these events are not statistically independent.

MULTIPLICATION RULE FOR NON_STATISTICALLY INDEPENDENT EVENTS

If two events are non-statistically independent, we apply the following rule. The
multiplication rule may be used to find the joint probability of event A and B occurring on a
single trial of a random experiment i.e that is the intersection of the two events. By
rearranging the conditional probability formula, the multiplication rile is defined as:

P(A n B)
P(A|B) =
P(B)

P(A|B) * P(B) = P(A n B)

Where P(A n B)= The joint probability for A and B.

P(A|B) = conditional probability of event A occurring given that B has already


occurred.

P(B)= is the marginal probability of event B occurring.

The personnel department of an insurance company analysed the qualification profile of


their 129 managers. The qualifications attained by each manager are shown below.

MANAGEMENYT
LEVELS
Qualification Section Head Department Division Head Total
Level Head
O’Level 28 14 8 50
Diploma 20 24 6 50
Degree 5 10 14 29
Total 53 48 28 129

What is the probability of a person selected at random:

i) Having only O’ Level

ii) Being section head and having a degree.

iii) Being a department head given that they have a diploma.

iv) Being a division head.

v) Being a division head or a section head.

vi) Having an O’ Level given that the person is a section head.

Answer:

i) P(O’Level)= 50/129

ii) P(Section Head n Diploma)= P(A n B)= P(A)*P(B)

Or P(A n B)= P(A|B)* P(B)

P(A|B)= 5/29 and P(A)= 53/129

Therefore, being a section head and having a degree are non-statistically independent
events.

We use P(A n B)= P(A|B)*P(B)

= 5/129 * 29/129

= 5/129

iii) P(Being Department Head| Diploma )

24
P(A n B) 129 24
P(A|B) = = =
P(B) 50 50
129

iv) 28/129

v) Being division head and being section head are mutually exclusive events:

P(A u B)= P(A) + P(B)

= 2/129 + 53/129
= 81/129

14
P(A n B) 129 14
vi) P(A|B) = = =
P(B) 48 48
129

PROBABILITY TREE DIGRAMS

This a diagram that helps in decision making. The diagram has the shape of a tree and each
branch on the tree represent a logical outcome.

Example: A farmer has 15 cows in which 7 are black and 8 are white, it has been a tradition
that he sells a cow each month, if two cows are sold and then not replaced find the
probability that

i) Both are black

ii) One is white and one is black.

iii) How will these probabilities be affected if these cows were replaced?

Solution
i) P(B) = 7/15 x 6/14 = 1/5

ii) P(W,B) = (7/15 x 8/14) +(8/15 x 7/14) = 8/15

Example: If three playing cards are selected from a pack of playing cards with replacement,
what is the probability of getting at least two diamonds. How would this probability be
affected if these cards were not replaced?

Example: Suppose that we rolled an unbiased dice three times, Find the probability that the
outcome is:

i) Three even numbers

ii) At least two even numbers

iii) At least one even number

iv) No even number at all.

The Fundamental Principles of Counting.


If the event E can be split into K-sub-events i.e E1, E2 ,…Ek such that there are n1, n2 ,…nk ,
ways of performing each sub event then the entire set E can be performed: n1 x n2 x ...nk
ways.

Example

A man walks from point A to point C via point B as illustrated in diagram below.

A B C

Qtn: In how many ways can the man move from point A to C

Solution

A to B = 2 ways

B to C = 2 ways

Therefore, A to C = 2x2 = 4 ways

Example: A restaurant menu has a choice of four starters ten main courses and six deserts.
Find the total number of possible meals that can be ordered.

Total number of possible meals = 4x10x6 = 240 meals

Example: How many different 7 place number plates are possible if the first 3 places are to
be occupied by alphabetic letters and the final four by numbers?

Solution

In the case we assume that repetition is allowed then:

= 26 x 26 x 26 x10 x 10 x 10 x 10

= 175760000 number plates

In the case we assume that repetition is not allowed then:

= 26 x 25 x 24 x 10 x9 x 8 x7

= 78624000 number plates

PERMUTATIONS OF R OBJECTS FRON N OBJECTS

A permutation is the number of distinct ways in which a group of objects can be arranged,
each possible arrangement is called a permutation. Consider the number of ways of placing
3 of the letters ABCDEFG in three empty spaces. The first space can be filled in 7 ways, the
second space can be filled in 56 ways, the third space can be filled in 5 ways. Therefore,
there are 7x6x5=210 ways of arranging the letters taken from seven letters. This is the
number of permutations of three objects taken from seven objects and it written as:

NB* With a permutation the order in which letter or numbers are arranged is very important.

by

n n!
Pr =
(n-r)!
Eg: 5! = 5x4x3x2x1= 120

And (5 – 2)!= 3!= 3x2x1= 6


COMBINATIONS OF R OBJECTS FROM N OBJECTS

Is the number of the different ways of arranging a subset of objects selected from a group
of objects where the order is not important. Each possible arrangements is called a
combination. ABC gives rise to many permutations ABC, ACB, CBA, CAB, BAC, BCA. But each
arrangement is called a combination. Therefore the number of combinations of three letters
from seven letter ABCDEFG is denoted by

Example: From a group of five women and seven men. How many different committees
consisting of two women and three men can be formed?

Solution

5
C2 x 7C3
=10 x35

=350 committees

In general, the number of combinations or r objects from n unlike objects is

n n!
Cr =
r!(n-r)!
PROBABILITY DISTRIBUTION

A probability distribution is a list of all possible outcomes of a random variable and their
associated probabilities of occurrence. The expected value of a random experiment which is
the mean is given by the following formula

E(X)= µ = ∑x P (X = x)
The Variance

Denoted as ơ2 is calculated as follows

Ơ2=∑(xi - µ)2P(X=xi)

or

Ơ2=E(X - µ)2
Example: A fair coin is tossed twice.

i) Construct a probability distribution of the number of heads that can occur.

ii) Find the value of the expectation and standard deviation of the number of heads
that can occur.

NB* In your probability distribution the sum of P(X=x)=1 (exhaustive probability).

Example: An unbiased dice is thrown once, construct a probability distribution that shows
the possible outcomes and use it to find the expectation and standard deviation. Let X be
the possible outcomes.

Example: A coin is tossed three times, construct a probability distribution for the number of
tails that can be obtained, and use it to calculate the expectation and standard deviation for
the number of tails that occur.

Example: Each customer at a supermarket pays using one of the three methods, cash,
cheque and credit. The probability of randomly selected customer paying by cash is 0.54 and
cheque is 0.12.

i) Determine the probability of a randomly selected customer paying by credit card,


and three customers are selected at random find the probability of all three
paying by cash.

ii) Exactly one paying by cheque

iii) One paying by cash, one by Cheque and one by a credit card.

TYPES OF PROBABILITY DISTRIBUTIONS

The choice of a particular probability distribution function depends primarily on the nature of
the random variable under study.

Discrete or Continuous

Probability distribution functions can be classified as.

Discrete Probability Distribution

These probability distribution function assume that the outcomes of a random variable
under study can take only specific values usually integers eg a car can only take
0,1,2,3,4,5,6… tyres at any time. The two common types of discrete probability distributions
are the Binomial and Poisson distributions. For a random variable to follow either the poison
or binomial distribution, the following have to be met.

A discrete random variable can be said to follow a Binomial distribution if the following are
satisfied:

i) There are two mutually exclusive outcomes of the random variable generally
referred to as success or failure.
ii) The probability of the success outcome is denoted as p, whereas for the failure is
q.

P + q =1

iii) The random variable is observed n times and each observation is called a trial.

iv) The trials are assumed to be independent of each other i.e each trial does not
influence the outcomes of another trial.

If a random variable satisfies all the above conditions, it is said to follow a binomial process

i.e X- Bin (n; p)

The PDF of a binomial distribution is given as :

n x n-x
P(X=x)= Cx p q
Where n= number of trials

x= number of success outcomes.

P= probability of a success outcome.

q= probability of a failure outcome.

Example D (Done in video lesson): Ten students seat for an exam. The probability for each
student to pass an exam is 0.2. What is the probability that three of them will pass the paper?

The formula for calculating the mean and variance of a binomial distribution is
Important words or terms used in probability (USING EXAMPLE D): NB visit video notes

i) Exactly or Equals

e.g P(X=3)= 10C3 p3 q10-3

ii) More than or greater than.

e.g P(X > 3) = 1 – P(X≤ 3)

= 1 – [P(X=0)+ P(X=1) + P(X=2) + P(X=3)]

iii) Not more than or at most

e.g P(X≤ 3)= P(X=0) + P(X=1) + P(X=2) + P(X=3)

iv) Less Than

e.g P(X<2) = P(X=0) + P(X=1)

v) Between and inclusive

e.g P(2≤ X ≤ 4) = P(X=2) + P(X=3) + P(X=4)

vi) Between

e.g P(2< X < 5) = P(X=3) + P(X=4)

vii) Or (You add the probabilities)

viii) And (You multiply the probabilities)

ix) Not less than or at least

e.g P(X≥ 7) = P(X=7)+ P(X=8)+ P(X=9)+ P(X=10)

Example: Refer to the example D above and answer the following questions.
a) What is the probability that more than two students will pass?

b) Less than two students will fail.

c) Between 2 and 4 students will pass the exam.

d) Between 1 and 3 inclusive will fail

e) Two or three students will pass.

f) Calculate the mean and standard deviation for the number of students who will pass.

POISSON DISTRIBUTION

The poison process measures the nu mber of occurrences of a particular outcome of a


discrete random variable in a pre-determined time space, or volume interval for which an
average number of occurrences of the outcome can be determined eg the number of cars
arriving at a parking lot in a hourly interval or the number of telephone calls received in a ten
minutes interval. If a distribution follows a poison process, then

X – Poi ()

The PDF is defined as follows:

Where is the mean the number of occurrences, x is the number of occurrences whose
probability is being calculated.

Example: The average number of errors a junior typist can make in a page is 6. What is the
probability that she makes:

a) Two errors per page.

b) At least two errors per page.

c) Two errors in two pages.

d) Between 1 and 3 errors in three pages.

e) 1 or 2 errors in a singles page.

f) Find the mean and standard deviation of errors per page.

Answer:

Example: A textile producer has established that a spinning machine stops randomly due to
thread breakages at an average rate of 5 stoppages per hour. What is the probability that in a
given hour:

a) 3 stoppages will occur

b) At most two stoppages will occur,

c) More than four stoppages will occur.

d) Not more than two stoppages will occur in a two hour interval.

Answer:

CONTINOUS PROBABILITY DISTRIBUTIONS

A continuous random variable can take any value in an interval. Continuous probability
functions are used for probabilities associated with intervals of X values. You will encounter
many business situations in which the random variables of interest can be treated as a
continuous variable. There are several continuous distributions that a frequently used to
describe a physical situation. The most common and useful continuous distribution function
is the normal distribution, the reason being that the output for many processes are normally
distributed.

THE NORMAL DISTRIBUTION

A normal probability distribution function finds the probability for a continuous random
variable. It has the following characteristics:

i) It is bell shaped.

ii) It is symmetrical about a central value (The Mean)

iii) The tails of the distribution never touch the X- axis.

iv) A normally distributed random variable is described by two parameters, namely


the mean and the standard deviation.

v) The area under the curve of the PDF of a normal distribution is equal to 1.

How to read off Normal Distribution tables.

1) Identify a given Z value to one decimal place on the left column.

2) The remaining Z values are given on the top row.

3) The required area of probability is found where the Z values to one decimal place on
the left column intersects the remaining Z values on the top row.
Find the following probabilities where Z is the standard normal variable.

i) P(Z<2.31)

ii) P(Z<-1.49)

iii) P(Z>2.1)

iv) P(-2.5<Z)

v) P(0<Z<2.05)

vi) P(-1.52<Z<0.69)

NB* Always sketch a normal probability distribution curve and indicate the area whose
probability is to be found.

Answer:

v) P(Z<2.05) - P(Z<0)

0,9798 – 0,5000

=0,4798

f) P (-1,52 < Z < 0,69)

P(Z<0,69) – P(Z<-1,52)

0,7549 – 0,0643

=0,6906

g) Find the following probabilities from the z tables

a. P(0<Z<1,46)

Solution

P(Z<1,46) – P (Z <0)
0,9278 – 0,5

=0,4278

b) P ( -2,3 < Z < 0)

Solution

P (Z < 0) – [P (Z < -2,3)]

0,5-0,0107

=0,4893

c) P (-2,1 < Z < 1,32)

P (Z < 1,32) – P (Z < -2,1)

0,9066 – 0,0179

=0,8887

d) P (1,24 < Z < 2,08)


P (Z < 2,08) – P (Z < 2,4)

0,9812 – 0,8925

=0,0887

STANDARD NORMAL DISTRIBUTION

The trick is finding probabilities for a normal distribution is to convert the normal distribution
to a standard normal distribution. Values of x associated with any normally distributed
random variable can be converted into corresponding Z values by using the conversion
formula.

x-μ
Z=
σ

Where µ = mean of the specific normal distribution and,

σ = standard deviation of the specific normal distribution.

NB The process is converting X~ Z is called standardizing

The time taken to install a new telephone is found to be normally distributed with the mean
time of 45minutes and a standard deviation of 8minutes. For a new installation what is the
probability that

a) It will take less than 40minutes.


b) It will take between 44 and 49 minutes.
c) It will take between 43 and 45 minutes.
d) It will take between 45 and 51 minutes.

X= time taken to install a new telephone


µ = 45minutes

σ =8 minutes

x-μ
P(Z < )
σ

P (X < 40)

40-45
=P(Z< )
8

P (X < -0,625) = P (X < -0,63)

= 0.2643

x-μ
b) P (44 < Z < 49) =P (Z < )
σ

44-45 49-45
=P ( <Z< )
8 8

=P (-0.215 < Z < 0.5)

=0,6915 – 0,4483

= 0,2432

x-μ x-μ 43-45 45-45


c) P ( <Z< ) = P( <Z< )
σ σ 8 8

= P (-0,25 < Z < 0)

= P(Z<0) – P(Z<-0.25)

=0,5 – 0,4013

= 0,0987

45-45 51-45
d. P (45 < X < 51) = P( <Z< )
8 8

= P(0 < Z < 0,75)

0,7734 – 0,5
= 0,2734

The number of customers who enter a certain a super market in a day is normally distributed
with the mean of 400 customers and a standard deviation of 80 customers.

a) What is the probability that on a given day the number of customers is less than 250?
b) Greater than 400
c) Between 300 and 400
d) Between 200 and 5

Solution
µ= 400 ,σ = 80

X- μ
a) P (X < 250) = P (Z < )
σ
250-400
= P (Z < )
80

= P (Z < -1,875)

= 0,0301

300-400 400-400
b) P (300 < X< 400)=P( <Z< )
80 80
= P (-1,25 < Z < 0)

= 0,5 – 0,1056

= 0,3944

200-400 500-400
c) P (200 < X < 500) = P( Z< )
80 80

= P(-2,5<Z<1,25)

P (2,5 < X < 1,25) = 0,8944 – 0,0062

= 0,8882
The Central Limit Theorem

There are many situations in business where the population is not normally distributed. For
simple random sample of n observations taken from a population with mean μ and
standard deviation σ.The sum of the random variables will have an approximately normal
distribution. More specifically if x1, x2…...xn is a random sample of size n taken from a
̅
population with mean µ and standard deviation σ the mean of the sample x follows a
normal distribution with the following parameters.

̅
̅ σ2 ̅ x-μ
x ~N(μ ) such that the probability P ( x < x) = P (Z < )
n σ
√n

NB Hypothesis testing and confidence interval estimation are based on this

Hypothesis Testing

In business it is common practice to make blanket statements about a population of interest.


An example could be ″workers earn $1000″. Such statements can be proved to be true or
false using statistical methods. A hypothesis can be defined as claim or assumption /
assertion about a true value of a population parameter that can be proved to be true or not.
Hypothesis testing can be defined as a process of testing / verifying the validity of a claim
about a true value of a population parameter. Hypothesis tests can be carried out on the
following population parameters

 Hypothesis testing for a single population mean.

 Large samples

 Small samples

 Hypothesis testing for the difference between 2 population means

 Large samples

 Small samples

 Hypothesis testing for paired differences

 Small samples
Basic steps for hypothesis test

1. Formulate the hypothesis that is null hypothesis which is denoted as H0 and the
alternative hypothesis which is denoted as H1.

2. Determine the type of distribution.


3. Determine the areas of acceptance and rejection.
4. Compute the test statistic.
5. Compare the test static with the critical value and draw a conclusion from the result
obtained.

Formulating the Hypothesis

The null hypothesis is a claim made about a true value of a population parameter.

The alternative hypothesis is a statement that reverses or oppresses a claim or oppresses a


claim made about a true value of a population parameter.

H0: µ= 1000

H1: µ≠ 1000

The following are different ways in which a hypothesis can be stated.

a) Two sided hypothesis test

 It is a claim that equates a population parameter to a stated value

 It can easily be identified by taking note of words like is exactly, indeed, equal to,
same as etc.

 The hypothesis for a two sided test is stated as:

H0: µ= a

H1: µ≠ a

It has two rejection areas

One sided lower tailed test


 This is a claim that states that a population is greater than or equal to a special value

 To identify this type of a hypothesis, look for words such as smaller than, less than,
below etc.

 The hypothesis for a lower tail test is stated as

H0: µ= a

H1: µ˂ a

It has one rejection area on the left hand side

One sided upper tailed test.

This is a claim that states that a population parameter is greater than or equal to a specified
value. It is identified by taking note of words like greater than, above, beyond etc

The hypothesis for a one sided upper tailed test is stated as

H0: µ= a

H1: µ˃a

It has one rejection region on the right hand side.


Errors associated with hypothesis testing.

There are two types of errors that can be made when carrying out a hypothesis.

Type One Error

It is the chance of rejecting a null hypothesis when it is true. It is denoted as α(alpha) which
is the level of significance or the probability of committing a type one error.

Type Two Error

This is the chance of accepting a null hypothesis when it is false, it is denoted as β(beta)
which is the probability of committing a type 2 error.

Determining the type of distribution.

There are two common types of distribution used in hypothesis taking i.e, the Z-distribution
and the t- distribution.

The Z-distribution is used if the sample size is large i.e n≥30

The t- distribution is used if the sample size is small i.e n≤30

Determining the areas of acceptance and rejection

The acceptance area is the region into which when the calculated test statistic falls in it then
H0 is not rejected. The rejection area is the region into which where the calculated test
statistic fails in it then H0 is rejected.

Critical Values

This is a value that separates the acceptance and rejection regions.


To arrive at this value, the level of significance α is given and it is always given as
percentage.

Two sided test

α = 10%

α
= 0.05 thus Z0,05 = ± 1.64
2

α = 5%

5
= 0,025 thus Z0.025 = ±1,96
2

One sided lower tailed


Zα = Z (0,01) = -2,33

Zα = 0,05 = -1,28

One sided upper tailed test

α = 1%

100% - 1%= 99%

Zα = Z0,99=2,33
Step 4

Computing the test statistic

χ- μ
For a large sample the test statistic is calculated as Zcalc =
σ
√n

χ-μ
For a small sample the test static is calculated as t calc =
s
√n

Step 5

Drawing a conclusion.

The conclusion depends on the results obtained in the step above, if the calculated test
static falls within the rejection region then H0 is not accepted that is rejected. If it fails within
the acceptance region we fail to reject H0.

Hypothesis Testing for Single Population Mean.

Large Sample

Where n1 + n2 = n ≥ 30

Χ- μ
Zcalc =
σ
√n

Where Χ = sample mean

µ = population mean

σ = standard deviation n = sample size.

Example

A firm suspect that the average life of 28000km claimed for certain tires is too high. To
check this, claim the firm puts 40 of these tires on these types on its truck and get a mean
life time of 27563km and a standard deviation 1348km. is this evidence that the mean life
time for these tires is in fact less than 28000km. Carry out an appropriate test using α = 0,01

H0: µ = 28000km

H1: µ ˂ 28000km
n =40 =˃ z-distribution.

Critical Value

α = 0,01
Zα= Z0,01

= -2,33

Reject H0 if Zcalc ˂ -2,33

Χ- μ
Test Statistic : Zcalc =
σ
√n

27 563-28000
=
1348
√40

= -2,05

Since Zcalc = -2,05 is greater than -2,33 we fail to reject H0 and conclude at the 1% level of
significance that the mean life time of these tires is 28000km.

A manufacture claims that the light bulbs have an average life of 1600hrs. A sample of 100
light bulbs tested gave an average life of 1570hrs and standard deviation of 120hrs. Test at
the 5% of significance if this claim is true.

H0: µ =1600

H1: µ ≠ 1600

n =100 =˃ Z-distribution

Critical Value

0,05
Zα = = 0,0025 = -1,96
2
2

Reject Ho if Zcalc ˂ -1,96 or Zcalc ˃ 1,96

Χ- μ
Test Statistic : Zcalc =
σ
√n

1570-1600
=
120
√100

= -2,50

Since Zcalc = -2,5 is less than -1,96 we reject Ho and conclude at the 5% level of significance
that the average life of these light bulbs not equal to 1600.

1. The average monthly salary paid to an employee at a certain company is $340. A


study carried out amongst a sample of 300 employee produced an average monthly
salary of $350 with a standard deviation of $60. Test the hypothesis at the 5% level
of significance that the average monthly salary of an employee.

2. The average speed of cars along a high way is 135km\h. A sample study of 200 cars
along the high way showed an average speed of 130km\h with a variance of 900.
Test the hypothesis at the 10% level of significance to determine if the speed of cars
along the highway is below 135km\h.

H0 : µ = 340

H1 : µ ≠ 340
N= 300= z-distribution.

Critical Value

5
Zα = = 0,0025 = -1,96
2
2
Reject H0 if Zcalc < -1,96 or H0 > 1,96

Χ- μ
Test Statistic : Zcalc =
σ
√n

350 -340
=
60
√300

= 2,89

Since Zcalc = 2,89 greater than 1,96 we reject H0 n and conclude that at the 5% level of
significance that the average monthly salary of the employees is not equal to $340.

Ho: µ = 135km\h

H1: µ < 135km\h

N=200 = z-distribution

Critical Value

Zα =10% = 1,28
Reject H0 if Zcalc < -1,28

Χ- μ
Test Statistic : Zcalc =
σ
√n

130-135
=
30
√200

= -2,86

Since Zcalc = -2,36 is greater than -1,28 we reject H0 and conclude that at the 10% level of
significance that the average speed of cars along a way is less than 135km\h.

H0 to use the t tables


 If it is has a two tailed test divide alpha by 2 ( .)
2

 Look up the results obtained on the top row of the t tables where these results
intersect the degree of freedom this is the critical value

 For a single population mean the degrees of freedom are calculated as n-1 is the
sample size

 The critical value is therefore denoted as

One sided lower tail test

-tα n-1

One sided upper tail test

tα n-1

Two sided test

± t ∝ n-1
2
Example

α = 0,025 n=24 t∝; n-1

t0,025; 23 = 2,07

α = 0,005, n=1

t∝, n-1

t0,005; 4 = 4,60

Rejection Criterion for a t-test.

Reject H0 if tcalc < -t∝; n-1

One sided upper


Reject H0 if tcalc > -t∝; n-1 degrees of freedom

Two Sided test

reject H0 if tcalc < -t ∝ , n-1 or if tcalc > t ∝ , n-1.


2 2

On coverage the price of a pour of shoes is taken to be $20 from a sample of 25


pairs it was found that the mean price was $22 with a standard deviation of $5. Test
the hypothesis at the 1% level of significance that average price of a pair of shoes is
more than $20.

H0: µ = 20

H1: µ > 20
n = 25, t-distribution

Critical Values

t0.01,24

22-20
,= 2
5
√25

Since tcalc = 2 is less than 2,49 we fail to reject H0: µ = 20 and conclude that 1% level
of significance the average price of pair of shoes is $20.

The mean weight of a certain product is assumed to be 85kgs. To prove this, claim a
random sample of 16 such products was studied and it was found that average
weight was 83kg with a standard deviation of 5kgs. Test whether the claim is true or
not using α =0,05.

H0: µ = 85kg

H1: µ ≠ 85kg
,n=16 t-distribution t0,05, 15

Reject H0 if tcalc is >2,13 or tcalc< -2,13

83-85
= -1,6
5
√16

Since tcalc is > -2,13 we fail to H0: µ = 85kg

Hypothesis testing for the difference between 2 means

Large Samples

The sum of the two samples should be greater than 30 when n1 = size of sample 1
̅ ̅
x 1- x 2
and n2 = sample size of 2. The test statistic is calculated as Zcalc = , where
σ21 σ22
+
n1 n2

̅
x 1=is the sample mean for sample 1

̅
x 2=is the sample mean for sample 2
σ21= is the variance for sample 1

σ22=is the variance for the sample 2

Small Samples

n1 + n2 < 30

̅ ̅
x 1- x 2
The Test Statistic is calculated as tcalc =
s21 s22
-
n1 n2

Critical Values.

One sided lower tail test

Large sample Z∝

Small Sample -t∝ n1 + n2 - 2

Two sided upper tailed test

Large sample Z∝

Small sample ±t ∝ ,n1 + n2 - 2


2

Example

A professor took two samples one of 15males and another of 12 females from
students at a college who were enrolled for statistics course. The professor found
that the mean score of male students in an exam was 76,2 with a standard deviation
of 7,4 and the mean score of the female student was 78,5 with a standard deviation
of 6,7. Test at the 5% level of significance if the mean score of all male students is
lower than that of the students

Let males be population 1

Let females be population 2

H0: μ1 = μ2

H0: μ1<μ2

n1 = 15 ,n2 = 12

n1 + 1 = 27, < 30 …= t-distribution

Critical Value

α = 0,05

-t∝ n1 + n2 - 2 = t0,05, 25 = -1,71

Rejection Criterion

Reject H0 if tcalc is < -1,71


̅ ̅
x 1- x 2
Test Statistic is calculated as tcalc =
s21 s22
-
n1 n2

76,2-78,5
=
7,4 6,7
+
15 12

= -0,85

Since tcalc = -0,85 which is greater than -1,71 we fail to reject H0 and conclude at the 5% level
of significance that the mean score of all male students is not lower than that of female
students.

A transport company want to compare the performance of 2 cars a Nissan and a Toyota.
The Nissan was used to 75times and its average breakdowns was recorded to be 5 with a
variance of 4. The Toyota was used 63 times and its coverage number of breakdowns was
recorded to 4 with a variance of 3. Test the hypothesis whether the performance of the two
cars is the same. Use α = 0,05

Let Nissan be population 1

Let Toyota be population 2

H0: μ1 = μ2

H1: μ1≠μ2

n1 = 75 ,n2 = 63

=138 > 30 z-distribution

Critical Value

α = 0,05

Z ∝ = Z 0,05 = 0,025
2 2

= ±1,96
Reject H0 if Zcalc < -1,96 or if the Zcalc > 1,96

5-4
Test statistic =
4 3
+
75 63

=3,15

Since Zcalc 3,15 is greater than 1,96 we reject H0 and conclude that at 5% level of
significance that the level of performance of these two cars is not the same.

The principal of a college wants to compare the performance of two teachers, X and Y. X
was assed 8times with the mean of 6,2 scores and a variance of 2,15 scores. Y was assed
6times with the, a mean of 5,8scores and a variance of 1,2 scores. Test the hypothesis at the
1% level of significance that they is no difference between the mean number of scores
obtained by the two teachers

Let X be population 1

Let Y be population 2

n1 = 8 ,n2 = 6 < 30 we use t-distribution

Critical Value

α =0,01/2 = 0,005

t∝= n1 + n2 - 2 =±2,98

Reject H0 if tcalc is < -2,98 or if tcalc > 2,98


̅ ̅
x 1- x 2 6,2-5,8
Test Statistic = 2 2
= = 0,58
σ σ 2,15 1,2
+ 21 +
n1 n2 8 6

Since tcalc is less than 2,98 we fail to reject H0.

Hypothesis Testing for Paired Differences.

In some cases, it is possible to pair the measurements from one population or sample. The
hypothesis test tests whether the differences between two measurements, in the population,
we will always have small samples for this type of hypothesis

One sided lower tailed test

H0 = µd = 0;

H1 = µα < 0;

Critical Value

-t∝ n - 1;

One sided upper tailed test

H0 = µd = 0;

H1 = µd > 0;

Critical Values

t∝ n - 1;

Two sided test


H0 = µd = 0;

H1 = µd ≠ 0;

Critical Value

± t ∝ n-1;
2

d
tcalc=
sd
√n

,where d represents the difference between the before measurements and the after
measurement that is d = B - A

̅ ∑d
d = , n= sample size
n
2
̅
2
∑d -n( d )
SD= standard deviation of the differences calculated as sd =√
n-1

Example

The following table shows the before and after use of tobacco for a particular group of
people

Heart beat before Heart beat after d d^2


1 81 105 -24 576
2 81 91 -10 100
3 68 87 -19 361
4 61 86 -25 625
5 67 82 -15 225
6 74 78 -4 16
7 75 87 -12 144
8 64 94 -30 900
9 70 93 -23 529
10 60 90 -30 900
Sum 701 893 -192 36864

Does tobacco use increase in the heartrates of these people? Test using α=5%

H0 = µd = 0; (does not cause an increase)

H1 = µd > 0;
Critical Value

-t∝ n - 1; -t0.059 ; =-1,83

Rejection Criterion

Reject H0 if tcalc is < -1,83

̅ ∑d
Test Statistic = d =
n

=-192/10

= -19,2
2
̅
∑d2-n( d )
sd =√
n-1

4376-10 (-19,20)^2
=
9

= 8,75

-19,2
tcalc = ;
8,75
√10

= -6,94

Since tcalc = -6,94 is less than -1,83 we reject H0 at 5% level of significance and conclude that
the use of tobacco does caused an increase in the heart rates.

You have been trying to control the weight of a chocolate candy bar by intervening in the
production process. The following table shows the weight of before and after intervention.
Has the intervention managed to reduce the weight of the chocolate candy bar? Test at the
1% level of significance upper tailed.

Before after d=B-A d^2


1.62 1.55 -0.07 0.0049
1.71 1.71 0 0
1.6 1.65 0.05 0.0025
1.61 1.64 0.03 0.0009
1.62 1.62 0 0
1.55 1.63 0.08 0.0064
1.64 1.72 0.08 0.0064
1.61 1.77 0.16 0.0256
1.57 1.63 0.06 0.0036
1.67 1.73 0.06 0.0036
16.2 16.65 0.45 0.0539 sum

H0 = µd = 0;

H1 = µd > 0;

Critical Values

-t∝ n - 1; -t0,019 = 2,82

Reject H0 if tcalc is >2,82

d
Test Statistic = tcalc=
sd
√n

0,45
d= = -0,045
10

0,0539-10(0,045)^2
sd =
9

=2,33

Since tcalc =2,33 is less than 2,82 we fail to reject H0 at 1% significant level and conclude that
the intervention has managed to control the weight of chocolate candy bars.

Confidence Interval Estimation.

The common statements like ‘the average price of petrol per liter is between $1,40 &
$1,50’are examples of interval estimates. In statistics it is customary to give not only the
interval estimate for a parameter but the probability it will lead to the interval which contains
the parameter. The probability is the level of confidence for example 90%, 95%, 97%.

Small Sample

,n<30
The confidence interval estimate for a small for a small sample is given by following formula

̅
x - t∝ n - 1
2
( sn ) ≤ μ ≤ ̅ x + t ;n - 1( sn );

2

̅ s
x ± t α ;n-1( )
2
√n

Example

From the sample of 64 car commuters. The sample mean time taken to commute to
work daily was found to be 26,5minutes if the standard deviation is known to be
15minutes. Find the 95% confidence interval estimate of the actual mean time µ
taken by all car commuters.

,n=64≫ z-distribution

α = 100 – 95%

=5%

=0,05%

̅ ̅
x - Z∝ σ ≤ μ ≤ x + Z∝ σ
2 n 2 n

̅ ̅
26.5 - Z 0,05 15 ≤ μ ≤ 26,5 + Z 0,05 15
2 64 2 64

̅ ̅
26,5 - Z 15 ≤ μ ≤ 26,5 + Z 15
0,025 0,025
64 64

26,5 - 1.96(1.875) ≤ μ ≤ 26.5 + 1.96(1.875)

22,83 ≤µ≤ 30,18

We are 95% confident that the mean time taken to commute to work daily uses daily
between 22,83minutes and 30,18minutes.

If the sample size in the example above was 25 and the means and standard deviation
remaining the sample the same compute α 99% Confidence interval estimate of the
population mean µ.

,n=25 →t distribution

α= 100%- CI

100%-99%

=0,01
̅
26,5 - t 0,01 24
2
( )
15
25
̅
≤ μ ≤ 26,5 + t 0,01 ;24(
2
15
25
);

̅
26,5 - t0,005 24
s
n ( ) ̅
≤ μ ≤ 26,5 + t0,005;24(
15
25
);

̅
26,5 - 2,80 ( )
15
25
̅
≤ μ ≤ 26,5 + 26,5(
15
25
);

18,1 ≤ µ ≤ 34,9

The mean taken to commute to work daily lies between 18,20minutes and 34,90minutes
with a probability of 0,99.

Confidence Interval Estimate For the difference between 2 population means

Large Samples

The appropriate formula to be used for a large sample is given as


2 2
̅ ̅ σ2 σ 2 ̅ ̅ σ2 σ 2
( X 1 - X 2) - Z ∝ + ≤ (μ1 - μ2) ≤ ( X 1 - X 2) - Z ∝ +
2
n1 n2 2
n1 n2

Or
2
̅ ̅ σ2 σ 2
( X 1 - X 2) ± Z ∝ +
2
n1 n2

A company has two shops A&B to compare the efficiency of the employees of these two
shops. 30 employees were sampled from shop A & 20 from shop B & their performance
were observed. Shop A employees completed a given task within 30minutes averagely with
a sample standard deviation 6minutes. Shop B employees took given 25minutes to
complete the same task an average with a sample variance of 25minutes. Construct a 95%
confidence interval estimate for the difference in the mean of the number of minutes taken
to complete the task by the employees from two shops

Let shop A be population 1

Let shop B be population 2

n1 - n2 > 30 = z distribution

2 2
̅ ̅ σ2 σ 2 ̅ ̅ σ2 σ 2
( X 1 - X 2) - Z ∝ + ≤ (μ1 - μ2) ≤ ( X 1 - X 2) - Z ∝ +
2
n1 n2 2
n1 n2

62 252 62 252
(30 - 25) - Z 0,05 + ≤ (μ1 - μ2) ≤ (30 - 25) - Z 0,05 +
2
30 20 2
30 20

62 252 62 252
5 – Z(0,025) + ≤ (μ1 - μ2) ≤ 5+ +
30 20 30 20
1,93≤x≤8,07

We are 95% confident that the difference between the mean number of minutes taken to
complete a given task by the employees from the two shops is between 1,93minutes and
8,07

Small Samples

the appropriate formula to be used is given bas follows

n1 - n2 < 30 = t distribution

( ) ( )
2 2
̅ ̅ s2 s 2 ̅ ̅ s2 s 2
( X 1- X 2 - t α + ≤ (μ1 - μ2) ≤ ( X 1- X 2 - t α +
2
;n1- n2-2 n1 n2 2
;n - n -2
1 2
n1 n2

Or

(( X - X ) ± t
2
̅ ̅ s2 s 2
1 2 α +
;n - n -2
2 1 2
n1 n2

If the sample size n in the example above were 15 employees from shop A & 10 employees
from shop B and the standard deviation remaining the same. Compute 90% confidence
interval estimate for the difference in the mean number of minutes taken to complete the
task given from the two shops.

Let shop A be population 1

Let shop B be population 2

n1 - n2 < 30 = t distribution

62 252 62 252
((30-25) - t0,05 + ≤ (μ1 - μ2) ≤ (30 - 25) - t +
15 10 0,05 15 10

5 -1,71(2,21) ≤ (μ1 - μ2) ≤5+0,05(2,21)

1,13≤ (μ1 - μ2) ≤ 5,11

Determining the sample size

A sample to be drawn from a given population must be represent for a fair conclusion to be
made about the population being represented. It follows then than for a given confidence
level and sample standard deviation and a mean size within the true average is expected to
fail then the sample size can be calculated using the formula
n = (Z α ,σ)^2
2
e

where Z α = is the value associated with a given confidence level


2

σ = standard deviation which shows how much variance one expects on their
response

e = margin error.

A recent study of a private company employees, salaries showed a standard deviation of


$251,35. The study would like to estimate the mean salary to be within ±80% of the true
mean with a 90% confidence level. What sample size of the employees must be de
for the study.

,e= 80

σ = 251,35

α = 100-90 = 0,1

2
Z 0,1
2
n=( , 251,35)
80

Z(0,05)*251,35 2
=( )
80

1,64*251,35 2
=( )
80

=27 employees

The Chi`- SQUARE DISTRIBUTION OR TEST

The Chi-square distribution is a distribution obtained from multiplying the ratio of sample
variance to population variance by the degree of freedom when a random sample, are
selected. Expected frequencies denoted as Ѐ are frequencies obtained by calculations where,
as observed denoted by Ӧ are obtained by observations. The Chi-squared distribution is
denoted χ2 and it is used to test for independency.

Test for Independency or Association

In this test the claim is that the row and column variable are independent of each other. The
hypothesis for this test is stated as follows.
H0 : row and column variables are independent of each other

H1 : row and column variables are dependent on each other

Or

H0 : there is no association between the row and column variable.

H1 : there is an association between the row and column variable.

Or

H0 : there is no relationship between the row and column variable.

H1 : there is a relationship between the row and column variable.

Critical Value

The critical value of a Chi-square test is given as

χ2α(r-1)(c-1).

Where d.f = (r-1)(c-1)

r = number of rows

c = number of columns

E.g α = 0,05; r=3 and c=3

χ20.01(3-1)(3-1)=χ20.01;4. = 13.3

Rejection Criterion

Reject H0 if χ2calc > χ2α(r-1)(c-1)


Test Statistic

(O-E)2
χ2calc =∑
E

0 –represent the observed frequency which we obtain by observation

E – expected frequencies which are obtained by calculations.

Expected frequencies are calculated by the formula below

row total*column total


E=
grand total
NB. Each observed frequency must have its own expected frequency

Example. In order to determine whether or not a relationship exists between blood type and
the severity to the winter a survey was concluded and yield the following results. Test at 5%
level of significance if there is a relationship.

Type of Flue A B AB O Total


Severe 34 57 82 55 228
mild 53 45 137 57 292
No 213 218 211 173 815
Total 300 320 430 285 1335

228*300 228*320 228*430 228*285


1335 1335 1335 1335
Solution

H0: There is no relationship between blood type and the severity of a winter flue

H1: There is a relationship between blood type and the severity of a winter flue.

Critical Value

α = 0,05

r = 3, c = 4

d.f = (r-1)(c-1) =2x3=6

χ20,05(4-1)(3-1).

χ20,05(3x2). = 12.6
Rejection Criteria

Reject H0 if χ2calc> 12,6

Test Statistic

(O-E)2
χ2calc =∑
E

228*300
Cell 1 =
1335
= 51.24

228*320
Cell 2 =
1335
= 54.65

228*430
Cell 3=
1335
= 73.44

228*285
Cell 4 =
1335
= 48.67

O E O-E (0-E)^2 (O-E)/E


34 51.24 -17.24 297.2176 5.8005
57 54.65 2.35 5.5225 0.101052
82 73.44 8.56 73.2736 0.997734
55 48.67 6.33 40.0689 0.823277
53 65.62 -12.62 159.2644 2.427071
45 69.99 -24.99 624.5001 8.922705
137 94.05 42.95 1844.703 19.61406
57 62.34 -5.34 28.5156 0.457421
213 183.15 29.85 891.0225 4.864988
218 195.36 22.64 512.5696 2.623718
211 262.51 -51.51 2653.28 10.10735
173 173.9 -0.99 0.98 0.0056

∑56.73988
Since χ2calc =56,73988 is greater than 12,60 we reject H0 at the 5% level of significance and
conclude that there is a relationship between blood type and the severity of winter flue.

Example. A survey of 382 respondents produced the following results.

1 2 3 total
1 45 87 52 184
2 33 65 100 198
total 79 152 152 383

Test the hypothesis that a response failing in any response is independent of the column it
will fail use α = 1%?

Simple Linear Regression and Correlation.

Regression analysis is a statistical method that establishes a linear relationship between


two variables. Correlation analysis closely looks at the strength of this linear relationship
between variables.

Regression- establish if there is a linear relationship (LR)

Correlation – tests the strength of the linear relationship between two variables.

Regression Analysis

The purpose of simple linear regression analysis is to examine some form of linear
relationship between two random variables. These variables are denoted X and Y

X (Independent variables) values are always known or they can be always known or can
easily be found whereas Y (Dependent variables) values are estimated using X values.

There are two methods of establishing if a linear relationship exists between 2 variables

(a) Scatter plot

(b) The linear regression function

Scatter Plot

This is a plot of x values against y values x values make up the horizontal line of the graph
and y values make up the vertical line of the graph is drawn by plotting dots into space
where the values of x and y Intersect. If the dots seem to lie in a linear form, then a linear
relationship exists between the two variables.
This suggest that x values can be confidently used in predicting the y values

Scatter Plot

Y-Values
3.5
3
2.5
2
1.5
1
0.5
0
0 0.5 1 1.5 2 2.5 3

This indicate that they is a linear relationship between x and y and its positive

As x increases y increases

0
0 1 2
y
6
5
4
3
2
1
0
0 1 2 3 4 5 6
This indicates a linear relationship between x and y and its negative.

A perfect negative linear relationship for a negative linear relation as x increases y decreases.
If the dots are scattered all over the space this suggests no linear relationship between x
and y.

If a linear relationship exists between two variables, then x values can be relied upon in
pretending the y values. If a linear relationship does not in predicting the y values, then the x
values cannot be relied upon in predicting the y values

The following data shows the number of garments and the size of cloth meters.

 Identify the independent and dependent variable

 Produce a scatter plot of the data and comment on the relationship

number cloth in
of meters
garment
45 25
28 16
34 20
42 28
34 19
30 17
42 22
39 20
24 14
32 17
20 6

The dependent variable is the number of garment, independent variable is the cloth in
meters.

cloth in meters
30

25

20

15

10

0
0 5 10 15 20 25 30 35 40 45 50

From the scatter plot a positive linear relationship exist between cloth in meters and
number of garment.

The following data gives different profits for a particular type of machine sold and
the number of units sold in different shop.

a) Determine the independent and dependent variables.


b) Construct a scatter plot of the data and comment.

Solution

Dependent variable =profits

Independent variable = number of units

profit number
of units
550 42
600 38
650 35
600 40
500 44
650 38
450 45
500 42

number of units
50

40

30

20

10

0
0 100 200 300 400 500 600 700

Linear Regression Function.

The simple linear regression is given as

̂ ̂ ̂
y = β 0 + β1x where β 0 & β1 are unknowns.

̂
The estimated value of dependent variable y is composed of a linear function β 0 + β1x of
the explanatory variable x

̂
The parameter β 0is known as the intercept parameter and the parameter β1is known as the
slope parameter. The slope parameter β1is of particular interest since it indicates how the
expected value of y
depends on x if β1> 0
y
then a positive linear
6 relationship exist
5 between x and y.
4

0
0 1 2 3 4 5 6

y
̂ ̂
y = β 0 + β1x as x increases y will also increases

y
6

0
0 1 2 3 4 5 6

̂ ̂
y = β 0 + β1x if β1 < 0 then a negative linear relationship exist between x and y.

If β1 = 0 is a straight line exists if there is no linear relationship between x and y


y
6

0
0 1 2 3 4 5 6

̂
NB: The two unknown parameters β 0 & β1 are estimated from a data set

n∑xy-∑x∑y
β1 =
n∑x2-(∑x)2

̂
β 0is calculated from β1as follows

̂ ∑y ∑x
β0 = - β1( )
n n

̂ ̂ ̅
Thus, β 0 = y - β1 x

NB: for a specific value of the explanatory variable x the equation provides an estimated
value of y

Example

The following is sample data obtained in a study of the relationship between the number of
years that applicants for a certain job have studied English language in high school or
college and the grades which they received in a proficiency test in that language.

number of years grade in test


3 57
4 78
4 72
2 58
5 59
3 63
4 73
5 84
3 75 a) plot the scatter graph and comment on the
2 48 relationship

b). Compute the appropriate regression equation

c). comment on the regression coefficient

d). super impose the equation line into the scatter graph

e). predict the grade in the test for someone with 8years in school studying English language

grade in test
90
80
70
60
50
40
30
20
10
0
0 1 2 3 4 5 6

̂ ̂
y = β 0 + β1x

n∑xy-∑x∑y 10(2404)-(35*667)
β1 = =
n∑x2-(∑x)2 10(133)-1225

β1 =6,62

̂ ̂
y = β 0 + β1x

66,7 – (6,62)4
̂
y = 43,53

d). since β1 > 0 it implies a positive linear relationship between x&y and as x increases by
one unit y increases by a factor of 6,6

̂ ̂
y = 43,53 + 6,62x y = 43,53 + 6,62x

=43,53 + 6,62(2)

=56,77

̂
y = 43,53 + 6,62x

=43,53 + 6,62(4)

=470,01

̂
e). y = 43,53 + 6,62x

= 43,53 + 6,62(8)

=96,49

=96%

Correlation Analysis

Correlation analysis tests the strength of the between two variables. It measures the
strength of a linear relationship between independent variable x and dependent variable y

The correlation coefficient is denoted as ṙ takes values between -1 ≤ r ≤ 1, that is your r must
in

-1 ≤ r ≤ 1.

-1 -0.5 0 0.5 1

Perfect –ve correlation -ve correlation no correlation +ve correlation


perfect +ve

-1≤ r < -0.5 A strong negative correlation

-0.5≤ r < 0 A weak negative correlation

0≤ r < 0.5 A weak positive correlation


0,5≤ r < 1 A strong positive correlation

The common correlation coefficient used in statistics is the Pearson correlation coefficient

It is calculated by

n∑xy- ∑x∑y
r=
(n∑y2-(∑y)2) (n∑x2-(∑x)2
r = correlation coefficient

n = number of data sets

x =independent variable

y = dependent variable

Suppose an experiment involving 5 subjects is conducted to determine the relationship


between the percentage of a certain drug in the blood stream and the length of time it takes
to react to a stimulus.

a) Estimate the regression line equation.


b) Interpret the slope of the regression line.
c) Predict the reaction of a subject with the amount of drug of 1,05% in their blood
stream.

Subject amount reaction


of drug time(s)
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4

n∑xy-∑x∑y
β1x =
n∑x2-(∑x)2

n∑xy-∑x∑y
β1x =
n∑x2-(∑x)2

5(37)-(1510)
=
5(55)-(15)2
= 0,07

̂ ̂
y = β 0 + β1x

= 2 – (0,7)3

= -0.1+ 0.7x

n =11; ∑xy=7289, ∑x=204, ∑y=370 ∑x2=4120, ∑y2=13070,

compute and interpret the correlation coefficient.

11*7289-204*370
(11*13070- 3702 )-(11*4120-(204)2

= 0,93

= strong positive correlation

Coefficient of determination

It is the square of the Pearson’s correlation coefficient that is

COD =r2 * 100

This measurement helps to determine the relationship or association of the two variables
and its measured as a percentage. It also helps in estimating the reliability of x values in
predicting the y values.

0,93*0,93*100 =86,49%

The x values are 86,49% reliable in predicting the y values according to this model.

A lady operates a hot dog stand in the park. She suspects that there is relationship
between the temperature in a given day and the number of hotdogs she sells in that
day. She begins to keep her track of the data and obtains the following results.

a). plot a scatter diagram of the data and comment.

b). is it reasonable to fit regression line explain?

c. estimates the regression line equation.

d) interpret

e) estimate the increase in the number of hot dogs sold when the temperature
increases from 27 degrees to 30degrees.

f) Calculate the Pearson’s correlation coefficient and interpret it

g) Compute the coefficient of determination and interpret it.


Day Temperature number of hot dogs
sold
1 25 67
2 23 61
3 20 49
4 21 54
5 28 65
6 32 75
7 31 72
8 33 77
9 27 64
10 25 60

80 Chart Title
70
60
50
40
Positive linear
30
relationships
20
10
0
0 2 4 6 8 10 12
b). yes it is
reasonable to fit
regression line
temperature number of hotdogs sold because the
scatter plot
exhibits some form of positive linear relationship between temperature of a
given day and the number of dogs sold on that day.

̂ ̂
y = β 0 + β1x

n∑xy-∑x∑y
β1x =
n∑x2-(∑x)2

̇
∑x=265, ∑x =7207, ∑x∑y=17413 ∑y=644, ∑y =42186
2 2
y =64,4 ∑xy=
10*17413-265*644
= 1,88
10*7207- 2652

̂ ̂
y = β 0 + β1x

= 64,4 – 1,88* 26,5

= 14,5

d) There is a positive linear relationship.

̂ ̂
e) y = β 0 + β1x

= 14,58 + 1,88*30

70,98

@27 degrees =14,58 +1,88*27

=65,34

Increase in the number of hotdogs sold

70,98 – 65,34

=5,64

=5 hotdogs sold

n∑xy- ∑x∑y
f) r=
(n∑y -(∑y)2) (n∑x2-(∑x)2
2

10*17413-265*644
=
(10*7207- 265)2(10*42186)-6442

= 0,96

COD = r*r (100)


= 0,96(0,96) *100

= 92,16%

The x values are 92,16% reliable in predicting the y values according to this model.

You might also like