QAB - II - Lecture - Notes Statistic
QAB - II - Lecture - Notes Statistic
Sampling Techniques
Systematic Sampling
Description: First item is selected randomly with Kth items
Method: Calculation
Formula
Business Examples
Advantages
Disadvantages
Uses or applications of the technique
Stratified Sampling
Description: Homogenous strata, random sampling within each stratum
Cluster Sampling
Description: Heterogeneous clusters
Overview of Statistics
Before examining the broad areas of statistics, it is necessary to become familiar
with certain terms and concepts used extensively in the subject.
A random variable whose observations can take only specific values which are
integers {whole numbers is referred to as a discrete random variable}. In such
instances certain values are valid whilst others are invalid e.g. The number of
cars in a parking lot at a given time, the numbers of students in a class or the
number of employees in an organization.
CONTINUOS DATA
A random variable whose observations can take on any value in an interval is
said to generate continuous data for example the mass of a person, distance
travelled, and time taken to travel to work daily.
4) Forecasting
a) Regression and Correlation
Section 1
PRESENTATION OF DATA
UNGROUPED DATA
There are a number of ways in which ungrouped data can be presented such as
frequency distribution tables, stem and leaf.
A frequency distribution is a table which summarizes data with corresponding
frequencies. The following data correspond to the performance of students in a test.
Construct the frequency distribution table to illustrate the information. The random
variable is marks represented by X
X1=10 X6=20 X11=25 X16=21
X2=20 X7=27 X12=15 X17=80
X3=25 X8=80 X13=17 X18=10
X4=15 X9=15 X14=25 X19=27
X5=10 X10=20 X15=30 X20=21
Mark Frequency
10 3
15 3
17 1
20 3
21 2
25 3
27 2
30 3
∑f=20
Stem and Leaf Plots
A stem and leaf plot is a way of summarizing data. It can be constructed in two
phases, rough draft and final draft. The numbers are divided into two parts one
called a stem and the other called a leaf.
NB: In the final draft data values are arranged inn ascending order and it is important
to have a key. Example: The following data shows ages in years of people who were
shipping at a supermarket one afternoon. Construct a stem and leaf plot to present
the data.
55 15 25 50 28 66 73 25 24 47 10 45 54
55 55 43 57 53 65 38 30 29 64 12 70 16
24 25 40 15 36 53 57 24 27
Rough Draft
Stem Leaf
1 0 6 5 2 5
2 9 4 5 5 5 4 4 8 7
3 0 6 8
4 3 5 0 7
5 5 7 3 43 70 5 5
6 6 4 5
7 3 0
Stem Leaf
1 0 25 5 5 6
2 4 4 4 5 5 5 7 8 9
3 0 6 8
4 0 3 5 7
5 0 3 3 45 55 7
6 4 5 6
7 0 3
Marks in Economics
16 29 45 58 64 78
42 54 66 72 34 35
54 91 74 24 84 92
70 78 54 52 18 41 65
Exercise
The data below shows the number of villages interviewed in different villages of the
country
1) Construct a frequency distribution to illustrate the data
2) Construct a stem and leaf plot of the data
Grouped Data
Type A: Continuous
Data is grouped into classes for example 20≤30, 30≤40
If the random variable is x: 20<x<30 then 20≤x≤30 is a class where 20 is the lower
limit and 30 is the upper limit.
We use the abbreviations LCL and UCL to denote these. The difference between the
LCL and UCL is called the class interval or class length or class width.
Class width= UCL – LCL
= 30 - 20
=10
The sum of the LCL and UCL divided by two gives the midpoint of a class usually
denoted as x
(LCL + UCL)
Midpoint (x) =
2
Class Limits X
20≤30 25
30≤40 35
40≤50 45
50≤60 55
60≤70 65
70≤80 75
80≤90 85
NB. The UCL of the first class is the LCL of the succeeding class.
Type B: Discrete
Classes Adjusted Classes
5-9 4.5≤9.5
10-14 9.5≤14.5
15-19 14.5≤19.5
20-24 19.5≤24.5
25-29 24.5≤29.5
Example
The owner of a small business once to analyse profits over past 25 day period using
a class interview of 5 beginning at 20 construct. a) Frequency Distribution
b) Histogram
c) Frequency polygon
N.B In the original class you put an interval of four such that the adjusted classes will
have an interview of 5.
21 27 35 41 23
32 30 35 28 38
36 32 33 32 34
42 29 43 37 20
32 30 20 34 35
c. Frequency Polygon
Before constructing a frequency polygon find the midpoint of each and then plot the
midpoint against the corresponding frequency
Cumulate comes from the word accumulate. A less than cumulative frequency
distribution is called an ogive. To construct an ogive, plot the UCL of each class
against the cumulative frequency and join the points using free hand [ It has to be a
curve]
Example
The following data shows the number of days on which patients visited a clinic for
counselling using a class interval of 15 starting at 125 construct
a) Frequency distribution
c) Ogive
Example
The following data are the marks obtained by a group of students in a statistics
exam
68 49 69 41 79
42 60 87 65 68
50 61 85 66 63
52 56 74 59 81
57 88 47 55 65
78 90 65 72 95
a) Group the data into classes with an interval of 10 starting at 40 until all the values
have been accounted for.
In general, the mean which is denoted as x̅ is defined as the sum of all observation
divided by the number of all observation
18 15 7 24 10
23 28 10 16 12
5 23 24 16 19
26 17 27 17 17
29 18 23 9 26
12 22 14 26 22
555
a) Find the mean number of days between orders = = 18.5
30
Grouped data are represented by a frequency distribution. To calculate the mean for
grouped data, find the midpoint for each class and multiply it by the corresponding
frequency, the formula therefore for calculating the mean is given as follows
∑f(x) ∑f(x)
x̅ = , x̅ =
∑f n
where x is the midpoint for each class and f is the absolute frequency for each class.
Find the mean number of days between orders using the data from the previous
example assuming that the data are grouped as shown in the following table
∑f(x)=565
565
The average time between order is = 18.83
30
The mean for grouped data is not exactly equal to the mean for ungrouped data. The
more reliable mean is the one for ungrouped data because it uses absolute values
unlike grouped data which puts the values unto classes.
N.B The mean uses every value of the data set in its computation as a result it
possesses certain useful properties, which make it the most widely, used measure
of central tendency.
Example
36, 36, 36, 40, 48, 48, 50, 67, 70, 70, 74, 74, 74, 74
Therefore, Mo=74
Calculating the mode for grouped data is based on a frequency distribution table.
The first step is to identify the modal interval and then determine the modal value
within the modal interval. The formula used to accomplish this is given as follows
Classes F
125≤140 4
140≤155 11
155≤170 9
170≤185 8
185≤200 10
200≤215 2
15 (11-4)
Mode =
2(11)-4-9
= 151.67
Classes f F
5≤10 3
10≤15 5
15≤20 9
20≤25 7
25≤30 6
b)
Classes F
50≤90 2
90≤130 9
130≤170 26
170≤210 27
210≤250 6
5( 9-5) 40(27-26)
Mo= 15 + Mo = 170 +
2(9)-5-7 2(27)-26-6
= 18.33 = 171.82
The median is that value of a random variable, which divides an ordered data set
into two equal parts. Half of the observations will fall below this median value and
the other half above it.
When finding the median for ungrouped data, the first step is to arrange the
observations in ascending order. If n is odd, identify the median position as the
(n+1)th
position.
2
(n)th
If n is even identify the value in the position, average this value and the
2
adjacent value in its right, to find the median value.
Example
27 38 12 34 42 40 24 40 23
Step
12 23 24 27 34 38 40 40 42
n= 9, n is odd
(n+1)th (9+1)th
position = position = 5th position.Median =34
2 2
27 38 12 42 40 24 40 23 18 34
Ascending order
12 18 23 24 27 34 38 40 40 42
27+34
Median = = 30.5
2
The formula for finding the median for grouped data using the arithmetic method is
as follows
n
Me= Lm + [Cm ( - Fm-1)] divided [fm]
2
n
Cm ( -F )
2 m-1
=:Lm +
fm
Basically there are two methods of finding the median for grouped data
Examples
Find the median for the following set of grouped data using the arithmetic
method.
Classes F F
125≤140 4 4
140≤155 11 15
155≤170 9 24
170≤185 9 33
185≤200 10 43
200≤215 2 45
∑x=45
n 45
Median Class = ( )th position= ( )th = 22.5th position
2 2
NB: You identify the median class using the cumulative frequency
45
15( -15)
2
Median = 155 +
9
= 167.5
2. Graphical Method: The median is found by reading off the value of random
variable associated with the fifty percent cumulative frequency on the vertical
axis
SKEWNESS
After calculating the mean, mode and median the decision has to be made as
to which one should be preferred as a measure of central tendency for a data
set. The following comparisons might help in this endeavour.
Symmetrical Distribution
This is known as the left skewed distribution, if the mean< median< mode.
This situation indicates that more data values are distributed to the right
owing to few data values to the left as such a long tail results to the left. This
yields a negatively skewed distribution
Measures of Position
There are two types of measures of position that is quartiles and percentiles
QUARTILES
These values divide a data set that is ordered in ascending order into four
equal parts. There are three quartiles, which are
Q1 = Lower quartile
Q2 = Middle quartile
Q3 = Upper Quartile
Ungrouped Data
Example
18 9 11 30 15 22 19 20 35 40 43
9 11 15 18 19 20 22 30 35 40 43
n 11
Q1 = ( )th position = = 2.75 th position
4 4
Q1= 15
11
Q2 = = 5.5 position = 20
2
3(11)
Q3 = = 8.25 position = 35
4
If n is even and the result you get after calculation is a whole number average
this value and the value to its right and if it is not a whole number, consider
the position to the next whole number.
18 9 11 30 15 22 19 20 35 40 43 24 9 11 15 18
19 20 22 24 30 35 40 43
n 12 15+18
Q1 = ( ) th = ( ) = 3rd position = = 16.5
4 4 2
n 12 20+22
Q2 = ( ) th = ( ) = 6th position = = 21
2 2 2
3n 3(12) 30+35
Q3 = ( ) th =( )= 9th position = = 32.5
4 4 2
Grouped data
There are two methods that can be used to compute quartiles for grouped
data and these are the graphical method and arithmetic method. The
cumulative frequency distribution is required for both methods.
Graphical Method
45
Q1 = = 11,25th
4
45
Q2 = = 22,5th
2
3(45)
Q3 = = 33,75th
4
The following formulae are used to find the lower quartile Q1 and the upper
quartile Q3. Q2 is found using the median formulae
n
Cq( -Fm-1)
4
Q1 = Lq1 +
fq
3n
Cq( -Fm-1)
4
Q3 = Lq3 +
fq
n
Cq( -Fm-1)
4
Q1 = Lq1 +
fq
45
15( -4)
4
= 140 +[ ]
11
= 149,89
3n
Cq( -Fm-1)
4
Q3 = Lq3 +
fq
3(45)
15( -33)
4
= 185 +[ ]
10
= 186,13
In general any percentile value can be found by adjusting the median formula
to find the required percentile position and from this establish the percentile
9n
for example 90th percentile position =( ) th position
10
Measures of Position
(i) Range
(iii) Variance
They are used to describe the extent to which the values of a random variable are
scattered about a central value. The central value can be described as more reliable
if there is a high concentration of the values of observations about it. On the other
hand, widely spread observations show low reliability of the central value.
Range
Is the gap or difference between the smallest and biggest observation in a given data
set.
Example
The following data show the amount in millions of dollars paid to employees in
different companies. Find the data range and interpret your solution.
16 2 38 9 20 80 3 10 50
Range = 80-2
= $78 million
Since 78 is very close to the highest observation and very far from the lowest
observation this suggests a wide dispersion hence the mean as a measure of central
tendency will be strongly unrepresentative.
Interquartile Range
It is simply the difference between the upper quartile and the lower quartile
IQR =Q3 –Q1
Variance
∑(x )-n( x )
2
2 ̅
n-1
Where
n= sample size
x̅ = sample mean
The standard deviation is found by computing the square root of the variance
S= √(S)2
∑(x )-n( x )
2
2 ̅
S=√( )
n-1
The following data show the weights in kgs of 8 patients who visited a clinic one
afternoon. Compute the variance and standard deviation of the weights.
80 70 60 50 40 35 65 45 Mean = 55,63
X x2
80 6400
70 4900
60 3600
50 2500
40 1600
35 1225
65 4225
45 2025
26475-8(55,63)2
S2 =
8-1
=245,35
= 15.66 kg
Example
The following data give the time in minutes spent by a sample of 20 students to
complete a given task. Showing all workings calculate the standard deviation of the
data.
16 29 58 66 78 42 54 72 54 72 54 91 44 84 92 70
78 52 28 41
x̅ = 58.75
∑((x)^2) = 77671
77671-20((58.75)^2)
S2 =
19
= 454,72
= 21.32
Variance for grouped data
( )
2
̅
∑f(x2)-n x
S=
n-1
S= √(S)2
( ))
2
̅
∑f(x2)-n x
S=√(
n-1
Where f is the frequency for each class and x is the midpoint of each class.
Example
Find the variance and standard deviation for the grouped data below
Classes F X fx fx2
125≤140 4 132.5 530 70225
140≤155 11 147.5 1622.5 239318.75
155≤170 9 162.5 1462.5 237656.25
170≤185 9 177.5 1597.5 283556.25
185≤200 10 192.5 1925 370562.5
200≤215 2 207.5 415 86112.5
∑fx 7552,5
x̅ = = = 167.83
n 45
( )
2
̅
2 ∑f(x2)-n x
S =
n-1
1287431.25-45((167.83)2)
=
45-1
= 452.74
S= √(452.74= 21.28
Coefficient of Variation
s
Sample C.O.V = x 100
x̅
δ
Population C.O.V= x 100
μ
The following data show the mean and standard deviation of sales per month and
the experience of employees in years. Calculate and compare the coefficients of
variation of two random variables. Which random variable is exhibiting better
variation?
s
COV for experience = x 100
̅
x
4
= x 100
20
= 20%
s
COV for sales per month = x 100
̅
x
80
= x 100
500
= 16%
NB: The one with a higher percentage has greater variability, which means that it is
less consistent. It follows therefore that the one with smaller variability is more
consistent.
Interpretation
Sales per month are more consistent than experience as shown by the variability of
16% compared to 20%.
Coefficient of Skewness
The coefficient of skewness values should lie between negative 3 and positive 3
inclusive
A value less than zero indicate negative skewness. A value equal to zero represents
a symmetrical distribution. A value greater than zero indicate positive skewness.
mean-mode
Sk1 =
standard deviation
̅
x -Mode
=
s
3( mean-median)
Sk2 =
standard deviation
̅
3( x -Median)
=
s
When Sk1 and Sk2 are both less than zero then we have a negatively skewed
distribution.
If Sk1 and Sk2 are both equal to zero, then we have a symmetrical distribution
Sk1 and Sk2 are both greater than zero then we have a positively skewed
distribution
Example
Compute Pearson’s first, second coefficient of skewness, and interpret your results.
Cm(fm-fm-1)
Mode= Lm+ [ ]
2fm-fm-1-fm+1
15(11-4)
= 140+[ ]
2(11)-4-9
= 151.67
Mean= 16783
Mode= 151.67
Median =167.5
mean-mode
Sk1=
standard deviation
167.83-151.67
=
21.28
= 0,76
3( mean-median)
Sk2 =
standard deviation
̅
3( x -Median)
=
s
3(167.83-167.5)
=
21.28
= 0.05
PROBABILITY THEORY
Subjective Probability
Objective Probability
Example
a) A girl
b) A boy
3 4
a) P(G) = b) P(B) =
7 7
1. A probability value lies only between zero and 1 inclusive that is 0≤P(A)≤1
Example
Consider a random process of drawing cards from a pack of playing cards find the
probability of selecting
a) A red card
b) A spade
c) An ace
d) Not an ace
26 1
a) P ( red card) = =
52 2
13 1
b) P ( spade) = =
52 4
4 1
c) P ( Ace) = =
52 13
1 12
d) P ( Ace1) = 1 - =
13 13
Marginal Probability
A marginal probability is the probability of only a single event e.g the probability of event A
occurring, it is written as P(A). A single event is an event that describes outcomes of one
random variable only. If A represents the event of a small company fund P(A).
29
P(A) =
150
29
P(A) =
150
JOINT PROBABILITY
Let A be event of a small company and B the event of a Finance company. Therefore, the
probability of P (A n B) = 9/150.
CONDITIONAL PROBABILITY
P(A n B)
P(A|B) =
P(B)
This the probability of event A occurring given that event B has already occurred.
The essential feature here is that the sample space is reduced to the to the outcomes
describing event B only and not all possible outcomes as for marginal and joint probabilities.
Let A be the event of a large company and B the event of a retail company. Find P(A|B) using
INTERSECTION OF EVENTS
The intersection of events A and B is the set of outcomes that belong to both A and B
simultaneously. It is written as:
Let A be the event of a small company and B the event of a service company. A n B is the set
of all small and service companies. A n B = 6.
UNION OF EVENTS
The Union of events A and B is the set of outcomes that belong to A or B or Both. It is written
as A u B [A or B].
Let A be the event of a small company and be the event of a service company. Then A u B is
the set of all small or service or both companies. A u B = 29 + 10 - 6 = 33.
Events are mutually exclusive if they cannot occur together on a single trial of a random
experiment. For example, let A be event of a small company and B be the event of a medium
company. Events A and B are mutually exclusive because a randomly selected company
from ZSE cannot be both small and medium at the same time.
The events are said to be statistically independent if the occurrence of one event A has no
effect on the outcome of event B occurring or vice versa. For example, let L be the event of
an accident occurring in London and H be the event of an industrial strike occurring in
Harare. These scenarios have no effect on each other event if they may occur at the same
time.
The terms statistically independent events and mutually exclusive event must not be
confused. When the events are mutually exclusive they not statistically independent. They
are dependent in the sense that when one event occurs then the other will not occur. In
probability terms the probability of an intersection between two mutually exclusive events is
zero.
PROBABILITY RULES
1. Addition Rule (u ; “or”) ≤ for both mutually and non-mutually exclusive events.
For mutually exclusive events there is no intersection event, therefore P(A n B)= 0.
Let A be the event of small company and B the event of a large company. Since these two
events are mutually exclusive therefore P(A u B) = 29/150 + 84/150 = 113/150.
This is the probability that a randomly selected company from ZSE will either be a small
company or a large company.
For example: Let A be event of a small company and B the event of a service company.
These two events are not mutually exclusive as they can occur at the same time. Therefore
P(A u B)= 29/150 + 10/150 – 6/150 = 11/50
This the probability that a randomly selected company for ZSE will either be a small
company or a service company or both.
If two events are statistically independent, then the multiplication rule reduces to the
probability of
NB* two events A and B are statistically independent if the following test can be satisfied.
This means that if the marginal probability of event A equals the conditional probability of A
given that event B has occurred, then these two events are statistically independent. This
means that the prior occurrence of event B does not influence the outcome of event A.
Let A be the event of a media company and B the vent of a Finance company. Determine if
the two events are statistically independent or not.
P(A n B)
= P(A)
P(B)
21
150 37
=
72 150
15
7 37
≠
24 150
Since P(A|B) is not equal to P(A), then these events are not statistically independent.
If two events are non-statistically independent, we apply the following rule. The
multiplication rule may be used to find the joint probability of event A and B occurring on a
single trial of a random experiment i.e that is the intersection of the two events. By
rearranging the conditional probability formula, the multiplication rile is defined as:
P(A n B)
P(A|B) =
P(B)
MANAGEMENYT
LEVELS
Qualification Section Head Department Division Head Total
Level Head
O’Level 28 14 8 50
Diploma 20 24 6 50
Degree 5 10 14 29
Total 53 48 28 129
Answer:
i) P(O’Level)= 50/129
Therefore, being a section head and having a degree are non-statistically independent
events.
= 5/129 * 29/129
= 5/129
24
P(A n B) 129 24
P(A|B) = = =
P(B) 50 50
129
iv) 28/129
v) Being division head and being section head are mutually exclusive events:
= 2/129 + 53/129
= 81/129
14
P(A n B) 129 14
vi) P(A|B) = = =
P(B) 48 48
129
This a diagram that helps in decision making. The diagram has the shape of a tree and each
branch on the tree represent a logical outcome.
Example: A farmer has 15 cows in which 7 are black and 8 are white, it has been a tradition
that he sells a cow each month, if two cows are sold and then not replaced find the
probability that
iii) How will these probabilities be affected if these cows were replaced?
Solution
i) P(B) = 7/15 x 6/14 = 1/5
Example: If three playing cards are selected from a pack of playing cards with replacement,
what is the probability of getting at least two diamonds. How would this probability be
affected if these cards were not replaced?
Example: Suppose that we rolled an unbiased dice three times, Find the probability that the
outcome is:
Example
A man walks from point A to point C via point B as illustrated in diagram below.
A B C
Qtn: In how many ways can the man move from point A to C
Solution
A to B = 2 ways
B to C = 2 ways
Example: A restaurant menu has a choice of four starters ten main courses and six deserts.
Find the total number of possible meals that can be ordered.
Example: How many different 7 place number plates are possible if the first 3 places are to
be occupied by alphabetic letters and the final four by numbers?
Solution
= 26 x 26 x 26 x10 x 10 x 10 x 10
= 26 x 25 x 24 x 10 x9 x 8 x7
A permutation is the number of distinct ways in which a group of objects can be arranged,
each possible arrangement is called a permutation. Consider the number of ways of placing
3 of the letters ABCDEFG in three empty spaces. The first space can be filled in 7 ways, the
second space can be filled in 56 ways, the third space can be filled in 5 ways. Therefore,
there are 7x6x5=210 ways of arranging the letters taken from seven letters. This is the
number of permutations of three objects taken from seven objects and it written as:
NB* With a permutation the order in which letter or numbers are arranged is very important.
by
n n!
Pr =
(n-r)!
Eg: 5! = 5x4x3x2x1= 120
Is the number of the different ways of arranging a subset of objects selected from a group
of objects where the order is not important. Each possible arrangements is called a
combination. ABC gives rise to many permutations ABC, ACB, CBA, CAB, BAC, BCA. But each
arrangement is called a combination. Therefore the number of combinations of three letters
from seven letter ABCDEFG is denoted by
Example: From a group of five women and seven men. How many different committees
consisting of two women and three men can be formed?
Solution
5
C2 x 7C3
=10 x35
=350 committees
n n!
Cr =
r!(n-r)!
PROBABILITY DISTRIBUTION
A probability distribution is a list of all possible outcomes of a random variable and their
associated probabilities of occurrence. The expected value of a random experiment which is
the mean is given by the following formula
E(X)= µ = ∑x P (X = x)
The Variance
Ơ2=∑(xi - µ)2P(X=xi)
or
Ơ2=E(X - µ)2
Example: A fair coin is tossed twice.
ii) Find the value of the expectation and standard deviation of the number of heads
that can occur.
Example: An unbiased dice is thrown once, construct a probability distribution that shows
the possible outcomes and use it to find the expectation and standard deviation. Let X be
the possible outcomes.
Example: A coin is tossed three times, construct a probability distribution for the number of
tails that can be obtained, and use it to calculate the expectation and standard deviation for
the number of tails that occur.
Example: Each customer at a supermarket pays using one of the three methods, cash,
cheque and credit. The probability of randomly selected customer paying by cash is 0.54 and
cheque is 0.12.
iii) One paying by cash, one by Cheque and one by a credit card.
The choice of a particular probability distribution function depends primarily on the nature of
the random variable under study.
Discrete or Continuous
These probability distribution function assume that the outcomes of a random variable
under study can take only specific values usually integers eg a car can only take
0,1,2,3,4,5,6… tyres at any time. The two common types of discrete probability distributions
are the Binomial and Poisson distributions. For a random variable to follow either the poison
or binomial distribution, the following have to be met.
A discrete random variable can be said to follow a Binomial distribution if the following are
satisfied:
i) There are two mutually exclusive outcomes of the random variable generally
referred to as success or failure.
ii) The probability of the success outcome is denoted as p, whereas for the failure is
q.
P + q =1
iii) The random variable is observed n times and each observation is called a trial.
iv) The trials are assumed to be independent of each other i.e each trial does not
influence the outcomes of another trial.
If a random variable satisfies all the above conditions, it is said to follow a binomial process
n x n-x
P(X=x)= Cx p q
Where n= number of trials
Example D (Done in video lesson): Ten students seat for an exam. The probability for each
student to pass an exam is 0.2. What is the probability that three of them will pass the paper?
The formula for calculating the mean and variance of a binomial distribution is
Important words or terms used in probability (USING EXAMPLE D): NB visit video notes
i) Exactly or Equals
vi) Between
Example: Refer to the example D above and answer the following questions.
a) What is the probability that more than two students will pass?
f) Calculate the mean and standard deviation for the number of students who will pass.
POISSON DISTRIBUTION
X – Poi ()
Where is the mean the number of occurrences, x is the number of occurrences whose
probability is being calculated.
Example: The average number of errors a junior typist can make in a page is 6. What is the
probability that she makes:
Answer:
Example: A textile producer has established that a spinning machine stops randomly due to
thread breakages at an average rate of 5 stoppages per hour. What is the probability that in a
given hour:
d) Not more than two stoppages will occur in a two hour interval.
Answer:
A continuous random variable can take any value in an interval. Continuous probability
functions are used for probabilities associated with intervals of X values. You will encounter
many business situations in which the random variables of interest can be treated as a
continuous variable. There are several continuous distributions that a frequently used to
describe a physical situation. The most common and useful continuous distribution function
is the normal distribution, the reason being that the output for many processes are normally
distributed.
A normal probability distribution function finds the probability for a continuous random
variable. It has the following characteristics:
i) It is bell shaped.
v) The area under the curve of the PDF of a normal distribution is equal to 1.
3) The required area of probability is found where the Z values to one decimal place on
the left column intersects the remaining Z values on the top row.
Find the following probabilities where Z is the standard normal variable.
i) P(Z<2.31)
ii) P(Z<-1.49)
iii) P(Z>2.1)
iv) P(-2.5<Z)
v) P(0<Z<2.05)
vi) P(-1.52<Z<0.69)
NB* Always sketch a normal probability distribution curve and indicate the area whose
probability is to be found.
Answer:
v) P(Z<2.05) - P(Z<0)
0,9798 – 0,5000
=0,4798
P(Z<0,69) – P(Z<-1,52)
0,7549 – 0,0643
=0,6906
a. P(0<Z<1,46)
Solution
P(Z<1,46) – P (Z <0)
0,9278 – 0,5
=0,4278
Solution
0,5-0,0107
=0,4893
0,9066 – 0,0179
=0,8887
0,9812 – 0,8925
=0,0887
The trick is finding probabilities for a normal distribution is to convert the normal distribution
to a standard normal distribution. Values of x associated with any normally distributed
random variable can be converted into corresponding Z values by using the conversion
formula.
x-μ
Z=
σ
The time taken to install a new telephone is found to be normally distributed with the mean
time of 45minutes and a standard deviation of 8minutes. For a new installation what is the
probability that
σ =8 minutes
x-μ
P(Z < )
σ
P (X < 40)
40-45
=P(Z< )
8
= 0.2643
x-μ
b) P (44 < Z < 49) =P (Z < )
σ
44-45 49-45
=P ( <Z< )
8 8
=0,6915 – 0,4483
= 0,2432
= P(Z<0) – P(Z<-0.25)
=0,5 – 0,4013
= 0,0987
45-45 51-45
d. P (45 < X < 51) = P( <Z< )
8 8
0,7734 – 0,5
= 0,2734
The number of customers who enter a certain a super market in a day is normally distributed
with the mean of 400 customers and a standard deviation of 80 customers.
a) What is the probability that on a given day the number of customers is less than 250?
b) Greater than 400
c) Between 300 and 400
d) Between 200 and 5
Solution
µ= 400 ,σ = 80
X- μ
a) P (X < 250) = P (Z < )
σ
250-400
= P (Z < )
80
= P (Z < -1,875)
= 0,0301
300-400 400-400
b) P (300 < X< 400)=P( <Z< )
80 80
= P (-1,25 < Z < 0)
= 0,5 – 0,1056
= 0,3944
200-400 500-400
c) P (200 < X < 500) = P( Z< )
80 80
= P(-2,5<Z<1,25)
= 0,8882
The Central Limit Theorem
There are many situations in business where the population is not normally distributed. For
simple random sample of n observations taken from a population with mean μ and
standard deviation σ.The sum of the random variables will have an approximately normal
distribution. More specifically if x1, x2…...xn is a random sample of size n taken from a
̅
population with mean µ and standard deviation σ the mean of the sample x follows a
normal distribution with the following parameters.
̅
̅ σ2 ̅ x-μ
x ~N(μ ) such that the probability P ( x < x) = P (Z < )
n σ
√n
Hypothesis Testing
Large samples
Small samples
Large samples
Small samples
Small samples
Basic steps for hypothesis test
1. Formulate the hypothesis that is null hypothesis which is denoted as H0 and the
alternative hypothesis which is denoted as H1.
The null hypothesis is a claim made about a true value of a population parameter.
H0: µ= 1000
H1: µ≠ 1000
It can easily be identified by taking note of words like is exactly, indeed, equal to,
same as etc.
H0: µ= a
H1: µ≠ a
To identify this type of a hypothesis, look for words such as smaller than, less than,
below etc.
H0: µ= a
H1: µ˂ a
This is a claim that states that a population parameter is greater than or equal to a specified
value. It is identified by taking note of words like greater than, above, beyond etc
H0: µ= a
H1: µ˃a
There are two types of errors that can be made when carrying out a hypothesis.
It is the chance of rejecting a null hypothesis when it is true. It is denoted as α(alpha) which
is the level of significance or the probability of committing a type one error.
This is the chance of accepting a null hypothesis when it is false, it is denoted as β(beta)
which is the probability of committing a type 2 error.
There are two common types of distribution used in hypothesis taking i.e, the Z-distribution
and the t- distribution.
The acceptance area is the region into which when the calculated test statistic falls in it then
H0 is not rejected. The rejection area is the region into which where the calculated test
statistic fails in it then H0 is rejected.
Critical Values
α = 10%
α
= 0.05 thus Z0,05 = ± 1.64
2
α = 5%
5
= 0,025 thus Z0.025 = ±1,96
2
Zα = 0,05 = -1,28
α = 1%
Zα = Z0,99=2,33
Step 4
χ- μ
For a large sample the test statistic is calculated as Zcalc =
σ
√n
χ-μ
For a small sample the test static is calculated as t calc =
s
√n
Step 5
Drawing a conclusion.
The conclusion depends on the results obtained in the step above, if the calculated test
static falls within the rejection region then H0 is not accepted that is rejected. If it fails within
the acceptance region we fail to reject H0.
Large Sample
Where n1 + n2 = n ≥ 30
Χ- μ
Zcalc =
σ
√n
µ = population mean
Example
A firm suspect that the average life of 28000km claimed for certain tires is too high. To
check this, claim the firm puts 40 of these tires on these types on its truck and get a mean
life time of 27563km and a standard deviation 1348km. is this evidence that the mean life
time for these tires is in fact less than 28000km. Carry out an appropriate test using α = 0,01
H0: µ = 28000km
H1: µ ˂ 28000km
n =40 =˃ z-distribution.
Critical Value
α = 0,01
Zα= Z0,01
= -2,33
Χ- μ
Test Statistic : Zcalc =
σ
√n
27 563-28000
=
1348
√40
= -2,05
Since Zcalc = -2,05 is greater than -2,33 we fail to reject H0 and conclude at the 1% level of
significance that the mean life time of these tires is 28000km.
A manufacture claims that the light bulbs have an average life of 1600hrs. A sample of 100
light bulbs tested gave an average life of 1570hrs and standard deviation of 120hrs. Test at
the 5% of significance if this claim is true.
H0: µ =1600
H1: µ ≠ 1600
n =100 =˃ Z-distribution
Critical Value
0,05
Zα = = 0,0025 = -1,96
2
2
Χ- μ
Test Statistic : Zcalc =
σ
√n
1570-1600
=
120
√100
= -2,50
Since Zcalc = -2,5 is less than -1,96 we reject Ho and conclude at the 5% level of significance
that the average life of these light bulbs not equal to 1600.
2. The average speed of cars along a high way is 135km\h. A sample study of 200 cars
along the high way showed an average speed of 130km\h with a variance of 900.
Test the hypothesis at the 10% level of significance to determine if the speed of cars
along the highway is below 135km\h.
H0 : µ = 340
H1 : µ ≠ 340
N= 300= z-distribution.
Critical Value
5
Zα = = 0,0025 = -1,96
2
2
Reject H0 if Zcalc < -1,96 or H0 > 1,96
Χ- μ
Test Statistic : Zcalc =
σ
√n
350 -340
=
60
√300
= 2,89
Since Zcalc = 2,89 greater than 1,96 we reject H0 n and conclude that at the 5% level of
significance that the average monthly salary of the employees is not equal to $340.
Ho: µ = 135km\h
N=200 = z-distribution
Critical Value
Zα =10% = 1,28
Reject H0 if Zcalc < -1,28
Χ- μ
Test Statistic : Zcalc =
σ
√n
130-135
=
30
√200
= -2,86
Since Zcalc = -2,36 is greater than -1,28 we reject H0 and conclude that at the 10% level of
significance that the average speed of cars along a way is less than 135km\h.
∝
If it is has a two tailed test divide alpha by 2 ( .)
2
Look up the results obtained on the top row of the t tables where these results
intersect the degree of freedom this is the critical value
For a single population mean the degrees of freedom are calculated as n-1 is the
sample size
-tα n-1
tα n-1
± t ∝ n-1
2
Example
t0,025; 23 = 2,07
α = 0,005, n=1
t∝, n-1
t0,005; 4 = 4,60
H0: µ = 20
H1: µ > 20
n = 25, t-distribution
Critical Values
t0.01,24
22-20
,= 2
5
√25
Since tcalc = 2 is less than 2,49 we fail to reject H0: µ = 20 and conclude that 1% level
of significance the average price of pair of shoes is $20.
The mean weight of a certain product is assumed to be 85kgs. To prove this, claim a
random sample of 16 such products was studied and it was found that average
weight was 83kg with a standard deviation of 5kgs. Test whether the claim is true or
not using α =0,05.
H0: µ = 85kg
H1: µ ≠ 85kg
,n=16 t-distribution t0,05, 15
83-85
= -1,6
5
√16
Large Samples
The sum of the two samples should be greater than 30 when n1 = size of sample 1
̅ ̅
x 1- x 2
and n2 = sample size of 2. The test statistic is calculated as Zcalc = , where
σ21 σ22
+
n1 n2
̅
x 1=is the sample mean for sample 1
̅
x 2=is the sample mean for sample 2
σ21= is the variance for sample 1
Small Samples
n1 + n2 < 30
̅ ̅
x 1- x 2
The Test Statistic is calculated as tcalc =
s21 s22
-
n1 n2
Critical Values.
Large sample Z∝
Large sample Z∝
Example
A professor took two samples one of 15males and another of 12 females from
students at a college who were enrolled for statistics course. The professor found
that the mean score of male students in an exam was 76,2 with a standard deviation
of 7,4 and the mean score of the female student was 78,5 with a standard deviation
of 6,7. Test at the 5% level of significance if the mean score of all male students is
lower than that of the students
H0: μ1 = μ2
H0: μ1<μ2
n1 = 15 ,n2 = 12
Critical Value
α = 0,05
Rejection Criterion
76,2-78,5
=
7,4 6,7
+
15 12
= -0,85
Since tcalc = -0,85 which is greater than -1,71 we fail to reject H0 and conclude at the 5% level
of significance that the mean score of all male students is not lower than that of female
students.
A transport company want to compare the performance of 2 cars a Nissan and a Toyota.
The Nissan was used to 75times and its average breakdowns was recorded to be 5 with a
variance of 4. The Toyota was used 63 times and its coverage number of breakdowns was
recorded to 4 with a variance of 3. Test the hypothesis whether the performance of the two
cars is the same. Use α = 0,05
H0: μ1 = μ2
H1: μ1≠μ2
n1 = 75 ,n2 = 63
Critical Value
α = 0,05
Z ∝ = Z 0,05 = 0,025
2 2
= ±1,96
Reject H0 if Zcalc < -1,96 or if the Zcalc > 1,96
5-4
Test statistic =
4 3
+
75 63
=3,15
Since Zcalc 3,15 is greater than 1,96 we reject H0 and conclude that at 5% level of
significance that the level of performance of these two cars is not the same.
The principal of a college wants to compare the performance of two teachers, X and Y. X
was assed 8times with the mean of 6,2 scores and a variance of 2,15 scores. Y was assed
6times with the, a mean of 5,8scores and a variance of 1,2 scores. Test the hypothesis at the
1% level of significance that they is no difference between the mean number of scores
obtained by the two teachers
Let X be population 1
Let Y be population 2
Critical Value
α =0,01/2 = 0,005
t∝= n1 + n2 - 2 =±2,98
In some cases, it is possible to pair the measurements from one population or sample. The
hypothesis test tests whether the differences between two measurements, in the population,
we will always have small samples for this type of hypothesis
H0 = µd = 0;
H1 = µα < 0;
Critical Value
-t∝ n - 1;
H0 = µd = 0;
H1 = µd > 0;
Critical Values
t∝ n - 1;
H1 = µd ≠ 0;
Critical Value
± t ∝ n-1;
2
d
tcalc=
sd
√n
,where d represents the difference between the before measurements and the after
measurement that is d = B - A
̅ ∑d
d = , n= sample size
n
2
̅
2
∑d -n( d )
SD= standard deviation of the differences calculated as sd =√
n-1
Example
The following table shows the before and after use of tobacco for a particular group of
people
Does tobacco use increase in the heartrates of these people? Test using α=5%
H1 = µd > 0;
Critical Value
Rejection Criterion
̅ ∑d
Test Statistic = d =
n
=-192/10
= -19,2
2
̅
∑d2-n( d )
sd =√
n-1
4376-10 (-19,20)^2
=
9
= 8,75
-19,2
tcalc = ;
8,75
√10
= -6,94
Since tcalc = -6,94 is less than -1,83 we reject H0 at 5% level of significance and conclude that
the use of tobacco does caused an increase in the heart rates.
You have been trying to control the weight of a chocolate candy bar by intervening in the
production process. The following table shows the weight of before and after intervention.
Has the intervention managed to reduce the weight of the chocolate candy bar? Test at the
1% level of significance upper tailed.
H0 = µd = 0;
H1 = µd > 0;
Critical Values
d
Test Statistic = tcalc=
sd
√n
0,45
d= = -0,045
10
0,0539-10(0,045)^2
sd =
9
=2,33
Since tcalc =2,33 is less than 2,82 we fail to reject H0 at 1% significant level and conclude that
the intervention has managed to control the weight of chocolate candy bars.
The common statements like ‘the average price of petrol per liter is between $1,40 &
$1,50’are examples of interval estimates. In statistics it is customary to give not only the
interval estimate for a parameter but the probability it will lead to the interval which contains
the parameter. The probability is the level of confidence for example 90%, 95%, 97%.
Small Sample
,n<30
The confidence interval estimate for a small for a small sample is given by following formula
̅
x - t∝ n - 1
2
( sn ) ≤ μ ≤ ̅ x + t ;n - 1( sn );
∝
2
̅ s
x ± t α ;n-1( )
2
√n
Example
From the sample of 64 car commuters. The sample mean time taken to commute to
work daily was found to be 26,5minutes if the standard deviation is known to be
15minutes. Find the 95% confidence interval estimate of the actual mean time µ
taken by all car commuters.
,n=64≫ z-distribution
α = 100 – 95%
=5%
=0,05%
̅ ̅
x - Z∝ σ ≤ μ ≤ x + Z∝ σ
2 n 2 n
̅ ̅
26.5 - Z 0,05 15 ≤ μ ≤ 26,5 + Z 0,05 15
2 64 2 64
̅ ̅
26,5 - Z 15 ≤ μ ≤ 26,5 + Z 15
0,025 0,025
64 64
We are 95% confident that the mean time taken to commute to work daily uses daily
between 22,83minutes and 30,18minutes.
If the sample size in the example above was 25 and the means and standard deviation
remaining the sample the same compute α 99% Confidence interval estimate of the
population mean µ.
,n=25 →t distribution
α= 100%- CI
100%-99%
=0,01
̅
26,5 - t 0,01 24
2
( )
15
25
̅
≤ μ ≤ 26,5 + t 0,01 ;24(
2
15
25
);
̅
26,5 - t0,005 24
s
n ( ) ̅
≤ μ ≤ 26,5 + t0,005;24(
15
25
);
̅
26,5 - 2,80 ( )
15
25
̅
≤ μ ≤ 26,5 + 26,5(
15
25
);
18,1 ≤ µ ≤ 34,9
The mean taken to commute to work daily lies between 18,20minutes and 34,90minutes
with a probability of 0,99.
Large Samples
Or
2
̅ ̅ σ2 σ 2
( X 1 - X 2) ± Z ∝ +
2
n1 n2
A company has two shops A&B to compare the efficiency of the employees of these two
shops. 30 employees were sampled from shop A & 20 from shop B & their performance
were observed. Shop A employees completed a given task within 30minutes averagely with
a sample standard deviation 6minutes. Shop B employees took given 25minutes to
complete the same task an average with a sample variance of 25minutes. Construct a 95%
confidence interval estimate for the difference in the mean of the number of minutes taken
to complete the task by the employees from two shops
n1 - n2 > 30 = z distribution
2 2
̅ ̅ σ2 σ 2 ̅ ̅ σ2 σ 2
( X 1 - X 2) - Z ∝ + ≤ (μ1 - μ2) ≤ ( X 1 - X 2) - Z ∝ +
2
n1 n2 2
n1 n2
62 252 62 252
(30 - 25) - Z 0,05 + ≤ (μ1 - μ2) ≤ (30 - 25) - Z 0,05 +
2
30 20 2
30 20
62 252 62 252
5 – Z(0,025) + ≤ (μ1 - μ2) ≤ 5+ +
30 20 30 20
1,93≤x≤8,07
We are 95% confident that the difference between the mean number of minutes taken to
complete a given task by the employees from the two shops is between 1,93minutes and
8,07
Small Samples
n1 - n2 < 30 = t distribution
( ) ( )
2 2
̅ ̅ s2 s 2 ̅ ̅ s2 s 2
( X 1- X 2 - t α + ≤ (μ1 - μ2) ≤ ( X 1- X 2 - t α +
2
;n1- n2-2 n1 n2 2
;n - n -2
1 2
n1 n2
Or
(( X - X ) ± t
2
̅ ̅ s2 s 2
1 2 α +
;n - n -2
2 1 2
n1 n2
If the sample size n in the example above were 15 employees from shop A & 10 employees
from shop B and the standard deviation remaining the same. Compute 90% confidence
interval estimate for the difference in the mean number of minutes taken to complete the
task given from the two shops.
n1 - n2 < 30 = t distribution
62 252 62 252
((30-25) - t0,05 + ≤ (μ1 - μ2) ≤ (30 - 25) - t +
15 10 0,05 15 10
A sample to be drawn from a given population must be represent for a fair conclusion to be
made about the population being represented. It follows then than for a given confidence
level and sample standard deviation and a mean size within the true average is expected to
fail then the sample size can be calculated using the formula
n = (Z α ,σ)^2
2
e
σ = standard deviation which shows how much variance one expects on their
response
e = margin error.
,e= 80
σ = 251,35
α = 100-90 = 0,1
2
Z 0,1
2
n=( , 251,35)
80
Z(0,05)*251,35 2
=( )
80
1,64*251,35 2
=( )
80
=27 employees
The Chi-square distribution is a distribution obtained from multiplying the ratio of sample
variance to population variance by the degree of freedom when a random sample, are
selected. Expected frequencies denoted as Ѐ are frequencies obtained by calculations where,
as observed denoted by Ӧ are obtained by observations. The Chi-squared distribution is
denoted χ2 and it is used to test for independency.
In this test the claim is that the row and column variable are independent of each other. The
hypothesis for this test is stated as follows.
H0 : row and column variables are independent of each other
Or
Or
Critical Value
χ2α(r-1)(c-1).
r = number of rows
c = number of columns
χ20.01(3-1)(3-1)=χ20.01;4. = 13.3
Rejection Criterion
(O-E)2
χ2calc =∑
E
Example. In order to determine whether or not a relationship exists between blood type and
the severity to the winter a survey was concluded and yield the following results. Test at 5%
level of significance if there is a relationship.
H0: There is no relationship between blood type and the severity of a winter flue
H1: There is a relationship between blood type and the severity of a winter flue.
Critical Value
α = 0,05
r = 3, c = 4
χ20,05(4-1)(3-1).
χ20,05(3x2). = 12.6
Rejection Criteria
Test Statistic
(O-E)2
χ2calc =∑
E
228*300
Cell 1 =
1335
= 51.24
228*320
Cell 2 =
1335
= 54.65
228*430
Cell 3=
1335
= 73.44
228*285
Cell 4 =
1335
= 48.67
∑56.73988
Since χ2calc =56,73988 is greater than 12,60 we reject H0 at the 5% level of significance and
conclude that there is a relationship between blood type and the severity of winter flue.
1 2 3 total
1 45 87 52 184
2 33 65 100 198
total 79 152 152 383
Test the hypothesis that a response failing in any response is independent of the column it
will fail use α = 1%?
Correlation – tests the strength of the linear relationship between two variables.
Regression Analysis
The purpose of simple linear regression analysis is to examine some form of linear
relationship between two random variables. These variables are denoted X and Y
X (Independent variables) values are always known or they can be always known or can
easily be found whereas Y (Dependent variables) values are estimated using X values.
There are two methods of establishing if a linear relationship exists between 2 variables
Scatter Plot
This is a plot of x values against y values x values make up the horizontal line of the graph
and y values make up the vertical line of the graph is drawn by plotting dots into space
where the values of x and y Intersect. If the dots seem to lie in a linear form, then a linear
relationship exists between the two variables.
This suggest that x values can be confidently used in predicting the y values
Scatter Plot
Y-Values
3.5
3
2.5
2
1.5
1
0.5
0
0 0.5 1 1.5 2 2.5 3
This indicate that they is a linear relationship between x and y and its positive
As x increases y increases
0
0 1 2
y
6
5
4
3
2
1
0
0 1 2 3 4 5 6
This indicates a linear relationship between x and y and its negative.
A perfect negative linear relationship for a negative linear relation as x increases y decreases.
If the dots are scattered all over the space this suggests no linear relationship between x
and y.
If a linear relationship exists between two variables, then x values can be relied upon in
pretending the y values. If a linear relationship does not in predicting the y values, then the x
values cannot be relied upon in predicting the y values
The following data shows the number of garments and the size of cloth meters.
number cloth in
of meters
garment
45 25
28 16
34 20
42 28
34 19
30 17
42 22
39 20
24 14
32 17
20 6
The dependent variable is the number of garment, independent variable is the cloth in
meters.
cloth in meters
30
25
20
15
10
0
0 5 10 15 20 25 30 35 40 45 50
From the scatter plot a positive linear relationship exist between cloth in meters and
number of garment.
The following data gives different profits for a particular type of machine sold and
the number of units sold in different shop.
Solution
profit number
of units
550 42
600 38
650 35
600 40
500 44
650 38
450 45
500 42
number of units
50
40
30
20
10
0
0 100 200 300 400 500 600 700
̂ ̂ ̂
y = β 0 + β1x where β 0 & β1 are unknowns.
̂
The estimated value of dependent variable y is composed of a linear function β 0 + β1x of
the explanatory variable x
̂
The parameter β 0is known as the intercept parameter and the parameter β1is known as the
slope parameter. The slope parameter β1is of particular interest since it indicates how the
expected value of y
depends on x if β1> 0
y
then a positive linear
6 relationship exist
5 between x and y.
4
0
0 1 2 3 4 5 6
y
̂ ̂
y = β 0 + β1x as x increases y will also increases
y
6
0
0 1 2 3 4 5 6
̂ ̂
y = β 0 + β1x if β1 < 0 then a negative linear relationship exist between x and y.
0
0 1 2 3 4 5 6
̂
NB: The two unknown parameters β 0 & β1 are estimated from a data set
n∑xy-∑x∑y
β1 =
n∑x2-(∑x)2
̂
β 0is calculated from β1as follows
̂ ∑y ∑x
β0 = - β1( )
n n
̂ ̂ ̅
Thus, β 0 = y - β1 x
NB: for a specific value of the explanatory variable x the equation provides an estimated
value of y
Example
The following is sample data obtained in a study of the relationship between the number of
years that applicants for a certain job have studied English language in high school or
college and the grades which they received in a proficiency test in that language.
d). super impose the equation line into the scatter graph
e). predict the grade in the test for someone with 8years in school studying English language
grade in test
90
80
70
60
50
40
30
20
10
0
0 1 2 3 4 5 6
̂ ̂
y = β 0 + β1x
n∑xy-∑x∑y 10(2404)-(35*667)
β1 = =
n∑x2-(∑x)2 10(133)-1225
β1 =6,62
̂ ̂
y = β 0 + β1x
66,7 – (6,62)4
̂
y = 43,53
d). since β1 > 0 it implies a positive linear relationship between x&y and as x increases by
one unit y increases by a factor of 6,6
̂ ̂
y = 43,53 + 6,62x y = 43,53 + 6,62x
=43,53 + 6,62(2)
=56,77
̂
y = 43,53 + 6,62x
=43,53 + 6,62(4)
=470,01
̂
e). y = 43,53 + 6,62x
= 43,53 + 6,62(8)
=96,49
=96%
Correlation Analysis
Correlation analysis tests the strength of the between two variables. It measures the
strength of a linear relationship between independent variable x and dependent variable y
The correlation coefficient is denoted as ṙ takes values between -1 ≤ r ≤ 1, that is your r must
in
-1 ≤ r ≤ 1.
-1 -0.5 0 0.5 1
The common correlation coefficient used in statistics is the Pearson correlation coefficient
It is calculated by
n∑xy- ∑x∑y
r=
(n∑y2-(∑y)2) (n∑x2-(∑x)2
r = correlation coefficient
x =independent variable
y = dependent variable
n∑xy-∑x∑y
β1x =
n∑x2-(∑x)2
n∑xy-∑x∑y
β1x =
n∑x2-(∑x)2
5(37)-(1510)
=
5(55)-(15)2
= 0,07
̂ ̂
y = β 0 + β1x
= 2 – (0,7)3
= -0.1+ 0.7x
11*7289-204*370
(11*13070- 3702 )-(11*4120-(204)2
= 0,93
Coefficient of determination
This measurement helps to determine the relationship or association of the two variables
and its measured as a percentage. It also helps in estimating the reliability of x values in
predicting the y values.
0,93*0,93*100 =86,49%
The x values are 86,49% reliable in predicting the y values according to this model.
A lady operates a hot dog stand in the park. She suspects that there is relationship
between the temperature in a given day and the number of hotdogs she sells in that
day. She begins to keep her track of the data and obtains the following results.
d) interpret
e) estimate the increase in the number of hot dogs sold when the temperature
increases from 27 degrees to 30degrees.
80 Chart Title
70
60
50
40
Positive linear
30
relationships
20
10
0
0 2 4 6 8 10 12
b). yes it is
reasonable to fit
regression line
temperature number of hotdogs sold because the
scatter plot
exhibits some form of positive linear relationship between temperature of a
given day and the number of dogs sold on that day.
̂ ̂
y = β 0 + β1x
n∑xy-∑x∑y
β1x =
n∑x2-(∑x)2
̇
∑x=265, ∑x =7207, ∑x∑y=17413 ∑y=644, ∑y =42186
2 2
y =64,4 ∑xy=
10*17413-265*644
= 1,88
10*7207- 2652
̂ ̂
y = β 0 + β1x
= 14,5
̂ ̂
e) y = β 0 + β1x
= 14,58 + 1,88*30
70,98
=65,34
70,98 – 65,34
=5,64
=5 hotdogs sold
n∑xy- ∑x∑y
f) r=
(n∑y -(∑y)2) (n∑x2-(∑x)2
2
10*17413-265*644
=
(10*7207- 265)2(10*42186)-6442
= 0,96
= 92,16%
The x values are 92,16% reliable in predicting the y values according to this model.