Sit 212 Lecture Note
Sit 212 Lecture Note
Sit 212 Lecture Note
prepared by
Musa Mohammed (PhD)
FREQUENCY DISTRIBUTION
What is a Frequency Distribution Table?
For example, in the following list of numbers, the frequency of the number 9 is 5
(because it occurs 5 times): 1, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0, 6, 9.
So the table which contains frequency and data is called a frequency distribution
table or simply a frequency table.
Types of Frequency Distributions
Basically, there are two types of frequency distribution under statistics which are
explained below:
Marks Frequency
7 1
8 5
9 2
10 6
11 2
12 2
14 1
15 1
N = 20
Ungrouped frequency distribution CONT.
Examples 2: A jar containing beads of different colors- red, green, blue, black,
red, green, blue, yellow, red, red, green, green, green, yellow, red, green, yellow.
To know the exact number of beads of each particular color, we need to classify
the beads into categories. An easy way to find the number of beads of each color
is to use tally marks. Pick the beads one by one and enter the tally marks in the
respective row and column. Then, indicate the frequency for each item in the
table.
Grouped Frequency Distribution
How to make a grouped frequency table particularly for a large amount of data
For a large number of data, to draw a frequency table, we should use the
following steps:
1. mean
2. median
3. mode
Where,
Lb = lower class boundary of the median class (median class is the one in which
𝑛 th
order of observation lies in the C.F.)
2
Where
Lb = lower class boundary of the modal class (i.e. the class containing the mode)
F1 = frequency of the modal class – frequency of the next preceding class (just
before)
F2 = frequency of the modal class – frequency of the next succeeding class (just
after)
Solution:
Σ𝑋 7+2+5+4+3+3+6
(a) 𝑋 = = = 4.29, therefore mean = N429
𝑛 7
(b) Median = arrange in ascending order 2, 3, 3, 4 , 5, 6, 7 = N400
(c) Mode = N300
Mean, Median and Mode for Ungrouped Data CONT
Example 2: A project manager was saddle with the responsibility of make the SIT
clean and used the following number of cleaners in six days.
10, 40, 15, 30, 10, and 15
Compute (a) the mean cleaner per day (b) median, and (c) the mode
Σ𝑋 10+40+15+30+10+15 120
(a) 𝑋 = = = = 20 cleaners
𝑛 6 6
𝑛
2
−Σ𝑓1
(c) Median = 𝐿𝑏 + c
𝑓𝑚
39
−12 7.5
2
= 39.5 + 5 = 39.5 + 5 = 43.25 bags of cements
10 10
Mean, Median and Mode for Grouped Data CONT.
𝑓1
(d) Mode = 𝐿𝑏 + C
𝑓1+𝑓2
5
= 44.5 + 5
5+13
25
= 44.5 +
18
It is the sum of the differences of all the values from the arithmetic mean divided
by the number of observations
Σ 𝑋−𝑋 Σ𝑑
MD = or
𝑛 𝑛
Σ𝑓 𝑋 − 𝑋
MD =
𝑛
Standard Deviation
It shows how far the observations are spread from or clustered to the mean.
When the observations are clustered to the mean, the standard deviation is
small, but, when they are spread out, it will be large.
It is the square root of the variance.
The formula:
for ungrouped data
2
Σ 𝑋 −𝑋
S=
𝑛
Σ𝑋 48
𝑋= = = 6 milk
𝑛 8
MD, SD and Variance for Ungrouped data CONT.
Σ 𝑋−𝑋 12
(b) MD = = = 1.5
𝑛 8
2
Σ 𝑋 −𝑋 22
(c) S = = 8 = 2.75 = 1.66
𝑛
Σ(𝑋 −𝑋)2 22
(d) S2 = = = 2.75
𝑛 8
MD, SD and Variance for Grouped data
Example 2: The quantity of fertilizer (in bags) used by twenty farmers in one year is
shown in the following distribution.
2, 4, 5, 5, 4, 8, 6, 7, 6, 2, 4, 5, 6, 6, 7, 10, 12, 10, 5, 6.
Required: Compute
(a) The Range
(b) The Mean Deviation
(c) The Standard Deviation
(d) The Variance
solution:
(a) Range = Highest value – Lowest value
= 12 – 2
= 10 bags
MD, SD and Variance for Grouped data CONT.
X F FX X –𝑿 𝑋 − 𝑋 f𝑋 − 𝑋 (X – 𝑿)𝟐 f(X – 𝑿)𝟐
2 2 4 -4 4 8 16 32
4 3 12 -2 2 6 4 12
5 4 20 -1 1 4 1 4
6 5 30 0 0 0 0 0
7 2 14 1 1 2 1 2
8 1 8 2 2 2 4 4
10 2 20 4 4 8 16 34
12 1 12 6 6 6 36 36
Σ𝑓=20 Σ𝑓𝑥=120 Σf 𝑋 − 𝑋 =36 Σf(X – 𝑿)𝟐 =122
Σf𝑋 120
𝑋= = = 6 bags
Σ𝑓 20
MD, SD and Variance for Grouped data CONT.
Σ𝑓 𝑋 − 𝑋 36
(b) MD = = = 1.8
𝑛 20
2
Σ𝑓 𝑋 −𝑋 122
(c) S = = = 6.1 = 2.47
𝑛 20
54, 40, 38, 25, 32, 45, 46, 45, 35, 59, 42, 43, 46, 46, 28, 34, 40, 44, 44, 47, 51, 49,
49, 36, 31, 36, 41, 42, 37, 35, 45, 49, 48, 45, 46, 47, 48, 41, 44.
Using the grouping of data in the 5 interval of 5 bags (e.g. 25-29, 30-34, etc.)
Required: Compute
(a) The Range
(b) The Mean Deviation
(c) The Standard Deviation
(d) The Variance
PROBABILITY
The literally meaning of probability is the chance of an event to occur or happen.
The term probability of an event ‘E’ is defined as the ratio of favourable outcome
of the event ‘E’ happening to the total outcome of an event ‘E’.
𝐹𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 (𝐸)
i.e. P(E) =
𝑇𝑜𝑡𝑎𝑙 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 (𝐸)
Also, if the events are not mutually exclusive (if they have points in common).
The probability of either A or B is:
3
(a) P(O) =
5
2 1
(b) P(M) = =
4 2
3 1
= x
5 2
3
=
10
BINOMIAL DISTRIBUTION
If P is the probability that an event will happen in any single trial (called
probability of success) and q = 1 – P is the probability that it will fail to happen in
any single trial (called the probability of a failure).
The binomial equation also uses factorials. In mathematics, the factorial of a
non-negative integer k is denoted by k!, for example,
4! = 4 x 3 x 2 x 1 = 24,
2! = 2 x 1 = 2,
1!=1.
With this notation in mind, the binomial distribution model is defined as:
BINOMIAL DISTRIBUTION Cont.
Then, the probability that an event will happen exactly X times in N trial (i.e. X
success and N – X failure will occur) is given by
𝑁
P(X) = 𝑋
𝑃 𝑋 𝑞𝑁−𝑋
𝑁!
= 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!
Where
N= Number of trials
X= favourable outcome
𝑃 𝑋 = probability of success
𝑞𝑁−𝑋 = probability of failure
q= 1 – P
BINOMIAL DISTRIBUTION Cont.
Example 1: If a fair coin is tossed 6 times, what is the probability of getting exactly
2 heads. Solution:
1
N = 6, P = since there are two outcome per trial and only one is favourable, X = is
2
the appearance of head
𝑁!
P(X=2) = 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!
6! 1 2 1 6−2
= 1−
2!(6−2)! 2 2
6∗5∗4∗3∗2∗1 1 2 1 4
=
2∗1(4∗3∗2∗1) 2 2
30 1 2 1 4
=
2 2 2
15 1 1 1 1 1 1 15
= x x x x x x =
1 2 2 2 2 2 2 64
BINOMIAL DISTRIBUTION Cont.
Example 2: if a ludo die is thrown five times, what is the probability of getting (a)
one six, (b) two six (c) three or more six
Solution:
1
N = 5, P = since there are six outcomes per trial and only one is favourable, X =
6
appearances of six
𝑁!
(a) P(X=1) = 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!
5! 1 1 1 5−1
P(X=1) = 1−
1!(5−1)! 6 6
120 1 5 4
=
24 6 6
75000
=
186624
= 0.40
BINOMIAL DISTRIBUTION Cont.
𝑁!
(b) P(X=2) = 𝑃 𝑋 𝑞𝑁−𝑋
𝑋!(𝑁−𝑋)!
5! 1 2 1 5−2
= 1−
2!(5−2)! 6 6
120 1 2 5 3
=
12 6 6
15000
=
15552
= 0.96
BINOMIAL DISTRIBUTION Cont.
(c) P(X=3 or more 6) = P(X=3) + P(X=4) + P(X=5)
= 0.035
POISSON DISTRIBUTION
It describes the number of events that occur within a given interval
The event in question must occur at random. They must be independence of one
another.
They must be what is describe as rare i.e. their occurrence must be very low.
It is possible to calculate the probability of any event in a defined interval if the
mean number of events per interval is known.
The formula is
𝑚 𝑥
Pr(x) = 𝑒 −𝑚
𝑥!
Where
x = favourable outcome
e = a special mathematical constant
𝑒 −𝑚 = exponential table gives its value
m = mean
POISSON DISTRIBUTION Cont.
Example: Customers arrived randomly at a departmental store at an average rate
of 3.4 per minutes. Assuming the customers arrivals from a poisson
distribution, calculate the probability that
(a) No customer arrive in particular minutes
(b) Exactly one customer arrives in any particular minute
(c) Two or more customers arrive in any particular minutes, and
(d) One or more customers arrive in any 30 second period.
Solution:
For a, b, and c, the interval for the poisson distribution is one minute, with a given
mean of 3.5. For d, the interval is just 30 seconds, thus the mean here must be
3.4
adjusted to = 1.7
2
POISSON DISTRIBUTION Cont.
𝑚 𝑥 3.4 0
(a) Pr(0) = 𝑒 −𝑚 = 𝑒 −3.4 =𝑒 −3.4 = 0.0334
𝑥! 0!
𝑚 𝑥 3.41
(b) Pr(1) = 𝑒 −𝑚 = 𝑒 −3.4 = (0.0334) (3.4) = 0.1136
𝑥! 1!
Firstly, we find the probability for a yield of 420 (X = 420, 𝜇 = 360kg/ha, σ = 24kg/ha)
𝑋−𝜇 420−360 60
Z= = = = 2.5 = 0.4938
𝜎 24 24
𝑋−𝜇 440−360 80
Z= = = = 3.33 = 0.4996
𝜎 24 24
Hypothesis testing in statistics is a way for you to test the results of a survey or
experiment to see if you have meaningful results.
1. Null hypotheses: this is the hypotheses under investigation that the researcher
is willing to reject. It is denoted by H0. E.g.
1. Type I error: This is a situation where the researcher rejects the null hypotheses
when actually null hypotheses is true (i.e. to be accepted).
2. Type II error: This is a situation where the researcher accepts the null
hypotheses when actually the null hypotheses is false (i.e. to be rejected).
Choose level of significance
Type I error can be minimised by choosing the significance level appropriately. In
practice, a significance level of 0.05 or 0.01 is customary, 0.05 signifies that we
are about 95% confident that we have made the right decision. That of 0.01 is
99% confident.
Standard Normal Curve with critical region (0.05) and acceptance region (0.95)
Test Statistics CONT.
The total shaded area (0.05) is the significance level of the test.
It represents the probability of our being wrong in rejecting the hypotheses (i.e.
the probability of making type I error).
Thus, we said the hypothesis is at 0.05 significance level or that the z-scores of
the given sample statistics is significance at the 0.05 level.
Reject the null hypothesis at 0.05 significance level if the z score of the statistic
lie outside the range -1.96 to 1.96 (i.e. either Z > 1.96 or Z < -1.96) or Z > 𝑍∝
(1.96) .
Critical value for z for one -1.28 or 1.28 -1.645 or 1.645 -2.33 or 2.33 -2.58 or 2.58 -2.88 or 2.88
tailed tests
Critical value for z for two -1.645 or 1.645 -1.96 or 1.96 -2.58 or 2.58 -2.81 or 2.81 -3.08 or 3.58
tailed tests
Test Concerning Mean (One Sample Distribution)
We may be interested in determining (testing) that the mean (𝑋) is not different
from the mean (𝜇).
We shall examining one sample size for n≥30 (large sample).
The testing procedure can be itemised as follows
1. State the null hypothesis (H0)
2. State the alternative hypothesis (H1)
3. State the 𝛼-significance level
4. State the decision rule
5. Compute the statistics using sample data (calculated value)
6. Decision: Reject H0 if the calculated value is in the critical region otherwise,
accept H0.
Test Concerning Mean CONT.
When the mean and the standard deviation are known, we can use Z statistics.
𝑋−𝜇
Z=𝜎
𝑛
Where
𝑋 = sample mean
µ = hypothesised or population mean
𝜎 = standard deviation
n = sample number
Test Concerning Mean CONT.
Example 1: A beverage manufacturing company claim that the mean weight of its
medium size products is 450g with standard deviation of 15g. If a sample of 40
products were obtained and found out that the mean weight is 442g, test
whether the mean is not significantly equal to 450g at 0.05 level of significance.
Solution:
1. H0: 𝜇 = 450
2. H1: 𝜇 ≠ 450 (two tailed)
3. ∝ =0.05
4. Critical region: Z < -1.96 and Z > 1.96 or if Z > 𝑍∝ (1.96) Reject H0 (results
are significant)
Test Concerning Mean CONT.
5. Computation:
𝑋−𝜇 442−450
Z=𝜎 Z= 15 = -3.38
𝑛 40
6. Conclusion: Since 𝑍 = 3.38 is greater than 𝑍∝ = 1.96, the results are highly
significant and we reject H0 and conclude that the mean weight of the beverage
is not equal to 450.
Test Concerning Mean CONT.
Example 2: The average length of rods produced from a company was claimed to
be 30m with 𝜎 = 0.88m. A distributor of the rods disputed the claim and said
that it is less than 30m. A sample of 50 rods were taken and the average length
was 29.8m. Test the hypothesis at ∝ = 0.05 to ascertain which claim is correct,
the manufacturer or the distributor.
Solution:
1. H0: 𝜇 = 30m
2. H1: 𝜇 < 30m (one tailed)
3. ∝ =0.05
4. Critical region: Z < -1.96 and Z > 1.96 or if Z > 𝑍∝ Reject H0 (results are
significant)
Test Concerning Mean CONT.
5. Computation:
𝑋−𝜇 29.8−30
Z=𝜎 Z= 0.88 = -1.6071
𝑛 50
6. Conclusion: Since Z = 1.6071 is less than 𝑍∝ = 1.96, the results are not
significant and we accept H0 and conclude that the mean length of the rod
is 30m.
ANALYSIS OF VARIANCE (ANOVA)
An ANOVA test is a way to find out if survey or experiment results are significant.
In other words, they help you to figure out if you need to reject the null
hypothesis or accept it.
• A manufacturer has two different processes to make light bulbs. They want to
know if one process is better than the other.
• Students from four colleges take the same exam. You want to see if one college
outperforms the other.
ANALYSIS OF VARIANCE (ANOVA) Cont.
ANOVA is basically of two types:
1. One way classification or one factor experiments: It has one dependent and one
independent variable (with 2 or 3 levels). E.g. yields in kg per acre of a wheat
grown in a particular type of soil treated with chemical A, B and C.
2. Two way classification or two factor experiments: it has one dependent variable
with two independent variables. E.g. yields in kg per acre of a wheat grown in a
particular type of soil treated with chemical A, B and C, and with rainfall of little,
moderate and sufficient.
The null hypothesis (H0) of ANOVA is that there is no difference among group
means. The alternate hypothesis (Ha) is that there is significance difference
among the group means.
ANALYSIS OF VARIANCE (ANOVA) Cont.
One way ANOVA
Example: The Table below shows the yields in bushels per acre of a certain variety
of wheat grown in a particular type of soil treated with chemical A, B, or C.
A 48 49 50 49
B 47 49 48 48
C 49 51 50 40
H0: The mean yield in bushels per acre are the same
H1: The mean yield in bushels per acre are not the same (though, not necessary)
ANALYSIS OF VARIANCE (ANOVA) Cont.
The ANOVA Table:
Variation Degree of Freedom Mean Square F
𝑉
Between treatment (factor) a–1 𝑆𝐵2 = 𝑎−1
𝐵 𝑆𝐵2
VB = b Σ (𝑋j – 𝑋)2 2
𝑆𝑊
Total ab – 1 or N – 1
V = VB – VW
= Σ (Xjk – 𝑋)2
Row means
48+49+50+49 47+49+48+48 49+51+50+50
Xj1 = = 45, Xj2 = = 48, Xj3 = = 50
4 4 4
+(48-49)2+(49-49)2+(51-49)2+(50-49)2+(50-49)2 = 14
ANALYSIS OF VARIANCE (ANOVA) Cont.
The variation between treatment
VB = b Σ(𝑋j – 𝑋)2 = 4[(45 – 49)2 + (48 – 49)2 + (50 – 49)2] = 8
The variation within:
VW = V – VB = 14 – 8 = 6
Then, the substitution of answers in ANOVA Table
Variation Degree of Freedom Mean Square F
Between treatment a–1=3–1=2 𝑉 8 2
𝑆𝐵2 = 𝑎 −1
𝐵 =4 𝑆𝐵
=
4
=6
2 2
𝑆𝑊 0.667
(factor)
VB = 8
𝑉𝑊 6 With 2 and 9 degree of
Within treatment (error) a(b – 1) = 3(4 – 1) =9 𝑆𝐵2 = = = 0.667
𝑁 −𝑎 9 freedom
VW = 6 or N – a = 12 – 3 = 9
Total ab – 1 = 3(4) – 11
V = 14 or N – 1 = 12 – 1 = 11
ANALYSIS OF VARIANCE (ANOVA) Cont.
Decision Rule: If F calculated > F critical value, then reject the null hypothesis
and conclude that the means of at least two groups are statistically significant
(not the same).
From the results, the F-calculated is 6. The F-critical value (the intersection of
the degree of freedom of 2 and 9 from the F-distribution table) is 8.02.
Therefore, since Fcal < Fvalue, we reject null hypothesis and conclude that the
mean yield in bushels per acre are not the same.
SIMPLE REGRESSION MODEL
Is concerned with mathematical form of relationship between two variables.
The simple regression model is:
Y = a + bX + e or Ŷ = â + bX
where
Y = dependent Variable
X = independent variable
a = intercept
b = the slope
e = error term
SIMPLE REGRESSION MODEL cont.
a and b are constants whose values are to be estimated or obtained on the basis
of given value of X and Y.
The intercept ‘a’ represent the part of dependent variable Y that does not depend
on the X.
The slope ‘b’ represent change in Y per unit change in X. if b is positive, a unit
increase in X would increase Y by b units. And if b is negative, a unit increase in X
would decrease Y by b units.
The error term ‘e’ represents the differences between the actual value of Y and its
estimate.
Various method exist in obtaining the value of a and b, one of such method is the
least square estimate. The formula is :
𝑛Ʃ𝑋𝑌 − Ʃ𝑋Ʃ𝑌 𝑛Ʃ𝑌 −𝑏Ʃ𝑋
b= â = Ῡ - bẊ or a=
𝑛Ʃ𝑋2 − Ʃ𝑋 2 𝑛
SIMPLE REGRESSION MODEL cont.
Coefficient of determination
The coefficient of determination (r2) measure the goodness of fit of the fitted
regression to a set of data.
It gives the proportion or percentage of the total variation in the dependent
variable Y by explanatory variable X.
The r2 lies between 0 and 1. If it is 1, the fitted regression explains 100% variation
in Y, and if 0, the model does not explain any of the variation in Y.
The fit of the model is said to be “better” the closer r2 is to 1.
1 = 100% error term = 0%
0.99 = 99% error term = 1%
0.56 = 56% error term = 44%
SIMPLE REGRESSION MODEL cont.
𝑏2Ʃ𝑥2
𝑟2 =
Ʃ𝑦2
or
𝑏2 Σ(𝑋−𝑋)2
𝑟2 =
Σ(𝑌−𝑌)2
SIMPLE REGRESSION MODEL cont.
Example: The following table contains observations of the quantity demanded
(Y) of certain commodity at various prices.
Quantity Demanded in Kg (Y) Prices in N (X)
100 4
60 5
40 9
70 6
130 4
80 8
Required:
a. Fit the simple linear regression model of quantity demanded on price.
b. Predict the quantity that would be demanded when the price is N11.
c. Compute the coefficient of determination and express how the fitted
equation fit the data.
SIMPLE REGRESSION MODEL cont.
Solution:
X Y XY X2 X-Ẋ=𝒙 Y-Ῡ=y x2 y2
8 80 640 64 2 0 4 0
36 480
Ẋ= = 6, Ῡ = = 80
6 6
SIMPLE REGRESSION MODEL cont.
a. model: Y = â + Bx
𝑛Ʃ𝑋𝑌 − Ʃ𝑋Ʃ𝑌 6 2640 −36(480)
b= = = -10.91
𝑛Ʃ𝑋2 − Ʃ𝑋 2 6(238) − 36 2
𝑛Ʃ𝑌 −𝑏Ʃ𝑋
â = Ῡ - bẊ or a=
𝑛
= 80 – (-10.91)(6) = 145.46
Fitted simple linear regression is:
Y = 145.46 + (-10.91)X
b. Required to find Y when X = 11
Y = 145.46 + (-10.91)11
= 25.45 implying 25.45 kg.
SIMPLE REGRESSION MODEL cont.
𝑏2Ʃ𝑥2 𝑏2 Σ(𝑋−𝑋)2
c. 𝑟 2 = =
Ʃ𝑦2 Σ(𝑌−𝑌)2
−10.91 2(22)
= = 0.52
5000
If the calculated r falls in shaded area, reject H0, and if falls in unshaded area,
accept the H0 at 0.05 and 0.01 significance levels
To Find Pearson's Correlation Coefficient (by Hand)
Example: Find the value of the correlation coefficient from the following table:
Age (𝒙) 43 21 25 42 57 59
Glucose Level (𝑦) 99 65 79 75 87 81
H0: There is no statistically significance relationship between age and glucose level
CORRELATION ANALYSIS cont.
Solution: the formula first to what is required in the table.
The frequency of each category for one nominal variable is compared across the
categories of the second nominal variable.
The data can be displayed in a contingency table where each row represents a
category for one variable and each column represents a category for the other
variable.
Example of Chi-Square Test
Question: A cellular phone company conducts a survey to determine the
ownership of cellular phones in different groups. The results for 1000 households
are obtained as follows
Yes 50 80 70 50 250
X2 = 14.3
Degree of freedom, 𝜐 = 𝑟 − 1 𝑐−1 = 2−1 4−1 =3
Example of Chi-Square Test CONT.
Example of Chi-Square Test CONT.
2 2
Conclusion: 𝑋.95 = 7.81, and 𝑋𝑐𝑎𝑙 = 14.3. Since 14.3 is greater than 7.81, we reject
null hypothesis and conclude that the proportion owning cellular
phones are not the same for the different age groups.