Module - Data Management (Part 2)
Module - Data Management (Part 2)
Objectives
1. Recognize the basic terms of statistics.
2. Determine and apply the measures of central tendency, variability, and
position.
3. Apply the measures of central tendency, and variability in normal
distribution.
4. Determine the linear regression and correlation of the set of data.
Lesson Proper
Like the descriptive measures of central tendency (mean, median, and mode),
measures of variability have also different types, such as range, variance, standard
deviation, and among others. The bigger the descriptive measures of variability,
means the data is more spread, disperse, scatter, or variable. Similarly, the smaller
the descriptive measures of variability, means the data is more close to one another.
Ungrouped Data
1. Range
The difference of the highest value and lowest value.
𝑅 = 𝐻𝑉 − 𝐿𝑉
The range of Set A is 𝑅 = 8 − 2 = 6, while the range of Set B is 𝑅 = 9 − 2 = 7. Results
implies that Set A have a more closer data than Set B.
Range is not as reliable as the other measures of variable, because it does not consider
the data in between the highest and lowest values. Other types of variability consider
these data.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 1
For instant, given the data below. Both have the range (𝑅 = 8 − 2 = 6), but as we can
observe the behavior of data between the highest value and lowest value are not the
same.
Set C Set D
2, 2, 2, 2, 8 2, 3, 4, 6, 8
2. Variance
Variance of Population
2
∑(𝑥 − 𝜇 )2
𝜎 =
𝑁
Where:
𝜎 (Lowercase Greek letter sigma)
∑ (uppercase Greek letter sigma means summation)
𝜎 2 is the variance
𝑥 is each datum
𝜇 (Lowercase Greek letter mu) is the mean of the population
𝑁 is the population size
Variance of Sample
∑(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1
Where:
∑ (uppercase Greek letter sigma means summation)
𝑠 2 is the variance
𝑥 is each datum
𝑥 is the mean of the population
𝑛 is the population size
Use the variance of population if data comes from the whole population.
Likewise, use the formula for the variance of the sample if the sample comes
from the sample.
Let us determine the variance of each set. Assuming the data comes from a
sample.
Set A Set B
2, 3, 3, 4, 8 2, 3, 3, 3, 9
𝑥̅ = 4 𝑥̅ = 4
𝑥̃ = 3 𝑥̃ = 3
𝑥̂ = 3 𝑥̂ = 3
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 2
Set A
𝒙 𝒙−𝒙 ( 𝒙 − 𝒙 )𝟐
2 2 − 4 = −2 (−2)2 = 4
3 3 − 4 = −1 (−1)2 = 1
3 3 − 4 = −1 (−1)2 = 1
4 4−4= 0 (0)2 = 0
8 8−4= 4 (4)2 = 16
𝑥̅ = 4 ∑(𝑥 − 𝑥̅ )2 = 22
There are five data point (e.i. 2, 3, 3, 4, and 8).
∑(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1
22
𝑠2 =
5−1
22 11
𝑠2 = = = 𝟓. 𝟓
4 2
The variance of Set A is 5.5. Let us also determine the variance of Set B.
Set B
𝒙 𝒙−𝒙 ( 𝒙 − 𝒙 )𝟐
2 2 − 4 = −2 (−2)2 = 4
3 3 − 4 = −1 (−1)2 = 1
3 3 − 4 = −1 (−1)2 = 1
3 3 − 4 = −1 (−1)2 = 1
9 9−4= 5 (5)2 = 25
𝑥̅ = 4 ∑(𝑥 − 𝑥̅ )2 = 32
2
∑(𝑥 − 𝑥̅ )2
𝑠 =
𝑛−1
2
32
𝑠 =
5−1
32 16
𝑠2 = = =𝟖
4 2
The variance of Set B is 8 compare to variance of Set A is 5.5. Since the Variance of Set
B is bigger that the variance of Set A, therefore the data in Set B is more spread,
disperse, scatter, or variable. Likewise, each datum in Set A is closer from one another
compare to behavior of data in Set B.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 3
3. Standard Deviation
Standard Deviation of Population
∑(𝑥 − 𝜇 )2
𝜎=√
𝑁
Where:
𝜎 (Lowercase Greek letter sigma) is the standard deviation
∑ (uppercase Greek letter sigma means summation)
𝑥 is each datum
𝜇 (Lowercase Greek letter mu) is the mean of the population
𝑁 is the population size
Use the standard deviation of population if data comes from the whole
population. Likewise, use the formula for the standard deviation of the sample
if the sample comes from the sample.
Standard deviation and variance are related. The square root of variance is equal to
the standard deviation. Similarly, the square of standard deviation is equal to the
variance.
Therefore:
Set A
∑(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1
𝑠 = √5.5 ≈ 𝟐. 𝟑𝟓
Set B
∑(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1
𝑠 = √8 ≈ 𝟐. 𝟖𝟑
Similarly, results reveals that the data in Set B is more disperse or scatter compare to
data in Set A.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 4
Grouped Data
Let us consider the same problem we used in measures of central tendency. The ages
of the first 50 persons who enter the mall were tallied, as shown below. Determine
the mean, median, and mode of their ages.
Age Frequency
10 – 19 5
20 – 29 20
30 – 39 10
40 – 49 7
50 – 59 8
Total n=50
1. Range
𝑈𝑝𝑝𝑒𝑟 𝐶𝑙𝑎𝑠𝑠 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝐿𝑜𝑤𝑒𝑟 𝐶𝑙𝑎𝑠𝑠 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦
𝑅= −
𝑜𝑓 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑜𝑓 𝐿𝑜𝑤𝑒𝑠𝑡 𝐶𝑙𝑎𝑠𝑠 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙
To solve, determine first the boundaries of class.
Age Frequency Boundaries
10 – 19 5
20 – 29 20 19.5 – 29.5
30 – 39 10
40 – 49 7
50 – 59 8
Total n=50
For example, consider class 20 – 29. The lower boundary is average of lower
limit of the class and the upper limit of lower class:
20 + 19 39
= = 19.5
2 2
Hence, the upper boundary is the average of the upper limit of the class and
the lower limit of the higher class next to it:
29 + 30 59
= = 29.5
2 2
Therefore, the class boundary is 19.5 – 29.5. To proceed, we all know from the
previous discussions that the class interval is 10 (e.g. 40 – 30=10). Simply, add
or subtract the class interval to determine the other boundaries. For example:
19.5 – 29.5
+10 +10
29.5 – 39.5
19.5 – 29.5
-10 -10
9.5 – 19.5
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 5
And continue the same process to determine the succeeding class boundaries.
Age Frequency Boundaries
10 – 19 5 9.5 – 19.5
20 – 29 20 19.5 – 29.5
30 – 39 10 29.5 -39.5
40 – 49 7 39.5 – 49.5
50 – 59 8 49.5 – 59.5
Total n=50
𝑅 = 59.5 − 9.5 = 𝟓𝟎
The range is 50.
2. Variance
Variance of Population
∑ 𝑓 ( 𝑥 − 𝜇 )2
𝜎2 =
𝑁
Where:
𝜎 (Lowercase Greek letter sigma)
∑ (uppercase Greek letter sigma means summation)
𝑓 is the frequency of the class
𝜎 2 is the variance
𝑥 is each datum
𝜇 (Lowercase Greek letter mu) is the mean of the population
𝑁 is the population size
Variance of Sample
2
∑ 𝑓 (𝑥 − 𝑥̅ )2
𝑠 =
𝑛−1
Where:
∑ (uppercase Greek letter sigma means summation)
𝑓 is the frequency of the class
𝑠 2 is the variance
𝑥 is each datum
𝑥 is the mean of the population
𝑛 is the population size
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 6
In the previous discussions (measures of central tendency) we already computed for
the mean which is 𝑥̅ = 33.1. Likewise, the class mark was already determined in the
previous discussions. Class mark (x) is the average of lower limit and upper limit of
each class. We will use this to complete first the table.
Age f x 𝑥− 𝑥 ̅ (𝑥 − 𝑥
̅ )2 𝑓 (𝑥 − 𝑥
̅ )2
14.5 − 33.1 (−18.6)2
10 – 19 5 14.5 5(345.96) = 1,729.8
= −18.6 = 345.96
24.5 − 33.1 (−8.6)2
20 – 29 20 24.5 20(73.96) = 1,479.2
= −8.6 = 73.96
30 – 39 10 34.5 1.4 1.96 19.6
40 – 49 7 44.5 11.4 129.96 909.72
50 – 59 8 54.5 21.4 457.96 3,663.68
Total n=50 ∑ 𝑓 (𝑥 − 𝑥
̅ )2 = 7,802
∑ 𝑓 (𝑥 − 𝑥
̅ )2 = 1,729.8 + 1,479.2 + 19.6 + 909.72 + 3,663.68 = 7,802
2
∑ 𝑓 (𝑥 − 𝑥̅ )2
𝑠 =
𝑛−1
7,802 7,802
𝑠2 = = ≈ 𝟏𝟓𝟗. 𝟐𝟐
50 − 1 49
The variance of the sample is 159.22.
3. Standard Deviation
Standard Deviation of Population
∑ 𝑓 (𝑥 − 𝜇 ) 2
𝜎=√
𝑁
Where:
𝜎 (Lowercase Greek letter sigma) is the standard deviation
∑ (uppercase Greek letter sigma means summation)
𝑓 is the frequency of the class
𝑥 is each datum
𝜇 (Lowercase Greek letter mu) is the mean of the population
𝑁 is the population size
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 7
We already know that variance and standard computation is related (vis – a – vis).
We can already determine the standard deviation using variance.
∑ 𝑓 (𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1
𝑠 = √159.22 ≈ 𝟏𝟐. 𝟔𝟐
This gives us the standard deviation of 12.62.
Measures of location or position is used to locate the relative position of data value in
the data set. These includes standard scores, percentiles, deciles, and quartiles.
1. Standard Scores
Standard scores or z-scores tells how many standard deviations a data value
is above or below the mean for a specific distribution of values. If standard
score is zero, then the data value is the same as the mean. If it is positive means
the data value is above the mean. Hence, if it is negative, then it is below the
mean.
It obtained by subtracting the mean from the value and dividing the result by
the standard deviation. The symbol for a standard score is z. The formula is:
𝑣𝑎𝑙𝑢𝑒 − 𝑚𝑒𝑎𝑛
𝑧=
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
The formula for samples is:
𝑥 − 𝑥̅
𝑧=
𝑠
The formula for populations is:
𝑥−𝜇
𝑧=
𝜎
The z – score represents the number of standard deviations a data falls above
or below the mean.
Example. Angelo’s score in Literature is 88, compare to the mean score of the class
which is 80 with standard deviation of 3. Also, his score in Mathematics is
90 with a class mean of 95 and standard deviation of 5. Which subject
Angelo perform better?
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 8
Solution:
Literature Mathematics
Score/ value (x) 88 90
Mean 80 95
Standard Deviation 3 5
z -3 -2 -1 0 1 2 2.67 3
s -3s -2s -1s 0s 1s 2s 2.67s 3s
Scores 71 74 77 80 83 86 88 89
𝑥̅
z -3 -2 -1 0 1 2 3
s -3s -2s -1s 0s 1s 2s 3s
Scores 80 85 90 95 100 105 110
𝑥̅
Ungrouped Data
A group of students obtained the following scores in their Statistics quiz:
4, 9, 7, 14, 10, 8, 12, 15, 6, 11
Determine the 1st and 3rd Quartiles, 3rd and 7thDecile, and 25th and 75th Percentiles.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 9
2. Quartiles
Divide the group into 4 parts/ quarters (𝑄1, 𝑄2 , 𝑄3 , 𝑄4 ). Every part or quarter
is equivalent to ¼ or 25%.
𝑘
𝑄𝑘 = (𝑛 + 1)
4
Where 𝑘 is the partition (𝑘 = 1,2,3,4), and 𝑛 is the number of terms. Round off
the nearest whole number.
First, arrange the scores in ascending order. Then solve for the quartiles.
4, 6, 7, 8, 9, 10, 11, 12, 14, 15
There are 10 scores (𝑛 = 10).
From the result 𝑄1 is the 3rd term, and 𝑄3 is the 8th term which is 12. The
difference between 𝑄3 and 𝑄1 respectively is the interquartile (𝑄2 ). The 𝑄2 can
also be solve by the use of formula.
𝑄2 = 𝑄3 − 𝑄1
𝑄2 = 8.25 − 2.75
𝑄2 = 5.5 ≈ 6
or
𝑘
(𝑛 + 1)
𝑄𝑘 =
4
2
𝑄2 = (10 + 1)
4
1
𝑄2 = (11)
2
1 11
𝑄2 = (11) = = 5.5 ≈ 6
2 2
This implies that 2nd quartile is 6th term which is 10. The first 6 terms (4, 6, 7,
8, 9, 10) belongs to 50% of the score.
4, 6, 7, 8, 9, 10, 11, 12, 14, 15
𝑸𝟏 𝑸𝟐 𝑸𝟑
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 10
This means that the first 3 terms (4, 6, 7) belongs to 25% of the scores.
Similarly, the first 8 terms (4, 6, 7, 8, 9, 10, 11,12) belongs to 75% of the
scores.
Interpolation technique can be done to get the exact number in the position.
For example, the 𝑄3 is not actually the 8th term but the 8.25th term. What is the
score in the 8.25th term?
Interpolation
The 3rd
quartile (𝑄3 ) is in the 8.25th term which is somewhere between 8th
term and the 9th term (specifically, it is a quarter after the 8th term). It is 0.25
higher than the 8th term. First, is get the difference of scores between the 9th
term and the 8th term:
9𝑡ℎ – 8𝑡ℎ
14 − 12
=2
Then multiply the difference of scores (2) and the excess after the 8 th term
(0.25).
2(0.25) = 0.5
Then add the result to the 8 term.
th
12 + 0.5 = 12.5
Therefore, the 3 quartile (𝑄3 ) which is 8.25th term with score of 12.5. This is
rd
the actual position and actual score. This can also be done to other quartiles.
3. Deciles
Divide the group into 10 parts (𝐷1 , 𝐷2 , … 𝐷9 , 𝐷10 ). Each partition is equivalent
1
to 10 or 10%.
𝑘
𝐷𝑘 = (𝑛 + 1)
10
Where 𝑘 is the partition (𝑘 = 1,2,3,4,5,6,7,8,9,10), and 𝑛 is the number of
terms. Round off the nearest whole number.
From the problem, we’re about to find the 3rd and 7th deciles.
4, 6, 7, 8, 9, 10, 11, 12, 14, 15
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 11
From the results, the 3rd decile is 3rd term which is 12. The 7th decile is the 8th
term which is 12. Interpolation technique can be done to get the score of actual
position.
4. Percentiles
Divide the group into 100 parts (𝑃1 , 𝑃2 , 𝑃3 , … 𝑃99, 𝑃100). Each partition is
equivalent to 1/100 or 1%.
𝑘
𝑃𝑘 = (𝑛 + 1)
100
Where 𝑘 is the partition (𝑘 = 1,2,3, … ,99,100), and 𝑛 is the number of terms.
Round off the nearest whole number.
From the given, we are about to the 25th and 75th percentiles.
4, 6, 7, 8, 9, 10, 11, 12, 14, 15
The 25th percentile is the 3rd terms which is 7. The 75th percentile is 8th term
which is 12. Interpolation technique can also be done to get the actual score in
actual position.
Furthermore, the 50th term is the same as the 2nd quartile or the 5th decile
which is the 6th term. And the 6th term is 10.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 12
Quartile Decile Percentile
𝐷1 𝑃10
𝐷2 𝑃20
𝑄1 𝐷2.5 𝑃25
𝐷3 𝑃30
𝐷4 𝑃40
𝑄2 𝐷5 𝑃50
𝐷6 𝑃60
𝐷7 𝑃70
𝑄3 𝐷7.5 𝑃75
𝐷8 𝑃80
𝐷9 𝑃90
𝑄4 𝐷10 𝑃100
Grouped Data
The computation of quartiles, deciles, and percentiles in grouped data is the same as
the computation for the median of grouped data.
Problem. The scores of 50 students in Statistics are shown in the table below.
Score Frequency
41 - 45 9
36 – 40 13
31 – 35 15
26 – 30 10
21 – 25 3
Total 50
1. Quartile
𝑘𝑛
− 𝑐𝑓𝑏
𝑄𝑘 = 𝑙𝑏 + ( 4 )𝑖
𝑓
Where:
𝑄𝑘 is the quartile position
𝑙𝑏 is the lower boundary of the class.
𝑘 is the nth quartile (𝑘 = 1, 2, 3, 4)
𝑛 is the total frequency
𝑐𝑓𝑏 is cumulative of frequency before the class
𝑓 is the frequency of the class
𝑖 is the class interval
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 13
Solution. To solve for the 2nd quartile (𝑄2 ) complete first the table.
Score Frequency 𝑐𝑓
41 - 45 9 50
36 – 40 13 41
31 – 35 15 28
26 – 30 10 13
21 – 25 3 3
Total 50
𝑘𝑛
Then solve for :
4
𝑘𝑛 2(50) 100
= = = 25
4 4 4
Class 31 – 35 gave 14th to 28th ranks where 25th rank is included. Therefore, 2nd
quartile is included in this class.
Score Frequency 𝑐𝑓
41 - 45 9 50
36 – 40 13 41
31 – 35 15 28
26 – 30 10 13
21 – 25 3 3
Total 50
The lower boundary (𝑙𝑏) is halfway between 31 and 30 which 30.5. The frequency
(𝑓) of the class is 15. The cumulative frequency before (𝑐𝑓𝑏 ) the class is 13. The class
interval is 5 (e.g. 31 − 26, 𝑜𝑟 35 − 30). Therefore:
𝑘𝑛
− 𝑐𝑓𝑏
𝑄𝑘 = 𝑙𝑏 + ( 4 )𝑖
𝑓
2(50)
− 13
𝑄2 = 30.5 + ( 4 )5
15
100
− 13
𝑄2 = 30.5 + ( 4 )5
15
25 − 13
𝑄2 = 30.5 + ( )5
15
12
𝑄2 = 30.5 + ( )5
15
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 14
4
𝑄2 = 30.5 + ( ) 5
5
𝑄2 = 30.5 + 4
𝑸𝟐 = 𝟑𝟒. 𝟓
The 2nd quartile is 34.5.
2. Decile
𝑘𝑛
− 𝑐𝑓𝑏
𝐷𝑘 = 𝑙𝑏 + (10 )𝑖
𝑓
Where:
𝐷𝑘 is the decile position
𝑙𝑏 is the lower boundary of the class.
𝑘 is the nth quartile (𝑘 = 1, 2, 3, … 9, 10)
𝑛 is the total frequency
𝑐𝑓𝑏 is cumulative of frequency before the class
𝑓 is the frequency of the class
𝑖 is the class interval
Score Frequency 𝑐𝑓
41 - 45 9 50
36 – 40 13 41
31 – 35 15 28
26 – 30 10 13
21 – 25 3 3
Total 50
𝑘𝑛
To solve for the 4 decile, start with 10 :
th
𝑘𝑛 4(50)
= = 4(5) = 20
10 10
The 20th rank also belongs to class 31 -35 which holds the ranks 14th to 28th.
The same as the solution for quartile earlier, we have the same values for
unknowns. Hence:
𝑘𝑛
− 𝑐𝑓𝑏
𝐷𝑘 = 𝑙𝑏 + (10 )𝑖
𝑓
4(50)
− 13
𝐷4 = 30.5 + ( 10 )5
15
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 15
4(5) − 13
𝐷4 = 30.5 + ()5
15
20 − 13
𝐷4 = 30.5 + ( )5
15
7
𝐷4 = 30.5 + ( ) 5
15
7
𝐷4 = 30.5 +
3
𝑸𝟒 = 𝟑𝟎. 𝟓 + 𝟐. 𝟑𝟑 ≈ 𝟑𝟐. 𝟖𝟑
The 4th decile is approximately 32.83.
3. Percentile
𝑘𝑛
− 𝑐𝑓𝑏
𝑃𝑘 = 𝑙𝑏 + ( 100 )𝑖
𝑓
Where:
𝑃𝑘 is the percentile position
𝑙𝑏 is the lower boundary of the class.
𝑘 is the nth quartile (𝑘 = 1, 2, 3, … ,99, 100)
𝑛 is the total frequency
𝑐𝑓𝑏 is cumulative of frequency before the class
𝑓 is the frequency of the class
𝑖 is the class interval
Score Frequency 𝑐𝑓
41 - 45 9 50
36 – 40 13 41
31 – 35 15 28
26 – 30 10 13
21 – 25 3 3
Total 50
𝑘𝑛
To solve for the 67th percentile, start with 100:
𝑘𝑛 67(50) 3,350
= = = 33.5 ≈ 38
100 100 100
The 38th rank is included in the class 36 -40 since it holds the ranks 29th to 41st.
36 + 35 71
𝑙𝑏 = = = 35.5
2 2
𝑛 = 50
𝑐𝑓𝑏 = 28
𝑓=13
𝑖=5
𝑘𝑛
100 − 𝑐𝑓𝑏
𝑃𝑘 = 𝑙𝑏 + ( )𝑖
𝑓
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 16
67(50)
− 28
𝑃67 = 35.5 + ( 100 )5
13
35.5 − 28
𝑃67 = 35.5 + (
)5
13
7.5
𝑃67 = 35.5 + ( ) 5
13
37.5
𝑃67 = 35.5 + ( )
13
𝑷𝟔𝟕 = 𝟑𝟓. 𝟓 + 𝟐. 𝟖𝟖 ≈ 𝟑𝟖. 𝟑𝟖
The 67 percentile is approximately 38.38.
th
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 17
2, 4, 5, 5, 7, 8, 9, 9, 10, 13, 15
L 𝑄1 𝑄2 𝑄3 H
Step 2. Draw and translate data sets to and from a box – and – whisker plot.
a) In the number line, draw congruent vertical lines positioned at 𝑄1 , 𝑄2, 𝑎𝑛𝑑 𝑄3.
c) Connect the least and highest value to the box. These lines are called whiskers.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 18
LESSON5. Normal Distribution
If the mean, median, and mode of data coincide or equal to one another, we can say
that data is normally distributed. This form a “bell curve”. A normal distribution is a
perfectly symmetric, mound-shaped distribution. If data is normally distributed, it
means that most of the data is densely concentrated towards the center.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 19
Empirical Rule
1. The area from -1.0 to 1.0 is equivalent to 68% of the total area.
2. The area from -2.0 to 2.0 is equivalent to 95% of the total area.
3. The area from -3.0 to 3.0 is equivalent to 99.7% of the total area.
Source: Almukkahal, R., et. al. (2016). CK-12 Advanced Probability and Statistics Concepts.
Problem. Suppose, the score of 250 students in a certain exam is normally distributed
with the mean 30 with a standard deviation of 6. Answer the following.
1. How many students scored less 25? Can be write as P(z<25)? – Probability of
z-scores which less 25?
2. How many students have scores higher than 36? P(z>36)
3. How many students have scores from 24 to 44? P(24<z<44)
Solution.
Use z-score:
𝑥 − 𝑥̅ 𝑥−𝜇
𝑧= 𝑜𝑟 𝑧 =
𝑠 𝜎
1. P(z<25)
𝑥 − 𝑥̅ 25 − 30 −5
=𝑧= = ≈ −0.83
𝑠 6 6
Use Standard Normal Distribution (see Appendix).
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 20
The computed z is −0.83, split into −0.8 and −0.03. Since this is negative table
for negative z-score. Locate -0.8 at the left most of the table, and locate -0.03
at the top. The intersection of the two represents area or probability from left
to right of the normal distribution.
Source: http://onlinestatbook.com/2/calculators/normal_dist.html
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 21
-0.83
0.20327 or
20.33%
x=25 𝑥̅ = 30
Since, the sample is 250, the 20.33% of it scored 25 or less. Therefore:
250𝑥0.20327 = 51
At most 51 students scored less than 25. The number of students is rounded
into whole number since it is nominal (countable).
2. P(z>36)
𝑥 − 𝑥̅ 36 − 30 6
𝑧=
= = ≈ 1.00
𝑠 6 6
Locate 1.0 and 0.00 at the table for positive z-score. We will get 0.84134 or
84.134% (see the illustration below.
1.00
84.13%
15.87%
𝑥̅ = 30 𝑥 = 41
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 22
But we are to get the number of students with score of at least 36 (36 or
above). This means that the answer found in the other section, which is
100% − 84.13% = 15.87%. Therefore:
250𝑥0.1587 = 40
At most 40 students scored at least 36 or 36 and above.
3. (24<z<44)
𝑥 − 𝑥̅ 24 − 30 −6
𝑧= = = ≈ −1.00
𝑠 6 6
Locate: −1.0 𝑎𝑛𝑑 0.00
𝑥 − 𝑥̅ 44 − 30 14
𝑧= = = ≈ 2.33
𝑠 6 6
Locate: 2.3 𝑎𝑛𝑑 0.03
The covered area of -1.00 from left is 0.15866 or 15.87%. Hence, the covered area of
2.33 from left is 0.99010 or 99.01%.
99.01%
−1.00
83.14%
2.33
15.87%
𝑥 = 24 𝑥̅ = 30 𝑥 = 44
To get the area covered in between subtract the smaller area.
99.01% − 15.87% = 83.14%
So, 83.14% or 0.8314 has scored from 24 to 44. Therefore:
250𝑥0.8314 = 208
At most 208 students has score 24 to 44.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 23
LESSON 6. Linear Regression
Way in determining the equation of the relationship between variables, given that the
two variables are really related. Draw scatter diagram or plot to create the
approximate model of the relationship between variables. The line of most interest is
called line of the best fit or the least-square regression line. It is the line for a set of
bivariate that minimizes the sum of the squares of the vertical deviations from each
point to the line.
Let us determine the equation of the line which approximately model the relationship
of the following points:
A(1,2), B(2,3), C(3,3), D(4,6), E(5,4)
7
0
0 1 2 3 4 5 6
2
𝑥 𝑦 𝑥𝑦 𝑥
1 2 1(2) = 2 1
2
2 3 6 2 =4
3 3 9 9
4 6 24 16
5 4 20 25
∑ 𝑥 = 15 ∑ 𝑦 = 18 ∑ 𝑥𝑦 = 61 ∑ 𝑥 2 = 60
15 18
𝑥̅ = =3 𝑦̅ = = 3.6
5 5
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 24
𝒏 ∑ 𝒙𝒚 − (∑ 𝒙)(∑ 𝒚)
𝒂= 𝟐
𝒏 ∑ 𝒙 − (∑ 𝒙)𝟐
5(61) − (15)(18)
𝑎=
5(60) − (15)2
305 − 270
𝑎=
300 − 225
35 𝟕
𝑎= = ̅
≈ 0.46
75 𝟏𝟓
𝒃=𝒚 ̅ − 𝒂𝒙̅
7
𝑏 = 3.6 − (3)
15
7 11
𝑏 = 3.6 − = = 𝟐. 𝟐
5 5
Therefore, the equation of least – regression line is:
̂ = 𝒂𝒙 + 𝒃
𝒚
𝟕 𝟏𝟏
̂=
𝒚 𝒙+ ̂ = 𝟎. 𝟒𝟕𝒙 + 𝟐. 𝟐
𝒐𝒓 𝒚
𝟏𝟓 𝟓
7
0
0 1 2 3 4 5 6
7
The trend of the line is slightly increasing with steepness or inclination of 𝑜𝑟 0.47
15
11
per unit and y – intercept of 𝑜𝑟 2.2.
5
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 25
Linear Correlation Coefficient
Use to determine the strength of a linear relationship between two variables, denoted
by 𝑟. Given the ordered pairs of related variables (𝒙𝟏 , 𝒚𝟏 ), (𝒙𝟐 , 𝒚𝟐 ), (𝒙𝟑, 𝒚𝟑 ),…,(𝒙𝒏 , 𝒚𝒏 );
the equation will be
𝑛(∑ 𝑥𝑦) − (∑ 𝑥 )(∑ 𝑦)
𝑟=
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥 )2 ∙ √𝑛(∑ 𝑦2 ) − (∑ 𝑦)2
If 𝑟 is positive, as one variables increases, the other one also increases.
If 𝑟 is negative, as one variable increases, the other one decreases.
The closer |𝑟| to 1, the stronger the relationship of the variable, while if r=0 means
there is no correlation.
Further interpretation. The following points are the accepted guidelines for
interpreting the correlation coefficient:
𝒓 𝑰𝒏𝒕𝒆𝒓𝒑𝒓𝒆𝒕𝒂𝒕𝒊𝒐𝒏
0 No linear relationship
Perfect positive linear relationship (as one variable
+1
increases, then the other variable increases)
Perfect negative linear relationship (as one variable
-1
decreases, then the other variable decreases)
Between
0 and 0.3 Weak positive (negative) linear relationship
(0 and -0.3)
Between
0.3 and 0.7 Moderate positive (negative) linear relationship
(-0.3 and -0.7)
Between
0.7 and 1.0 Strong positive (negative) linear relationship
(-0.7 and -1.0)
Source: Ratner, B. (2009). The correlation coefficient: Its values range between +1/−1, or do they?
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 26
Let us consider the same example, after determining the equation of the line the best
fit to the given data. Let us now determine the correlation of data.
A(1,2), B(2,3), C(3,3), D(4,6), E(5,4)
7
0
0 1 2 3 4 5 6
First 4 columns were already filled in the previous solution. For this, we need addition
column for 𝑦 2 .
𝑥 𝑦 𝑥𝑦 𝑥2 𝑦2
1 2 1(2) = 2 1 22 = 4
2 3 6 22 = 4 32 = 9
3 3 9 9 9
4 6 24 16 36
5 4 20 25 16
∑ 𝑥 = 15 ∑ 𝑦 = 18 ∑ 𝑥𝑦 = 61 ∑ 𝑥 2 = 60 ∑ 𝑦 2 = 74
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 27
7
𝑟=
√138
7√138
𝑟= ≈ 𝟎. 𝟔𝟎
138
The computed r is +0.60, this means that variables have moderate positive linear
relationship.
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 28
References
Almukkahal, R., et. al. (2016). CK-12 Advanced Probability and Statistics Concepts.
Flexbook: next generation textbook.
Australian Bureau of Statistics (2013). What is Variable? Retrieved 04 June 2020 from
https://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+langu
age+-
+what+are+variables#:~:text=A%20variable%20is%20any%20characteri
stics,type%20are%20examples%20of%20variables.
Bluman, A. G. (2018). Elementary Statistics: A Step by Step Approach , Tenth Edition,
ISBN 978 – 1 – 259 -75533 McGraw – Hill Education, New York City, USA.
Retrieved 03 June 2020 from https://b-ok.asia/book/5009088/f236d3
Dataceuticc, Inc. (2018). Sir Ronald Aylmer Fisher – The Father of Modern Statistics.
Retrieved 06 June 2020 from
https://www.dataceutics.com/blog/2018/7/24/sir-ronald-aylmer-fisher-
the-father-of-modern-statistics
Encyclopedia Britanica, Inc. (2020). Sir Ronald Aylmer Fisher. Retrieved 06 June
2020 from https://www.britannica.com/science/physical-anthropology
Gupta, S. (2014). Sampling Methods. Retrieved 06 June 2020 from
https://www.slideshare.net/shubhanshug1/seminar-sampling-
methods?qid=d1f11eda-cdd5-44b8-81de-
f0cd88637e6e&v=&b=&from_search=1
Ratner, B. (2009). The correlation coefficient: Its values range between +1/−1, or do
they?. Spring Nature Switzerland. Retrieved 17 June 2020 from
https://doi.org/10.1057/jt.2009.5
Tejada, J.J. & Punzalan, R. B. (2012). On the Misuse of Slovin’s Formula. The Philippine
Statistician, Vol. 61, No. 1, pp. 129 – 136. Retrieved 06 May 2020 from
https://www.psai.ph/docs/publications/tps/tps_2012_61_1_9.pdf
Weiss, N. A. (2012). Elementary Statistics, 8th Edition, ISBN 978 – 0- 321 – 69123 - 1.
Pearson Education, Inc., Boston, USA. Retrieved 03 June 2020 from https://b-
ok.asia/book/1236722/d339a2
http://onlinestatbook.com/2/calculators/normal_dist.html
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 29
Appendix A
http://onlinestatbook.com/2/calculators/normal_dist.html
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 30
Appendix B.
http://onlinestatbook.com/2/calculators/normal_dist.html
Mathematics in the Modern World – Data Management (Part 2) – Madrazo, A. (2020), [email protected] | 31