Data Management 1 Merged 1
Data Management 1 Merged 1
Objectives
1. Recognize the basic terms of statistics.
2. Determine and apply the measures of central tendency, variability, and
position.
3. Apply the measures of central tendency, and variability in normal
distribution.
4. Determine the linear regression and correlation of the set of data.
Lesson Proper
Introduction
Data management is a process by which information is acquired and processed to ensure the
accessibility and reliability of the data for its users. One of the most important tools in processing and
managing such information is statistics. Statistics is utilized in most areas of human endeavor. It is
usually used in education, research, business, agriculture, and other fields and even in everyday life
activities.
2. Inferential Statistics
Consists of methods for drawing and measuring the reliability of conclusions
about a population based on information obtained from a sample of the
population.
After collection, organization, summarization, and presentation of data (descriptive), inferential
statistics is used to determine the findings and draw conclusions, respectively.
This denotes, that descriptive statistics and inferential statistics are interrelated. Use
descriptive statistics to organize and summarize the obtained information from sample before
carrying out an inferential statistic.
Population
The collection of all individuals or items under consideration in a statistical study.
Sample
That part of the population from which information is obtained.
For example, in a certain study about Statistics University with 6,589 students. The 6,589 students is
the population. Hence, if the researcher randomly selected class A with 44 students, the 44 students is
the sample. Sample is the representative of the population.
Before we through the discussions, let use first define some basic operational terms
in statistics:
Variable – a characteristic or attribute that can assume different values. Any characteristic,
number, or quantity that can be measured or counted. It is also called data item.
Collected information for variables, describe the situation.
Example. Age, sex, business income and expenses, birth, expenditure,
class grades, eye color, and among others
Types of Variables
1. Numeric Variables/ Quantitative Variables
Have values that describe a measurable quantity as a number, like ‘how many’
or ‘how much’. These are that quantifiable variables. Data collected in
numeric variable is called quantitative data.
a. Continuous Variable
Observations can take any value between a certain set of real numbers. The value
given to an observation for a continuous variable can include values as small as the
instrument of measurement allows.
Examples: height, time, age, and temperature
3
Height can be 1.62m, time can be 3.5hours (3 hours and 30 minutes), age can be16 4
2
years old (16 years and 9 months), and temperature can be 36 5 ℃ 𝑜𝑟 36.40℃
b. Discrete Variable
Observations can take a value based on a count from a set of distinct whole values. A
discrete variable cannot take the value of a fraction between one value and the
next closest value.
Examples: number of registered cars, number of business locations,
and number of children in a family, all of which measured
as whole units (i.e. 1, 2, 3 cars)
Data
For example, the grades of 5 students in Statistics are 94, 75, 82.5, 74.9, and 89.
From the example above, the grades of students is the variable. Under numeric variable, it
classified as continuous variable since it can be represented by decimal or fraction.
Furthermore, 94, 75, 82.5, 74.9, and 89 is the data set. Each value is the data value or datum
(e.g. 94 is data value or datum). These data are continuous data since it can be from a set of
real numbers.
Moreover, variables can also be classified by how they are categorized besides qualitative and
quantitative data – measurement scales/ level of measurement.
Level of Measurement
1. Nominal level of measurement
Classifies data into mutually exclusive (no overlapping) categories in which no order
or ranking can be imposed on the data. Nominal data are countable.
Example: gender, zip codes; political party; religion; nationality
One property is lacking in the interval scale: There is no true zero. For example, IQ tests
do not measure people who have no intelligence. For temperature, 0°F does not mean
no heat at all.
For example, if one person can lift 200 pounds and another can lift 100 pounds, then
the ratio between them is 2 to 1. Put another way, the first person can lift twice as
much as the second person.
There is not complete agreement among statisticians about the classification of data into one
of the four categories. For example, some researchers classify IQ data as ratio data rather
than interval. Also, data can be altered so that they fit into a different category. For instance,
if the incomes of all professors of a college are classified into the three categories of low,
average, and high, then a ratio variable becomes an ordinal variable.
Developed to mathematically determine the most effective way to acquire a sample that
would accurately reflect the population of the study.
The most common mathematical formula to determine the number of sample in reference to
population is the Slovin’s Formula which is introduced by Slovin in 1960. To this day, it is still
unknown who really Solvin is, many names associated either Mark Slovin, Michael Slovin, or
Kulkol Slovin.
Slovin’s Formula
𝑁
𝑛=
1 + 𝑁𝑒 2
where:
Use Slovin’s formula if you have no idea about the population’s behavior. Slovin’s formula
determines sample in proportion to the population. Slovin’s formula is applicable only
when estimating a population proportion and when the confidence coefficient is 95%.
There are other sampling formula that could be used to determine samples in relation to
the characteristics of the variables.
Margin of error tells how many times percentage points your results will differ
from the real population. For example, 0.05 (5%) level of significance which
implies 0.95 (95%) confidence level to the real population value.
6,518
𝑛=
1 + 6,518(0.05)2
6,518
𝑛=
1 + 6,518(0.0025)
6,518
𝑛=
1 + 16.295
6,518
𝑛=
17.295
𝑛 = 376.87 ≈ 377
This implies that using Slovin’s formula, the given’s sample size is 377 (respondents).
Sampling Techniques
Sampling techniques are methods of identifying who will be the respondents of the study
(sample). For instance, in the previous example, how to identify the 377 respondents? Here
comes the sampling techniques.
Example: For the sake of illustration let us limit the population size. Suppose 10
population size is 10, and the sample is 5. How can we obtain the 5
samples?
Solution: Step 1. Divide the population size by sample size.
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒
𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
10
=
5
=𝟐
This implies that every 2nd will be selected.
Step 2. To start arrange the population in order, and
randomly select the starting first sample.
1 2 3 4 5 6 7 8 9 10
From 4, every 2nd will be selected until 5 target samples is obtained. So:
1 2 3 4 5 6 7 8 9 10
1st 2nd 1st 2nd 1st 2nd
4 5 6 7 8 9 10 1 2 3
1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st
Target 5 samples is now obtained: 4th, 6th, 8th, 10th, and 2nd
Example: The town has 250 homeowners of which 25, 175, and 50 are
upper income, middle income, and low income, respectively.
Explain how we can obtain a sample of 20 homeowners,
using stratified sampling with proportional allocation,
stratifying by income group.
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 9
Solution:
Step 1. Divide the population into subpopulations (strata).
Stratum 1: upper income (25)
Stratum 2: middle income (175)
Stratum 3: lower income (50)
The planner used the 947 blocks as the clusters, thus dividing
the population (residential portion of the city) into 947 groups.
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 10
Step 2. Obtain a simple random sample of the clusters.
Step 3. Use all the members of the clusters obtained in Step 2 as the
sample
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 11
bases of quota are age, gender, education, race, religion, & socio-
economic status.
Example: The basis of quota is college level & research needs equal
presentation with 100 as sample size. Researcher must
select 25 from each year level.
After observing the initial subject, the researcher asks for assistance
from the subject to help in identifying people with a similar trait of
interest. It is like asking subjects to nominate another with the same
trait. The same process is done until sufficient number of subjects is
obtained.
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 12
LESSON 3. Measures of Central Tendency
It is a descriptive measures that indicate where the center or most of the typical
value of the data set lies. This often called averages. There are three most important
measures of central tendencies: the mean, median and mode. The mean and median
apply only to quantitative data, whereas the mode can either be used in quantitative
or qualitative data.
Statistic – a characteristic or measure obtained by using data values from sample.
Parameter – a characteristic or measure obtained by using all the data values from a
specific population.
Data Classification
a. Ungrouped/ Small Data – if data is 30 and below.
b. Grouped/ Large Data – if data is more than 30.
Suppose, Carmella’s scores in seven 100 - item tests are 78, 96, 85, 91, 70, 79, and 96.
Determine the mean, median, and mode.
1. Mean
It is the sum of the observations divided by the number of observations.
Among the three this is the most reliable. Also called average.
∑ 𝑥 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛−1 + 𝑥𝑛
𝑥= =
𝑛 𝑛
∑ 𝑥 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑁−1 + 𝑥𝑁
𝜇= =
𝑁 𝑁
Where:
𝑥 is the individual datum,
𝑛 is the sample size,
𝑁 is the population size.
∑ 𝑥 78 + 96 + 85 + 91 + 70 + 79 + 96
𝑥= =
𝑛 7
595
𝑥= = 𝟖𝟓
7
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 13
The mean being described above is arithmetic mean. But, besides this, there
are other types of mean such weighted mean, and combined/ compound
mean.
2. Median
When data in increasing or decreasing order, it is the middle most number.
• If the number of observations is odd, then the median is the
observation exactly in the middle of the ordered list.
• If the number of observations is even, then the median is the mean of
the two middle observations in the ordered list.
In both cases, if we let n denote the number of observations, then the median
is at position (n + 1)/2 in the ordered list. Median is denoted by 𝑥̃ (read as x
Let us consider the given above, arrange the data in increasing order. Since
the number of data 7 which odd, it satisfy the first condition.
70 78 79 85 91 96 96
1st 2nd 3rd 4th 5th 6th 7th
𝑛+1 7+1 8
The middle most number is 85. Hence, the position is = = = 4, so
2 2 2
85 is the 4th term.
𝑥̃ = 85
To illustrate the 2nd condition if we have even number of data, let consider
the same given we will add another number, suppose the additional number
is 68.
68 70 78 79 85 91 96 96
1st 2nd 3rd 4th 5th 6th 7th 8th
Median is the average of the numbers at the center, 79 and 85, respectively.
79 + 85 164
𝑥̃ = = = 82
2 2
𝑛+1 8+1 9
The position is 2 = 2 = 2 = 4.5. The position of 82 as median is 4.5th. This
means that 82 is halfway between the 4th and the 5th term.
Median is also the most stable measures among the three because it is not
affected by outliers (extremes). Outliers are the data that are either extremely
high or extremely low.
Let us consider again the same example, but this time, we’re going to change
either of the highest or lowest or both.
70 78 79 85 91 96 96
From the given, 85 is the median.
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 14
1 78 79 85 91 96 96
We changed the lowest number from 70 to 1, but the median is still 85.
70 78 79 85 91 96 500
We replaced the highest from 96 to 500, still the median is 85.
1 78 79 85 91 96 500
We replaced both the lowest and highest, still the median is 85.
3. Mode
• The most frequent data.
• If no value occurs more than once, then the data set has no mode.
• Otherwise, any value that occurs with the greatest frequency is a mode
of the data set.
• Denoted by 𝑥̂ (read as x – hut).
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 15
b) Grouped/ Large Data
The ages of the first 50 persons who enter the mall were tallied, as shown below.
Determine the mean, median, and mode of their ages.
Age Frequency
10 – 19 5
20 – 29 20
30 – 39 10
40 – 49 7
50 – 59 8
Total n=50
From the table above, age is the classes.
1. Mean
∑ 𝑓𝑥 𝑓1𝑥1 + 𝑓2 𝑥2 + 𝑓3 𝑥3 + ⋯ + 𝑓𝑛−1 𝑥𝑛−1 + 𝑓𝑛 𝑥𝑛
𝑥= =
𝑛 𝑛
∑ 𝑓𝑥 𝑓1 𝑥1 + 𝑓2 𝑥2 + 𝑓3 𝑥3 + ⋯ + 𝑓𝑁−1 𝑥𝑁−1 + 𝑓𝑁 𝑥𝑁
𝜇= =
𝑁 𝑁
Where:
𝑥 is sample mean
𝜇 population mean
𝑛 is the sample size
𝑁 is the population size
𝑓 is class frequency
𝑥 is class mark
To start let us first complete the table below. In each class, for instance, class
10 – 19, the smaller value is the lower limit which 10 (in the given class), and
upper limit which 19 (in the given class). Class mark is the average of the lower
limit and upper limit of the class. In lowest class (class with lowest values), the
class mark is:
10 + 19 29
= = 14.5
2 2
You could do the same process the other class. But, there is alternative way to
continue the process by use of class interval.
𝐶𝑙𝑎𝑠𝑠(𝐴𝑔𝑒) 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑓) 𝐶𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 (𝑥) 𝑓𝑥
10 – 19 5 14.5
20 – 29 20
30 – 39 10
40 – 49 7
50 – 59 8
Total n=50
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 16
succeeding classes, and 19 and 29 are upper limits of two succeeding class.
Their difference is the class interval, such:
20 – 10 = 29 − 19 = 10
This also true to other classes.
To continue, just add the class interval to the initial class mark.
𝐶𝑙𝑎𝑠𝑠(𝐴𝑔𝑒) 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑓) 𝐶𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘 (𝑥) 𝑓𝑥
10 – 19 5 14.5 5 ∙ 14.5 = 72.5
20 – 29 20 14.5+10=24.5 490
30 – 39 10 34.5 345
40 – 49 7 44.5 311.5
50 – 59 8 54.5 436
Total 𝑛 = 50 ∑ 𝑓𝑥 = 1,655
The column is the product of the frequency (f) and class mark (x). Add all the
product to get ∑ 𝑓𝑥. Hence, to get the mean:
∑ 𝑓𝑥 1,655
𝑥= = = 𝟑𝟑. 𝟏
𝑛 50
This implies that the average age who comes to mall is more or less 33 years
old.
2. Median
𝑛
− 𝑐𝑓𝑏
𝑥̃ = 𝐿𝐵𝑚𝑒 + (2 )𝑖
𝑓𝑚𝑒
Where:
𝑥̃ is the median
𝐿𝐵𝑚𝑒 lower boundary of the median class
𝑛 is the sample size
𝑐𝑓𝑏 is the summation of frequencies before the median class (lower
classes of median class). 𝑐𝑓 stands for cumulative frequency.
𝑓𝑚𝑒 is the frequency of the median class
𝑖 is the class interval
Let us use the previous results. Add another column for summation of
frequencies. If you’re going only to find the median, you can disregard the 3 rd
column (class mark) and 4th column (fx).
𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑓 𝑥 𝑓𝑥 cf
10 – 19 5 14.5 72.5 5
20 – 29 20 24.5 490 5+20=25
30 – 39 10 34.5 345 25+10=35
40 – 49 7 44.5 311.5 35+7=42
50 – 59 8 54.5 436 42+8=50
Total 𝑛 = 50 ∑ 𝑓𝑥 = 1,655
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 17
𝑛
− 𝑐𝑓𝑏
𝑥̃ = 𝐿𝐵𝑚𝑒 + (2 )𝑖
𝑓𝑚𝑒
Divide first the sample size into 2.
𝑛 50
= = 𝟐𝟓 (𝟐𝟓𝒕𝒉 𝒕𝒆𝒓𝒎)
2 2
Observe the last column, class 10 -19 has 1st to 5th terms. Hence, class 20 – 29
has the 6th to 25th terms, then class 30 – 39 has the 26th to 35th terms, and so
on. Since the 25th term belongs to class 20 – 29, therefore the median class will
the class 20 – 29.
𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑓 𝑥 𝑓𝑥 ∑𝑓
10 – 19 5 14.5 72.5 5
20 – 29 20 24.5 490 25
30 – 39 10 34.5 345 35
40 – 49 7 44.5 311.5 42
50 – 59 8 54.5 436 50
Total 𝑛 = 50 ∑ 𝑓𝑥 = 1,655
𝑓𝑚𝑒 𝑐𝑓𝑏
The last variable with no value yet is 𝐿𝑚𝑒 . This is the average of the lower
boundary of the median class which 20 in this case and upper boundary of the
lower class before the median class which is 19 in this case. So:
19 + 20 39
𝐿𝐵𝑚𝑒 = = = 19.5
2 2
Then, compute for the median of the given. The value of class interval (𝑖 ) is 10
the same as what we used earlier to determine the mean.
𝑛
− 𝑐𝑓𝑏
𝑥̃ = 𝐿𝐵𝑚𝑒 + (2 )𝑖
𝑓𝑚𝑒
50
−5
𝑥̃ = 19.5 + ( 2 )8
20
25 − 5
𝑥̃ = 19.5 + (
)8
20
20
𝑥̃ = 19.5 + ( ) 8
20
𝑥̃ = 19.5 + (1)8
𝑥̃ = 19.5 + 8 = 𝟐𝟕. 𝟓
Therefore, the median is 27.5. The middle most age is more or less 28 years.
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 18
3. Mode
𝑓𝑚𝑜 − 𝑓𝑏
𝑥̂ = 𝐿𝐵𝑚𝑜 + ( )𝑖
2𝑓𝑚𝑜 − 𝑓𝑏 − 𝑓𝑎
Where:
𝑥̂ is the mode
𝐿𝐵𝑚𝑜 is the lower boundary of the modal class
𝑓𝑚𝑜 is the frequency of the modal class
𝑓𝑏 is the frequency before the modal class or frequency of
immediate lower class than modal class
𝑓𝑎 is the frequency after the modal class or frequency of
immediate higher class than modal class
Class 20 – 29 has the highest frequency, immediately that is the modal class.
In case two or more have the highest equal frequencies, therefore the classes
with the highest equal frequency are modal classes.
𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑓 𝑥 𝑓𝑥 ∑𝑓
𝑓𝑏
10 – 19 5 14.5 72.5 5
20 – 29 20 24.5 490 25
30 – 39 10 34.5 345 35
40 – 49 7 44.5 311.5 42
50 – 59 8 54.5 436 50
𝑓𝑎
Total 𝑛 = 50 ∑ 𝑓𝑥 = 1,655
𝑓𝑚𝑜
The frequency of the modal class is 20. The frequency of class before modal class
(lower class immediately next to modal class) is 5. Hence, the frequency of the class
after the modal class (higher class immediately next to modal class) is 10. The class
interval is also 10 (like in the mean and median). Lower limit of the modal class is the
same process as the lower limit of the median class. The average of lower limit of the
modal class and upper limit of the immediate lower class next to modal class.
20 + 19 39
𝐿𝐵𝑚𝑜 = = = 19.5
2 2
Then, compute for the mode.
𝑓𝑚𝑜 − 𝑓𝑏
𝑥̂ = 𝐿𝐵𝑚𝑜 + ( )𝑖
2𝑓𝑚𝑜 − 𝑓𝑏 − 𝑓𝑎
20 − 5
𝑥̂ = 19.5 + ( )8
2(20) − 5 − 10
15
𝑥̂ = 19.5 + ( )8
40 − 15
15
𝑥̂ = 19.5 + ( ) 8
25
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 19
3
𝑥̂ = 19.5 + ( ) 8
5
24
𝑥̂ = 19.5 +
5
𝑥̂ = 19.5 + 4.8
̂ = 𝟐𝟒. 𝟑
𝒙
The mode is 24.3. Most of the age who enter the mall is more or less 24 years old.
References
Almukkahal, R., et. al. (2016). CK-12 Advanced Probability and Statistics Concepts.
Flexbook: next generation textbook.
Australian Bureau of Statistics (2013). What is Variable? Retrieved 04 June 2020 from
https://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+langu
age+-
+what+are+variables#:~:text=A%20variable%20is%20any%20characteri
stics,type%20are%20examples%20of%20variables.
Bluman, A. G. (2018). Elementary Statistics: A Step by Step Approach , Tenth Edition,
ISBN 978 – 1 – 259 -75533 McGraw – Hill Education, New York City, USA.
Retrieved 03 June 2020 from https://b-ok.asia/book/5009088/f236d3
Dataceuticc, Inc. (2018). Sir Ronald Aylmer Fisher – The Father of Modern Statistics.
Retrieved 06 June 2020 from
https://www.dataceutics.com/blog/2018/7/24/sir-ronald-aylmer-fisher-
the-father-of-modern-statistics
Encyclopedia Britanica, Inc. (2020). Sir Ronald Aylmer Fisher. Retrieved 06 June
2020 from https://www.britannica.com/science/physical-anthropology
Gupta, S. (2014). Sampling Methods. Retrieved 06 June 2020 from
https://www.slideshare.net/shubhanshug1/seminar-sampling-
methods?qid=d1f11eda-cdd5-44b8-81de-
f0cd88637e6e&v=&b=&from_search=1
Ratner, B. (2009). The correlation coefficient: Its values range between +1/−1, or do
they?. Spring Nature Switzerland. Retrieved 17 June 2020 from
https://doi.org/10.1057/jt.2009.5
Tejada, J.J. & Punzalan, R. B. (2012). On the Misuse of Slovin’s Formula. The Philippine
Statistician, Vol. 61, No. 1, pp. 129 – 136. Retrieved 06 May 2020 from
https://www.psai.ph/docs/publications/tps/tps_2012_61_1_9.pdf
Weiss, N. A. (2012). Elementary Statistics, 8th Edition, ISBN 978 – 0- 321 – 69123 - 1.
Pearson Education, Inc., Boston, USA. Retrieved 03 June 2020 from https://b-
ok.asia/book/1236722/d339a2
http://onlinestatbook.com/2/calculators/normal_dist.html
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 20
Appendix A
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 21
http://onlinestatbook.com/2/calculators/normal_dist.html
Appendix B.
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 22
http://onlinestatbook.com/2/calculators/normal_dist.html
Mathematics in the Modern World – Data Management (Part 1) – Madrazo, A. (2020), [email protected] | 23