Data Management: Objectives
Data Management: Objectives
Data Management: Objectives
CvSU Vision Republic of the Philippines Cavite State University shall provide excellent,
The premier university in
CAVITE STATE UNIVERSITY
equitable and relevant educational opportunities in
historic Cavite recognized for the arts, science and technology through quality
excellence in the development of Cavite City Campus instruction and relevant research and development
activities.
globally competitive and morally
Brgy. 8, Pulo II, Dalahican, Cavite City It shall produce professional, skilled and
upright individuals. morally upright individuals for global competitiveness.
CHAPTER 4
DATA MANAGEMENT
Objectives:
After the completion of the chapter, students should be able to:
Use variety of statistical tools to process and manage numerical data;
Use methods of linear regression and correlations to predict the value of a variable given certain
conditions; and
advocate the use of statistical data in making important decisions.
EVALUATION REQUIREMENTS:
Problem Sets and Exercises
Quiz
Quantitative Research Proposal (FINAL PROJECT)
SAMPLE: You want the university to offer an online enrolment system to improve the enrolment
process. CSG asks your team to present hard data that will convince the administration. Prepare a
proposal on how you will do this task.
Statistical tools derived from mathematics are useful in processing and managing numerical data
in order to describe a phenomenon and predict values.
NATURE OF STATISTICS
General Uses of Statistics
a. Statistics aids in decision making
provides comparison
explains action that has taken place
justifies a claim or assertion
predicts future outcome
estimates unknown quantities
b. Statistics summarizes data for public use
FIELDS OF STATISTICS
a. Statistical Methods of Applied Statistics – refers to procedures and techniques used in the
collection, presentation, analysis and interpretation of data.
Descriptive statistics
- methods concerned with the collection, description and analysis of a set of data
without drawing conclusions or inferences about a larger set.
- the main concern is simply describe the set of data.
Inferential Statistics
- methods concerned with making predictions or inferences about a larger set of data
using only the information gathered from a subset of this larger set.
- the main is not merely to describe but actually predict and make inferences based
on the information gathered.
2
b. Statistical Theory of Mathematical Statistics – deals with the development and exposition of
theories that serve as bases of statistical methods.
CLASSIFICATION OF VARIABLE
1. Discrete vs. Continuous
Discrete – a variable which can assume finite number of values; usually measured by counting or
enumeration.
Continuous – a variable which can assume infinitely many values corresponding to a line number.
2. Qualitative vs. Quantitative
Qualitative – a variable that yields a categorical response.
Example: Occupation, Marital Status
Quantitative – a variable that takes on numerical values representing an amount or quantity.
Example: Weight, Height, Age, Number of cars
LEVEL OF MEASUREMENT
1. Nominal Level – the nominal level or classificatory scale is the weakest level of measurement where
numbers or symbols are used simply for categorizing subjects into different groups.
Examples: Sex: M-Male F-Female
Marital Status: 1-Single 2-Married 3-Widowed 4-Separated
2. Ordinal Level – the ordinal level of measurement contains the properties of the nominal level, and in
addition, the numbers assigned to categories of any variables may be ranked or ordered in some
low-to-high manner.
Examples: Teaching Ratings 1-poor 2-fair 3-good 4-excellent
Year Level 1-1st year 2-2nd year 3-3rd year 4-4th year
3. Interval Level – the interval level is that which the distances between any two numbers on the scale
are of known sizes.
Example: IQ level, Temperature
4. Ratio Level – the ratio level of measurement contains all the properties of the interval level, and in
addition, it has a “true zero” point.
Example: Number of correct answers in exam.
CLASSIFICATION OF DATA
2. A study to be conducted by an NGO would determine the Filipinos’ awareness about the war
against IRAQ.
Population: _________________________________________________________________________
Variable: ___________________________________________________________________________
Type of Variable: ____________________________________________________________________
SLOVIN’S FORMULA
𝑁
𝑛=
1 + 𝑁𝑒 2
Where:
n = sample size
N = population size
e = margin of error (0.05 or 0.01)
Example:
1. Solve for the sample size of 350 patients from Cavite Medical Center.
2. 12,345
3. 1000
4. 1203
2. Observation method – makes possible the recording of behavior but only at the time of occurrence.
3. Experimental method – a method designed for collecting data under controlled conditions. An
experiment is an operation where there is actual human interference with the conditions
than can affect the variable under study.
4. Use of existing studies – e.g., census, health statistics, and weather bureau reports.
Two type:
Documentary sources – published or written reports, periodicals, unpublished documents,
etc.
Field sources – researchers who have done studies on the area of interest are asked
personally or directly for information needed.
5. Registration method – e.g., car registration, student registration and hospital admission.
Advantages Disadvantages
When a large mass of quantitative data are
It gives emphasis to significant figures and included in a text or paragraph, the
comparisons. presentation becomes almost
incomprehensible.
It is simplest and most appropriate Paragraphs can be tiresome to read
approach when there are only a few especially if the same words are repeated
numbers to be presented. so many times.
2. Box Head –the portion of the table that contains the column heads which describe the data in each
column.
3. Stub – The portion of the table usually comprising the first column on the left. The row caption is a
descriptive title of the data on the given line.
4. Field – main part of the table; contains the substance or the figures of one’s data.
5. Source note – an exact citation of the source of data presented in the table (should always be
placed when the figures are not original).
6. Foot note – any statement or note inserted at the bottom of the table.
Index CrimesPhilippine77,261
Source: National 124
Police 67,354 106 58,684 90
stub Murder 8,707 14 8,293 13 7,758 12
Homicide 8,069 13 7,912 12 7,123 11
Physical 29,862 35 20,462 32 18,722 29 field
Injury 13,817 22 11,164 18 9,856 15
Robbery 22,780 37 17,374 27 12,940 20
Theft 2,026 3 2,149 3 2,285 4
Rape
44,065 71 37,365 59 38002 58
Nonindex crimes
Graphical Presentation – a graph or chart is a device for showing numerical values or relationships in
pictorial form.
Advantages:
Main features and implications of a body of data can be grasped at a glance.
Can attract attention and hold the reader’s interest.
Simplifies concepts that would otherwise have been expressed in so many words.
Can readily clarify data; frequently bring hidden facts and relationships.
a. Mode – it is the observation that appears most often. Mode is the least preferred measure of central
location.
Example: Find the mode
Observations Mode
3 8 6 7 9 9 3 3 10 3 - unimodal
10 15 15 20 25 25 30 35 45 15 & 25 - bimodal
10 15 15 20 25 25 30 30 35 45 15, 25 & 30 - trimodal
3 8 6 6 7 7 9 9 3 6 3 10 7 9 3, 6, 7, & 9 - multimodal
Solution:
b. Median
𝑛
(2 −<𝑐𝑓𝑝 )
Formula: 𝑥̃ = 𝐿𝐶𝐵𝑚𝑑 + [ ]𝑖
𝑓𝑚𝑑
Example:
Final grades of Stat 101 students arrange in array. Solve for the median.
Solution:
1. Determine the median class by dividing the total number of observations by 2.
𝑛 110
= = 55
2 2
2. Go over the entries in the less than cumulative frequency column. The class that immediately
has a sum of frequencies greater than the result of step 1 is the median class.
𝑛
( −<𝑐𝑓𝑝 )
2
Class Frequency LCB <cf 𝑥̃ = 𝐿𝐶𝐵𝑚𝑑 + [ ]𝑖
𝑓𝑚𝑑
50 – 55 10 49.5 10
56 – 61 6 55.5 16 (
110
−49)
2
62 – 67 8 61.5 24 𝑥̃ = 73.5 + [ ]6
22
68 – 73 25 67.5 49
74 – 79 22 73.5 71 𝑥̃ = 75.14
Median class 80 – 85 23 79.5 94
86 – 91 12 85.5 106
92 – 97 4 91.5 110
N= 110
c. Mode
𝑓𝑚 −𝑑1
Formula: 𝑥̂ = 𝐿𝐶𝐵𝑚 + ( )𝑖
2𝑓𝑚−𝑑1 −𝑑2
Where: 𝑥̂ = Mode
𝐿𝐶𝐵𝑚 = LCB of the modal class
𝑓𝑚 = Frequency of the modal class
𝑑1 = difference between the frequency of the modal
class and the frequency before the modal class
𝑑2 = difference between the frequency of the modal
class and the frequency preceding the modal class
Example:
Final grades of Stat 101 students arrange in array. Solve for the median.
Solution:
1. Determine the modal class by identifying the class that contains the highest frequency or
observation.
𝑓 𝑑
Frequenc 𝑥̂ = 𝐿𝐶𝐵𝑚 + ( 𝑚− 1 ) 𝑖
Class LCB <cf 2𝑓𝑚 −𝑑1 −𝑑2
y
50 – 55 10 49.5 10 25−17
𝑥̂ = 67.5 + ( )6
56 – 61 6 55.5 16 2(25)−17−3
62 – 67 8 61.5 24
Modal class 68 – 73 25 67.5 49 𝑥̂ = 69.10
74 – 79 22 73.5 71
80 – 85 23 79.5 94
86 – 91 12 85.5 106
92 – 97 4 91.5 110
N= 110
2. Complete the Frequency Distribution Table to find the mean, median and mode of the data set
given:
Class F CM (x) fx LCB <CF
10-19 3
20-29 1
30-39 3
40-49 2
50-59 9
60-69 8
70-79 35
80-89 30
90-99 9
Percentile (P) …10 …20 …25 …30 …40 …50 …60 …70 …75 …80 …90 …100
Decile (D) 1 2 3 4 5 6 7 8 9 10
Quartile (Q) 1 2 3 4
𝑖(𝑛+1)
a. Percentile – to compute for the 𝑖 𝑡ℎ percentile: 𝑃𝑖 = is the value of the [ ] 𝑡ℎ observation in the
100
array.
Where: 𝑃𝑖 = Percentile location
𝑖 = Percentile of interest
𝑛 = number of observation
1. If Pi is a whole number, the percentile location is the Pth in the ordered set of observations.
2. If Pi is not a whole number, the percentile location is between the P th and (P+1)th , by taking the
difference between the Pth and (P+1)th location and multiply the result by the decimal portion of
Pi.
Example:
Below is the list of the daily wages of 20 workers of XYZ Construction Company. Compute for P 87.
200 200 265 285 290 300 300 315 330 350
375 450 450 500 550 550 600 615 630 650
Solution:
𝑖(𝑛+1)
𝑃𝑖 = [ ] 𝑃87 = 615 + 0.27(630 − 615)
100
87(20+1)
𝑃87 = [ ] 𝑃87 = 619.05 𝑜𝑟 619
100
𝑃87 = 18.27𝑡ℎ 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛
𝑖(𝑛+1)
b. Decile – to compute for the 𝑖 𝑡ℎ decile: 𝐷𝑖 = is the value of the [ 10 ] 𝑡ℎ observation in the array.
Where: 𝐷𝑖 = Decile location
𝑖 = Decile of interest
𝑛 = number of observation
Example:
Below is the list of the daily wages of 20 workers of XYZ Construction Company. Compute for D 7.
200 200 265 285 290 300 300 315 330 350
375 450 450 500 550 550 600 615 630 650
Solution:
𝑖(𝑛+1)
𝐷𝑖 = [ ] 𝐷7 = 500 + 0.7(550 − 500)
10
7(20+1)
𝐷7 = [ 10 ] 𝐷7 = 535
𝐷7 = 14.70𝑡ℎ 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛
𝑖(𝑛+1)
c. Quartile – to compute for the 𝑖 𝑡ℎ quartile: 𝑄𝑖 = is the value of the [ 4 ] 𝑡ℎ observation in the array.
Where: 𝑄𝑖 = quartile location
𝑖 = quartile of interest
𝑛 = number of observation
Example:
Below is the list of the daily wages of 20 workers of XYZ Construction Company. Compute for Q 3.
200 200 265 285 290 300 300 315 330 350
375 450 450 500 550 550 600 615 630 650
Solution:
𝑖(𝑛+1)
𝑄𝑖 = [ ] 𝑄3 = 550 + 0.75(550 − 550)
4
3(20+1)
𝑄3 = [ 4 ] 𝑄3 = 550
𝑄3 = 15.75𝑡ℎ 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛
Solution:
1. Determine the Quartile class by dividing the number of observation by 4.
𝑛 110
= = 27.5
4 4
2. Go over the entries in the less than cumulative frequency column. The class that has a sum of
𝑛
frequencies greater than the 4 is the quartile 1 class.
𝑛
Class Frequency LCB <cf ( −<𝑐𝑓𝑝 )
4
𝑄1 = 𝐿𝐶𝐵𝑄1 + [ ]𝑖
50 – 55 10 49.5 10 𝑓 𝑄𝑘
56 – 61 6 55.5 16
110
62 – 67 8 61.5 24 (
4
−24)
68 – 73 25 67.5 49 𝑄1 = 67.5 + [ ]6
25
74 – 79 22 73.5 71
80 – 85 23 79.5 94 𝑄1 = 68.34
86 – 91 12 85.5 106
92 – 97 4 91.5 110
N= 110
b. Deciles
𝑘𝑛
( 10 −<𝑐𝑓𝑝 )
Formula: 𝐷𝑘 = 𝐿𝐶𝐵𝐷𝑘 + [ 𝑓𝐷𝑘
]𝑖
Where: 𝐿𝐶𝐵𝐷𝑘 = lower class boundary of the deciles class
𝑛 = number of observations
< 𝑐𝑓𝑝 = sum of the frequencies before the deciles class
𝑓𝐷𝑘 = frequency of the quartile class
𝑖 = class interval/size
Example:
Final grades of Stat 101 students arrange in array. Solve for the D8.
Solution:
1. Determine the Deciles class by dividing the number of observation by 10.
𝑘𝑛 8∗110
= = 88
10 10
2. Go over the entries in the less than cumulative frequency column. The class that has a sum of
𝑛
frequencies greater than the 10 is the deciles 8 class.
𝑘𝑛
Class Frequency LCB <cf (
10
−<𝑐𝑓𝑝 )
𝐷8 = 𝐿𝐶𝐵𝐷8 + [ ]𝑖
50 – 55 10 49.5 10 𝑓𝐷𝑘
56 – 61 6 55.5 16
62 – 67 8 61.5 24
68 – 73 25 67.5 49
74 – 79 22 73.5 71
80 – 85 23 79.5 94
86 – 91 12 85.5 106
92 – 97 4 91.5 110
N= 110
c. Percentile
𝑘𝑛
(100−<𝑐𝑓𝑝 )
Formula: 𝑃𝑘 = 𝐿𝐶𝐵𝑃𝑘 + [ 𝑓𝑃𝑘
]𝑖
Where: 𝐿𝐶𝐵𝑃𝑘 = lower class boundary of the percentile class
𝑛 = number of observations
< 𝑐𝑓𝑝 = sum of the frequencies before the percentile
class
𝑓𝑃𝑘 = frequency of the percentile class
𝑖 = class interval/size
Example:
Final grades of Stat 101 students arrange in array. Solve for the P57.
Solution:
1. Determine the Percentile class by dividing the number of observation by 100.
𝑘𝑛 57∗110
= = 62.7
100 100
2. Go over the entries in the less than cumulative frequency column. The class that has a sum of
𝑛
frequencies greater than the 100 is the percentile 57 class.
𝑘𝑛
Class Frequency LCB <cf (
100
−<𝑐𝑓𝑝 )
𝑃57 = 𝐿𝐶𝐵𝑃57 + [ ]𝑖
50 – 55 10 49.5 10 𝑓 𝑃𝑘
56 – 61 6 55.5 16
62 – 67 8 61.5 24
68 – 73 25 67.5 49
74 – 79 22 73.5 71
80 – 85 23 79.5 94
86 – 91 12 85.5 106
92 – 97 4 91.5 110
N= 110
Q3 P45
D3 P89
B. Complete the Frequency Distribution Table to find the Q3, D6 and P94 of the data set given:
Class F LCB <CF
10-19 3
20-29 1
30-39 3
40-49 2
50-59 9
60-69 8
70-79 35
80-89 30
90-99 9
Example:
Below is the list of the scores of two groups of students in a grammar quiz.
Group A Group B
13 10
14 10
15 15
16 18
19 18
20 19
25 26
30 36
Solution:
1. Compute the mean
∑𝑥 152 ∑𝑥 152
𝑥̅𝐴 = = = 19 𝑥̅𝐵 = = = 19
𝑛 8 𝑛 8
2. Compute the deviations by subtracting the mean from each of the observations, and then
square the deviations.
Group A 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2 Group B 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
13 -6 36 10 -9 81
14 -7 49 10 -9 81
15 -4 16 15 -4 16
16 -3 9 18 -1 1
19 0 0 18 -1 1
20 1 1 19 0 0
25 6 36 26 7 49
30 11 121 36 17 289
3. Take the sum of the squared deviations, then divide the sum by N – 1, then take the square root
of the sample variance
∑(𝑥−𝑥̅ )2 268 ∑(𝑥−𝑥̅ )2 518
𝑠𝐴 = √ =√ = 6.19 𝑠𝐵 = √ =√ = 8.60
𝑛−1 8−1 𝑛−1 8−1
Example:
Final grades of students in Stat 101 arranged in FDT. Solve for the Standard deviation.
Frequenc 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2 𝑓(𝑥 − 𝑥̅ )2
Class CM (x) 𝑓𝑥
y
50 – 55 10
56 – 61 6
62 – 67 8
68 – 73 25
74 – 79 22
80 – 85 23
86 – 91 12
92 – 97 4
N= 110
∑ 𝑓(𝑥−𝑥̅ )2 ∑ 𝑓|𝑥− 𝑥̅ |
𝑠=√ 𝑀𝐷 =
𝑛−1 𝑁
Complete the Frequency Distribution Table to find the standard deviation of the data set given:
Class F CM (x) 𝑓𝑥 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2 𝑓(𝑥 − 𝑥̅ )2
10-19 3
20-29 1
30-39 3
40-49 2
50-59 9
60-69 8
70-79 35
80-89 30
90-99 9
a. The mean, median, and mode are all equal and are located at the center of the distribution.
b. The distribution is symmetric. The distribution depicts a bell-shaped curve where the left area is a
mirror image of the right area.
c. The total area under the normal curve is 1 or 100%.
d. The distribution is asymptotic.
e. The location of the distribution is determined by the mean and the standard deviation determines
dispersion of the distribution.
𝜇 − 3𝛿 𝜇 − 2𝛿 𝜇 − 1𝛿 𝜇 𝜇 + 1𝛿 𝜇 + 2𝛿 𝜇 + 3𝛿
The mean and the standard deviation determine the shape of the distribution.
As previously stated, there are infinite families of curves depending upon the standard deviation of the
distribution. This may suggest that we have to use different table corresponding to a particular mean and
standard deviation. Well, it is not. It is necessary that we need to standardize a given observation. the
standardized score may also be termed as Z-value, Z statistics, standard deviate, standard normal value or
just normal value. The formula is shown below.
𝑥−𝜇
𝑍=
𝜎
Where: 𝑧 = normal value
𝑥 = value of any particular observation
𝜇 = mean of the distribution
𝜎 = standard deviation of the distribution
Z - values Rules
1. The z – values are positive and negative Add the areas of the corresponding Z – values.
2. Both Z – values are positive or both Z – Value In either case, subtract the smaller area from the
are negative bigger area
3. To the right of a positive z – value or to the
Subtract the area from 0.5
left of a negative z value
4. To the right of a negative z value or to the
Add area to 0.5
left of a positive z value
Examples:
Find the area under the normal distribution curve of the following z values:
1. 0 < z < 1.63 5. z > 1.63
2. Two-tailed test – a test where the areas of rejection are both sides of the distribution. The two-tailed
test is used if the alternate hypothesis is non-directional.
Example:
A test was administered to two groups of students – the HRM student group and the tourism
student group. At the 0.05 significance level, is there difference between the scores obtained by the
two groups of students?
H0 = There is no significant difference between the scores obtained by the two groups of students.
Ha = There is significant difference between the scores obtained by the two groups of students.
LEVEL OF SIGNIFICANCE
It is the probability of rejecting a true null hypothesis.
If the null hypothesis is true and is rejected, it is called TYPE I ERROR. And if the null hypothesis is
false and is accepted, it is called TYPE II ERROR.
Decision
Null Hypothesis
Reject H0 Accept H0
H0 is true Type I Error Correct Decision
CRITICAL VALUE
The value that divides the area of rejection and the area of acceptance.
Region of
acceptance
Region of Region of
rejection rejection
-1.701 1.701
STEPS IN HYPOTHESIS TESTING
1. State the null hypothesis (H0) and the alternative hypothesis (Ha).
2. Set the desired level of significance.
3. Determine the appropriate test statistic and establish the critical region.
4. Compute the test statistic as a basis for decision.
5. Formulate the decision.
Examples:
1. The soft drink dispenser of a fast food center was just readjusted. The manager, wanting to know if
the dispenser is really in good condition, got a sample of 50 cups filled by the dispenser. She would
only classify the dispenser as “in good condition” (and therefore need not to be readjusted again) if
the average fill per cup of the dispenser is 8 ounces.
Solution:
Variable: The variable that will represent the information is –
X = fill per cup of the dispenser.
2. Jenny suspects that male CvSU-CCC students spend less time studying compare to their female
counterpart. She decided to conduct a study regarding the study habits of both male and female
CvSU-CCC student spends doing his/her school work.
Solution:
Variable: The variable that will represent the information is –
X = time spent by male CvSU-CCC student in doing school work.
Y = time spent by female CvSU-CCC student in doing school work
Hypothesis: Ho: μx = μy (The average time spent by male CvSU-CCC students in doing
school work is the same with the female CvSU-CCC students.)
Ha: μx < μy (The average time spent by male CvSU-CCC students in doing
school work is less than the female CvSU-CCC students.)
TEST OF DIFFERENCE
1. Z – Test of One Population Mean
FUNCTION: Parametric. Used to determine if a given sample mean was drawn from the
population with known parameters.
LEVEL OF MEASUREMENT: Interval/Ratio
SAMPLE DATA: SATT Scores, Average, Ratings, IQ, Budget, Gross Income
RESEARCH PROBLEM: Is the group of teenagers in Makati represent Metro Manila teenagers?;
Is there enough evidence to contradict the rental company’s claim that the mean time to
rent a car on their website is 60 seconds if the mean time of rent of random sample of 36
customers was 75 seconds?; Is there a significant difference between the mean score of the
2018 LET passers from CvSU with mean score of the total LET passers of CvSU?
RESEARCH PROBLEM: Is there a significant difference between students who are in favor of
Duterte’s war on drug before and after the forum?; Is there a significant difference between
voters’ choice of candidate before and after the political debate?