Data Management
Data Management
Our target learning outcomes are a) Use a variety of statistical tools to process and
manage numerical data; b) Use the methods of linear regression and correlations to predict
the value of a variable given certain conditions; c) Advocate the use of statistical data in
making important decisions
What is Statistics?
Statistics is the science of collection, organizing, presenting, analyzing, and
interpreting data to assist in making more effective decisions.
A. Divisions of Statistics
1. Descriptive Statistics. It deals with the methods of organizing, summarizing, and
presenting a mass of data to yield meaningful information. It includes anything done
to the data designed to summarize, or describe without any attempt to make
inference or conclusion about the gathered data.
Activities:
• Collect data; e.g., Survey
• Present data; e.g., Tables and graphs
• Summarize data; e.g., Sample mean
2. Inferential Statistics. It is concerned with generalizing about a population or other
groups of data based on the study of the sample. It comprises those methods
concerned with the analysis of a subset of data leading to predictions or inferences
about the entire set of data.
Activities:
• Estimation; e.g., Estimate the population mean weight using the
sample mean weight
• Hypothesis testing; e.g., Test the claim that the population mean
weight is 70 kg
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 1
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
B. Population and Sample
1. Population. It consists of the totality of the observations with which we are concerned.
It refers to a group of a total number of people, objects, or reactions that can be
described as having a unique or combination of qualities. Population can be either
finite or infinite.
• Parameter is any numerical value describing a characteristic of a population
usually represented by Greek letters.
Examples:
• If we consider all math classes to be the population, then the
average number of points earned per student over all the math
classes is an example of a parameter.
• There are 35, 000 students enrolled in a university and 15 % of
them are enrolled in math. The figure of 15% is a parameter
because it is based on the entire population of all enrolled
students.
2. Sample . It refers to a finite number of objects selected from the population. It is a
collection of some elements in a population or is a representative of the entire
population.
• Statistic is any numerical values describing a characteristic of a sample and
usually represented by the ordinary letters of the English alphabets
Example:
• If we consider one math class to be a sample of the population
of all math classes, then the average number of points earned
by students in that one math class at the end of the term is an
example of a statistic. The statistic is an estimate of a population
parameter, in this case the mean.
• An institution polled 2.3 million adults in the Philippines and 80%
said that they would vote for the presidency. That figure of 80 %
is a statistic because it is based on a sample, not the entire
population of all adults in the Philippines.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 2
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
C. Sample Size Determination
Estimating a
Population Mean
Estimating a
Population Proportion
Where n is the sample size, ,𝒁𝜶 is the two- tailed z- score corresponding to the level of
𝟐
significance, s is the known standard deviation, e is the margin of error, p is the past
estimate of the population proportion, and q=1-p
NOTE
a. The level of significance,𝛼 , can take any of the standard values namely, 0.01,
0.05, and 0.10. Theoretically, the level of significance is the probability of the
type 1 error in hypothesis testing.
b. The following table presents the values of 𝒁𝜶 corresponding to the standard
𝟐
values of 𝛼
𝛼 𝒁𝜶
𝟐
0.01 2.575
0.05 1.96
0.10 1.645
c. The standard deviation, s, can be estimated from a pilot data set or the value
can be adopted from a previous study that considered the same or similar
population.
d. In the same manner as s, p can be the past estimate of the population
proportion or can be computed from a pilot data set.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 3
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
2. Yamane’s Formula (Simplified Formula for Proportions)
If the behavior of the population is not certain or the researcher is not familiar
with the population’s behavior, Yaro Yamen’s formula (1980) or Taro Yamane’s
formula (1967) may be used. The formula is:
𝑁
𝑛=
1 + 𝑁𝑒 2
Example 78: 41% of Jacksonville residents said that they had been in a hurricane. How
many adults should be surveyed to estimate the true proportion of adults who have been
in a hurricane, with a 95%confidence interval and 3% margin of error.
Solution:
41 % is a past estimate of population proportion. Unknown population size. Hence,
we use the following formula.
𝛼 =0.05
p=0.41
q=1-0.41=0.59
𝑍𝛼 = 1.96
2
2
( 𝒁𝜶 ) 𝑝𝑔
𝟐
𝑛≥
𝑒2
( 𝟏. 𝟗𝟔)2 (0.41)(0.59)
𝑛≥
(0.03)2
𝑛 ≥ 1,032.54 ≈ 1.033
Example 79: From a population of 10,000 individuals of a certain town, what sample size
is needed in order to get an accurate result for a certain study using a margin of error of
3% .
Solution:
𝑁
𝑛=
1 + 𝑁𝑒 2
10,000
𝑛=
1 + (10,000)(0.03)2
𝑛 = 1,000
Hence, the sample size needed in order to get an accurate result for a a certain
study using a margin of error of 3% is 1000 individuals.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 4
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
D. Sampling Techniques
Sampling is the process of selecting units, like people, organizations, or objects from
a population of interest in order to study and fairly generalize the results back to the
population from which the sample was taken. The two types of sampling are
1. Random Sampling Techniques
Members from the population are selected in such a way that each individual
member in the population has an equal chance of being selected.
a. Simple Random Sampling . Every case in the population being sampled has an equal
chance of being chosen. It is an equal probability sampling Method (EPSEM).
Basic Steps:
1. Make a list of the population units and number them from a 1 to N,
where N is the population size.
2. Select n random numbers from 1 to N using some random process.
3. Employ any of the following selection procedure:
• Draw lots
• Lottery
• Usage of gadgets like the calculator or computer to generate
Random Numbers
• Table of Random Numbers
b. Systematic Random Sampling. we select some starting point randomly and then
select every kth (such as every 50th) element in the population until the desired
sample size is achieved.
Basic Steps:
1. Construct the sampling frame
2. Determine the sample size
𝑁
3. Determine the sample interval, k: 𝑘 =
𝑛
4. Identify the random start using SRS, r: 1 ≤ 𝑟 ≤ 𝑘
5. Commencing on the random start, select every kth item until the
desired sample size is reached.
c. Stratified Random Sampling. We subdivide the population into at least two different
subgroups (or strata) so that subjects within the same subgroup share the same
characteristics (such as gender or age bracket), then we draw a sample from each
subgroup (or stratum).
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 5
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
We use Proportional Allocation to draw a sample from each stratum to
reach the desired sample size.
Example 80: Suppose a school has five departments composed of the following number
of students. Determine the number of students to be part of the sample when the
researcher needs 363 respondents.
Solution:
Department 𝑁ℎ 𝑛ℎ
Business Administration (BA) 1,500 140
Management(M) 1,200 112
Finance(F) 850 80
Entrepreneurship(E) 200 19
Culinary Arts(CA) 150 14
Total 3,900
1,500
𝑛𝐴𝐵 = (363) = 139.62 ≈ 140
3,900
1,200
𝑛𝑀 = (363) = 111.69 ≈ 112
3,900
850
𝑛𝐹 = (363) = 79.12 ≈ 80
3,900
200
𝑛𝐸 = (363) = 18.62 ≈ 19
3,900
150
𝑛𝐶𝐴 = (363) = 13.96 ≈ 14
3,900
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 6
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Sample Size Round-off
Rule: When the calculated sample size is not a whole number, it should be
rounded up to the next higher whole number. Rounding up a sample size calculation
for conservativeness ensures that your sample size will always be representative of the
population. For instance, A sample size calculation determined that 2006.083 data
points were necessary to represent the population. In this case, 2007 data points
samples should be taken.
d. Cluster Random Sampling. Divide the population into sections (or clusters), then
randomly select some of those clusters, and then choose all members from those
selected clusters.
e. Multi-Stage Sampling. This method uses several stages or phases in getting random
samples from the general population.
Commonly used if research is of National Scope.
• We divide the country to Regions
• Regions to Municipalities and Cities
• Municipalities and Cities to barangays
• Barangays to Sitios or sections
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 7
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
2. Expert Sampling. It involves the assembling of a sample of persons with known
or demonstrable experience and expertise in some area.
3. Quota Sampling. Selecting items non randomly according to some fixed
quota.
4. Snowball Sampling. Begin by identifying someone who meets the criteria for
inclusion in your study. You ask them to recommend others who they may know
who also meet the criteria.
E. Statistical Data
It is the raw materials of research or any statistical investigations usually obtained by
counting or measuring items. Statistical data are usually obtained by counting or measuring
items. Data are categorized
1. According to Description:
a) Qualitative (Categorical) Data generally described by words or letters. They are not
as widely used as quantitative data because many numerical techniques do not
apply to the qualitative data. For example, it does not make sense to find an average
hair color and other attributes of the population.
• The gender (male, female) of survey respondents
• The numbers 24, 28, 17, 54, and 31 sewn on the shirts of the
basketball team are categorical data. These numbers are
substitutes for names. They do not count or measure anything.
Qualitative data can be separated into two subgroups:
1. Dichotomic takes the form of a word with two options, such as gender -
male or female.
2. Polynomic takes the form of a word with more than two options, such as
education - primary school, secondary school and university.
b) Quantitative (Numerical) Data are always numbers and are the result of counting
or measuring attributes of a population.
• The ages (in years) of survey respondents
• distance traveled
• number of children in a family,
Quantitative data can be separated into two subgroups:
1. Discrete is the result of counting. It is expressed as whole numbers and is
always exact.
• The numbers of eggs that hens lay are discrete data because
they represent counts.
• The number of students of a given ethnic group in a class.
• The number of books on a shelf.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 8
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
2. Continuous is the result of measuring. It is not necessarily whole numbers.
• The amounts of milk from cows are continuous data because
they are measurements that can assume any value over a
continuous span. During a year, a
a cow might yield an amount of milk that can be any value
between 0 and 7000 liters. It would be possible to get
5678.1234 liters because the cow is not restricted to the
discrete amounts of 0, 1, 2, . . . , 7000 liters.
• distance traveled
• weight of luggage
2. According to Source:
a. Primary data refers to the information which is gathered directly from an original
source or which are based on direct or first- hand experience using methods like
surveys, interviews, or experiments.
b. Secondary data refers to the information taken from published / unpublished
materials that have been previously gathered by other individuals, researcher’s or
agencies.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 9
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
• Distances (in km) traveled by cars (0 km represents no distance traveled, and
400 km is twice as far as 200 km.)
• Prices of books(P0.00 does represent no cost, and a P300.00 book does cost
twice as much as a P150.00 book.)
• height
• weight
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 10
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
4. Experiment Method - this method is used when the objective is to determine the
cause-and-effect relationship of certain phenomena under controlled conditions. It
is usually used by scientific researchers.
5. Registration Method – this method of gathering information is enforced by law.
• registration of births
• deaths
• vehicles
• licenses
Positive:
1. Information is kept systematized.
2. Information is always made available to the public.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 11
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
2. Tabular Presentation – This is a systematic way of categorizing related data in rows
and columns. This methodical arrangement called statistical table presents data in a
more concise and greater detail than in textual or graphical form.
2. Frequency Polygon. One type of statistical graph involves the class midpoints.
A frequency polygon uses line segments connected to points located directly
above class midpoint values. A variation of the basic frequency polygon is the
relative frequency polygon, which uses relative frequencies (proportions or
percentages) for the vertical scale. When trying to compare two data sets, it
is often very helpful to graph two relative frequency polygons on the same
axes.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 12
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
3. Ogive. Another type of statistical graph called an ogive (pronounced “oh-
jive”) involves cumulative frequencies. Ogives are useful for determining the
number of values below some particular value, as illustrated in Example 3. An
ogive is a line graph that depicts cumulative frequencies. An ogive uses class
boundaries along the horizontal scale, and cumulative frequencies along the
vertical scale.
4. Pie chart is a graph that depicts qualitative data as slices of a circle, in which
the size of each slice is proportional to the frequency count for the category.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 13
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
5. A stemplot (or stem-and-leaf plot) represents quantitative data by separating
each value into two parts: the stem (such as the leftmost digit) and the leaf
(such as the rightmost digit).
Numerical values that tend to locate in some sense the middle of a set of data
when arranged in increasing or decreasing order. The term average is often
associated with these measures mean, median, mode, midrange
1. Mean 𝝁 or x
a. Arithmetic Mean. It is obtained by adding all the observations and dividing the sum
by the number of observations, thus it is called computational average.
1. Population Mean: If 𝑥1 , 𝑥2 , ..., 𝑥𝑛 represents a finite population of size N,
the population mean is given by
Example 81: Suppose you chose ten people who entered the campus and whose ages
are as follows: 15 25 18 20 25 18 18 20 20 25 What is the mean age of this sample?
Solution:
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 14
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
b. Weighted Mean. If the data values 𝑥1 , 𝑥2 , ..., 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 , ..., 𝑤𝑘 ,
respectively, the mean is given by
Example 82: :A student was taking 5 subjects last semester. Find his average if his final
grades were as follows:
Solution:
Characteristics of Mean
1. Interval and ratio measurements
2. All the scores or measurements are considered in the computation of the mean.
3. Very high or very low scores or measurements affect the mean.
2. Mode 𝝁̂ or 𝒙̂
It is the value in the distribution with the highest frequency. It locates the point
where the observation values occur with the greatest density. It can be used for
quantitative aw sell as qualitative data.
A data set can have one mode, more than one mode, or no mode.
• When two data values occur with the same greatest frequency, each
one is a mode and the data set is bimodal.
• When more than two data values occur with the same greatest
frequency, each is a mode and the data set is said to be multimodal.
• When no data value is repeated, we say that there is no mode.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 15
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Characteristics of Mode
1. It is very easy to compute but is seldom used because it is very unstable.
2. When a rough or quick estimate of a central value is wanted.
3. It is most appropriate for nominal scale as a measure of popularity.
3. Median
It is a value that divides the distribution into two equal parts (after arranging the
values in ascending or descending order). As such, it is a positional average. The median
is defined by
Example 84 During the first marking period, Nicole's math quiz scores were 90, 92, 93, 88,
95, 88, 97, 87, and 98. What was the median quiz score?
Solution:
Ordering the data from least to greatest, we get:
Since n =9 (odd),
The median quiz score is 92. (Four quiz scores were higher than 92 and four
were lower.)
Example 85 The ages of 10 college students are listed below. Find the median.
18, 24, 20, 35, 19, 23, 26, 23, 19, 20
Solution:
Ordering the data from least to greatest, we get:
Characteristics of Median
1. Ordinal or ranked measurements
2. Only the middle scores or measurements are considered in the computation of the
median.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 16
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
3. Very high or very low scores do not affect the median.
4. When there are extreme cases, thus the distribution is markedly skewed.
5. When we desire to know whether the cases fall within the upper halves or the lower
halves of a distribution
These measures, also known as quantiles or fractiles, are values below which a specific
fraction or percentage of the observations in a given data set must fall. These are
percentiles, deciles, and quartiles
1. The Percentiles
Percentiles are values that divide a set of observations into 100 equal parts. These
values, denoted by 𝑃1 , 𝑃2 , … , 𝑃99 , are such that 1% of the data falls below 𝑃1 , 2% falls below
𝑃2 , …, and 99% falls below 𝑃99 . The 𝑘th percentile, 𝑃𝑘 (𝑘 = 1, 2, 3, … ,99), can be determined
using the following procedure:
a. Arrange the data in increasing order and compute the value of the index
𝑘
𝑖=( ) 𝑛, where 𝑛 is the number of observations.
100
𝑥𝑖 +𝑥𝑖+1
b. If 𝑖 is an integer, 𝑃𝑘 = . If 𝑖 is not an integer, use the rounded-up value for 𝑖
2
and take 𝑃𝑘 = 𝑥𝑖 . Note that 𝑥𝑖 here pertains to the score in the data set.
Example 86. As part of a quality-control study aimed at improving a production line, the
weights (in ounces) of 50 bars of soap are measured. The results are as follows, sorted from
smallest to largest. Find, the 43rd percentile and 10th percentile
11.6 12.6 12.7 12.8 13.1 13.3 13.6 13.7 13.8 14.1
14.3 14.3 14.6 14.8 15.1 15.2 15.6 15.6 15.7 15.8
15.8 15.9 15.9 16.1 16.2 16.2 16.3 16.4 16.5 16.5
16.5 16.6 17.0 17.1 17.3 17.3 17.4 17.4 17.4 17.6
17.7 18.1 18.3 18.3 18.3 18.5 18.5 18.8 19.2 20.3
Solution
a. 43rd Percentile
We compute the index 𝑖. Note that 𝑘=43 𝑎𝑛𝑑 𝑛=50
43
Then 𝑖 = ( ) 50 = 21.5 ≈22 (𝑟𝑜𝑢𝑛𝑑 𝑢𝑝)
100
From the data set, 𝑥22 = 15.9
Hence, we have 𝑃43 = 𝑥22 = 15.9
Hence, 43% of the values lie below 15.9.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 17
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
b. 10th percentile 𝑃10
Notice that the data are already arranged in increasing order. We compute the
10
index 𝑖. Note that 𝑘 = 10 and 𝑛 = 50, so 𝑖 = ( ) 50 = 5.
100
.
𝑥5 +𝑥6
Since 𝑖 is an integer, 𝑃10 = From the data set, 𝑥5 = 13.1 and 𝑥6 = 13.3 (the fifth
2
𝑥5 +𝑥6 13.1.+13.3
and sixth values in the data set). Thus, we have 𝑃10 = = = 13.2.
2 2
This means that 10% of the bars of soap weigh less than 13.2 ounces.
2. The Deciles
Deciles are values that divide a set of observations into ten equal parts. These values,
denoted by 𝐷1 , 𝐷2 , … , 𝐷9, are such that 10% of the data falls below 𝐷1 , 20% falls below 𝐷2 , …,
and 90% falls below 𝐷9 . The 𝑘th Decile, 𝐷𝑘 (𝑘 = 1, 2, … ,9), can be determined using the
following procedure:
a. Arrange the data in increasing order and compute the value of the index
𝑘
𝑖 = ( ) 𝑛, where 𝑛 is the number of observations.
10
𝑥𝑖 +𝑥𝑖+1
b. If 𝑖 is an integer, 𝐷𝑘 = . If 𝑖 is not an integer, use the rounded-up value for 𝑖
2
and take 𝐷𝑘 = 𝑥𝑖 .
Example 87. Compute the 1st and 9th deciles of the data set from the previous example.
Solution
a. 9th Decile
b. First Decile
1
Let us compute the index 𝑖 given that 𝑘 = 1 and 𝑛 = 50: 𝑖 = ( ) 50 = 5.
10
𝑥5 +𝑥6
Since 𝑖 is an integer, 𝐷1 = . From the data set, 𝑥5 = 13.1 and 𝑥6 = 13.3. Thus, we
2
𝑥5 +𝑥6 13.1.+13.3
have 𝐷1 = = = 13.2.
2 2
This means that 10% of the bars of soap weigh less than 13.2 ounces.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 18
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
3. The Quartiles
Quartiles are values that divide a set of observations into four equal parts. These
values, denoted by 𝑄1 , 𝑄2 , and 𝑄3 , are such that 25% of the data falls below 𝑄1 , 50% falls
below 𝑄2 and 75% falls below 𝑄3 . The 𝑘th quartile, 𝑄𝑘 (𝑘 = 1, 2, 3), can be determined
using the following procedure:
a. Arrange the data in increasing order and compute the value of the index
𝑘
𝑖 = ( ) 𝑛, where 𝑛 is the number of observations.
4
𝑥𝑖 +𝑥𝑖+1
c. If 𝑖 is an integer, 𝑄𝑘 = . If 𝑖 is not an integer, use the rounded-up value for 𝑖
2
and take 𝑄𝑘 = 𝑥𝑖 .
Example 88. Compute the 2nd and the 3rd quartiles of the data set from the previous
example.
Solution
a. 2nd quartile 𝑄2
2
Let us compute the index 𝑖 given that 𝑘 = 2 and 𝑛 = 50: 𝑖 = ( ) 50 = 25.
4
.
𝑥25 +𝑥26
Since 𝑖 is an integer, 𝑄2 = From the data set, 𝑥25 = 16.2 and 𝑥26 = 16.2.
2
𝑥25 +𝑥26 16.2+16.2
Thus, we have 𝑄2 = = = 16.2.
2 2
This means that 50% of the bars of soap weigh less than 16.2 ounces.
Note that the value of 𝑄2 represents the median of the data and that 𝑄2 = 𝑃50 = 𝐷5 .
b. 3rd quartile, 𝑄3
3
We compute the index 𝑖. Note that 𝑘 = 3 and 𝑛 = 50: 𝑖 = (4) 50 = 37.5 ≈ 38(round up
since 37.5 is not an integer). Hence, we have 𝑄3 = 𝑥38 = 17.4.
This means that 75% of the bars of soap weigh less than 17.4 ounces. We also say that
𝑄3 = 𝑃75 .
The measures of central tendency and relative location do not by themselves give an
adequate description of the data. It is also very important for us to know how the
observations spread out from the average. The measures of variability/dispersion indicate
the extent to which individual items in a series are scattered about the average. It is used to
determine the extent of the scatter so that steps may be taken to control the existing
variation. After going through this unit, you are expected to know how to calculate
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 19
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
descriptive measures to explain the consistency or variability of data using a scientific
calculator.
For example, let us consider the following measurements from two samples of data:
Both samples have the same mean 𝑥̅𝐴 = 𝑥̅𝐵 = 1.00. However, looking closely at the values,
the measurements for sample A are more uniform, or the values are close to each other
compared to sample B. This is what we will quantify in this unit. There are two general
classifications of the measures of variability: (1) measures of absolute dispersion and (2)
measures of relative dispersion.
Measures of Variability indicate the extent to which individual items in a series are
scattered about the average It is used to determine the extent of the scatter so that steps
may be taken to control the existing variation. General Classifications of Measures of
Variation are
a) Measures of Absolute Dispersion
b) Measures of Relative Dispersion
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 20
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
∑(𝑋𝑖 − 𝜇)2 𝑁 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 )2
𝜎2 = 𝑜𝑟 𝜎 2 =
𝑁 𝑁2
The two formulas will generate the same result, but the second is
computationally more convenient because it eliminates the step of
computing the deviations from the mean. The second formula is
recommended in computing the variance since it does not require the
computation of the mean first and it also eliminates round off errors
caused by taking deviations from the mean.
∑(𝑥𝑖 − 𝑥̅ )2 𝑛 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
𝑠2 = 𝑜𝑟 𝑠2 =
𝑛−1 𝑛(𝑛 − 1)
𝑁 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 )2
𝜎=√
𝑁2
𝑛 ∑ 𝑥𝑖 2 − (∑ 𝑥𝑖 )2
𝑠=√
𝑛(𝑛 − 1)
If the data are clustered around the mean, then the variance and the standard
deviation will be somewhat small. If the data are widely scattered about the mean, the
variance and the standard deviation will be somewhat large.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 21
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Let us take note of the following:
1. The standard deviation unit is the same as that of the raw data, so it is preferable to
use the standard deviation as a measure of variability instead of the variance, whose
unit is the square of the unit of the raw data.
2. In the formula for 𝑠, we divide by the quantity 𝑛 − 1 (instead of 𝑛) to make the sample
variance an unbiased estimator of the population variance (an estimator is unbiased
if its average value is equal to the parameter it is estimating). Hence, it is critical to
determine if the data set is from a population or a sample because of the difference
in the formula to be used.
3. There is a big difference between (∑ 𝑥𝑖 )2 and ∑ 𝑥𝑖 2 ! The first, (∑ 𝑥𝑖 )2 , means we add up
all the 𝑥𝑖 values first, then square the sum. The second, ∑ 𝑥𝑖 2 , means we should square
each of the 𝑥𝑖 values first, then add them up! To illustrate, suppose our data set is 4,
3, 6, and 7. This gives us:
∑ 𝑥𝑖 = 4 + 3 + 6 + 7 = 20
(∑ 𝑥𝑖 )2 = (4 + 3 + 6 + 7)2 = 202 = 400
∑ 𝑥𝑖 2 = 42 + 32 + 62 + 72 = 110
See the difference there!
A scientific calculator can easily compute these quantities using the Statistics mode.
You can watch the video demonstration for the models Casio 991EX and Casio 991ES on the
following links, if you are not familiar with the statistical functions of your calculator.
https://www.loom.com/share/85e2345a662446c9bcab064213aeb381
https://www.loom.com/share/a08e8f19258148e0ba91ef945a39521a
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 22
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Example 89. A high school teacher at a small private school assigns trigonometry practice
problems to be worked via the net. Students must use a password to access the problems,
and the time of log-in and log-off is automatically recorded for the teacher. At the end of
the week, the teacher examines the amount of time each student spent working on the
assigned problems. The data is provided below in minutes. Find the Range, Standard
Deviation, and Variance for the above data.
15, 28 25 48 22 43 49 34 22 33 27 25 22 20 39
Solution
x 𝒙𝟐
15 225
28 784
25 625
48 2304
22 484
43 1849
49 2401
34 1156
22 484
33 1089
27 729
25 625
22 484
20 400
39 1521
452 (Total) 15160 (Total)
𝑛 ∑ 𝑥𝑖 2 − (𝑥𝑖 )2
𝑠=√
𝑛(𝑛 − 1)
15(15160) − (452)2
𝑠=√
15(15 − 1)
𝒔 = 𝟏𝟎. 𝟒𝟖𝟕𝟏𝟖
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 23
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
For the variance:
The variance is 109.98095. The standard deviation is 10.48718. The variance is
the square of the standard deviation. 10.48718 squared is equal to 109.98095.
𝒔𝟐 = 𝟏𝟎𝟗. 𝟗𝟖𝟎𝟗𝟓
For Range:
The high score is 49; the low score is 15. Hence, the range is 49 - 15 = 34.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 24
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Example 90: If we have a standard deviation of 1.5 and a mean of 5, what is the coefficient
of variation?
Solution
𝑥̅ = 5
𝑠 = 1.5
1.5
𝐶𝑉 = ( ) 100% = 30 %
5
In other words, the standard deviation is 30% of the mean. When comparing two data
sets, the general rule of thumb you should follow is: The higher the coefficient of variation,
the higher the variability of the data set. This means that, when comparing two or more data
sets, the one with the highest coefficient of variability can be said to have the highest
variation.
Example 91: The mean of the number of sales of cars over a 3-month period is 87, and the
standard deviation is 5. The mean of the commissions is P261,250.00 and the standard
deviation is P38,650.00. Compare the variations of the two.
Solution
5
𝐶𝑉 = ( ) 100% = 5.70 %
87 Sales
P38,650.00
𝐶𝑉 = ( ) 100% = 14.80 %
P261,250.00 Commissions
Since the coefficient of variation is larger for commissions, the commissions are more
variable than sales.
A. Correlation
It measures the strength of the association or relationship between variables. the
variables are not designated as dependent or independent.
𝑐𝑜𝑟𝑟𝑒𝑙 𝑋 𝑎𝑛𝑑 𝑌=𝑐𝑜𝑟𝑟𝑒𝑙 𝑌 𝑎𝑛𝑑 𝑋
It is not define as causation (cause and effect relationship)
Assume that the association is linear, that one variable increases or decreases a fixed
amount for a unit increase or decrease in the other.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 25
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Pearson Correlation Coefficient
• denoted by 𝑟
• use to measure the degree of linear association or relationship
• measured on a scale that varies from +1 through 0 𝑡𝑜 – 1
• formula is
n xy − x y
r=
n x2 − ( x) 2
n y 2 − (y )
2
r Interpretation
1.0 Perfect positive/negative correlation
0.80-0.99 Very strong positive/ negative correlation
0.60-0.79 Strong positive/ negative correlation
0.40-0.59 Moderate positive/ negative correlation
0.20-0.39 Weak positive/ negative correlation
0.01-0.19 Very weak positive/ negative correlation
0.0 No correlation
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 26
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Example 92: Given the following data on the number of hours of study (x) for an
examination and the scores (y)received by a random sample of 10 students, compute for
the Pearson correlation coefficient.
Solution
Student 𝒙 𝒚 ∑ 𝒙𝒚 ∑ 𝒚𝟐 ∑ 𝒙𝟐
1 8 56 448 3136 64
2 5 44 220 1936 25
3 11 79 869 6241 121
4 13 72 936 5184 169
5 10 70 700 4900 100
6 5 54 270 2916 25
7 18 94 1692 8836 324
8 15 85 1275 7225 225
9 2 33 66 1089 4
10 8 65 520 4225 64
n xy − x y
r=
n x 2 − ( x) 2
n y 2 − (y )
2
10(6996) − (95)(652)
r=
10(1121) − (95) 10(45688) − (652)
2 2
r = 0.9625
There is a very strong positive linear relationship between the number of hours
of study (𝑥) for an examination and the scores (𝑦) received by a random sample of
10 students.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 27
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Example 93: Consider the scores obtained in Math and Statistics by 10 students.
Student 1 2 3 4 5 6 7 8 9 10
Math 5 8 10 12 12 14 15 16 18 20
Score
Stat 2 7 8 9 10 12 14 10 16 12
Score
Solution
Student 1 2 3 4 5 6 7 8 9 10 Total
Math Score 5 8 10 12 12 14 15 16 18 20 130
(𝒙)
Stat Score 2 7 8 9 10 12 14 10 16 12 100
(𝒚)
𝒙𝒚 10 56 80 108 120 168 210 160 288 240 1440
10(1440) − (130)(100)
r=
10(1878) − (130) 10(1138) − (100)
2 2
r = 0.8692
There is a very strong positive linear relationship between math and stat scores.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 28
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
B. Regression
It is used to examine the relationship between one dependent and one independent
variable and to predict the dependent variable (Y) when the independent variable (X) is
known.
It finds the best line (regression line) that predicts Y from X.
The Regression Line
It is a line that is as close as possible to all the data points at once.
The Regression Equation
It is an equation that represents the relationship between one dependent and
one independent variable.
𝒚= 𝒂 + 𝒃𝒙
The slope is
n xy − x y
b=
n x2 − ( x) 2
The y-intercept
a=
y − b x
n n
SPXY = x i y i −
( x )( y ) i i
SSY = y i
2
−
( y ) i
2
SSX = x i
2
−
( x ) i
2
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 29
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Example 94: The paired data below consist of the costs of advertising (in thousands
of pesos) and the number of products sold (in thousand units).
Solution
Cost # Products xy x 2
y 2
(x) Sold
(y)
9,000.00 85,000.00 765,000,000.00 81,000,000.00 7,225,000,000.00
2,000.00 52,000.00 104,000,000.00 4,000,000.00 2,704,000,000.00
3,000.00 55,000.00 165,000,000.00 9,000,000.00 3,025,000,000.00
4,000.00 68,000.00 272,000,000.00 16,000,000.00 4,624,000,000.00
2,000.00 67,000.00 134,000,000.00 4,000,000.00 4,489,000,000.00
5,000.00 86,000.00 430,000,000.00 25,000,000.00 7,396,000,000.00
9,000.00 83,000.00 747,000,000.00 81,000,000.00 6,889,000,000.00
10,000.00 73,000.00 730,000,000.00 100,000,000.00 5,329,000,000.00
Total 44,000.00 569,000.00 3,347,000,000.00 320,000,000.00 41,681,000,000.00
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 30
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
2. Find the equation of the regression line to predict weekly sales from advertising
expenditures.
Therefore, 50.08 % of the variance in the number of products sold is predictable from
the cost of advertising.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 31
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
Learning Reinforcement 6
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 32
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.
_______________ _______________ 5. The length of telephone calls made by
students to their parents
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any 33
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited.