0% found this document useful (0 votes)
83 views7 pages

Data Management

(1) Measures of central tendency and location describe the middle or center of a data set and include the mean, median, mode, percentiles, quartiles, and deciles. (2) The mean is the average value found by adding all values and dividing by the number of values, while the median is the middle value of a data set arranged in order. (3) Measures of variability or dispersion, such as the range or standard deviation, describe how spread out the values are around the mean or median.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views7 pages

Data Management

(1) Measures of central tendency and location describe the middle or center of a data set and include the mean, median, mode, percentiles, quartiles, and deciles. (2) The mean is the average value found by adding all values and dividing by the number of values, while the median is the middle value of a data set arranged in order. (3) Measures of variability or dispersion, such as the range or standard deviation, describe how spread out the values are around the mean or median.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

DATA MANAGEMENT 1

MEASURES OF CENTRAL TENDENCY

Measures of Central Tendency are numerical values that tend to locate in some sense the middle of a set of
data. The term average is often associated with these measures. The most important measure of central
tendency are (1) the mean, (2) the median, and (3) the mode.

A. MEAN, 𝜇 or 𝑥̅

1. Arithmetic Mean – it is obtained by adding all the observations and dividing the sum by the number of
observations, thus it is called a computational average.
Population mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑁 represents a finite population of size 𝑁, then the population
mean 𝜇 is
N

x
i 1
i


N
Sample Mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑛 represents a finite sample of size 𝑛, then the sample mean 𝑥̅ is
n

x
i 1
1

x
n

Example:
Suppose you are to choose ten people who enter the campus and whose ages are as follows:
15 25 18 20 25 18 18 20 25 15
What is the mean age of this sample?

2. Weighted Mean – if the data set 𝑥1 , 𝑥2 … 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 … 𝑤𝑘 , respectively, then the
weighted mean is computed as follows:
k

w x i i
x i 1
k

w i 1
i

Example:
A student was taking six subjects in college during the first semester. Find his average grade if his final
grades were as follows:
Subject Math Physics English Speech Statistics
Grade 1.75 2.50 2.25 1.50 3.0
Units 3 5 3 2 4
B. MEDIAN, 𝜇̃ or 𝑥̃
- a value that divides the distribution into two equal parts (after arranging the values/scores in ascending or
descending order). As such, it is a positional average. The median is defined by
𝑥𝑛+1 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝜇̃ 𝑜𝑟 𝑥̃ = {𝑥𝑛 + 𝑥𝑛+1
2 2
𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2

Example:
Find the median:
(a) 12, 15, 18, 8, 9,10, 6
(b) 23, 18, 15, 12, 10, 9, 8, 6

C. MODE, 𝜇̂ or 𝑥̂
- the value in the distribution with the highest frequency. It locates the point where the observation values occur
with the greatest density. It can be used for quantitative as well as qualitative data.

GMATH- Mathematics in the Modern World


DATA MANAGEMENT 2

Example: Find the mode of the following data:


15 12 4 9 6 10 5 15 12 4 12 6 12
5 15 12 4 15 4 6 5

Evidently, a distribution can have no mode, one mode, or more than one mode. Thus, the mode is not a very
reliable measure of central tendency. However, there are instances when no other measure can be used except
the mode. In determining the prevalent gender, civil status, or highest educational attainment, only the mode
can be used because no numerical values can be assigned to these variables.

D. MIDRANGE - the mean of the largest and smallest values in the data set.

Remarks
Mean:
1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or measurements affect the mean.
Median:
1. Only the middle scores or measurements are considered in the computation of the median.
2. Very high or very low scores do not affect the median.
Mode:
1. It is very easy to compute but is seldom used because it is very unstable.
2. It is most appropriate for nominal scale as a measure of popularity.

MEASURES OF LOCATION

There are several other measures of location that describe or locate the position of certain non-central pieces
of data relative to the entire set of data. These measures, often referred to as quantiles or fractiles are values
below which a specific fraction or percentage of the observations in a given set must fall.

PERCENTILES
Percentiles are values that divide a set of observations into 100 equal parts. These values, denoted by
𝑃1 , 𝑃2 , … , 𝑃99 , are such that 1% of the data falls below 𝑃1 , 2% falls below 𝑃2 , …, and 99% falls below 𝑃99 .
The 𝑘th percentile, 𝑃𝑘 (𝑘 = 1, 2, 3, … ,99), can be determined using the following procedure:
𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( ) 𝑛, where 𝑛 is the
100
number of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝑃𝑘 = 𝑖 𝑖+1 . If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝑃𝑘 = 𝑥𝑖 .
2

DECILES
Deciles are values that divide a set of observations into 10 equal parts. These values, denoted by 𝐷1 , 𝐷2 , … , 𝐷9 ,
are such that 10% of the data falls below 𝐷1 , 20% falls below 𝐷2 , …, and 90% falls below 𝐷9 .
The 𝑘th decile, 𝐷𝑘 (𝑘 = 1, 2, … ,9), can be determined using the following procedure:
𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( ) 𝑛, where 𝑛 is the number
10
of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝐷𝑘 = 𝑖 𝑖+1 . If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝐷𝑘 = 𝑥𝑖 .
2

QUARTILES
Quartiles are values that divide a set of observations into 4 equal parts. These values, denoted by 𝑄1 , 𝑄2 , and
𝑄3 , are such that 25% of the data falls below 𝑄1 , 50% falls below 𝑄2 and 75% falls below 𝑄3 .

The 𝑘th quartile, 𝑄𝑘 (𝑘 = 1, 2, 3), can be determined using the following procedure:
𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( ) 𝑛, where 𝑛 is the number
4
of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝑄𝑘 = 𝑖 𝑖+1 . If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝑄𝑘 = 𝑥𝑖 .
2

GMATH- Mathematics in the Modern World


DATA MANAGEMENT 3

Examples

1. Find the quartiles, interquartile range, 3rd and 7th deciles, and 12th, 37th, 95th percentiles for the
following examination scores given in the stem-and-leaf plot.
Exam Scores
4 |568
5 |34569
6 |2356699
7 |01133455578
8 |122369

2. As part of a quality-control study aimed at improving a production line, the weights (in ounces) of 50
bars of soap are measured. The results are as follows, sorted from smallest to largest. Find the interquartile
range, the 3rd and 9th deciles, and the 12th, 43rd, and 61st percentiles.

11.6 12.6 12.7 12.8 13.1 13.3 13.6 13.7 13.8 14.1
14.3 14.3 14.6 14.8 15.1 15.2 15.6 15.6 15.7 15.8
15.8 15.9 15.9 16.1 16.2 16.2 16.3 16.4 16.5 16.5
16.5 16.6 17.0 17.1 17.3 17.3 17.4 17.4 17.4 17.6
17.7 18.1 18.3 18.3 18.3 18.5 18.5 18.8 19.2 20.3

MEASURES OF VARIABILITY OR DISPERSION

The measures of central tendency do not by themselves give an adequate description of the data. It is also very
important for us to know how the observations spread out from the average. The measures of variation indicate
the extent to which individual items in a series are scattered about the average. It is used to determine the extent
of the scatter so that steps may be taken to control the existing variation.

Let us consider the following measurements for two samples of data:

Sample A P24,500 20,700 22,900 26,000 24,100 23,800 22,500


Sample B P24,900 17,500 21,600 29,700 25,300 23,800 21,700

Both samples have the same mean but, it is quite obvious that the measurements for sample A are more uniform
or the values are close to each other as compared to sample B.

General Classifications of Measures of Variation


 Measures of Absolute Dispersion
 Measures of Relative Dispersion

Measures of Absolute Dispersion


The measures of absolute dispersion are expressed in the units of the original observations. They cannot
be used to compare variations of two data sets when the averages of these data sets differ a lot in value or when
the observations differ in units of measurement. The most common statistics for measuring the variability of a set
of data are the range, variance, and the standard deviation.

RANGE
The range measures the distance between the largest and the smallest values and, as such, gives an idea of the
spread of the data set. However, the range does not use the concept of deviation. It is affected by outliers but
does not consider all values in the data set. Thus it is a not a very useful measure of variability.

𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 – 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒

GMATH- Mathematics in the Modern World


DATA MANAGEMENT 4

MEAN ABSOLUTE DEVIATION


The mean absolute deviation (MAD) utilizes deviations of the data values from the mean in its computation. The
MAD is the average of the absolute values of the deviations from the mean, computed using the formula
∑ |𝑥𝑖 −𝜇| ̅
∑ |𝑥𝑖 −𝑥|
population: 𝑀𝐴𝐷 = sample: 𝑀𝐴𝐷 =
𝑁 𝑛

If a data set A has a greater MAD than data set B, then it is reasonable to believe that the values in data set A
are more spread out (variable) than the values in set B.

VARIANCE AND STANDARD DEVIATION


The variance and the standard deviation are the most common and useful measures of variability. These two
measures provide information about how the data vary about the mean. The variance 𝜎 2 or 𝑠 2 is a measure of
variation which considers the position of each observation relative to the mean of the set. It is an approximate
average of the squared deviations from the sample mean. The standard deviation 𝜎 or 𝑠 is the square root of the
variance.

Population Variance: Given the finite population 𝑥1 , 𝑥2 … 𝑥𝑁 , the population variance, which is exact, is

∑(𝑥𝑖 −𝜇)2 𝑁∑𝑥𝑖 2 −(∑𝑥𝑖 )2


𝜎2 = or 𝜎2 =
𝑁 𝑁2

Sample Variance: Given a random sample 𝑥1 , 𝑥2 … 𝑥𝑛 , the sample variance is

∑(𝑥𝑖 −𝑥̅ )2 𝑛∑𝑥𝑖 2 −(∑𝑥𝑖 )2


𝑠2 = or 𝑠2 =
𝑛−1 𝑛(𝑛−1)

where:  = population standard deviation 𝑥𝑖 = 𝑖 th observation


𝑠 = sample standard deviation 𝜇 = population mean
𝑥̅ = sample mean 𝑁 = population size
𝑛 = sample size

If the data are clustered around the mean, then the variance and the standard deviation will be somewhat
small. If, however, the data are widely scattered about the mean, the variance and the standard deviation will
be somewhat large.

Notes:
1. We divide by the quantity 𝑛 − 1 in order to make the sample variance an unbiased estimator of the
population variance. (An estimator is unbiased if its average value is equal to the parameter it is
estimating.)
2. The unit of the standard deviation is the same as that of the raw data, so it is preferable to use the standard
deviation as a measure of variability instead of the variance.
3. The range is a quick but a rough measure of variation since considers only the highest value and the
lowest value of the observations.

Measures of Relative Dispersion


The measures of relative dispersion are unit less and are used when one wishes to compare the dispersion
of one distribution with another distribution.

COEFFICIENT OF VARIATION (CV)


The coefficient of variation standardizes the variation by dividing it by the sample mean. Because of this property,
it can be used to compare variations for different variables with different units.

𝜎 𝑠
population: 𝐶𝑉 = ( ) 100% sample: 𝐶𝑉 = ( ) 100%
𝜇 𝑥̅

GMATH- Mathematics in the Modern World


DATA MANAGEMENT 5

A larger coefficient of variation implies a more spread out or more dispersed data set.

This is only defined for non-zero mean, and is most useful for variables that are always positive. It is also known
as unitized risk or the variation coefficient. CV is unitless. It is used to compare dispersion of two or more data
sets with the same or different units. The higher the CV the more variable is the data set relative to its mean.

Example:
Several measurements of the diameter of a spherical instrument bearing made with one micrometer
had a mean of 2.49 mm and a standard deviation of 0.12 mm, and several measurements of the
unstretched length of a spring made with another micrometer had a mean of 0.75 in. with a standard
deviation of 0.02 in. Which of the two micrometers is relatively more precise?

Example:
Blood samples from 10 persons were sent to each of two laboratories for cholesterol determination.
Measurements were as follows (Kuzma and Bohnenblust, 2005):

Subject 1 2 3 4 5 6 7 8 9 10
Lab1 296 268 244 272 240 244 282 254 244 262
Lab2 318 287 260 279 245 249 294 271 262 285

Compare the data sets recorded by the two laboratories by considering the following descriptive
measures: mean, median, mode, first quartile, third quartile, range, standard deviation, variance,
mean absolute deviation, and coefficient of variation.

CORRELATION and REGRESSION ANALYSIS

Correlation analysis is a technique used to describe the relationship or association between variables. If
we want to know the degree of relationship between two variables which are measured in at least an interval
scale, the Pearson Product Moment Correlation Coefficient (r) may be obtained.

Interpreting the Correlation Coefficient:


The value of the correlation coefficient indicates the degree as to how the variables are related with
each other. The correlation coefficient is a value between -1 and +1 inclusive where if the value of r is negative,
there is a negative relationship between the variables while if r is positive, the relationship is said to be positive.
The value of r is interpreted as follows:

Correlation
Linear Relationship
Coefficient
0 None
± 0.01 - ± 0.20 Very Weak
± 0.21 - ± 0.40 Weak
± 0.41 - ± 0.60 Moderate
± 0.61 - ± 0.80 Strong
± 0.81 - ± 0.99 Very Strong
±1 Perfect Linear

Pearson Product Moment Correlation Coefficient ρ

The estimator of the true population Pearson Product Moment Correlation Coefficient (ρ) is given by

GMATH- Mathematics in the Modern World


DATA MANAGEMENT 6

 x  y 
 xy  n
r

  x   
2
 y  2


 x    y 
2 2



n 

n 

Properties of the Correlation Coefficient (r):


1. It is a unitless quantity.
2. It is always some number between -1 and +1, inclusive.
3. The magnitude of r is simply a measure of how closely the points cluster about a certain trend line
which is known as the regression line.

Example: Consider the scores obtained in Math (X) and Statistics (Y) by 10 students.

Student 1 2 3 4 5 6 7 8 9 10
Math
Score 5 8 10 12 12 14 15 16 18 20
(X)
Stat
Score 2 7 8 9 10 12 14 10 16 12
(Y)

Compute for the correlation coefficient, r

Correlation and regression analysis are closely related since both involve relationship between two
variables and they both use paired observations obtained from the same (or matched) subjects. While
correlation is used to determine the degree as well as the direction of relationship between variables, regression
analysis deals with the use of the relationship for forecasting or predicting the value of a dependent variable.
The primary goal of regression analysis is to develop a statistical (regression) model that will characterize the
association of the variables and also to determine the statistical relationship, if any, between variables. If the
regression model is found to be adequate, it can then be used to estimate or forecast values of the dependent
variable.
Before proceeding with regression analysis, a scatter diagram of Y versus X can be done. It may give an
idea of the form of relationship between them.

* Simple Linear Regression

- A statistical tool that is used to


o Describe the dependence of variable Y on the independent variable X.
o Lend support to the hypothesis regarding the possible causation of changes in Y brought about
by changes in X.
o Predict Y in terms of X.
o Explain some of the variations of Y by X.

The Simple Linear Regression Model

In most real situation, the relation between the two variables is not perfect. For example, if a student
obtained a grade of 85%, it cannot be solely attributed to the students’ IQ. The student’s performance is also
affected by other factors aside from the student’s IQ level.
The simple linear regression model, expresses the response (or dependent) variable (Y) as a function of
one predictor (or independent) variable (X), as
Yi = β0 + β1Xi + εi
Where
Y = observed value of the dependent variable
GMATH- Mathematics in the Modern World
DATA MANAGEMENT 7

X = observed value of the independent variable


βo = true regression intercept or the value of the response variable when X is zero
β1 = true regression slope or the changes (increase if positive or decrease if negative) in the
response variable brought about by an increase of one unit in the independent variable
εi = random error component which captures all other factors affecting the response variable
but were not included in the model

Estimation of the Parameters βo and β1:


The values of the parameters in the regression equation or model are often times unknown. The
common practice is to take sample observations and from this sample data, the parameters are estimated
The estimate of the parameter β1 is the statistic b1 and is given by

 x  y 
x y
i i
i i 
b1  n
  xi 2

x 2
i 
n

The estimate of the parameter β0, on the other hand, is given by the statistic b0 where

b0  y  b1 x

Example:
1. A corporation administers an aptitude test to all new sales representatives. Management is interested in the
extent to which this test is able to predict their eventual success. The accompanying table records average
weekly sales (in thousands of pesos) and aptitude test scores for a random sample of eight representatives.

Test Scores 55 60 85 75 80 85 65 60
Weekly Sales 10 12 28 24 18 16 15 12

a.) Estimate the linear regression of weekly sales on aptitude test scores.
b.) Interpret the estimated slope of the regression line.

2. The IQ test scores and freshmen algebra grades of a sample of students were recorded and are given in
the following table. Find the regression equation and draw the regression line. What could be the algebra
grade of a student with IQ score of 88?

Student 1 2 3 4 5 6 7 8 9 10
IQ Test Score 80 75 90 105 97 85 92 100 94 78
Algebra 79
83 80 88 90 89 82 88 91 87
Grade

GMATH- Mathematics in the Modern World

You might also like