Data Description
Data Description
Data Description
CHAPTER 3
UNIVERSITY OF ANTIQUE - 2June 26, 2021
March 2021
Introduction
In the last chapter, you gain useful information from raw data by organizing and
presenting them in charts. This chapter will show you statistical methods that can be used to
summarized data. The most familiar of these methods is the finding of averages. Measures of
average are also called measures of central tendency. In addition to knowing the average, you
must know how the data values are dispersed. The measures determine the spread of data
values are called measures of variation, or measures of dispersion. Finally, another set of measures
is necessary to describe data. These measures are called measures of position. They will tell
where a specific data value falls within the data set or its relative position in comparison with
other data values.
At the end of this lesson, you should be able to:
1.Describe the uses of the measures of central tendency
2.Compute and interpret the mean, median and mode;
3. Discuss the properties of mean, median and mode.
4. Define and interpret results of any measures of variability
5. Determine the properties of normal curve, areas under normal curve, and its corresponding z-scores.
6. Interpret result using normal distribution, skewness and kurtosis.
Properties of Mean
• Used when the data is interval or ratio.
• It is the layman’s concept of the average.
• Used when the distribution is normal or is not badly skewed. The most reliable
measure of central tendency.
• The mean is found by using all the values of the data
• The mean varies less than the median or mode when samples are taken from the
same population and all three measures are computed for these samples.
• The mean is used in computing other statistics, such as the variance.
• The mean for the data set is unique and not necessarily one of the data values.
• The mean cannot be computed for the data in a frequency distribution that has an
open-ended class.
• The mean is affected by extremely high or low values, called outliers, and may not
be the appropriate average to use in these situations.
The mean of ungrouped data can be determined by adding all the scores or data and
divide the sum by the numbers of scores in the data. In symbol,
n
X1 + X2 + … + Xn ∑ Xi
!X̄ = = i=1 .
n n
For example, to find the mean of 5, 7, 9, 10, 12, and 15 is
5 + 7 + 9 + 12 + 15 58
! =
X̄ = ≈ 9.67.
6 6
There are times that a number has a certain weight. For example, you are asked to
determine your mean grade in the first semester. Given the fact that every course has a
weight, this can be done by getting the sum of the products of a number and its weight
divided by the total weight.
Suppose, X
! 1, X2, X3, …, Xn are the scores and their respective weights are
! 1, w2, w3, …, wn, the weighted mean of the scores is defined as
w
X1w1 + X2 w2 + … + Xn wn
!X̄ =
w1 + w2 + w3 + … + wn
Let us take John’s grade last semester:
Subject Grade (X) Unit(w)
Calculus 1.5 5
Filipino 2.0 3
Statistics 1.8 3
P.E 1.3 2
NSTP 1.0 0
In this data, grades are the scores and units are weights. To find the weighted mean of
John’s grade. The computation will be as follows:
(1.5)(5) + (2.0)(3) + (1.8)(3) + (1.0)(0) 21.5
! =
X̄ = = 1.65.
5+3+3+2+0 13
The Median
The median is the middle most score in the distribution. It divides the distribution into
upper 50% and lower 50%. The determination of median necessitates the arrangement of
scores either ascending or descending. If the number of scores (n) is odd, the median is the
middle value. If n is even, the median of the distribution is the average of two middle scores
in the ordered list. There are varieties of symbols for median. Some of the symbols are MD,
Mdn, Med or X̃
! . For the sake of this module, we will be using Mdn for one simple reason— it
is suggested by American Psychological Association (APA).
Properties of the Median
• The median is used to find the centre or middle value of a data set.
• Is not amenable to algebraic manipulation
To make it easier, to find the position of median in an ordered set of values the
following formula is used:
n+1
Position of Median = ! (where n is the number of scores)
2
Let us try to find the median of the following distributions:
Example 1: 4, 6, 2, 8, 10, 7, 8, 9, 9, 3, 5
Solution:
Step 1. Arrange the scores. 2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10
Step 2. Select the middle most score. Since there are 11 scores, the position of median is
11 + 1 12
Position of median = ! = = 6 implies that the position of median is in
2 2
the 6th rank.
2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10
Step 3. Identify the median in the data set. Mdn= 7
19 + 20 39
Step 3. Identify the median in the data set. Mdn= ! = = 19.5
2 2
The Mode
The mode is the frequent score appearing in the distribution. It is used when the data is
nominal. If the data set not too large, one can determine the modal score by mere inspection.
The same as mean and median, mode has a variety of symbols. The most common are x! ̂ and
Mo. For the sake of this module, we will be using Mo as symbol for mode.
If there is only one mode, the distribution is unimodal. If there are two modes the
distribution is bimodal. If there are three modes, the distribution is trimodal. If there are four
or more modes, the distribution is multimodal or polymodal. If there is no mode, the
distribution is called rectangular distribution.
Properties of Mode
• The mode is used when the most typical case is desired.
• The mode is the easiest average to compute.
• The mode can be used when the data are nominal or categorical
• The mode is not always unique. A data set can have more than one mode, or the
mode may not exist for a data set
• Always located at the peak of the distribution
• Not unduly affected by extreme values
• Very unstable value
The mode of ungrouped data is a value or values that occur most frequent. This can be
done by mere inspection. For example, we are going to find the mode of the following scores
(a) 3, 4, 6, 7, 7, 7, 8, 8, 9, 10 — the mode is 7. (b) 10, 9, 15, 10, 8, 11, 7, 12, 11, 5, 10 — the
modes are 10 and 11.
Range
Range is the cutest
crudest measure of dispersion. It is the difference between the highest and
the lowest scores in the data set. This means that range considers only two scores, thus
making it the most unstable measure of dispersion. For ungrouped data, Range is R
! =H−L
Where: R — range; H — highest score; L — lowest score
For standard deviation, since it is the square root of variance, the formula for the
population and sample standard deviation will be:
∑ (X − μ)2
Population standard deviation, σ
! = and sample standard deviation,
N
∑ (X − X̄ )2
s! = . Since the variance and standard deviation are the measures of variability
n−1
or spread, they are interpreted as the lower the value the more clustered the scores are and
the higher the value the more spread the scores are.
1. As previously stated, variances and standard deviations can be used to determine the
spread of the data. If the variance or standard deviation is large, the data are more
dispersed. This information is useful in comparing two (or more) data sets to determine
which is more (most) variable.
2. The measures of variance and standard deviation are used to determine the consistency of
a variable. For example, in the manufacture of fittings, such as nuts and bolts, the
variation in the diameters must be small, or the parts will not fit together.
3. The variance and standard deviation are used to determine the number of data values
that fall within a specified interval in a distribution.
4. Finally, the variance and standard deviation are used quite often in inferential statistics.
Coefficient of Variation
Whenever two samples have the same units of measure, the variance and standard
deviation for each can be compared directly. A statistics that allows to compare standard
deviations when the units are different is called the coefficient of variation.
The standard deviation or variance is not a reliable measure to compare two data sets in
terms of spread when the two sets are of different units or have the same units but widely
dissimilar mean in the field. In this case, the coefficient of variation is developed to answer
s
this kind of problem. The formula for coefficient of variation is given below: CV
! =
X̄
Where: CV — coefficient of variation; s — standard deviation; X̄
! — mean
Standard Scores
A standard score or !z score tells how many standard deviation a data value is above or
below the mean for a specific distribution of values. If a standard score is zero, then the data
value is the same as the mean.
A z score or standard score for a value is obtained by subtracting the mean from the
value and dividing the result by the standard deviation. The symbol for a standard score is z.
value - mean
The formula is z! =
standard deviation
X − X̄
For the samples, the formula is z! =
s
X−μ
For the populations, the formula is z! =
σ
The z score represents the number of standard deviations that a data value fails
falls above
or below the mean.
Percentiles
Percentiles are position measures used in educational and health-related fields to
indicate the position of an individual in a group.
Percentiles divide the data set into 100 equal groups. It is used to compare an
individual’s test score with the national norm.
Percentiles are not the same as percentages. That is, if a student gets 72 correct answers
out of a possible 100, she obtained a percentage score of 72. There is no indication of her
position with respect to the rest of the class. On the other hand, if a raw score of 72
Percentile Formula
In addition to dividing the data set into four groups, quartiles can be used as a rough
measurement of variability. The interquartile range (IQR) is defined as the difference
between !Q1 and !Q3 and is the range of the middle 50% of the data.
The interquartile range is used to identify outliers, and it is also used as a measurement
of varibility in exploratory data analysis.
Deciles divide the distribution into 10 groups. They are denoted by !D1, D2, etc.
Note that D
! 1 corresponds to P
! 10; D
! 2 corresponds to P
! 20; etc. Deciles can be found by
using the formulas given for percentiles.
Taken altogether then, these are the relationships among percentiles, deciles, and
quartiles.
Deciles are denoted by !D1, D2, D3, …, D9, and there correspond to !P10, P20, P30, …, P90.
Quartiles are denoted by Q
! 1, Q2, Q3 and they correspond to P
! 25, P50, P75.
The median is the same as P
! 50, Q2 or D
! 5
Skewness
No variable fits a normal distribution perfectly, since a normal distribution is a
theoretical distribution. However, a normal distribution can used to describe many variables,
because the deviations from a normal distribution are very small.
When the data values are evenly distributed about the mean, a distribution is said to
be a symmetric distribution. When the majority of the data values fall to the left or right of
the mean, the distribution is said to be skewed.
When the majority of the data values fall to the right of the mean, the distribution is
said to be a negatively or left-skewed distribution. The mean is to the left of the median,
and the mean and the median are to the left of the mode. mean<median<mode
When the majority of the data values fall to the left of the mean, a distribution is said
to be a positively or right-skewed distribution. The mean falls to the right ofif the median, and
both the mean and the median fall to the right of the mode. mean>median>mode
The “tail” of the curve indicates the direction of skewness (right is positive, left is negative).
Kurtosis
Kurtosis is associated with the tallness rather than the flatness or weakness of the
distribution. It is also a measure that describes the tail of the distribution in relation to its
overall shape. There are three types of kurtosis— The Mesokurtic Distribution has a
kurtosis similar to that of the normal distribution. This means that the extreme value
characteristics of the distribution is the same as the normal distribution.
The Leptokurtic Distribution is a kind of distribution that has kurtosis greater than the
normal. Lepto means thin or skinny. Generally, the leptokurtic curve is characterized by a
narrow or thin curve that is taller than the normal. However, its thin shape is only a
consequence of the tails of the distribution which stretch along the horizontal axis. This
happens when there are occasional extreme outliers appear in the distribution.
Normal Distribution
A normal distribution is a continuous, symmetric, bell-shaped distribution of a
variable.
Properties of the Theoretical Normal Distribution
1. A normal distribution curve is bell-shaped
2. The mean, median, and mode are equal and are located at the centre of the
distribution.
3. A normal distribution curve is unimodal.
4. The curve is symmetric about the mean, which is equivalent to saying that its shape
is the same on both sides of a vertical line passing through the centre.
5. The curve is continuous; that is, there are no gaps or holes. For each value of X, there
is a corresponding value of Y.
6. The curve never touches the x axis. — but it gets increasingly closer.
7. The total area under the normal distribution curve is equal to 1.00 or 100%.
8. The area under the part of a normal curve that lies within 1 standard deviation of
the mean is approximately 0.68, or 68%; within 2 standard deviations, about 0.95 or 95%;
and within 3 standard deviations, about 0.997, or 99.7%.
75 80 85 90 95
65 70
(80-80)/5
-1.96 1.96