Data Description

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Data Description

CHAPTER 3
UNIVERSITY OF ANTIQUE - 2June 26, 2021
March 2021

Introduction
In the last chapter, you gain useful information from raw data by organizing and
presenting them in charts. This chapter will show you statistical methods that can be used to
summarized data. The most familiar of these methods is the finding of averages. Measures of
average are also called measures of central tendency. In addition to knowing the average, you
must know how the data values are dispersed. The measures determine the spread of data
values are called measures of variation, or measures of dispersion. Finally, another set of measures
is necessary to describe data. These measures are called measures of position. They will tell
where a specific data value falls within the data set or its relative position in comparison with
other data values.
At the end of this lesson, you should be able to:
1.Describe the uses of the measures of central tendency
2.Compute and interpret the mean, median and mode;
3. Discuss the properties of mean, median and mode.
4. Define and interpret results of any measures of variability
5. Determine the properties of normal curve, areas under normal curve, and its corresponding z-scores.
6. Interpret result using normal distribution, skewness and kurtosis.

Lesson 1. Measures of Central Tendency


Statistical methods are needed for summarizing and describing gathered numerical
data. However, by looking at the tables and graphs, one can have a difficulty in describing
the entire set of data. Better, if we can pick or choose a single score that would represent the
entirety of the data set. This value will be helpful in making decisions based on the data
collected. The measures of central tendency are used to select the central value that could
represent the entire set of data thus, helping the investigator in decision making.

WEEK 3 - DR. M. U, MAGBANUA 1


The Mean
The mean is the arithmetic average of the all the scores in the data set. This is the most
frequently used measure of central tendency. This is used when the data is either interval or
ratio.
The symbol !X̄ read as “X bar” is used to represent sample mean while μ
! read as “mu”
represents the population mean.

Properties of Mean
• Used when the data is interval or ratio.
• It is the layman’s concept of the average.
• Used when the distribution is normal or is not badly skewed. The most reliable
measure of central tendency.
• The mean is found by using all the values of the data
• The mean varies less than the median or mode when samples are taken from the
same population and all three measures are computed for these samples.
• The mean is used in computing other statistics, such as the variance.
• The mean for the data set is unique and not necessarily one of the data values.
• The mean cannot be computed for the data in a frequency distribution that has an
open-ended class.
• The mean is affected by extremely high or low values, called outliers, and may not
be the appropriate average to use in these situations.

The Mean of Ungrouped Data

The mean of ungrouped data can be determined by adding all the scores or data and
divide the sum by the numbers of scores in the data. In symbol,
n
X1 + X2 + … + Xn ∑ Xi
!X̄ = = i=1 .
n n
For example, to find the mean of 5, 7, 9, 10, 12, and 15 is
5 + 7 + 9 + 12 + 15 58
! =
X̄ = ≈ 9.67.
6 6

WEEK 3 - DR. M. U, MAGBANUA 2


The Weighted Mean

There are times that a number has a certain weight. For example, you are asked to
determine your mean grade in the first semester. Given the fact that every course has a
weight, this can be done by getting the sum of the products of a number and its weight
divided by the total weight.
Suppose, X
! 1, X2, X3, …, Xn are the scores and their respective weights are
! 1, w2, w3, …, wn, the weighted mean of the scores is defined as
w
X1w1 + X2 w2 + … + Xn wn
!X̄ =
w1 + w2 + w3 + … + wn
Let us take John’s grade last semester:
Subject Grade (X) Unit(w)
Calculus 1.5 5
Filipino 2.0 3
Statistics 1.8 3
P.E 1.3 2
NSTP 1.0 0

In this data, grades are the scores and units are weights. To find the weighted mean of
John’s grade. The computation will be as follows:
(1.5)(5) + (2.0)(3) + (1.8)(3) + (1.0)(0) 21.5
! =
X̄ = = 1.65.
5+3+3+2+0 13

The Median
The median is the middle most score in the distribution. It divides the distribution into
upper 50% and lower 50%. The determination of median necessitates the arrangement of
scores either ascending or descending. If the number of scores (n) is odd, the median is the
middle value. If n is even, the median of the distribution is the average of two middle scores
in the ordered list. There are varieties of symbols for median. Some of the symbols are MD,
Mdn, Med or X̃
! . For the sake of this module, we will be using Mdn for one simple reason— it
is suggested by American Psychological Association (APA).
Properties of the Median
• The median is used to find the centre or middle value of a data set.
• Is not amenable to algebraic manipulation

WEEK 3 - DR. M. U, MAGBANUA 3


• It is used when the distribution is grossly asymmetrical or skewed
• In an open-ended distribution, the median is the most reliable measure of the central
tendency
• It is used when the data is ordinal
• The median is used when it is necessary to find out whether the data values fall into
the upper half or lower half of the distribution.
• The median is affected less than the mean by extremely high or extremely low values.

Median of Ungrouped Data

To determine the median of ungrouped data, we must take these steps:


Step 1. Arrange the data set in ascending or descending order.
Step 2. If n is (a) odd, the median is the middle most value in an ordered values (b)
even, the median is the average of two middle most values.

To make it easier, to find the position of median in an ordered set of values the
following formula is used:
n+1
Position of Median = ! (where n is the number of scores)
2
Let us try to find the median of the following distributions:

Example 1: 4, 6, 2, 8, 10, 7, 8, 9, 9, 3, 5
Solution:
Step 1. Arrange the scores. 2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10
Step 2. Select the middle most score. Since there are 11 scores, the position of median is
11 + 1 12
Position of median = ! = = 6 implies that the position of median is in
2 2
the 6th rank.
2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10
Step 3. Identify the median in the data set. Mdn= 7

Example 2: 4, 10, 13, 16, 19, 20, 25, 35, 40, 40


Solution:
Step 1. Arrange the scores. 4, 10, 13, 16, 19, 20, 25, 35, 40, 40

WEEK 3 - DR. M. U, MAGBANUA 4


Step 2. Select the middlemost score. Since, there are 10 scores, the position of the
10 + 1 11
median is Position of median= ! = = 5.5 implies that the position of median is
2 2
in the 5.5th rank. 4, 10, 13, 16, 19, 20, 25, 35, 40, 40

19 + 20 39
Step 3. Identify the median in the data set. Mdn= ! = = 19.5
2 2

The Mode
The mode is the frequent score appearing in the distribution. It is used when the data is
nominal. If the data set not too large, one can determine the modal score by mere inspection.
The same as mean and median, mode has a variety of symbols. The most common are x! ̂ and
Mo. For the sake of this module, we will be using Mo as symbol for mode.
If there is only one mode, the distribution is unimodal. If there are two modes the
distribution is bimodal. If there are three modes, the distribution is trimodal. If there are four
or more modes, the distribution is multimodal or polymodal. If there is no mode, the
distribution is called rectangular distribution.
Properties of Mode
• The mode is used when the most typical case is desired.
• The mode is the easiest average to compute.
• The mode can be used when the data are nominal or categorical
• The mode is not always unique. A data set can have more than one mode, or the
mode may not exist for a data set
• Always located at the peak of the distribution
• Not unduly affected by extreme values
• Very unstable value

Mode of Ungrouped Data

The mode of ungrouped data is a value or values that occur most frequent. This can be
done by mere inspection. For example, we are going to find the mode of the following scores
(a) 3, 4, 6, 7, 7, 7, 8, 8, 9, 10 — the mode is 7. (b) 10, 9, 15, 10, 8, 11, 7, 12, 11, 5, 10 — the
modes are 10 and 11.

WEEK 3 - DR. M. U, MAGBANUA 5


Lesson 2. Measure of Variability
In the previous lessons, you learned how to compute mean, median, and mode. These
measures of centrality focus only on giving information of what score could best represent
the entire set of data. However, if you want to determine the spread of scores, the measures of
variability can address that query. thus, in this lesson you will be learning the different kinds
of measures of variability or sometime called as measures of dispersion.
Closely grouped data have relatively small values, and more widely spread out data
have larger values. The closest possible grouping occurs when the data have no dispersion
(all data are the same value); in this situation, the measure of dispersion will be zero. There is
no limit how widely spread out the data can be.

Range
Range is the cutest
crudest measure of dispersion. It is the difference between the highest and
the lowest scores in the data set. This means that range considers only two scores, thus
making it the most unstable measure of dispersion. For ungrouped data, Range is R
! =H−L
Where: R — range; H — highest score; L — lowest score

Variance and Standard Deviation


Variance is defined as the average squared deviation from the mean while standard
deviation is the square root of variance.
2
∑ (X − μ 2) 2
∑ (X − X̄ 2)
Population variance, !σ = and sample variance, s! =
N n−1
2
Where: !σ — population variance
!s 2 — sample variance
X
! — individual score
μ
! — population mean
x̄! — sample mean
N
! — number of scores (population)
!n — number of scores (sample)

WEEK 3 - DR. M. U, MAGBANUA 6


Standard Deviation of Ungrouped Data

For standard deviation, since it is the square root of variance, the formula for the
population and sample standard deviation will be:
∑ (X − μ)2
Population standard deviation, σ
! = and sample standard deviation,
N

∑ (X − X̄ )2
s! = . Since the variance and standard deviation are the measures of variability
n−1
or spread, they are interpreted as the lower the value the more clustered the scores are and
the higher the value the more spread the scores are.

Uses of the Variance and Standard Deviation

1. As previously stated, variances and standard deviations can be used to determine the
spread of the data. If the variance or standard deviation is large, the data are more
dispersed. This information is useful in comparing two (or more) data sets to determine
which is more (most) variable.
2. The measures of variance and standard deviation are used to determine the consistency of
a variable. For example, in the manufacture of fittings, such as nuts and bolts, the
variation in the diameters must be small, or the parts will not fit together.
3. The variance and standard deviation are used to determine the number of data values
that fall within a specified interval in a distribution.
4. Finally, the variance and standard deviation are used quite often in inferential statistics.

Coefficient of Variation
Whenever two samples have the same units of measure, the variance and standard
deviation for each can be compared directly. A statistics that allows to compare standard
deviations when the units are different is called the coefficient of variation.
The standard deviation or variance is not a reliable measure to compare two data sets in
terms of spread when the two sets are of different units or have the same units but widely
dissimilar mean in the field. In this case, the coefficient of variation is developed to answer
s
this kind of problem. The formula for coefficient of variation is given below: CV
! =

Where: CV — coefficient of variation; s — standard deviation; X̄
! — mean

WEEK 3 - DR. M. U, MAGBANUA 7


Lesson 3. Measures of Position
In addition to measures of central tendency and measures of variation, there are
measures of position or location. These measures include standard scores, percentiles, deciles,
and quartiles. They are used to locate the relative position of a data value in the data set. For
example, if a value is located at the 80th percentile, it means that 80% of the values fall below
it in the distribution and 20% of the values fall above it. The median is the value that
corresponds to the 50th percentile, since one-half of the values fall below it and one-half of
the values fall above it.

Standard Scores
A standard score or !z score tells how many standard deviation a data value is above or
below the mean for a specific distribution of values. If a standard score is zero, then the data
value is the same as the mean.
A z score or standard score for a value is obtained by subtracting the mean from the
value and dividing the result by the standard deviation. The symbol for a standard score is z.
value - mean
The formula is z! =
standard deviation
X − X̄
For the samples, the formula is z! =
s
X−μ
For the populations, the formula is z! =
σ
The z score represents the number of standard deviations that a data value fails
falls above
or below the mean.

Percentiles
Percentiles are position measures used in educational and health-related fields to
indicate the position of an individual in a group.
Percentiles divide the data set into 100 equal groups. It is used to compare an
individual’s test score with the national norm.
Percentiles are not the same as percentages. That is, if a student gets 72 correct answers
out of a possible 100, she obtained a percentage score of 72. There is no indication of her
position with respect to the rest of the class. On the other hand, if a raw score of 72

WEEK 3 - DR. M. U, MAGBANUA 8


corresponds to the 64th percentile, then she did better than 64% of the students in her class.
Percentiles are symbolised by P
! 1, P2, P3, …, P99 and divide the distribution into 100 groups.

Percentile Formula

The percentile corresponding to a given value X is computed by using the following


formula:
(number of values below X)+ 0.5
Percentile
! = ⋅ 100
total number of values

Finding a Data Value Corresponding to a Given Percentile


Step 1 Arrange the data in order from lowest to highest.
n⋅p
Step 2 Substitute into the formula c! =
100
where: n — total number of values
p — percentile
Step 3A If c is not a whole number, round up to the next whole number. Starting at the
lowest value, count over to the number that corresponds to the rounded-up value.
Step 3B If c is a whole number, use the value halfway between the cth and (c+1)st values
when counting up from the lowest value.

Quartiles and Deciles


Quartiles divide the distribution into four groups, separated by Q
! 1, Q2, Q3. Note that
! 1 is the same as the 25th percentile; !Q2 is the same as the 50th percentile, or median; Q
Q ! 3
corresponds to the 75th percentile, as shown

Finding Data Values Corresponding to Q


! 1, Q
! 2 and Q
! 3
Step 1 Arrange the data in order from lowest to highest.

WEEK 3 - DR. M. U, MAGBANUA 9


Step 2 Find the median of the data values. This is the value for !Q2.
Step 3 Find the median of the data values that fall below !Q2. This is the value
for Q
! 1.
Step 4 Find the median of the data values fall above !Q2. This is the value for Q
! 3.

In addition to dividing the data set into four groups, quartiles can be used as a rough
measurement of variability. The interquartile range (IQR) is defined as the difference
between !Q1 and !Q3 and is the range of the middle 50% of the data.
The interquartile range is used to identify outliers, and it is also used as a measurement
of varibility in exploratory data analysis.

Deciles divide the distribution into 10 groups. They are denoted by !D1, D2, etc.

Note that D
! 1 corresponds to P
! 10; D
! 2 corresponds to P
! 20; etc. Deciles can be found by
using the formulas given for percentiles.
Taken altogether then, these are the relationships among percentiles, deciles, and
quartiles.
Deciles are denoted by !D1, D2, D3, …, D9, and there correspond to !P10, P20, P30, …, P90.
Quartiles are denoted by Q
! 1, Q2, Q3 and they correspond to P
! 25, P50, P75.
The median is the same as P
! 50, Q2 or D
! 5

Summary of Position Measures

Measure Definition Symbol(s)

Standard score or z score Number of standard deviations that a data value is z


above or below the mean

Percentile Position in hundredths that a data value holds in !Pn


the distribution

Decile Position in tenths that a data value holds in the !Dn


distribution

Quartile Position in fourths that a data value holds in the !Qn


distribution

WEEK 3 - DR. M. U, MAGBANUA 10


Outliers
A data set should be checked for extremely high or extremely low values. These values
are called outliers.
An outlier is an extremely high or an extremely low data value when compared with
the rest of the data values. It can strongly affect the mean and standard deviation of a
variable and can have an effect on other statistics as well.
There are several ways to check a data set for outliers. One method is
Step 1 Arrange the data in order and find Q
! 1 and Q
! 3.
Step 2 Find the interquartile range: IQR = !Q3 − Q1
Step 3 Multiply the IQR by 1.5.
Step 4 Subtract the value obtained in Step 3 from Q
! 1 and add the value of Q
! 3
Step 5 Check the data set for any data value that smaller that Q
! 1 − 1.5(IQ R) or larger
than Q
! 3 + 1.5(IQ R).

Lesson 4 Distribution Shapes


Continuous variable can assume all values between any two given values of the
variable. Many continuous variables have distributions that are bell-shaped, and these are
called approximately normally distributed variables. The distribution is also called a bell curve or a
Gaussian distribution, named for the German mathematician Carl Friedrich Gauss (1777-1855),
who derived its equation.

Skewness
No variable fits a normal distribution perfectly, since a normal distribution is a
theoretical distribution. However, a normal distribution can used to describe many variables,
because the deviations from a normal distribution are very small.
When the data values are evenly distributed about the mean, a distribution is said to
be a symmetric distribution. When the majority of the data values fall to the left or right of
the mean, the distribution is said to be skewed.
When the majority of the data values fall to the right of the mean, the distribution is
said to be a negatively or left-skewed distribution. The mean is to the left of the median,
and the mean and the median are to the left of the mode. mean<median<mode
When the majority of the data values fall to the left of the mean, a distribution is said
to be a positively or right-skewed distribution. The mean falls to the right ofif the median, and
both the mean and the median fall to the right of the mode. mean>median>mode

WEEK 3 - DR. M. U, MAGBANUA 11


There are several formulas in finding the coefficient of skewness. The coefficient of
skewness will be converted to its standard score (z-score). If the alpha level (!α level) is set to
! .05 then the calculated z-score must fall within −1.96
! and +1.96
! so that the data
approximates the normal distribution. If the z-score goes beyond these values, it means that
the data has an unacceptable skewness at !α = .05

The “tail” of the curve indicates the direction of skewness (right is positive, left is negative).

Kurtosis
Kurtosis is associated with the tallness rather than the flatness or weakness of the
distribution. It is also a measure that describes the tail of the distribution in relation to its
overall shape. There are three types of kurtosis— The Mesokurtic Distribution has a
kurtosis similar to that of the normal distribution. This means that the extreme value
characteristics of the distribution is the same as the normal distribution.
The Leptokurtic Distribution is a kind of distribution that has kurtosis greater than the
normal. Lepto means thin or skinny. Generally, the leptokurtic curve is characterized by a
narrow or thin curve that is taller than the normal. However, its thin shape is only a
consequence of the tails of the distribution which stretch along the horizontal axis. This
happens when there are occasional extreme outliers appear in the distribution.

WEEK 3 - DR. M. U, MAGBANUA 12


The Platykurtic Distribution is a kind of distribution characterized by short tails of the
curve. The platy means broad or flat. This characteristics of the curve is due to the fact that
the middle scores have almost the same or similar frequency. Furthermore, the extreme
values are less compare to that of the normal distribution.

Normal Distribution
A normal distribution is a continuous, symmetric, bell-shaped distribution of a
variable.
Properties of the Theoretical Normal Distribution
1. A normal distribution curve is bell-shaped
2. The mean, median, and mode are equal and are located at the centre of the
distribution.
3. A normal distribution curve is unimodal.
4. The curve is symmetric about the mean, which is equivalent to saying that its shape
is the same on both sides of a vertical line passing through the centre.
5. The curve is continuous; that is, there are no gaps or holes. For each value of X, there
is a corresponding value of Y.
6. The curve never touches the x axis. — but it gets increasingly closer.
7. The total area under the normal distribution curve is equal to 1.00 or 100%.
8. The area under the part of a normal curve that lies within 1 standard deviation of
the mean is approximately 0.68, or 68%; within 2 standard deviations, about 0.95 or 95%;
and within 3 standard deviations, about 0.997, or 99.7%.

WEEK 3 - DR. M. U, MAGBANUA 13


mean=80
s=5

75 80 85 90 95
65 70
(80-80)/5

The areas under a Normal Distribution Curve

The Standard Normal Distribution


Since each normally distributed variable has its own mean and standard deviation, the
shape and location of these curves will vary. To simplify this situation, statisticians use what
is called the standard normal distribution.
The standard normal distribution is normal distribution with a mean of 0 and a
standard deviation of 1.
2 /2
e −z
The formula for the standard normal distribution is y! = . All normally

distributed variables can be transformed into the standard normally distributed variable by
value - mean X−μ
using the formula for the standard score : z! = or z! = .
standard deviation σ

-1.96 1.96

The areas under a Standard Normal Distribution

WEEK 3 - DR. M. U, MAGBANUA 14


Determining Normality
A normally shaped or bell-shaped distribution is only one of many shapes that a
distribution can assume; however, it is very important sine
since many statistical methods require
that the distribution of values be normally or approximately normally shaped.
There are several ways statisticians check for normality. The easiest way is to draw a
histogram for the data and check its shape. If the histogram is not approximately nell-shaped,
bell
then the data are not normally distributed.
Skewness can be checked by using the Pearson coefficient of skewness (PC) also called
3(X̄ − median)
Pearson’s index of skewness. The formula is !PC = . If the index is greater
s
than or equal to +1 or less than or equal to -1, it can be concluded that data are significantly
skewed.
Another method that is used to check normality is to draw a normal quantile plot.
Quantiles, sometimes called fractiles, are values that separate the data set into approximately
equal groups.
There are several other methods used to checked for normality, if we are to use SPSS the
statistical tool we may use are Kolmogorov-Smirnov test, Liliefors test, and Shapiro-Wilks .
The tool should show no significant results for you to determine that the data are
approximately normal.

WEEK 3 - DR. M. U, MAGBANUA 15

You might also like