MCS Lecture 3
MCS Lecture 3
Lecture # 3
NUMERICAL DESCRIPTIVE
MEASURES
PROBABILITY AND STATISTICS
CHAPTER 3
NUMERICAL descriptive measures
TABLE OF CONTENTS
Mean
Median Mode
OTHER TYPES
Geometric Mean: The nth root of the product of the data values.
Harmonic Mean: The reciprocal of the arithmetic mean of the reciprocals of the data
values.
Mid-Range: The arithmetic mean of the maximum and minimum values of a data set.
Trimmed Mean: The arithmetic mean of data values after a certain number or
proportion of the highest and lowest data values have been discarded.
MEAN FOR UNGROUPED DATA
It is the sum of all the data values divided by the number of total data values
σ𝑥
For population µ=
𝑁
σ𝑥
For sample x̄̄=
𝑛
Solution:
18+24+38+36+21+40+33+22
X ̄= = 29
8
MEAN FOR GROUPED DATA
First, data must be arranged in frequency distribution.
σ 𝑓𝑚
For population µ=
𝑁
σ 𝑚𝑓
For Sample x̄̄ =
𝑛
As obvious from the definition of the median, it divides a ranked data set into two equal
parts.
Find the middle term. The value of this term is the median.
MEDIAN FOR UNGROUPED DATA
For odd number of observations
If the number of observations in a data set is odd, then the median is given by the value
of the middle term in the ranked data
Example
The following data give the prices (in thousands of dollars) of seven houses selected from
all houses sold last month in a city.
Since there are seven homes in this data set and the middle term is the fourth term, the
median is given by the value of the fourth term in the ranked data.
Example
Consider the data
7 8 9 10 11 12 13 13 14 17 17 45
𝒉 𝒏
𝒍 + ( − 𝒄)
𝒇 𝟐
Where
l = Lower class boundary of the median class
h = Class width or interval
f = Frequency of the median class
n = Total number of observations
c = Cumulative frequency of the class preceding the median class
MEDIAN FOR GROUPED DATA
Example
MEDIAN FOR GROUPED DATA
Here
𝒏
= 75
𝟐
So the class 5-9 is the median class. The remaining values are
c = 32
f = 71
h = 9.5 – 4.5 = 5
l = 4.5
Mode is a French word that means fashion—an item that is most popular or common.
In statistics, the mode represents the most common value in a data set.
MODE FOR UNGROUPED DATA
The mode, in this case, is simply the most repeated value in the data set.
A data set can contain one or more than one values that are repeated with the same
peak frequency. In this perspective the data set can be
Uni-modal
Bimodal
Multimodal
A data set in which all the values are repeated with the same frequency has no modal
value.
EXAMPLES
Unimodal
77 82 74 81 79 84 74 78
Mode = 74
Bimodal
77 82 74 81 77 84 74 78
Mode = 74, 77
Multimodal
77 82 74 82 77 84 74 78
Mode = 74, 77, 82
No Mode
77 82 73 81 79 84 74 78
MODE FOR GROUPED DATA
The mode for a given grouped data can be calculated by the following formula
𝒇𝒎−𝒇𝟏
𝒍+ ∗h
𝒇𝒎−𝒇𝟏 +(𝒇𝒎−𝒇𝟐)
Where
l = Lower class boundary of the modal class
h = Class width or interval
fm = Frequency of the modal class
f1 = Frequency of the class preceding the modal class
f2 = Frequency of the class succeeding the modal class
EXAMPLE
Consider the following table
l = 15
h = 5
fm = 7
f1 = 5
f2 = 2
Mode = 16.42
FOR GROUPED DATA
Symmetric data
Data equally
spaced around an
axis about which
the mean lies
SKEWNESS
Normal curve
Mean, median and mode are in
the centre and at the same point.
For symmetric data, mean lies in the middle of the spread but that is not true for
unsymmetrical data.
In unsymmetrical data the spread is around the median.
Symmetric Unsymmetric
Central tendency
data data
MEAN MEDIAN
UNGROUPED DATA
CASE 1
We have an ungrouped data set of income of 7 people
10000, 12000, 15000, 20000, 25000, 20000, 50000000
RESULT:
MEDIAN explains the data better
UNGROUPED DATA
CASE 2
We again have a sample of incomes of 7 people
10000,15000,15000,20000,25000,10000,15000
RESULT:
MEAN explain the data better
NOTE(mode is also the same)
MODE
Consider a discrete categorical data which consist of the choice of buyers from
cars of three colours
Red, white, black
RED 20
BLACK 30
WHITE 50
HOW TO RELATE ALL THE CENTRAL TENDENCIES WITH SPREAD
MEASURES OF DISPERSION
The measures of central tendency that include mean, median, or mode by themselves are
usually not sufficient enough to reveal the shape of the distribution of a data set.
Two data sets having similar measures of central tendency might have different spreads i.e.
the variations in the data set values might be different.
40 50 60 Mean: 60 Mean: 60 58 59 60
70 80 Spread: 40 Spread: 4 61 62
To completely describe a data set, ‘Measures of Dispersion’ are used alongside the
measures of central tendency.
DISPERSION
• Range
• Variance
• Standard Deviation
• Inter-quartile Range
RANGE
• Range is simplest measure of statistical dispersion and it simply tells spread of the data set.
• Range is simply the difference of the largest and smallest data set observation.
• Range = Largest value – smallest value
VARIANCE
• In probability theory and statistics, variance is the expectation of the squared deviation of a
data set value from the mean of the data set.
• It measures how far a set of numbers are spread out from their average value.
• For ungrouped data:
STANDARD DEVIATION
• The standard deviation of a random variable, statistical population or data set is the
square root of its variance.
Chebyshev’s Theorem
Chebyshev’s theorem gives a lower bound for the area under a curve between two
points that are on opposite sides of the mean and at the same distance from the
mean.
Example
Empirical Rule
Whereas Chebyshev’s theorem is applicable to any kind of distribution, the empirical
rule applies only to a specific type of distribution called a bell-shaped distribution.
STANDARD DEVIATION
• In statistics, an outlier is an observation point that is distant from other observations.
• An outlier can cause serious problems in statistical analyses. (The standard deviation
might not depict the true behavior of the data set)
• If a data set consists of values 1,3,5,7,10,12,15 and 10000, it is clearly visible that the data
set value of 10000 is an outlier and it affects the overall standard deviation and variance of
the data set.
• To identify the outliers, another measure of dispersion is used that is the inter-quartile
range.
INTER-QUARTILE RANGE
• In statistics, the interquartile range (IQR), also called the midspread or middle 50%, or
technically H-spread, is a measure of statistical dispersion, being equal to the difference
between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.
• Unlike total range, the interquartile range has a breakdown point of 25% and is thus often
preferred to the total range. The IQR is used to build box plots, simple graphical
representations of a probability distribution.
• The IQR can be used to identify outliers. The behavior of the data set values between first
and third quartiles represents the distribution of data set in a satisfactory manner.
Measures of position
Quartiles: Quartiles are three summary measures that divide a ranked data set into four
equal parts. The second quartile is the same as the median of a data set. The first quartile
is the value of the middle term among the observations that are less than the median,
and the third quartile is the value of the middle term among the observations that are
greater than the median.
Approximately 25% of the values in a ranked data set are less than Q1 and about 75% are
greater than Q1. The second quartile, Q2, divides a ranked data set into two equal parts;
hence, the second quartile and the median are the same. Approximately 75% of the data
values are less than Q3 and about 25% are greater than Q3. The difference between the
third quartile and the first quartile for a data set is called the interquartile range (IQR).
Percentiles and Percentile Rank
Percentiles are the summary measures that divide a ranked data set into 100 equal parts.
Each (ranked) data set has 99 percentiles that divide it into 100 equal parts. The data
should be ranked in increasing order to compute percentiles. The kth percentile is
denoted by Pk, where k is an integer in the range 1 to 99. For instance, the 25th
percentile is denoted by P25.
Thus, the kth percentile, Pk, can be defined as a value in a data set such that about k% of
the measurements are smaller than the value of Pk and about (100- k)% of the
measurements are greater than the value of Pk. The approximate value of the kth
percentile is determined as explained next.
BOX-AND-WHISKER PLOT
A box-and whisker plot gives a graphic presentation of data using five measures: the
median, the first quartile, the third quartile, and the smallest and the largest values in
the data set between the lower and the upper inner fences.
A box-and-whisker plot can help us visualize the center, the spread, and the skewness
of a data set.