0% found this document useful (0 votes)
40 views55 pages

3 Numerical Descriptive Measures

Uploaded by

sarikajayaswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views55 pages

3 Numerical Descriptive Measures

Uploaded by

sarikajayaswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Numerical Descriptive

measures
Summary Definitions

 The central tendency/location is the extent to which the values of a


numerical variable, group around a typical or central value. It is the
central value around which data tends to cluster.

 The variation is the amount of dispersion or scattering away from


a central value that the values of a numerical variable show. It
measures the spread of the data

 The shape is the pattern of the distribution of values from the


lowest value to the highest value.
Numerical Descriptive Techniques
Measure of Central Tendency: Mean

Mean (average)
 The sum of all the data entries divided by the number of entries.

 Sigma notation: Σx = add all of the data entries (x) in the


data set.

 Population mean: x
u N

 Sample mean: x
x n
Population mean µ

 The population mean is the sum of the values in the population


divided by the population size, N.

i1Xi X X
 N   1  X
2
N
N

Where μ = population mean


N = population size
Xi = ith value of the variable
X
Sample mean

 The arithmetic mean (often just called the “mean”)


is the most common measure of central
tendency.


For a sample of size n:
The ith value
Pronounced x-bar
n


i1
Xi X X  
X n  X1 2
n
n

Sample size Observed values


Measures of Central Tendency : Mean

Advantages of using mean:


• easy to calculate,
• provides good description for data on height, grades etc.
Disadvantages of using mean:
• is sensitive to extreme values.

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Mean Mean =
=13 14
Measures of Central Tendency : Median
• Median is the value that divides the data into two parts- 50% of the observations
have values less than the median and 50% of the observations have values
greater then the median.

• The median is calculated by placing all the observations in order; the


observation that falls in the middle is the median.

• The location of the median when the values are in numerical order
(smallest to largest):
n1
Median position  2 position in the ordered
data

• If the number of values is odd, the median is the middle number.


• If the number of values is even, the median is the average of the two
middle numbers.
• Note that (n + 1)/2 is not the value of the median, only the position of
the median in the ranked data.
Measures of Central Tendency: Median

 In an ordered array, the median is the “middle” number


(50% above, 50% below).

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Median = 13 Median = 13

 Less sensitive than the mean to extreme values


 There are as many values above the median as below it in
the data array.
 The sample and population medians are computed in the
same way.
EXAMPLES - Mean and Median

A sample of 10 adults was asked to report the number of hours they spent on the internet the
previous month. The results are listed here. Calculate the sample mean and Median.

0 7 12 5 33 14 8 0 9 22

The median is the average of the fifth and sixth observations (the middle two), which
are 8 and 9, respectively. Thus, the median is 8.5.
Measures of Central Tendency : The Mode

 Value that occurs most often.


 Not affected by extreme values.
 Used mainly for nominal data.
 There may be no mode.
 There may be several modes. (bi-modal)
 The sample and population modes are computed in the same
way.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode = 9 No Mode
Copyright © 2017 Pearson Education, Ltd.
Who wins between Mean, Median, and Mode?

Out of the three measures to choose from, which one should we use?

• The mean is generally our first selection. However, there are several
circumstances when the median is better.
• The mode is seldom the best measure of central location.
• One advantage the median holds is that it not as sensitive to
extreme values as is the mean.

Find the mode for the data in Internet Example


0 7 12 5 33 14 8 0 9 22

All observations except 0 occur once. There are two 0s. Thus, the
mode is 0. As you can see, this is a poor measure of central location. It
is nowhere near the center of the data. Compare this with the mean
11.0 and median 8.5 and we can see that mean and median are
superior measures.
Activity
The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to
Cancun, Mexico are listed. What is the mean, median, mode price of the
flights?
1872 432 397 427 388 482 397 358 432

Which central tendency measure is best suitable to describe this data?

Mean=5185/9= 576.111

Median=5th position= 427


358,388,397,397,427,432,432,482,1872

Mode= 397, 432

Because of extreme values median is appropriate


Summary
• Compute the Mean to
Describe the central location of a single set of interval data.

• Compute the Median to


Describe the central location of a single set of interval or ordinal data
(with extreme observations)

• Compute the Mode to


Describe a single set of nominal, interval data
Instructor-
Dispersion and Variation

Why Study Dispersion?

– A measure of location, such as the mean or the median,


only describes the center of the data. It is valuable from
that standpoint, but it does not tell us anything about the
spread of the data.
– For example, if your nature guide told you that the river
ahead averaged 3 feet in depth, would you want to
wade across on foot without additional information?
Probably not. You would want to know something about
the variation in the depth.
– A second reason for studying the dispersion in a set of
data is to compare the spread in two or more
distributions.
Measures of Variation

Variation

Range Variance Standard Coefficient


Deviation of Variation


Measures of variation
give information on the
spread or variability of
the data values which
measure of location fail
to tell.
Same
centre,
different
variation
Measures of Variation: The Range

 Simplest measure of variation.


 Difference between the largest and the smallest
values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 2 = 12
 Potential problem with Range?
 Once again let us think about the following example on grades.
 Grades of course 1: {4, 4, 4, 4, 50}.
 Grades of course 2: {4, 8, 15, 24, 39, 50}.

 Range= 46 in both the courses but the two courses have very
different distributions.

• Its major advantage is the ease with which it can be computed.

• Its major shortcoming is its failure to provide information on the


dispersion of the observations between the two end points.

• Hence we need a measure of variability that incorporates all the


data and not just two observations.
Deviation, Variance, and Standard Deviation

Deviation
 The difference between the data entry, x, and the mean of the

data set.

 It gives a rough estimate of the typical distance of a data value


from the mean.

 Population data set:



Deviation of x = x – μ

 Sample data set:



Deviation of x = x – x
Numerical Descriptive Measures for a Population:
Variance σ2

 Average of squared deviations of values from the mean.

N

Population variance:  i
(X  μ)2

i1
σ2  N
Where
μ = population mean, N = population size
Xi = ith value of the variable X
Copyright © 2017 Pearson Education, Ltd.
Numerical Descriptive Measures for a Population: Standard
Deviation σ
 Most commonly used measure of variation.
 Shows average variation about the mean.
 Is the square root of the population variance.
 Has the same units as the original data.

N
Population standard deviation:
 i

(X  μ) 2

i1
σ
 N
Measures of Variation: Sample Variance
 Average (approximately) of squared deviations of values from the
mean.

n

Sample variance:

2
 (X  X)
i
2

S  i1
n -1
Where
X = arithmetic mean
n = sample size
Xi = ith value of the variable
X
Measures of Variation: Sample Standard Deviation
 Most commonly used measure of variation.
 Shows average variation about the mean.
 Is the square root of the variance.
 Has the same units as the original data.


Sample standard deviation:  (X i  X) 2

S i 1
n -1
Interpreting Standard Deviation
 Standard deviation is a measure of the typical amount an entry
deviates from the mean.
 The more the entries are spread out, the greater the
standard deviation.

.
Measures of Variation: Comparing Standard
Deviations

Smaller standard deviation

Larger standard deviation


Measure of Variability: Standard Deviation -
Interpretation
Measure of Variability: Standard Deviation -
Interpretation
Measure of Variability: Standard Deviation -
Interpretation
Measure of Variability: Standard Deviation -
Interpretation
• Suppose that the mean and standard deviation of last year’s midterm test
marks are 70 and 5, respectively.

• What can you say about the distribution of grades if the histogram is bell-
shaped?
• We know that approximately 68% of the marks fell between 65 and 75,
approximately 95% of the marks fell between 60 and 80, and
approximately 99.7% of the marks fell between 55 and 85.

• What can you say about the distribution of grades if the shape of the
histogram is not known?

• If the shape of the histogram is not known, we can say that at least 75%
of the marks fell between 60 and 80, and at least 88.9% of the marks fell
between 55 and 85. (k= 2 and 3.)
The Coefficient of Variation (CV)
 Measures relative variation
 Always in percentage (%)
 Shows variation relative to mean
 Is the standard deviation divided by the mean, multiplied by 100%
Comparing Coefficients of Variation

 Stock A:
 Average price last year = $50

 Standard deviation = $5

S $5
 
CVA   100%  100% 10%
X $50
S $5
CVB   100%  100% 5%
X $100
Measures of Variation:
Summary Characteristics
 The more the data are spread out, the greater the range, variance,
and standard deviation.

 The more the data are concentrated, the smaller the range, variance,
and standard deviation.

 If the values are all the same (no variation), all these measures will be
zero.

 None of these measures are ever negative.

 The measure of variability can be used for interval data and Ordinal data
(IQR).
Measure of Relative Standing

• Measures of relative standing are designed to


provide information about the position of particular
values relative to the entire data set.
• Percentile: the Pth percentile is the value for which P
% of the observations are less than that value and
(100-P)% of the observations are greater than that
value.
• Suppose you scored in the 60th percentile on some
exam, that means 60% of the other scores were
below yours, while 40% of scores were above yours
Quartile Measures
 The quartile measures the spread of values above and below the mean
by dividing the distribution into four groups.
 A quartile divides data into three points:
 First quartile, Q1: About one quarter of the data fall on or below Q1.
 Second quartile, Q2: About one half of the data fall on or below Q2
(median).
 Third quartile, Q3: About three quarters of the data fall on or below
Q3.

25% 25% 25% 25%

Q1 Q2
Q3

 Quartiles are used to calculate the interquartile range, which is a


measure of variability around the median.
Quartile Measures:
Locating Quartiles

Find a quartile by determining the value in the appropriate position


in the ranked data, where:

First quartile position: Q1 = (n+1)/4 ranked value.

Second quartile position: Q2 = (n+1)/2 ranked value or Median.

Third quartile position: Q3 = 3(n+1)/4 ranked value.

where n is the number of observed values.


The number of nuclear power plants in the top 15 nuclear power-producing
countries in the world are listed. Find the first, second, and third quartiles of
the data set.
7 18 11 6 59 17 18 54 104 20 31 8 10 15 19

Solution:
• Q2 divides the data set into two halves.
Lower half Upper half

6 7 8 10 11 15 17 18 18 19 20 31 54 59 104
Q2
 The first (16/4th position) =4th position = 10, second quartiles (16*2)/4 =8th
position = 18 and third quartiles (16*3)/4 =12th position = 31
Lower half Upper half
6 7 8 10 11 15 17 18 18 19 20 31 54 59 104
Q1 Q2 Q3

 Q1 tells us that 25% of the countries have 10 or less nuclear plants, Q2


tells us that about 50% have 18 or less; and Q3 reveals that about 75%
have 31 or less plants.
Measure of Relative Standing: Commonly
used Percentiles
Measure of Relative
Standing: Location of
Percentiles
Measure of Relative
Standing: Location of
Percentiles
Measure of Relative
Standing: Location of
Percentiles
Measure of Relative
Standing: Location of
Percentiles
Interquartile Range(IQR)

 Measures the range of the middle 50% of the data that shows how
spread out the data is.
 The difference between the third and first quartiles.
 IQR = Q3 – Q1
 Large values of this statistic mean that the 1st and 3rd quartiles are
far apart indicating a high level of variability.

Find the interquartile range of the data set. Recall Q1 = 10, Q2 = 18,
and Q3 = 31

Solution:
• IQR = Q3 – Q1 = 31 – 10 = 21

The number of power plants in the middle portion of the data set vary by at
most 21.
Describing Relationship between Two

Variables

 One graphical technique we use to show the


relationship between 2 variables is called a scatter
diagram.
 To draw a scatter diagram we need two variables. We
scale one variable along the horizontal axis (X-axis)
of a graph and the other variable along the vertical
axis (Y-axis).
Describing Relationship between Two
Variables – Scatter Diagram Examples
We Discuss Two Measures Of The Relationship
Between Two Numerical Variables

Scatter plots allow you to visually examine the


relationship between two numerical variables
 Now,

We will discuss two quantitative measures of such


relationships.
 The Covariance
 The Coefficient of Correlation
The Covariance
 The covariance measures the strength of the linear
relationship between two numerical variables (X & Y)
Numerical Illustration
Interpreting Covariance

 Covariance between two variables:

When there is no particular pattern, the covariance is a small number.


 The covariance has a major flaw:
 It is not possible to determine the relative strength of the
relationship from the size of the covariance
Coefficient of Correlation

 Measures the relative strength of the linear


relationship between two numerical variables
 Sample coefficient of correlation:
cov (X , Y)
r
SX SY

n n n
 (X  X)(Y  Y)
i i  (Xi  X) 2
 i
(Y  Y ) 2

cov (X , Y)  i1 SX  i1


SY  i1
n 1 n 1 n 1
Features of the
Coefficient of Correlation

 The population coefficient of correlation is referred as ρ.


 The sample coefficient of correlation is referred to as r.
 Either ρ or r have the following features:
 Unit free
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear relationship
 The closer to 1, the stronger the positive linear relationship
 The closer to 0, the weaker the linear relationship / no relationship
Scatter Plots of Sample Data with
Various Coefficients of Correlation
Coefficient of Correlation

Because we’ve already calculated the covariances we need to compute only the standard deviations of X
and Y.
For Set 1: Strong positive linear relationship
For Set 2: Strong negative linear relationship
For Set 3: Weak negative linear relationship

You might also like