0% found this document useful (0 votes)
23 views82 pages

3 Descriptive Statistics - Numerical

The document provides an overview of descriptive statistics, focusing on numerical measures such as location (mean, median, mode, percentiles, quartiles) and variability (range, variance, standard deviation). It emphasizes the importance of understanding these measures in analyzing data sets, particularly in the context of hotel room rates. Additionally, it discusses concepts like skewness, z-scores, and Chebyshev's theorem for assessing data distribution and variability.

Uploaded by

9nnx7khkh7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views82 pages

3 Descriptive Statistics - Numerical

The document provides an overview of descriptive statistics, focusing on numerical measures such as location (mean, median, mode, percentiles, quartiles) and variability (range, variance, standard deviation). It emphasizes the importance of understanding these measures in analyzing data sets, particularly in the context of hotel room rates. Additionally, it discusses concepts like skewness, z-scores, and Chebyshev's theorem for assessing data distribution and variability.

Uploaded by

9nnx7khkh7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Slide 3, Part A

Descriptive Statistics: Numerical Measures


 Measures of Location
 Measures of Variability

Slide 1
Measures of Location

 Mean
If the measures are computed
 Median
for data from a sample,
 Mode they are called sample statistics.
 Percentiles
 Quartiles If the measures are computed
for data from a population,
they are called population parameters.

A sample statistic is referred to


as the point estimator of the
corresponding population parameter.

Slide 2
Mean

 Perhaps the most important measure of location is


the mean.
 The mean provides a measure of central location.
 The mean of a data set is the average of all the data
values.
 The sample mean x is the point estimator of the
population mean µ.

Slide 3
Sample Mean x

Sum of the values


of the n observations

∑x i
x=
n

Number of
observations
in the sample

Slide 4
Population Mean µ

Sum of the values


of the N observations

∑x i
µ=
N

Number of
observations in
the population

Slide 5
Sample Mean

 Example: Hotel Room Rates


Seventy hotels were randomly sampled in
Shatin and Tai Po. The daily rent prices for a
studio room are listed below.

Slide 6
Sample Mean

 Example: Hotel Room Rates

=x

=
x i 34, 356
= 490.80
n 70

Slide 7
Median

 The median of a data set is the value in the middle


when the data items are arranged in ascending order.
 Whenever a data set has extreme values, the median
is the preferred measure of central location.
 The median is the measure of location most often
reported for annual income and property value data.
 A few extremely large incomes or property values
can inflate the mean.

Slide 8
Median

 For an odd number of observations:

26 18 27 12 14 27 19 7 observations

12 14 18 19 26 27 27 in ascending order

the median is the middle value.

Median = 19

Slide 9
Median

 For an even number of observations:

26 18 27 12 14 27 30 19 8 observations

12 14 18 19 26 27 27 30 in ascending order

the median is the average of the middle two values.

Median = (19 + 26)/2 = 22.5

Slide 10
Median

 Example: Hotel Room Rates


Averaging the 35th and 36th data values:
Median = (475 + 475)/2 = 475

Note: Data is in ascending order.

Slide 11
Trimmed Mean

 Another measure, sometimes used when extreme


values are present, is the trimmed mean.
 It is obtained by deleting a percentage of the
smallest and largest values from a data set and then
computing the mean of the remaining values.
 For example, the 5% trimmed mean is obtained by
removing the smallest 5% and the largest 5% of the
data values and then computing the mean of the
remaining values.

Slide 12
Mode

 The mode of a data set is the value that occurs with


greatest frequency.
 The greatest frequency can occur at two or more
different values.
 If the data have exactly two modes, the data are
bimodal.
 If the data have more than two modes, the data are
multimodal.
 Caution: If the data are bimodal or multimodal,
Excel’s MODE function will incorrectly identify a
single mode.

Slide 13
Mode

 Example: Hotel Room Rates


450 occurred most frequently (7 times)
Mode = 450

Note: Data is in ascending order.

Slide 14
Percentiles

 A percentile provides information about how the


data are spread over the interval from the smallest
value to the largest value.
 Admission test scores for colleges and universities
are frequently reported in terms of percentiles.
 The pth percentile of a data set is a value such that at
least p percent of the items take on this value or less
and at least (100 - p) percent of the items take on this
value or more.

Slide 15
Percentiles
Arrange the data in ascending order.

Compute index i, the position of the pth percentile.

i = (p/100)n

If i is not an integer, round up. The p th percentile


is the value in the i th position.

If i is an integer, the p th percentile is the average


of the values in positions i and i +1.

The 0 th percentile is defined as the minimum, and


the 100 th percentile is defined as the maximum.
Slide 16
80th Percentile

 Example: Hotel Room Rates


i = (p/100)n = (80/100)70 = 56
Averaging the 56th and 57th data values:
80th Percentile = (535 + 549)/2 = 542

Note: Data is in ascending order.

Slide 17
80th Percentile

 Example: Hotel Room Rates


“At least 80% of the “At least 20% of the
items take on a items take on a
value of 542 or less.” value of 542 or more.”
56/70 = .8 or 80% 14/70 = .2 or 20%

Slide 18
Quartiles

 Quartiles are specific percentiles.


 First Quartile = 25th Percentile
 Second Quartile = 50th Percentile = Median
 Third Quartile = 75th Percentile

Slide 19
Third Quartile

 Example: Hotel Room Rates


Third quartile = 75th percentile
i = (p/100)n = (75/100)70 = 52.5 = 53
Third quartile = 525

Note: Data is in ascending order.

Slide 20
We look at the average income in Hong Kong over the last 10
years, adjusted for inflation, it has been going up, then can we
conclude that in general, citizens are better off economically?

Not Necessary!!

Let’s say if the median income has gone down in the same
period, then it means that a typical person is actually worse
off, even though the people at the top have been making more
money.

A statistician has his head in the oven and his feet in the
refrigerator. When he is asked how he feel, he says, “On
average, pretty good!”

Slide 21
Measures of Variability

 It is often desirable to consider measures of variability


(dispersion), as well as measures of location.
 For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.

Slide 22
Measures of Variability

 Range
 Interquartile Range
 Variance
 Standard Deviation
 Coefficient of Variation

Slide 23
Range

 The range of a data set is the difference between the


largest and smallest data values.
 It is the simplest measure of variability.
 It is very sensitive to the smallest and largest data
values.

Slide 24
Range

 Example: Hotel Room Rates


Range = largest value - smallest value
Range = 615 - 425 = 190

Note: Data is in ascending order.

Slide 25
Interquartile Range

 The interquartile range of a data set is the difference


between the third quartile and the first quartile.
 It is the range for the middle 50% of the data.
 It overcomes the sensitivity to extreme data values.

Slide 26
Interquartile Range

 Example: Hotel Room Rates


3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80

Note: Data is in ascending order.

Slide 27
Variance

The variance is a measure of variability that utilizes


all the data.

It is based on the difference between the value of


each observation (xi) and the mean ( x for a sample,
µ for a population).

The variance is useful in comparing the variability


of two or more variables.

Slide 28
Variance

The variance is the average of the squared


differences between each data value and the mean.

The variance is computed as follows:

2
2 ∑ ( xi − x ) 2 ∑ ( xi − µ ) 2
s = σ =
n −1 N

for a for a
sample population

Slide 29
Standard Deviation

The standard deviation of a data set is the positive


square root of the variance.

It is measured in the same units as the data, making


it more easily interpreted than the variance.

Slide 30
Standard Deviation

The standard deviation is computed as follows:

s = s2 σ = σ2

for a for a
sample population

Slide 31
Coefficient of Variation

The coefficient of variation indicates how large the


standard deviation is in relation to the mean.

The coefficient of variation is computed as follows:

s  σ 
 × 100  %  ×100  %
x  µ 
for a for a
sample population

Slide 32
Sample Variance, Standard Deviation,
And Coefficient of Variation
 Example: Hotel Room Rates
• Variance ∑ i
( x − x ) 2

=s2 = 2, 996.16
n−1

• Standard Deviation the standard


deviation is
=s s2
= 2996.16
= 54.74
about 11%
of the mean
• Coefficient of Variation
 s   54.74 
 × 100 % =  × 100 % = 11.15%
x   490.80 

Slide 33
Slide 3, Part B
Descriptive Statistics: Numerical Measures
 Measures of Distribution Shape, Relative Location,
and Detecting Outliers
 Exploratory Data Analysis
 Measures of Association Between Two Variables
 The Weighted Mean and
Working with Grouped Data

Slide 34
Measures of Distribution Shape,
Relative Location, and Detecting Outliers
 Distribution Shape
 z-Scores
 Chebyshev’s Theorem
 Empirical Rule
 Detecting Outliers

Slide 35
Distribution Shape: Skewness

 An important measure of the shape of a distribution


is called skewness.
 The formula for the skewness of sample data is
3
n  xi − x 
Skewness = ∑ 
(n − 1)(n − 2)  s 

 Skewness can be easily computed using statistical


software.

Slide 36
Distribution Shape: Skewness

 Symmetric (not skewed)


• Skewness is zero.
• Mean and median are equal.
Skewness = 0
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0

Slide 37
Distribution Shape: Skewness

 Moderately Skewed Left


• Skewness is negative.
• Mean will usually be less than the median.
Skewness = − .31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0

Slide 38
Distribution Shape: Skewness

 Moderately Skewed Right


• Skewness is positive.
• Mean will usually be more than the median.
Skewness = .31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0

Slide 39
Distribution Shape: Skewness

 Highly Skewed Right


• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
.35
Skewness = 1.25
.30
Relative Frequency

.25
.20
.15
.10
.05
0

Slide 40
Distribution Shape: Skewness

 Example: Hotel Room Rates


Seventy hotels were randomly sampled in Shatin
and Tai Po. The daily rates for the studio rooms are
listed below in ascending order.

Slide 41
Distribution Shape: Skewness

 Example: Hotel Room Rates

.35 Skewness = .92


.30
Relative Frequency

.25

.20
.15

.10
.05
0

Slide 42
z-Scores

The z-score is often called the standardized value.

It denotes the number of standard deviations a data


value xi is from the mean.

xi − x
zi =
s

Excel’s STANDARDIZE function can be used to


compute the z-score.

Slide 43
z-Scores

 An observation’s z-score is a measure of the relative


location of the observation in a data set.
 A data value less than the sample mean will have a
z-score less than zero.
 A data value greater than the sample mean will have
a z-score greater than zero.
 A data value equal to the sample mean will have a
z-score of zero.

Slide 44
z-Scores

 Example: Hotel Room Rates


• z-Score of Smallest Value (425)
xi − x 425 − 490.80
z= = = − 1.20
s 54.74

Standardized Values for Room Rates

Slide 45
Chebyshev’s Theorem

At least (1 - 1/z2) of the items in any data set will be


within z standard deviations of the mean, where z is
any value greater than 1.

Chebyshev’s theorem requires z > 1, but z need not


be an integer.

Slide 46
Chebyshev’s Theorem

At least 75% of the data values must be


within z = 2 standard deviations of the mean.

At least 89% of the data values must be


within z = 3 standard deviations of the mean.

At least 94% of the data values must be


within z = 4 standard deviations of the mean.

Slide 47
Chebyshev’s Theorem

 Example: Hotel Room Rates


Let z = 1.5 with x = 490.80 and s = 54.74

At least (1 − 1/(1.5)2) = 1 − 0.44 = 0.56 or 56%


of the rent values must be between
x - z(s) = 490.80 − 1.5(54.74) = 409
and
x + z(s) = 490.80 + 1.5(54.74) = 573

(Actually, 86% of the rent values


are between 409 and 573.)

Slide 48
Empirical Rule

When the data are believed to approximate a


bell-shaped distribution …

The empirical rule can be used to determine the


percentage of data values that must be within a
specified number of standard deviations of the
mean.

The empirical rule is based on the normal


distribution, which will be covered in the class
about Continuous Probability Distribution.

Slide 49
Empirical Rule

For data having a bell-shaped distribution:

68.26% of the values of a normal random variable


are within +/- 1 standard deviation of its mean.

95.44% of the values of a normal random variable


are within +/- 2 standard deviations of its mean.

99.72% of the values of a normal random variable


are within +/- 3 standard deviations of its mean.

Slide 50
Empirical Rule

99.72%
95.44%
68.26%

x
µ µ + 3σ
µ – 3σ µ – 1σ µ + 1σ
µ – 2σ µ + 2σ

Slide 51
Detecting Outliers

 An outlier is an unusually small or unusually large


value in a data set.
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
 It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a correctly recorded data value that belongs in
the data set

Slide 52
Detecting Outliers

 Example: Hotel Room Rates


• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there
are no outliers in this data set.

Standardized Values for Hotel Room Rates

Slide 53
Exploratory Data Analysis

Exploratory data analysis procedures enable us to use


simple arithmetic and easy-to-draw pictures to
summarize data.

We simply sort the data values into ascending order


and identify the five-number summary and then
construct a box plot.

Slide 54
Five-Number Summary

1 Smallest Value

2 First Quartile

3 Median

4 Third Quartile

5 Largest Value

Slide 55
Five-Number Summary

 Example: Hotel Room Rates


Lowest Value = 425 First Quartile = 445
Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

Slide 56
Box Plot

A box plot is a graphical summary of data that is


based on a five-number summary.

A key to the development of a box plot is the


computation of the median and the quartiles Q1 and
Q3.

Box plots provide another way to identify outliers.

Slide 57
Box Plot

 Example: Hotel Room Rates


• A box is drawn with its ends located at the first and
third quartiles.
• A vertical line is drawn in the box at the location of
the median (second quartile).

400 425 450 475 500 525 550 575 600 625

Q1 = 445 Q3 = 525
Q2 = 475
Slide 58
Box Plot

 Limits are located (not drawn) using the interquartile


range (IQR).
 Data outside these limits are considered outliers.
 The locations of each outlier is shown with the
symbol * .
continued

Slide 59
Box Plot

 Example: Hotel Room Rates


• The lower limit is located 1.5(IQR) below Q1.

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325

• The upper limit is located 1.5(IQR) above Q3.

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645

• There are no outliers (values less than 325 or


greater than 645) in the hotel room rate data.

Slide 60
Box Plot

 Example: Hotel Room Rates


• Whiskers (dashed lines) are drawn from the ends
of the box to the smallest and largest data values
inside the limits.

400 425 450 475 500 525 550 575 600 625

Smallest value Largest value


inside limits = 425 inside limits = 615
Slide 61
Box Plot

An excellent graphical technique for making


com parisons am ong two or m ore groups.
Slide 62
Measures of Association
Between Two Variables
Thus far we have examined numerical methods used
to summarize the data for one variable at a time.

Often a manager or decision maker is interested in


the relationship between two variables.

Two descriptive measures of the relationship


between two variables are covariance and correlation
coefficient.

Slide 63
Covariance

The covariance is a measure of the linear association


between two variables.

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.

Slide 64
Covariance

The covariance is computed as follows:

∑ ( xi − x )( yi − y ) for
sxy =
n −1 samples

∑ ( xi − µ x )( yi − µ y ) for
σ xy = populations
N

Slide 65
Correlation Coefficient

Correlation is a measure of linear association and not


necessarily causation.

Just because two variables are highly correlated, it


does not mean that one variable is the cause of the
other.

Slide 66
Correlation Coefficient

The correlation coefficient is computed as follows:


sxy σ xy
rxy = ρ xy =
sx s y σ xσ y

for for
samples populations

Slide 67
Correlation Coefficient

The coefficient can take on values between -1 and +1.

Values near -1 indicate a strong negative linear


relationship.

Values near +1 indicate a strong positive linear


relationship.

The closer the correlation is to zero, the weaker the


relationship.

Slide 68
Covariance and Correlation Coefficient

 Example: Golfing Study


A golfer is interested in investigating the
relationship, if any, between driving distance and
18-hole score.
Average Driving Average
Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69

Slide 69
Covariance and Correlation Coefficient

 Example: Golfing Study

x y ( xi − x ) ( y i − y ) ( xi − x )( yi − y )
277.6 69 10.65 -1.0 -10.65
259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192 .8944

Slide 70
Covariance and Correlation Coefficient

 Example: Golfing Study


• Sample Covariance
sxy=
∑ (x − x )( y
i − y ) −35.40
i
= = − 7.08
n−1 6−1
• Sample Correlation Coefficient
sxy −7.08
rxy =
= = -.9631
sx sy (8.2192)(.8944)

Slide 71
The Weighted Mean and
Working with Grouped Data
 Weighted Mean
 Mean for Grouped Data
 Variance for Grouped Data
 Standard Deviation for Grouped Data

Slide 72
Weighted Mean

 When the mean is computed by giving each data


value a weight that reflects its importance, it is
referred to as a weighted mean.
 In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
 When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.

Slide 73
Weighted Mean

x= ∑ wx i i

∑w i

where:
xi = value of observation i
wi = weight for observation i

Slide 74
Grouped Data

 The weighted mean computation can be used to


obtain approximations of the mean, variance, and
standard deviation for the grouped data.
 To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
 We compute a weighted mean of the class midpoints
using the class frequencies as weights.
 Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.

Slide 75
Mean for Grouped Data

 Sample Data

x= ∑ fM i i

 Population Data

µ= ∑ fMi i

N
where:
fi = frequency of class i
Mi = midpoint of class i

Slide 76
Sample Mean for Grouped Data

 Example: Hotel Room Rates


The previously presented sample of hotel room
rates is shown here as grouped data in the form of
a frequency distribution.

Slide 77
Sample Mean for Grouped Data

 Example: Hotel Room Rates

34, 525
=x = 493.21
70
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.

Slide 78
Variance for Grouped Data

 For sample data

2 ∑ f i ( Mi − x ) 2
s =
n −1

 For population data

∑ f i ( M i − µ ) 2
σ2 =
N

Slide 79
Sample Variance for Grouped Data

 Example: Hotel Room Rates

continued
Slide 80
Sample Variance for Grouped Data

 Example: Hotel Room Rates


• Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89

• Sample Standard Deviation


=s 3,017.89
= 54.94

This approximation differs by only $.20


from the actual standard deviation of $54.74.

Slide 81
Check Your Understanding
What can you conclude from the following data set?
Variable A Variable B
1 12
6 18
23 25
28 43
55 52
56 73
64 75
66 94
A. The correlation coefficient equals -0.86, so the two
variables have a strong negative linear relationship.
B. The correlation coefficient equals 0.05, so the two
variables have not much linear relationship.
C. The correlation coefficient equals 0.95, so the two
variables have a strong positive linear relationship.
Slide 82

You might also like