0% found this document useful (0 votes)
45 views68 pages

Topic 1 Describing Data II

This document provides an overview of key concepts for describing data, including measures of central tendency, variability, and relationships. It discusses graphical and numerical methods for describing distributions, such as the mean, median, mode, range, variance, standard deviation, and percentiles. Examples are provided to demonstrate calculating and interpreting various descriptive statistics.

Uploaded by

anneshadas2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views68 pages

Topic 1 Describing Data II

This document provides an overview of key concepts for describing data, including measures of central tendency, variability, and relationships. It discusses graphical and numerical methods for describing distributions, such as the mean, median, mode, range, variance, standard deviation, and percentiles. Examples are provided to demonstrate calculating and interpreting various descriptive statistics.

Uploaded by

anneshadas2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

ECON 1280 Analysis of Economics Data

Lecture 1. Describing Data


(Chapters 1 and 2)
Xiao Betty Wang

HKU Faculty of Business and Economics


The University of Hong Kong
Today’s lecture
1. What is statistics?
1. Why do we study it?
2. Some key definitions in statistics

2. How do we describe data?


1. Graphical
2. Numerical
Describing Data Numerically

Central
Shape Location Variation Relationship
Tendency
Mean Skewness Minimum Range Covariance
Median Maximum IQR Correlation
Mode Percentiles Variance
Quartiles Standard deviation
z-Score Coefficient of variation
Measures of Central Tendency: Mean, Median, and Mode
• Measures of central tendency provide information about a “typical” observation in the data
• Usually computed from sample data rather than from population data

Central Tendency

Mean Median Mode

x i
x= i=1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
(if one exists)
Measures of Central Tendency: Mean, Median, and Mode
• The (arithmetic) mean of a set of data is the sum of the data values
divided by the number of observations
• The population mean is a parameter given by
σ𝑁𝑖=1 𝑥𝑖 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑁
𝜇= =
𝑁 𝑁 Population size
• The sample mean is a statistic given by
σ𝑛𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛 Sample size
• The mean is appropriate for numerical data
Median
The median is the middle observation of a set of observations that are
arranged in increasing (or decreasing) order.
• If n is odd, the median is the middle observation.
• If n is even, the median is the average of the two middle
observations. The median will be the number located in the 0.5(n
+1)th ordered position.
• The median is more robust to outliers than the mean. [why?]
Mode
The mode, if one exists, is the most frequently occurring value.
• A distribution with one mode is called unimodal; with two (local)
modes, it is called bimodal; and with more than two (local) modes, it
is said to be multimodal.
• The mode is most commonly used with categorical data
Question:
• You want to measure the central tendency of the following
data. Which measurement would you use? Mean, Median or
Mode?
• Mode is most commonly used with categorical data (why?)

Variable Value Numerical Value

Gender Female Female=1


Male Male=2
Intersex Intersex=3
Transgender Transgender=4
Others Others=5
The most appropriate measure of central tendency is context specific
Pick the appropriate measurement(s) of the central tendency for the
following data.
• (a) mean, (b) median, and/or (c) mode.

1. As a clothing retailer manager, you want to know what sizes are


most in-demand for inventory decisions.

2. As a student, you want to know where you are among the class from
the midterm grades.

3. You want to understand the well-being of an economy from its


income distribution. Which measure is most appropriate?
Example 2.1: Demand for Bottled Water
• The number of bottled water sold in n=12 hours at one store during
hurricane season is 60, 84, 65, 67, 75, 72, 80, 85, 63, 82, 70, 75. What
are the mean, median, and mode?

60+84+65+67+75+72+80+85+63+82+70+75
• The mean is 𝑥ҧ = = 73.17
12
• To find the median, arrange the sales from least to greatest:
• 60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85
72+75
• So the median is 𝑥0.5 = = 73.5
2
• The mode is 75.
Describing Data Numerically: Measures of Central Tendency

The most appropriate measure of central tendency is context specific.


• For categorical data: median and mode are appropriate (not mean)
• For numerical data (the most popular data type in business
applications): median (esp. outliers exist) and mean are more
appropriate
• With mode: hard to tell the center if each value occurs only once
Percentiles and Quartiles

• Percentiles and quartiles are measures that indicate the location,


or position, of a value relative to the entire set of data.
• They are generally used to describe large data sets, e.g., sales
data, survey data, or even the weights of newborn babies.
Percentiles and Quartiles
Arranging the data in order from the smallest to the largest,
the pth percentile is a value such that approximately p% of the
observations are at or below that number.
• Percentiles separate large ordered data sets into 100ths.
• The 50th percentile is the median.

𝑝
Pth percentile = value located in the 𝑛 + 1 𝑡ℎ
100
ordered position
Percentiles and Quartiles
Quartiles are descriptive measures that separate large data sets into four
quarters.
• Split the ranked data into 4 segments with an equal number of values
per segment. Note that the widths of the segments may be different.
1. The first quartile, 𝑄1 , (or 25th percentile) separates approximately
the smallest 25% of the data from the remainder of the data.
2. The second quartile, 𝑄2 , (or 50th percentile) is the median.
3. The third quartile, 𝑄3 , (or 75th percentile) separates approximately
the smallest 75% of the data from the remainder of the data.

25% 25% 25% 25%

Q1 Q2 Q3
Find a quartile by determining the value in the appropriate position in the
ranked data
• where n is the number of observed values
• 𝑄1 = the value in the 0.25(n+1)th ordered position.
• 𝑄2 = the value in the 0.50(n+1)th ordered position.
• 𝑄3 = the value in the 0.75 (n+1)th ordered position.

First quartile position Q1 = 0.25(n+1)

Second quartile position Q2 = 0.50(n+1)

Third quartile position Q3 = 0.75(n+1)


Example Question
Sample Ranked Data: 11 12 13 16 16 17 18 21 22

Find the first quartile


Answer:
• n=9
• Q1 = is in the 0.25(9+1) = 2.5 position of the
ranked data
• Use the value half way between the 2nd and 3rd
values.
• Q1 = 12.5
Five-Number Summary
The five-number summary refers to five descriptive measures:
1. minimum
2. first quartile
3. median
4. third quartile
5. maximum
• minimum < Q1 < median < Q3 < maximum
Example Question
Demand for Bottled Water Ascendingly ordered sales
60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85
This sample n is small, but we use it for illustration.
Find the five-number summary.
Answer:
• Q1 = the value located in the 0.25(12+1) = 3.25th ordered position.
– Then Q1 = 65+0.25(67-65) = 65.5
• Q3 = the value located in the 0.75(12+1) = 9.75th ordered position.
– Then Q3 = 80+0.75(82-80) = 81.5
• The five-number summary is 60 < 65.5 < 73.5 < 81.5 < 85
Measures of Variability
• Two datasets can have the same mean
but the observations in one set could
vary more from the mean than the
other.
• Sample A: 1, 2, 1, 36
• Sample B: 8, 9, 10, 13
• Both have a mean of 10, but the spread
of Sample A is obviously larger than
that of Sample B.
• give information on the spread or Same center,
variability of the data values. different variability
Measures of Variability: Methods

Variability

Range Interquartile Variance Standard Coefficient of


Range Deviation Variation
Measures of Variability: Range and Interquartile Range
The range is the difference between the largest and smallest
observations.
• The greater the spread of the data from the center of the distribution,
the larger the range will be.

Range = Xlargest – Xsmallest


Disadvantages of range:
• sensitive to outliers

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Measures of Variability: Range and Interquartile Range

• Although the range measures the total spread, it is sensitive to outliers.


• One solution is to discard a few of the highest and a few of the lowest
numbers. E.g. scoring free dives in Olympic Games – drop the highest
and lowest scores.
• Or use the interquartile range (IQR): it measures the spread in the
middle 50% of the data (essentially dropping observations <25%
percentile and >75%):
• IQR = Q3-Q1
Box-and-Whisker Plots
A box-and-whisker plot is a graph that
describes the shape of the distribution
in terms of the five-number summary.
• The inner box shows the numbers
that span the range from the first to
the third quartile.
• A line is drawn through the box at the
median.
• There are two "whiskers": one from
the 25th percentile to the minimum,
and the other from the 75th
percentile to the maximum.
• Location 3 is skewed left. Location 4 is skewed right.
• The plot can be oriented horizontally or vertically

Example:
Median
Xminimum Q1 Q3 Xmaximum
(Q2)
25% 25%
25% 25%

12 30 45 57 70
A Boxplot in Excel
• The upper whisker ends at min(𝑄3 + 1.5 × 𝐼𝑄𝑅, 𝑚𝑎𝑥𝑖𝑚𝑢𝑚)
• The lower whisker ends at m𝑎𝑥(𝑄1 − 1.5 × 𝐼𝑄𝑅, 𝑚𝑖𝑛𝑖𝑚𝑢𝑚)
• The points beyond are marked as extreme points
Practice Question
A small accounting office is trying to determine its staffing needs for the coming tax
season. The manager has collected the following data: 46, 27, 79, 57, 99, 75, 48, 89,
and 85. These values represent the number of returns the office completed each year
over the entire nine years it has been doing tax returns. For this data, what is the
interquartile for the number of tax returns completed each year?
Variance and Standard Deviation
• Both range and IQR use only two of the data values. Variance uses
the distances of all observations from the mean.
• Variance: measures the average squared “deviation” from the mean.
The unit of variance is the squared unit of the observations, 𝑥𝑖
• Standard deviation: is the positive square root of the variance. By
taking square root, we get back to the “standard” (original) unit of
observations, 𝑥𝑖
Variance and Standard Deviation
• The population variance is
N

 (x − μ)
the sum of the squared 2
differences between each i
observation and the
population mean divided by
σ =
2 i=1

the population size


N
• The sample variance is the
n

 i
sum of the squared
differences between each (x − x) 2

observation and the sample


mean divided by the sample
s =
2 i=1

size minus 1 (why n-1? We


n -1
will explain in Topic 5)
• Simulation showing bias in sample variance
• 6:24 (start from 1:24)
Variance and Standard Deviation
n

Unbiased sample variance  (x − x)


i
2

𝒔𝟐𝒏−𝟏 s =
2 i=1
n -1

n
Biased sample variance  (x − x)
i
2

𝒔𝟐𝒏 s =
2 i=1
n -1
Variance and Standard Deviation
Sample variance, 𝑠 2 , can be computed as follows:
(σ𝑛 𝑥𝑖 )2
σ𝑛 𝑥
𝑖=1 𝑖
2
− 𝑖=1
• 𝑠2 = 𝑛
𝑛−1

σ𝑛 2
𝑖=1 𝑖 −𝑛𝑥ҧ
𝑥 2
• 𝑠2 =
𝑛−1
• The population standard N
deviation, σ, is the
(positive) square root of
 (x − μ)
i
2

the population variance σ= i=1


N
• The sample standard n
deviation, s, is the
(positive) square root of  i
(x − x) 2

the sample variance S= i=1


n -1

• Variance measures the average squared "deviation" from the


mean and has the unit of the squared unit of xi
• By taking , we get back to the "standard" original unit of xi
• A measure of the “average” scatter around the mean
Which one has smaller standard deviation?
•A

B
Example 2.9: Gilotti’s Pizzeria Sales at Locaiton 1

• There is a typo in Table


2.4 in the textbook: 𝑥ҧ =
σ 𝑥𝑖
𝑛
Practice Question
The following data represent scores on a 15 point aptitude test: 8, 10, 15, 12, 14, and 13.
(a). Subtract 5 from every observation and compute the sample mean for the original
data and the new data.

(b). Subtract 5 from every observation and complete the sample variance for the original
data and the new data

(c). What effect, if any, does subtracting 5 from every observation have on the sample
mean and sample variance?
• Stock A
• Average price last year = $50
• Standard deviation = $5

• Stock B:
• Average price last year = $100
• Standard deviation = $5

Ignore the market risk and the correlations between individual


stocks and the market.
• Is the statement “stock A and B are equally risky” true or false?
Coefficient of Variation
The coefficient of variation (CV), is a measure of relative dispersion
that expresses the standard deviation as a percentage of the mean
(provided the mean is positive and NOT close to zero)
• When the means of two objects are different, it is better to compare
them using CV rather than σ2 or s2
• Always in percentage (%)

Population coefficient of Sample coefficient of


variation variation
σ   s
CV =    100% CV =    100%
μ  x 
• Stock A
• Average price last year = $50
• Standard deviation = $5
s $5
CVA =    100% =  100% = 10%
x  $50
Both stocks
• Stock B: have the same
standard
• Average price last year = $100 deviation, but
• Standard deviation = $5 stock B is less
variable relative
to its price
s $5
CVB =    100% =  100% = 5%
x  $100
Practice Question
For the following three samples, for which sample is the data most
closely grouped about the sample mean? Give a written explanation that
supports your conclusion. (Hint: use the coefficient of variation)
• Sample 1: 15, 16, 19, 21, 28;
• Sample 2: 44, 49, 50, 51, 57; and
• Sample 3: 122.8, 123.7, 124.6, 130.5, 135.8.
Chebyshev’s Theorem
• For any population with mean 𝜇, standard deviation 𝜎, and 𝑘 > 1, the
percent of observations that lie within the interval [𝜇 ± 𝑘𝜎] is at least
𝟏
𝟏𝟎𝟎 𝟏 − 𝟐 %
𝒌
• where 𝑘 is the number of standard deviations

• Chebyshev’s theorem can be applied to any distribution, but it is often too


conservative.
• check k = 1 for the extreme case
Empirical Rule
An empirical rule, called the 68-95-99.7 rule, gives more precise
guidelines for the percentage of data values that lie within 1, 2, and
3 standard deviations (𝜎) of the mean (𝜇) for many large
populations
• apply to data distributions that are bell-shaped
z-Score
A z-score is a standardized value that indicates the number of standard
deviations a value is from the mean.
• For a population, the z-score of each value 𝑥𝑖 is
𝑥𝑖 − 𝜇
𝑧𝑖 =
𝜎
• For the sample, the z-score of each value 𝑥𝑖 is
𝑥𝑖 − 𝑥ҧ
𝑧𝑖 =
𝑠
• Percentiles and quartiles are measures that indicate the location or
position of a value relative to the entire set of data, while a z-score
measures the location or position of a value relative to the mean of the
distribution
• 𝑧𝑖 > 0: the value is greater than the mean
• 𝑧𝑖 < 0 the value is less than the mean
• 𝑧𝑖 = 0 the value is equal to the mean
Shape of the Distribution: Skewness
1 σ𝑛 ҧ 3
𝑖=1(𝑥𝑖 −𝑥)
• Skewness is defined as skewness =
𝑛 𝑠3
• The numerator is the key. The denominator serves the purpose of standardization (free of
units of 𝑥𝑖 )
• Skewness is positive if a distribution is skewed to the right, negative if skewed
to the left, and zero if bell-shaped that are mounded and symmetric about its
mean (refer to figures in the previous lecture slides)
• For continuous numerical unimodal data, the mean is usually less than the
median in a skewed-left distribution, and vice versa
• E.g. the distribution of income is usually right skewed, so the median is more appropriate
than the mean [why?]
• For a symmetric distribution, mean = median [is the converse true?]
Practice Question
A set of data is mounded, with a mean of 500 and a variance of 576.
(a) Approximately what proportion of the observations is greater than 476? (Hint: use the
Empirical Rule)
Practice Question
• A set of data is mounded, with a mean of 500 and a variance of 576.
(b) Approximately what proportion of the observations is less than 548?

(c) Approximately what proportion of the observations is greater than 572?


Practice Question
• A set of data is mounded, with a mean of 500 and a variance of 576.
(a) Approximately what proportion of the observations is between 452 and 548?

(b) Approximately what proportion of the observations is between 428 and 572?

(c) Approximately what proportion of the observations is between 476 and 524?
Practice Question
The manager of 45 sales people examined their monthly expenditures
on entertaining clients. He found that the mean amount was $237.50
with a standard deviation of $27.40. Assuming the data is bell-shaped,
would a claim for the amount of $300 be considered unlikely? Why or
why not?
Practice Question
A large sample is selected from a bell-shaped distribution. The middle
99.7% of the sample data falls between 24.2 and 69.2. Estimate the
sample mean and the sample standard deviation.
What Is And How To Use Chebyshev's Theorem And The Empirical Rule
Formula In Statistics Explained
• 3:13
Weighted Mean
The weighted mean of a set of data is
n

w x i i
w 1x1 + w 2 x 2 +  + w n x n
x= i=1
=
n n

• where 𝑤𝑖 is the weight of the 𝑖𝑡ℎ observation, and 𝑛 = σ𝑛𝑖=1 𝑤𝑖


Example 2.17: Stock Recommendation

σ𝑛
𝑖=1 𝑤𝑖 𝑥𝑖 10+6+18+0+0
• 𝑥ҧ = = = 1.79
𝑛 19
If the data are intervals rather than specific values,
can we calculate the exact mean and variance?
• No, but we can approximate them
Measures of Grouped Data
• Suppose that data are grouped into 𝐾 classes, with
frequencies 𝑓1 , 𝑓2 , … , 𝑓𝐾 . If the midpoints of these
classes are 𝑚1 , 𝑚2 , … , 𝑚𝐾 , then the sample mean
and sample variance can be approximated as
K K

 fimi i i
f (m − x) 2

x= i=1 s2 = i=1

n n −1

• where 𝑛 = σ𝐾
𝑖=1 𝑓𝑖
Practice Question
What is the (approximate) mean
and variance of this sample?
Measures of Relationships Between Variables: Covariance

Covariance and correlation are numerical measures of the linear


relationship between two variables as intuitively indicated in a scatter plot
• A positive value indicates a direct or increasing linear relationship
• A negative value indicates a decreasing linear relationship
For any constants 𝑎1 , 𝑎2 , 𝑏1 , 𝑏2 , variables 𝑋, 𝑌 :
Covariance • 𝐶𝑜𝑣 𝑋, 𝑎1 = 0
• The population covariance • 𝐶𝑜𝑣 𝑋, 𝑋 = 𝑉𝑎𝑟(𝑋)
N • 𝐶𝑜𝑣 𝑋, 𝑌 = 𝐶𝑜𝑣 𝑌, 𝑋
 (x − 
i x )(y i −  y )
• 𝐶𝑜𝑣 𝑎1 + 𝑏1 𝑋, 𝑎2 + 𝑏2 𝑌 = 𝑏1 𝑏2 𝐶𝑜𝑣 𝑋, 𝑌
Cov (x , y) =  xy = i=1
N
• From this property, the covariance depends on
• The sample covariance units of measurement. Its unit is the product of
the units of X and Y
n
• i.e., not invariant to the scaling of X and Y
 (x − x)(y − y)
i i • Value of covariance varies if a variable such as
Cov (x , y) = s xy = i=1
height is measured in feet or inches
n −1
• It measures the direction, but not strength, of
the linear relationship between X and Y
• No causal effect is implied
Positive Covariance Negative Covariance

Zero Covariance Zero Covariance (quadratic)


Measures of Relationships Between Variables: Correlation

The correlation coefficient gives a standardized measure of the linear


relationship between two variables.
• It is generally more useful than covariance
• free of units
• provides both the direction and strength of a linear relationship

Also called Pearson’s product-moment correlation


coefficient or Pearson’s r, was developed by Karl Pearson
Correlation
• A population correlation is
𝐶𝑜𝑣(𝑥, 𝑦)
𝐶𝑜𝑟𝑟 𝑥, 𝑦 = 𝜌𝑥𝑦 =
𝜎𝑥 𝜎𝑦
• A sample correlation is
𝐶𝑜𝑣(𝑥, 𝑦)
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
• The correlation coefficient ranges from -1 to 1.
• Closer to 1: data points are aligned in an increasing straight line with a positive linear relationship
• Closer to -1: data points are aligned in an increasing straight line with a negative linear relationship
• =0: no linear relationship, does not necessarily mean no relationship
• A useful rule: a linear relationship exists if
2
|𝑟𝑥𝑦 | ≥
𝑛
• Because 𝜎𝑥 and 𝜎𝑦 are always positive, 𝜎𝑥𝑦 and
𝜌𝑥𝑦 will always have the same sign.
• And 𝜌𝑥𝑦 = 0 if and only if (iff) 𝜎𝑥𝑦 = 0. This is
also true for 𝑟𝑥𝑦 .
• Both 𝜌𝑥𝑦 and 𝑟𝑥𝑦 ∈ [−1,1]

Figure on the left: Retail sales by quarter


• When 𝑟 = 0, there is no linear
relationship between x and y, but not
necessarily a lack of relationship
Practice Question
Consider the following (x, y) sample data: (53, 37), (34, 26), (10, 29), (63, 55), (28,
36), (58, 48), (28, 41), (50, 42), (39, 21), and (35, 46).
(a) Calculate the variances 𝑠𝑥2 and 𝑠𝑦2 and the covariance 𝑠𝑥𝑦 of the sample data.
Calculate the correlation coefficient sample data.
(b) In general, which of the covariance and the sample correlation coefficient is a
more useful measure of the relationship between the two variables?
How Ice Cream Kills! Correlation vs. Causation
• 5:26
Summary of Measures
Why use (n+1) in the percentile formula instead of n?
𝑝
• 𝑝𝑡ℎ percentile = (𝑛 + 1)𝑡ℎ
100
• More than one formula that woks. No consensus esp for quartiles.
• Make up for the loss of degrees of freedom of standard deviation on a
sample (n-1)
𝑛 𝑛+1
• The average of numbers 1 through n is not , it is
2 2
• Enforces symmetry to the problem, so percentiles in 25th match that of
75th

You might also like