Lecture01 Describing Data Ver2
Lecture01 Describing Data Ver2
Describing Data
(Chapters 1 and 2)
Ping Yu
Statistics for Business and Economics (Global Edition, 9th edition), by Paul
Newbold, William Carlson and Betty Thorne, Pearson, 2019.
Ping Yu (HKU) Describing Data 3 / 67
Software
We will use R and RStudio as the statistical software in this course; RStudio is an
Integrated Development Environment (IDE) for R.
Different from STATA or other softwares, both of them are free to download and
install.
Website for R: https://www.r-project.org/
Website for RStudio: https://www.rstudio.com/
Wiki for R: https://en.wikipedia.org/wiki/R_(programming_language)
Wiki for RStudio: https://en.wikipedia.org/wiki/RStudio
I ever taught ECON2280, but will teach ECON1280 and ECON3225 in this
academic year.
In ECON1280, I will emphasize concepts understanding and their empirical
applications.
To avoid repetition with ECON2280, I will not cover linear regression (Chapters
11-13 of SBE).
To avoid repetition with ECON3283, I will not cover time series analysis and
forecasting (Chapter 16 of SBE) and related materials in other chapters.
I plan to cover all the other chapters of SBE (depending on whether time allows),
roughly following the notations of the textbook.
Ping Yu (HKU) Describing Data 6 / 67
Course Policy
In Class: (i) turn off your cell phone and keep quiet; (ii) come to class and return
from the break on time; (iii) you can ask me freely in class, but if your question is
far out of the course or will take a long time to answer, I will answer you after class;
(iv) speak English!
Policy on Plagiarism: If judged as “plagiarism”, you are in serious trouble. If a few
students are judged to copy each other, each gets zero mark. I will not judge who
copied whom. So DO NOT copy others and DO NOT be copied by others.
- You may discuss with your classmates about the HWs, but DO NOT copy each
other.
- This policy applies to HW, midterm and final.
Feedback: Any feedback to my teaching (e.g., the lecturer’s English is hard to
follow, technicalities are too hard to understand, the teaching should slow down,
more interactions are required, there are some typos in the slides, etc.) is very
welcome. I would incorporate your feedbacks in my future teaching during the
semester. You can also give your feedbacks (e.g., some difficult points in the
lectures) to the tutor so that the tutor can discuss them in tutorial classes.
Guest Account (cannot receive announcements):
- Website: http://hkuportal.hku.hk/moodle/guest
- Guest Username: econ1280_1a_2021_guest
- Password: ECON1280@ping
Ping Yu (HKU) Describing Data 7 / 67
Course Outline
Lecture 01: Describing Data (Chapters 1 and 2)
Lecture 02: Probability (Chapter 3)
Lecture 03: Discrete Random Variables (Chapter 4)
Lecture 04: Continuous Random Variables (Chapter 5)
Lecture 05: Sampling Distribution Theory (Chapter 6)
Midterm: usually during the first week after the break and cover Lectures 1-4
(Note: one lecture need not be finished in one week.).
Lecture 06: Hypothesis Testing (Chapters 9 and 10)
Lecture 07: Confidence Interval Estimation (Chapters 7 and 8)
Lecture 08: Nonparametric Statistics (Chapter 14)
Lecture 09: Analysis of Variance (Chapter 15)
Lecture 10: Sampling (Chapter 17)
- The first seven lectures will definitely be covered, and whether or which of the
remaining three are covered depends on how fast I will teach.
- The final will concentrate on the materials that are not covered by the midterm.
Slides indexed by (*): covered in the lecture or by the tutor, maybe related to the
assignments, but not tested in the midterm or final.
Slides indexed by (**): not covered in the lecture, only for after-class reading.
I won’t cite (in my slides) the section numbers in the textbook unless necessary.
Ping Yu (HKU) Describing Data 8 / 67
Plan of This Lecture
Statistics can help us process, summarize, analyze, and interpret data to make
better decisions in uncertain environment (although usually loses some
information of the raw data). It permits us to make sense of all the data.
Data in raw form are usually not easy to use for decision making. I will introduce
tables and graphs in the first half of this lecture to provide visual support for
improved decision making, and introduce numerical measures in the second half
for more rigorous analysis.
- Pay special attentions to the differences in describing categorical and numerical
variables both graphically and numerically.
Decisions are often made based on limited information – data (or samples).
- This may be due to the cost constraints or time constraints.
A population is the complete set of all items of interest. Population size, N, can be
very large or even infinite.
- e.g., all potential buyers of a new product.
- e.g., all stocks traded on the NYSE.
A sample is an observed subset (or portion) of a population with sample size given
by n. [figure here]
We hope the sample can represent the population, since our decision is made on
the population.
Population Sample
A discrete numerical variable may (but does not necessarily) have a finite number
of values.
- The most common type of discrete variable produces a response that comes
from a counting process, i.e., takes values from infinite numbers, 0, 1, 2, 3, , e.g.,
the number of customers.
A continuous numerical variable may take on any value within a given range of
real numbers.
- The continuous variable usually arises from a measurement (not a counting)
process, e.g., the salary of a worker.
- In daily life, we tend to truncate continuous variables as if they were discrete
ones due to the precision of measurement instruments or convenience.
continue
Measurement Levels
The left column (called classes or groups) includes all possible responses on a
variable under study.
The right column is a list of the frequencies, or number of observations, for each
class.
A relative frequency distribution: frequency
n 100%.
Ping Yu (HKU) Describing Data 22 / 67
Describing Data: Graphical Graphs to Describe Categorical Variables
Bar Charts
Bar charts draw attention to the frequency itself (not proportion of frequencies) of
each category.
The height of bars represents frequency, and bars need not touch.
Pie Charts
If the focus is the proportion of frequencies, then pie charts are appropriate.
Pareto Diagrams
A Pareto diagram is a bar chart that displays the frequency of defect causes. It is
used to separate the "vital few" from the "trivial many".
Cross Tables
It lists the frequencies of all combinations of values for the two variables.
A component (or stacked) bar chart and cluster (or side-by-side) bar chart are
used to picture the information in cross tables, and are extensions of the bar chart
above.
Number of Classes
Intuitions are used in practice to guarantee that each class includes "not too few"
or "not too many" observations.
299 222
n = 110, so set k = 8 and w = 8 = 10 (rounded up).
Suppose the goal is 4.5 minutes; then we can tell from Table 1.8 that less than
3/4 (72.7%) employees can achieve the goal.
Histograms
Read Section 1.6 for some popular mistakes in presenting histograms. These
mistakes can be easily avoided by using statistical softwares properly.
Ogives
A ogive (or cumulative line graph) is a line connecting points that are the
cumulative percent of observations below the upper limit of each interval in a
cumulative frequency distribution.
The stem-and-leaf display is a quick way to identify possible patterns for a small
data set. Both it and the box-and-whisker plot blow were invented by John Tukey.
Scatter Plots
A scatter plot locates one point for each observation of two variables. It can
provide a picture of the data, including (i) the range of each variable, (ii) the
pattern of values over the range, (iii) a suggestion as to a possible relationship
between the two variables, and (iv) an indication of outliers (extreme points, i.e.,
data values that are much larger or smaller than other values).
Summary of Techniques
∑N
i =1 xi x + x2 + + xN
µ= = 1 .
N N
- The sample mean is a statistic given by
∑ni=1 xi
x̄ = .
n
- The mean is appropriate for numerical data.
continue
The median is the middle observation of a set of observations that are arranged in
increasing (or decreasing) order.
- If n is odd, the median is the middle observation.
- If n is even, the median is the average of the two middle observations.
- The median will be the number located in the 0.5 (n + 1)th ordered position.
- The median is more robust to outliers than the mean. [why?]
The mode, if one exists, is the most frequently occurring value.
- A distribution with one mode is called unimodal; with two (local) modes, it is
called bimodal; and with more than two (local) modes, it is said to be multimodal.
- The mode is most commonly used with categorical data. [see more discussions
below]
The most appropriate measure of central tendency is context specific.
- e.g., for clothing retailers, the mode is more informative than the mean for
inventory decisions. [why?]
For categorical data, median and mode are appropriate, but mean is not.
- e.g., what is the mean of "male" (coded 1) and "female" (coded 0)?
For numerical data (the most popular data type in business applications), mean
and median (esp. outliers exist) are more appropriate than the mode (maybe each
value occurs only once, which one is the center?).
Ping Yu (HKU) Describing Data 44 / 67
Describing Data: Numerical Measures of Central Tendency and Location
The number of bottled water sold in n = 12 hours at one store during hurricane
season is
60, 84, 65, 67, 75, 72, 80, 85, 63, 82, 70, 75.
The mean is
60 + 84 + 65 + 67 + 75 + 72 + 80 + 85 + 63 + 82 + 70 + 75
x̄ = = 73.17.
12
Arrange the sales from least to greatest:
60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85,
so the median is
72 + 75
x.5 = = 73.5.
2
The mode is clearly 75 bottles.
Percentiles and quartiles are measures that indicate the location, or position, of a
value relative to the entire set of data.
They are generally used to describe large data sets, e.g., sales data, survey data,
or even the weights of newborn babies.
Arranging the data in order from the smallest to the largest, the pth percentile is a
value such that approximately p% of the observations are at or below that number.
- Percentiles separate large ordered data sets into 100ths.
- The 50th percentile is the median.
p
- pth percentile = value located in the 100 (n + 1)th order position.
Quartiles are descriptive measures that separate large data sets into four quarters.
The first quartile, Q1 , (or 25th percentile) separates approximately the smallest
25% of the data from the remainder of the data. The second quartile, Q2 , (or 50th
percentile) is the median. The third quartile, Q3 , (or 75th percentile) separates
approximately the smallest 75% of the data from the remainder of the data.
- Q1 = the value in the 0.25 (n + 1)th ordered position.
- Q2 = the value in the 0.50 (n + 1)th ordered position.
- Q3 = the value in the 0.75 (n + 1)th ordered position.
Five-Number Summary
The five-number summary: minimum, first quartile, median, third quartile, and
maximum, in ascending order.
Example 2.5: Demand for Bottled Water Ascendingly ordered sales:
60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85,
Two data sets can have the same mean but the observations in one set could vary
more from the mean than do those in the other set:
Sample A: 1, 2, 1, 36,
Sample B: 8, 9, 10, 13,
both of which have mean 10, but the spread of Sample A is obviously larger than
that of Sample B.
- Intuition: gunfire.
The range is the difference between the largest and smallest observations.
- The greater the spread of the data from the center of the distribution, the larger
the range will be.
- Although the range measures the total spread, it is sensitive to outliers.
- One solution is to discard a few of the highest and a few of the lowest numbers,
such as the IQR below.
The interquartile range (IQR) measures the spread in the middle 50% of the data:
IQR = Q3 Q1 .
Box-and-Whisker Plots
Both range and IQR use only two of the data values. Variance uses the distances
of all observations from the mean.
The population variance is the sum of the squared differences between each
observation and the population mean divided by the population size:
2
∑Ni =1 (xi µ ) [Exercise] ∑N
i = 1 xi
2
σ2 = = µ 2. (1)
N N
Population A: f 2, 2g and Population B: f 1, 1g:
( 2)2 + 22 ( 1)2 + 12
σ 2A = =4>1= = σ 2B
2 2
matches the intuition that Population A is more spreading (or more risky) than
Population B, where µ A = µ B = 0.
The sample variance is the sum of the squared differences between each
observation and the sample mean divided by the sample size minus one:
2
2 ∑i = 1 i ( n
x)
∑n (xi x̄ ) ∑n x 2 ∑n x 2 nx̄ 2
s = i =12
= i =1 i n
= i =1 i ,
n 1 n 1 n 1
where the last two equalities can be similarly shown as in (1).
- The reason for dividing by n 1 rather than n will be explained in Lecture 5.
Ping Yu (HKU) Describing Data 50 / 67
Describing Data: Numerical Measures of Variability
continue
∑ xi
Typo: x̄ = n .
Ping Yu (HKU) Describing Data 52 / 67
Describing Data: Numerical Measures of Variability
Coefficient of Variation
Chebyshev’s Theorem
Chebyshev’s Theorem: For any population with mean µ, standard deviation σ , and
i of observations that lie within the interval [ µ k σ ] is at least
k > h1, the percent
100 1 1/k 2 %, where k is the number of σ . [figure here]
so h i
E jX µj2 σ2 1
P (jX µj kσ) 1 =1 =1 .
k 2σ 2 k 2σ 2 k2
Empirical Rule
An empirical rule, called the 68-95-99.7 rule, gives more precise guidelines for the
percentage of data values that lie within 1, 2, and 3 standard deviations (σ ) of the
mean (µ) for many large populations (mounded, bell-shaped).
This empirical rule actually applies to the normal distribution which will be
discussed in Lecture 4.
z-Score
Percentiles and quartiles are measures that indicate the location or position of a
value relative to the entire set of data, while a z-score measures the location or
position of a value relative to the mean of the distribution: it is a standardized
value that indicates the number of standard deviations a value is from the mean.
For the population, the z-score of each value xi is
xi µ
zi = ,
σ
which is positive if xi > µ, negative if xi < µ, and zero if xi = µ.
For the sample, the z-score of each value xi is
xi x̄
zi = .
s
Shape of a Distribution
Skewness is defined as
1 ∑ni=1 (xi x̄ )3
skewness = .
n s3
- The numerator is the key, and the denominator serves the purpose of
standardization (free of units of xi ).
Skewness is positive if a distribution is skewed to the right, negative if skewed to
the left, and zero if bell-shaped that are mounded and symmetric about its mean
[why? refere to the figures in slides 38 and 49].
For continuous numerical unimodal data, the mean is usually less than the median
in a skewed-left distribution, and vice versa.
- e.g., the distribution of income is usually right skewed, so the median is more
appropriate than the mean since the latter is too optimatic to the economic
well-being of the community.
- For a symmetric distribution, the mean and median are equal, but the converse is
not true.
- Mean is more popular than median in practice because the former is more
straightforward and better understood than the latter.
∑ni=1 wi xi
x̄ = ,
n
where wi is the weight of the ith observation, and n = ∑ni=1 wi .
Example 2.17: Stock Recommendation:
∑ni=1 wi xi 10+6+18+0+0
x̄ = n = 19 = 1.79.
If the data are intervals rather than specific values, e.g., age intervals, wage
intervals, etc., then we cannot calculate the exact mean and variance, but we can
approximate them.
Suppose that data are grouped into K classes, with frequencies f1 , f2 , , fK . If the
midpoints of these classes are m1 , m2 , , mK , then the sample mean and sample
variance can be approximated as
∑K
i = 1 f i mi
x̄ = ,
n
2
K f (m
∑i =1 i i x̄ )
s2 = ,
n 1
where n = ∑K
i =1 fi .
∑N
i =1 (xi µ x )(yi µy )
Cov (x, y ) = σ xy = . [figure here]
N
A sample covariance is
∑ni=1 (xi x̄ ) (yi ȳ )
sxy = .
n 1
It is easy to check that for any constants a1 , b1 , a2 and b2 ,
Cov (a1 + b1 x, a2 + b2 y ) = b1 b2 Cov (x, y ). [Exercise]
From this property, the covariance depends on units of measurement (i.e., not
invariant to the scaling of x and y ); its unit is the product of the units of x and y . In
other words, it measures the direction, but not strength, of the linear relationship
between x and y .
Ping Yu (HKU) Describing Data 61 / 67
Describing Data: Numerical Measures of Relationships Between Variables
1
Galton was Charles Darwin (1809-1882)’s half-cousin, sharing the common grandparent. He was also the
advisor of Karl Pearson, African explorer, and inventor of fingerprinting.
Ping Yu (HKU) Describing Data 64 / 67
Describing Data: Numerical Measures of Relationships Between Variables
Because σ x and σ y are positive, σ xy and ρ xy always have the same sign, and
ρ xy = 0 if and only if (iff) σ xy = 0. This is also true for rxy .
Both ρ xy and rxy 2 [ 1, 1] [proof not required]. What does rxy = 1 mean?2
2
This is why we know covariance measures the linear relationship between x and y .
Ping Yu (HKU) Describing Data 65 / 67
Describing Data: Numerical Measures of Relationships Between Variables
“By 1910, frequent epidemics became regular events throughout the developed
world, primarily in cities during the summer months. At its peak in the 1940s and
1950s, polio would paralyze or kill over half a million people worldwide every year.”
- From Wiki
Summary of Measures