0% found this document useful (0 votes)
51 views

Lecture01 Describing Data Ver2

This document provides an overview of the first lecture in a course on describing data. It discusses how statistics can help process and interpret data to make better decisions with limited information. It introduces the concepts of populations, which are complete sets of all items of interest, and samples, which are subsets of populations that are observed. The goal is for samples to represent populations since decisions are made based on sample information rather than having complete population data.

Uploaded by

Hongjiang Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Lecture01 Describing Data Ver2

This document provides an overview of the first lecture in a course on describing data. It discusses how statistics can help process and interpret data to make better decisions with limited information. It introduces the concepts of populations, which are complete sets of all items of interest, and samples, which are subsets of populations that are observed. The goal is for samples to represent populations since decisions are made based on sample information rather than having complete population data.

Uploaded by

Hongjiang Zhang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Lecture 01.

Describing Data
(Chapters 1 and 2)

Ping Yu

HKU Business School


The University of Hong Kong

Ping Yu (HKU) Describing Data 1 / 67


Course Information

Instructor: Yu, Ping


Email: [email protected]
Teaching Time: 13:30-14:45, and 14:55-16:10, Tuesday
Teaching Location: f2f at CYPP4 (but the lectures will also be recorded and
uploaded on moodle)
Office Hour: 11:00-12:00, Tuesday, KKL1108
- I will NOT answer questions in email if the answer is long or is not easy to explain
exactly by words. Please stop by during my office hour.

Tutor: Zhao, Jiuqi


Email: [email protected]
Teaching Time: TBA
Teaching Location: TBA
Office Hour: TBA
- Any issues on administration (e.g., enrollment, Moodle, time clash, lab entrance,
absence from the exams, etc.) and HWs (e.g., clarification of problems) should
contact the tutor.
Ping Yu (HKU) Describing Data 2 / 67
The Textbook: SBE Hereafter

Statistics for Business and Economics (Global Edition, 9th edition), by Paul
Newbold, William Carlson and Betty Thorne, Pearson, 2019.
Ping Yu (HKU) Describing Data 3 / 67
Software

We will use R and RStudio as the statistical software in this course; RStudio is an
Integrated Development Environment (IDE) for R.

Different from STATA or other softwares, both of them are free to download and
install.
Website for R: https://www.r-project.org/
Website for RStudio: https://www.rstudio.com/
Wiki for R: https://en.wikipedia.org/wiki/R_(programming_language)
Wiki for RStudio: https://en.wikipedia.org/wiki/RStudio

Ping Yu (HKU) Describing Data 4 / 67


Evaluation

Evaluation: 4 HWs (50%), Midterm Test (20%), Final Exam (30%)


HW: Evenly distributed over the 12 weeks. Must be typed (e.g., by LaTex or Word).
R commands need not be submitted. Turn in your HW on moodle on the due day
(usually before midnight, 11:59pm, of some Sunday).
- Late HWs are not acceptable for whatever reasons. To avoid any risk, start your
HW early (the HWs indicate clearly which problems can be solved after each
lecture; usually, one problem is assigned to each section).
Tutorial: The answer key to the HWs and midterm would not be posted on moodle
and will be discussed by the tutor. The tutorial class starts from week three (the
week starting from Sep. 12). Tutorial questions will be posted on moodle one
week in advance. Tutorial questions are not HWs; there is no need to turn them in.
Examination: Mimic HWs and tutorials. Closed book and closed note. A formula
sheet would be provided for the midterm and final and posted on moodle before
the midterm and final. R commands are not tested. No past exams are provided
(due to the university policy).
- You must take the final to pass this course; if you cannot take the midterm due to
sickness, then the weight of midterm would be automatically shifted to the final.
- Midterm: Oct. 24, Sunday, 10:00am-12:00noon, KB223.
- Final: Dec. 22, Wednesday, 2:30pm-4:30pm, Loke Yew Hall.
Suggestion: Preview slides before the class.
Ping Yu (HKU) Describing Data 5 / 67
UG Econometric Courses at HKU Business School

ECON1280. Analysis of Economic Data (both fall and spring): introduction to


statistics, prerequisite of the other courses especially ECON2280.
- You’d better enroll in ECON1280 in the fall if you want to enroll in ECON2280 in
the spring, and vice versa.
ECON2280. Introductory Econometrics (both fall and spring): linear regression.
ECON3225. Big Data Economics (only spring): machine learning.
ECON3283. Economic Forecasting (only spring): time series.
ECON3284. Introduction to Causal Inference and Statistical Learning (only
spring): treatment effects evaluation.

I ever taught ECON2280, but will teach ECON1280 and ECON3225 in this
academic year.
In ECON1280, I will emphasize concepts understanding and their empirical
applications.
To avoid repetition with ECON2280, I will not cover linear regression (Chapters
11-13 of SBE).
To avoid repetition with ECON3283, I will not cover time series analysis and
forecasting (Chapter 16 of SBE) and related materials in other chapters.
I plan to cover all the other chapters of SBE (depending on whether time allows),
roughly following the notations of the textbook.
Ping Yu (HKU) Describing Data 6 / 67
Course Policy

In Class: (i) turn off your cell phone and keep quiet; (ii) come to class and return
from the break on time; (iii) you can ask me freely in class, but if your question is
far out of the course or will take a long time to answer, I will answer you after class;
(iv) speak English!
Policy on Plagiarism: If judged as “plagiarism”, you are in serious trouble. If a few
students are judged to copy each other, each gets zero mark. I will not judge who
copied whom. So DO NOT copy others and DO NOT be copied by others.
- You may discuss with your classmates about the HWs, but DO NOT copy each
other.
- This policy applies to HW, midterm and final.
Feedback: Any feedback to my teaching (e.g., the lecturer’s English is hard to
follow, technicalities are too hard to understand, the teaching should slow down,
more interactions are required, there are some typos in the slides, etc.) is very
welcome. I would incorporate your feedbacks in my future teaching during the
semester. You can also give your feedbacks (e.g., some difficult points in the
lectures) to the tutor so that the tutor can discuss them in tutorial classes.
Guest Account (cannot receive announcements):
- Website: http://hkuportal.hku.hk/moodle/guest
- Guest Username: econ1280_1a_2021_guest
- Password: ECON1280@ping
Ping Yu (HKU) Describing Data 7 / 67
Course Outline
Lecture 01: Describing Data (Chapters 1 and 2)
Lecture 02: Probability (Chapter 3)
Lecture 03: Discrete Random Variables (Chapter 4)
Lecture 04: Continuous Random Variables (Chapter 5)
Lecture 05: Sampling Distribution Theory (Chapter 6)
Midterm: usually during the first week after the break and cover Lectures 1-4
(Note: one lecture need not be finished in one week.).
Lecture 06: Hypothesis Testing (Chapters 9 and 10)
Lecture 07: Confidence Interval Estimation (Chapters 7 and 8)
Lecture 08: Nonparametric Statistics (Chapter 14)
Lecture 09: Analysis of Variance (Chapter 15)
Lecture 10: Sampling (Chapter 17)
- The first seven lectures will definitely be covered, and whether or which of the
remaining three are covered depends on how fast I will teach.
- The final will concentrate on the materials that are not covered by the midterm.
Slides indexed by (*): covered in the lecture or by the tutor, maybe related to the
assignments, but not tested in the midterm or final.
Slides indexed by (**): not covered in the lecture, only for after-class reading.
I won’t cite (in my slides) the section numbers in the textbook unless necessary.
Ping Yu (HKU) Describing Data 8 / 67
Plan of This Lecture

Statistics can help us process, summarize, analyze, and interpret data to make
better decisions in uncertain environment (although usually loses some
information of the raw data). It permits us to make sense of all the data.
Data in raw form are usually not easy to use for decision making. I will introduce
tables and graphs in the first half of this lecture to provide visual support for
improved decision making, and introduce numerical measures in the second half
for more rigorous analysis.
- Pay special attentions to the differences in describing categorical and numerical
variables both graphically and numerically.

Describing Data: Graphical


- Decision Making in an Uncertain Environment
- Classification of Variables
- Graphs to Describe Categorical Variables
- Graphs to Describe Numerical Variables
Describing Data: Numerical
- Measures of Central Tendency and Location
- Measures of Variability
- Weighted Mean and Measures of Grouped Data
- Measures of Relationships Between Variables
Ping Yu (HKU) Describing Data 9 / 67
Describing Data: Graphical

Describing Data: Graphical

Ping Yu (HKU) Describing Data 10 / 67


Describing Data: Graphical Decision Making in an Uncertain Environment

Decision Making in an Uncertain Environment

Decisions are often made based on limited information – data (or samples).
- This may be due to the cost constraints or time constraints.
A population is the complete set of all items of interest. Population size, N, can be
very large or even infinite.
- e.g., all potential buyers of a new product.
- e.g., all stocks traded on the NYSE.
A sample is an observed subset (or portion) of a population with sample size given
by n. [figure here]
We hope the sample can represent the population, since our decision is made on
the population.

Ping Yu (HKU) Describing Data 11 / 67


Describing Data: Graphical Decision Making in an Uncertain Environment

Population vs. Sample

Population Sample

Ping Yu (HKU) Describing Data 12 / 67


Describing Data: Graphical Decision Making in an Uncertain Environment

Random and (**) Systematic Sampling

(Simple) random sampling is a sampling scheme in that 1 each member of the


population has the same probability of being selected, 2 the selection of one
member is independent of the selection of any other member, and 3 every
possible sample of a given size, n, has the same probability of selection.
- Although random sampling is too ideal in practice (due to the cost issue), it
serves as a benchmark for other sampling schemes discussed in Lecture 10.
(**) Suppose that the population list is arranged in some fashion unconnected with
the subject of interest (i.e., in random order). Systematic sampling involves the
selection of every jth item in the population, where j = N/n, and the first item is
randomly selected from 1 to j.
- Suppose n = 100 samples are desired, N = 5000, then j = 50. If your first item is
numbered 20, then the 20th, 70th, 120th, items are sampled.
- Systematic samples provide a good representation of the population if there is
no cyclical variation in the population.

Ping Yu (HKU) Describing Data 13 / 67


Describing Data: Graphical Decision Making in an Uncertain Environment

Parameter and Statistic

A parameter is a numerical measure that describes a specific characteristic of a


population.
A statistic is a numerical measure that describes a specific characteristic of a
sample.
In other words, a parameter is a function of the population and a statistic is a
function of the sample.
Descriptive statistics focus on graphical and numerical procedures that are used to
summarize and process data.
Inferential statistics focus on using the data to make predictions, forecasts, and
estimates to make better decisions.
Usually, descriptive statistics are elementary and intuitive, and inferential statistics
are more advanced and more powerful.

Ping Yu (HKU) Describing Data 14 / 67


Describing Data: Graphical Decision Making in an Uncertain Environment

Sampling and Nonsampling Errors

The target of statistics is to make decisions on a population parameter based on a


sample statistic.
Because n < N, there must be some uncertainty in the decision making based on
the statistic (about the parameter); the resulting error is called sampling error.
Even the whole population were collected, there are still some errors called
nonsampling error.
- The population actually sampled is not the relevant one, e.g., the voting opinion
on Franklin Roosevelt.
- Survey subjects may give inaccurate or dishonest answers, e.g., the voting
opinion on Donald Trump.
- There may be no response to survey questions, e.g., income level of the rich.
Read the textbook (Page 28) for other examples of nonsampling errors, but we will
focus on sampling errors in this course.

Ping Yu (HKU) Describing Data 15 / 67


Describing Data: Graphical Classification of Variables

Classification of Variables: Categorical and Numerical Variables

A variable is a specific characteristic (such as age or weight) of an individual or


object.
- A variable is any property or descriptor that can take multiple values.
- A variable can be though of as a question, to which the value is the answer. E.g.,
"How od are you?", "42 years old". Here, "age" is the variable, and "42" is its value.
Based on the type and amount of information contained in the data, we classify
variables into categorical and numerical variables.
Based on the levels of measurement, we classify variables into qualitative and
quantitative variables.
Categorical variables produce responses that belong to groups or categories.
- e.g., responses to yes/no questions.
- e.g., choices from "strongly disagree" to "strongly agree".
Numerical variables includes both discrete and continuous variables.

Ping Yu (HKU) Describing Data 16 / 67


Describing Data: Graphical Classification of Variables

Discrete and Continuous Variables

A discrete numerical variable may (but does not necessarily) have a finite number
of values.
- The most common type of discrete variable produces a response that comes
from a counting process, i.e., takes values from infinite numbers, 0, 1, 2, 3, , e.g.,
the number of customers.
A continuous numerical variable may take on any value within a given range of
real numbers.
- The continuous variable usually arises from a measurement (not a counting)
process, e.g., the salary of a worker.
- In daily life, we tend to truncate continuous variables as if they were discrete
ones due to the precision of measurement instruments or convenience.

Ping Yu (HKU) Describing Data 17 / 67


Describing Data: Graphical Classification of Variables

Ping Yu (HKU) Describing Data 18 / 67


Describing Data: Graphical Classification of Variables

Qualitative and Quantitative Variables

Qualitative data do not assign measurable meaning to the "difference" in numbers.


- e.g., the numbers assigned to the football players – number 10 does not play
twice as number 5.
Quantitative data assign measurable meaning to the "difference" in numbers.
- e.g., the exam score 90 is twice of 45.
Qualitative data include nominal data and ordinal data.
- Nominal data are considered the lowest or weakest type of data, e.g., gender,
country of citizenship, phone number, etc., where numerical identification is
chosen only for convenience and does not imply ranking of responses.
- Ordinal data indicate the rank of ordering, e.g., product quality rating (1: poor; 2:
average; 3: good), but the difference in numbers is meaningless.

Ping Yu (HKU) Describing Data 19 / 67


Describing Data: Graphical Classification of Variables

continue

Quantitative data include interval data and ratio data.


- Interval data indicate rank and distance from an arbitrarily determined
benchmark or zero, e.g., Celsius and Fahrenheit degrees of temperature or the
year based on the Gregorian calendar, where the difference makes sense but ratio
is meaningless.
- Ratio data indicate both rank and distance from a natural zero, e.g., age and
weight, where the ratios of two measures have meaning.
From nominal, to ordinal, to interval, and to ratio, more and more information is
contained in the data.

Ping Yu (HKU) Describing Data 20 / 67


Describing Data: Graphical Classification of Variables

Measurement Levels

Ping Yu (HKU) Describing Data 21 / 67


Describing Data: Graphical Graphs to Describe Categorical Variables

Graphs to Describe Categorical Variables: (Relative) Frequency


Distributions

A frequency distribution is a table used to organize data.

Figure: HEI: Healthy Eating Index

The left column (called classes or groups) includes all possible responses on a
variable under study.
The right column is a list of the frequencies, or number of observations, for each
class.
A relative frequency distribution: frequency
n 100%.
Ping Yu (HKU) Describing Data 22 / 67
Describing Data: Graphical Graphs to Describe Categorical Variables

Bar Charts

Bar charts draw attention to the frequency itself (not proportion of frequencies) of
each category.
The height of bars represents frequency, and bars need not touch.

Ping Yu (HKU) Describing Data 23 / 67


Describing Data: Graphical Graphs to Describe Categorical Variables

Pie Charts

If the focus is the proportion of frequencies, then pie charts are appropriate.

Browser Wars: European Market Share North America Market Share

Ping Yu (HKU) Describing Data 24 / 67


Describing Data: Graphical Graphs to Describe Categorical Variables

Pareto Diagrams

A Pareto diagram is a bar chart that displays the frequency of defect causes. It is
used to separate the "vital few" from the "trivial many".

Figure: Errors in Health Care Claims Processing

The bars are arranged in the descending order of frequencies.


The first three causes contribute about 80% of errors.
Ping Yu (HKU) Describing Data 25 / 67
Describing Data: Graphical Graphs to Describe Categorical Variables

The Pareto Principle or "80-20 Rule"


The Pareto principle or "80-20 rule" states that 80% of outcomes are due to 20%
of causes. [figure here]
αx α
(**) The Pareto density function: f (x ) = x α +m1 1(x xm ).

Figure: Pareto Density Functions for Various α’s with xm = 1

Ping Yu (HKU) Describing Data 26 / 67


Describing Data: Graphical Graphs to Describe Categorical Variables

Vilfredo F. D. Pareto (1848-1923),


Italian, University of Lausanne

Ping Yu (HKU) Describing Data 27 / 67


Describing Data: Graphical Graphs to Describe Categorical Variables

Cross Tables

Cross tables (or crosstabs/contingency tables) are used to describe relationships


between categorical or ordinal variables.

Figure: A 3 2 Cross Table

It lists the frequencies of all combinations of values for the two variables.

Ping Yu (HKU) Describing Data 28 / 67


Describing Data: Graphical Graphs to Describe Categorical Variables

Component or Cluster Bar Charts

A component (or stacked) bar chart and cluster (or side-by-side) bar chart are
used to picture the information in cross tables, and are extensions of the bar chart
above.

Ping Yu (HKU) Describing Data 29 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Graphs to Describe Numerical Variables: Frequency Distributions

Different from categorical data, we must construct the classes by ourselves.


Three Rules:
Rule 1: Determine k , the number of classes.
Rule 2: Classes should be the same width, w, which is determined by
Largest Observation Smallest Observation
w= .
Number of Classes
- w should always be rounded upward to include all observations in the frequency
distribution table.
Rule 3: Classes must be inclusive and nonoverlapping: each observation must
belong to one and only one class.
- Make sure the boundary values are clearly classified.

Ping Yu (HKU) Describing Data 30 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Number of Classes

Sample Size n Number of Classes k


n < 50 5 7
50 n 100 7 8
101 n 500 8 10
501 n 1000 10 11
1001 n 5000 11 14
n > 5000 14 20
Table: Rule of Thumb in Determining k

Intuitions are used in practice to guarantee that each class includes "not too few"
or "not too many" observations.

Ping Yu (HKU) Describing Data 31 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

(Relative) Cumulative Frequency Distributions

A cumulative frequency distribution contains the total number of observations


whose values are less than the upper limit for each class, which can be
constructed by adding the frequencies of all classes up to and including the
present class.
cumulative frequency
A relative cumulative frequency distribution: n 100%.

Ping Yu (HKU) Describing Data 32 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

299 222
n = 110, so set k = 8 and w = 8 = 10 (rounded up).

Ping Yu (HKU) Describing Data 33 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Suppose the goal is 4.5 minutes; then we can tell from Table 1.8 that less than
3/4 (72.7%) employees can achieve the goal.

Ping Yu (HKU) Describing Data 34 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Histograms

A histogram is a counterpart of a bar chart for numerical variables.

Read Section 1.6 for some popular mistakes in presenting histograms. These
mistakes can be easily avoided by using statistical softwares properly.

Ping Yu (HKU) Describing Data 35 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Ogives

A ogive (or cumulative line graph) is a line connecting points that are the
cumulative percent of observations below the upper limit of each interval in a
cumulative frequency distribution.

Ping Yu (HKU) Describing Data 36 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Shape of a Distribution: Symmetricity

A distribution is symmetric if the observations are balanced, or approximately


evenly distributed, about its center.

Figure: Symmetric Distribution

Ping Yu (HKU) Describing Data 37 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Shape of a Distribution: Skewedness


A distribution is skewed (or asymmetric) if the observations are not symmetrically
distributed on either side of the center.
- A skewed-right (or positively skewed) distribution has a tail that extends farther to
the right, e.g., the income distribution.
- A skewed-left (or negatively skewed) distribution has a tail that extends farther to
the left, e.g., the GPAs in Example 1.10.

Skewed-Right Distribution Skewed-Left Distribution

Ping Yu (HKU) Describing Data 38 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

(**) Stem-and-Leaf Displays

A stem-and-leaf display is an exploratory data analysis (EDA) graph that is an


alternative to the histogram.
- Data are grouped according to their leading digits (called stems), and the final
digits (called leaves) are listed separately for each member of a class.
- The leaves are displayed individually in ascending order after each of the stems.
Accounting Final-Exam Grades: 88, 51, 63, 85, 79, 65, 79, 70, 73, 77

The stem-and-leaf display is a quick way to identify possible patterns for a small
data set. Both it and the box-and-whisker plot blow were invented by John Tukey.

Ping Yu (HKU) Describing Data 39 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Scatter Plots

A scatter plot locates one point for each observation of two variables. It can
provide a picture of the data, including (i) the range of each variable, (ii) the
pattern of values over the range, (iii) a suggestion as to a possible relationship
between the two variables, and (iv) an indication of outliers (extreme points, i.e.,
data values that are much larger or smaller than other values).

Figure: GPA at College Graduation vs. Entrance SAT Math Scores

Ping Yu (HKU) Describing Data 40 / 67


Describing Data: Graphical Graphs to Describe Numerical Variables

Summary of Techniques

Ping Yu (HKU) Describing Data 41 / 67


Describing Data: Numerical

Describing Data: Numerical

Ping Yu (HKU) Describing Data 42 / 67


Describing Data: Numerical Measures of Central Tendency and Location

Measures of Central Tendency: Mean, Median, and Mode

Measures of central tendency provide information about a "typical" observation in


the data.
They are usually computed from sample data rather than from population data.
The (arithmetic) mean of a set of data is the sum of the data values divided by the
number of observations.
- The population mean is a parameter given by

∑N
i =1 xi x + x2 + + xN
µ= = 1 .
N N
- The sample mean is a statistic given by

∑ni=1 xi
x̄ = .
n
- The mean is appropriate for numerical data.

Ping Yu (HKU) Describing Data 43 / 67


Describing Data: Numerical Measures of Central Tendency and Location

continue

The median is the middle observation of a set of observations that are arranged in
increasing (or decreasing) order.
- If n is odd, the median is the middle observation.
- If n is even, the median is the average of the two middle observations.
- The median will be the number located in the 0.5 (n + 1)th ordered position.
- The median is more robust to outliers than the mean. [why?]
The mode, if one exists, is the most frequently occurring value.
- A distribution with one mode is called unimodal; with two (local) modes, it is
called bimodal; and with more than two (local) modes, it is said to be multimodal.
- The mode is most commonly used with categorical data. [see more discussions
below]
The most appropriate measure of central tendency is context specific.
- e.g., for clothing retailers, the mode is more informative than the mean for
inventory decisions. [why?]
For categorical data, median and mode are appropriate, but mean is not.
- e.g., what is the mean of "male" (coded 1) and "female" (coded 0)?
For numerical data (the most popular data type in business applications), mean
and median (esp. outliers exist) are more appropriate than the mode (maybe each
value occurs only once, which one is the center?).
Ping Yu (HKU) Describing Data 44 / 67
Describing Data: Numerical Measures of Central Tendency and Location

Example 2.1: Demand for Bottled Water

The number of bottled water sold in n = 12 hours at one store during hurricane
season is
60, 84, 65, 67, 75, 72, 80, 85, 63, 82, 70, 75.
The mean is
60 + 84 + 65 + 67 + 75 + 72 + 80 + 85 + 63 + 82 + 70 + 75
x̄ = = 73.17.
12
Arrange the sales from least to greatest:

60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85,

so the median is
72 + 75
x.5 = = 73.5.
2
The mode is clearly 75 bottles.

Ping Yu (HKU) Describing Data 45 / 67


Describing Data: Numerical Measures of Central Tendency and Location

Percentiles and Quartiles

Percentiles and quartiles are measures that indicate the location, or position, of a
value relative to the entire set of data.
They are generally used to describe large data sets, e.g., sales data, survey data,
or even the weights of newborn babies.
Arranging the data in order from the smallest to the largest, the pth percentile is a
value such that approximately p% of the observations are at or below that number.
- Percentiles separate large ordered data sets into 100ths.
- The 50th percentile is the median.
p
- pth percentile = value located in the 100 (n + 1)th order position.
Quartiles are descriptive measures that separate large data sets into four quarters.
The first quartile, Q1 , (or 25th percentile) separates approximately the smallest
25% of the data from the remainder of the data. The second quartile, Q2 , (or 50th
percentile) is the median. The third quartile, Q3 , (or 75th percentile) separates
approximately the smallest 75% of the data from the remainder of the data.
- Q1 = the value in the 0.25 (n + 1)th ordered position.
- Q2 = the value in the 0.50 (n + 1)th ordered position.
- Q3 = the value in the 0.75 (n + 1)th ordered position.

Ping Yu (HKU) Describing Data 46 / 67


Describing Data: Numerical Measures of Central Tendency and Location

Five-Number Summary

The five-number summary: minimum, first quartile, median, third quartile, and
maximum, in ascending order.
Example 2.5: Demand for Bottled Water Ascendingly ordered sales:

60, 63, 65, 67, 70, 72, 75, 75, 80, 82, 84, 85,

We use this sample for illustration although n is small here.


Q1 = the value located in the 0.25 (12 + 1) = 3.25th ordered position.
The value in the third position is 65 and in the fourth position is 67, so

Q1 = 65 + 0.25 (67 65) = 65.5.

Q3 = the value located in the 0.75 (12 + 1) = 9.75th ordered position.


The value in the 9th position is 80 and in the 10th position is 82, so

Q3 = 80 + 0.75 (82 80) = 81.5.

The five-number summary is

60 < 65.5 < 73.5 < 81.5 < 85.

Ping Yu (HKU) Describing Data 47 / 67


Describing Data: Numerical Measures of Variability

Measures of Variability: Range and Interquartile Range

Two data sets can have the same mean but the observations in one set could vary
more from the mean than do those in the other set:

Sample A: 1, 2, 1, 36,
Sample B: 8, 9, 10, 13,

both of which have mean 10, but the spread of Sample A is obviously larger than
that of Sample B.
- Intuition: gunfire.
The range is the difference between the largest and smallest observations.
- The greater the spread of the data from the center of the distribution, the larger
the range will be.
- Although the range measures the total spread, it is sensitive to outliers.
- One solution is to discard a few of the highest and a few of the lowest numbers,
such as the IQR below.
The interquartile range (IQR) measures the spread in the middle 50% of the data:

IQR = Q3 Q1 .

Ping Yu (HKU) Describing Data 48 / 67


Describing Data: Numerical Measures of Variability

Box-and-Whisker Plots

A box-and-whisker plot is a graph that describes the shape of the distribution in


terms of the five-number summary. The inner box shows the numbers that span
the range from the first to the third quartile. A line is drawn through the box at the
median. There are two "whiskers": one from the 25th percentile to the minimum,
and the other from the 75th percentile to the maximum.
Read from the following Figure 2.3 about the median and spread.

Ping Yu (HKU) Describing Data 49 / 67


Describing Data: Numerical Measures of Variability

Variance and Standard Deviation

Both range and IQR use only two of the data values. Variance uses the distances
of all observations from the mean.
The population variance is the sum of the squared differences between each
observation and the population mean divided by the population size:
2
∑Ni =1 (xi µ ) [Exercise] ∑N
i = 1 xi
2
σ2 = = µ 2. (1)
N N
Population A: f 2, 2g and Population B: f 1, 1g:
( 2)2 + 22 ( 1)2 + 12
σ 2A = =4>1= = σ 2B
2 2
matches the intuition that Population A is more spreading (or more risky) than
Population B, where µ A = µ B = 0.
The sample variance is the sum of the squared differences between each
observation and the sample mean divided by the sample size minus one:
2
2 ∑i = 1 i ( n
x)
∑n (xi x̄ ) ∑n x 2 ∑n x 2 nx̄ 2
s = i =12
= i =1 i n
= i =1 i ,
n 1 n 1 n 1
where the last two equalities can be similarly shown as in (1).
- The reason for dividing by n 1 rather than n will be explained in Lecture 5.
Ping Yu (HKU) Describing Data 50 / 67
Describing Data: Numerical Measures of Variability

continue

The population standard deviation, σ , is the (positive) square root of the


population variance: s
p ∑N µ )2
i =1 (xi
σ = σ2 = .
N
The sample standard deviation, s, is the (positive) square root of the sample
variance: s
p ∑ni=1 (xi x̄ )
2
s = s2 = .
n 1
The name "standard deviation" came from Karl Pearson [figure below].
- Variance measures the average squared "deviation" from the mean and has the
unit of the squared
p unit of xi .
- By taking , we get back to the "standard" (original) unit of xi .
2
∑ni=1 (xi x̄ ) ∑ni=1 (xi x̄ ) ∑ni=1 jxi x̄j
(**) In the definition of s2 , why use n 1 rather than n 1 or n 1 ?

Ping Yu (HKU) Describing Data 51 / 67


Describing Data: Numerical Measures of Variability

Example 2.9: Gilotti’s Pizzeria Sales At Location 1

∑ xi
Typo: x̄ = n .
Ping Yu (HKU) Describing Data 52 / 67
Describing Data: Numerical Measures of Variability

Coefficient of Variation

The coefficient of variation (CV), is a measure of relative dispersion that expresses


the standard deviation as a percentage of the mean (provided the mean is
positive).
- The population CV is
σ
CV = 100%.
µ
- The sample CV is
s
CV = 100%.

When the means of two objects are different, it is better to compare them using
CV rather than σ 2 or s2 .
- e.g., a large store can be treated as a sum of many small stores, so both s2 and
x̄ of its sales are larger than those of a small store.

Ping Yu (HKU) Describing Data 53 / 67


Describing Data: Numerical Measures of Variability

Chebyshev’s Theorem

Chebyshev’s Theorem: For any population with mean µ, standard deviation σ , and
i of observations that lie within the interval [ µ k σ ] is at least
k > h1, the percent
100 1 1/k 2 %, where k is the number of σ . [figure here]

(**) Why? For a random variable X , P (jX µj k σ ) = 1 P jX µj2 > k 2 σ 2 ,


while
h i h i
E jX µj2 E jX µj2 1 jX µj2 > k 2 σ 2 k 2 σ 2 P jX µj2 > k 2 σ 2 ,

so h i
E jX µj2 σ2 1
P (jX µj kσ) 1 =1 =1 .
k 2σ 2 k 2σ 2 k2

Chebyshev’s theorem can be applied to any distribution, but it is often too


conservative. [check k = 1 for the extreme case]
Ping Yu (HKU) Describing Data 54 / 67
Describing Data: Numerical Measures of Variability

Pafnuty L. Chebyshev (1821-1894),


St. Petersburg University

Ping Yu (HKU) Describing Data 55 / 67


Describing Data: Numerical Measures of Variability

Empirical Rule

An empirical rule, called the 68-95-99.7 rule, gives more precise guidelines for the
percentage of data values that lie within 1, 2, and 3 standard deviations (σ ) of the
mean (µ) for many large populations (mounded, bell-shaped).

This empirical rule actually applies to the normal distribution which will be
discussed in Lecture 4.

Ping Yu (HKU) Describing Data 56 / 67


Describing Data: Numerical Measures of Variability

z-Score

Percentiles and quartiles are measures that indicate the location or position of a
value relative to the entire set of data, while a z-score measures the location or
position of a value relative to the mean of the distribution: it is a standardized
value that indicates the number of standard deviations a value is from the mean.
For the population, the z-score of each value xi is
xi µ
zi = ,
σ
which is positive if xi > µ, negative if xi < µ, and zero if xi = µ.
For the sample, the z-score of each value xi is

xi x̄
zi = .
s

Ping Yu (HKU) Describing Data 57 / 67


Describing Data: Numerical Measures of Variability

Shape of a Distribution

Skewness is defined as

1 ∑ni=1 (xi x̄ )3
skewness = .
n s3
- The numerator is the key, and the denominator serves the purpose of
standardization (free of units of xi ).
Skewness is positive if a distribution is skewed to the right, negative if skewed to
the left, and zero if bell-shaped that are mounded and symmetric about its mean
[why? refere to the figures in slides 38 and 49].
For continuous numerical unimodal data, the mean is usually less than the median
in a skewed-left distribution, and vice versa.
- e.g., the distribution of income is usually right skewed, so the median is more
appropriate than the mean since the latter is too optimatic to the economic
well-being of the community.
- For a symmetric distribution, the mean and median are equal, but the converse is
not true.
- Mean is more popular than median in practice because the former is more
straightforward and better understood than the latter.

Ping Yu (HKU) Describing Data 58 / 67


Describing Data: Numerical Weighted Mean and Measures of Grouped Data

(**) Weighted Mean

The weighted mean of a set of data is

∑ni=1 wi xi
x̄ = ,
n
where wi is the weight of the ith observation, and n = ∑ni=1 wi .
Example 2.17: Stock Recommendation:

∑ni=1 wi xi 10+6+18+0+0
x̄ = n = 19 = 1.79.

Ping Yu (HKU) Describing Data 59 / 67


Describing Data: Numerical Weighted Mean and Measures of Grouped Data

(**) Measures of Grouped Data

If the data are intervals rather than specific values, e.g., age intervals, wage
intervals, etc., then we cannot calculate the exact mean and variance, but we can
approximate them.
Suppose that data are grouped into K classes, with frequencies f1 , f2 , , fK . If the
midpoints of these classes are m1 , m2 , , mK , then the sample mean and sample
variance can be approximated as

∑K
i = 1 f i mi
x̄ = ,
n
2
K f (m
∑i =1 i i x̄ )
s2 = ,
n 1

where n = ∑K
i =1 fi .

Ping Yu (HKU) Describing Data 60 / 67


Describing Data: Numerical Measures of Relationships Between Variables

Measures of Relationships Between Variables: Covariance

Covariance and correlation are numerical measures of the linear relationship


between two variables as intuitively indicated in a scatter plot. A positive value
indicates a direct or increasing linear relationship, and a negative value indicates a
decreasing linear relationship.
A population covariance is

∑N
i =1 (xi µ x )(yi µy )
Cov (x, y ) = σ xy = . [figure here]
N
A sample covariance is
∑ni=1 (xi x̄ ) (yi ȳ )
sxy = .
n 1
It is easy to check that for any constants a1 , b1 , a2 and b2 ,
Cov (a1 + b1 x, a2 + b2 y ) = b1 b2 Cov (x, y ). [Exercise]
From this property, the covariance depends on units of measurement (i.e., not
invariant to the scaling of x and y ); its unit is the product of the units of x and y . In
other words, it measures the direction, but not strength, of the linear relationship
between x and y .
Ping Yu (HKU) Describing Data 61 / 67
Describing Data: Numerical Measures of Relationships Between Variables

Positive Covariance Negative Covariance

Zero Covariance Zero Covariance (Quadratic)

Figure: Positive, Negative an Zero Covariance

Ping Yu (HKU) Describing Data 62 / 67


Describing Data: Numerical Measures of Relationships Between Variables

Measures of Relationships Between Variables: Correlation

The correlation (coefficient), also called Pearson’s product-moment correlation


coefficient or Pearson’s r , was developed by Karl Pearson from a related idea
introduced by Francis Galton in the 1880s. It gives a standardized measure of the
linear relationship between two variables. [figure here]
- It is more useful than covariance because it is free of units and provides both
the direction and strength of a relationship.
The correlation is computed by dividing the covariance by the product of the
standard deviations of the two variables.
A population correlation is
Cov (x, y )
Corr (x, y ) = ρ xy = .
σxσy
A sample correlation is
sxy
rxy = .
sx sy
A useful rule to determine a relationship exists is
2
rxy p ,
n
which will be explained in Lecture 8.
Ping Yu (HKU) Describing Data 63 / 67
Describing Data: Numerical Measures of Relationships Between Variables

Karl Pearson (1857-1936), UCL Sir Francis Galton (1822-1911), English1

1
Galton was Charles Darwin (1809-1882)’s half-cousin, sharing the common grandparent. He was also the
advisor of Karl Pearson, African explorer, and inventor of fingerprinting.
Ping Yu (HKU) Describing Data 64 / 67
Describing Data: Numerical Measures of Relationships Between Variables

Figure 2.4: Scatter Plots and Correlation

Because σ x and σ y are positive, σ xy and ρ xy always have the same sign, and
ρ xy = 0 if and only if (iff) σ xy = 0. This is also true for rxy .
Both ρ xy and rxy 2 [ 1, 1] [proof not required]. What does rxy = 1 mean?2
2
This is why we know covariance measures the linear relationship between x and y .
Ping Yu (HKU) Describing Data 65 / 67
Describing Data: Numerical Measures of Relationships Between Variables

Correlation Does NOT Imply Causation: Polio and Ice-cream

“By 1910, frequent epidemics became regular events throughout the developed
world, primarily in cities during the summer months. At its peak in the 1940s and
1950s, polio would paralyze or kill over half a million people worldwide every year.”
- From Wiki

Folk legends: (i) A pretty woman causes death? (inauspicious or unlucky?)


(ii) Old age causes death of my children? (bad omen or natural phenomenon?)
Donald Trump: More tests, more infections.
Warmonger: (larger) war induces (longer) peace.
Ping Yu (HKU) Describing Data 66 / 67
Describing Data: Numerical Measures of Relationships Between Variables

Summary of Measures

Central Tendency Location Variation Shape Relationship


Mean Minimum Range Skewness Covariance
Median Maximum Interquartile Range Correlation
Mode Percentiles Variance
Quartiles Standard Deviation
z -Score Coefficient of Variation

Ping Yu (HKU) Describing Data 67 / 67

You might also like