0% found this document useful (0 votes)

59 views

Lecture01 Describing Data Ver2

This document provides an overview of the first lecture in a course on describing data. It discusses how statistics can help process and interpret data to make better decisions with limited information. It introduces the concepts of populations, which are complete sets of all items of interest, and samples, which are subsets of populations that are observed. The goal is for samples to represent populations since decisions are made based on sample information rather than having complete population data.

Uploaded by

Hongjiang Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Lecture01 Describing Data Ver2

Uploaded by

Hongjiang Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Lecture 01.

Describing Data
(Chapters 1 and 2)

Ping Yu

HKU Business School

The University of Hong Kong

Ping Yu (HKU) Describing Data 1 / 67

Course Information

Instructor: Yu, Ping

Email: [email protected]
Teaching Time: 13:30-14:45, and 14:55-16:10, Tuesday
Teaching Location: f2f at CYPP4 (but the lectures will also be recorded and
uploaded on moodle)
Office Hour: 11:00-12:00, Tuesday, KKL1108
- I will NOT answer questions in email if the answer is long or is not easy to explain
exactly by words. Please stop by during my office hour.

Tutor: Zhao, Jiuqi

Email: [email protected]
Teaching Time: TBA
Teaching Location: TBA
Office Hour: TBA
- Any issues on administration (e.g., enrollment, Moodle, time clash, lab entrance,
absence from the exams, etc.) and HWs (e.g., clarification of problems) should
contact the tutor.
Ping Yu (HKU) Describing Data 2 / 67
The Textbook: SBE Hereafter

Statistics for Business and Economics (Global Edition, 9th edition), by Paul
Newbold, William Carlson and Betty Thorne, Pearson, 2019.
Ping Yu (HKU) Describing Data 3 / 67
Software

We will use R and RStudio as the statistical software in this course; RStudio is an
Integrated Development Environment (IDE) for R.

Different from STATA or other softwares, both of them are free to download and
install.
Website for R: https://www.r-project.org/
Website for RStudio: https://www.rstudio.com/
Wiki for R: https://en.wikipedia.org/wiki/R_(programming_language)
Wiki for RStudio: https://en.wikipedia.org/wiki/RStudio

Ping Yu (HKU) Describing Data 4 / 67

Evaluation

Evaluation: 4 HWs (50%), Midterm Test (20%), Final Exam (30%)

HW: Evenly distributed over the 12 weeks. Must be typed (e.g., by LaTex or Word).
R commands need not be submitted. Turn in your HW on moodle on the due day
(usually before midnight, 11:59pm, of some Sunday).
- Late HWs are not acceptable for whatever reasons. To avoid any risk, start your
HW early (the HWs indicate clearly which problems can be solved after each
lecture; usually, one problem is assigned to each section).
Tutorial: The answer key to the HWs and midterm would not be posted on moodle
and will be discussed by the tutor. The tutorial class starts from week three (the
week starting from Sep. 12). Tutorial questions will be posted on moodle one
week in advance. Tutorial questions are not HWs; there is no need to turn them in.
Examination: Mimic HWs and tutorials. Closed book and closed note. A formula
sheet would be provided for the midterm and final and posted on moodle before
the midterm and final. R commands are not tested. No past exams are provided
(due to the university policy).
- You must take the final to pass this course; if you cannot take the midterm due to
sickness, then the weight of midterm would be automatically shifted to the final.
- Midterm: Oct. 24, Sunday, 10:00am-12:00noon, KB223.
- Final: Dec. 22, Wednesday, 2:30pm-4:30pm, Loke Yew Hall.
Suggestion: Preview slides before the class.
Ping Yu (HKU) Describing Data 5 / 67
UG Econometric Courses at HKU Business School

ECON1280. Analysis of Economic Data (both fall and spring): introduction to

statistics, prerequisite of the other courses especially ECON2280.
- You’d better enroll in ECON1280 in the fall if you want to enroll in ECON2280 in
the spring, and vice versa.
ECON2280. Introductory Econometrics (both fall and spring): linear regression.
ECON3225. Big Data Economics (only spring): machine learning.
ECON3283. Economic Forecasting (only spring): time series.
ECON3284. Introduction to Causal Inference and Statistical Learning (only
spring): treatment effects evaluation.

I ever taught ECON2280, but will teach ECON1280 and ECON3225 in this
academic year.
In ECON1280, I will emphasize concepts understanding and their empirical
applications.
To avoid repetition with ECON2280, I will not cover linear regression (Chapters
11-13 of SBE).
To avoid repetition with ECON3283, I will not cover time series analysis and
forecasting (Chapter 16 of SBE) and related materials in other chapters.
I plan to cover all the other chapters of SBE (depending on whether time allows),
roughly following the notations of the textbook.
Ping Yu (HKU) Describing Data 6 / 67
Course Policy

In Class: (i) turn off your cell phone and keep quiet; (ii) come to class and return
from the break on time; (iii) you can ask me freely in class, but if your question is
far out of the course or will take a long time to answer, I will answer you after class;
(iv) speak English!
Policy on Plagiarism: If judged as “plagiarism”, you are in serious trouble. If a few
students are judged to copy each other, each gets zero mark. I will not judge who
copied whom. So DO NOT copy others and DO NOT be copied by others.
- You may discuss with your classmates about the HWs, but DO NOT copy each
other.
- This policy applies to HW, midterm and final.
Feedback: Any feedback to my teaching (e.g., the lecturer’s English is hard to
follow, technicalities are too hard to understand, the teaching should slow down,
more interactions are required, there are some typos in the slides, etc.) is very
welcome. I would incorporate your feedbacks in my future teaching during the
semester. You can also give your feedbacks (e.g., some difficult points in the
lectures) to the tutor so that the tutor can discuss them in tutorial classes.
Guest Account (cannot receive announcements):
- Website: http://hkuportal.hku.hk/moodle/guest
- Guest Username: econ1280_1a_2021_guest
- Password: ECON1280@ping
Ping Yu (HKU) Describing Data 7 / 67
Course Outline
Lecture 01: Describing Data (Chapters 1 and 2)
Lecture 02: Probability (Chapter 3)
Lecture 03: Discrete Random Variables (Chapter 4)
Lecture 04: Continuous Random Variables (Chapter 5)
Lecture 05: Sampling Distribution Theory (Chapter 6)
Midterm: usually during the first week after the break and cover Lectures 1-4
(Note: one lecture need not be finished in one week.).
Lecture 06: Hypothesis Testing (Chapters 9 and 10)
Lecture 07: Confidence Interval Estimation (Chapters 7 and 8)
Lecture 08: Nonparametric Statistics (Chapter 14)
Lecture 09: Analysis of Variance (Chapter 15)
Lecture 10: Sampling (Chapter 17)
- The first seven lectures will definitely be covered, and whether or which of the
remaining three are covered depends on how fast I will teach.
- The final will concentrate on the materials that are not covered by the midterm.
Slides indexed by (*): covered in the lecture or by the tutor, maybe related to the
assignments, but not tested in the midterm or final.
Slides indexed by (**): not covered in the lecture, only for after-class reading.
I won’t cite (in my slides) the section numbers in the textbook unless necessary.
Ping Yu (HKU) Describing Data 8 / 67
Plan of This Lecture

Statistics can help us process, summarize, analyze, and interpret data to make
better decisions in uncertain environment (although usually loses some
information of the raw data). It permits us to make sense of all the data.
Data in raw form are usually not easy to use for decision making. I will introduce
tables and graphs in the first half of this lecture to provide visual support for
improved decision making, and introduce numerical measures in the second half
for more rigorous analysis.
- Pay special attentions to the differences in describing categorical and numerical
variables both graphically and numerically.

Describing Data: Graphical

- Decision Making in an Uncertain Environment
- Classification of Variables
- Graphs to Describe Categorical Variables
- Graphs to Describe Numerical Variables
Describing Data: Numerical
- Measures of Central Tendency and Location
- Measures of Variability
- Weighted Mean and Measures of Grouped Data
- Measures of Relationships Between Variables
Ping Yu (HKU) Describing Data 9 / 67
Describing Data: Graphical

Describing Data: Graphical

Ping Yu (HKU) Describing Data 10 / 67

Describing Data: Graphical Decision Making in an Uncertain Environment

Decision Making in an Uncertain Environment

Decisions are often made based on limited information – data (or samples).
- This may be due to the cost constraints or time constraints.
A population is the complete set of all items of interest. Population size, N, can be
very large or even infinite.
- e.g., all potential buyers of a new product.
- e.g., all stocks traded on the NYSE.
A sample is an observed subset (or portion) of a population with sample size given
by n. [figure here]
We hope the sample can represent the population, since our decision is made on
the population.

Ping Yu (HKU) Describing Data 11 / 67

Describing Data: Graphical Decision Making in an Uncertain Environment

Population vs. Sample

Population Sample

Ping Yu (HKU) Describing Data 12 / 67

Describing Data: Graphical Decision Making in an Uncertain Environment

Random and (**) Systematic Sampling

(Simple) random sampling is a sampling scheme in that 1 each member of the

population has the same probability of being selected, 2 the selection of one
member is independent of the selection of any other member, and 3 every
possible sample of a given size, n, has the same probability of selection.
- Although random sampling is too ideal in practice (due to the cost issue), it
serves as a benchmark for other sampling schemes discussed in Lecture 10.
(**) Suppose that the population list is arranged in some fashion unconnected with
the subject of interest (i.e., in random order). Systematic sampling involves the
selection of every jth item in the population, where j = N/n, and the first item is
randomly selected from 1 to j.
- Suppose n = 100 samples are desired, N = 5000, then j = 50. If your first item is
numbered 20, then the 20th, 70th, 120th, items are sampled.
- Systematic samples provide a good representation of the population if there is
no cyclical variation in the population.

Ping Yu (HKU) Describing Data 13 / 67

Describing Data: Graphical Decision Making in an Uncertain Environment

Parameter and Statistic

A parameter is a numerical measure that describes a specific characteristic of a

population.
A statistic is a numerical measure that describes a specific characteristic of a
sample.
In other words, a parameter is a function of the population and a statistic is a
function of the sample.
Descriptive statistics focus on graphical and numerical procedures that are used to
summarize and process data.
Inferential statistics focus on using the data to make predictions, forecasts, and
estimates to make better decisions.
Usually, descriptive statistics are elementary and intuitive, and inferential statistics
are more advanced and more powerful.

Ping Yu (HKU) Describing Data 14 / 67

Describing Data: Graphical Decision Making in an Uncertain Environment

Sampling and Nonsampling Errors

The target of statistics is to make decisions on a population parameter based on a

sample statistic.
Because n < N, there must be some uncertainty in the decision making based on
the statistic (about the parameter); the resulting error is called sampling error.
Even the whole population were collected, there are still some errors called
nonsampling error.
- The population actually sampled is not the relevant one, e.g., the voting opinion
on Franklin Roosevelt.
- Survey subjects may give inaccurate or dishonest answers, e.g., the voting
opinion on Donald Trump.
- There may be no response to survey questions, e.g., income level of the rich.
Read the textbook (Page 28) for other examples of nonsampling errors, but we will
focus on sampling errors in this course.

Ping Yu (HKU) Describing Data 15 / 67

Describing Data: Graphical Classification of Variables

Classification of Variables: Categorical and Numerical Variables

A variable is a specific characteristic (such as age or weight) of an individual or

object.
- A variable is any property or descriptor that can take multiple values.
- A variable can be though of as a question, to which the value is the answer. E.g.,
"How od are you?", "42 years old". Here, "age" is the variable, and "42" is its value.
Based on the type and amount of information contained in the data, we classify
variables into categorical and numerical variables.
Based on the levels of measurement, we classify variables into qualitative and
quantitative variables.
Categorical variables produce responses that belong to groups or categories.
- e.g., responses to yes/no questions.
- e.g., choices from "strongly disagree" to "strongly agree".
Numerical variables includes both discrete and continuous variables.

Ping Yu (HKU) Describing Data 16 / 67

Describing Data: Graphical Classification of Variables

Discrete and Continuous Variables

A discrete numerical variable may (but does not necessarily) have a finite number
of values.
- The most common type of discrete variable produces a response that comes
from a counting process, i.e., takes values from infinite numbers, 0, 1, 2, 3, , e.g.,
the number of customers.
A continuous numerical variable may take on any value within a given range of
real numbers.
- The continuous variable usually arises from a measurement (not a counting)
process, e.g., the salary of a worker.
- In daily life, we tend to truncate continuous variables as if they were discrete
ones due to the precision of measurement instruments or convenience.

Ping Yu (HKU) Describing Data 17 / 67

Describing Data: Graphical Classification of Variables

Ping Yu (HKU) Describing Data 18 / 67

Describing Data: Graphical Classification of Variables

Qualitative and Quantitative Variables

Qualitative data do not assign measurable meaning to the "difference" in numbers.

- e.g., the numbers assigned to the football players – number 10 does not play
twice as number 5.
Quantitative data assign measurable meaning to the "difference" in numbers.
- e.g., the exam score 90 is twice of 45.
Qualitative data include nominal data and ordinal data.
- Nominal data are considered the lowest or weakest type of data, e.g., gender,
country of citizenship, phone number, etc., where numerical identification is
chosen only for convenience and does not imply ranking of responses.
- Ordinal data indicate the rank of ordering, e.g., product quality rating (1: poor; 2:
average; 3: good), but the difference in numbers is meaningless.

Ping Yu (HKU) Describing Data 19 / 67

Describing Data: Graphical Classification of Variables

continue

Quantitative data include interval data and ratio data.

- Interval data indicate rank and distance from an arbitrarily determined
benchmark or zero, e.g., Celsius and Fahrenheit degrees of temperature or the
year based on the Gregorian calendar, where the difference makes sense but ratio
is meaningless.
- Ratio data indicate both rank and distance from a natural zero, e.g., age and
weight, where the ratios of two measures have meaning.
From nominal, to ordinal, to interval, and to ratio, more and more information is
contained in the data.

Ping Yu (HKU) Describing Data 20 / 67

Describing Data: Graphical Classification of Variables

Measurement Levels

Ping Yu (HKU) Describing Data 21 / 67

Describing Data: Graphical Graphs to Describe Categorical Variables

Graphs to Describe Categorical Variables: (Relative) Frequency

Distributions

A frequency distribution is a table used to organize data.

Figure: HEI: Healthy Eating Index

The left column (called classes or groups) includes all possible responses on a
variable under study.
The right column is a list of the frequencies, or number of observations, for each
class.
A relative frequency distribution: frequency
n 100%.
Ping Yu (HKU) Describing Data 22 / 67
Describing Data: Graphical Graphs to Describe Categorical Variables

Bar Charts

Bar charts draw attention to the frequency itself (not proportion of frequencies) of
each category.
The height of bars represents frequency, and bars need not touch.

Ping Yu (HKU) Describing Data 23 / 67

Describing Data: Graphical Graphs to Describe Categorical Variables

Pie Charts

If the focus is the proportion of frequencies, then pie charts are appropriate.

Browser Wars: European Market Share North America Market Share

Ping Yu (HKU) Describing Data 24 / 67

Describing Data: Graphical Graphs to Describe Categorical Variables

Pareto Diagrams

A Pareto diagram is a bar chart that displays the frequency of defect causes. It is
used to separate the "vital few" from the "trivial many".

Figure: Errors in Health Care Claims Processing

The bars are arranged in the descending order of frequencies.

The first three causes contribute about 80% of errors.
Ping Yu (HKU) Describing Data 25 / 67
Describing Data: Graphical Graphs to Describe Categorical Variables

The Pareto Principle or "80-20 Rule"

The Pareto principle or "80-20 rule" states that 80% of outcomes are due to 20%
of causes. [figure here]
αx α
(**) The Pareto density function: f (x ) = x α +m1 1(x xm ).

Figure: Pareto Density Functions for Various α’s with xm = 1

Ping Yu (HKU) Describing Data 26 / 67

Describing Data: Graphical Graphs to Describe Categorical Variables

Vilfredo F. D. Pareto (1848-1923),

Italian, University of Lausanne

Ping Yu (HKU) Describing Data 27 / 67

Describing Data: Graphical Graphs to Describe Categorical Variables

Cross Tables

Cross tables (or crosstabs/contingency tables) are used to describe relationships

between categorical or ordinal variables.

Figure: A 3 2 Cross Table

It lists the frequencies of all combinations of values for the two variables.

Ping Yu (HKU) Describing Data 28 / 67

Describing Data: Graphical Graphs to Describe Categorical Variables

Component or Cluster Bar Charts

A component (or stacked) bar chart and cluster (or side-by-side) bar chart are
used to picture the information in cross tables, and are extensions of the bar chart
above.