0% found this document useful (0 votes)
5 views51 pages

1 Introduction

The document provides an overview of statistics, including its definition, applications in business and economics, data classification, and the importance of statistical inference. It discusses various types of data, scales of measurement, and methods for data collection and analysis, emphasizing the role of computers in statistical practices. Additionally, it covers descriptive statistics, variability in data, and the process of statistical inference.

Uploaded by

henrytsang1835
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views51 pages

1 Introduction

The document provides an overview of statistics, including its definition, applications in business and economics, data classification, and the importance of statistical inference. It discusses various types of data, scales of measurement, and methods for data collection and analysis, emphasizing the role of computers in statistical practices. Additionally, it covers descriptive statistics, variability in data, and the process of statistical inference.

Uploaded by

henrytsang1835
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Slide 1

Data and Statistics (Ch.1)


 Definition of Statistics
 Applications in Real Life
 Data Classification
 Data Sources
 Descriptive Statistics
 Statistical Inference
 The Use of Computers in Statistical Analysis
 Data Mining
 Ethical Guidelines for Statistical Practice

Slide 1
Acknowledgement to Dr. William Lau
Statistics

 The term statistics can refer to numerical facts such as


averages, medians, percents, and index numbers that
help us understand a variety of business and economic
situations.
 Statistics can also refer to the art and science of
collecting, analyzing, presenting, and interpreting
data.

Slide 2
Applications in
Business and Economics
 Accounting
Public accounting firms use statistical sampling
procedures when conducting audits for their clients.
 Economics
Economists use statistical information in making
forecasts about the future of the economy or some
aspect of it.
 Finance
Financial advisors use price-earnings ratios and
dividend yields to guide their investment advice.

Slide 3
Applications in
Business and Economics
 Marketing
Electronic point-of-sale scanners at retail checkout
counters are used to collect data for a variety of
marketing research applications.
 Production
A variety of statistical quality control charts are used
to monitor the output of a production process.

Slide 4
Data and Data Sets

 Data are the facts and figures collected, analyzed,


and summarized for presentation and interpretation.

 All the data collected in a particular study are referred


to as the data set for the study.

Slide 5
Elements, Variables, and Observations

 Elements are the entities on which data are collected.


 A variable is a characteristic of interest for the elements.
 The set of measurements obtained for a particular
element is called an observation.
 A data set with n elements contains n observations.
 The total number of data values in a complete data
set is the number of elements multiplied by the
number of variables.

Slide 6
Data, Data Sets,
Elements, Variables, and Observations
Variables
Element
Names Stock Annual Earn/
Company Exchange Sales($M) Share($)

Dataram NQ 73.10 0.86


EnergySouth N 74.00 1.67
Keystone N 365.70 0.86
LandCare NQ 111.40 0.33
Psychemedics N 17.60 0.13

Data Set
Slide 7
Types of Data

• Categorical

Observed values are categories


Example: female and male attendance in this class

• Numerical
Observed values are integer, real or complex numbers
Examples: IQ scores of GT students (integer values)
Lifetime of a computer chip (real values)

Slide 8
Quick check

1. For each CUHK student, we record his/her blood type


A. numeric B. categorical
2. For each CUHK student, we record his/her number of
siblings.
A. numeric B. categorical
3. For each CUHK student, we record his/her country of
residence.
A. numeric B. categorical
4. For each CUHK student, we record his/her height.
A. numeric B. categorical

Slide 9
Scales of Measurement
Scales of measurement include:
Nominal: Categorical with no order
Ordinal: Categorical with order
Interval: numerical values
Ratio: numerical positive values

The scale determines the amount of information


contained in the data.

The scale indicates the data summarization and


statistical analyses that are most appropriate.

Slide 10
Scales of Measurement

 Nominal

Data are labels or names used to identify an


attribute of the element.

A nonnumeric label or numeric code may be used.

Slide 11
Scales of Measurement

 Nominal

Example:
Students of a university are classified by the
school in which they are enrolled using a
nonnumeric label such as Business, Humanities,
Education, and so on.
Alternatively, a numeric code could be used for
the school variable (e.g. 1 denotes Business,
2 denotes Humanities, 3 denotes Education, and
so on).

Slide 12
Scales of Measurement

 Ordinal

The data have the properties of nominal data and


the order or rank of the data is meaningful.

A nonnumeric label or numeric code may be used.

Slide 13
Scales of Measurement

 Ordinal

Example:
Students of a university are classified by their
class standing using a nonnumeric label such as
Freshman, Sophomore, Junior, or Senior.
Alternatively, a numeric code could be used for
the class standing variable (e.g. 1 denotes
Freshman, 2 denotes Sophomore, and so on).

Slide 14
Scales of Measurement

 Interval

The data have the properties of ordinal data, and


the interval between observations is expressed in
terms of a fixed unit of measure.

Interval data are always numeric.

Slide 15
Scales of Measurement

 Interval

Example:
Melissa has an SAT score of 1885, while Kevin
has an SAT score of 1780. Melissa scored 105
points more than Kevin.

Slide 16
Scales of Measurement

 Ratio

The data have all the properties of interval data


and the ratio of two values is meaningful.

Variables such as distance, height, weight, and time


use the ratio scale.

This scale must contain a zero value that indicates


that nothing exists for the variable at the zero point.

Slide 17
Scales of Measurement

 Ratio

Example:
Melissa’s college record shows 36 credit hours
earned, while Kevin’s record shows 72 credit
hours earned. Kevin has twice as many credit
hours earned as Melissa.

Slide 18
Scales of Measurement

Data

Categorical Numerical

Numeric Non-numeric Numeric

Nominal Ordinal Nominal Ordinal Interval Ratio

Slide 19
Cross-Sectional Data

Cross-sectional data are collected at the same or


approximately the same point in time.

Example: data detailing the salary distribution


of all fresh graduates in 2021 from each university
in Hong Kong

Slide 20
Time Series Data

Time series data are collected over several time


periods.

Example: data detailing the salary distribution


of all fresh graduates at CUHK in the past 3 years

Slide 21
Time Series Data

U.S. Average Price Per Gallon


For Conventional Regular Gasoline

Source: Energy Information Administration, U.S. Department of Energy, May 2009.

Slide 22
Data Sources

 Existing Sources

Internal company records – almost any department


Business database services – EJFQ
Government agencies - Census & Statistics Department
Industry associations – The Hong Kong General
Chamber of Commerce
Special-interest organizations – Graduate Management
Admission Council
Internet – more and more firms

Slide 23
Data Sources

 Data Available From Internal Company Records


Record Some of the Data Available
Employee records name, address, social security number
Production records part number, quantity produced,
direct labor cost, material cost
Inventory records part number, quantity in stock,
reorder level, economic order quantity
Sales records product number, sales volume, sales
volume by region
Credit records customer name, credit limit, accounts
receivable balance
Customer profile age, gender, income, household size

Slide 24
Data Sources

 Statistical Studies - Experimental


In experimental studies the variable of interest is
first identified. Then one or more other variables
are identified and controlled so that data can be
obtained about how they influence the variable of
interest.

The largest experimental study ever conducted is


believed to be the 1954 Public Health Service
experiment for the Salk polio vaccine. Nearly two
million U.S. children (grades 1- 3) were selected.

Slide 25
Data Sources

 Statistical Studies - Observational


In observational (nonexperimental) studies no
attempt is made to control or influence the
variables of interest.
a survey is a good example

Studies of smokers and nonsmokers are


observational studies because researchers
do not determine or control
who will smoke and who will not smoke.

Slide 26
Data Acquisition Considerations

Time Requirement
• Searching for information can be time consuming.
• Information may no longer be useful by the time it
is available.
Cost of Acquisition
• Organizations often charge for information even
when it is not their primary business activity.
Data Errors
• Using any data that happen to be available or were
acquired with little care can lead to misleading
information.
Slide 27
Descriptive Statistics

 Most of the statistical information in newspapers,


magazines, company reports, and other publications
consists of data that are summarized and presented
in a form that is easy to understand.
 Such summaries of data, which may be tabular,
graphical, or numerical, are referred to as descriptive
statistics.

Slide 28
Example: William Auto Repair

The manager of William Auto would like to have a


better understanding of the cost of parts used in the
engine tune-ups performed in his shop. He examines
50 customer invoices for tune-ups. The costs of parts,
rounded to the nearest dollar, are listed on the next
slide.

Slide 29
Example: William Auto Repair

 Sample of Parts Cost ($) for 50 Tune-ups

91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73

Slide 30
Tabular Summary:
Frequency and Percent Frequency
 Example: William Auto Repair

Parts Percent
Cost ($) Frequency Frequency
50-59 2 4
60-69 13 26
(2/50)100
70-79 16 32
80-89 7 14
90-99 7 14
100-109 5 10
50 100

Slide 31
Graphical Summary: Histogram

 Example: William Auto Repair


18
Tune-up Parts Cost
16
14
12
Frequency

10
8
6
4
2
Parts
50−59 60−69 70−79 80−89 90−99 100-110 Cost ($)

Slide 32
Numerical Descriptive Statistics

 The most common numerical descriptive statistic


is the average (or mean).
 The average demonstrates a measure of the central
tendency, or central location, of the data for a variable.
 William’s average cost of parts, based on the 50
tune-ups studied, is $79 (found by summing the
50 cost values and then dividing by 50).

Slide 33
Variability in Data

 Samples are random


• individual variability Small
Variability

• noise Large Variability


• measurement errors
• the world has randomness

 Random is the same as variability:


successive observations of a system or phenomenon do
not produce exactly the same result.

 We capture the randomness by probability models

Slide 34
Probability vs Statistics

Probability

Population Sample
Inferential
Statistics

Probability: given the information in


the pail, what is in your hand?

Statistics: given the information in


your hand, what is in the pail?

Slide 35
Statistical Inference
 Population: a finite well-defined group of ALL objects which,
although possibly large, can be enumerated in theory
(e.g. investigating ALL the bearings manufactured today).

 Sample: A sample is a SUBSET of a population

Population

Sample

Observation

Slide 36
Estimation
Hospital waiting time:

Determine a probability distribution of a population based on a random sample


Estimate parameters of a distribution based on a random sample

True parameter value

� = 𝟓𝟓. 𝟎𝟎𝟎𝟎
𝜷𝜷 Slide 37
Confidence Interval
How confident we are given the variability of data?

5.07

𝜷𝜷 ∈
Slide 38
Hypothesis Test

Given the average wait time 5.07

Null hypothesis 𝜷𝜷 ≤ 𝟓𝟓

Alternative hypothesis 𝜷𝜷 > 𝟓𝟓

Data

Which one is true?


Statistics

Decision

Slide 39
Example: Comparing two population
 Which drug is more effective

Group 1 Group 2

Slide 40
Example: Comparing two population
A/B testing

Slide 41
Linear Regression
• Predict a response variable based on one or m ore predictor
variables
• Identify im portant factors influencing a response variable
What are important variables affecting the waiting time in hospital
waiting room?

Type of emergency cares


Physicians’ experience and skills
Time of day Waiting time in hospital
Number of nurses
Capacity of waiting room
Response Regressor or Predictor

Yi = β 0 + β1 X i + ε i i = 1,2,, n

Slide 42
Intercept Slope Random error
Statistical Inference

Population − the set of all elements of interest in a


particular study
Sample − a subset of the population

Statistical inference − the process of using data obtained


from a sample to make estimates
and test hypotheses about the
characteristics of a population
Census − collecting data for the entire population

Sample survey − collecting data for a sample

Slide 43
Process of Statistical Inference

1. Population
consists of all tune- 2. A sample of 50
ups. Average cost of engine tune-ups
parts is unknown. is examined.

3. The sample data


4. The sample average
provide a sample
is used to estimate the average parts cost
population average. of $79 per tune-up.

Slide 44
Computers and Statistical Analysis

 Statisticians often use computer software to perform


the statistical computations required with large
amounts of data.
 To facilitate computer usage, many of the data sets
in this book are available on the website that
accompanies the text.
 The data files may be downloaded in either Minitab
or Excel formats.
 Also, the Excel add-in StatTools can be downloaded
from the website.
 Chapter ending appendices cover the step-by-step
procedures for using Minitab, Excel, and StatTools.

Slide 45
Data Warehousing

 Organizations obtain large amounts of data on a


daily basis by means of magnetic card readers, bar
code scanners, point of sale terminals, and touch
screen monitors.
 Wal-Mart captures data on 20-30 million transactions
per day.
 Visa processes 6,800 payment transactions per second.
 Capturing, storing, and maintaining the data, referred
to as data warehousing, is a significant undertaking.

Slide 46
Data Mining

 Analysis of the data in the warehouse might aid in


decisions that will lead to new strategies and higher
profits for the organization.
 Using a combination of procedures from statistics,
mathematics, and computer science, analysts “mine
the data” to convert it into useful information.
 The most effective data mining systems use automated
procedures to discover relationships in the data and
predict future outcomes, … prompted by only general,
even vague, queries by the user.

Slide 47
Data Mining Applications

 The major applications of data mining have been


made by companies with a strong consumer focus
such as retail, financial, and communication firms.
 Data mining is used to identify related products that
customers who have already purchased a specific
product are also likely to purchase (and then pop-ups
are used to draw attention to those related products).
 As another example, data mining is used to identify
customers who should receive special discount offers
based on their past purchasing volumes.

Slide 48
Data Mining Requirements

 Statistical methodology such as multiple regression,


logistic regression, and correlation are heavily used.
 Also needed are computer science technologies
involving artificial intelligence and machine learning.
 A significant investment in time and money is
required as well.

Slide 49
Data Mining Model Reliability

 Finding a statistical model that works well for a


particular sample of data does not necessarily mean
that it can be reliably applied to other data.
 With the enormous amount of data available, the
data set can be partitioned into a training set (for
model development) and a test set (for validating
the model).
 There is, however, a danger of over fitting the model
to the point that misleading associations and
conclusions appear to exist.
 Careful interpretation of results and extensive testing
is important.

Slide 50
Ethical Guidelines for Statistical Practice

 In a statistical study, unethical behavior can take a


variety of forms including:
• Improper sampling
• Inappropriate analysis of the data
• Development of misleading graphs
• Use of inappropriate summary statistics
• Biased interpretation of the statistical results
 You should strive to be fair, thorough, objective, and
neutral as you collect, analyze, and present data.
 As a consumer of statistics, you should also be aware
of the possibility of unethical behavior by others.

Slide 51

You might also like