0% found this document useful (0 votes)
201 views43 pages

Chapter 1. Defining and Collecting Data Triệu Vi Gửi

This document provides an introduction to the Applied Statistics for Business course MAS202. The textbook is Basic Business Statistics: Concepts and Applications. Topics covered include defining and collecting data, organizing and visualizing variables, numerical descriptive measures, probability, probability distributions, sampling distributions, confidence intervals, hypothesis testing, analysis of variance, and linear regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views43 pages

Chapter 1. Defining and Collecting Data Triệu Vi Gửi

This document provides an introduction to the Applied Statistics for Business course MAS202. The textbook is Basic Business Statistics: Concepts and Applications. Topics covered include defining and collecting data, organizing and visualizing variables, numerical descriptive measures, probability, probability distributions, sampling distributions, confidence intervals, hypothesis testing, analysis of variance, and linear regression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Course Introduction

Course name: Applied Statistics for Business (MAS202)

Textbook: Basic Business Statistics : Concepts and Applications, Pearson, 2019.

Topics covered:
Chapter 1: Defining and Collecting Data
Chapter 2: Organizing and Visualizing Variables
Chapter 3: Numerical Descriptive Measure
Chapter 4: Basic Probability
Chapter 5: Discrete Probability Distribution
Chapter 6: The Normal Distribution and Other Continuous Distributions
Chapter 7: Sampling Distribution
Chapter 8: Confidence Interval Estimation
Chapter 9: Fundamental of Hypothesis Testing: One-Sample Tests
Chapter 10: Two-Sample Tests
Chapter 11: Analysis of Variance
Chapter 13: Simple Linear Regression
Chapter 14: Introduction to Multiple Regression
May 10, 2023 1 / 43
Applied Statistics for Business
Chapter 1: DEFINING AND COLLECTING DATA

FPT University
Department of Mathematics

Quy Nhon, 2023

FUQN
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 3 / 43
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 4 / 43
Classifying Variables By Type

Variables

Categorical Numerical

Nominal Ordinal Discrete Continuous

Examples: Examples: Ratings Examples: Examples:


Marital Status Good, Better, Number of Weight
Political Party Best Children Voltage
Eye Color Low, Med, High Defects per hour (Measured Char-
(Defined Categories) (Ordered Categories) (Counted Items) acteristics)

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 5 / 43
Categorical Variables
Categorical variables (qualitative variables) whose data represent categories.
→ This variables take values that are non-numerical in nature.
1 N ominal variable: describes a name, label or category without natural order.
2 Ordinal variable: whose values are defined by an order relation between the
different categories.

Examples (Categorical Variables)


1 Gender with its categories male and female is a categorical variable, as is the
variable preferred-New-Coke with its categories yes and no.
2 Table 1: Method of travel to go to school Table 2: Student behaviour ranking
Transportation Number of students Behaviour Number of students
Car as diver 1 Excellent 5
Motorbike 20 Very good 12
Bicycle 2 Good 10
Walked 5 Bad 2
Other methods 2 Very bad 1
→ Transportation is a nominal variable. → Behaviour is a ordinal variable.
FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 6 / 43
Numerical Variables
Numerical variables (quantitative variables) whose data represent a counted/ measured
quantity.
1 Discrete variables: arise from a counting process.
2 Continuous variables: arise from a measuring process.

Examples (Numerical Variables)


1 Discrete variables:
Number of scratches on a surface.
Number of transmitted bits received in error.
Proportion of defective parts among 1000 tested.
The monthly number of smartphones sold in an electronics store.
2 Continuous variables:
Time.
Weight.
Electrical current.
Length.
Pressure.
Temperature.
Voltage.
FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 7 / 43
Quiz
Determine variable type corresponding to answer of each the following:
1 Do you have a Facebook profile?
2 How many text messages have you sent in the past three days?
3 How long did the mobile app update take to download?

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 8 / 43
Measurement Scales
Nominal Scale
A nominal scale classifies data into distinct categories in which no ranking is implied.

Examples (Nominal Scale)


Categorical Variables Categories
Do you have a Facebook profile? ←→ Yes, No
Type of investment ←→ Growth, Value, Other
Cellular Provider ←→ T&T, Sprint, Verizon, Other, None

Ordinal Scale
An ordinal scale classifies data into distinct categories in which ranking is implied.

Examples (Ordinal Scale)


Categorical Variable Ordered Categories
Student class designation Freshman, Sophomore, Junior, Senior
Product satisfaction Very unsatisfied, Fairly unsatisfied, Neutral, Fairly satisfied,
Very satisfied
Faculty rank Professor, Associate Professor, Assistant Professor, Instructor
Student Grades A, B, C, D, F FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 9 / 43
Interval Scale
An interval scale is an ordered scale in which the difference between measurements is a
meaningful quantity but the measurements do not have a true zero point.

Examples (Interval Scale)


Numerical Variable Level of Measurement
Temperature (in degrees Celsius or Fahrenheit) −→ Interval
Standardized exam score (e.g., ACT or SAT) −→ Interval

Ratio Scale
A ratio scale is an ordered scale in which the difference between the measurements is a
meaningful quantity and the measurements have a true zero point.

Examples (Ratio Scale)


Numerical Variable Level of Measurement
Height (in inches or centimeters) −→ Ratio
Weight (in pounds or kilograms) −→ Ratio
Age (in years or days) −→ Ratio
Salary (in American dollars or Japanese yen) −→ Ratio
FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 10 / 43
Recall that
1 Scale: Nominal, Ordinal, Interval, Ratio.
2 Type: Categorical, Numerical.

Quiz
Fill in the blank.
Data Scale, Type Values
Cellular provider ..., ... AT&T, T-Mobile, Verizon, Other, None
Excel skills ..., ... novice, intermediate, expert
Temperature (◦ F ) ..., ... −459.67◦ F or higher
SAT Math score ..., ... a value between 200 and 800, inclusive
Item cost (in $) ..., ... $0.00 or higher

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 11 / 43
Bonus: For each of the following variables, determine whether the variable is categorical
or numerical and determine its measurement scale. If the variable is numerical, determine
whether the variable is discrete or continuous.
1 Number of cellphones in the household.
2 Monthly data usage (in MB).
3 Number of text messages exchanged per month.
4 Voice usage per month (in minutes).
5 Whether the cellphone is used for email.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 12 / 43
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 13 / 43
Population versus Sample

Data is collected from either a population or a sample.


Population
A population contains all of the items or individuals of interest that you seek to study.

Sample
A sample contains only a portion of a population of interest.

Examples (Population vs Sample)

Population Sample
Advertisements for IT jobs in the Vietnam The top 50 search results for advertise-
ments for IT jobs in the Vietnam on
May 1, 2022
Undergraduate students in the FPT University 200 undergraduate students majoring
in AI
All countries of the world Countries with published data available
on birth rates and GDP since 2010
FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 14 / 43
Question 1. A survey will be given to 100 students randomly selected from the freshmen
class at LQD High School. What is the population?
1 The 100 selected students.
2 All freshmen at LQD High School.
3 All students at LQD High School

Question 2. A survey will be given to 100 students randomly selected from the freshmen
class at LQD High School. What is the sample?
1 The 100 selected students.
2 All freshmen at LQD High School.
3 All students at LQD High School

Question 3. Fifty bottles of water were randomly selected from a large collection of
bottles in a company’s warehouse. These fifty bottles are referred to as the . . . . . .

1 population. 2 sample.

Question 4. Fifty bottles of water were randomly selected from a large collection of
bottles in a company’s warehouse. The large collection of bottles is referred to as the . . .

1 population. 2 sample.
FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 15 / 43
Note
1 In statistics, population can be a group of individuals, objects, events, organizations,
countries, species, organisms, etc.
2 The size of the sample is always less than the total size of the population.

Collecting Data Via Sampling Is Useful


1 Less time consuming than selecting every item in the population.
2 Less costly than selecting every item in the population.
3 Less cumbersome and more practical than analyzing the entire population.

Parameter vs Statistic
1 A parameter summarizes the value of a population for a specific variable.
2 A statistic summarizes the value of a specific variable for sample data.

Question 1. A survey of 2000 American households found that 33% of the respondents
own a computer. Is this value a parameter or a statistic?

Question 2. The average salary of all automotive workers is $42, 000. Is this value a
parameter or a statistic? FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 16 / 43
Data Sources
Data sources arise from the following activities:
Capturing data generated by ongoing business activities.
Distributing data compiled by an organization or individual.
Compiling the responses from a survey.
Conducting a designed experiment and recording the outcomes of the experiment.
Conducting an observational study and recording the results of the study.

Primary and Secondary Data Sources


1 Primary data source: The data collector is the one using the data for analysis.
Data from a political survey.
Data collected from an experiment.
Observed data.
2 Secondary data source: The person performing data analysis is not the data
collector.
Census data.
Data from print journals, books, or data published on the internet.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 17 / 43
Examples of Data Collected From Ongoing Business Activities
A bank studies years of financial transactions to help them identify patterns of fraud.
Economists utilize data on searches done via Google to help forecast future
economic conditions.
Marketing companies use tracking data to evaluate the effectiveness of a web site.

Examples Of Data Distributed By An Organization or Individual


Financial data on a company provided by investment services.
Industry or market data from market research firms and trade associations.
Stock prices, weather conditions, and sports statistics in daily newspapers.

Examples of Survey Data


A survey asking people which laundry detergent has the best stain-removing abilities.
Political polls of registered voters during political campaigns.
People being surveyed to determine their satisfaction with a recent product or
service experience.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 18 / 43
Examples of Data From A Designed Experiment
Consumer testing of different versions of a product to help determine which product
should be pursued further.
Material testing to determine which supplier’s material should be used in a product.
Market testing on alternative product promotions to determine which promotion to
use more broadly.

Examples of Data Collected From Observational Studies


Market researchers utilizing focus groups to elicit unstructured responses to
open-ended questions.
Measuring the time it takes for customers to be served in a fast food establishment.
Measuring the volume of traffic through an intersection to determine if some form of
advertising at the intersection is justified.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 19 / 43
Observational Studies & Designed Experiments
1 In both, researchers that collect data are looking for the effect of some change,
called a treatment, on a variable of interest.
2 In an observational study, the researcher collects data in a natural or neutral setting
and has no direct control of the treatment.
3 In a designed experiment, there is direct control over which items receive the
treatment.

Question1 . The American Community Survey (www.census.gov/acs) provides data every


year about communities in the United States. Addresses are randomly selected, and
respondents are required to supply answers to a series of questions.
1 Which of the sources of data best describe the American Community Survey?
2 Is the American Community Survey based on a sample or a population?

1
Hint. Population: Return on all the IPOs in the US. Sample: Return on 250 IPOs in the US. FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 20 / 43
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 21 / 43
A Sampling Process Begins with a Sampling Frame

1 The sampling frame is a listing of items that make up the population.


2 Frames are data sources such as population lists, directories, or maps.
3 Inaccurate or biased results can result if a frame excludes certain groups or portions
of the population.
4 Using different frames to generate data can lead to dissimilar conclusions.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 22 / 43
Types of Samples

Samples

Non Probability Samples Probability Samples

Judgment Convenience Simple Random Stratified

Systematic Cluster

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 23 / 43
Non Probability Sample
In a non probability sample, items included are chosen without regard to their probability
of occurrence.
In a judgment sample, you get the opinions of pre-selected experts in the subject
matter.
In convenience sampling, items are selected based only on the fact that they are
easy, inexpensive, or convenient to sample.

Probability Sample
In a probability sample, items in the sample are chosen on the basis of known
probabilities.

Probability Samples

Simple Random Stratified Systematic Cluster

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 24 / 43
Probability Sample: Simple Random Sample
Every individual or item from the frame has an equal chance of being selected.
Selection may be with replacement (selected individual is returned to frame for
possible reselection) or without replacement (selected individual is not returned to
the frame).
Samples obtained from table of random numbers or computer random number
generators.

Example (Simple Random Sample)


1 The names of 20 students being chosen out of a hat from a university of 1000
students.
→ In this case, the population is all 1000 students, and the sample is simple random
because each student has an equal chance of being chosen.
2 Put 100 numbered balls into a bowl (this is the population). Select 10 balls from
the bowl without looking (this is the sample). Note that it is important not to look
as you could (unknowingly) bias the sample.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 25 / 43
Steps to Conduct Simple Random Sampling
1 Make a list of all the members in the population (population size: N ).
2 Assign a sequential number to each member (i.e., 1, 2, . . . , N ).
→ This is the sampling frame (the list from which you draw your sample).
3 Figure out what the sample size is going to be (sample size: n).
4 Use a random number generator to select the sample.
→ For instance, if the sample size is 4 and the population is 12, generate 4 random
numbers between 1 and 12.

Figure2 . Simple random sampling of a sample n of 4 from a population N of 12.


2
Image: Dan Kernler |Wikimedia Commons FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 26 / 43
Probability Sample: Systematic Sample
Decide on sample size: n.
N
Divide frame of N individuals into groups of k individuals where k = .
n
Randomly select one individual from the 1st group.
Select every kth individual thereafter.

Figure3 . Systematic Sampling.

3
Source: Analytics Vidhya FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 27 / 43
Probability Sample: Stratified Sample
Divide population into two or more subgroups (called strata) according to some
common characteristic.
A simple random sample is selected from each subgroup, with sample sizes
proportional to strata sizes.
Samples from subgroups are combined into one.
This method is typically used when a population has distinct differences, such as
demographics, level of education, or age, and can easily be broken into subgroups.

Figure4 . Stratified Sampling.


4
Source: DataScience Made Simple FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 28 / 43
Probability Sample: Cluster Sample
Population is divided into several clusters, each representative of the population.
A simple random sample of clusters is selected.
All items in the selected clusters can be used, or items can be chosen from a cluster
using another probability sampling technique.
A common application of cluster sampling involves election exit polls, where certain
election districts are selected and sampled.

Figure5 . Cluster Sampling.


5
Source: Netquest FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 29 / 43
Probability Sample: Comparing Sampling Methods
1 Simple random sample and Systematic sample:
Simple to use.
May not be a good representation of the population’s underlying characteristics.
2 Stratified sample:
Ensures representation of individuals across the entire population.
3 Cluster sample:
More cost effective.
Less efficient (need larger sample to acquire the same level of precision).

Question. For a population containing N = 902 individuals, what code number would
you assign for
1 the first person on the list?
2 the fortieth person on the list?
3 the last person on the list?

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 30 / 43
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 31 / 43
Data Cleaning Is An Important Data Preprocessing Task Prior To Analysis

Data Cleaning Corrects Irregularities In The Data


1 Invalid variable values, including:
Non-numerical data for numerical variable.
Invalid categorical values for a categorical variable.
Numeric values outside a defined range.
2 Coding errors, including:
Inconsistent categorical values.
Inconsistent case for categorical values.
Extraneous characters.
3 Data integration errors, including:
Redundant columns.
Duplicated rows.
Different column lengths.
Different units of measure or scale for numerical variables.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 32 / 43
Examples (Coding Errors)
1 Copy-and-paste or data import can result in poor recording or entry of data.
2 Categorical variable: Gender, Correct coding: F or M.
Correctable error: Female.
Invalid data: New York.
Correctable or software tolerated: m.
Extraneous and nonprintable characters: Leading or trailing space(s) including _F or
F_, other nonprintable characters may also be leading or trailing.

Examples (Data Integration Errors)


Data integration errors come from combining two different computerized data sources.
1 Variable names or definitions may differ.
2 Duplicated rows (observations) may also occur.
3 Different units of measurement (or scale) may not be obvious without human
interpretation.
→ Data integration errors often requires time-consuming manual effort.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 33 / 43
Missing Values
1 Missing values are values that were not collected for a variable.
2 For example, survey data may include answers for which no response was given by
the survey taker. Such “no responses” are examples of missing values.

Algorithmic Cleaning of Extreme Numerical Values


1 For numerical variables without a defined range of possible values, you might find
outliers, values that seem excessively different from most of the other values.
2 Such values may or may not be errors, but all outliers require review.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 34 / 43
Data Cleaning Cannot Be A Fully Automated Process
Excel, JMP, and Minitab have functionality to lessen the burden of data cleaning.
The software guides in the book explain this functionality.
When performing data cleaning, always preserve a copy of the original data for later
reference.

Cleaning Invalid Variable Values Can Be Semi-Automated


Invalid variable values can be identified by simple scanning techniques, for instance
Non-numeric entries for numerical variables.
Values for categorical variables that don’t match a pre-defined category.
Values for a numeric variable outside a pre-defined explicit range.

Features exist in Excel, JMP, or Minitab to assist in this task.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 35 / 43
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 36 / 43
Data Can Be Formatted and/or Encoded In More Than One Way
Some electronic formats are more readily usable than others.
Different encodings can impact the precision of numerical variables and can also
impact data compatibility.
As you identify and choose sources of data you need to consider or deal with these
issues.

Stacked versus Unstacked Data


For stacked data you create a single column for the variable of interest and create
additional columns for the potential grouping variables.
For unstacked data you create separate numerical variables for different groups (i.e.
genders, locations, etc.)

After Collection It Is Often Helpful To Recode Some Variables


Recoding a variable can either supplement or replace the original variable.
Recoding a categorical variable involves redefining categories.
Recoding a numerical variable involves changing this variable into a categorical
variable.
→ When recoding be sure that the new categories are mutually exclusive (categories do
not overlap) and collectively exhaustive (categories cover all possible values). FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 37 / 43
Table of Contents

1 Defining Variables

2 Collecting Data

3 Types of Sampling Methods

4 Data Cleaning

5 Other Data Processing Tasks

6 Types of Survey Errors

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 38 / 43
Evaluating Survey Worthiness
What is the purpose of the survey?
Is the survey based on a probability sample?
Coverage error - appropriate frame?
Nonresponse error - follow up.
Measurement error - good questions elicit good responses.
Sampling error - always exists.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 39 / 43
Types of Survey Errors
1 Coverage error or selection bias: Exists if some groups are excluded from the
frame and have no chance of being selected.
→ Excluded from frame.
2 Nonresponse error or bias: People who do not respond may be different from those
who do respond.
→ Follow up on nonresponses.
3 Sampling error: Variation from sample to sample will always exist.
→ Random differences from sample to sample.
4 Measurement error: Due to weaknesses in question design and/or respondent error.
→ Bad or leading question.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 40 / 43
Ethical Issues About Surveys
Coverage error and nonresponse error can be leveraged by survey designers to
purposely bias survey results.
Sampling error can be an ethical issue if the findings are purposely not reported with
the associated margin of error.
Measurement error can be an ethical issue:
Survey sponsor chooses leading questions.
Interviewer purposely leads respondents in a particular direction.
Respondent(s) willfully provide false information.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 41 / 43
Chapter Summary

In this chapter we have discussed:


1 Understanding issues that arise when defining variables.
2 How to define variables.
3 Understanding the different measurement scales.
4 How to collect data.
5 Identifying different ways to collect a sample.
6 Understanding the issues involved in data preparation.
7 Understanding the types of survey errors.

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 42 / 43
Thank you!

FUQN
VyNHT – FUQN MAS202 - Chapter 1 Quy Nhon, 2023 43 / 43

You might also like