0% found this document useful (0 votes)

104 views31 pages

2 - Data Examination 20

This document discusses examining and cleaning data from the HBAT database. It describes the variables in the database, which include classification variables like customer type and industry type, as well as perception variables measured on a 10-point scale. The document discusses examining the data through graphical methods like histograms, stem-and-leaf plots, and frequency distributions to understand the shape and relationships of variables. Understanding the data distribution and identifying outliers are important steps before finalizing the data for analysis.

Uploaded by

parth bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views31 pages

2 - Data Examination 20

Uploaded by

parth bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Examining and cleaning the

data
Dr. Javad Feizabadi
Lecture 2: this lecture is enabled by content from Hair, Balck, Babin and
Anderson (2014) “multivariate data analysis”. Pearson

EXAMINING AND CLEANING THE DATA

Description of HBAT Primary Database Variables
Variable Description Variable Type
Data Warehouse Classification Variables
X1 Customer Type nonmetric
X2 Industry Type nonmetric
X3 Firm Size nonmetric
X4 Region nonmetric
X5 Distribution System nonmetric
Performance Perceptions Variables
X6 Product Quality metric
X7 E-Commerce Activities/Website metric
X8 Technical Support metric
X9 Complaint Resolution metric
X10 Advertising metric
X11 Product Line metric
X12 Salesforce Image metric
X13 Competitive Pricing metric
X14 Warranty & Claims metric
X15 New Products metric
X16 Ordering & Billing metric
X17 Price Flexibility metric
X18 Delivery Speed metric
Outcome/Relationship Measures
X19 Satisfaction metric
X20 Likelihood of Recommendation metric
X21 Likelihood of Future Purchase metric
X22 Current Purchase/Usage Level metric
X23 Consider Strategic Alliance/Partnership in Future nonmetric
Classification Variables

§ X1 Customer Type : length of time a particular customer has been buying from HBAT:
q 1 = less than a year
q 2 = between 1 and 5 years
q 3 = longer than 5 years
§ X2 Industry Type : type of industry that purchases HBAT’s paper products
q 0 = Magazine industry
q 1 = newsprint industry
§ X3 Firm Size : Employee size
q 0 = small firm fewer than 500 employees
q 1 = large firm, 500 or more employees
§ X4 Region: customer location
q 0 = USA/North America
q 1 = outside North America
§ X5 Distribution System: How paper products are sold to customers
q 0 = Sold indirectly through a broker
q 1 = sold directly
Perception and Outcome Variables
§ Variables X6 to X18 were measured in a 10-point scale in which endpoints is labeled
as “Poor” and “Excellent”.
§ For example: X8 Technical Support: Extent to which technical support
is offered to help solve product/service issues.
§ In outcome variables
q X19, X20 and X21 were measured in 10-point scale
q X22: Percentage of Purchases from HBAT : measured on 100-point percentage
scale
q X23: Perception of future relationship with HBAT: Extent to which
customer/respondent perceives his/her firm would engage in strategic
alliance/partnership with HBAT
• 0 = would not consider
• 1 = yes, would consider strategic alliance or partnership
Examination Phases

• Graphical examination.

• Identify and evaluate missing values.

• Identify and deal with outliers.

• Check whether statistical assumptions are met.

• Develop a preliminary understanding of your data.

Graphical Examination

• Shape:
ü Histogram
ü Bar Chart
ü Box & Whisker plot
ü Stem and Leaf plot

• Relationships:
ü Scatterplot
ü Outliers
Histograms and The Normal Curve
This is the distribution for HBAT
X19 - Satisfaction database variable
30
X19 – Satisfaction.

10
Frequency

Std. Dev = 1.19

Mean = 6.92

0 N = 100.00
4.50 5.50 6.50 7.50 8.50 9.50
5.00 6.00 7.00 8.00 9.00 10.00

X19 - Satisfaction
Stem & Leaf Diagram – HBAT Variable X6
Each stem is shown by the
numbers, and each number is a
X6 - Product Quality
leaf. This stem has 10 leaves.
Stem-and-Leaf Plot

Frequency Stem & Leaf The length of the stem, indicated by the
number of leaves, shows the frequency
3.00 5. 012
distribution. For this stem, the frequency
10.00 5. 5567777899
10.00 6. 0112344444 is 14.
10.00 6. 5567777999
5.00 7. 01144
This table shows the distribution of X6 with a stem and leaf
11.00 7. 55666777899
diagram . The first category is from 5.0 to 5.5, thus the stem is 5.0.
9.00 8. 000122234
There are three observations with values in this range (5.0, 5.1 and
14.00 8. 55556667777778
18.00 9. 001111222333333444 5.2). This is shown as three leaves of 0, 1 and 2. These are also the
three lowest values for X6. In the next stem, the stem value is
8.00 9. 56699999
again 5.0 and there are ten observations, ranging from 5.5 to 5.9.
2.00 10 . 00
These correspond to the leaves of 5.5 to 5. 9. At the other end of
the figure, the stem is 10.0. It is associated with two leaves (0 and
Stem width: 1.0
Each leaf: 1 case(s) 0), representing two values of10.0, the two highest values for X6.
Frequency Distribution: Variable X6 – Product Quality
X6 - Product Qua lity

Cu mu lative
Fre quen cy Percent Valid Pe rce nt Percent
Valid 5.0 1 1.0 1.0 1.0
5.1 1 1.0 1.0 2.0
5.2 1 1.0 1.0 3.0
5.5 2 2.0 2.0 5.0
5.6 1 1.0 1.0 6.0
5.7 4 4.0 4.0 10 .0
5.8 1 1.0 1.0 11 .0
5.9 2 2.0 2.0 13 .0
6.0 1 1.0 1.0 14 .0
6.1 2 2.0 2.0 16 .0
6.2 1 1.0 1.0 17 .0
6.3 1 1.0 1.0 18 .0
6.4 5 5.0 5.0 23 .0
6.5 2 2.0 2.0 25 .0
6.6 1 1.0 1.0 26 .0
6.7 4 4.0 4.0 30 .0
6.9 3 3.0 3.0 33 .0
7.0 1 1.0 1.0 34 .0
7.1 2 2.0 2.0 36 .0
7.4 2 2.0 2.0 38 .0
7.5 2 2.0 2.0 40 .0
7.6 3 3.0 3.0 43 .0
7.7 3 3.0 3.0 46 .0
7.8 1 1.0 1.0 47 .0
7.9 2 2.0 2.0 49 .0
8.0 3 3.0 3.0 52 .0
8.1 1 1.0 1.0 53 .0
8.2 3 3.0 3.0 56 .0
8.3 1 1.0 1.0 57 .0
8.4 1 1.0 1.0 58 .0
8.5 4 4.0 4.0 62 .0
8.6 3 3.0 3.0 65 .0
8.7 6 6.0 6.0 71 .0
8.8 1 1.0 1.0 72 .0
9.0 2 2.0 2.0 74 .0
9.1 4 4.0 4.0 78 .0
9.2 3 3.0 3.0 81 .0
9.3 6 6.0 6.0 87 .0
9.4 3 3.0 3.0 90 .0
9.5 1 1.0 1.0 91 .0
9.6 2 2.0 2.0 93 .0
9.9 5 5.0 5.0 98 .0
Excellent 2 2.0 2.0 10 0.0
Total 10 0 10 0.0 10 0.0
HBAT Diagnostics: Box & Whiskers Plots
Outlier = #13 Group 2 has substantially more
11 dispersion than the other groups.

10
13

7
X6 - Product Quality

6 Median
5

4
N= 32 35 33

Less than 1 year 1 to 5 years Over 5 years

X1 - Customer Type
HBAT Scatterplot: Variables X19 and X6
10

4
4 5 6 7 8 9 10 11

X6 - Product Quality
Missing Data
• Missing Data = information not available for a subject
(or case) about whom other information is available.
Typically occurs when respondent fails to answer one
or more questions in a survey.
ü Systematic?
ü Random?

• Researcher’s Concern = to identify the patterns and

relationships underlying the missing data in order to
maintain as close as possible to the original distribution
of values when any remedy is applied.
• Impact . . .
ü Reduces sample size available for analysis.
ü Can distort results.
Four-Step Process for
Identifying Missing Data

Step 1: Determine the Type of Missing Data

Step 2: Determine the Extent of Missing Data
Step 3: Diagnose the Randomness of the
Missing Data Processes
Step 4: Select the Imputation Method
Missing Data

Strategies for handling missing data . . .

ü use observations with complete data
only;
ü delete case(s) and/or variable(s);
ü estimate missing values.
Rules of Thumb
How Much Missing Data Is Too Much?
• Missing data under 10% for an individual
case or observation can generally be
ignored, except when the missing data
occurs in a specific nonrandom fashion (e.g.,
concentration in a specific set of questions,
attrition at the end of the questionnaire, etc.).
• The number of cases with no missing data
must be sufficient for the selected analysis
technique if replacement values will not be
substituted (imputed) for the missing data.
Rules of Thumb

Imputation of Missing Data

• Under 10% – Any of the imputation methods can be applied when

missing data is this low, although the complete case
method has been shown to be the least preferred.
• 10 to 20% – The increased presence of missing data makes the all
available, hot deck case substitution and regression
methods most preferred for MCAR data, while
model-based methods are necessary with MAR missing
data processes
• Over 20% – If it is necessary to impute missing data when the
level is over 20%, the preferred methods are:
o the regression method for MCAR situations, and
o model-based methods when MAR missing data occurs.
Outlier

Outlier = an observation/response with a

unique combination of characteristics
identifiable as distinctly different from the
other observations/responses.

Issue: “Is the observation/response

representative of the population?”
Why Do Outliers Occur?

• Procedural Error.
• Extraordinary Event.
• Extraordinary Observations.
• Observations unique in their
combination of values.
Dealing with Outliers

• Identify outliers.
• Describe outliers.
• Delete or Retain?
Identifying Outliers

• Standardize data and then identify outliers in

terms of number of standard deviations.
• Examine data using Box Plots, Stem & Leaf,
and Scatterplots.
• Multivariate detection (D2).
Rules of Thumb
Outlier Detection
• Univariate methods – examine all metric variables to identify unique or
extreme observations.
• For small samples (80 or fewer observations), outliers typically are defined
as cases with standard scores of 2.5 or greater.
• For larger sample sizes, increase the threshold value of standard scores up
to 4.
• If standard scores are not used, identify cases falling outside the ranges of
2.5 versus 4 standard deviations, depending on the sample size.
• Bivariate methods – focus their use on specific variable relationships, such
as the independent versus dependent variables:
o use scatterplots with confidence intervals at a specified Alpha level.
• Multivariate methods – best suited for examining a complete variate, such
as the independent variables in regression or the variables in factor
analysis:
o threshold levels for the D2/df measure should be very conservative (.005
or .001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger
samples.
Multivariate Assumptions

q Normality
q Linearity
q Homoscedasticity
q Non-correlated Errors

ü Data Transformations?
Testing Assumptions

• Normality assessment:
§ Visual check of histogram.
§ Kurtosis.
§ Normal probability plot.

• Homoscedasticity
§ Equal variances across independent
variables.
§ Levene test (univariate).
§ Box’s M (multivariate).
Rules of Thumb
Testing Statistical Assumptions
• Normality can have serious effects in small samples (less than 50
cases), but the impact effectively diminishes when sample sizes reach
200 cases or more.
• Most cases of heteroscedasticity are a result of non-normality in one
or more variables. Thus, remedying normality may not be needed due
to sample size, but may be needed to equalize the variance.
• Nonlinear relationships can be very well defined, but seriously
understated unless the data is transformed to a linear pattern or
explicit model components are used to represent the nonlinear portion
of the relationship.
• Correlated errors arise from a process that must be treated much like
missing data. That is, the researcher must first define the “causes”
among variables either internal or external to the dataset. If they are
not found and remedied, serious biases can occur in the results, many
times unknown to the researcher.
Data Transformations ?

Data transformations . . . provide a means of

modifying variables for one of two reasons:

1. To correct violations of the statistical

assumptions underlying the multivariate
techniques, or
2. To improve the relationship (correlation)
between the variables.
Rules of Thumb
Transforming Data
• To judge the potential impact of a transformation, calculate the ratio of
the variable’s mean to its standard deviation:
o Noticeable effects should occur when the ratio is less than 4.
o When the transformation can be performed on either of two variables,
select the variable with the smallest ratio .
• Transformations should be applied to the independent variables except
in the case of heteroscedasticity.
• Heteroscedasticity can be remedied only by the transformation of the
dependent variable in a dependence relationship. If a heteroscedastic
relationship is also nonlinear, the dependent variable, and perhaps the
independent variables, must be transformed.
• Transformations may change the interpretation of the variables. For
example, transforming variables by taking their logarithm translates the
relationship into a measure of proportional change (elasticity). Always
be sure to explore thoroughly the possible interpretations of the
transformed variables.
• Use variables in their original (untransformed) format when profiling or
interpreting results.
Dummy Variable

Dummy variable . . . a nonmetric independent

variable that has two (or more) distinct levels that
are coded 0 and 1. These variables act as
replacement variables to enable nonmetric
variables to be used as metric variables.
A dummy variable is a dichotomous variable that
represents one category of a nonmetric
independent variable. Any nonmetric variable
with k categories can be represented as k - 1
dummy variables.
Dummy Variable
Coding
Category X1 X2

Physician 1 0
Attorney 0 1
Professor 0 0
Simple Approaches to Understanding Data

o Tabulation = a listing of how respondents answered all

possible answers to each question. This typically is shown
in a frequency table.
o Cross Tabulation = a listing of how respondents answered
two or more questions. This typically is shown in a two-way
frequency table to enable comparisons between groups.
o Chi-Square = a statistic that tests for significant differences
between the frequency distributions for two (or more)
categorical variables (non-metric) in a cross-tabulation table.
Note: Chi square results will be distorted if more than 20
percent of the cells have an expected count of less than 5,
or if any cell has an expected count of less than 1.
o ANOVA = a statistic that tests for significant differences
between two means.
Examining Data
Learning Checkpoint

1. Why examine your data?

2. What are the principal aspects of
data that need to be examined?
3. What approaches would you use?

Exam AFF700 211210 - Solutions
No ratings yet
Exam AFF700 211210 - Solutions
11 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
The Secret To Marketing Simulations by Concentric
No ratings yet
The Secret To Marketing Simulations by Concentric
41 pages
MachineLearningNotes PDF
100% (1)
MachineLearningNotes PDF
299 pages
ENGDAT1 - Module2 (Review) PDF
No ratings yet
ENGDAT1 - Module2 (Review) PDF
40 pages
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
No ratings yet
Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is
4 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
Presentation Fbook Version
No ratings yet
Presentation Fbook Version
22 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Advanced Handling of Missing Data: One-Day Workshop
No ratings yet
Advanced Handling of Missing Data: One-Day Workshop
38 pages
Research Methodology: Data Collection, Analysis and Interpretation
No ratings yet
Research Methodology: Data Collection, Analysis and Interpretation
54 pages
Missing Data
100% (2)
Missing Data
35 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Essential Stats For Decision Making-1 Descriptive Stats-2011
No ratings yet
Essential Stats For Decision Making-1 Descriptive Stats-2011
116 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Presentation 3
No ratings yet
Presentation 3
14 pages
Modern Business Management Is More A Science Than An Art. Ever Increasing Global Competition Mandates
No ratings yet
Modern Business Management Is More A Science Than An Art. Ever Increasing Global Competition Mandates
10 pages
Business Analytics (MIS171) Summary Notes
No ratings yet
Business Analytics (MIS171) Summary Notes
6 pages
Outliers PDF
No ratings yet
Outliers PDF
5 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
CH 12 Analyse Quantitative Data
No ratings yet
CH 12 Analyse Quantitative Data
34 pages
Act2 Apren GVZA
No ratings yet
Act2 Apren GVZA
4 pages
Chapter 6 Demand Forecasting
91% (11)
Chapter 6 Demand Forecasting
27 pages
Most Important Findings 1zm31 Per Subject
No ratings yet
Most Important Findings 1zm31 Per Subject
24 pages
ANOVA Presentation
No ratings yet
ANOVA Presentation
12 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Part A
No ratings yet
Part A
16 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
BRM CS
No ratings yet
BRM CS
4 pages
Hello
No ratings yet
Hello
3 pages
Summary Data Quality Course
No ratings yet
Summary Data Quality Course
7 pages
Wet Milling of Corn-A Review of Laboratory-Scale and Pilot Plant Scale Studies
No ratings yet
Wet Milling of Corn-A Review of Laboratory-Scale and Pilot Plant Scale Studies
9 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
QM 1
No ratings yet
QM 1
58 pages
Experimental Investigations and Multi-Objective Optimization of Friction Drilling Process On AISI 1015
No ratings yet
Experimental Investigations and Multi-Objective Optimization of Friction Drilling Process On AISI 1015
14 pages
Mock Exam - Emprical Methods For Finance
100% (1)
Mock Exam - Emprical Methods For Finance
4 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
Bismillah Kimed Hksa
No ratings yet
Bismillah Kimed Hksa
93 pages
Analisis COC PT. Astra Internasional TBK
No ratings yet
Analisis COC PT. Astra Internasional TBK
5 pages
Business Statistics For Decision Making (Sneha)
No ratings yet
Business Statistics For Decision Making (Sneha)
15 pages
Experimental Investigation and Optimization of Machining Parameters For Surface Roughness in CNC Turning by Taguchi Method
No ratings yet
Experimental Investigation and Optimization of Machining Parameters For Surface Roughness in CNC Turning by Taguchi Method
6 pages
Question.5 Allegheny Steel Corporation Has Been Looking Into The Factors That Influence How Many
No ratings yet
Question.5 Allegheny Steel Corporation Has Been Looking Into The Factors That Influence How Many
5 pages
M.tech. (Mechanical Engineering) Part-Time (Semester System)
No ratings yet
M.tech. (Mechanical Engineering) Part-Time (Semester System)
37 pages
Impact of Government Expenditure On Private Investment in Kenya-Very Good
No ratings yet
Impact of Government Expenditure On Private Investment in Kenya-Very Good
19 pages
M&S 05 Output Data Analysis
No ratings yet
M&S 05 Output Data Analysis
27 pages
Eda Chapters 12 and 13
No ratings yet
Eda Chapters 12 and 13
17 pages
Research File 3
No ratings yet
Research File 3
10 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Regression Problems (Practical)
No ratings yet
Regression Problems (Practical)
24 pages
Data Screening and Main Model Analysis in Spss
No ratings yet
Data Screening and Main Model Analysis in Spss
26 pages
MArket Share of Yamaha and Future Potential of FZ
No ratings yet
MArket Share of Yamaha and Future Potential of FZ
27 pages
Jurnal Ilmu Dan Riset Manajemen e-ISSN: 2461-0593: Pendahuluan
No ratings yet
Jurnal Ilmu Dan Riset Manajemen e-ISSN: 2461-0593: Pendahuluan
16 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Summary - Data Quality
No ratings yet
Summary - Data Quality
7 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
Correlation and Linear
No ratings yet
Correlation and Linear
68 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Chapter 2 - Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2 - Data Exploration, Preprocessing and Visualization
92 pages
Lozupone Knight 2007 Global Patterns in Bacterial Diversity
No ratings yet
Lozupone Knight 2007 Global Patterns in Bacterial Diversity
5 pages
Quantitative Research Methods
No ratings yet
Quantitative Research Methods
44 pages
Key Formulas: Simple Linear Regression
No ratings yet
Key Formulas: Simple Linear Regression
2 pages
Umar Complet Project
No ratings yet
Umar Complet Project
39 pages
Unit - I Chap-4 Model Evaluation and Development
No ratings yet
Unit - I Chap-4 Model Evaluation and Development
35 pages
Full Download Twenty Years of Inflation Targeting Lessons Learned and Future Prospects First Edition David Cobham PDF
100% (10)
Full Download Twenty Years of Inflation Targeting Lessons Learned and Future Prospects First Edition David Cobham PDF
77 pages
4 - SM and Data Visualization
No ratings yet
4 - SM and Data Visualization
61 pages
KJWDH
No ratings yet
KJWDH
4 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Correlation and Regression 2020
No ratings yet
Correlation and Regression 2020
63 pages
Uts Ekonometrika
No ratings yet
Uts Ekonometrika
2 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
6.research Methodology-BBA S1M6
No ratings yet
6.research Methodology-BBA S1M6
64 pages
HW 1
No ratings yet
HW 1
7 pages
CH 9 - Forecasting Exchange Rates
No ratings yet
CH 9 - Forecasting Exchange Rates
34 pages
Unit 4
No ratings yet
Unit 4
14 pages

2 - Data Examination 20

Uploaded by

2 - Data Examination 20

Uploaded by

Examining and cleaning the

EXAMINING AND CLEANING THE DATA

• Identify and evaluate missing values.

• Identify and deal with outliers.

• Check whether statistical assumptions are met.

• Develop a preliminary understanding of your data.

Std. Dev = 1.19

Less than 1 year 1 to 5 years Over 5 years

• Researcher’s Concern = to identify the patterns and

Step 1: Determine the Type of Missing Data

Strategies for handling missing data . . .

Imputation of Missing Data

• Under 10% – Any of the imputation methods can be applied when

Outlier = an observation/response with a

Issue: “Is the observation/response

• Standardize data and then identify outliers in

Data transformations . . . provide a means of

1. To correct violations of the statistical

Dummy variable . . . a nonmetric independent

o Tabulation = a listing of how respondents answered all

1. Why examine your data?

You might also like