0% found this document useful (0 votes)
104 views31 pages

2 - Data Examination 20

This document discusses examining and cleaning data from the HBAT database. It describes the variables in the database, which include classification variables like customer type and industry type, as well as perception variables measured on a 10-point scale. The document discusses examining the data through graphical methods like histograms, stem-and-leaf plots, and frequency distributions to understand the shape and relationships of variables. Understanding the data distribution and identifying outliers are important steps before finalizing the data for analysis.

Uploaded by

parth bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views31 pages

2 - Data Examination 20

This document discusses examining and cleaning data from the HBAT database. It describes the variables in the database, which include classification variables like customer type and industry type, as well as perception variables measured on a 10-point scale. The document discusses examining the data through graphical methods like histograms, stem-and-leaf plots, and frequency distributions to understand the shape and relationships of variables. Understanding the data distribution and identifying outliers are important steps before finalizing the data for analysis.

Uploaded by

parth bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Examining and cleaning the

data
Dr. Javad Feizabadi
Lecture 2: this lecture is enabled by content from Hair, Balck, Babin and
Anderson (2014) “multivariate data analysis”. Pearson

EXAMINING AND CLEANING THE DATA


Description of HBAT Primary Database Variables
Variable Description Variable Type
Data Warehouse Classification Variables
X1 Customer Type nonmetric
X2 Industry Type nonmetric
X3 Firm Size nonmetric
X4 Region nonmetric
X5 Distribution System nonmetric
Performance Perceptions Variables
X6 Product Quality metric
X7 E-Commerce Activities/Website metric
X8 Technical Support metric
X9 Complaint Resolution metric
X10 Advertising metric
X11 Product Line metric
X12 Salesforce Image metric
X13 Competitive Pricing metric
X14 Warranty & Claims metric
X15 New Products metric
X16 Ordering & Billing metric
X17 Price Flexibility metric
X18 Delivery Speed metric
Outcome/Relationship Measures
X19 Satisfaction metric
X20 Likelihood of Recommendation metric
X21 Likelihood of Future Purchase metric
X22 Current Purchase/Usage Level metric
X23 Consider Strategic Alliance/Partnership in Future nonmetric
Classification Variables

§ X1 Customer Type : length of time a particular customer has been buying from HBAT:
q 1 = less than a year
q 2 = between 1 and 5 years
q 3 = longer than 5 years
§ X2 Industry Type : type of industry that purchases HBAT’s paper products
q 0 = Magazine industry
q 1 = newsprint industry
§ X3 Firm Size : Employee size
q 0 = small firm fewer than 500 employees
q 1 = large firm, 500 or more employees
§ X4 Region: customer location
q 0 = USA/North America
q 1 = outside North America
§ X5 Distribution System: How paper products are sold to customers
q 0 = Sold indirectly through a broker
q 1 = sold directly
Perception and Outcome Variables
§ Variables X6 to X18 were measured in a 10-point scale in which endpoints is labeled
as “Poor” and “Excellent”.
§ For example: X8 Technical Support: Extent to which technical support
is offered to help solve product/service issues.
§ In outcome variables
q X19, X20 and X21 were measured in 10-point scale
q X22: Percentage of Purchases from HBAT : measured on 100-point percentage
scale
q X23: Perception of future relationship with HBAT: Extent to which
customer/respondent perceives his/her firm would engage in strategic
alliance/partnership with HBAT
• 0 = would not consider
• 1 = yes, would consider strategic alliance or partnership
Examination Phases

• Graphical examination.

• Identify and evaluate missing values.

• Identify and deal with outliers.

• Check whether statistical assumptions are met.

• Develop a preliminary understanding of your data.


Graphical Examination

• Shape:
ü Histogram
ü Bar Chart
ü Box & Whisker plot
ü Stem and Leaf plot

• Relationships:
ü Scatterplot
ü Outliers
Histograms and The Normal Curve
This is the distribution for HBAT
X19 - Satisfaction database variable
30
X19 – Satisfaction.

20

10
Frequency

Std. Dev = 1.19


Mean = 6.92

0 N = 100.00
4.50 5.50 6.50 7.50 8.50 9.50
5.00 6.00 7.00 8.00 9.00 10.00

X19 - Satisfaction
Stem & Leaf Diagram – HBAT Variable X6
Each stem is shown by the
numbers, and each number is a
X6 - Product Quality
leaf. This stem has 10 leaves.
Stem-and-Leaf Plot

Frequency Stem & Leaf The length of the stem, indicated by the
number of leaves, shows the frequency
3.00 5. 012
distribution. For this stem, the frequency
10.00 5. 5567777899
10.00 6. 0112344444 is 14.
10.00 6. 5567777999
5.00 7. 01144
This table shows the distribution of X6 with a stem and leaf
11.00 7. 55666777899
diagram . The first category is from 5.0 to 5.5, thus the stem is 5.0.
9.00 8. 000122234
There are three observations with values in this range (5.0, 5.1 and
14.00 8. 55556667777778
18.00 9. 001111222333333444 5.2). This is shown as three leaves of 0, 1 and 2. These are also the
three lowest values for X6. In the next stem, the stem value is
8.00 9. 56699999
again 5.0 and there are ten observations, ranging from 5.5 to 5.9.
2.00 10 . 00
These correspond to the leaves of 5.5 to 5. 9. At the other end of
the figure, the stem is 10.0. It is associated with two leaves (0 and
Stem width: 1.0
Each leaf: 1 case(s) 0), representing two values of10.0, the two highest values for X6.
Frequency Distribution: Variable X6 – Product Quality
X6 - Product Qua lity

Cu mu lative
Fre quen cy Percent Valid Pe rce nt Percent
Valid 5.0 1 1.0 1.0 1.0
5.1 1 1.0 1.0 2.0
5.2 1 1.0 1.0 3.0
5.5 2 2.0 2.0 5.0
5.6 1 1.0 1.0 6.0
5.7 4 4.0 4.0 10 .0
5.8 1 1.0 1.0 11 .0
5.9 2 2.0 2.0 13 .0
6.0 1 1.0 1.0 14 .0
6.1 2 2.0 2.0 16 .0
6.2 1 1.0 1.0 17 .0
6.3 1 1.0 1.0 18 .0
6.4 5 5.0 5.0 23 .0
6.5 2 2.0 2.0 25 .0
6.6 1 1.0 1.0 26 .0
6.7 4 4.0 4.0 30 .0
6.9 3 3.0 3.0 33 .0
7.0 1 1.0 1.0 34 .0
7.1 2 2.0 2.0 36 .0
7.4 2 2.0 2.0 38 .0
7.5 2 2.0 2.0 40 .0
7.6 3 3.0 3.0 43 .0
7.7 3 3.0 3.0 46 .0
7.8 1 1.0 1.0 47 .0
7.9 2 2.0 2.0 49 .0
8.0 3 3.0 3.0 52 .0
8.1 1 1.0 1.0 53 .0
8.2 3 3.0 3.0 56 .0
8.3 1 1.0 1.0 57 .0
8.4 1 1.0 1.0 58 .0
8.5 4 4.0 4.0 62 .0
8.6 3 3.0 3.0 65 .0
8.7 6 6.0 6.0 71 .0
8.8 1 1.0 1.0 72 .0
9.0 2 2.0 2.0 74 .0
9.1 4 4.0 4.0 78 .0
9.2 3 3.0 3.0 81 .0
9.3 6 6.0 6.0 87 .0
9.4 3 3.0 3.0 90 .0
9.5 1 1.0 1.0 91 .0
9.6 2 2.0 2.0 93 .0
9.9 5 5.0 5.0 98 .0
Excellent 2 2.0 2.0 10 0.0
Total 10 0 10 0.0 10 0.0
HBAT Diagnostics: Box & Whiskers Plots
Outlier = #13 Group 2 has substantially more
11 dispersion than the other groups.

10
13

7
X6 - Product Quality

6 Median
5

4
N= 32 35 33

Less than 1 year 1 to 5 years Over 5 years

X1 - Customer Type
HBAT Scatterplot: Variables X19 and X6
10

4
4 5 6 7 8 9 10 11

X6 - Product Quality
Missing Data
• Missing Data = information not available for a subject
(or case) about whom other information is available.
Typically occurs when respondent fails to answer one
or more questions in a survey.
ü Systematic?
ü Random?

• Researcher’s Concern = to identify the patterns and


relationships underlying the missing data in order to
maintain as close as possible to the original distribution
of values when any remedy is applied.
• Impact . . .
ü Reduces sample size available for analysis.
ü Can distort results.
Four-Step Process for
Identifying Missing Data

Step 1: Determine the Type of Missing Data


Step 2: Determine the Extent of Missing Data
Step 3: Diagnose the Randomness of the
Missing Data Processes
Step 4: Select the Imputation Method
Missing Data

Strategies for handling missing data . . .


ü use observations with complete data
only;
ü delete case(s) and/or variable(s);
ü estimate missing values.
Rules of Thumb
How Much Missing Data Is Too Much?
• Missing data under 10% for an individual
case or observation can generally be
ignored, except when the missing data
occurs in a specific nonrandom fashion (e.g.,
concentration in a specific set of questions,
attrition at the end of the questionnaire, etc.).
• The number of cases with no missing data
must be sufficient for the selected analysis
technique if replacement values will not be
substituted (imputed) for the missing data.
Rules of Thumb

Imputation of Missing Data

• Under 10% – Any of the imputation methods can be applied when


missing data is this low, although the complete case
method has been shown to be the least preferred.
• 10 to 20% – The increased presence of missing data makes the all
available, hot deck case substitution and regression
methods most preferred for MCAR data, while
model-based methods are necessary with MAR missing
data processes
• Over 20% – If it is necessary to impute missing data when the
level is over 20%, the preferred methods are:
o the regression method for MCAR situations, and
o model-based methods when MAR missing data occurs.
Outlier

Outlier = an observation/response with a


unique combination of characteristics
identifiable as distinctly different from the
other observations/responses.

Issue: “Is the observation/response


representative of the population?”
Why Do Outliers Occur?

• Procedural Error.
• Extraordinary Event.
• Extraordinary Observations.
• Observations unique in their
combination of values.
Dealing with Outliers

• Identify outliers.
• Describe outliers.
• Delete or Retain?
Identifying Outliers

• Standardize data and then identify outliers in


terms of number of standard deviations.
• Examine data using Box Plots, Stem & Leaf,
and Scatterplots.
• Multivariate detection (D2).
Rules of Thumb
Outlier Detection
• Univariate methods – examine all metric variables to identify unique or
extreme observations.
• For small samples (80 or fewer observations), outliers typically are defined
as cases with standard scores of 2.5 or greater.
• For larger sample sizes, increase the threshold value of standard scores up
to 4.
• If standard scores are not used, identify cases falling outside the ranges of
2.5 versus 4 standard deviations, depending on the sample size.
• Bivariate methods – focus their use on specific variable relationships, such
as the independent versus dependent variables:
o use scatterplots with confidence intervals at a specified Alpha level.
• Multivariate methods – best suited for examining a complete variate, such
as the independent variables in regression or the variables in factor
analysis:
o threshold levels for the D2/df measure should be very conservative (.005
or .001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger
samples.
Multivariate Assumptions

q Normality
q Linearity
q Homoscedasticity
q Non-correlated Errors

ü Data Transformations?
Testing Assumptions

• Normality assessment:
§ Visual check of histogram.
§ Kurtosis.
§ Normal probability plot.

• Homoscedasticity
§ Equal variances across independent
variables.
§ Levene test (univariate).
§ Box’s M (multivariate).
Rules of Thumb
Testing Statistical Assumptions
• Normality can have serious effects in small samples (less than 50
cases), but the impact effectively diminishes when sample sizes reach
200 cases or more.
• Most cases of heteroscedasticity are a result of non-normality in one
or more variables. Thus, remedying normality may not be needed due
to sample size, but may be needed to equalize the variance.
• Nonlinear relationships can be very well defined, but seriously
understated unless the data is transformed to a linear pattern or
explicit model components are used to represent the nonlinear portion
of the relationship.
• Correlated errors arise from a process that must be treated much like
missing data. That is, the researcher must first define the “causes”
among variables either internal or external to the dataset. If they are
not found and remedied, serious biases can occur in the results, many
times unknown to the researcher.
Data Transformations ?

Data transformations . . . provide a means of


modifying variables for one of two reasons:

1. To correct violations of the statistical


assumptions underlying the multivariate
techniques, or
2. To improve the relationship (correlation)
between the variables.
Rules of Thumb
Transforming Data
• To judge the potential impact of a transformation, calculate the ratio of
the variable’s mean to its standard deviation:
o Noticeable effects should occur when the ratio is less than 4.
o When the transformation can be performed on either of two variables,
select the variable with the smallest ratio .
• Transformations should be applied to the independent variables except
in the case of heteroscedasticity.
• Heteroscedasticity can be remedied only by the transformation of the
dependent variable in a dependence relationship. If a heteroscedastic
relationship is also nonlinear, the dependent variable, and perhaps the
independent variables, must be transformed.
• Transformations may change the interpretation of the variables. For
example, transforming variables by taking their logarithm translates the
relationship into a measure of proportional change (elasticity). Always
be sure to explore thoroughly the possible interpretations of the
transformed variables.
• Use variables in their original (untransformed) format when profiling or
interpreting results.
Dummy Variable

Dummy variable . . . a nonmetric independent


variable that has two (or more) distinct levels that
are coded 0 and 1. These variables act as
replacement variables to enable nonmetric
variables to be used as metric variables.
A dummy variable is a dichotomous variable that
represents one category of a nonmetric
independent variable. Any nonmetric variable
with k categories can be represented as k - 1
dummy variables.
Dummy Variable
Coding
Category X1 X2

Physician 1 0
Attorney 0 1
Professor 0 0
Simple Approaches to Understanding Data

o Tabulation = a listing of how respondents answered all


possible answers to each question. This typically is shown
in a frequency table.
o Cross Tabulation = a listing of how respondents answered
two or more questions. This typically is shown in a two-way
frequency table to enable comparisons between groups.
o Chi-Square = a statistic that tests for significant differences
between the frequency distributions for two (or more)
categorical variables (non-metric) in a cross-tabulation table.
Note: Chi square results will be distorted if more than 20
percent of the cells have an expected count of less than 5,
or if any cell has an expected count of less than 1.
o ANOVA = a statistic that tests for significant differences
between two means.
Examining Data
Learning Checkpoint

1. Why examine your data?


2. What are the principal aspects of
data that need to be examined?
3. What approaches would you use?

You might also like