2 - Data Examination 20
2 - Data Examination 20
data
Dr. Javad Feizabadi
Lecture 2: this lecture is enabled by content from Hair, Balck, Babin and
Anderson (2014) “multivariate data analysis”. Pearson
§ X1 Customer Type : length of time a particular customer has been buying from HBAT:
q 1 = less than a year
q 2 = between 1 and 5 years
q 3 = longer than 5 years
§ X2 Industry Type : type of industry that purchases HBAT’s paper products
q 0 = Magazine industry
q 1 = newsprint industry
§ X3 Firm Size : Employee size
q 0 = small firm fewer than 500 employees
q 1 = large firm, 500 or more employees
§ X4 Region: customer location
q 0 = USA/North America
q 1 = outside North America
§ X5 Distribution System: How paper products are sold to customers
q 0 = Sold indirectly through a broker
q 1 = sold directly
Perception and Outcome Variables
§ Variables X6 to X18 were measured in a 10-point scale in which endpoints is labeled
as “Poor” and “Excellent”.
§ For example: X8 Technical Support: Extent to which technical support
is offered to help solve product/service issues.
§ In outcome variables
q X19, X20 and X21 were measured in 10-point scale
q X22: Percentage of Purchases from HBAT : measured on 100-point percentage
scale
q X23: Perception of future relationship with HBAT: Extent to which
customer/respondent perceives his/her firm would engage in strategic
alliance/partnership with HBAT
• 0 = would not consider
• 1 = yes, would consider strategic alliance or partnership
Examination Phases
• Graphical examination.
• Shape:
ü Histogram
ü Bar Chart
ü Box & Whisker plot
ü Stem and Leaf plot
• Relationships:
ü Scatterplot
ü Outliers
Histograms and The Normal Curve
This is the distribution for HBAT
X19 - Satisfaction database variable
30
X19 – Satisfaction.
20
10
Frequency
0 N = 100.00
4.50 5.50 6.50 7.50 8.50 9.50
5.00 6.00 7.00 8.00 9.00 10.00
X19 - Satisfaction
Stem & Leaf Diagram – HBAT Variable X6
Each stem is shown by the
numbers, and each number is a
X6 - Product Quality
leaf. This stem has 10 leaves.
Stem-and-Leaf Plot
Frequency Stem & Leaf The length of the stem, indicated by the
number of leaves, shows the frequency
3.00 5. 012
distribution. For this stem, the frequency
10.00 5. 5567777899
10.00 6. 0112344444 is 14.
10.00 6. 5567777999
5.00 7. 01144
This table shows the distribution of X6 with a stem and leaf
11.00 7. 55666777899
diagram . The first category is from 5.0 to 5.5, thus the stem is 5.0.
9.00 8. 000122234
There are three observations with values in this range (5.0, 5.1 and
14.00 8. 55556667777778
18.00 9. 001111222333333444 5.2). This is shown as three leaves of 0, 1 and 2. These are also the
three lowest values for X6. In the next stem, the stem value is
8.00 9. 56699999
again 5.0 and there are ten observations, ranging from 5.5 to 5.9.
2.00 10 . 00
These correspond to the leaves of 5.5 to 5. 9. At the other end of
the figure, the stem is 10.0. It is associated with two leaves (0 and
Stem width: 1.0
Each leaf: 1 case(s) 0), representing two values of10.0, the two highest values for X6.
Frequency Distribution: Variable X6 – Product Quality
X6 - Product Qua lity
Cu mu lative
Fre quen cy Percent Valid Pe rce nt Percent
Valid 5.0 1 1.0 1.0 1.0
5.1 1 1.0 1.0 2.0
5.2 1 1.0 1.0 3.0
5.5 2 2.0 2.0 5.0
5.6 1 1.0 1.0 6.0
5.7 4 4.0 4.0 10 .0
5.8 1 1.0 1.0 11 .0
5.9 2 2.0 2.0 13 .0
6.0 1 1.0 1.0 14 .0
6.1 2 2.0 2.0 16 .0
6.2 1 1.0 1.0 17 .0
6.3 1 1.0 1.0 18 .0
6.4 5 5.0 5.0 23 .0
6.5 2 2.0 2.0 25 .0
6.6 1 1.0 1.0 26 .0
6.7 4 4.0 4.0 30 .0
6.9 3 3.0 3.0 33 .0
7.0 1 1.0 1.0 34 .0
7.1 2 2.0 2.0 36 .0
7.4 2 2.0 2.0 38 .0
7.5 2 2.0 2.0 40 .0
7.6 3 3.0 3.0 43 .0
7.7 3 3.0 3.0 46 .0
7.8 1 1.0 1.0 47 .0
7.9 2 2.0 2.0 49 .0
8.0 3 3.0 3.0 52 .0
8.1 1 1.0 1.0 53 .0
8.2 3 3.0 3.0 56 .0
8.3 1 1.0 1.0 57 .0
8.4 1 1.0 1.0 58 .0
8.5 4 4.0 4.0 62 .0
8.6 3 3.0 3.0 65 .0
8.7 6 6.0 6.0 71 .0
8.8 1 1.0 1.0 72 .0
9.0 2 2.0 2.0 74 .0
9.1 4 4.0 4.0 78 .0
9.2 3 3.0 3.0 81 .0
9.3 6 6.0 6.0 87 .0
9.4 3 3.0 3.0 90 .0
9.5 1 1.0 1.0 91 .0
9.6 2 2.0 2.0 93 .0
9.9 5 5.0 5.0 98 .0
Excellent 2 2.0 2.0 10 0.0
Total 10 0 10 0.0 10 0.0
HBAT Diagnostics: Box & Whiskers Plots
Outlier = #13 Group 2 has substantially more
11 dispersion than the other groups.
10
13
7
X6 - Product Quality
6 Median
5
4
N= 32 35 33
X1 - Customer Type
HBAT Scatterplot: Variables X19 and X6
10
4
4 5 6 7 8 9 10 11
X6 - Product Quality
Missing Data
• Missing Data = information not available for a subject
(or case) about whom other information is available.
Typically occurs when respondent fails to answer one
or more questions in a survey.
ü Systematic?
ü Random?
• Procedural Error.
• Extraordinary Event.
• Extraordinary Observations.
• Observations unique in their
combination of values.
Dealing with Outliers
• Identify outliers.
• Describe outliers.
• Delete or Retain?
Identifying Outliers
q Normality
q Linearity
q Homoscedasticity
q Non-correlated Errors
ü Data Transformations?
Testing Assumptions
• Normality assessment:
§ Visual check of histogram.
§ Kurtosis.
§ Normal probability plot.
• Homoscedasticity
§ Equal variances across independent
variables.
§ Levene test (univariate).
§ Box’s M (multivariate).
Rules of Thumb
Testing Statistical Assumptions
• Normality can have serious effects in small samples (less than 50
cases), but the impact effectively diminishes when sample sizes reach
200 cases or more.
• Most cases of heteroscedasticity are a result of non-normality in one
or more variables. Thus, remedying normality may not be needed due
to sample size, but may be needed to equalize the variance.
• Nonlinear relationships can be very well defined, but seriously
understated unless the data is transformed to a linear pattern or
explicit model components are used to represent the nonlinear portion
of the relationship.
• Correlated errors arise from a process that must be treated much like
missing data. That is, the researcher must first define the “causes”
among variables either internal or external to the dataset. If they are
not found and remedied, serious biases can occur in the results, many
times unknown to the researcher.
Data Transformations ?
Physician 1 0
Attorney 0 1
Professor 0 0
Simple Approaches to Understanding Data