Chapter 13
Chapter 13
The boxplot, or boxrand-whisker plot, is another techniqueused frequently in exploratory data analysis. 6
A boxplot reduces the detail of the stem-and-leaf display and provides a different visual image of the
distribution's location, spread,shape,tail length, and outliers. Boxplotsare extensions ofthe fivenumber
summary of a distribution. This summary consists of the median, the upper and lower quartiles, and
the largest and smallest observations. The median and quartiles are used because they are particularly
resistant statistics. Resistant statistics are unaffected by outliers and change only slightly in response to
the replacementof small portions of the data file.9
Assume we are examining the following data file
346
1STLDY
glance. and both shape and spread impressions are immediate. Patt
arranged to the left of a vertical line. Next, we pass through the aver
the order they were recorded and place the last digit for each item
(
the vertical line. Note that any digit to the right of the decimal poi
order the digits in each row, creating the stem-and-leaf display show
Each line or row a stem, and each piece of information on the
is
5|455666788889
This reflects 12 items in the data file whose first digit five: 54,
is
and 59. The second stem is
6|12466799
It shows that there are eight customers with purchases are in
th
67, 69, and 69.
Pareto Diagrams
The Pareto diagram is a bar chart whose percentages sum to 100
p
multiple-choice, single-response scale; a multiple-choice, multiple
of words (or themes) from content analysis. The participants" ans
tance, with bar height in descending order from left to right. An :
is depicted as a Pareto diagram in Exhibit 13-14. The cumulative
that the top two problems (the repair did not resolve the custom
returned multiple times for repair) accounted for 80 percent of
th
vice. The pictorial array that results reveals that any attempt to
in
first two problems.
344
ISTUDY
>chapter
5 45566678 8889
6 12466799
7 02235678 When rotated, a stem-and-leaf display
ta
02268 properties of a histogram.
24
O18
11
12 3
13
14 0€
15 3
16 3€
17
3
68888L999s5
18
19
20 6
21
800
600
SJueitee
400
edi
jo
squnN
2,500
2,000
1,500
milions)
(S,
1,000
profts
Net
500
-500
Sector
348
ISTUDY
Right- and left-skewed distributions and those with reduced spread are also presented clearly in the plot
comparison. Finally. groups may be compared by means of multiple plots. One variation, in which
a
notch at the median marks off a confidence interval to test the equality of group medians, takes us a step
closer to hypothesis Here the sides of the box return to full width at the
testing." and lower confi
dence intervals. When the intervals do not overlap, we can be confident, at a specified confidence level,
In Exhibit 13-17, multiple boxplots compare five sectors of PrimeSell's customers by their average
annual purchases data. The overall impression is one of several potential problems for the analyst:
unequal variances, skewness, and extreme outliers. Note the similarities of the profiles of finance and
retailing in contrast to the high-tech and insurance sectors. If hypothesis tests are planned, further exam
ination of this plot for each sector would require a stem-and-leaf display and a five-number summary.
From this, we could make decisions on the types of tests to select for confirmatory analysis.
Mapping
Increasingly, when possible, research data are being attached to their geographic dimension. Geographic
Information System (GIS) software works by link
ing data files to each other with at least one common
U.S. Population Density
(By Counties)
data field (e.g., a household's street address). The GIS
allows the researcher to connect target and classifica
analysis, it does take specialized software and hardware, as well as the expertise to operate it. Students are
encouraged to take specialized courses on GIS to expand their skill set in this growing area.
1. The rectangular plot (encompasses 50 percent of the data values).
2. A center line (marks the median and goes through the width of the box).
4. The "whiskers" (extend from the right and left hinges to the largest and smallest values).""
These values may be found within 1.5 times the interquartile range (1QR) from either edge of the box.
When you are examining data, it is important to separate legitimate outliers from errors in measure
ment, editing, coding, and data entry. Outliers, data points that exceed the interquartile range by 1.5
times, reflect unusual cases and are an important source of information for the study. They are displayed
or given special statistical treatment, or other portions of the data file are sometimes shielded from their
effects. Extreme outliers, however, can be data entry errors; these variables should be corrected during
editing. Outliers may be early warning signs of a disruption, so researchers need to take extreme care in
their assessment.
Exhibit 13-16 summarizes several comparisons that are of help to the analyst. Boxplots are an excel
lent diagnostic in the exhibit are
tool, especially when graphed on the same scale. The upper pios to
both symmetric, but one is larger than the other. Larger box widths are sometimes Iwhen the second
variable, from the same measurement scale, comes from a larger sample size. The box widths should
be proportional to the square root of the sample size, but not all plotting programs account for this."
Smallest Largest
observed value observed value
within 1.5 IOR within 1.5 IOR
of lower hinge of upper hinge
Extreme
Outside Outside or far
Whiskers
value value Outside
I
or outlier Median Ior outlier value
347
(STUDY
Symmetric
Symmetric
larger relative size in
proportion to sample size
Right skewed
Left skewed
Small spread
oVERSEASASSIGNMENT
Cell Count
Yes
No
content
Row
Tot Pct Total
GENDER 22 40 62
Male 1
35.5 645 62.0
786 55.6
22.0 400
32 38
2 -Marginals
Female 158 38.0
214
Cell 2,1
(row2, column1)
Column 72 100
and column totals, called marginals, appear at the bottom and right "margins" of the table. They show,
separately, the counts and percentagesof the rows and columns. In CFA,when cross-tabulation tables
areconstructedfor statistical testing, we call them contingency tables, and the test determines if the clas
21 years old 60 6 6 6
Any age 6 60 6 96
No opinion 7 40 4 100
During EDA,the researcher has the flexibility to respond to the patterns revealed in the preliminary
summaries of the data. This flexibility is an important attribute of the process. Because it doesn't fol
low a rigid structure, EDA is free to take many paths in unraveling the mysteries in the data. While
numerical summaries may start the process, visual representations and graphical techniques offer major
contributions. Summary statistics, as you will seemomentarily,may obscure, conceal, or even misrepre
sent the underlying structure of the data. When numericalsummaries are used exclusively and accepted
without visual inspection, the selection of confirmatorymodels may be based on flawed assumptions."
For these reasons, exploratory data analysis should begin with visual inspection. After that, it is not only
possible but also desirable to cycle between exploratory and confirmatory approaches.
adjusted for missing data), and cumulative percent. This example nominal variable table describes the
perceived desirable minimum age to be permitted to own a social networkingaccount.The same data
are presented in Exhibit 13-10 using a pie chart and a bar chart. The values and percentages are more
readily understoodin the graphicformat. When the variable of interest is measured at an interval-ratio
has many potential values, these techniques are not particularly informative. Exhibit 13-11 is a
level and
condensed frequencytable of the averageannual purchases of PrimeSell's top 50 customers. Only two
values, 59.9 and 66, have a frequency greater than 1. Thus, the primary contribution of a frequency
table for these data is an ordered list of values. If the table were converted to a bar chart, it would have
48 bars of equal length and two bars with two occurrences. Bar charts do not reserve spaces for values
where no observations occur within the range. Constructing a pie chart for this variable would also be
pointless with the data in its present form, but the frequencytable reveals an opportunity to recode the
variable so that these techniques might have value.
>Exhibit 13-11 Average Annual Purchases of PrimeSell's Top 50 Customers
Cumulative Cumulative
54.9 1 2 75.6 1 54
55.4 2 76.4 56
1
55.6 1 77.5 58
56,4 2 789 60
1
56.8 2 10 80.9 1 2 62
1
56.9 1 2 12 82.2 64
57.8 1 2 14 82.5 66
58.1 1
2 16 864 68
58.2 2 18 88.3 70
1
58.3 2 20 102.5 1 2 72
1
58.5 1 2 22 104.1 74
59.9 2 4 26 110.4 2 76
61.5 1 2 28 111.9 2 78
62.6 2 30 118.6 80
1
64.8 2 32 123.8 82
1
66.0 2 4 36 131.2 84
66.3 2 38 140.9 86
1
67.6 ---- 2 40 146.2 1 2 88
1
69.1 1 2 42 153.2 90
1
69.2 2 44 163.2 1
92
70.5 2 46 166.7 94
72.7 2 48 183.2 1 2 96
72.9 50 206.9 98
73.5
------
1 2 52 218.2 1 2 100
Total 50 100
25
20
15
10
ISTUDY
each interval is on the vertical axis. The value of the start of each interval is noted
of observations in
at the left of the bar on the horizontal acess. The height of the bar corresponds with the frequency of
observations in the interval above which is erected. This histogram was constructed with intervals
it
PrimeSell's average annual purchases frequency table
20 increments wide. These values are found in
(Exhibit 13-11). Intervals with 0counts would show gaps in the table and alert the analyst to look for
spread. When the upper of the distribution is compared with the frequency table, we
problems with tail
ber of observations in the upper tail,this histogram warns us of irregularities in the data.
Stem-and-Leaf Displays14
a technique that is closely related to the histogram. It shares some of the his
The stem-and-leaf display is
by hand for small samples
togram's features but offers several unique advantages. is easy to construct
It
programs. In contrast to histograms, which lose information by group
or may be produced by computer
values presents actual data values that can be inspected directly.
ing data into intervals, the stem-and-leaf
use of enclosed bars or asterisks as the representation medium. This feature reveals the
without the
their rank order for finding the median, quartiles,
distribution of values within the interval and preserves
back to the data file and to the
eases linking a specific observation
Histograms
The histogram is a conventional solution for the display of interval-ratio data. Histograms are used
when it is possible to group the variable's values into intervals. Histograms are constructed with bars
(or asterisks) that represent each interval, where the interval quantity determines the height of the bar.
and where cach interval's bar is the same width and occupies an equal amount of area within graph. You
want the number of intervals to represent the expanse of the data. The number of intervals may be arbi
trarily chosen. One researcher suggests you calculate the number of intervals based on the square root
341
STUDY
>Exhibit 13-10 Nominal Displays of Data (Minimum Age for Social Networking)
Percent
21 years old 6
18 years old 18
16 years old 33
13 years old 28
10 years old 5
DAny age 6
ONo opinion
25 -
20
15 -
10F
21 18 16 10 Any
Age
of the number of observations (after rounding). In the example in Exhibit 13-11, the number of intervals
would be 8 (square root of 50 = 7.07, rounded up). The interval width should then be approximately
equal to the range divided by the number of intervals [(218.2 - 54.9)/8 = 20.4, rounded to 20].!2 Data
analysts find histograms useful for (1) displaying all intervals in a distribution, even intervals without
observed values, and (2)) examining the shape of the distribution for skewness, kurtosis, and the modal
pattern. When looking at a histogram, one might ask: Is there a single mode (a hump)? Are subgroups
identifiable when multiple modes are present? Are outliers (straggling data values) detached from the
central concentration?!3
The values for the average annual purchases variable presented in Exhibit 13-11 were measured on a
ratio scale and are easily grouped. Other variables possessing an underlying order are similarly appro
priate for histograms. A histogram would not be used for a nominal variable that has no order to its
A histogram of the average annual purchases is shown in Exhibit 13-12. Each interval range for the
variable of interest, average annual purchases, is shown on the horizontal axis; the frequency or number