0% found this document useful (0 votes)
2 views8 pages

Chapter 13

The document discusses various techniques for exploratory data analysis, including boxplots, stem-and-leaf displays, and Pareto diagrams, which help visualize data distributions, outliers, and trends. It emphasizes the importance of resistant statistics and visual representations in understanding data patterns and making informed decisions. Additionally, it highlights the role of Geographic Information Systems (GIS) in linking data to geographic dimensions for richer insights.

Uploaded by

Minh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Chapter 13

The document discusses various techniques for exploratory data analysis, including boxplots, stem-and-leaf displays, and Pareto diagrams, which help visualize data distributions, outliers, and trends. It emphasizes the importance of resistant statistics and visual representations in understanding data patterns and making informed decisions. Additionally, it highlights the role of Geographic Information Systems (GIS) in linking data to geographic dimensions for richer insights.

Uploaded by

Minh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Boxplots15

The boxplot, or boxrand-whisker plot, is another techniqueused frequently in exploratory data analysis. 6
A boxplot reduces the detail of the stem-and-leaf display and provides a different visual image of the
distribution's location, spread,shape,tail length, and outliers. Boxplotsare extensions ofthe fivenumber
summary of a distribution. This summary consists of the median, the upper and lower quartiles, and
the largest and smallest observations. The median and quartiles are used because they are particularly

resistant statistics. Resistant statistics are unaffected by outliers and change only slightly in response to
the replacementof small portions of the data file.9
Assume we are examining the following data file

[5,6.6,7,7,7,8,8,9 ],mean =7,standarddeviation = 1.22,

median =7,lower quartile =6,upper quartile =8


if we replace the 9, with 90
[5,6,6,7,7, 7, 8,8,90 ]. mean = 16,standard deviation =27.78,
median =7,lower quartile = 6, upper quartile =8
(The appendix Describing Data Statisticaly at the end of this chapter provides a review of these statisti
cal concepts).

346

1STLDY
glance. and both shape and spread impressions are immediate. Patt

no values exist, areas where values are clustered, or outlying valuest

data-are easily observed.

To develop a stem-and-leaf display for the data in Exhibit 13-11,

arranged to the left of a vertical line. Next, we pass through the aver

the order they were recorded and place the last digit for each item

(
the vertical line. Note that any digit to the right of the decimal poi

item is placed on the horizontal row corresponding to its first digit(

order the digits in each row, creating the stem-and-leaf display show
Each line or row a stem, and each piece of information on the

is
5|455666788889
This reflects 12 items in the data file whose first digit five: 54,

is
and 59. The second stem is

6|12466799
It shows that there are eight customers with purchases are in

th
67, 69, and 69.

Pareto Diagrams
The Pareto diagram is a bar chart whose percentages sum to 100

p
multiple-choice, single-response scale; a multiple-choice, multiple
of words (or themes) from content analysis. The participants" ans
tance, with bar height in descending order from left to right. An :
is depicted as a Pareto diagram in Exhibit 13-14. The cumulative
that the top two problems (the repair did not resolve the custom
returned multiple times for repair) accounted for 80 percent of

th
vice. The pictorial array that results reveals that any attempt to

in
first two problems.

344

ISTUDY

>chapter

>Exhibit 13-13 A Stem-and-Leaf Display of PrimeSell's Average Annual Pu

5 45566678 8889
6 12466799
7 02235678 When rotated, a stem-and-leaf display
ta
02268 properties of a histogram.

24
O18
11
12 3
13
14 0€
15 3
16 3€
17
3
68888L999s5
18
19
20 6
21

>Exhibit 13-14 Pareto Diagram of Laptop Repair Complaints

800

600
SJueitee
400
edi
jo
squnN
2,500

2,000

1,500

milions)
(S,
1,000

profts
Net
500

-500

Financial Health High-tech Insurance Retailing

Sector

348

ISTUDY

>chapter 13 Stage 3:Collect, Prepare, and Examine Data

Right- and left-skewed distributions and those with reduced spread are also presented clearly in the plot

comparison. Finally. groups may be compared by means of multiple plots. One variation, in which

a
notch at the median marks off a confidence interval to test the equality of group medians, takes us a step
closer to hypothesis Here the sides of the box return to full width at the
testing." and lower confi

dence intervals. When the intervals do not overlap, we can be confident, at a specified confidence level,

that the medians of the two populations are different.

In Exhibit 13-17, multiple boxplots compare five sectors of PrimeSell's customers by their average

annual purchases data. The overall impression is one of several potential problems for the analyst:

unequal variances, skewness, and extreme outliers. Note the similarities of the profiles of finance and
retailing in contrast to the high-tech and insurance sectors. If hypothesis tests are planned, further exam
ination of this plot for each sector would require a stem-and-leaf display and a five-number summary.

From this, we could make decisions on the types of tests to select for confirmatory analysis.

Mapping
Increasingly, when possible, research data are being attached to their geographic dimension. Geographic
Information System (GIS) software works by link
ing data files to each other with at least one common
U.S. Population Density
(By Counties)
data field (e.g., a household's street address). The GIS
allows the researcher to connect target and classifica

tion variables from a survey, experiment, observation

study, or other research to specific geographic-based

databases like U.S. Census data, to develop a richer


understanding of the sample's attitudesand behavior.
As radio frequency identification (RFID) data become
more prevalent, much behavioral data are able to con

nect with these new geographically rich databases.

The most common way to display such data is with a

map.Colors and patterns denoting knowledge, attitude, MAP KEY

behavior, or demographic data arrays are superimposed

over street maps (finest-level GIS); blockgroup maps;


or county. state, or country maps to help identify the
best locations for stores based on demographic, psycho
Government data is often presented in map form.
graphic, and life-stage segmentation data. Florists array Source US Census
promotional response information geographically and
use the map to plan targeted promotions. Consumer
and business-to-business researchers use mapping of data on ownership, usage level, and price sensitiv
ity in plotting geographic rollouts of new products. Although this is an attractive option for exploratory

analysis, it does take specialized software and hardware, as well as the expertise to operate it. Students are

encouraged to take specialized courses on GIS to expand their skill set in this growing area.
1. The rectangular plot (encompasses 50 percent of the data values).
2. A center line (marks the median and goes through the width of the box).

3. The edges of the box, called hinges.

4. The "whiskers" (extend from the right and left hinges to the largest and smallest values).""

These values may be found within 1.5 times the interquartile range (1QR) from either edge of the box.

These components and their relationships are shown in Exhibit 13-15.

When you are examining data, it is important to separate legitimate outliers from errors in measure
ment, editing, coding, and data entry. Outliers, data points that exceed the interquartile range by 1.5

times, reflect unusual cases and are an important source of information for the study. They are displayed
or given special statistical treatment, or other portions of the data file are sometimes shielded from their

effects. Extreme outliers, however, can be data entry errors; these variables should be corrected during
editing. Outliers may be early warning signs of a disruption, so researchers need to take extreme care in

their assessment.

Exhibit 13-16 summarizes several comparisons that are of help to the analyst. Boxplots are an excel
lent diagnostic in the exhibit are
tool, especially when graphed on the same scale. The upper pios to
both symmetric, but one is larger than the other. Larger box widths are sometimes Iwhen the second

variable, from the same measurement scale, comes from a larger sample size. The box widths should
be proportional to the square root of the sample size, but not all plotting programs account for this."

>Exhibit 13-15 Boxplot Components

Smallest Largest
observed value observed value
within 1.5 IOR within 1.5 IOR
of lower hinge of upper hinge
Extreme
Outside Outside or far
Whiskers
value value Outside
I
or outlier Median Ior outlier value

15 IOR OR- 1.5 1OR


Outerfence Inner fence Hinge: Hinge: Inner fence Outer fence
lower hinge lower hinge Lower Upper 1.5 IQR plus 3 IQR plus
minus minus quartile quartile upper hinge upper hinge
3IQR 1.5 IOR
50% of
observed
values are
within the
I
box

347
(STUDY

>part IV Collect,Prepare, and Examine the Data

>Exhibit 13-16 Diagnostics with Boxplots

Symmetric

Symmetric
larger relative size in
proportion to sample size

Right skewed

Left skewed

Small spread

Notched at the median for a test of


the equality of population medians
>part IV Collect, Prepare, and Examine the Data

>Exhibit 13-18 Cross-Tabulation of Gender by Overseas Assignment Opportunity

oVERSEASASSIGNMENT

Cell Count
Yes

No
content
Row
Tot Pct Total

GENDER 22 40 62
Male 1
35.5 645 62.0
786 55.6
22.0 400

32 38
2 -Marginals
Female 158 38.0

214
Cell 2,1

(row2, column1)

Column 72 100

Total 28.0 720 100.0

and column totals, called marginals, appear at the bottom and right "margins" of the table. They show,
separately, the counts and percentagesof the rows and columns. In CFA,when cross-tabulation tables
areconstructedfor statistical testing, we call them contingency tables, and the test determines if the clas

sification variables areindependent of each other.


Throughout this chapter, we have exploited the visual techniques of exploratory data analysis to look
beyond numerical summaries and gain insight into the patterns of the data. Few of the approaches have
stressed the need for advanced mathematics, and all have an intuitive appeal for the analyst. When the
more common ways of summarizing location, spread, and shape have conveyed an inadequate picture
of the data, we have used more resistant statistics to protect us from the effects of extreme scoresand
occasionalerrors. We have alsoemphasized the value of transforming the original scale of the data dur
ing preliminaryanalysis rather than at the point of hypothesis testing.
>chapter 13 Stage 3: Collect, Prepare, and Examine Data

>Exhibit 13-9 A Frequency Table (Minimum Age for SocialNetworking)


Value Label Value Frequency Percent Valid Percent Cumulative Percent

21 years old 60 6 6 6

18 years old min 2 180 18 18 24

16 years old min 3 330 33 33 57

13 years old min 4 280 8 85

10years old min 5 50 5 90

Any age 6 60 6 96

No opinion 7 40 4 100

1,000 100 100

Valid Cases 1,000:Missing Cases 0

During EDA,the researcher has the flexibility to respond to the patterns revealed in the preliminary

summaries of the data. This flexibility is an important attribute of the process. Because it doesn't fol

low a rigid structure, EDA is free to take many paths in unraveling the mysteries in the data. While
numerical summaries may start the process, visual representations and graphical techniques offer major

contributions. Summary statistics, as you will seemomentarily,may obscure, conceal, or even misrepre
sent the underlying structure of the data. When numericalsummaries are used exclusively and accepted

without visual inspection, the selection of confirmatorymodels may be based on flawed assumptions."
For these reasons, exploratory data analysis should begin with visual inspection. After that, it is not only
possible but also desirable to cycle between exploratory and confirmatory approaches.

Frequency Tables, Bar Charts, and Pie Graphs1


Several techniques are essential to any data examination. For example, a frequency table is a simple
device for arraying data. An example is presented in Exhibit 13-9. It arrays data by assigned response
code values, from lowest to highest value, with columns for count, percent, valid percent (percent

adjusted for missing data), and cumulative percent. This example nominal variable table describes the
perceived desirable minimum age to be permitted to own a social networkingaccount.The same data
are presented in Exhibit 13-10 using a pie chart and a bar chart. The values and percentages are more
readily understoodin the graphicformat. When the variable of interest is measured at an interval-ratio

has many potential values, these techniques are not particularly informative. Exhibit 13-11 is a
level and
condensed frequencytable of the averageannual purchases of PrimeSell's top 50 customers. Only two
values, 59.9 and 66, have a frequency greater than 1. Thus, the primary contribution of a frequency
table for these data is an ordered list of values. If the table were converted to a bar chart, it would have

48 bars of equal length and two bars with two occurrences. Bar charts do not reserve spaces for values
where no observations occur within the range. Constructing a pie chart for this variable would also be
pointless with the data in its present form, but the frequencytable reveals an opportunity to recode the
variable so that these techniques might have value.
>Exhibit 13-11 Average Annual Purchases of PrimeSell's Top 50 Customers
Cumulative Cumulative

Value Frequency Percent Percent Value Frequency Percent Percent

54.9 1 2 75.6 1 54

55.4 2 76.4 56

1
55.6 1 77.5 58
56,4 2 789 60

1
56.8 2 10 80.9 1 2 62

1
56.9 1 2 12 82.2 64

57.8 1 2 14 82.5 66

58.1 1
2 16 864 68
58.2 2 18 88.3 70
1
58.3 2 20 102.5 1 2 72
1
58.5 1 2 22 104.1 74

59.9 2 4 26 110.4 2 76

61.5 1 2 28 111.9 2 78

62.6 2 30 118.6 80
1
64.8 2 32 123.8 82
1
66.0 2 4 36 131.2 84

66.3 2 38 140.9 86
1
67.6 ---- 2 40 146.2 1 2 88
1
69.1 1 2 42 153.2 90

1
69.2 2 44 163.2 1
92

70.5 2 46 166.7 94

72.7 2 48 183.2 1 2 96

72.9 50 206.9 98

73.5
------
1 2 52 218.2 1 2 100

Total 50 100

>Exhibit 13-12 Histogram of PrimeSell's Top 50 Customers' Average Annual Purchases

25

20

15

10

50 70 90 110 130 150 170 190 210 230

Average Annual Purchases thousands)


(in
343

ISTUDY

>part IV Collect,Prepare, and Examine the Data

each interval is on the vertical axis. The value of the start of each interval is noted
of observations in

at the left of the bar on the horizontal acess. The height of the bar corresponds with the frequency of
observations in the interval above which is erected. This histogram was constructed with intervals
it
PrimeSell's average annual purchases frequency table
20 increments wide. These values are found in

(Exhibit 13-11). Intervals with 0counts would show gaps in the table and alert the analyst to look for

spread. When the upper of the distribution is compared with the frequency table, we
problems with tail

Along with the peaked midpoint and reduced num


find three extreme values (183.2, 206.9, and 218.2).

ber of observations in the upper tail,this histogram warns us of irregularities in the data.

Stem-and-Leaf Displays14
a technique that is closely related to the histogram. It shares some of the his
The stem-and-leaf display is
by hand for small samples
togram's features but offers several unique advantages. is easy to construct
It
programs. In contrast to histograms, which lose information by group
or may be produced by computer
values presents actual data values that can be inspected directly.
ing data into intervals, the stem-and-leaf

use of enclosed bars or asterisks as the representation medium. This feature reveals the
without the
their rank order for finding the median, quartiles,
distribution of values within the interval and preserves
back to the data file and to the
eases linking a specific observation
Histograms
The histogram is a conventional solution for the display of interval-ratio data. Histograms are used
when it is possible to group the variable's values into intervals. Histograms are constructed with bars
(or asterisks) that represent each interval, where the interval quantity determines the height of the bar.

and where cach interval's bar is the same width and occupies an equal amount of area within graph. You
want the number of intervals to represent the expanse of the data. The number of intervals may be arbi
trarily chosen. One researcher suggests you calculate the number of intervals based on the square root

341

STUDY

>part IV Collect, Prepare, and Examine the Data

>Exhibit 13-10 Nominal Displays of Data (Minimum Age for Social Networking)

linimum Agefor Social Networking

Percent

21 years old 6

18 years old 18

16 years old 33

13 years old 28

10 years old 5
DAny age 6

ONo opinion

Minimum Age for Social Networking


3
30

25 -
20

15 -
10F

21 18 16 10 Any

Age

of the number of observations (after rounding). In the example in Exhibit 13-11, the number of intervals

would be 8 (square root of 50 = 7.07, rounded up). The interval width should then be approximately
equal to the range divided by the number of intervals [(218.2 - 54.9)/8 = 20.4, rounded to 20].!2 Data

analysts find histograms useful for (1) displaying all intervals in a distribution, even intervals without
observed values, and (2)) examining the shape of the distribution for skewness, kurtosis, and the modal

pattern. When looking at a histogram, one might ask: Is there a single mode (a hump)? Are subgroups
identifiable when multiple modes are present? Are outliers (straggling data values) detached from the
central concentration?!3

The values for the average annual purchases variable presented in Exhibit 13-11 were measured on a

ratio scale and are easily grouped. Other variables possessing an underlying order are similarly appro
priate for histograms. A histogram would not be used for a nominal variable that has no order to its

categories, such as gender or occupation.

A histogram of the average annual purchases is shown in Exhibit 13-12. Each interval range for the

variable of interest, average annual purchases, is shown on the horizontal axis; the frequency or number

You might also like