0% found this document useful (0 votes)
6 views13 pages

Using Visual Data Mining

This article discusses enhancements to traditional statistical process control (SPC) tools using visual data mining techniques to better understand and diagnose process variation in manufacturing. It presents a case study from a US manufacturer of structural tubular metal products, demonstrating how updated tools can provide insights into defect occurrences and trends over time. The authors also introduce a quality visualization toolkit (QVT) that allows practitioners to implement these visual tools easily without extensive statistical training.

Uploaded by

DARIO HERNANDEZ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Using Visual Data Mining

This article discusses enhancements to traditional statistical process control (SPC) tools using visual data mining techniques to better understand and diagnose process variation in manufacturing. It presents a case study from a US manufacturer of structural tubular metal products, demonstrating how updated tools can provide insights into defect occurrences and trends over time. The authors also introduce a quality visualization toolkit (QVT) that allows practitioners to implement these visual tools easily without extensive statistical training.

Uploaded by

DARIO HERNANDEZ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Special Issue Article

(wileyonlinelibrary.com) DOI: 10.1002/qre.1706 Published online 4 August 2014 in Wiley Online Library

Using Visual Data Mining to Enhance the


Simple Tools in Statistical Process Control:
A Case Study
Huw D. Smith,a Fadel M. Megahed,a*† L. Allison Jones-Farmerb
and Mark Clarkb
Statistical process control (SPC) is a collection of problem-solving tools used to achieve process stability and improve process
capability through variation reduction. Because of its sound statistical basis and intuitive use of visual displays, SPC has been
extensively used in manufacturing and health care and service industries. Deploying SPC involves both a technical aspect and
a proper environment for continuous improvement activities based on management support and worker empowerment.
Many of the commonly used SPC tools, including histograms, fishbone diagrams, scatter plots, and defect concentration
diagrams, were proposed prior to the advent of microcomputers as efficient methods to record and visualize data for single
(or few) variable(s) processes. As the volume, variety, and velocity of data continues to evolve, there are opportunities to
supplement and improve these methods for understanding and visualizing process variation. In this paper, we propose
enhancements to some of the basic quality tools that can be easily applied with a desktop computer. We demonstrate
how these updated tools can be used to better characterize, understand, and/or diagnose variation in a case study involving
a US manufacturer of structural tubular metal products. Finally, we create the quality visualization toolkit to allow
practitioners to implement some of these visualization tools without the need for training, extensive statistical background,
and/or specialized statistical software. Copyright © 2014 John Wiley & Sons, Ltd.

Keywords: animated graphs; fishbone diagram; phase I methods; statistical engineering; visual analytics

1. Introduction
he goal of data analysis is to gain understanding from the collected data. A feature of process data is that it varies over time. The

T information in this variation is important to the understanding of how the process is performing, and statistical process control
(SPC) is the primary tool for monitoring and reducing this variation1 [p. 1]. Key tools used in SPC and process improvement in
general, include the histogram, check sheet, Pareto chart, cause-and-effect diagram, defect concentration diagram, scatter diagram,
and the control chart. The collective use of these seven visual tools has been popularized by Prof. Kaoru Ishikawa in the 1950–1960s.2
These tools are often referred to as the seven (basic) quality tools.3 Using these visual tools allowed factory workers to diagnose and
possibly eliminate their quality problems without detailed knowledge of statistics. This has been a well-documented reason for the
widespread adoption of these early SPC methods.3
Modern SPC applications are quite different from those of the early SPC applications, see, for example, Megahed et al.,4 Wells
et al.,5 and Wells et al.6 The volume of data being acquired from production systems continues to expand at exponential rates.7
Emerging measurement technologies (such as coordinate measuring machines, machine vision systems, and 3D surface scanners)
diversifies the types of data being collected, pushing data collection away from the historically low-dimensional data. The use of
computerized data acquisition systems has transformed the nature of the process monitoring and control problem, as real-time
process data are now available on hundreds of processes and product quality characteristics.5,6,8
Despite the changes in the nature and volume of process data, the seven basic quality tools remain the most widely used methods
in industry. These basic tools constitute the basis for much of the work in six sigma9 and lean manufacturing.10 Simple graphical tools
have been historically useful in understanding the nature of process variation and diagnosing the root causes for process changes yet
have received little attention in the literature.5 To address the importance of a more holistic view of SPC, Hoerl and Snee11 proposed a
new paradigm for quantitative approaches to quality improvement, statistical engineering. Statistical engineering is defined as ‘the

a
Department of Industrial and Systems Engineering, Auburn University, Auburn, AL 36849, USA
b
Department of Aviation and Supply Chain Management, Auburn University, Auburn, AL 36849, USA
905

*Correspondence to: Fadel M. Megahed, Department of Industrial and Systems Engineering, Auburn University, Auburn, AL 36849, USA.

E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

study of how to best use statistical concepts, methods and tools, and integrate them with IT and other relevant sciences to generate
improved results’.11
Aligned with the motivation behind statistical engineering, we explore how visual data mining tools can enhance some of the well-
known SPC graphical tools. We highlight several enhancements to the traditional seven basic tools. These enhancements were
deployed in a US manufacturer of structural tubular metal products. So that others may use these enhancements, we provide the
quality visualization toolkit (QVT) as supplemental material to this paper. The QVT contains an Excel program to create the measure
of risk and error (MORE) plot (as well as other plots discussed later) based on user input data. Microsoft Excel is a widely used software
package that can be found in almost any industry; its ease-of-use and widespread popularity among engineering practitioners made
it a compelling choice for our QVT.
We begin in Section 2 by providing background information on the field of visual data mining and discuss the importance of sound
data collection methods. In Sections 3 and 3.1, we discuss tools for visualizing discrete and continuous data, respectively. In
Section 3.2, we discuss methods for visualizing relationships among variables. Advice to practitioners and concluding remarks are
given in Sections 4 and 4.2.

2. A brief introduction to visual data mining


Visual data mining is an approach to exploratory data analysis that is based on the integration of concepts from computer science,
cognitive psychology, and data analysis to assist in uncovering trends and patterns that may be missed with other nonvisual
methods.12 Visual data mining also helps overcome one of the main limitations in data mining approaches, where the ‘data is
analyzed in a hypothesis testing mode in which one might have a priori notions about what the important results will be before
the analysis actually begins’13 [p. i]. The use of visualizations has proven to be a simple, effective, and assumption-free approach to
discover trends in the data.12–15 In addition, visualizing the associations and correlations among the data can provide a solid
foundation for statistical and mathematical modeling in the cases when additional analysis is needed.
The use of visualizations in statistical and mathematical sciences is not a new phenomenon, as it dates back to the visual proof of
the Pythagorean theorem. A more applied example can be seen in the work of John Snow whose plots of the cholera outbreak on the
London Map in the 1850s allowed him to discover the cause of cholera.16,3,4 Wickham17 discusses some of the historical foundations
of statistical graphics. Well-done graphical displays can help us to solve complex problems without making any assumptions or the
need to understand complicated mathematical or statistical algorithms. The visual exploration of data can also be used to supplement
model-based methods, and often lead to better results, especially in situations when automated data mining algorithms fail.12
Not all graphical representations of data are useful, and some can be misleading. There are several factors/guidelines that can help in
choosing/developing informative statistical graphics. For example, Tufte18,13–15 introduced the term graphical excellence to reflect on
graphics that communicate complex ideas with clarity, precision, and efficiency. Keim et al.12 provided some general rules for expressive
and effective visualizations. The expressiveness of a visual relates to the constraint that all relevant attributes, without any others, must be
expressed by the visualization.19 Effective graphics are the ones that allow the viewer to interpret the information hidden within the data
correctly and quickly. We consider these guidelines when we develop our graphics for quality engineering applications.

3. Visualizing discrete data distributions


In this section, we highlight several tools that the authors have deployed in a leading US manufacturer of tubular metal products
while dealing with count data. The structural tubular metal product is created in a foundry where ingredients are combined and
heated to a specific temperature in a furnace. Once the mixture conditions are met and the molten metal has reached the correct
temperature, the molten metal is poured from the furnace into a ladle. The ladle is then transported to the molds where the molten
metal is poured into the molds and the tubular products are formed. Once the tubes are cooled to a certain temperature, they are
removed from the mold, and quality data are obtained. Critical-to-quality data are collected on the final product.
The data described throughout this paper is based on data obtained from this manufacturer but has been modified for
presentation to maintain confidentiality. Discrete, as well as continuous measurements are of interest in metal casting operations.
Discrete measurements of interest include defect counts and the location of defects on the product. This data is used to drive quality
improvement through deeper process understanding. In many cases, charts of discrete count data might be the first indication that
the process could be out-of-control, and the proportion nonconforming is increasing.

3.1. Pareto chart


Pareto charts are widely used to categorize defects or quality problems from the most frequently occurring to the least. One
limitation of Pareto charts is they do not provide temporal information on the occurrence of these defects. For example,
consider the Pareto chart in Figure 1. Under the assumption that the cost exposure for all these defects is the same, a
practitioner may conclude that it is most important to investigate the root causes behind the Joint Flash problem. This
conclusion is correct only if an analysis of the temporal nature of each of these defects show that there is no evidence for
906

an improved rate of the Joint Flash defect over the data collection period. However, if this defect only occurred in the first

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Figure 1. A Pareto chart of 1-year data from the metal casting operation

week of the data, this conclusion may not be valid (it would still be interesting to identify why the process improved) and may
be more urgent to focus on improving a different aspect of the process. Accordingly, it is useful to develop software tools that
allow users to observe temporal effects with the defect occurrence rate. This can be carried out through either animation or a
plot of the incidence rate of different defects over time.
Our proposed method for adding a temporal aspect into Pareto charts includes either the use of animation (Figure 2) or a line
chart that plots the cumulative occurrence of defects over time (Figure 3). Although it is not possible to show the full effect of the
animation here, Figure 2 shows three screenshots of the Pareto chart as it develops over time. The same period of the company
casting defect data was plotted on the line chart in Figure 3, yielding valuable additional information into the temporal nature
of defect development. A quick inspection of a typical Pareto chart Figure 1 leaves the practitioner prioritizing the blue (Joint Flash),
red (sand hole), and finally green (rippling) as the top three defect areas to focus on. However, the temporal analyses given in
Figures 2 and 3 yield an additional insight; the mechanical damage defect developed most of its occurrences in just a few small
jumps beginning at time 231. In this case, the same human error was made on several occasions within a short period. Additionally,
the practitioner gets an idea of the rate at which each defect is developing and can monitor whether the rate of change of the
defects accelerates over time. Both the animated Pareto chart and the cumulative line plot of defect occurrences can be created
in the QVT.
It is important to note that the animated Pareto reveals a very important piece of information for the metal casting operation.
Rippling stays relatively flat in the middle of the analyzed period but starts to grow quickly around time 141. This pattern cannot
be extracted from the traditional Pareto chart in Figure 1. More importantly, this pattern indicates a possible tooling problem that

907

Figure 2. Snapshots of the animated Pareto chart feature taken from the provided QVT tool

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Figure 3. Cumulative line chart for metal casting process

is developing over time that needs addressing in order to reduce rippling defects in future products. Rippling is a common defect in
metal casting operations and can be caused by a wear problem in the extrusion process.

3.2. Defect concentration diagram


A defect concentration diagram is a graphical tool that is used to analyze the causes of defects on a part or product. The defect
concentration diagram includes a schematic drawing of the part of interest, showing all the relevant views. Various types of
defects are overlaid on the drawing, indicating the location of the defect on the part. The diagram is analyzed to see if the location
of the defects on the part provides any useful information about the potential root cause of the defects. A common approach to
using a defect concentration diagram includes regression analysis. Specifically, a general framework for the occurrence of a
defect (or event in a nonmanufacturing setting) can be represented by the following logistic regression model:20
 
p
logit ðpÞ ¼ log ¼ gðx Þ ¼ β0 þ β1 X 1 þ … þ βk X k ; (1)
1p

where p is the defect rate and X1,…,Xk represent a comprehensive list of all potential processes and product variables that can
cause the defect. This comprehensive list can be obtained via a traditional fishbone diagram, machine learning techniques and/
or engineering process knowledge. Here, we focus on a visual data mining approach to generate insight without the need for
advanced statistical knowledge, evaluating the addition of interaction terms or the addition of any nonlinear terms to Eq. (1). Thus,
we suggest that practitioners would incorporate several of these characteristics (potentially ones they think are the most relevant)
to the defect concentration diagram. An illustrative example highlighting the differences between the more traditional defect
concentration diagram and our proposed is shown in Figures 4 and 5, respectively. In Figure 4, practitioners are asked to specify
only the location and type of the fault. In our modifications, we suggest to encode additional information using size to specify
the magnitude of the fault and color to indicate the shift where the fault has occurred. It is important to note however that our
QVT allows encoding additional information pertaining to the predictor variables of Eq. (1). For example, symbol outline color
can be used to encode an additional variable, and animation can be used to show temporal effects. Inclusion of these additions
in a quality/data analysis environment simplifies root cause analysis, presents valuable information about the process, and provides
pointers to practitioners regarding the right course of action to take (because it provides more visual information regarding
important predictor variables).
Figure 5 depicts a 360-degree view diagram of a tubular metal product produced in the casting operation. The diagram is
simplified for demonstration purposes. The casting operation performs visual inspections on all of their tubular metal products as part
of their quality program to look for pinholes, cracks, and other defects. Similar to the Pareto chart, the multicharacteristic
spatiotemporal defect concentration diagram uncovers information hidden by the aggregation process. Three snapshots of the visual
inspection results are shown in Figure 6 in time order. The animated diagram provides insight into the temporal nature of quality yet
again as time goes on, the ‘ridging’ defect spreads in an orderly fashion along the length of the product. A quick look into the process
shows that over time, the hydraulic support pistons are not supporting the tubular products evenly causing an angle to develop
during the extrusion process. This angle, caused by tool fatigue, creates a ‘brushstroke’, or ridging along the tubular metal product.
Other useful quality applications of the multicharacteristic data carrier detect are identification of time-related defect clusters. The
importance of such a phenomenon extends beyond metal casting operations. For example, when monitoring wafer production
processes, it is well documented that defective chips often occur in clusters or display systematic patterns,21 resulting from variations
908

in manufacturing process conditions.22

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Figure 4. Traditional defect concentration diagram that only highlights the type and location of defects. This defect concentration diagram is from Montgomery3 [p. 212]

Figure 5. A multicharacteristic defect concentration diagram depicting the type of defect, its location, size, and shift number (i.e. morning, afternoon, and evening)

4. Visualizing continuous data distributions


4.1. The box plot
The box plot is a graphical display that can simultaneously depict several important features of the data, such as central tendency,
variability, departure from symmetry, and the identification of outliers. These features are shown through the ‘display of three quartiles,
the minimum and maximum of the data on a rectangular box, aligned either horizontally or vertically’3 [p. 75]. The box plots are
commonly used in designed experiments because they can provide insight to differences in location and/or scale parameters among
groups as well as the presence of potential outliers. The NIST/SEMATCH e-Handbook of Statistical Methods23 states that the box plot is
… an important EDA (exploratory data analysis) tool for determining if a factor has a significant effect on the response with
respect to either location or variation. The box plot is also an effective tool for summarizing large quantities of information.
Although box plots provide excellent summaries of data, the appearance of the box plot is quite dependent on the sample size.
Further, a box plot gives no information as to statistical differences among groups or adequacy of the sample size. One solution to
these limitations is the MORE plot. First proposed by Nelson,24 the MORE plot differs from the typical box plot in several ways:
inclusion of the mean in addition to the median; the use of flexible quantiles (instead of the usual first and third); and the presentation
of confidence intervals for the quantiles. A comparison of a standard box plot versus the MORE plot is shown in Figure 7.
909

This comparison is based on the casting temperature for one of the plant’s furnaces. Temperature is an important factor in metal

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Figure 6. Snapshots of the animation of the multicharacteristic spatiotemporal defect concentration diagram

Figure 7. A box plot versus an MORE plot of casting temperature

casting operations as it can lead to defects, such as scabbing. Scabbing occurs when the formation of shells or irregular crusts from
the surface of a cast piece. As the temperature increases, the chance of scabs occurring also increases. The plots in Figure 7 are based
on 5706 observations obtained from temperature gages in the plant. The similarities between the two are visually obvious, but the
MORE plot gives the user flexibility to choose quantile values (here, we show the fifth and 95th quantiles on the MORE plot) and
overlays the confidence intervals for the quantiles. We arbitrarily chose the fifth and 95th percentiles, but as Banks et al.25 stated,
any percentiles (symmetric or asymmetric) could be chosen. The small shaded regions surrounding the upper and lower quantiles
show 95% confidence intervals created using Banks et al.25 large sample formula.
While there are some similarities between the two plots in Figure 7, the MORE plot can be more informative. The MORE plot helps
in verifying sample size adequacy, which is important for determining the amount of estimation error through its inclusion of
confidence intervals. For example, the tight confidence intervals depicted in Figure 7 indicate that the sample size for estimating
temperature is adequate and that the estimation error is negligible. This is not surprising because a sample size of ~5000 is often
considered to be sufficient to estimate the mean of a continuous variable.
Figure 8 shows a MORE plot of a random sample (without replacement) of 500 observations from the temperature data. When
comparing this plot with the MORE plot shown in Figure 7, it is clear that the width of the confidence regions for the quantiles is much
larger in Figure 8, indicating the presence of more sampling variability due to the smaller sample size. Although not used for
inference, this can help the user to better frame interpretations from the plots in terms of sample size and expected sampling
fluctuations. Plots similar to those in Figures 7 and 8 can be created in the QVT.
To take advantage of the time ordering of SPC data, we introduce the concept of an animated MORE plot. Although we cannot
present the animation directly in the paper, we have provided the animation tool in the QVT. The animated MORE plot gives the visual
appearance of the MORE plot but is animated to show distributional changes over time. In Figure 9, we give a snapshot of six
consecutive points in the MORE animation of the temperature data using the same quantiles and confidence interval settings. When
viewed in succession, a clear picture of how the process variability is changing is apparent. These plots were used to visualize the
changing casting temperature distribution at certain time intervals (to correlate with scabbing frequency). For processes measured
by continuous variables, the animated MORE plot provides clear benefits and allows the user to quantify the impact of low precision.

4.2. The histogram


The histogram graphically displays information regarding the distribution of quantitative variables, including shape, location, and
910

scale. The histogram can also indicate the presence of potential outliers and multiple modes within the data. As with the box plot,

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Figure 8. The MORE plot for a random sample of 500 observations from the casting temperature dataset

Figure 9. A snapshot of six consecutive periods in a casting temperature data animation

there are opportunities for making the histogram more informative by incorporating additional information. For example, several
statistical software packages allow users to fit a distribution to the histogram to better characterize the shape of the distribution, as seen
in Figure 10. In many cases, we do not find this feature to be effective, and sometimes, it can be misleading. For example, it is not clear
whether the shape of the histogram matches the normal fit in Figure 10. A discussion on how traditional histograms can be improved
can be found in Potter et al.26 who created the summary plot to combine information from the box plot and the histogram.
Similar to the box plot, the histogram only provides information about the data distribution at a static point in time. In our QVT, we
have provided an animated histogram that allows the practitioner to see the binning of the data develop over time. Understanding
the temporal behavior of a quality metric provides valuable insight into what might be causing certain types of quality problems or
defects, information on what rates certain defects change relative to others, and a much clearer understanding of what happened in
the process during a certain period. An animated histogram allows the user to glean additional knowledge about their process by
giving clues to the cause of problems or changes. The three snapshots in Figure 11 show a histogram of metal product weight in
the midst of animation. The practitioner would expect to see the bins fill up in a random fashion, tending to normalize over time.
However, in this application, there is a process shift around two-thirds of the way through the animation. This could indicate tool
wear, a shift in casting temperature, or some other quality variable that merits investigation. Detecting the exact time a shift takes
place in this specific example could yield many new insights causing the change: a sudden shift in the elemental composition of
the pour mixture, a wear problem with the pour mechanism, a shift change (operator error), and so on. Insights such as these can
minimize the time the practitioner spends on exploratory data analysis, leading to faster diagnosis and corrective action.
In addition to the animated histogram, our QVT provides the user the option to overlay confidence intervals of selected quantiles of
the distribution onto a static histogram. This is similar to the use of the MORE box plot but with the visual depiction of the histogram.
Figure 12 shows an example histogram of the cast weight data overlaid with 95% confidence intervals of the fifth and 95th quantiles and
the mean of the distribution. In order to further explore the temporal nature of the histogram, we have also supplemented the MORE
histogram with a cumulative frequency plot for each histogram bin. The cumulative frequency plot of the bins provides a similar analysis
as the animated histogram, allowing the practitioner to see how the frequency in each bin develops over time, plotting each bin’s rate of
911

development as a line. We do not provide a figure here as we have provided a similar plot in Figure 3 along with analysis.

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Figure 10. A histogram of 100 normally distributed points N(10, 1), with a fitted distribution

Figure 11. Snapshots of the animated MORE histogram of casting weight

Figure 12. Overlaid percentiles with confidence intervals to a histogram of 1344 casting operation weights

5. Visualizing the relationships among quality variables


In this section, we discuss additional visualization tools to better understand relationships among variables. The metal casting dataset
will be used to demonstrate the power of classical visual data mining techniques (classification and regression trees) when used in
912

conjunction with the fishbone diagram.

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

5.1. Scatter plot


A scatter plot is a standard technique to show potential relationships between two variables. Recently, Xie27 suggested that incorporating
animation can facilitate the identification of patterns in 2D plots. It is worth noting that Xie27 developed an R animation package: animate,
which contains about 30 functions that are related to statistical analysis and simulation modeling. The reader is encouraged to read Xie’s
work on his/her computer because some animations are also embedded in the PDF. In addition to Xie,27 an animated ‘motion bubble
chart’ made famous by Rosling28,29 makes possible the visualization of multidimensional data including a time dimension (Figure 13).
Similar to a basic scatter plot, points are plotted on the X- and Y-axes of a chart as bubbles. The size of the bubbles defines the third
dimension, and the colors of the bubbles may indicate separate data points or may be categorized according to method fourth grouping
variable. The fifth and final dimension contained in an animated bubble chart is time. Animation provides the ability to visualize the way
the relationships among variables change over time. In Figure 13, we perform exploratory data analysis on the metal casting operation
using the motion bubble chart. Because of confidentiality reasons, we cannot publish the identities of the element ratios. Comparing
the ratio of two elements (x- and y-axes) over time and temperature (size) allows the user to draw inferences about the cause of failures
in the casting operation. Although we cannot present the animation directly in the paper, a snapshot from one time point is provided.
Rosling28,29 has many examples of such charts on his website, www.gapminder.org. For quality engineering applications, such
charts can be used to visualize the variation among several variables. This type of chart allows one to understand general trends
among multiple variables on a 2D chart. We do not include the motion bubble chart in our QVT because there are numerous freely
available applications to create these charts. Figure 13 was created using the open-source Google spreadsheet application. For a
detailed introduction on the use of that application, the reader is referred to Jacks.30

5.2. Fishbone (cause-and-effect) diagram


The fishbone diagram was developed as an analysis tool to analyze potential root causes for a quality problem. Typically, the fishbone
diagram is used as a brainstorming tool by a quality improvement team in the analyze and improve steps of the define, measure,
analyze, improve, and control process3 [p. 210]. There are currently several automated tools that allow practitioners to generate
computer-based fishbone diagrams. For such tools, the practitioners decides on what goes into each of the bones, that is, what
are the potential causes for the machine, materials, methods, measurement, and man groupings. Once the diagram is constructed,
in-depth discussion helps the group to reach consensus on the most likely cause of the quality problem and then investigate which
of the underlying factors led to the occurrence of the problem.
Although widely used by practitioners and quite effective in many situations, the fishbone diagram may not be useful in all
situations. In situations when the process is complex or the quality team is substantially biased towards unsubstantiated causes of

913

Figure 13. An example of a motion bubble chart comparing two element ratios over time, using temperature as the size parameter. The colors of the bubbles correspond
to a pass/fail state

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

problems, it would be more beneficial to turn to the data for answers. Modern data analytic and computing tools allow the use of
machine learning methods15,31 such as classification and regression trees (CART) for fault identification and diagnosis. With data of
sufficient quality, these methods can be used to assign likelihood values for each possible root cause and may help to uncover
new insights into the source of quality problems more effectively than a fishbone diagram alone can.
Over the last few years, many software packages allowing high-powered data mining have become available and are beginning to
be used in data environments (manufacturing, health care, financial institutions, etc.). However, ‘big data’ and ‘data mining’ are
growing rapidly and many people are unfamiliar with the techniques. The ability to create a predictive model (using classification
and regression tree algorithms) to predict a dependent quality variable for a process based on several independent variables
presents a significant opportunity in the fields of quality, SPC, and data visualization. Understanding what combinations of
independent variables influence a dependent quality variable is beneficial in several ways: deeper process understanding and
the potential to reduce inspector costs as the process data management/capture system can predict based on this model in the
inspector’s place. A fundamental understanding of data mining is required to create such models with the currently available
software packages. Examples of this type of software include IBM’s SPSS Modeler, StatSoft’s Statistica, XLSTAT’s Excel Add-in, JMP
Pro, and SAS Enterprise Miner.
To illustrate the visualization power and intuitive appeal of CART methods, we used Statistica to generate a CART model for
chemistry data from the metal casting operation. The model predicts a categorical dependent variable with two possible values: pass
or fail. Chemical composition values obtained from spectroscopy were used as the 15 continuous predictor variables. It is important to
note here that CART does not limit the practitioner to a categorical response and can handle continuous responses also. Figure 14
shows a CART model that can help predict/distinguish pass or fail performance of tubular metal products.
In Figure 14, one can see that the ‘rule-based flow-chart’ type of analysis provided by the tree is very intuitive to understand. From
the 15 elements used to predict a pass/fail outcome for a tubular metal product, it is instantly obvious that element 12 is extremely
important to the final quality outcome of the product. The first level of the tree splits based on element 12 having a high or low value.
914

Figure 14. An example of the output from a classification and regression tree analysis. This tree contains far fewer nodes for demonstration purposes, as many splits and
nodes as desired can be displayed with most software packages

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

For the high value, the CART model predicts almost entirely ‘pass’ outcomes. For the low value, the model predicts a more equal ratio
of pass versus fail. As the user traces further down the tree, it indicates various levels of elements that, in combination, lead to a
predicted ‘fail’ outcome. The level of classification and number of nodes present in a CART model is flexible, so more pinpointed
classification is possible if desired. While the fishbone diagram identifies potential causes and effects of problems, the CART model
identifies the values of the independent variables that are related to the different quality classifications and gives the practitioner
information about which variables are most important to quality ratings. This gives the practitioner a simple, visual, and objective
method for identifying potential root causes using data-driven approaches.
Similar data-driven methods are currently deployed in several large manufacturing corporations. For example, in the 1990s,
General Electric and Snecma S.A. developed the CASSIOPEE troubleshooting system to diagnose and predict problems in the Boeing
737 and Airbus A340s.31,32 Their system was based on deriving families of faults using clustering methodologies. It proved to be very
successful; it was adopted by three major European airlines31 and received the European first prize for innovative applications.33 For
an introduction on using machine learning approaches in fault diagnosis, we refer the reader to Isermann.34,35 It is important to note
that the CASSIOPEE troubleshooting system is made possible by computerized data management systems. We see the impact of data
management systems to continue to grow for fault identification and diagnosis, an important area for statistics research as
highlighted by Nair et al.36

6. Advice to practitioners
In this paper, we discuss potential improvements to the use of simple tools and graphics in SPC. The role of graphics remains as a
valuable explanatory data analysis toolset, but practitioners should use caution in making decisions on graphical tools alone. For
example, the use of any of the spatiotemporal visualizations should be used to gain insight on potential factors of interest. However,
analysis based on observational data should not be taken to infer a causal relationship among variables. Conclusions indicating causal
relationships can only be obtained with carefully designed and controlled experiments, and even then, causal inference is difficult to
confirm. For an excellent discussion on the criteria required to infer causality in epidemiological studies, see Hill.37 Although written
primarily for epidemiological studies, many of the criteria relate to more general research domains.
To use many of the methods described in this paper, practitioners may need to explore using multiple software
simultaneously. A positive aspect of the field of visual data mining is that the development and updates occur quite frequently
and that the field is heavily dependent on open-source software. Because of cost and other limitations, a significant portion of
the SPC and statistics research community has migrated to the open-source R package for computations and data analyses. We
believe that R and similar open-source technologies can allow for more flexibility in deploying state-of-the-art research into
practice. That being said, any migration of data management systems should be carried out with caution to avoid problems.
There exist several ‘top’ lists of visualization software online, see, for example, Suda.38 These lists are useful for generating
insights and learning about the wide variety of tools that are available online. However, the choice of software (or data
management systems) remains application dependent. For the tools developed/discussed in this paper, we chose Microsoft
Excel because of its widespread popularity among engineering/business practitioners, and it being the application of choice
by our customer in the case study.

7. Concluding remarks
In this paper, we provided an overview of visual data mining and the role it can play in transforming our understanding of different
data types (continuous, discrete, multivariate, and relationship among different variables). We also developed an Excel-based toolkit
that provides several enhancements over the basic quality tools. We encourage more research and case studies in how statistical
graphics can be used to explore SPC data and/or to communicate the results from statistical models and experiments to practitioners.
Work is needed on developing strategies to develop insights in big data analytics applications. In particular, additional methods are
necessary to diagnose control chart signals. The combination of visual data mining with control charting principles may prove helpful
in fault identification and diagnosis, as shown in Wells et al.5 Bersimis et al.39 provided an overview of the visualization tools used in
interpreting out-of-control signals in multivariate SPC charts. There remains significant work to be done in this area, including
integrating these methodologies into open-source software, better understanding the limitations of these approaches, and ensuring
that these approaches can be generalized to multiple domains.

Acknowledgements
The work of the first two authors was partially supported by the NIOSH Deep South Center for Occupational Safety and Ergonomics,
grant number G00007701. The authors would like to thank Mohammad Ansari, Ashkan Negahban, and Prof. Jeffrey S. Smith for their
valuable insights regarding the MORE plots. We would also like to thank Ali Dag for his valuable discussions on machine learning with
the first author. Prof. Jeffrey S. Smith is the Joe W. Forehand Jr. Professor of Industrial and Systems Engineering at Auburn University,
915

and the other aforementioned three researchers are PhD students at Auburn University.

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Supplemental material
The QVT is located at: http://www.eng.auburn.edu/users/fmm0002/QVT_Smith_Program.zip. The contents of this link will be
continuously updated based on users’ feedback.

References
1. Stapenhurst T. Mastering Statistical Process Control: A Handbook for Performance Improvement Using Cases. 2005; Amsterdam; Boston: Elsevier
Butterworth-Heinemann. xxxv.
2. Big Data - What is it? | SAS. 2013. Available from: http://www.sas.com/big-data/. Last accessed on: 5/8/2013.
3. Montgomery DC. Introduction to Statistical Quality Control. 7th ed. 2013; Hoboken, NJ: Wiley. 754.
4. Megahed FM, Woodall WH, Camel JA. A review and perspective on control charting with image data. Journal of Quality Technology 2011; 43(2):83–98.
5. Wells LJ, Megahed FM, Camelio JA, Woodall WH. A framework for variation visualization and understanding in complex manufacturing systems.
Journal of Intelligent Manufacturing 2012; 23(5):2025–2036.
6. Wells LJ, Megahed FM, Niziolek CB, Camelio JA, Woodall WH. Statistical process monitoring approach for high-density point clouds. Journal of
Intelligent Manufacturing 2013; 24(6):1267–1279.
7. What is Big Data? | SAS. 2014 Available from: http://www.sas.com/en_us/insights/big-data/what-is-big-data.html. Last accessed on 07/17/2013.
8. Megahed FM, Jones-Farmer LA. Statistical perspectives on ‘Big Data’. In XIth International Workshop on Intelligent Statistical Quality Control. 2013;
Sydeny, Australlia: Physica Verlag Heidelberg.
9. Pyzdek T, Keller PA. The Six Sigma Handbook: A Complete Guide for Green Belts, Black Belts, and Managers at all Levels. 3rd ed. 2010; New York:
McGraw-Hill Companies. xii.
10. Spann MS, Adams M, Rahman M, Czarnecki H, Schroer BJ. Transferring Lean Manufacturing to Small Manufacturers: The Role of NIST-MEP. 1999;
University of Alabama in Huntsville. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.201.6147&rep=rep1&type=pdf. Last
accessed on 07/10/2013.
11. Hoerl RW, Snee RD. Closing the gap. Quality Progress 2010; 43(5):52–53.
12. Keim DA, Müller W, Schumann H. Visual data mining. State of the art report. In Eurographics’ 2002. 2002; Saarbruecken, Germany: Eurographics Association.
13. Simoff S, Böhlen MH, Mazeika A, Visual Data Mining: Theory, Techniques and Tools for Visual Analytics. Vol. 4404. 2008; Springer.
14. Greitzer, FL, Noonan CF, Franklin LR. Cognitive Foundations for Visual Analytics. 2011, Pacific Northwest National Laboratory. Available from: http://
www.pnl.gov/main/publications/external/technical_reports/PNNL-20207.pdf. Last accessed on 08/01/2013.
15. Han J, M Kamber, Data Mining: Concepts and Techniques. 3rd ed. 2011; Burlington, MA: Elsevier. 703.
16. Rajaraman A, J Leskovec, JD Ullman. Mining of Massive Datasets. 2012, Cambridge University Press: New York. Available from: http://i.stanford.edu/
~ullman/mmds/book.pdf. Last accessed on 08/01/2013.
17. Wickham H. Graphical criticism: some historical notes. Journal of Computational and Graphical Statistics 2013; 22(1):38–44. DOI: 10.1080/
10618600.2012.761140
18. Tufte ER. The Visual Display of Quantitative Information. 1983, Cheshire, Conn. (Box 430, Cheshire 06410): Graphics Press.
19. Mackinlay J. Automating the design of graphical presentations of relational information. ACM Transactions on Graphics 1986; 5(2):110–141. DOI:
10.1145/22949.22950
20. Steiner S, MacKay RJ. Effective monitoring of processes with parts per million defective. A hard problem!. In Frontiers in Statistical Quality Control 7,
H-J Lenz, P-T Wilrich, Editors. 2004, Physica-Verlag HD, 140–149.
21. Hansen MH, Nair VN, Friedman DJ. Monitoring wafer map data from integrated circuit fabrication processes for spatially clustered defects.
Technometrics 1997; 39(3):241–253. DOI: 10.2307/1271129
22. Cunningham SP, MacKinnon S Statistical methods for visual defect metrology. Ieee Transactions on Semiconductor Manufacturing 1998; 11(1):48–53.
DOI: 10.1109/66.661284
23. NIST/SEMATECH e-handbook of statistical methods. 2012. Available from: http://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htm. Last
accessed on 03/10/2014.
24. Nelson BL. The more plot: displaying measures of risk & error from simulation output. in Proceedings of the 40th Conference on Winter Simulation.
2008. Winter Simulation Conference.
25. Banks J, Carson JS, Nelson BL, Nicol DM. Discrete-event system simulation. In Prentice-Hall International Series in Industrial and Systems
Engineering. 4th ed. 2005; Upper Saddle River, NJ: Pearson Prentice Hall. xvi.
26. Potter K, Kniss J, Riesenfeld R, Johnson CR. Visualizing summary statistics and uncertainty. In Computer Graphics Forum. 2010; Wiley Online Library.
27. Xie YH. Animation: an R package for creating animations and demonstrating statistical methods. Journal of Statistical Software 2013; 53(1):1–27.
28. Rosling H, Zhang Z. Health advocacy with Gapminder animated statistics. Journal of Epidemiol and Global Health 2011; 1(1):11–4. DOI: 10.1016/j.
jegh.2011.07.001.
29. Rosling H. Hans Rosling: the best stats you have ever seen. FILMED Feb 2006 2009 June 2006 Available from: http://www.ted.com/talks/
hans_rosling_shows_the_best_stats_you_ve_ever_seen.html. Last accessed on: 08/20/2013.
30. Jacks J. Visualization: Student Visas in the US 2000-2008. 2013; Auburn University: Auburn, AL, USA. Available from: http://auburnbigdata.blogspot.
com/2013/02/visualization-student-visas-in-us-2000.html. Last accessed on 08/01/2013.
31. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Magazine 1996; 17(3):37–54.
32. Esprit Project 21522 – CASSIOPEE: improving methodologies for efficiently designing decision support software for aircraft maintenance. 1996
Available from: http://cordis.europa.eu/esprit/src/21522.htm. Last accessed on: 8/13/2013.
33. Manago M, Auriol E. Mining for OR: case-based reasoning and data mining techniques show their mettle in a number of real-world applications.
ORMS Today (Special Issue on Data Mining) 1996; 23(1):28–32.
34. Isermann R. Supervision, fault-detection and fault-diagnosis methods – An introduction. Control Engineering Practice 1997; 5(5):639–652.
35. Isermann R. Model-based fault-detection and diagnosis – status and applications. Annual Reviews in Control 2005; 29(1):71–85.
36. Nair V, Hansen M, Shi J. Statistics in advanced manufacturing. Journal of the American Statistical Association 2000; 95(451):1002–1005. DOI: 10.2307/
2669486
37. Hill AB. The environment and disease: association or causation? Proceedings of the Royal Society of Medicine 1965; 58:295–300.
38. Suda B. The top 20 data visualization tools. 2012 09/17/2012 Available from: http://www.netmagazine.com/features/top-20-data-visualisation-tools.
Last accessed on: 8/15/2013.
916

39. Bersimis S, Panaretos J, Psarakis S. Multivariate statistical process control charts and the problem of interpretation: a short overview and some
applications in industry. in Proceedings of the 7th Hellenic European Conference on Computer Mathematics and its Applications. 2005. Athens, Greece.

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917
H. D. SMITH ET AL.

Authors' biographies
Huw D. Smith is a graduate student in the Department of Industrial and Systems Engineering and is also pursuing a dual MBA degree
at Auburn University. This research was completed, while he was an undergraduate research student in the Department of Industrial
and Systems Engineering at Auburn University. His research interests include visual analytics, SPC, image monitoring, lean
manufacturing, big data, business analytics, and simulation.
Fadel M. Megahed is an Assistant Professor in the Department of Industrial and Systems Engineering at Auburn University. He
received his PhD and MS in Industrial and Systems Engineering from Virginia Tech, and his BS in Mechanical Engineering from the
American University in Cairo. He is the recipient of the Mary G. and Joseph Natrella Scholarship (2012) from the American Statistical
Association. His research interests are in the areas of data analytics, data visualization, statistical quality control, and reliability. His
work in these areas has been funded by the NIOSH Deep South Center for Occupational Safety and Ergonomics, Proctor and Gamble
(P&G) Fund of the Greater Cincinnati Foundation, Amazon Web Services (AWS), Windows Azure (Microsoft), and the National Science
Foundation.
L. Allison Jones-Farmer is the C&E Smith Associate Professor of analytics and statistics in the Raymond J. Harbert College of Business
at Auburn University. Her main research interests include business analytics, statistical surveillance methodologies, and multivariate
statistical methods. Currently, Professor Jones-Farmer is studying methods for monitoring multiple social media streams and data
quality. She is the founding director of the Auburn University Business Analytics Lab (AUBAL). Dr. Jones-Farmer has published her
work in several scholarly journals including Technometrics, Journal of Quality Technology, Quality and Reliability Engineering
International, and International Journal of Logistics Management. She served as an Associate Editor for Technometrics from
2001–2005 and is on the editorial review board for Journal of Quality Technology.
Mark Clark has a dual appointment in the Raymond J. Harbert College of Business at Auburn University. He serves as a Management
Scientist in the Auburn Technical Assistance Center, and he also serves as a visiting Assistant Professor in the Aviation and Supply
Chain Management Department. Dr Clark has a PhD in Industrial Engineering and specializes in the area of production and operations
management. He has 8 years of experience in the pulp and paper industry and served as a process engineer for approximately 3 years
before joining the management team at one of the International Paper production facilities. Dr Clark has co-authored numerous
articles accepted in Annals of Operations Research, Computers and Operations Research, and Forest Science. Dr Clark balances his
time between the classroom and his activities in the outreach office. For the last 6 years, Dr Clark has focused, primarily, on
implementing the six sigma process improvement methodologies in manufacturing companies located in Alabama. In the classroom,
Dr Clark teaches quality management, project management, and quantitative methods at both the graduate and undergraduate
levels. He is also an instructor in the Executive MBA Program. Dr Clark has been in his current role since 1998.

917

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 905–917

You might also like