0% found this document useful (0 votes)
11 views

Data Mining Methods and Applications

Uploaded by

emra24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Mining Methods and Applications

Uploaded by

emra24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

651

Data Mining M
36. Data Mining Methods and Applications

36.1 The KDD Process .................................. 653


In this chapter, we provide a review of the
knowledge discovery process, including data 36.2 Handling Data ..................................... 654
handling, data mining methods and software, 36.2.1 Databases and Data Warehousing 654
and current research activities. The introduction 36.2.2 Data Preparation....................... 654
defines and provides a general background to 36.3 Data Mining (DM) Models and Algorithms 655
data mining knowledge discovery in databases. 36.3.1 Supervised Learning .................. 655
In particular, the potential for data mining to 36.3.2 Unsupervised Learning .............. 661
improve manufacturing processes in industry is 36.3.3 Software .................................. 663
discussed. This is followed by an outline of the
entire process of knowledge discovery in databases 36.4 DM Research and Applications .............. 664
36.4.1 Activity Monitoring .................... 664
in the second part of the chapter.
36.4.2 Mahalanobis–Taguchi System ..... 665
The third part presents data handling issues,
36.4.3 Manufacturing Process Modeling . 665
including databases and preparation of the data

Part D 36
for analysis. Although these issues are generally 36.5 Concluding Remarks ............................ 667
considered uninteresting to modelers, the largest References .................................................. 667
portion of the knowledge discovery process is
spent handling data. It is also of great importance
since the resulting models can only be as good as data may not be available while clustering
the data on which they are based. methods are more technically similar to the
The fourth part is the core of the chapter supervised learning methods presented in this
and describes popular data mining methods, chapter. Finally, this section closes with a review
separated as supervised versus unsupervised of various software options.
learning. In supervised learning, the training The fifth part presents current research
data set includes observed output values (“correct projects, involving both industrial and business
answers”) for the given set of inputs. If the applications. In the first project, data is collected
outputs are continuous/quantitative, then we from monitoring systems, and the objective is to
have a regression problem. If the outputs detect unusual activity that may require action.
are categorical/qualitative, then we have a For example, credit card companies monitor
classification problem. Supervised learning customers’ credit card usage to detect possible
methods are described in the context of both fraud. While methods from statistical process
regression and classification (as appropriate), control were developed for similar purposes, the
beginning with the simplest case of linear models, difference lies in the quantity of data. The second
then presenting more complex modeling with project describes data mining tools developed
trees, neural networks, and support vector by Genichi Taguchi, who is well known for his
machines, and concluding with some methods, industrial work on robust design. The third project
such as nearest neighbor, that are only for tackles quality and productivity improvement in
classification. In unsupervised learning, the manufacturing industries. Although some detail
training data set does not contain output values. is given, considerable research is still needed
Unsupervised learning methods are described to develop a practical tool for today’s complex
under two categories: association rules and manufacturing processes.
clustering. Association rules are appropriate for Finally, the last part provides a brief discussion
business applications where precise numerical on remaining problems and future trends.
652 Part D Regression Methods and Data Mining

Data mining (DM) is the process of exploration and (www.selectron.com). However, economies of scale,
analysis, by automatic or semiautomatic means, of purchasing power, and global competition are making
large quantities of data to discover meaningful pat- the business such that one must either be a big player or
terns and rules [36.1]. Statistical DM is exploratory serve a niche market. Today, extremely short life cycles
data analysis with little or no human interaction using and constantly declining prices are pressuring the elec-
computationally feasible techniques, i. e., the attempt to tronics industry to manufacture their products with high
find unknown interesting structure [36.2]. Knowledge quality, high yield, and low production cost.
discovery in databases (KDD) is a multidisciplinary To be successful, industry will require improvements
research field for nontrivial extraction of implicit, previ- at all phases of manufacturing. Figure 36.1 illustrates the
ously unknown, and potentially useful knowledge from three primary phases: design, ramp-up, and production.
data [36.3]. Although some treat DM and KDD equiva- In the production phase, maintenance of a high perfor-
lently, they can be distinguished as follows. The KDD mance level via improved system diagnosis is needed.
process employs DM methods (algorithms) to extract In the ramp-up phase, reduction in new product de-
knowledge according to the specifications of measures velopment time is sought by achieving the required
and thresholds, using a database along with any ne- performance as quickly as possible. Market demands
cessary preprocessing or transformations. DM is a step have been forcing reduced development time for new
in the KDD process consisting of particular algorithms product and production system design. For example, in
(methods) that, under some acceptable objective, pro- the computer industry, a product’s life cycle has been
duces particular patterns or knowledge over the data. The shortened to 2–3 years recently, compared to a life cy-
two primary fields that develop DM methods are statis- cle of 3–5 years a few years ago. As a result, there are
Part D 36

tics and computer science. Statisticians support DM by a number of new concepts in the area of production sys-
mathematical theory and statistical methods while com- tems, such as flexible and reconfigurable manufacturing
puter scientists develop computational algorithms and systems. Thus, in the design phase, improved system
relevant software [36.4]. Prerequisites for DM include: performance integrated at both the ramp-up and produc-
(1) Advanced computer technology (large CPU, parallel tion phases is desired. Some of the most critical factors
architecture, etc.) to allow fast access to large quantities and barriers in the competitive development of mod-
of data and enable computationally intensive algorithms ern manufacturing systems lie in the largely uncharted
and statistical methods; (2) knowledge of the business area of predicting system performance during the design
or subject matter to formulate the important business phase [36.5, 6]. Consequently, current systems necessi-
questions and interpret the discovered knowledge. tate that a large number of design/engineering changes
With competition increasing, DM and KDD have be made after the system has been designed.
become critical for companies to retain customers and
ensure profitable growth. Although most companies are
able to collect vast amounts of business data, they are Define & validate (KPCs)
product
often unable to leverage this data effectively to gain new
knowledge and insights. DM is the process of applying Define & validate
(KCCs)
sophisticated analytical and computational techniques to process
discover exploitable patterns in complex data. In many
Design &
cases, the process of DM results in actionable knowl- refinement
(KPCs, KCCs)
edge and insights. Examples of DM applications include
fraud detection, risk assessment, customer relationship Launch / (KPCs,
Ramp-up KCCs)
management, cross selling, insurance, banking, retail,
etc. Production
While many of these applications involve customer
relationship management in the service industry, a po- Ramp-up
Product and process design time Procuction
tentially fruitful area is performance improvement and
cost reduction through DM in industrial and manu- Lead time
facturing systems. For example, in the fast-growing
and highly competitive electronics industry, total rev- Fig. 36.1 Manufacturing system development phases.
enue worldwide in 2003 was estimated to be $900 KPCs = Key product characteristics. KCCs = Key control
billion, and the growthrate is estimated at 8% per year characteristics
Data Mining Methods and Applications 36.1 The KDD Process 653

At all phases, system performance depends on many and functional test. Due to advancements in informa-
manufacturing process stages and hundreds or thou- tion technology, sophisticated software and hardware
sands of variables whose interactions are not well technologies are available to record and process huge
understood. For example, in the multi-stage printed amounts of daily data in these process and testing
circuit board (PCB) industry, the stages include pro- stages. This makes it possible to extract important
cess operations such as paste printing, chip placement, and useful information to improve process and prod-
and wave soldering; and also include test opera- uct performance through DM and quality improvement
tions such as optical inspection, vision inspection, technologies.

36.1 The KDD Process


The KDD process consists of four main steps: 3. Data mining
1. Determination of business objectives, a) Identify DM tasks,
2. Data preparation, b) Apply DM tools,
a) Create target data sets, 4. Consolidation and application,
b) Data quality, cleaning, and preprocessing, a) Consolidate discovered knowledge,
c) Data reduction and projection, b) Implement in business decisions.

Part D 36.1
Business
objectives Source • Legacy systems
systems • External systems
Model
discovery file

Identify data needed Extract data from Cleanse and


and sources source systems aggregate data Model
evaluation file

Fig. 36.2 Data preparation flow chart

Model
Explore data Construct model Ideas
discovery file

Transform
Model Evaluate model into Reports
evaluation file model usable format

Models

Fig. 36.3 Data mining flow chart

Ideas

Make business
Communicate / Knowledge
Reports Extract knowledge decisions and
Transport knowledge database improve model

Models

Fig. 36.4 Consolidation and application flow chart


654 Part D Regression Methods and Data Mining

As an example of formulating business objectives, knowledge will enable the company to meet the chal-
consider a telecommunications company. It is criti- lenges of new product development effectively in
cally important to identify those customer traits that the future. Steps 2–4 are illustrated in figs. 36.2–
retain profitable customers and predict fraudulent be- 36.4. Approximately 20–25% of effort is spent on
havior, credit risks and customer churn. This knowledge determining business objectives, 50–60% of effort
may be used to improve programs in target market- is spent on data preparation, 10–15% of is spent
ing, marketing channel management, micro-marketing, on DM, and about 10% is spent on consolidation/
and cross selling. Finally, continually updating this application.

36.2 Handling Data


The largest percentage effort of the KDD process is spent a state subset by time. Other databases and tools in-
on processing and preparing the data. In this section, clude object-oriented databases, transactional databases,
common forms of data storage and tools for accessing time series and spatial databases, online analytical pro-
the data are described, and the important issues in data cessing (OLAP), multidimensional OLAP (MOLAP),
preparation are discussed. and relational OLAP using extended SQL (ROLAP).
See Chapt. 2 of Han and Kamber [36.7] for more
36.2.1 Databases and Data Warehousing details.
Part D 36.2

A relational database system contains one or more 36.2.2 Data Preparation


objects called tables. The data or information for the
database are stored in these tables. Tables are uniquely The purpose of this step in the KDD process is to identify
identified by their names and are comprised of columns data quality problems, sources of noise, data redun-
and rows. Columns contain the column name, data type dancy, missing data, and outliers. Data quality problems
and any other attributes for the column. Rows contain can involve inconsistency with external data sets, un-
the records or data for the columns. The structured query even quality (e.g., if a respondent fakes an answer), and
language (SQL) is the communication tool for relational biased opportunistically collected data. Possible sources
database management systems. SQL statements are used of noise include faulty data collection instruments (e.g.,
to perform tasks such as updating data in a database, or sensors), transmission errors (e.g., intermittent errors
retrieving data from a database. Some common rela- from satellite or internet transmissions), data entry er-
tional database management systems that use SQL are: rors, technology limitations errors, misused naming
Oracle, Sybase, Microsoft SQL Server, Access, and In- conventions (e.g., using the same names for different
gres. Standard SQL commands, such as Select, Insert, meanings), and incorrect classification.
Update, Delete, Create, and Drop, can be used to ac- Redundant data exists when the same variables have
complish almost everything that one needs to do with different names in different databases, when a raw vari-
a database. able in one database is a derived variable in another, and
A data warehouse holds local databases assembled when changes in a variable over time are not reflected
in a central facility. A data cube is a multidimensional in the database. These irrelevant variables impede the
array of data, where each dimension is a set of sets rep- speed of the KDD process because dimension reduc-
resenting domain content, such as time or geography. tion is needed to eliminate them. Missing data may be
The dimensions are scaled categorically, for example, irrelevant if we can extract useful knowledge without
region of country, state, quarter of year, week of quar- imputing the missing data. In addition, most statisti-
ter. The cells of the cube contain aggregated measures cal methods for handling missing data may fail for
(usually counts) of variables. To explore the data cube, massive data sets, so new or modified methods still
one can drill down, drill up, and drill through. Drill need to be developed. In detecting outliers, sophisti-
down involves splitting an aggregation into subsets, e.g., cated methods like the Fisher information matrix or
splitting region of country into states. Drill up involves convex hull peeling are available, but are too complex
consolidation, i. e., aggregating subsets along a dimen- for massive data sets. Although outliers may be easy
sion. Drill through involves subsets crossing multiple to visualize in low dimensions, high-dimensional out-
sets, e.g., the user might investigate statistics within liers may not show up in low-dimensional projections.
Data Mining Methods and Applications 36.3 Data Mining (DM) Models and Algorithms 655

Currently, clustering and other statistical modeling are no systematic biases). The sampling process can be
used. expensive if the data have been stored in a database
The data preparation process involves three steps: system such that it is difficult to sample the data the
data cleaning, database sampling, and database reduc- way you want and many operations need to be exe-
tion and transformation. Data cleaning includes removal cuted to obtain the targeted data. One must balance
of duplicate variables, imputation of missing values, a trade-off between the costs of the sampling process
identification and correction of data inconsistencies, and the mining process. Finally, database reduction is
identification and updating of stale data, and creat- used for data cube aggregation, dimension reduction,
ing a unique record (case) identification (ID). Via elimination of irrelevant and redundant attributes, data
database sampling, the KDD process selects appropri- compression, and encoding mechanisms via quantiza-
ate parts of the databases to be examined. For this to tions, wavelet transformation, principle components,
work, the data must satisfy certain conditions (e.g., etc.

36.3 Data Mining (DM) Models and Algorithms


The DM process is illustrated in Fig. 36.5. In this pro-
cess, one will start by choosing an appropriate class of Start
models. To fit the best model, one needs to split the sam- Consider

Part D 36.3
alternate
ple data into two parts: the training data and the testing models
data. The training data will be used to fit the model and Choose models
the testing data is used to refine and tune the fitted model.
Collect
After the final model is obtained, it is recommended to more data
use an independent data set to evaluate the goodness of Train data Build / Fit model
the final model, such as comparing the prediction er- Sample
data
ror to the accuracy requirement. (If independent data are Test data
(Validation Refine / Tune model
not available, one can use the cross-validation method to (model size & diagnostics)
data)
compute prediction error.) If the accuracy requirement
is not satisfied, then one must revisit earlier steps to re- Evaluation
data Evaluate model
consider other classes of models or collect additional (e. g. prediction error)
(Test data)
data.
No
Before implementing any sophisticated DM Meet accuracy
reqt.
methods, data description and visualization are used Yes
for initial exploration. Tools include descriptive Score data
Make
Prediction desicions
statistical measures for central tendency/location, dis-
persion/spread, and distributional shape and symmetry;
class characterizations and comparisons using analyti- Fig. 36.5 Data mining process
cal approaches, attribute relevance analysis, and class
discrimination and comparisons; and data visualiza- learning without a teacher. In this case, correct answers
tion using scatter-plot matrices, density plots, 3-D are not available, and DM methods would search for pat-
stereoscopic scatter-plots, and parallel coordinate plots. terns or clusters of similarity that could later be linked
Following this initial step, DM methods take two forms: to some explanation.
supervised versus unsupervised learning. Supervised
learning is described as 36.3.1 Supervised Learning
textitlearning with a teacher, where the teacher provides
data with correct answers. For example, if we want to In supervised learning, we have a set of input variables
classify online shoppers as buyers or non-buyers using (also known as predictors, independent variables, x) that
an available set of variables, our data would include are measured or preset, and a set of output variables
actual instances of buyers and non-buyers for training (also known as responses, dependent variables, y) that
a DM method. Unsupervised learning is described as are measured and assumed to be influenced by the in-
656 Part D Regression Methods and Data Mining

puts. If the outputs are continuous/quantitative, then we regression are abundant (e.g., [36.9, 10]). In particu-
have a regression or prediction problem. If the outputs lar, Neter et al. [36.11] provides a good background
are categorical/qualitative, then we have a classifica- on residual analysis, model diagnostics, and model se-
tion problem. First, a DM model/system is established lection using best subsets and stepwise methods. In
based on the collected input and output data. Then, the model selection, insignificant model terms are elimi-
established model is used to predict output values at new nated; thus, the final model may be a subset of the
input values. The predicted values are denoted by ŷ. original pre-specified model. An alternate approach is to
The DM perspective of learning with a teacher, use a shrinkage method that employs a penalty function
follows these steps: to shrink estimated model parameters towards zero, es-
sentially reducing the influence of less important terms.
• Student presents an answer (ŷi given xi );
Two options are
• Teacher provides the correct answer yi or an error ei ridge regression [36.12], which uses the
penalty form β
m
2 , and the lasso [36.13], which uses
for the student’s answer;
the penalty form |βm |.
• The result is characterized by some loss function or
In the classification case, linear methods generate
lack-of-fit criterion:LOF(y, ŷ);
linear decision boundaries to separate the C classes.
• The objective is to minimize the expected loss.
Although a direct linear regression approach could be
Supervised learning includes the common engineering applied, it is known not to work well. A better method is
task of function approximation, in which we assume logistic regression [36.14], which uses log-odds (or logit
that the output is related to the input via some function transformations) of the posterior probabilities µc (x) =
f (x, ), where  represents a random error, and seek to P(Y = c|X = x) for classes c = 1, . . . , C − 1 in the form
Part D 36.3

approximate f (·). µc (x) P(Y = c|X = x)


Below, we describe several supervised learning log = log
µC (x) P(Y = C|X = x)
methods. All can be applied to both the regression and
classification cases, except for those presented below p
= βc0 + βc j x j ,
under Other Classifikation Methods. We maintain the
j=1
following notation. The j-th input variable is denoted
by x j (or random variable X j ) and the correspond- where the C posterior probabilities µc (x) must sum
ing boldface x (or X) denotes the vector of p input to one. The decision boundary between class  c<C
variables (x1 , x2 , . . . , x p )T , where boldface xi denotes and
 class C is defined by the hyperplane x | βc0 +
the i-th sample point; N is the number of sample βc j x j = 0 , where the log-odds are zero. Similarly,
points, which corresponds to the number of observa- the decision boundary between classes c = C and d = C,
tions of the response variable; the response variable derived from  the log-odds for classes
 c and d, is defined
is denoted by y (or random variable Y ), where yi de- by {x | βc0 + βc j x j = βd0 + βd j x j }. In the binary
notes the i-th response observation. For the regression case (C = 2), if we define µ(x) = P(Y = 1|X = x), then
case, the response y is quantitative, while for the clas- 1 − µ(x) = P(Y = 2|X = x). The logit transformation is
sification case, the response values are indices for C then defined as g(µ) = µ/(1 − µ).
classes (c = 1, . . . , C). An excellent reference for these Closely related to logistic regression is linear dis-
methods is Hastie et al. [36.8]. criminant analysis [36.15], which utilizes exactly the
same linear form for the log-odds ratio, and defines
Linear and Additive Methods linear discriminant functions δc (x), such that x is clas-
In the regression case, the basic linear method is simply sified to class c if its maximum discriminant is δc (x).
the multiple linear regression model form The difference between the two methods is how the pa-
rameters are estimated. Logistic regression maximizes

M
the conditional likelihood involving the posterior prob-
µ(x; β) = E[Y | X = x] = β0 + βm bm (x),
abilities P(Y = c|X) while linear discriminant analysis
m=1
maximizes the full log-likelihood involving the uncondi-
where the model terms bm (x) are pre-specified func- tional probabilities P(Y = c, X). More general forms of
tions of the input variables, for example, a simple discriminant analysis are discussed below under Other
linear term bm (x) = x j or a more complex interaction Classifikation Methods.
term bm (x) = x j xk2 . The key is that the model is lin- Finally, it should be noted that the logistic regres-
ear in the parameters β. Textbooks that cover linear sion model is one form of generalized linear model
Data Mining Methods and Applications 36.3 Data Mining (DM) Models and Algorithms 657

(GLM) [36.16]. GLM forms convert what appear to be til the smallest p-value is greater than a pre-specified
nonlinear models into linear models, using tools such as significance level (α).
transformations (e.g., logit) or conditioning on nonlinear The popular classification and regression trees
parameters. This then enables the modeler to use tra- (CART) [36.24] utilize recursive partitioning (binary
ditional linear modeling analysis techniques. However, splits), which evolved from the work of Morgan and
real data often do not satisfy the restrictive conditions of Sonquist [36.25] and Fielding [36.26] on analyzing sur-
these models. vey data. CARTs have a forward stepwise procedure
Rather than using pre-specified model terms, as that adds model terms and backward procedure for
in a linear model, a generalized additive model pruning. The model terms partition the x-space into
(GAM) [36.17], provides a more flexible statistical disjoint hyper-rectangular regions via indicator func-
method to enable modeling of nonlinear patterns in each tions: b+ (x; t) = 1{x > t}, b− (x; t) = 1{x ≤ t}, where
input dimension. In the regression case, the basic GAM the split-point t defines the borders between regions. The
form is resulting model terms are:

p

Lm
µ(x) = β0 + f j (x j ) , f m (x) = bsl,m (xv(l,m) ; tl,m ) , (36.1)
j=1 l=1

where the f j (·) are unspecified (smooth) univariate where, L m is the number of univariate indicator func-
functions, one for each input variable. The additive re- tions multiplied in the m-th model term, xv(l,m) is the
striction prohibits inclusion of any interaction terms. input variable corresponding to the l-th indicator func-

Part D 36.3
Each function is fitted using a nonparametric regres- tion in the m-th model term, tl,m is the split-point
sion modeling method, such as running-line smoothers corresponding to xv(l,m) , and sl,m is +1 or −1 to in-
(e.g., lowess, [36.18]), smoothing splines or kernel dicate the direction of the partition. The CART model
smoothers [36.19–21]. In the classification case, an form is then
additive logistic regression model utilizes the logit trans-

M
formation for classes c = 1, . . . , C − 1 as above f (x; β) = β0 + βm f m (x) . (36.2)
µc (x) P(Y = c|X = x) m=1
log = log
µC (x) P(Y = C|X = x) The partitioning of the x-space does not keep the
p parent model terms because they are redundant. For
= β0 + f j (x j ) , example, suppose the current set has the model term:
j=1
f a (x) = 1{x3 > 7} · 1{x4 ≤ 10} ,
where an additive model is used in place of the linear
model. However, even with the flexibility of nonpara- and the forward stepwise algorithm chooses to add
metric regression, GAM may still be too restrictive. The f b (x) = f a (x) · 1{x5 > 13}
following sections describe methods that have essen-
tially no assumptions on the underlying model form. = 1{x3 > 7} · 1{x4 ≤ 10} · 1{x5 > 13} .
Then the model term f a (x) is dropped from the current
Trees and Related Methods set. Thus, the recursive partitioning algorithm follows
One DM decision tree model is chi-square automatic in- a binary tree with the current set of model terms f m (x)
teraction detection (CHAID) [36.22, 23], which builds consisting of the M leaves of the tree, each of which
non-binary trees using a chi-square test for the classi- corresponds to a different region Rm .
fication case and an F-test for the regression case. The In the regression case, CART minimizes the squared
CHAID algorithm first creates categorical input vari- error loss function,
ables out of any continuous inputs by dividing them
into several categories with approximately the same 
N
 2
number of observations. Next, input variable categories LOF( fˆ) = yi − fˆ(xi ) ,
that are not statistically different are combined, while i=1

a Bonferroni p-value is calculated for those that are sta- and the approximation is a piecewise-constant function.
tistically different. The best split is determined by the In the classification case, each region Rm is classi-
smallest p-value. CHAID continues to select splits un- fied into one of the C classes. Specifically, define the
658 Part D Regression Methods and Data Mining

proportion of class c observations in region Rm as (MART) [36.33], then consists of much lower-order in-
1  teraction terms. Friedman [36.34] presents stochastic
δ̂mc = 1{yi = c} , gradient boosting, with a variety of loss functions, in
Nm which a bootstrap-like bagging procedure is included in
xi ∈Rm

where Nm is the number of observations in the region the boosting algorithm.


Rm . Then the observations in region Rm are classified Finally, for the regression case only, multivariate
into the class c corresponding to the maximum pro- adaptive regression splines (MARS) [36.35] evolved
portion δ̂mc . The algorithm is exactly the same as for from CART as an alternative to its piecewise-constant
regression, but with a different loss function. Appro- approximation. Like CART, MARS utilizes a forward
priate choices include minimizing the misclassification stepwise algorithm to select model terms followed by
error (i. e., the number a backward procedure to prune the model. A univariate
C of misclassified observations), version (appropriate for additive relationships) was pre-
the Gini index,
C c=1 δ̂mc (1 − δ̂mc ), or the deviance
sented by Friedman and Silverman [36.36]. The MARS
δ̂
c=1 mc log( δ̂mc ).
approximation bends to model curvature at knot loca-
The exhaustive search algorithms for CART simul-
tions, and one of the objectives of the forward stepwise
taneously conduct variable selection (x) and split-point
algorithm is to select appropriate knots. An important
selection (t). To reduce computational effort, the fast
difference from CART is that MARS maintains the par-
algorithm for classification trees [36.27] separates the
ent model terms, which are no longer redundant, but are
two tasks. At each existing model term (leaf of the tree),
simply lower-order terms.
F-statistics are calculated for variable selection. Then
MARS model terms have the same form as (36.1),
Part D 36.3

linear discriminant analysis is used to identify the split-


except the indicator functions are replaced with trun-
point. A version for logistic and Poisson regression was
cated linear functions,
presented by Chaudhuri et al. [36.28].
The primary drawback of CART and FACT is a bias [b+ (x; t) = [+(x − t)]+ , b− (x; t) = [−(x − t)]+ ,
towards selecting higher-order interaction terms due to
where [q]+ = max(0, q) and t is an univariate knot.
the property of keeping only the leaves of the tree.
The search for new model terms can be restricted to
As a consequence, these tree methods do not provide
interactions of a maximum order (e.g., L m ≤ 2 per-
robust approximations and can have poor prediction ac-
mits up through two-factor interactions). The resulting
curacy. Loh and Shih [36.29] address this issue for FACT
MARS approximation, following (36.2), is a continuous,
with a variant of their classification algorithm called
piecewise-linear function. After selection of the model
QUEST that clusters classes into superclasses before
terms is completed, smoothness to achieve a certain
applying linear discriminant analysis. For CART, Fried-
degree of continuity may be applied.
man et al. [36.30] introduced to the statistics literature
Hastie et al. [36.8] demonstrate significant im-
the concepts of boosting [36.31] and bagging [36.32]
provements in accuracy using MART over CART. For
from the machine learning literature. The bagging ap-
the regression case, comparisons between MART and
proach generates many bootstrap samples, fits a tree to
MARS yield comparable results [36.34]. Thus, the pri-
each, then uses their average prediction. In the frame-
mary decision between these two methods is whether
work of boosting, a model term, called a base learner,
a piecewise-constant approximation is satisfactory or
is a small tree with only L disjoint regions (L is se-
if a continuous, smooth approximation would be pre-
lected by the user), call it B(x, a), where a is the vector
ferred.
of tree coefficients. The boosting algorithm begins by
fitting a small tree B(x, a) to the data, and the first ap-
Artificial Neural Networks
proximation, fˆ1 (x), is then this first small tree. In the
Artificial neural network (ANN) models have been very
m-th iteration, residuals are calculated, then a small tree
popular for modeling a variety of physical relation-
B(x, a) is fitted to the residuals and combined with the
ships (for a general introduction see Lippmann [36.37]
latest approximation to create the m-th approximation:
or Haykin [36.38]; for statistical perspectives see
fˆm (x; β0 , β1 , . . . , βm ) = fˆm−1 (x; β0 , β1 , White [36.39], Baron et al. [36.40], Ripley [36.23], or
. . . , βm−1 ) + βm B(x, a) , Cheng and Titterington [36.41]). The original motiva-
tion for ANNs comes from how learning strengthens
where a line search is used to solve for βm . The resulting connections along neurons in the brain. Commonly, an
boosted tree, called a multiple additive regression tree ANN model is represented by a diagram of nodes in vari-
Data Mining Methods and Applications 36.3 Data Mining (DM) Models and Algorithms 659

ous layers with weighted connections between nodes in the bell-shaped radial basis functions. Commonly used
different layers (Fig. 36.6). At the input layer, the nodes sigmoidal functions are the logistic function
are the input variables and at the output layer, the nodes 1
are the response variable(s). In between, there is usually b(z) =
at least one hidden layer which induces flexibility into 1 + e−z
the modeling. Activation functions define transforma- and the hyperbolic tangent
tions between layers (e.g., input to hidden). Connections
1 − e−2x
between nodes can feed back to previous layers, but b(z) = tanh(z) = .
for supervised learning, the typical ANN is feedforward 1 + e−2x
only with at least one hidden layer. The most common radial basis function is the Gaussian
The general form of a feedforward ANN with one probability density function.
hidden layer and activation functions b1 (·) (input to In the regression case, each node in the output layer
hidden) and b2 (·) (hidden to output) is represents a quantitative response variable. The output
activation function may be either a linear, sigmoidal, or
f c (x; w, v, θ, γc ) = radial basis function. Using a logistic activation function
⎡ ⎛ ⎞ ⎤
H p from input to hidden and from hidden to output, the ANN
b2 ⎣ whc · b1 ⎝ v jh x j + θh ⎠ + γc ⎦ , model in (36.3) becomes
h=1 j=1   H −1

(36.3) f c (x; w, v, θ, γc ) = 1 + exp − whc z h + γc ,

Part D 36.3
where c = 1, . . . , C and C is the number of output vari- h=1

ables, p is the number of input variables, H is the number where for each hidden node h
of hidden nodes, the weights v jh link input nodes j to ⎡ ⎛ ⎞⎤−1
hidden nodes h and whc link hidden nodes h to out- p

put nodes c, and θh and γc are constant terms called z h = ⎣1 + exp ⎝− v jh x j + θh ⎠⎦ .


bias nodes (like intercept terms). The number of coef- j=1
ficients to be estimated is ( p + 1)H + (H + 1)C, which In the classification case with C classes, each class is
is often larger than N. The simplest activation function represented by a different node in the output layer. The
is a linear function b(z) = z, which reduces the ANN recommended output activation function is the softmax
model in (36.3) with one response variable to a multiple function. For output node c, this is defined as
linear regression equation. For more flexibility, the rec-
ommended activation functions between the input and ez c
b(z 1 , . . . , z C ; c) = .
hidden layer(s) are the S-shaped sigmoidal functions or 
C
zd
e
d=1
Inputs Hidden layer Outputs
This produces output values between zero and one that
X1 V11 W11 Y1 sum to one and, consequently, permits the output values
to be interpreted as posterior probabilities for a categor-
V12 W21
Z1 ical response variable.
V21 W12 Mathematically, an ANN model is a nonlinear sta-
X2 V22 W22 Y2 tistical model, and a nonlinear method must be used to
estimate the coefficients (weights v jh and whc , biases θh
V31 W13
V32
Z2 and γc ) of the model. This estimation process is called
W23
network training. Typically, the objective is to minimize
X3 Y3 the squared error lack-of-fit criterion

Fig. 36.6 Diagram of a typical artificial neural network 


C 
N
 2
for function approximation. The input nodes correspond to LOF( fˆ) = yi − fˆc (xi ) .
c=1 i=1
the input variables, and the output node(s) correspond to
the output variable(s). The number of hidden nodes in the The most common method for training is backpropa-
hidden layer must be specified by the user gation, which is based on gradient descent. At each
660 Part D Regression Methods and Data Mining

iteration, each coefficient (say w) is adjusted according Two popular kernel functions for SVM are polynomials
to its contribution to the lack-of-fit of degree d, K (x, x ) = (1 + x, x )d , and radial basis
∂(LOF) functions, K (x, x) = exp(−x − x 2 /c).
∆w = α , Given K (x, x ), we maximize the following
∂w
Lagrangian dual-objective function:
where the user-specified α controls the step size; see
Rumelhart et al. [36.42] for more details. More effi- 
N
1 
N N
cient training procedures are a subject of current ANN max αi − αi αi  yi yi  K (xi , xi )
α1 ,...α N 2
research. i=1 
i=1 i =1
Another major issue is the network architecture, de- s.t. 0 ≤ αi ≤ γ , for i = 1, . . . , N and
fined by the number of hidden nodes. If too many hidden N
nodes are permitted, the ANN model will overfit the data. αi yi = 0 ,
Many model discrimination methods have been tested, i=1
but the most reliable is validation of the model on a test- where γ is an SVM tuning parameter. The optimal
ing data set separate from the training data set. Several solution allows us to rewrite f (x; β) as
ANN architectures are fitted to the training data set and
then prediction error is measured on the testing data set. 
N
Although ANNs are generally flexible enough to model f (x; β) = β0 + αi yi K (x, xi ) ,
anything, they are computationally intensive, and a sig- i=1
nificant quantity of representative data is required to where β0 and α1 , . . . , α N are determined by solv-
Part D 36.3

both fit and validate the model. From a statistical per- ing f (x; β) = 0. The support vectors are those xi
spective, the primary drawback is the overly large set of corresponding to nonzero αi . A smaller SVM tuning pa-
coefficients, none of which provide any intuitive under- rameter γ leads to more support vectors and a smoother
standing for the underlying model structure. In addition, decision boundary. A testing data set may be used to
since the nonlinear model form is not motivated by the determine the best value for γ .
true model structure, too few training data points can The SVM extension to more than two classes solves
result in ANN approximations with extraneous nonlin- multiple two-class problems. SVM for regression uti-
earity. However, given enough good data, ANNs can lizes the model form in (36.4) and requires specification
outperform other modeling methods. of a loss function appropriate for a quantitative re-
sponse [36.8, 46]. Two possibilities are the -insensitive
Support Vector Machines function
Referring to the linear methods for classification 
described earlier, the decision boundary between 0 if |e| <  ,
V (e) =
two classes  is a hyperplane of the form x | β0 +
 |e| −  otherwise ,
β j x j = 0 . The support vectors are the points that
are most critical to determining the optimal decision which ignores errors smaller than , and the Hu-
boundary because they lie close to the points belong- ber [36.47] function
ing to the other class. With support vector machines 
e2 /2 if |e| ≤ 1.345 ,
(SVM) [36.43, 44], the linear decision boundary is gen- VH (e) =
eralized to the more flexible form 1.345|e| − e /2 otherwise ,
2


M which is used in robust regression to reduce model
f (x; β) = β0 + βm gm (x), (36.4) sensitivity to outliers.
m=1
where the gm (x) are transformations of the input Other Classification Methods
vector. The decision boundary is then defined by In this section, we briefly discuss some other concepts
{x | f (x; β) = 0}. To solve for the optimal decision that are applicable to DM classification problems. The
boundary, it turns out that we do not need to specify basic intuition behind a good classification method is
the transformations gm (x), but instead require only the derived from the Bayes classifier, which utilizes the
kernel function [36.21, 45]: posterior distribution P(Y = c|X = x). Specifically, if
    P(Y = c|X = x) is the maximum over c = 1, . . . , C,
K (x, x )= g1 (x), . . ., g M (x) , g1 (x ), . . ., g M (x ) . then x would be classified to class c.
Data Mining Methods and Applications 36.3 Data Mining (DM) Models and Algorithms 661

Nearest neighbor (NN) [36.48] classifiers seek to es- attributes is often very high (much higher than that in
timate the Bayes classifier directly without specification supervised learning). In describing the methods, we de-
of any model form. The k-NN classifier identifies the note the j-th variable by x j (or random variable X j ),
k closest points to x (using Euclidean distance) as the and the corresponding boldface x (or X) denotes the
neighborhood about x, then estimates P(Y = c|X = x) vector of p variables (x1 , x2 , . . . , x p )T , where boldface
with the fraction of these k points that are of class c. As k xi denotes the i-th sample point. These variables may be
increases, the decision boundaries become smoother; either quantitative or qualitative.
however, the neighborhood becomes less local (and
less relevant) to x. This problem of local representation Association Rules
is even worse in high dimensions, and modifications Association rules or affinity groupings seek to find as-
to the distance measure are needed to create a prac- sociations between the values of the variables X that
tical k-NN method for DM. For this purpose, Hastie provide knowledge about the population distribution.
and Tibshirani [36.49] proposed the discriminant adap- Market basket analysis is a well-known special case, for
tive NN distance measure to reshape the neighborhood which the extracted knowledge may be used to link spe-
adaptively at a given x to capture the critical points to cific products. For example, consider all the items that
distinguish between the classes. may be purchased at a store. If the analysis identifies that
As mentioned earlier, linear discriminant analysis items A and B are commonly purchased together, then
may be too restrictive in practice. Flexible discrimi- sales promotions could exploit this to increase revenue.
nant analysis replaces the linear decision boundaries In seeking these associations, a primary objective
with more flexible regression models, such as GAM or is to identify variable values that occur together with

Part D 36.3
MARS. Mixture discriminant analysis relaxes the as- high probability. Let S j be the the set of values for X j ,
sumption that that classes are more or less spherical in and consider a subset s j ⊆ S j . Then we seek subsets
shape by allowing a class to be represented by mul- s1 , . . . , s p such that
tiple (spherical) clusters; see Hastie et al. [36.50] and ⎡ ⎤
Ripley [36.23] for more details.  p
K -means clustering classification applies the K - P ⎣ (X j ∈ s j )⎦ (36.5)
means clustering algorithm separately to the data for j=1
each of the C classes. Each class c will then be
represented by K clusters of points. Consequently, non- is large. In market basket analysis, the variables X are
spherical classes may be modeled. For a new input vector converted to a set of binary variables Z, where each
x, determine the closest cluster, then assign x to the the attainable value of each X j corresponds to avariable
class associated with that cluster. Z k . Thus, the number of Z k variables is K = |S j |. If
Genetic algorithms [36.51, 52] use processes such binary variable Z k corresponds to X j = v, then Z k = 1
as genetic combination, mutation, and natural selection when X j = v and Z k = 0 otherwise. An item set κ is
in an optimization based on the concepts of natural evo- a realization of Z. For example, if the Z k represent the
lution. One generation of models competes to pass on possible products that could be purchased from a store,
characteristics to the next generation of models, until then an item set would be the set of items purchased
the best model is found. Genetic algorithms are useful together by a customer. Note that the number of Z k = 1
in guiding DM algorithms, such as neural networks and in an item set is at most p. Equation (36.5) now becomes
decision trees [36.53].  

P (Z k = 1) ,
36.3.2 Unsupervised Learning k∈κ

In unsupervised learning, correct answers are not avail- which is estimated by


able, so there is no clear measure of success. Success Number of observations for which
must be judged subjectively by the value of discov-
ered knowledge or the effectiveness of the algorithm. T (κ) = item set κ occurs .
N
The statistical perspective is to observe N vectors from
the population distribution, then conduct direct infer- T (κ) is called the support for the rule. We can select
ences on the properties (e.g. relationship, grouping) of a lower bound t such that item sets with T (κ) > t would
the population distribution. The number of variables or be considered to have large support.
662 Part D Regression Methods and Data Mining

Further knowledge may be extracted via the a priori dissimilarity is measured by calculating d(xi , xi ) for all
algorithm [36.54] in the form of if–then statements. For points xi , xi within a cluster Ck , then summing over
an item set κ, the items with Z k = 1 would
be partitioned the K clusters. This is equivalent to calculating
into two disjoint item subsets such that A B = κ. The
association rule would be stated as “if A, then B” and 
K 
W(C) = d(xi , x̄k ) ,
denoted by A ⇒ B, where A is called the antecedent and
k=1 i∈Ck
B is called the consequent. This rule’s support T (A ⇒
B) is the same as T (κ) calculated above, an estimate of where the cluster mean x̄k is the sample mean vector of
the joint probability. The confidence or predictability of the points in cluster Ck . Given a current set of cluster
this rule is means, the K -means algorithm assigns each point to the
T (A ⇒ B) closest cluster mean, calculates the new cluster means,
C(A ⇒ B) = , and iterates until the cluster assignments do not change.
T (A)
Unfortunately, because of its dependence on the
which is an estimate of the conditional probability squared Euclidean distance measure, K -means clus-
P(B|A). The expected confidence is the support of B, tering is sensitive to outliers (i. e., is not robust).
T (B), and an estimate for the unconditional probabil- K -mediods [36.57] is a generalized version that uti-
ity P(B). The lift is the ratio of the confidence over the lizes an alternately defined cluster center in place of
expected confidence, the cluster means and an alternate distance measure.
Density-based clustering (DBSCAN) [36.58] algo-
C(A ⇒ B)
L(A ⇒ B) = , rithms are less sensitive to outliers and can discover
Part D 36.3

T (B) clusters of irregular ( p-dimensional) shapes. DBSCAN


which, if greater than , can be interpreted as the increased is designed to discover clusters and noise in a spatial
prevalence of B when associated with A. For example, database. The advantage of DBSCAN over other cluster-
if T (B) = 5%, then B is estimated to occur uncondition- ing methods is its ability to represent specific structure
ally 5% of the time. If C(A ⇒ B) = 40%, then given A in the analysis explicitly. DBSCAN has two key pa-
occurs, B is estimated to occur 40% of the time. This re- rameters: neighborhood size () and minimum cluster
sults in a lift of 8, implying that B is 8 times more likely size (n min ). The neighborhood of an object within a ra-
to occur if we know that A occurs. dius  is called the -neighborhood of the object. If the
-neighborhood of an object contains at least n min ob-
Cluster Analysis jects, then the object is called a core object. To find
The objective of cluster analysis is to partition the N a cluster, DBSCAN starts with an arbitrary object o in
observations of x into groups or clusters such that the the database. If the object o is a core object w.r.t. 
dissimilarities within each cluster are smaller than the and n min , then a new cluster with o as the core object
dissimilarities between different clusters [36.55]. Typ- is created. DBSCAN continues to retrieve all density-
ically the variables x are all quantitative, and a distance reachable objects from the core object and add them to
measure (e.g., Euclidean) is used to measure dissimilar- the cluster.
ity. For categorical x variables, a dissimilarity measure GDBSCAN [36.59] generalizes the two key param-
must be explicitly defined. Below, we describe some of eters of the DBSCAN algorithm such that it can cluster
the more common methods. point objects and spatially extended objects according
K -means [36.56] is the best-known clustering tool. to an arbitrarily selected combination of attributes. The
It is appropriate when the variables x are quantitative. neighborhood of an object is now defined by a binary
Given a prespecified value K , the method partitions the predicate η on a data set that is reflexive and symmet-
N observations of x into exactly K clusters by mini- ric. If η is true, then the neighborhood of an object is
mizing within-cluster dissimilarity. Squared Euclidean called the η-neighborhood of an object. In other words,
distance the η-neighborhood of an object is a set of objects, S,
which meet the condition that η is true. Corresponding
 
p
 2 to n min , another predicate, wmin of the set of objects, S,
d xi , xi = xij − xi  j
is defined such that it is true if and only if the weighted
j=1
cardinality for the set, wCard(S ), is greater or equal to
is used to measure dissimilarity. For a specific cluster- the minimum cardinality (MinCard), i. e. wCard(S ) ≥
ing assignment C = (C1 , . . . , C K ), the within-cluster MinCard.
Data Mining Methods and Applications 36.3 Data Mining (DM) Models and Algorithms 663

Finally, ordering points to identify the clustering • Ghostminer (www.fqspl.com.pl/ghostminer/),


structure (OPTICS) [36.60] is a method of cluster analy- • Quadstone (www.quadstone.com/),
sis that produces an augmented ordering of the database • Insightful Miner (www.splus.com/products/iminer/).
representing its density-based clustering structure. This
method by itself does not produce a clustering of a data Haughton et al. [36.62] present a review of the first
set explicitly. The information produced by OPTICS five listed above. The SAS and SPSS packages have the
includes representative points, arbitrarily shaped clus- most complete set of KDD/DM tools (data handling,
ters and intrinsic clustering structure, which can then be DM modeling, and graphics), while Quadstone is the
used by a clustering algorithm when selecting clustering most limited. Insightful Miner was developed by S+
settings. This same information can also be used by a hu- [www.splus.com], but does not require knowledge of the
man expert to gain insight into the clustering structure S+ language, which is only recommended for users that
of the data. are familiar with statistical modeling. For statisticians,
Self-Organizing (Feature) Maps (SOMs) [36.61] be- the advantage is that Insightful Miner can be integrated
long to the class of ANNs called unsupervised learning with more sophisticated DM methods available with S+,
networks. SOMs can be organized as a single layer or such as flexible and mixture discriminant analysis. All
as two layers of neuron nodes. In this arrangement, six packages include trees and clustering, and all except
the input layer consists of p nodes corresponding to Quadstone include ANN modeling. The SAS, SPSS,
the real-valued input vector of dimension p. The input and XLMiner packages include discriminant analysis
layer nodes are connected to a second layer of nodes U. and association rules. Ghostminer is the only one that
By means of lateral connections, the nodes in U form offers SVM tools.

Part D 36.3
a lattice structure of dimensionality M. Typically M SAS, SPSS, and Quadstone are the most expensive
is much smaller than p. By means of a learning al- (over $ 40 000) while XLMiner is a good deal for the
gorithm, the network discovers the clusters within the price (under $ 2 000). The disadvantage of XLMiner is
data. It is possible to alter the discovered clusters by that it cannot handle very large data sets. Each pack-
varying the learning parameters of the network. The age has certain specializations, and potential users must
SOM is especially suitable for data survey because it has carefully investigate these choices to find the package
appealing visualization properties. It creates a set of pro- that best fits their KDD/DM needs. Below we de-
totype vectors representing the data set and carries out scribe some other software options for the DM modeling
a topology-preserving projection of the prototypes from methods presented.
the p-dimensional input space onto a low-dimensional GLM or linear models are the simplest of DM tools
(typically two-dimensional) grid. This ordered grid can and most statistical software can fit them, such as SAS,
be used as a convenient visualization surface for show- SPSS, S+, and Statistica [www.statsoftinc.com/]. How-
ing different features of the SOM (and thus of the data), ever, it should be noted that Quadstone only offers a
for example, the cluster structure. While the axes of such regression tool via scorecards, which is not the same as
a grid do not correspond to any measurement, the spa- statistical linear models. GAM requires access to more
tial relationships among the clusters do correspond to sophisticated statistical software, such as S+.
relationships in p-dimensional space. Another attractive Software for CART, MART, and MARS is avail-
feature of the SOM is its ability to discover arbitrarily able from Salford Systems [www.salford-systems.com].
shaped clusters organized in a nonlinear space. SAS Enterprise Miner includes CHAID, CART, and the
machine learning program C4.5 [www.rulequest.com]
36.3.3 Software [36.63], which uses classifiers to generate decision
trees and if–then rules. SPSS Clementine and Insight-
Several DM software packages are available at a wide ful Miner also include CART, but Ghostminer and
range of prices, of which six of the most popular pack- XLMiner utilize different variants of decision trees.
ages are: QUEST [www.stat.wisc.edu/˜loh/quest.html] is avail-
able in SPSS’s AnswerTree software and Statistica.
• SAS Enterprise Miner Although ANN software is widely available, the
(www.sas.com/technologies/analytics/datamining/ most complete package is Matlab’s [www.mathworks
miner/), .com] Neural Network Toolbox. Information on SVM
• SPSS Clementine (www.spss.com/clementine/), software is available at [www.support-vector.net/soft-
• XLMiner in Excel (www.xlminer.net), ware.html]. One good option is Matlab’s SVM Toolbox.
664 Part D Regression Methods and Data Mining

36.4 DM Research and Applications


Many industrial and business applications require mod- Although the principle of SPC can be applied to
eling and monitoring processes with real-time data of service industries, such as business process monitoring,
different types: real values, categorical, and even text. fewer applications exist for two basic reasons that Mont-
DM is an effective tool for extracting process knowledge gomery [36.65] identified. First, the system that needs
and discovering data patterns to provide a control aid for to be monitored and improved is obvious in manufactur-
these processes. Advanced DM research involves com- ing applications, while it is often difficult to define and
plex system modeling of heterogeneous objects, where observe in service industries. Second, even if the system
adaptive algorithms are necessary to capture dynamic can be clearly specified, most non-manufacturing op-
system behavior. erations do not have natural measurement systems that
reflect the performance of the system. However, these
36.4.1 Activity Monitoring obstacles no longer exist, due to the many natural and
advanced measurement systems that have been devel-
One important DM application is the development oped. In the telecommunications industry, for example,
of an effective data modeling and monitoring sys- advanced software and hardware technologies make it
tem for understanding customer profiles and detecting possible to record and process huge amounts of daily
fraudulent behavior. This is generally referred to as data in business transactions and service activities. These
activity monitoring for interesting events requiring ac- databases contain potentially useful information to the
tion [36.64]. Other activity monitoring examples include company that may not be discovered without knowledge
Part D 36.4

credit card or insurance fraud detection, computer intru- extraction or DM tools.


sion detection, some forms of fault detection, network While SPC ideas can be applied to business data,
performance monitoring, and news story monitoring. SPC methods are not directly applicable. Existing SPC
Although activity monitoring has only recently re- theories are based on small or medium-sized samples,
ceived attention in the information industries, solutions and the basic hypothesis testing approach is intended to
to similar problems were developed long ago in the detect only simple shifts in a process mean or variance.
manufacturing industries, under the moniker statisti- Recently, Jiang et al. [36.70] successfully generalized
cal process control (SPC). SPC techniques have been the SPC framework to model and track thousands of di-
used routinely for online process control and monitor- versified customer behaviors in the telecommunication
ing to achieve process stability and to improve process industry. The challenge is to develop an integrated strat-
capability through variation reduction. In general, all egy to monitor the performance of an entire multi-stage
processes are subject to some natural variability regard- system and to develop effective and efficient techniques
less of their state. This natural variability is usually for detecting the systematic changes that require action.
small and unavoidable and is referred to as common A dynamic business process can be described by the
cause variation. At the same time, processes may be dynamic linear models introduced by West [36.71],
subject to other variability caused by improper ma-
chine adjustment, operator errors, or low-quality raw Observation equation : X t = At θt + ∆t ,
material. This variability is usually large, but avoidable, System evolution equation : θt = Bt θt−1 + Λt ,
and is referred to as special cause variation. The basic Initial information : π(S0 ) ,
objective of SPC is to detect the occurrence of spe-
cial cause variation (or process shifts) quickly, so that where At and Bt represent observation and state tran-
the process can be investigated and corrective action sition matrices, respectively, and ∆t and Λt represent
may be taken before quality deteriorates and defective observation and system transition errors, respectively.
units are produced. The main ideas and methods of Based on the dynamic system model, a model-based
SPC were developed in the 1920s by Walter Shewhart process monitoring and root-cause identification method
of Bell Telephone Laboratories and have had tremen- can be developed. Monitoring and diagnosis includes
dous success in manufacturing applications [36.65, 66]. fault pattern generation and feature extraction, isolation
Montgomery and Woodall [36.67] provide a comprehen- of the critical processes, and root-cause identifica-
sive panel discussion on SPC, and multivariate methods tion. Jiang et al. [36.70] utilize this for individual
are reviewed by Hayter and Tsui [36.68] and Ma- customer prediction and monitoring. In general, in-
son et al. [36.69]. dividual modeling is computationally intractable and
Data Mining Methods and Applications 36.4 DM Research and Applications 665

cluster models should be developed with mixture amount of discussion and controversy and are widely
distributions [36.72]. used in manufacturing [36.73–77]. The general consen-
One particularly competitive industry is telecom- sus among statisticians seems to be that, while many
munications. Since divestiture and government dereg- of Taguchi’s overall ideas on experimental design are
ulation, various telephone services, such as cellular, very important and influential, the techniques he pro-
local and long distance, domestic and commercial, have posed are not necessarily the most effective statistical
become battle grounds for telecommunication service methods. Nevertheless, Taguchi has made significant
providers. Because of the data and information ori- contributions in the area of quality control and quality
ented nature of the industry, DM methods for knowledge engineering. For DM, Taguchi has recently popularized
extraction are critical. To remain competitive, it is impor- the Mahalanobis–Taguchi System (MTS), a new set of
tant for companies to develop business planning systems tools for diagnosis, classification, and variable selection.
that help managers make good decisions. In particular, The method is based on a Mahalanobis distance scale
these systems will allow sales and marketing people that is utilized to measure the level of abnormality in ab-
to establish successful customer loyalty programs for normal items as compared to a group of normal items.
churn prevention and to develop fraud detection modules First, it must be demonstrated that a Mahalanobis dis-
for reducing revenue loss through market segmentation tance measure based on all available variables is able to
and customer profiling. separate the abnormal from the normal items. Should
A major task in this research is to develop and im- this be successfully achieved, orthogonal arrays and
plement DM tools within the business planning system. signal-to-noise ratios are used to select an optimal com-
The objectives are to provide guidance for targeting bination of variables for calculating the Mahalanobis

Part D 36.4
business growth, to forecast year-end usage volume and distances.
revenue growth, and to value risks associated with the The MTS method has been claimed to be very power-
business plan periodically. Telecommunication business ful for solving a wide range of problems, including
services include voice and non-voice services, which manufacturing inspection and sensing, medical diag-
can be further categorized to include domestic, local, in- nosis, face and voice recognition, weather forecasting,
ternational, products, toll-free calls, and calling cards. credit scoring, fire detection, earthquake forecasting, and
For usage forecasting, a minutes growth model is uti- university admissions. Two recent books have been pub-
lized to forecast domestic voice usage. For revenue lished on the MTS method by Taguchi et al. [36.78]
forecasting, the average revenue per minute on a log and Taguchi and Jugulum [36.79]. Many successful case
scale is used as a performance measure and is forecasted studies in MTS have been reported in engineering and
by a double exponential smoothing growth function. science applications in many large companies, such as
A structural model is designed to decompose the busi- Nissan Motor Co., Mitsubishi Space Software Co., Xe-
ness growth process into three major subprocesses: add, rox, Delphi Automotive Systems, ITT Industries, Ford
disconnect, and base. To improve explanatory power, Motor Company, Fuji Photo Film Company, and oth-
the revenue unit is further divided into different cus- ers. While the method is getting a lot of attention in
tomer groups. To compute confidence and prediction many industries, very little research [36.80] has been
intervals, bootstrapping and simulation methods are conducted to investigate how and when the method is
used. appropriate.
To understand the day effect and seasonal effect, the
concept of bill-month equivalent business days (EBD) 36.4.3 Manufacturing Process Modeling
is defined and estimated. To estimate EBD, the fac-
tor characteristics of holidays (non-EBD) are identified One area of DM research in manufacturing indus-
and eliminated and the day effect is estimated. For sea- tries is quality and productivity improvement through
sonality, the US Bureau of the Census X-11 seasonal DM and knowledge discovery. Manufacturing sys-
adjustment procedure is used. tems nowadays are often very complicated and involve
many manufacturing process stages where hundreds
36.4.2 Mahalanobis–Taguchi System or thousands of in-process measurements are taken
to indicate or initiate process control of the system.
Genichi Taguchi is best known for his work on ro- For example, a modern semiconductor manufactur-
bust design and design of experiments. The Taguchi ing process typically consists of over 300 steps, and
robust design methods have generated a considerable in each step, multiple pieces of equipment are used
666 Part D Regression Methods and Data Mining

to process the wafer. Inappropriate understanding of pends on product groups, process steps, and types of
interactions among in-process variables will create in- defects [36.86]. Unlike traditional defect models, an ap-
efficiencies at all phases of manufacturing, leading to propriate logit model can be developed as follows. Let
long product/process realization cycle times and long the number of defects of category X on an electronics
development times, and resulting in excessive system product be
costs. 
Current approaches to DM in electronics manufac- UX = YX
turing include neural networks, decision trees, Bayesian
and
models and rough set theory [36.81, 82]. Each of these
 
approaches carries certain advantages and disadvan- logit E(Y X ) = α0X + α O
X · OX
tages. Decision trees, for instance, produce intelligible
+ αCX · C X + α OC
X · OX · C X ,
rules and hence are very appropriate for generating pro-
cess control or design of experiments strategies. They where logit(z) = log[z/(1 − z)] is the link function for
are, however, generally prone to outlier and imperfect Bernoulli distributions, and Y X is a Bernoulli random
data influences. Neural networks, on the other hand, are variable representing a defect from defect category X.
robust against data abnormalities but do not produce The default logit of the failure probability is α0X , and
readily intelligible knowledge. These methods also dif- αOX and α X are the main effects of operations (O X )
C
fer in their ability to handle high-dimensional data, to and components (C X ). Since the Y X s are correlated,
discover arbitrarily shaped clusters [36.58] and to pro- this model will provide more detailed information about
vide a basis for intuitive visualization [36.83]. They
Part D 36.4

defects.
can also be sensitive to training and model building
parameters [36.60]. Finally, the existing approaches Multivariate Defect Modeling
do not take into consideration the localization of pro- Since different types of defects may be caused by the
cess parameters. The patterns or clusters identified same operations, multivariate Poisson models are nec-
by existing approaches may include parameters from essary to account for correlations among different types
a diverse set of components in the system. There- of defects. The trivariate reduction method suggests an
fore, a combination of methods that complement each additive Poisson model for the vector of Poisson counts
other to provide a complete set of desirable features is U = (U1 , U2 , · · · , Uk ) ,
necessary.
It is crucial to understand process structure and yield U = AV ,
components in manufacturing, so that problem localiza-
tion can permit reduced production costs. For example, where A is a matrix of zeros and ones, and
semiconducture manufacturing practice shows that over V = (v1 , v2 , · · · , v p ) consists of independent Poisson
70% of all fatal detects and close to 90% of yield variables vi . The variance–covariance matrix takes the
excursions are caused by problems related to process form Var(U) = AΣA = Φ + νν , where Φ = diag(µi )
equipment [36.84]. Systematic defects can be attributed is a diagonal matrix with the mean of the individual se-
to many categories that are generally associated with ries, and ν is the common covariance term. Note that
technologies and combinations of different process op- the vi are essentially latent variables, and a factor anal-
erations. To implement DM methods successfully for ysis model can be developed for analyzing multivariate
knowledge discovery, some future research for manu- discrete Poisson variables such that
facturing process control must include yield modeling, log[E(U)] = µ + L · F ,
defect modeling and variation propagation.
where U is the vector of defects, L is the matrix of factor
Yield Modeling loadings, and F contains common factors representing
In electronics manufacturing, the ANSI stand- effects of specific operations. By using factor analysis,
ards [36.85] and practice gene rally assume that the it is possible to relate product defects to the associated
number of defects on an electronics product follows packages and operations.
a Poisson distribution with mean λ. The Poisson random
variable is an approximation of the sum of independent Multistage Variation Propagation
Bernoulli trials, but defects on different components Inspection tests in an assembly line usually have func-
may be correlated since process yield critically de- tional overlap, and defects from successive inspection
Data Mining Methods and Applications References 667

stations exhibit strong correlations. Modeling seri- account for serial correlations of defects in different
ally correlated defect counts is an important task for inspection stations. Factor analysis methods based on
defect localization and yield prediction. Poisson re- hidden Markov models [36.88] can also be constructed
gression models, such as the generalized event-count to investigate how variations are propagated through
method [36.87] and its alternatives, can be utilized to assembly lines.

36.5 Concluding Remarks


While DM and KDD methods are gaining recognition isting commercial DM software systems include many
and have become very popular in many compa- sophisticated algorithms, but lack of guidance on which
nies and enterprises, the success of these methods algorithms to use.
is still somewhat limited. Below, we discuss a few Second, implementation of DM is difficult to apply
obstacles. effectively across an industry. Although it is clear that
First, the success of DM depends on a close collab- extracting hidden knowledge and trends across an in-
oration of subject-matter experts and data modelers. In dustry would be useful and beneficial to all companies
practice, it is often easy to identify the right subject- in the industry, it is typically impossible to integrate the
matter expert, but difficult to find the qualified data detailed data from competing companies due to confi-
modeler. While the data modeler must be knowledgeable dentiality and proprietary issues. Currently, the industry
and familiar with DM methods, it is more important to practice is that each company will integrate their own

Part D 36
be able to formulate real problems such that the existing detailed data with the more general, aggregated industry-
methods can be applied. In reality, traditional academic wide data for knowledge extraction. It is obvious that
training mainly focuses on knowledge of modeling al- this approach will be significantly less effective than
gorithms and lacks training in problem formulation and the approach of integrating the detailed data from all
interpretation of results. Consequently, many modelers competing companies. It is expected that, if these obs-
are very efficient in fitting models and algorithms to tacles can be overcome, the impact of the DM and KDD
data, but have a hard time determining when and why methods will be much more prominent in industrial and
they should use certain algorithms. Similarly, the ex- commercial applications.

References

36.1 M. J. A. Berry, G. Linoff: Mastering Data Mining: The 36.8 T. Hastie, J. H. Friedman, R. Tibshirani: Elements of
Art and Science of Customer Relationship Manage- Statistical Learning: Data Mining, Inference, and
ment (Wiley, New York 2000) Prediction (Springer, Berlin Heidelberg New York
36.2 E. Wegman: Data Mining Tutorial, Short Course 2001)
Notes, Interface 2001 Symposium, Cosa Mesa, Cali- 36.9 S. Weisberg: Applied Linear Regression (Wiley, New
fornien (2001) York 1980)
36.3 P. Adriaans, D. Zantinge: Data Mining (Addison- 36.10 G. Seber: Multivariate Observations (Wiley, New
Wesley, New York 1996) York 1984)
36.4 J. H. Friedman: Data Mining and Statistics: What 36.11 J. Neter, M. H. Kutner, C. J. Nachtsheim, W. Wasser-
is the Connection? Technical Report (Stat. Dep., man: Applied Linear Statistical Models, 4th edn.
Stanford University 1997) (Irwin, Chicago 1996)
36.5 K. B. Clark, T. Fujimoto: Product Development and 36.12 A. E. Hoerl, R. Kennard: Ridge Regression: Biased
Competitiveness, J. Jpn Int. Econ. 6(2), 101–143 Estimation of Nonorthogonal Problems, Techno-
(1992) metrics 12, 55–67 (1970)
36.6 D. W. LaBahn, A. Ali, R. Krapfel: New Product De- 36.13 R. Tibshirani: Regression Shrinkage and Selection
velopment Cycle Time. The Influence of Project and via the Lasso, J. R. Stat. Soc. Series B 58, 267–288
Process Factors in Small Manufacturing Companies, (1996)
J. Business Res. 36(2), 179–188 (1996) 36.14 A. Agresti: An Introduction to Categorical Data
36.7 J. Han, M. Kamber: Data Mining: Concept and Tech- Analysis (Wiley, New York 1996)
niques (Morgan Kaufmann, San Francisco 2001) 36.15 D. Hand: Discrimination and Classification (Wiley,
Chichester 1981)
668 Part D Regression Methods and Data Mining

36.16 P. McCullagh, J. A. Nelder: Generalized Linear Mod- 36.36 J. H. Friedman, B. W. Silverman: Flexible Par-
els, 2nd edn. (Chapman Hall, New York 1989) simonious Smoothing and Additive Modeling,
36.17 T. Hastie, R. Tibshirani: Generalized Additive Mod- Technometrics 31, 3–39 (1989)
els (Chapman Hall, New York 1990) 36.37 R. P. Lippmann: An Introduction to Computing with
36.18 W. S. Cleveland: Robust Locally-Weighted Regres- Neural Nets, IEEE ASSP Magazine April, 4–22 (1987)
sion and Smoothing Scatterplots, J. Am. Stat. Assoc. 36.38 S. S. Haykin: Neural Networks: A Comprehensive
74, 829–836 (1979) Foundation, 2nd edn. (Prentice Hall, Upper Saddle
36.19 R. L. Eubank: Spline Smoothing and Nonparametric River 1999)
Regression (Marcel Dekker, New York 1988) 36.39 H. White: Learning in Neural Networks: a Statis-
36.20 G. Wahba: Spline Models for Observational Data, tical Perspective, Neural Computation 1, 425–464
Applied Mathematics, Vol. 59 (SIAM, Philadelphia (1989)
1990) 36.40 A. R. Barron, R. L. Barron, E. J. Wegman: Statisti-
36.21 W. Härdle: Applied Non-parametric Regression cal Learning Networks: A Unifying View, Computer
(Cambridge Univ. Press, Cambridge 1990) Science and Statistics: Proceedings of the 20th
36.22 D. Biggs, B. deVille, E. Suen: A Method of Choosing Symposium on the Interface 1992, ed. by E. J. Weg-
Multiway Partitions for Classification and Decision man, D. T. Gantz, J. J. Miller (American Statistical
Trees, J. Appl. Stat. 18(1), 49–62 (1991) Association, Alexandria, VA 1992) 192–203
36.23 B. D. Ripley: Pattern Recognition and Neural Net- 36.41 B. Cheng, D. M. Titterington: Neural Networks: A
works (Cambridge Univ. Press, Cambridge 1996) Review from a Statistical Perspective (with discus-
36.24 L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: sion), Stat. Sci 9, 2–54 (1994)
Classification and Regression Trees (Wadsworth, 36.42 D. Rumelhart, G. Hinton, R. Williams: Learning In-
Belmont, California 1984) ternal Representations by Error Propagation. In:
Part D 36

36.25 J. N. Morgan, J. A. Sonquist: Problems in the An- Parallel Distributed Processing: Explorations in the
alysis of Survey Data, and a Proposal, J. Am. Stat. Microstructures of Cognition, Vol. 1: Foundations,
Assoc. 58, 415–434 (1963) ed. by D. E. Rumelhart, J. L. McClelland (MIT, Cam-
36.26 A. Fielding: Binary segmentation: The Automatic bridge 1986) pp. 318–362
Interaction Detector and Related Techniques for 36.43 V. Vapnik: The Nature of Statistical Learning
Exploring Data Structure. In: The Analysis of Survey (Springer, Berlin Heidelberg New York 1996)
Data, Volume I: Exploring Data Structures, ed. by 36.44 C. J. C. Burges: A Tutorial on Support Vector
C. A. O’Muircheartaigh, C. Payne (Wiley, New York Machines for Pattern Recognition, Knowledge Dis-
1977) pp. 221–258 covery and Data Mining 2(2), 121–167 (1998)
36.27 W. Y. Loh, N. Vanichsetakul: Tree-Structured Clas- 36.45 J. Shawe-Taylor, N. Cristianini: Kernel Methods for
sification Via Generalized Discriminant Analysis, J. Pattern Analysis (Cambridge Univ. Press, Cambridge
Am. Stat. Assoc. 83, 715–728 (1988) 2004)
36.28 W. D. Lo. Chaudhuri, W. Y. Loh, C. C. Yang: Gen- 36.46 N. Cristianini, J. Shawe-Taylor: An Introduction to
eralized Regression Trees, Stat. Sin. 5, 643–666 Support Vector Machines (Cambridge Univ. Press,
(1995) Cambridge 2000)
36.29 W. Y. Loh, Y. S. Shih: Split-Selection Methods for 36.47 P. Huber: Ann. Math. Stat., Robust Estimation of a
Classification Trees, Statistica Sinica 7, 815–840 Location Parameter 53, 73–101 (1964)
(1997) 36.48 B. V. Dasarathy: Nearest Neighbor Pattern Classifi-
36.30 J. H. Friedman, T. Hastie, R. Tibshirani: Additive cation Techniques (IEEE Computer Society, New York
Logistic Regression: a Statistical View of Boosting, 1991)
Ann. Stat. 28, 337–407 (2000) 36.49 T. Hastie, R. Tibshirani: Discriminant Adap-
36.31 Y. Freund, R. Schapire: Experiments with a New tive Nearest-Neighbor Classification, IEEE Pattern
Boosting Algorithm, Machine Learning: Proceed- Recognition and Machine Intelligence 18, 607–616
ings of the Thirteenth International Conference, (1996)
Bari, Italy 1996, ed. by M. Kaufmann, (Bari, Italy 36.50 T. Hastie, R. Tibshirani, A. Buja: Flexible Dis-
1996) 148–156 criminant and Mixture Models. In: Statistics and
36.32 L. Breiman: Bagging Predictors, Machine Learning Artificial Neural Networks, ed. by J. Kay, M. Titter-
26, 123–140 (1996) ington (Oxford Univ. Press, Oxford 1998)
36.33 J. H. Friedman: Greedy Function Approximation: a 36.51 J. R. Koza: Genetic Programming: On the Program-
Gradient Boosting Machine, Ann. Stat. 29, 1189– ming of Computers by Means of Natural Selection
1232 (2001) (MIT, Cambridge 1992)
36.34 J. H. Friedman: Stochastic Gradient Boosting, Com- 36.52 W. Banzhaf, P. Nordin, R. E. Keller, F. D. Francone:
putational Statistics and Data Analysis 38(4), Genetic Programming: An Introduction (Morgan
367–378 (2002) Kaufmann, San Francisco 1998)
36.35 J. H. Friedman: Multivariate Adaptive Regression 36.53 P. W. H. Smith: Genetic Programming as a Data-
Splines (with Discussion), Ann. Stat. 19, 1–141 (1991) Mining Tool. In: Data Mining: A Heuristic Approach,
Data Mining Methods and Applications References 669

ed. by H. A. Abbass, R. A. Sarker, C. S. Newton (Idea 36.70 W. Jiang, S.-T. Au, K.-L. Tsui: A Statistical Process
Group Publishing, London 2002) pp. 157–173 Control Approach for Customer Activity Monitoring,
36.54 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, Technical Report, AT&T Labs (2004)
A. I. Verkamo: Fast Discovery Of Association Rules: 36.71 M. West, J. Harrison: Bayesian Forecasting and Dy-
Advances in Knowledge Discovery and Data Mining namic Models, 2nd edn. (Springer, New York 1997)
(MIT, Cambridge 1995) Chap. 12 36.72 C. Fraley, A. E. Raftery: Model-based Clustering,
36.55 A. Gordon: Classification, 2nd edn. (Chapman Hall, Discriminant Analysis, and Density Estimation, J.
New York 1999) Amer. Stat. Assoc. 97, 611–631 (2002)
36.56 J. A. Hartigan, M. A. Wong: A K-Means Clustering 36.73 G. Taguchi: Introduction to Quality Engineering:
Algorithm, Appl. Stat. 28, 100–108 (1979) Designing Quality into Products and Processes
36.57 L. Kaufman, P. Rousseeuw: Finding Groups in Data: (Asian Productivity Organization, Tokyo 1986)
An Introduction to Cluster Analysis (Wiley, New York 36.74 G. E. P. Box, R. N. Kacker, V. N. Nair, M. S. Phadke,
1990) A. C. Shoemaker, C. F. Wu: Quality Practices in
36.58 M. Ester, H.-P. Kriegel, J. Sander, X. Xu: A Japan, Qual. Progress March, 21–29 (1988)
Density-Based Algorithm for Discovering Cluster 36.75 V. N. Nair: Taguchi’s Parameter Design: A Panel
in Large Spatial Databases, Proceedings of 1996 Discussion, Technometrics 34, 127–161 (1992)
International Conference on Knowledge Discovery 36.76 K.-L. Tsui: An Overview of Taguchi Method and
and Data Mining (KDD96), Portland 1996, ed. by Newly Developed Statistical Methods for Robust
E. Simoudis, J. Han, U. Fayyad (AAAI Press, Menlo Design, IIE Trans. 24, 44–57 (1992)
Park 1996) 226–231 36.77 K.-L. Tsui: A Critical Look at Taguchi’s Modeling Ap-
36.59 J. Sander, M. Ester, H.-P. Kriegel, X. Xu: proach for Robust Design, J. Appl. Stat. 23, 81–95
Density-Based Clustering in Spatial Databases: (1996)

Part D 36
The Algorithm DGBSCAN and its Applications, Data 36.78 G. Taguchi, S. Chowdhury, Y. Wu: The Mahalanobis–
Mining and Knowledge Discovery 2(2), 169–194 Taguchi System (McGraw-Hill, New York 2001)
(1998) 36.79 G. Taguchi, R. Jugulum: The Mahalanobis–Taguchi
36.60 M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander: Strategy: A Pattern Technology System (Wiley, New
OPTICS: Ordering Points to Identify the Clustering York 2002)
Structure, Proc. ACMSIGMOD Int. Conf. on Manage- 36.80 W. H. Woodall, R. Koudelik, K.-L. Tsui, S. B. Kim,
ment of Data, Philadelphia, Pennsylvania June 1-3, Z. G. Stoumbos, C. P. Carvounis: A Review and
1999 (ACM Press, New York 1999) 49–60 Analysis of the Mahalanobis–Taguchi System,
36.61 T. Kohonen: Self-Organization and Associative Technometrics 45(1), 1–15 (2003)
Memory, 3rd edn. (Springer, Berlin Heidelberg New 36.81 A. Kusiak, C. Kurasek: Data Mining of Printed–
York 1989) Circuit Board Defects, IEEE Transactions on Robotics
36.62 D. Haughton, J. Deichmann, A. Eshghi, S. Sayek, and Automation 17(2), 191–196 (2001)
N. Teebagy, H. Topi: A Review of Software Packages 36.82 A. Kusiak: Rough Set Theory: A Data Mining Tool
for Data Mining, Amer. Stat. 57(4), 290–309 (2003) for Semiconductor Manufacturing, IEEE Transac-
36.63 J. R. Quinlan: C4.5: Programs for Machine Learning tions on Electronics Packaging Manufacturing 24(1),
(Morgan Kaufmann, San Mateo 1993) 44–50 (2001)
36.64 T. Fawcett, F. Provost: Activity Monitoring: Notic- 36.83 A. Ultsch: Information and Classification: Concepts,
ing Interesting Changes in Behavior, Proceedings Methods and Applications (Springer, Berlin Heidel-
of KDD-99, San Diego 1999, (San Diego, CA 1999) berg New York 1993)
53–62 36.84 A. Y. Wong: A Statistical Approach to Identify
36.65 D. C. Montgomery: Introduction to Statistical Qual- Semiconductor Process Equipment Related Yield
ity Control, 5th edn. (Wiley, New York 2001) Problems, IEEE International Symposium on Defect
36.66 W. H. Woodall, K.-L. Tsui, G. R. Tucker: A Review of and Fault Tolerance in VLSI Systems, Paris 1997 (IEEE
Statistical and Fuzzy Quality Control Based on Cate- Computer Society, Paris, France 1997) 20–22
gorical Data, Frontiers in Statistical Quality Control 36.85 ANSI (2002). Am. Nat. Standards Institute, IPC-9261,
5, 83–89 (1997) In-Process DPMO and Estimated Yield for PWB
36.67 D. C. Montgomery, W. H. Woodall: A Discussion on 36.86 M. Baron, C. K. Lakshminarayan, Z. Chen: Markov
Statistically-Based Process Monitoring and Control, Random Fields In Pattern Recognition For Semi-
J. Qual. Technol. 29, 121–162 (1997) conductor Manufacturing, Technometrics 43, 66–72
36.68 A. J. Hayter, K.-L. Tsui: Identification and Qualifi- (2001)
cation in Multivariate Quality Control Problems, J. 36.87 G. King: Event Count Models for International
Qual. Tech. 26(3), 197–208 (1994) Relations: Generalizations and Applications, Inter-
36.69 R. L. Mason, C. W. Champ, N. D. Tracy, S. J. Wierda, national Studies Quarterly 33(2), 123–147 (1989)
J. C. Young: Assessment of Multivariate Process 36.88 P. Smyth: Hidden Markov models for fault detec-
Control Techniques, J. Qual. Technol. 29, 140–143 tion in dynamic systems, Pattern Recognition 27(1),
(1997) 149–164 (1994)

You might also like