Data Mining Methods and Applications
Data Mining Methods and Applications
Data Mining M
36. Data Mining Methods and Applications
Part D 36
for analysis. Although these issues are generally 36.5 Concluding Remarks ............................ 667
considered uninteresting to modelers, the largest References .................................................. 667
portion of the knowledge discovery process is
spent handling data. It is also of great importance
since the resulting models can only be as good as data may not be available while clustering
the data on which they are based. methods are more technically similar to the
The fourth part is the core of the chapter supervised learning methods presented in this
and describes popular data mining methods, chapter. Finally, this section closes with a review
separated as supervised versus unsupervised of various software options.
learning. In supervised learning, the training The fifth part presents current research
data set includes observed output values (“correct projects, involving both industrial and business
answers”) for the given set of inputs. If the applications. In the first project, data is collected
outputs are continuous/quantitative, then we from monitoring systems, and the objective is to
have a regression problem. If the outputs detect unusual activity that may require action.
are categorical/qualitative, then we have a For example, credit card companies monitor
classification problem. Supervised learning customers’ credit card usage to detect possible
methods are described in the context of both fraud. While methods from statistical process
regression and classification (as appropriate), control were developed for similar purposes, the
beginning with the simplest case of linear models, difference lies in the quantity of data. The second
then presenting more complex modeling with project describes data mining tools developed
trees, neural networks, and support vector by Genichi Taguchi, who is well known for his
machines, and concluding with some methods, industrial work on robust design. The third project
such as nearest neighbor, that are only for tackles quality and productivity improvement in
classification. In unsupervised learning, the manufacturing industries. Although some detail
training data set does not contain output values. is given, considerable research is still needed
Unsupervised learning methods are described to develop a practical tool for today’s complex
under two categories: association rules and manufacturing processes.
clustering. Association rules are appropriate for Finally, the last part provides a brief discussion
business applications where precise numerical on remaining problems and future trends.
652 Part D Regression Methods and Data Mining
Data mining (DM) is the process of exploration and (www.selectron.com). However, economies of scale,
analysis, by automatic or semiautomatic means, of purchasing power, and global competition are making
large quantities of data to discover meaningful pat- the business such that one must either be a big player or
terns and rules [36.1]. Statistical DM is exploratory serve a niche market. Today, extremely short life cycles
data analysis with little or no human interaction using and constantly declining prices are pressuring the elec-
computationally feasible techniques, i. e., the attempt to tronics industry to manufacture their products with high
find unknown interesting structure [36.2]. Knowledge quality, high yield, and low production cost.
discovery in databases (KDD) is a multidisciplinary To be successful, industry will require improvements
research field for nontrivial extraction of implicit, previ- at all phases of manufacturing. Figure 36.1 illustrates the
ously unknown, and potentially useful knowledge from three primary phases: design, ramp-up, and production.
data [36.3]. Although some treat DM and KDD equiva- In the production phase, maintenance of a high perfor-
lently, they can be distinguished as follows. The KDD mance level via improved system diagnosis is needed.
process employs DM methods (algorithms) to extract In the ramp-up phase, reduction in new product de-
knowledge according to the specifications of measures velopment time is sought by achieving the required
and thresholds, using a database along with any ne- performance as quickly as possible. Market demands
cessary preprocessing or transformations. DM is a step have been forcing reduced development time for new
in the KDD process consisting of particular algorithms product and production system design. For example, in
(methods) that, under some acceptable objective, pro- the computer industry, a product’s life cycle has been
duces particular patterns or knowledge over the data. The shortened to 2–3 years recently, compared to a life cy-
two primary fields that develop DM methods are statis- cle of 3–5 years a few years ago. As a result, there are
Part D 36
tics and computer science. Statisticians support DM by a number of new concepts in the area of production sys-
mathematical theory and statistical methods while com- tems, such as flexible and reconfigurable manufacturing
puter scientists develop computational algorithms and systems. Thus, in the design phase, improved system
relevant software [36.4]. Prerequisites for DM include: performance integrated at both the ramp-up and produc-
(1) Advanced computer technology (large CPU, parallel tion phases is desired. Some of the most critical factors
architecture, etc.) to allow fast access to large quantities and barriers in the competitive development of mod-
of data and enable computationally intensive algorithms ern manufacturing systems lie in the largely uncharted
and statistical methods; (2) knowledge of the business area of predicting system performance during the design
or subject matter to formulate the important business phase [36.5, 6]. Consequently, current systems necessi-
questions and interpret the discovered knowledge. tate that a large number of design/engineering changes
With competition increasing, DM and KDD have be made after the system has been designed.
become critical for companies to retain customers and
ensure profitable growth. Although most companies are
able to collect vast amounts of business data, they are Define & validate (KPCs)
product
often unable to leverage this data effectively to gain new
knowledge and insights. DM is the process of applying Define & validate
(KCCs)
sophisticated analytical and computational techniques to process
discover exploitable patterns in complex data. In many
Design &
cases, the process of DM results in actionable knowl- refinement
(KPCs, KCCs)
edge and insights. Examples of DM applications include
fraud detection, risk assessment, customer relationship Launch / (KPCs,
Ramp-up KCCs)
management, cross selling, insurance, banking, retail,
etc. Production
While many of these applications involve customer
relationship management in the service industry, a po- Ramp-up
Product and process design time Procuction
tentially fruitful area is performance improvement and
cost reduction through DM in industrial and manu- Lead time
facturing systems. For example, in the fast-growing
and highly competitive electronics industry, total rev- Fig. 36.1 Manufacturing system development phases.
enue worldwide in 2003 was estimated to be $900 KPCs = Key product characteristics. KCCs = Key control
billion, and the growthrate is estimated at 8% per year characteristics
Data Mining Methods and Applications 36.1 The KDD Process 653
At all phases, system performance depends on many and functional test. Due to advancements in informa-
manufacturing process stages and hundreds or thou- tion technology, sophisticated software and hardware
sands of variables whose interactions are not well technologies are available to record and process huge
understood. For example, in the multi-stage printed amounts of daily data in these process and testing
circuit board (PCB) industry, the stages include pro- stages. This makes it possible to extract important
cess operations such as paste printing, chip placement, and useful information to improve process and prod-
and wave soldering; and also include test opera- uct performance through DM and quality improvement
tions such as optical inspection, vision inspection, technologies.
Part D 36.1
Business
objectives Source • Legacy systems
systems • External systems
Model
discovery file
Model
Explore data Construct model Ideas
discovery file
Transform
Model Evaluate model into Reports
evaluation file model usable format
Models
Ideas
Make business
Communicate / Knowledge
Reports Extract knowledge decisions and
Transport knowledge database improve model
Models
As an example of formulating business objectives, knowledge will enable the company to meet the chal-
consider a telecommunications company. It is criti- lenges of new product development effectively in
cally important to identify those customer traits that the future. Steps 2–4 are illustrated in figs. 36.2–
retain profitable customers and predict fraudulent be- 36.4. Approximately 20–25% of effort is spent on
havior, credit risks and customer churn. This knowledge determining business objectives, 50–60% of effort
may be used to improve programs in target market- is spent on data preparation, 10–15% of is spent
ing, marketing channel management, micro-marketing, on DM, and about 10% is spent on consolidation/
and cross selling. Finally, continually updating this application.
Currently, clustering and other statistical modeling are no systematic biases). The sampling process can be
used. expensive if the data have been stored in a database
The data preparation process involves three steps: system such that it is difficult to sample the data the
data cleaning, database sampling, and database reduc- way you want and many operations need to be exe-
tion and transformation. Data cleaning includes removal cuted to obtain the targeted data. One must balance
of duplicate variables, imputation of missing values, a trade-off between the costs of the sampling process
identification and correction of data inconsistencies, and the mining process. Finally, database reduction is
identification and updating of stale data, and creat- used for data cube aggregation, dimension reduction,
ing a unique record (case) identification (ID). Via elimination of irrelevant and redundant attributes, data
database sampling, the KDD process selects appropri- compression, and encoding mechanisms via quantiza-
ate parts of the databases to be examined. For this to tions, wavelet transformation, principle components,
work, the data must satisfy certain conditions (e.g., etc.
Part D 36.3
alternate
ple data into two parts: the training data and the testing models
data. The training data will be used to fit the model and Choose models
the testing data is used to refine and tune the fitted model.
Collect
After the final model is obtained, it is recommended to more data
use an independent data set to evaluate the goodness of Train data Build / Fit model
the final model, such as comparing the prediction er- Sample
data
ror to the accuracy requirement. (If independent data are Test data
(Validation Refine / Tune model
not available, one can use the cross-validation method to (model size & diagnostics)
data)
compute prediction error.) If the accuracy requirement
is not satisfied, then one must revisit earlier steps to re- Evaluation
data Evaluate model
consider other classes of models or collect additional (e. g. prediction error)
(Test data)
data.
No
Before implementing any sophisticated DM Meet accuracy
reqt.
methods, data description and visualization are used Yes
for initial exploration. Tools include descriptive Score data
Make
Prediction desicions
statistical measures for central tendency/location, dis-
persion/spread, and distributional shape and symmetry;
class characterizations and comparisons using analyti- Fig. 36.5 Data mining process
cal approaches, attribute relevance analysis, and class
discrimination and comparisons; and data visualiza- learning without a teacher. In this case, correct answers
tion using scatter-plot matrices, density plots, 3-D are not available, and DM methods would search for pat-
stereoscopic scatter-plots, and parallel coordinate plots. terns or clusters of similarity that could later be linked
Following this initial step, DM methods take two forms: to some explanation.
supervised versus unsupervised learning. Supervised
learning is described as 36.3.1 Supervised Learning
textitlearning with a teacher, where the teacher provides
data with correct answers. For example, if we want to In supervised learning, we have a set of input variables
classify online shoppers as buyers or non-buyers using (also known as predictors, independent variables, x) that
an available set of variables, our data would include are measured or preset, and a set of output variables
actual instances of buyers and non-buyers for training (also known as responses, dependent variables, y) that
a DM method. Unsupervised learning is described as are measured and assumed to be influenced by the in-
656 Part D Regression Methods and Data Mining
puts. If the outputs are continuous/quantitative, then we regression are abundant (e.g., [36.9, 10]). In particu-
have a regression or prediction problem. If the outputs lar, Neter et al. [36.11] provides a good background
are categorical/qualitative, then we have a classifica- on residual analysis, model diagnostics, and model se-
tion problem. First, a DM model/system is established lection using best subsets and stepwise methods. In
based on the collected input and output data. Then, the model selection, insignificant model terms are elimi-
established model is used to predict output values at new nated; thus, the final model may be a subset of the
input values. The predicted values are denoted by ŷ. original pre-specified model. An alternate approach is to
The DM perspective of learning with a teacher, use a shrinkage method that employs a penalty function
follows these steps: to shrink estimated model parameters towards zero, es-
sentially reducing the influence of less important terms.
• Student presents an answer (ŷi given xi );
Two options are
• Teacher provides the correct answer yi or an error ei ridge regression [36.12], which uses the
penalty form β
m
2 , and the lasso [36.13], which uses
for the student’s answer;
the penalty form |βm |.
• The result is characterized by some loss function or
In the classification case, linear methods generate
lack-of-fit criterion:LOF(y, ŷ);
linear decision boundaries to separate the C classes.
• The objective is to minimize the expected loss.
Although a direct linear regression approach could be
Supervised learning includes the common engineering applied, it is known not to work well. A better method is
task of function approximation, in which we assume logistic regression [36.14], which uses log-odds (or logit
that the output is related to the input via some function transformations) of the posterior probabilities µc (x) =
f (x, ), where represents a random error, and seek to P(Y = c|X = x) for classes c = 1, . . . , C − 1 in the form
Part D 36.3
(GLM) [36.16]. GLM forms convert what appear to be til the smallest p-value is greater than a pre-specified
nonlinear models into linear models, using tools such as significance level (α).
transformations (e.g., logit) or conditioning on nonlinear The popular classification and regression trees
parameters. This then enables the modeler to use tra- (CART) [36.24] utilize recursive partitioning (binary
ditional linear modeling analysis techniques. However, splits), which evolved from the work of Morgan and
real data often do not satisfy the restrictive conditions of Sonquist [36.25] and Fielding [36.26] on analyzing sur-
these models. vey data. CARTs have a forward stepwise procedure
Rather than using pre-specified model terms, as that adds model terms and backward procedure for
in a linear model, a generalized additive model pruning. The model terms partition the x-space into
(GAM) [36.17], provides a more flexible statistical disjoint hyper-rectangular regions via indicator func-
method to enable modeling of nonlinear patterns in each tions: b+ (x; t) = 1{x > t}, b− (x; t) = 1{x ≤ t}, where
input dimension. In the regression case, the basic GAM the split-point t defines the borders between regions. The
form is resulting model terms are:
p
Lm
µ(x) = β0 + f j (x j ) , f m (x) = bsl,m (xv(l,m) ; tl,m ) , (36.1)
j=1 l=1
where the f j (·) are unspecified (smooth) univariate where, L m is the number of univariate indicator func-
functions, one for each input variable. The additive re- tions multiplied in the m-th model term, xv(l,m) is the
striction prohibits inclusion of any interaction terms. input variable corresponding to the l-th indicator func-
Part D 36.3
Each function is fitted using a nonparametric regres- tion in the m-th model term, tl,m is the split-point
sion modeling method, such as running-line smoothers corresponding to xv(l,m) , and sl,m is +1 or −1 to in-
(e.g., lowess, [36.18]), smoothing splines or kernel dicate the direction of the partition. The CART model
smoothers [36.19–21]. In the classification case, an form is then
additive logistic regression model utilizes the logit trans-
M
formation for classes c = 1, . . . , C − 1 as above f (x; β) = β0 + βm f m (x) . (36.2)
µc (x) P(Y = c|X = x) m=1
log = log
µC (x) P(Y = C|X = x) The partitioning of the x-space does not keep the
p parent model terms because they are redundant. For
= β0 + f j (x j ) , example, suppose the current set has the model term:
j=1
f a (x) = 1{x3 > 7} · 1{x4 ≤ 10} ,
where an additive model is used in place of the linear
model. However, even with the flexibility of nonpara- and the forward stepwise algorithm chooses to add
metric regression, GAM may still be too restrictive. The f b (x) = f a (x) · 1{x5 > 13}
following sections describe methods that have essen-
tially no assumptions on the underlying model form. = 1{x3 > 7} · 1{x4 ≤ 10} · 1{x5 > 13} .
Then the model term f a (x) is dropped from the current
Trees and Related Methods set. Thus, the recursive partitioning algorithm follows
One DM decision tree model is chi-square automatic in- a binary tree with the current set of model terms f m (x)
teraction detection (CHAID) [36.22, 23], which builds consisting of the M leaves of the tree, each of which
non-binary trees using a chi-square test for the classi- corresponds to a different region Rm .
fication case and an F-test for the regression case. The In the regression case, CART minimizes the squared
CHAID algorithm first creates categorical input vari- error loss function,
ables out of any continuous inputs by dividing them
into several categories with approximately the same
N
2
number of observations. Next, input variable categories LOF( fˆ) = yi − fˆ(xi ) ,
that are not statistically different are combined, while i=1
a Bonferroni p-value is calculated for those that are sta- and the approximation is a piecewise-constant function.
tistically different. The best split is determined by the In the classification case, each region Rm is classi-
smallest p-value. CHAID continues to select splits un- fied into one of the C classes. Specifically, define the
658 Part D Regression Methods and Data Mining
proportion of class c observations in region Rm as (MART) [36.33], then consists of much lower-order in-
1 teraction terms. Friedman [36.34] presents stochastic
δ̂mc = 1{yi = c} , gradient boosting, with a variety of loss functions, in
Nm which a bootstrap-like bagging procedure is included in
xi ∈Rm
ous layers with weighted connections between nodes in the bell-shaped radial basis functions. Commonly used
different layers (Fig. 36.6). At the input layer, the nodes sigmoidal functions are the logistic function
are the input variables and at the output layer, the nodes 1
are the response variable(s). In between, there is usually b(z) =
at least one hidden layer which induces flexibility into 1 + e−z
the modeling. Activation functions define transforma- and the hyperbolic tangent
tions between layers (e.g., input to hidden). Connections
1 − e−2x
between nodes can feed back to previous layers, but b(z) = tanh(z) = .
for supervised learning, the typical ANN is feedforward 1 + e−2x
only with at least one hidden layer. The most common radial basis function is the Gaussian
The general form of a feedforward ANN with one probability density function.
hidden layer and activation functions b1 (·) (input to In the regression case, each node in the output layer
hidden) and b2 (·) (hidden to output) is represents a quantitative response variable. The output
activation function may be either a linear, sigmoidal, or
f c (x; w, v, θ, γc ) = radial basis function. Using a logistic activation function
⎡ ⎛ ⎞ ⎤
H p from input to hidden and from hidden to output, the ANN
b2 ⎣ whc · b1 ⎝ v jh x j + θh ⎠ + γc ⎦ , model in (36.3) becomes
h=1 j=1 H −1
(36.3) f c (x; w, v, θ, γc ) = 1 + exp − whc z h + γc ,
Part D 36.3
where c = 1, . . . , C and C is the number of output vari- h=1
ables, p is the number of input variables, H is the number where for each hidden node h
of hidden nodes, the weights v jh link input nodes j to ⎡ ⎛ ⎞⎤−1
hidden nodes h and whc link hidden nodes h to out- p
iteration, each coefficient (say w) is adjusted according Two popular kernel functions for SVM are polynomials
to its contribution to the lack-of-fit of degree d, K (x, x ) = (1 + x, x )d , and radial basis
∂(LOF) functions, K (x, x) = exp(−x − x 2 /c).
∆w = α , Given K (x, x ), we maximize the following
∂w
Lagrangian dual-objective function:
where the user-specified α controls the step size; see
Rumelhart et al. [36.42] for more details. More effi-
N
1
N N
cient training procedures are a subject of current ANN max αi − αi αi yi yi K (xi , xi )
α1 ,...α N 2
research. i=1
i=1 i =1
Another major issue is the network architecture, de- s.t. 0 ≤ αi ≤ γ , for i = 1, . . . , N and
fined by the number of hidden nodes. If too many hidden N
nodes are permitted, the ANN model will overfit the data. αi yi = 0 ,
Many model discrimination methods have been tested, i=1
but the most reliable is validation of the model on a test- where γ is an SVM tuning parameter. The optimal
ing data set separate from the training data set. Several solution allows us to rewrite f (x; β) as
ANN architectures are fitted to the training data set and
then prediction error is measured on the testing data set.
N
Although ANNs are generally flexible enough to model f (x; β) = β0 + αi yi K (x, xi ) ,
anything, they are computationally intensive, and a sig- i=1
nificant quantity of representative data is required to where β0 and α1 , . . . , α N are determined by solv-
Part D 36.3
both fit and validate the model. From a statistical per- ing f (x; β) = 0. The support vectors are those xi
spective, the primary drawback is the overly large set of corresponding to nonzero αi . A smaller SVM tuning pa-
coefficients, none of which provide any intuitive under- rameter γ leads to more support vectors and a smoother
standing for the underlying model structure. In addition, decision boundary. A testing data set may be used to
since the nonlinear model form is not motivated by the determine the best value for γ .
true model structure, too few training data points can The SVM extension to more than two classes solves
result in ANN approximations with extraneous nonlin- multiple two-class problems. SVM for regression uti-
earity. However, given enough good data, ANNs can lizes the model form in (36.4) and requires specification
outperform other modeling methods. of a loss function appropriate for a quantitative re-
sponse [36.8, 46]. Two possibilities are the -insensitive
Support Vector Machines function
Referring to the linear methods for classification
described earlier, the decision boundary between 0 if |e| < ,
V (e) =
two classes is a hyperplane of the form x | β0 +
|e| − otherwise ,
β j x j = 0 . The support vectors are the points that
are most critical to determining the optimal decision which ignores errors smaller than , and the Hu-
boundary because they lie close to the points belong- ber [36.47] function
ing to the other class. With support vector machines
e2 /2 if |e| ≤ 1.345 ,
(SVM) [36.43, 44], the linear decision boundary is gen- VH (e) =
eralized to the more flexible form 1.345|e| − e /2 otherwise ,
2
M which is used in robust regression to reduce model
f (x; β) = β0 + βm gm (x), (36.4) sensitivity to outliers.
m=1
where the gm (x) are transformations of the input Other Classification Methods
vector. The decision boundary is then defined by In this section, we briefly discuss some other concepts
{x | f (x; β) = 0}. To solve for the optimal decision that are applicable to DM classification problems. The
boundary, it turns out that we do not need to specify basic intuition behind a good classification method is
the transformations gm (x), but instead require only the derived from the Bayes classifier, which utilizes the
kernel function [36.21, 45]: posterior distribution P(Y = c|X = x). Specifically, if
P(Y = c|X = x) is the maximum over c = 1, . . . , C,
K (x, x )= g1 (x), . . ., g M (x) , g1 (x ), . . ., g M (x ) . then x would be classified to class c.
Data Mining Methods and Applications 36.3 Data Mining (DM) Models and Algorithms 661
Nearest neighbor (NN) [36.48] classifiers seek to es- attributes is often very high (much higher than that in
timate the Bayes classifier directly without specification supervised learning). In describing the methods, we de-
of any model form. The k-NN classifier identifies the note the j-th variable by x j (or random variable X j ),
k closest points to x (using Euclidean distance) as the and the corresponding boldface x (or X) denotes the
neighborhood about x, then estimates P(Y = c|X = x) vector of p variables (x1 , x2 , . . . , x p )T , where boldface
with the fraction of these k points that are of class c. As k xi denotes the i-th sample point. These variables may be
increases, the decision boundaries become smoother; either quantitative or qualitative.
however, the neighborhood becomes less local (and
less relevant) to x. This problem of local representation Association Rules
is even worse in high dimensions, and modifications Association rules or affinity groupings seek to find as-
to the distance measure are needed to create a prac- sociations between the values of the variables X that
tical k-NN method for DM. For this purpose, Hastie provide knowledge about the population distribution.
and Tibshirani [36.49] proposed the discriminant adap- Market basket analysis is a well-known special case, for
tive NN distance measure to reshape the neighborhood which the extracted knowledge may be used to link spe-
adaptively at a given x to capture the critical points to cific products. For example, consider all the items that
distinguish between the classes. may be purchased at a store. If the analysis identifies that
As mentioned earlier, linear discriminant analysis items A and B are commonly purchased together, then
may be too restrictive in practice. Flexible discrimi- sales promotions could exploit this to increase revenue.
nant analysis replaces the linear decision boundaries In seeking these associations, a primary objective
with more flexible regression models, such as GAM or is to identify variable values that occur together with
Part D 36.3
MARS. Mixture discriminant analysis relaxes the as- high probability. Let S j be the the set of values for X j ,
sumption that that classes are more or less spherical in and consider a subset s j ⊆ S j . Then we seek subsets
shape by allowing a class to be represented by mul- s1 , . . . , s p such that
tiple (spherical) clusters; see Hastie et al. [36.50] and ⎡ ⎤
Ripley [36.23] for more details. p
K -means clustering classification applies the K - P ⎣ (X j ∈ s j )⎦ (36.5)
means clustering algorithm separately to the data for j=1
each of the C classes. Each class c will then be
represented by K clusters of points. Consequently, non- is large. In market basket analysis, the variables X are
spherical classes may be modeled. For a new input vector converted to a set of binary variables Z, where each
x, determine the closest cluster, then assign x to the the attainable value of each X j corresponds to avariable
class associated with that cluster. Z k . Thus, the number of Z k variables is K = |S j |. If
Genetic algorithms [36.51, 52] use processes such binary variable Z k corresponds to X j = v, then Z k = 1
as genetic combination, mutation, and natural selection when X j = v and Z k = 0 otherwise. An item set κ is
in an optimization based on the concepts of natural evo- a realization of Z. For example, if the Z k represent the
lution. One generation of models competes to pass on possible products that could be purchased from a store,
characteristics to the next generation of models, until then an item set would be the set of items purchased
the best model is found. Genetic algorithms are useful together by a customer. Note that the number of Z k = 1
in guiding DM algorithms, such as neural networks and in an item set is at most p. Equation (36.5) now becomes
decision trees [36.53].
P (Z k = 1) ,
36.3.2 Unsupervised Learning k∈κ
Further knowledge may be extracted via the a priori dissimilarity is measured by calculating d(xi , xi ) for all
algorithm [36.54] in the form of if–then statements. For points xi , xi within a cluster Ck , then summing over
an item set κ, the items with Z k = 1 would
be partitioned the K clusters. This is equivalent to calculating
into two disjoint item subsets such that A B = κ. The
association rule would be stated as “if A, then B” and
K
W(C) = d(xi , x̄k ) ,
denoted by A ⇒ B, where A is called the antecedent and
k=1 i∈Ck
B is called the consequent. This rule’s support T (A ⇒
B) is the same as T (κ) calculated above, an estimate of where the cluster mean x̄k is the sample mean vector of
the joint probability. The confidence or predictability of the points in cluster Ck . Given a current set of cluster
this rule is means, the K -means algorithm assigns each point to the
T (A ⇒ B) closest cluster mean, calculates the new cluster means,
C(A ⇒ B) = , and iterates until the cluster assignments do not change.
T (A)
Unfortunately, because of its dependence on the
which is an estimate of the conditional probability squared Euclidean distance measure, K -means clus-
P(B|A). The expected confidence is the support of B, tering is sensitive to outliers (i. e., is not robust).
T (B), and an estimate for the unconditional probabil- K -mediods [36.57] is a generalized version that uti-
ity P(B). The lift is the ratio of the confidence over the lizes an alternately defined cluster center in place of
expected confidence, the cluster means and an alternate distance measure.
Density-based clustering (DBSCAN) [36.58] algo-
C(A ⇒ B)
L(A ⇒ B) = , rithms are less sensitive to outliers and can discover
Part D 36.3
Part D 36.3
a lattice structure of dimensionality M. Typically M SAS, SPSS, and Quadstone are the most expensive
is much smaller than p. By means of a learning al- (over $ 40 000) while XLMiner is a good deal for the
gorithm, the network discovers the clusters within the price (under $ 2 000). The disadvantage of XLMiner is
data. It is possible to alter the discovered clusters by that it cannot handle very large data sets. Each pack-
varying the learning parameters of the network. The age has certain specializations, and potential users must
SOM is especially suitable for data survey because it has carefully investigate these choices to find the package
appealing visualization properties. It creates a set of pro- that best fits their KDD/DM needs. Below we de-
totype vectors representing the data set and carries out scribe some other software options for the DM modeling
a topology-preserving projection of the prototypes from methods presented.
the p-dimensional input space onto a low-dimensional GLM or linear models are the simplest of DM tools
(typically two-dimensional) grid. This ordered grid can and most statistical software can fit them, such as SAS,
be used as a convenient visualization surface for show- SPSS, S+, and Statistica [www.statsoftinc.com/]. How-
ing different features of the SOM (and thus of the data), ever, it should be noted that Quadstone only offers a
for example, the cluster structure. While the axes of such regression tool via scorecards, which is not the same as
a grid do not correspond to any measurement, the spa- statistical linear models. GAM requires access to more
tial relationships among the clusters do correspond to sophisticated statistical software, such as S+.
relationships in p-dimensional space. Another attractive Software for CART, MART, and MARS is avail-
feature of the SOM is its ability to discover arbitrarily able from Salford Systems [www.salford-systems.com].
shaped clusters organized in a nonlinear space. SAS Enterprise Miner includes CHAID, CART, and the
machine learning program C4.5 [www.rulequest.com]
36.3.3 Software [36.63], which uses classifiers to generate decision
trees and if–then rules. SPSS Clementine and Insight-
Several DM software packages are available at a wide ful Miner also include CART, but Ghostminer and
range of prices, of which six of the most popular pack- XLMiner utilize different variants of decision trees.
ages are: QUEST [www.stat.wisc.edu/˜loh/quest.html] is avail-
able in SPSS’s AnswerTree software and Statistica.
• SAS Enterprise Miner Although ANN software is widely available, the
(www.sas.com/technologies/analytics/datamining/ most complete package is Matlab’s [www.mathworks
miner/), .com] Neural Network Toolbox. Information on SVM
• SPSS Clementine (www.spss.com/clementine/), software is available at [www.support-vector.net/soft-
• XLMiner in Excel (www.xlminer.net), ware.html]. One good option is Matlab’s SVM Toolbox.
664 Part D Regression Methods and Data Mining
cluster models should be developed with mixture amount of discussion and controversy and are widely
distributions [36.72]. used in manufacturing [36.73–77]. The general consen-
One particularly competitive industry is telecom- sus among statisticians seems to be that, while many
munications. Since divestiture and government dereg- of Taguchi’s overall ideas on experimental design are
ulation, various telephone services, such as cellular, very important and influential, the techniques he pro-
local and long distance, domestic and commercial, have posed are not necessarily the most effective statistical
become battle grounds for telecommunication service methods. Nevertheless, Taguchi has made significant
providers. Because of the data and information ori- contributions in the area of quality control and quality
ented nature of the industry, DM methods for knowledge engineering. For DM, Taguchi has recently popularized
extraction are critical. To remain competitive, it is impor- the Mahalanobis–Taguchi System (MTS), a new set of
tant for companies to develop business planning systems tools for diagnosis, classification, and variable selection.
that help managers make good decisions. In particular, The method is based on a Mahalanobis distance scale
these systems will allow sales and marketing people that is utilized to measure the level of abnormality in ab-
to establish successful customer loyalty programs for normal items as compared to a group of normal items.
churn prevention and to develop fraud detection modules First, it must be demonstrated that a Mahalanobis dis-
for reducing revenue loss through market segmentation tance measure based on all available variables is able to
and customer profiling. separate the abnormal from the normal items. Should
A major task in this research is to develop and im- this be successfully achieved, orthogonal arrays and
plement DM tools within the business planning system. signal-to-noise ratios are used to select an optimal com-
The objectives are to provide guidance for targeting bination of variables for calculating the Mahalanobis
Part D 36.4
business growth, to forecast year-end usage volume and distances.
revenue growth, and to value risks associated with the The MTS method has been claimed to be very power-
business plan periodically. Telecommunication business ful for solving a wide range of problems, including
services include voice and non-voice services, which manufacturing inspection and sensing, medical diag-
can be further categorized to include domestic, local, in- nosis, face and voice recognition, weather forecasting,
ternational, products, toll-free calls, and calling cards. credit scoring, fire detection, earthquake forecasting, and
For usage forecasting, a minutes growth model is uti- university admissions. Two recent books have been pub-
lized to forecast domestic voice usage. For revenue lished on the MTS method by Taguchi et al. [36.78]
forecasting, the average revenue per minute on a log and Taguchi and Jugulum [36.79]. Many successful case
scale is used as a performance measure and is forecasted studies in MTS have been reported in engineering and
by a double exponential smoothing growth function. science applications in many large companies, such as
A structural model is designed to decompose the busi- Nissan Motor Co., Mitsubishi Space Software Co., Xe-
ness growth process into three major subprocesses: add, rox, Delphi Automotive Systems, ITT Industries, Ford
disconnect, and base. To improve explanatory power, Motor Company, Fuji Photo Film Company, and oth-
the revenue unit is further divided into different cus- ers. While the method is getting a lot of attention in
tomer groups. To compute confidence and prediction many industries, very little research [36.80] has been
intervals, bootstrapping and simulation methods are conducted to investigate how and when the method is
used. appropriate.
To understand the day effect and seasonal effect, the
concept of bill-month equivalent business days (EBD) 36.4.3 Manufacturing Process Modeling
is defined and estimated. To estimate EBD, the fac-
tor characteristics of holidays (non-EBD) are identified One area of DM research in manufacturing indus-
and eliminated and the day effect is estimated. For sea- tries is quality and productivity improvement through
sonality, the US Bureau of the Census X-11 seasonal DM and knowledge discovery. Manufacturing sys-
adjustment procedure is used. tems nowadays are often very complicated and involve
many manufacturing process stages where hundreds
36.4.2 Mahalanobis–Taguchi System or thousands of in-process measurements are taken
to indicate or initiate process control of the system.
Genichi Taguchi is best known for his work on ro- For example, a modern semiconductor manufactur-
bust design and design of experiments. The Taguchi ing process typically consists of over 300 steps, and
robust design methods have generated a considerable in each step, multiple pieces of equipment are used
666 Part D Regression Methods and Data Mining
to process the wafer. Inappropriate understanding of pends on product groups, process steps, and types of
interactions among in-process variables will create in- defects [36.86]. Unlike traditional defect models, an ap-
efficiencies at all phases of manufacturing, leading to propriate logit model can be developed as follows. Let
long product/process realization cycle times and long the number of defects of category X on an electronics
development times, and resulting in excessive system product be
costs.
Current approaches to DM in electronics manufac- UX = YX
turing include neural networks, decision trees, Bayesian
and
models and rough set theory [36.81, 82]. Each of these
approaches carries certain advantages and disadvan- logit E(Y X ) = α0X + α O
X · OX
tages. Decision trees, for instance, produce intelligible
+ αCX · C X + α OC
X · OX · C X ,
rules and hence are very appropriate for generating pro-
cess control or design of experiments strategies. They where logit(z) = log[z/(1 − z)] is the link function for
are, however, generally prone to outlier and imperfect Bernoulli distributions, and Y X is a Bernoulli random
data influences. Neural networks, on the other hand, are variable representing a defect from defect category X.
robust against data abnormalities but do not produce The default logit of the failure probability is α0X , and
readily intelligible knowledge. These methods also dif- αOX and α X are the main effects of operations (O X )
C
fer in their ability to handle high-dimensional data, to and components (C X ). Since the Y X s are correlated,
discover arbitrarily shaped clusters [36.58] and to pro- this model will provide more detailed information about
vide a basis for intuitive visualization [36.83]. They
Part D 36.4
defects.
can also be sensitive to training and model building
parameters [36.60]. Finally, the existing approaches Multivariate Defect Modeling
do not take into consideration the localization of pro- Since different types of defects may be caused by the
cess parameters. The patterns or clusters identified same operations, multivariate Poisson models are nec-
by existing approaches may include parameters from essary to account for correlations among different types
a diverse set of components in the system. There- of defects. The trivariate reduction method suggests an
fore, a combination of methods that complement each additive Poisson model for the vector of Poisson counts
other to provide a complete set of desirable features is U = (U1 , U2 , · · · , Uk ) ,
necessary.
It is crucial to understand process structure and yield U = AV ,
components in manufacturing, so that problem localiza-
tion can permit reduced production costs. For example, where A is a matrix of zeros and ones, and
semiconducture manufacturing practice shows that over V = (v1 , v2 , · · · , v p ) consists of independent Poisson
70% of all fatal detects and close to 90% of yield variables vi . The variance–covariance matrix takes the
excursions are caused by problems related to process form Var(U) = AΣA = Φ + νν , where Φ = diag(µi )
equipment [36.84]. Systematic defects can be attributed is a diagonal matrix with the mean of the individual se-
to many categories that are generally associated with ries, and ν is the common covariance term. Note that
technologies and combinations of different process op- the vi are essentially latent variables, and a factor anal-
erations. To implement DM methods successfully for ysis model can be developed for analyzing multivariate
knowledge discovery, some future research for manu- discrete Poisson variables such that
facturing process control must include yield modeling, log[E(U)] = µ + L · F ,
defect modeling and variation propagation.
where U is the vector of defects, L is the matrix of factor
Yield Modeling loadings, and F contains common factors representing
In electronics manufacturing, the ANSI stand- effects of specific operations. By using factor analysis,
ards [36.85] and practice gene rally assume that the it is possible to relate product defects to the associated
number of defects on an electronics product follows packages and operations.
a Poisson distribution with mean λ. The Poisson random
variable is an approximation of the sum of independent Multistage Variation Propagation
Bernoulli trials, but defects on different components Inspection tests in an assembly line usually have func-
may be correlated since process yield critically de- tional overlap, and defects from successive inspection
Data Mining Methods and Applications References 667
stations exhibit strong correlations. Modeling seri- account for serial correlations of defects in different
ally correlated defect counts is an important task for inspection stations. Factor analysis methods based on
defect localization and yield prediction. Poisson re- hidden Markov models [36.88] can also be constructed
gression models, such as the generalized event-count to investigate how variations are propagated through
method [36.87] and its alternatives, can be utilized to assembly lines.
Part D 36
be able to formulate real problems such that the existing detailed data with the more general, aggregated industry-
methods can be applied. In reality, traditional academic wide data for knowledge extraction. It is obvious that
training mainly focuses on knowledge of modeling al- this approach will be significantly less effective than
gorithms and lacks training in problem formulation and the approach of integrating the detailed data from all
interpretation of results. Consequently, many modelers competing companies. It is expected that, if these obs-
are very efficient in fitting models and algorithms to tacles can be overcome, the impact of the DM and KDD
data, but have a hard time determining when and why methods will be much more prominent in industrial and
they should use certain algorithms. Similarly, the ex- commercial applications.
References
36.1 M. J. A. Berry, G. Linoff: Mastering Data Mining: The 36.8 T. Hastie, J. H. Friedman, R. Tibshirani: Elements of
Art and Science of Customer Relationship Manage- Statistical Learning: Data Mining, Inference, and
ment (Wiley, New York 2000) Prediction (Springer, Berlin Heidelberg New York
36.2 E. Wegman: Data Mining Tutorial, Short Course 2001)
Notes, Interface 2001 Symposium, Cosa Mesa, Cali- 36.9 S. Weisberg: Applied Linear Regression (Wiley, New
fornien (2001) York 1980)
36.3 P. Adriaans, D. Zantinge: Data Mining (Addison- 36.10 G. Seber: Multivariate Observations (Wiley, New
Wesley, New York 1996) York 1984)
36.4 J. H. Friedman: Data Mining and Statistics: What 36.11 J. Neter, M. H. Kutner, C. J. Nachtsheim, W. Wasser-
is the Connection? Technical Report (Stat. Dep., man: Applied Linear Statistical Models, 4th edn.
Stanford University 1997) (Irwin, Chicago 1996)
36.5 K. B. Clark, T. Fujimoto: Product Development and 36.12 A. E. Hoerl, R. Kennard: Ridge Regression: Biased
Competitiveness, J. Jpn Int. Econ. 6(2), 101–143 Estimation of Nonorthogonal Problems, Techno-
(1992) metrics 12, 55–67 (1970)
36.6 D. W. LaBahn, A. Ali, R. Krapfel: New Product De- 36.13 R. Tibshirani: Regression Shrinkage and Selection
velopment Cycle Time. The Influence of Project and via the Lasso, J. R. Stat. Soc. Series B 58, 267–288
Process Factors in Small Manufacturing Companies, (1996)
J. Business Res. 36(2), 179–188 (1996) 36.14 A. Agresti: An Introduction to Categorical Data
36.7 J. Han, M. Kamber: Data Mining: Concept and Tech- Analysis (Wiley, New York 1996)
niques (Morgan Kaufmann, San Francisco 2001) 36.15 D. Hand: Discrimination and Classification (Wiley,
Chichester 1981)
668 Part D Regression Methods and Data Mining
36.16 P. McCullagh, J. A. Nelder: Generalized Linear Mod- 36.36 J. H. Friedman, B. W. Silverman: Flexible Par-
els, 2nd edn. (Chapman Hall, New York 1989) simonious Smoothing and Additive Modeling,
36.17 T. Hastie, R. Tibshirani: Generalized Additive Mod- Technometrics 31, 3–39 (1989)
els (Chapman Hall, New York 1990) 36.37 R. P. Lippmann: An Introduction to Computing with
36.18 W. S. Cleveland: Robust Locally-Weighted Regres- Neural Nets, IEEE ASSP Magazine April, 4–22 (1987)
sion and Smoothing Scatterplots, J. Am. Stat. Assoc. 36.38 S. S. Haykin: Neural Networks: A Comprehensive
74, 829–836 (1979) Foundation, 2nd edn. (Prentice Hall, Upper Saddle
36.19 R. L. Eubank: Spline Smoothing and Nonparametric River 1999)
Regression (Marcel Dekker, New York 1988) 36.39 H. White: Learning in Neural Networks: a Statis-
36.20 G. Wahba: Spline Models for Observational Data, tical Perspective, Neural Computation 1, 425–464
Applied Mathematics, Vol. 59 (SIAM, Philadelphia (1989)
1990) 36.40 A. R. Barron, R. L. Barron, E. J. Wegman: Statisti-
36.21 W. Härdle: Applied Non-parametric Regression cal Learning Networks: A Unifying View, Computer
(Cambridge Univ. Press, Cambridge 1990) Science and Statistics: Proceedings of the 20th
36.22 D. Biggs, B. deVille, E. Suen: A Method of Choosing Symposium on the Interface 1992, ed. by E. J. Weg-
Multiway Partitions for Classification and Decision man, D. T. Gantz, J. J. Miller (American Statistical
Trees, J. Appl. Stat. 18(1), 49–62 (1991) Association, Alexandria, VA 1992) 192–203
36.23 B. D. Ripley: Pattern Recognition and Neural Net- 36.41 B. Cheng, D. M. Titterington: Neural Networks: A
works (Cambridge Univ. Press, Cambridge 1996) Review from a Statistical Perspective (with discus-
36.24 L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: sion), Stat. Sci 9, 2–54 (1994)
Classification and Regression Trees (Wadsworth, 36.42 D. Rumelhart, G. Hinton, R. Williams: Learning In-
Belmont, California 1984) ternal Representations by Error Propagation. In:
Part D 36
36.25 J. N. Morgan, J. A. Sonquist: Problems in the An- Parallel Distributed Processing: Explorations in the
alysis of Survey Data, and a Proposal, J. Am. Stat. Microstructures of Cognition, Vol. 1: Foundations,
Assoc. 58, 415–434 (1963) ed. by D. E. Rumelhart, J. L. McClelland (MIT, Cam-
36.26 A. Fielding: Binary segmentation: The Automatic bridge 1986) pp. 318–362
Interaction Detector and Related Techniques for 36.43 V. Vapnik: The Nature of Statistical Learning
Exploring Data Structure. In: The Analysis of Survey (Springer, Berlin Heidelberg New York 1996)
Data, Volume I: Exploring Data Structures, ed. by 36.44 C. J. C. Burges: A Tutorial on Support Vector
C. A. O’Muircheartaigh, C. Payne (Wiley, New York Machines for Pattern Recognition, Knowledge Dis-
1977) pp. 221–258 covery and Data Mining 2(2), 121–167 (1998)
36.27 W. Y. Loh, N. Vanichsetakul: Tree-Structured Clas- 36.45 J. Shawe-Taylor, N. Cristianini: Kernel Methods for
sification Via Generalized Discriminant Analysis, J. Pattern Analysis (Cambridge Univ. Press, Cambridge
Am. Stat. Assoc. 83, 715–728 (1988) 2004)
36.28 W. D. Lo. Chaudhuri, W. Y. Loh, C. C. Yang: Gen- 36.46 N. Cristianini, J. Shawe-Taylor: An Introduction to
eralized Regression Trees, Stat. Sin. 5, 643–666 Support Vector Machines (Cambridge Univ. Press,
(1995) Cambridge 2000)
36.29 W. Y. Loh, Y. S. Shih: Split-Selection Methods for 36.47 P. Huber: Ann. Math. Stat., Robust Estimation of a
Classification Trees, Statistica Sinica 7, 815–840 Location Parameter 53, 73–101 (1964)
(1997) 36.48 B. V. Dasarathy: Nearest Neighbor Pattern Classifi-
36.30 J. H. Friedman, T. Hastie, R. Tibshirani: Additive cation Techniques (IEEE Computer Society, New York
Logistic Regression: a Statistical View of Boosting, 1991)
Ann. Stat. 28, 337–407 (2000) 36.49 T. Hastie, R. Tibshirani: Discriminant Adap-
36.31 Y. Freund, R. Schapire: Experiments with a New tive Nearest-Neighbor Classification, IEEE Pattern
Boosting Algorithm, Machine Learning: Proceed- Recognition and Machine Intelligence 18, 607–616
ings of the Thirteenth International Conference, (1996)
Bari, Italy 1996, ed. by M. Kaufmann, (Bari, Italy 36.50 T. Hastie, R. Tibshirani, A. Buja: Flexible Dis-
1996) 148–156 criminant and Mixture Models. In: Statistics and
36.32 L. Breiman: Bagging Predictors, Machine Learning Artificial Neural Networks, ed. by J. Kay, M. Titter-
26, 123–140 (1996) ington (Oxford Univ. Press, Oxford 1998)
36.33 J. H. Friedman: Greedy Function Approximation: a 36.51 J. R. Koza: Genetic Programming: On the Program-
Gradient Boosting Machine, Ann. Stat. 29, 1189– ming of Computers by Means of Natural Selection
1232 (2001) (MIT, Cambridge 1992)
36.34 J. H. Friedman: Stochastic Gradient Boosting, Com- 36.52 W. Banzhaf, P. Nordin, R. E. Keller, F. D. Francone:
putational Statistics and Data Analysis 38(4), Genetic Programming: An Introduction (Morgan
367–378 (2002) Kaufmann, San Francisco 1998)
36.35 J. H. Friedman: Multivariate Adaptive Regression 36.53 P. W. H. Smith: Genetic Programming as a Data-
Splines (with Discussion), Ann. Stat. 19, 1–141 (1991) Mining Tool. In: Data Mining: A Heuristic Approach,
Data Mining Methods and Applications References 669
ed. by H. A. Abbass, R. A. Sarker, C. S. Newton (Idea 36.70 W. Jiang, S.-T. Au, K.-L. Tsui: A Statistical Process
Group Publishing, London 2002) pp. 157–173 Control Approach for Customer Activity Monitoring,
36.54 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, Technical Report, AT&T Labs (2004)
A. I. Verkamo: Fast Discovery Of Association Rules: 36.71 M. West, J. Harrison: Bayesian Forecasting and Dy-
Advances in Knowledge Discovery and Data Mining namic Models, 2nd edn. (Springer, New York 1997)
(MIT, Cambridge 1995) Chap. 12 36.72 C. Fraley, A. E. Raftery: Model-based Clustering,
36.55 A. Gordon: Classification, 2nd edn. (Chapman Hall, Discriminant Analysis, and Density Estimation, J.
New York 1999) Amer. Stat. Assoc. 97, 611–631 (2002)
36.56 J. A. Hartigan, M. A. Wong: A K-Means Clustering 36.73 G. Taguchi: Introduction to Quality Engineering:
Algorithm, Appl. Stat. 28, 100–108 (1979) Designing Quality into Products and Processes
36.57 L. Kaufman, P. Rousseeuw: Finding Groups in Data: (Asian Productivity Organization, Tokyo 1986)
An Introduction to Cluster Analysis (Wiley, New York 36.74 G. E. P. Box, R. N. Kacker, V. N. Nair, M. S. Phadke,
1990) A. C. Shoemaker, C. F. Wu: Quality Practices in
36.58 M. Ester, H.-P. Kriegel, J. Sander, X. Xu: A Japan, Qual. Progress March, 21–29 (1988)
Density-Based Algorithm for Discovering Cluster 36.75 V. N. Nair: Taguchi’s Parameter Design: A Panel
in Large Spatial Databases, Proceedings of 1996 Discussion, Technometrics 34, 127–161 (1992)
International Conference on Knowledge Discovery 36.76 K.-L. Tsui: An Overview of Taguchi Method and
and Data Mining (KDD96), Portland 1996, ed. by Newly Developed Statistical Methods for Robust
E. Simoudis, J. Han, U. Fayyad (AAAI Press, Menlo Design, IIE Trans. 24, 44–57 (1992)
Park 1996) 226–231 36.77 K.-L. Tsui: A Critical Look at Taguchi’s Modeling Ap-
36.59 J. Sander, M. Ester, H.-P. Kriegel, X. Xu: proach for Robust Design, J. Appl. Stat. 23, 81–95
Density-Based Clustering in Spatial Databases: (1996)
Part D 36
The Algorithm DGBSCAN and its Applications, Data 36.78 G. Taguchi, S. Chowdhury, Y. Wu: The Mahalanobis–
Mining and Knowledge Discovery 2(2), 169–194 Taguchi System (McGraw-Hill, New York 2001)
(1998) 36.79 G. Taguchi, R. Jugulum: The Mahalanobis–Taguchi
36.60 M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander: Strategy: A Pattern Technology System (Wiley, New
OPTICS: Ordering Points to Identify the Clustering York 2002)
Structure, Proc. ACMSIGMOD Int. Conf. on Manage- 36.80 W. H. Woodall, R. Koudelik, K.-L. Tsui, S. B. Kim,
ment of Data, Philadelphia, Pennsylvania June 1-3, Z. G. Stoumbos, C. P. Carvounis: A Review and
1999 (ACM Press, New York 1999) 49–60 Analysis of the Mahalanobis–Taguchi System,
36.61 T. Kohonen: Self-Organization and Associative Technometrics 45(1), 1–15 (2003)
Memory, 3rd edn. (Springer, Berlin Heidelberg New 36.81 A. Kusiak, C. Kurasek: Data Mining of Printed–
York 1989) Circuit Board Defects, IEEE Transactions on Robotics
36.62 D. Haughton, J. Deichmann, A. Eshghi, S. Sayek, and Automation 17(2), 191–196 (2001)
N. Teebagy, H. Topi: A Review of Software Packages 36.82 A. Kusiak: Rough Set Theory: A Data Mining Tool
for Data Mining, Amer. Stat. 57(4), 290–309 (2003) for Semiconductor Manufacturing, IEEE Transac-
36.63 J. R. Quinlan: C4.5: Programs for Machine Learning tions on Electronics Packaging Manufacturing 24(1),
(Morgan Kaufmann, San Mateo 1993) 44–50 (2001)
36.64 T. Fawcett, F. Provost: Activity Monitoring: Notic- 36.83 A. Ultsch: Information and Classification: Concepts,
ing Interesting Changes in Behavior, Proceedings Methods and Applications (Springer, Berlin Heidel-
of KDD-99, San Diego 1999, (San Diego, CA 1999) berg New York 1993)
53–62 36.84 A. Y. Wong: A Statistical Approach to Identify
36.65 D. C. Montgomery: Introduction to Statistical Qual- Semiconductor Process Equipment Related Yield
ity Control, 5th edn. (Wiley, New York 2001) Problems, IEEE International Symposium on Defect
36.66 W. H. Woodall, K.-L. Tsui, G. R. Tucker: A Review of and Fault Tolerance in VLSI Systems, Paris 1997 (IEEE
Statistical and Fuzzy Quality Control Based on Cate- Computer Society, Paris, France 1997) 20–22
gorical Data, Frontiers in Statistical Quality Control 36.85 ANSI (2002). Am. Nat. Standards Institute, IPC-9261,
5, 83–89 (1997) In-Process DPMO and Estimated Yield for PWB
36.67 D. C. Montgomery, W. H. Woodall: A Discussion on 36.86 M. Baron, C. K. Lakshminarayan, Z. Chen: Markov
Statistically-Based Process Monitoring and Control, Random Fields In Pattern Recognition For Semi-
J. Qual. Technol. 29, 121–162 (1997) conductor Manufacturing, Technometrics 43, 66–72
36.68 A. J. Hayter, K.-L. Tsui: Identification and Qualifi- (2001)
cation in Multivariate Quality Control Problems, J. 36.87 G. King: Event Count Models for International
Qual. Tech. 26(3), 197–208 (1994) Relations: Generalizations and Applications, Inter-
36.69 R. L. Mason, C. W. Champ, N. D. Tracy, S. J. Wierda, national Studies Quarterly 33(2), 123–147 (1989)
J. C. Young: Assessment of Multivariate Process 36.88 P. Smyth: Hidden Markov models for fault detec-
Control Techniques, J. Qual. Technol. 29, 140–143 tion in dynamic systems, Pattern Recognition 27(1),
(1997) 149–164 (1994)