Handbook of Statistics 24 Data Mining and Data Visualization C.R. Rao 2024 Scribd Download
Handbook of Statistics 24 Data Mining and Data Visualization C.R. Rao 2024 Scribd Download
com
https://ebookgate.com/product/handbook-of-
statistics-24-data-mining-and-data-visualization-
c-r-rao/
https://ebookgate.com/product/symbolic-data-analysis-conceptual-
statistics-and-data-mining-wiley-series-in-computational-
statistics-1st-edition-lynne-billard/
https://ebookgate.com/product/everyday-data-visualization-
desiree-abbott/
https://ebookgate.com/product/visualize-this-the-flowing-data-
guide-to-design-visualization-and-statistics-2nd-edition-nathan-
yau/
https://ebookgate.com/product/social-data-visualization-with-
html5-and-javascript-timms/
Making Sense of Data A Practical Guide to Exploratory
Data Analysis and Data Mining 1st Edition Glenn J.
Myatt
https://ebookgate.com/product/making-sense-of-data-a-practical-
guide-to-exploratory-data-analysis-and-data-mining-1st-edition-
glenn-j-myatt/
https://ebookgate.com/product/statistical-data-mining-using-sas-
applications-second-edition-chapman-hall-crc-data-mining-and-
knowledge-discovery-series-george-fernandez/
https://ebookgate.com/product/biological-knowledge-discovery-
handbook-preprocessing-mining-and-postprocessing-of-biological-
data-1st-edition-mourad-elloumi/
https://ebookgate.com/product/clustering-for-data-mining-a-data-
recovery-approach-1st-edition-boris-mirkin/
https://ebookgate.com/product/encyclopedia-of-data-warehousing-
and-mining-2nd-edition-john-wang/
Preface
It has long been a philosophical theme that statisticians ought to be data centric as op-
posed to methodology centric. Throughout the history of the statistical discipline, the
most innovative methodological advances have come when brilliant individuals have
wrestled with new data structures. Inferential statistics, linear models, sequential analy-
sis, nonparametric statistics, robust statistical methods, and exploratory data analysis
have all come about by a focus on a puzzling new data structure. The computer rev-
olution has brought forth a myriad of new data structures for researchers to contend
with including massive datasets, high-dimensional datasets, opportunistically collected
datasets, image data, text data, genomic and proteomic data, and a host of other data
challenges that could not be dealt with without modern computing resources.
This volume presents a collection of chapters that focus on data; in our words, it
is data-centric. Data mining and data visualization are both attempts to handle non-
standard statistical data, that is, data, which do not satisfy traditional assumptions
of independence, stationarity, identically distribution, or parametric formulations. We
believe it is desirable for statisticians to embrace such data and bring innovative per-
spectives to these emerging data types.
This volume is conceptually divided into three sections. The first focuses on aspects
of data mining, the second on statistical and related analytical methods applicable to
data mining, and the third on data visualization methods appropriate to data mining. In
Chapter 1, Wegman and Solka present an overview of data mining including both statis-
tical and computer science-based perspectives. We call attention to their description of
the emerging field of massive streaming datasets. Kaufman and Michalski approach data
mining from a machine learning perspective and emphasize computational intelligence
and knowledge mining. Marchette describes exciting methods for mining computer se-
curity data with the important application to cybersecurity. Martinez turns our attention
to mining of text data and some approaches to feature extraction from text data. Solka
et al. also focuses on text mining applying these methods to cross corpus discovery.
They describe methods and software for discovery subtle, but significant associations
between two corpora covering disparate fields. Finally Duric et al. round out the data
mining methods with a discussion of information hiding known as steganography.
The second section, on statistical methods and related methods applicable to data
mining begins with Rao’s description of methods applicable to dimension reduction
and graphical representation. Hand presents an overview of methods of statistical pat-
tern recognition, while Scott and Sain present an update of Scott’s seminal 1992 book
on multivariate density estimation. Hubert et al., in turn, describe the difficult problem
v
vi Preface
C.R. Rao
E.J. Wegman
J.L. Solka
Table of contents
Preface v
Contributors xiii
1. Introduction 1
2. Computational complexity 2
3. The computer science roots of data mining 9
4. Data preparation 14
5. Databases 19
6. Statistical methods for data mining 21
7. Visual data mining 29
8. Streaming data 37
9. A final word 44
Acknowledgements 44
References 44
1. Introduction 47
2. Knowledge generation operators 49
3. Strong patterns vs. complete and consistent rules 60
4. Ruleset visualization via concept association graphs 62
5. Integration of knowledge generation operators 66
6. Summary 69
Acknowledgements 70
References 71
1. Introduction 77
2. Basic TCP/IP 78
vii
viii Table of contents
3. The threat 84
4. Network monitoring 92
5. TCP sessions 97
6. Signatures versus anomalies 101
7. User profiling 102
8. Program profiling 104
9. Conclusions 107
References 107
Introduction 133
1. Approach 133
2. Results 140
3. Conclusions 167
Acknowledgements 168
References 169
1. Introduction 171
2. Image formats 172
3. Steganography 174
4. Steganalysis 179
5. Relationship of steganography to watermarking 181
6. Literature survey 184
7. Conclusions 186
References 186
1. Introduction 189
2. Canonical coordinates 190
3. Principal component analysis 197
Table of contents ix
1. Background 213
2. Basics 214
3. Practical classification rules 216
4. Other issues 226
5. Further reading 227
References 227
1. Introduction 229
2. Classical density estimators 230
3. Kernel estimators 239
4. Mixture density estimation 248
5. Visualization of densities 252
6. Discussion 258
References 258
1. Introduction 263
2. Multivariate location and scatter 264
3. Multiple regression 272
4. Multivariate regression 278
5. Classification 282
6. Principal component analysis 283
7. Principal component regression 292
8. Partial Least Squares Regression 296
9. Some other multivariate frameworks 297
10. Availability 297
Acknowledgements 300
References 300
Ch. 11. Classification and Regression Trees, Bagging, and Boosting 303
Clifton D. Sutton
1. Introduction 303
2. Using CART to create a classification tree 306
3. Using CART to create a regression tree 315
4. Other issues pertaining to CART 316
x Table of contents
5. Bagging 317
6. Boosting 323
References 327
Ch. 12. Fast Algorithms for Classification Using Class Cover Catch Digraphs 331
David J. Marchette, Edward J. Wegman and Carey E. Priebe
1. Introduction 331
2. Class cover catch digraphs 332
3. CCCD for classification 334
4. Cluster catch digraph 338
5. Fast algorithms 340
6. Further enhancements 343
7. Streaming data 344
8. Examples using the fast algorithms 346
9. Sloan Digital Sky Survey 351
10. Text processing 355
11. Discussion 357
Acknowledgements 357
References 358
Ch. 15. Some Recent Graphics Templates and Software for Showing Statistical
Summaries 415
Daniel B. Carr
1. Introduction 415
2. Background for quantitative graphics design 417
Table of contents xi
Ch. 16. Interactive Statistical Graphics: the Paradigm of Linked Views 437
Adalbert Wilhelm
1. Introduction 539
2. Computer graphics 539
3. Graphics software tools 543
4. Data visualization 547
5. Virtual reality 556
xii Table of contents
xiii
xiv Contributors
Martinez, Angel R., Aegis Metrics Coordinator, NAVSEA Dahlgren – N20P, 17320
Dahlgren Rd., Dahlgren, VA 22448-5100 USA; e-mail: [email protected]
(Ch. 4).
Michalski, Ryszard, School of Computational Sciences, George Mason University,
4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail: richard.michalski@
gmail.com (Ch. 2).
Priebe, Carey E., Department of Applied Mathematics and Statistics, Whitehead Hall
Room 201, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD
21218-2682 USA; e-mail: [email protected] (Ch. 12).
Rao, C.R., 326 Thomas Building, Pennsylvania State University, University Park, PA
16802 USA; e-mail: [email protected] (Ch. 7).
Rousseeuw, Peter J., Department of Mathematics and Computer Science, University of
Antwerp, Middleheimlaan 1, B-2020 Antwerpen, Belgium; e-mail: peter.rousseeuw@
ua.ac.be (Ch. 10).
Said, Yasmin, School of Computational Sciences, George Mason University, 4400
University Drive, Fairfax, VA 22030-4444 USA; e-mail: [email protected],
[email protected] (Ch. 13).
Sain, Stephan R., Department of Mathematics, University of Colorado at Denver,
PO Box 173364, Denver, CO 80217-3364 USA; e-mail: [email protected]
(Ch. 9).
Scott, David W., Department of Statistics, MS-138, Rice University, PO Box 1892,
Houston, TX 77251-1892 USA; e-mail: [email protected] (Ch. 9).
Solka, Jeffrey L., Code B10, Naval Surface Warfare Center, DD, Dahlgren, VA 22448
USA; e-mail: [email protected] (Chs. 1, 5).
Sutton, Clifton D., Department of Applied and Engineering Statistics, MS 4A7, George
Mason University, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail:
[email protected] (Ch. 11).
Van Aelst, Stefan, Department of Applied Mathematics and Computer Science, Ghent
University, Krijslaan 281 S9, B-9000 Ghent, Belgium; e-mail: stefan.vanaelst@
ughent.be (Ch. 10).
Wegman, Edward J., Center for Computational Statistics, MS 4A7, George Ma-
son University, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail:
[email protected], [email protected] (Chs. 1, 5, 12).
Wilhelm, Adalbert, School of Humanities and Social Sciences, International Univer-
sity Bremen, PO Box 750 561, D-28725 Bremen, Germany; e-mail: a.wilhelm@
iu-bremen.de (Ch. 16).
Handbook of Statistics, Vol. 24
ISSN: 0169-7161 1
2005 Published by Elsevier B.V.
DOI 10.1016/S0169-7161(04)24001-9
Abstract
This paper provides an overview of data mining methodologies. We have been care-
ful during our exposition to focus on many of the recent breakthroughs that might
not be known to the majority of the community. We have also attempted to provide
a brief overview of some of the techniques that we feel are particularly beneficial to
the data mining process. Our exposition runs the gambit from algorithmic complex-
ity considerations, to data preparation, databases, pattern recognition, clustering,
and the relationship between statistical pattern recognition and artificial neural sys-
tems.
1. Introduction
The phrases ‘data mining’, and, in particular, ‘statistical data mining’, have been at once
a pariah for statisticians and also a darling. For many classically trained statisticians,
data mining has meant the abandonment of the probabilistic roots of statistical analy-
sis. Indeed, this is exactly the case simply because the datasets to which data mining
techniques are typically applied are opportunistically acquired and were meant origi-
nally for some other purpose, for example, administrative records or inventory control.
These datasets are typically not collected according to widely accepted random sam-
pling schemes and hence inferences to general situations from specific datasets are not
valid in the usual statistical sense. Nonetheless, data mining techniques have proven
their value in the marketplace. On the other hand, there has been considerable interest
in the statistics community in recent years for approaches to analyzing this new data
paradigm.
The landmark paper of Tukey (1962), entitled “The future of data analysis,” and later
in the book, Exploratory Data Analysis, John Tukey (1977) sets forth a new paradigm
for statistical analysis. In contrast to what has come to be called confirmatory analysis
in which a statistical model is assumed and inference is made on the parameters of
that model, exploratory data analysis (EDA) is predicated on the fact that we do not
1
2 E.J. Wegman and J.L. Solka
necessarily know that model assumptions actually hold for data under investigation.
Because the data may not conform to the assumptions of the confirmatory analysis,
inferences made with invalid model assumptions are subject to (potentially gross) errors.
The idea then is to explore the data to verify that the model assumptions actually hold for
the data in hand. It is a very short leap of logic to use exploratory techniques to discover
unanticipated structure in the data. With the rise of powerful personal computing, this
more aggressive form of EDA has come into vogue. EDA is no longer used to simply
verify underlying model assumptions, but also to uncover unanticipated structure in the
data.
Within the last decade, computer scientists operating in the framework of databases
and information systems have similarly come to the conclusion that a more powerful
form of data analysis could be used to exploit data residing in databases. That work has
been formulated as knowledge discovery in databases (KDD) and data mining. A land-
mark book in this area is (Fayyad et al., 1996). The convergence of EDA from the
statistical community and KDD from the computer science community has given rise
to a rich if somewhat tense collaboration widely recognized as data mining.
There are many definitions of data mining. The one we prefer was given in (Wegman,
2003). Data mining is an extension of exploratory data analysis and has basically the
same goals, the discovery of unknown and unanticipated structure in the data. The chief
distinction lies in the size and dimensionality of the data sets involved. Data mining, in
general, deals with much more massive data sets for which highly interactive analysis
is not fully feasible.
The implication of the distinction mentioned above is the sheer size and dimension-
ality of data sets. Because scalability to massive datasets is one of the cornerstones of
the data mining activity, it is worthwhile for us to begin our discussion of statistical data
mining with a discussion of the taxonomy of data set sizes and their implications for
scalability and algorithmic complexity.
2. Computational complexity
Table 1
Huber–Wegman taxonomy of data set sizes
Table 2
Algorithmic complexity
Complexity Algorithm
O(r) Plot a scatterplot
O(n) Calculate means, variances, kernel density estimates
O(n log(n)) Calculate fast Fourier transforms
O(nc) Calculate singular value decomposition of an rc matrix; solve a multiple linear regression
O(nr), O(n3/2 ) Solve a clustering algorithm with r ∝ sqrt(n)
O(n2 ) Solve a clustering algorithm with c fixed and small so that r ∝ n
Let us consider a data matrix consisting of r rows and c columns. One can calculate
the total number of entries n as n = rc. In the case of higher-dimensional data, we write
d = c and refer to the data as d-dimensional. There are numerous types of operations
or algorithms that we would like to utilize as part of the data mining process. We would
like to be able to plot our data as a scatterplot, to compute summary statistics for our
data such as means and variances, to perform probability density estimation on our data
set using the standard kernel density estimator approach, we might wish to apply the
fast Fourier transform to our dataset, we would like to be able to obtain the singular
value decomposition of a multi-dimensional data set in order to ascertain appropriate
linear subspaces that capture the nature of the data, we may wish to perform a multiple
linear regression on a multi-dimensional dataset, we may wish to applying a cluster-
ing methodology with r proportional to sqrt(n), or, finally, we might wish to solve a
clustering algorithm with c fixed and small so that r is proportional to n. This list of
algorithms/data analysis techniques is not exhaustive but the list does represent many
of the tasks that one may need to do as part of the data mining process. In Table 2 we
examine algorithmic complexity as a function of these various statistical/data mining
algorithms.
Now it is interesting to match the computational requirements of these various al-
gorithms against the various different dataset sizes that we discussed previously. This
allows one to ascertain the necessary computational resources in order to have a hope
of applying each of the particular algorithms to the various datasets. Table 3 details the
number of operations (within a constant multiplicative factor) for algorithms of various
computational complexities and various data set sizes.
4 E.J. Wegman and J.L. Solka
Table 3
Number of operations for algorithms of various computational complexities and various data set sizes
Next we would like to take these computational performance figures of merit and exam-
ine the possibility of executing them on some current hardware. It is difficult to know
exactly which machine should be included in the list of current hardware since the
machine performance capabilities are constantly changing as the push for faster CPU
speeds continues. We will use 4 machines for our computational feasibility analysis. The
first machine will be a 1 GHz Pentium IV (PIV) machine with a sustainable performance
of 1 gigaflop. The second machine will consist of a hypothetical flat neighborhood net-
work of 12 1.4 GHz Athlons with a sustainable performance of 24 gigaflops. The third
machine will be a hypothetical flat neighborhood network of 64 0.7 GHz Athlons with
a sustained performance of 64 gigaflops. The fourth and final machine will be a hypo-
thetical massively distributed grid type architecture with a sustained processing speed
of 1 teraflop. This list of machines runs the gambit from a relatively low end system to
a state of the art high performance “super computer”.
In Table 4 we present execution times for a hypothetical 1 GHz PIV machine with a
sustained performance level of 1 gigaflop when applied to the various algorithm/dataset
combinations. In Table 5 we present execution times for a hypothetical Flat Neighbor-
hood Network of 12 1.4 GHz Athlons with a sustained performance of 24 gigaflops
when applied to the various algorithm/dataset combinations. In Table 6 we present ex-
ecution times for a hypothetical Flat Neighborhood Network of 64 0.7 GHz Athlons
with a sustained performance of 64 gigaflops. And finally in Table 7 we present mini-
Table 4
Execution speed of the various algorithm/dataset combinations on a Pentium IV 1 GHz machine with 1
gigaflop performance assumed
Table 5
Execution speed of the various algorithm/dataset combinations on a Flat Neighborhood Network of 12
1.4 GHz Athlons with a 24 gigaflop performance assumed
Table 6
Execution speed for the various algorithm/dataset combinations on a Flat Neighborhood Network of 64
700 MHz Athlons with a 64 gigaflop performance assumed
Table 7
Execution speed for the various algorithm/dataset combinations on a massively distributed grid type architec-
ture with a 1 teraflop performance assumed
mum execution times for a massively distributed grid type architecture with a sustained
processing speed of 1 teraflop.
By way of comparison and to give a sense of scale, as of June, 2004, a NEC com-
puter in the Earth Simulator Center in Japan achieved a speed of 35.860 teraflops
in 2002 and has a theoretical maximum speed of 40.960 teraflops. The record in
the United States is held Lawrence Livermore National Laboratory with a California
Digital Corporation computer that achieved 19.940 teraflops in 2004 with a theoreti-
cal peak performance 22.038 teraflops. These records of course are highly dependent
on specific computational tasks and highly optimized code and do not represent per-
formance that would be achieved with ordinary algorithms and ordinary code. See
http://www.top500.org/sublist/ for details.
Tables 4–7 give computation times as a function of the overall the data set size and
the algorithmic complexity at least to a reasonable order of magnitude for a variety
of computer configurations. This is essentially an updated version of (Wegman, 1995).
What is perhaps of more interest for scalability issues is a discussion of what is com-
6 E.J. Wegman and J.L. Solka
Table 8
Types of computers for interactive feasibility with a response time less than one second
Table 9
Types of computers for computational feasibility with a response time less than one week
methodology are applied to medium or smaller datasets, i.e. falling in the range of more
traditional EDA applications.
Table 10
Transfer rates for a variety of data transfer regimes
are related to local network and data transfer within devices internal to the local com-
puter.
Table 11
Resolvable number of pixels across screen for several viewing scenarios
resolvable, thus 0.486 × 9 = 4.38. Because the angular separation between two foveal
cones is approximately one minute of arc (2 × 0.486), we include this angle in Table 11.
The standard aspect ratio for computer monitors and standard NTSC television is
4 : 3, width to height. If we take the Wegman angular resolution in an immersive setting,
i.e. 2333 pixels horizontal resolution, then the vertical resolution would be approxi-
mately 1750 pixels for a total of 4.08 × 106 resolvable pixels. Notice that taking the
high definition TV aspect ratio of 16 : 9 would actually yield fewer resolvable pixels.
Even if we took the most optimistic resolution of one minute of arc (implying each
pixel falls on a single foveal cone) in an immersive setting, the horizontal resolution
would be 8400 pixels and the vertical resolution would be 6300 pixels yielding the to-
tal number of resolvable pixels at 5.29 × 107 resolvable pixels. Thus as far as using
graphical data mining techniques, it would seem that there is an insurmountable upper
bound around 106 to 107 data points, i.e. somewhere between medium to large datasets.
Interestingly enough, this coincides with the approximate number of cones in the retina.
According to Osterberg (1935), there are approximately 6.4 million cones in the retina
and somewhere around 110 to 120 million rods.
As we have pointed out before, data mining in some sense flows from a conjunction
of both computer science and statistical frameworks. The development of relational
database theory and the structured query language (SQL) among information systems
specialists allowed for logical queries to databases and the subsequent exploitation of
knowledge in the databases. However, being limited to logical queries (and, or, not)
was a handicap and the desire to exploit more numerical and even statistically oriented
queries led to the early development of data mining. Simultaneously, the early exploita-
tion of supercomputers for physical system modeling using partial differential equations
had run its course and by the early 1990s, supercomputer manufacturers were looking
for additional marketing outlets. The exploitation of large scale commercial databases
was a natural application of supercomputer technology. So there was both an academic
pull and a commercial push to develop data mining in the context of computer science.
In later sections, we describe relational databases and SQL.
As discussed above, data mining is often defined in terms of approaches that can
deal with large to massive data set sizes. An important implication of this definition is
that analysis almost by definition has to be automated so that interactive approaches
and approaches that exploit very complex algorithms are prohibited in a data mining
framework.
on the data preparation portion of the process. The amount of effort associated with the
data collection, cleaning, standardization, etc., can be somewhat daunting and actually
far outweigh those steps associated with the rest of the data analysis.
Our current electronic age has allowed for the easy production of copious amounts
of data that can be subject to analysis. Each time that an individual sets out on a trip
he/she generates a paper trail of credit card receipts, hotel and flight reservation infor-
mation, and cell phone call logs. If one also includes website utilization preferences,
then one can begin to create a unique “electronic fingerprint” for the individual. Once
the information is collected then it must be stored in some sort of data repository. His-
torical precedence indicates that data was initially stored and manipulated as flat files
without any sort of associated database structure. There are some data analysis purists
who might insist that this is still the best strategy. Currently data mining trends have
focused on the use of relational databases as a convenient storage facility. These data-
bases or data warehouses provide the data analyst with a ready supply of real material
for use in knowledge discovery. The discovery of interesting patterns within the data
provides evidence as to the utility of the collected data. In addition, a well thought-out
data warehouse could provide a convenient framework for integration of the knowledge
discovery process into the organization.
The application areas of interest to the data mining community have been driven
by the business community. Market-based analysis which is an example of tool-based
machine learning has been one of the prominent applications of interest. In this case,
one can analyze either the customers, in order to discern what a customer purchases
in order to provide insight into psychological motivations for purchasing strategies or
insight into the products actually purchased. Product-based analysis is usually referred
to as market basket analysis. This type of analysis gives insight into the merchandise
by revealing those products that tend to be purchased together and those that are most
amenable to purchase.
Some of the applications of these types of market-based analysis include focused
mailing in direct/email marketing, fraud detection, warranty claims analysis, department
store floor/shelf layout, catalog design, segmentation based on transaction patterns, and
performance comparison between stores. Some of the questions that might be pertinent
to floor/shelf layout include the following. Where should detergents be placed in the
store in order to maximize their sales? Are window cleaning products purchased when
detergents and orange juice are bought together? Is soda purchased with bananas? Does
the brand of the soda make a difference? How are the demographics of the neighborhood
affecting what customers are buying?
Table 12
Co-occurrence of products
together, to the inexplicable, when a new super store opens, one of the most commonly
sold items is light bulbs.
The creation of association rules often proceeds from the analysis of grocery point-
of-sale transactions. Table 12 provides a hypothetical co-occurrence of products matrix.
A cursory examination of the co-occurrence table suggests some simple patterns that
might be resident in the data. First we note that orange juice and soda are more likely
to be purchases together than any other two items. Next we note that detergent is never
purchased with window cleaner or milk. Finally we note that milk is never purchased
with soda or detergent. These simple observations are examples of associations and
may suggest a formal rule like: if a customer purchases soda, then the customer does
not purchase milk.
In the data, two of the five transactions include both soda and orange juice. These
two transactions support the rule. The support for the rule is two out of five or 40%.
In general, of course, data subject to data mining algorithms is usually collected for
some other administrative purpose other than market research. Consequently, these data
are not considered as a random sample and probability statements and confirmatory
inference procedures in the usual statistical sense may not be associated with such
data.
This caveat aside, it is useful to understand what might be the probabilistic underpin-
nings of the association rules. The support of a product corresponds to the unconditional
probability, P (A), that a product is purchased. The support for a pair of products corre-
sponds to the unconditional probability, P (A ∩ B), that both occur simultaneously.
Because two of the three transactions that contain soda also contain orange juice,
there is 67% confidence in the rule ‘If soda, then orange juice.’ The confidence corre-
sponds to the conditional probability P (A | B) = P (A ∩ B)/P (B).
Typically, the data miner would require a rule to have some minimum user-specified
confidence. Rule 1 & 2 → 3 has a 90% confidence if when a customer bought 1 and 2,
in 90% of the cases, the customer also bought 3. A rule must have some minimum
user-specified support. By this we mean that the rule 1 & 2 → 3 should hold in some
minimum percentage of transactions to have value.
Consider a simple transaction in Table 13. The simple rule 1 → 3 has a minimum
support of 50% or 2 transactions and a minimum confidence of 50%. Table 14 provides
a simple frequency count for each of the items. The rule 1 → 3 has a support of 50%
and a confidence given by Support({1, 3})/Support({1}) = 66%.
Statistical data mining 13
Table 13 Table 14
Simple transaction table Simple item frequency table
With this sort of tool in hand one can proceed forward with the identification of
interesting associations between various products. For example, one might search for
all of the rules that contain “Diet coke” as a result. The results of this analysis might
help the store better boost the sales of Diet coke. Alternatively one might wish to find
all rules that have “Yogurt” in the condition. These rules may help determining what
products may be impacted if the store discontinues selling “Yogurt”. As another exam-
ple one might wish to find all rules that have “Brats” in the condition and “mustard”
in the result. These rules may help in determining the additional items that have to be
sold together to make it highly likely that mustard will also be sold. Sometimes one
may wish qualify their analysis to identify the top k rules. For example, one might wish
to find the top k rules that contain the word “Yogurt” in the result. Figure 2 presents
a tree-based representation of item associations from the specific level to the general
level.
Association rules may take various forms. They can be quantitative in nature. A good
example of a quantitative rule is “Age[35,40] and Married[Yes] → NumCars[2].” As-
sociation rules may involve constraints like “Find all association rules where the prices
of items are > 100 dollars.” Association rules may vary temporarily. For example, we
may have an association rule “Diaper → Beer (1% support, 80% confidence).” This
rule can be modified to accommodate a different time slot such as “Diaper → Beer
(20% support) 7:00–9:00 pm weekdays.” They can be generalized association rules
consisting of hierarchies over items (UPC codes). For example, even though the rule
“clothes → footwear” may hold even if “clothes → shoes” does not.
Association rules may be represented as Bayesian networks. These provide for the
efficient representation of a probability distribution as a directed acyclic graph where
the nodes represent attributes of interest, the edges direct causal influence between the
nodes and conditional probabilities for nodes are given all possible.
One may actually be interested in optimization of association rules. For example,
given a rule (I < A < u) and X → Y, find values for I and u such that it has a support
greater than certain threshold and maximizes a support confidence or gain. For example,
suppose we have a rule “ChkBal[I u] → DvDPlayer.” We might then ascertain that
choosing I = $30 000 and u = $50 000 optimizes the support confidence or gain of
this rule.
Some of the strengths of market basket analysis are that it produces easy to under-
stand results, it supports undirected data mining, it works on variable length data, and
rules are relatively easy to compute. It should be noted that if there are n items
under
consideration and rules of the form “A → B” are considered, there will be n2 possible
association
rules. Similarly, if rules of the form “A&B → C” are considered, there will
be n3 possible association rules. Clearly if all possible association rules are considered,
the number grows exponentially with n. Some of the other weaknesses of market basket
analysis are that it is difficult to determine the optimal number of items, it discounts rare
items, it is limited on the support that it provides.
The other computer science area that has contributed significantly to the roots of data
mining is the area of text classification. Text classification has become a particularly
important topic given the plethora of readily available information from the World Wide
Web. Several other chapters in this volume discuss text mining as another aspect of data
mining.
A discussion of the historical roots of the data mining methodologies within the com-
puter science community would not be complete without touching upon terminology.
One usually starts the data analysis process with a set of n observations in p space.
What is usually referred to as dimensions in the mathematical community are known as
variables in statistics or attributes in computer science. Observations within the math-
ematics community, i.e., one row of the data matrix, are referred to as cases in the
statistics community and records in the computer science community. Unsupervised
learning in computer science is known as clustering in statistics. Supervised learning
in computer science is known as classification or discriminant analysis in the statistics
community. It is worthwhile to note that statistical pattern recognition usually refers to
both clustering and classification.
4. Data preparation
Much time, effort, and money is usually associated with the actual collection of data
prior to data mining analysis. We will assume for the discussions below that the data
Statistical data mining 15
has already been collected and that we will merely have to obtain the data from its
storage facility prior to conducting our analysis. With the data in hand, the first step of
the analysis procedure is data preparation. Our experiences seem to indicate that the data
preparation phase may require up to 60% of the effort associated with a given project.
In fact, data preparation can often determine the success or failure of our analytical
efforts.
Some of the issues associated with data preparation include data cleaning/data
quality assurance, identification of appropriate data type (continuous or categorical),
handling of missing data points, identifying and dealing with outliers, dimensionality
reduction, standardization, quantization, and potentially subsampling. We will briefly
examine each of these issues in turn. First we note that data might be of such a quality
that it does contain statistically significant patterns or relationships. Even if there are
meaningful patterns in the data, these patterns might be inconsistent with results ob-
tained using other data sets. Data might also have been collected in a biased manner or
since in many cases the data is based on human respondents, the data may be of uneven
quality. We finally note that one has to be careful that the discovered patterns are not
too specific or too general for the application at hand.
Even when the researcher is presented with meaningful information, one still often
must remove noise from the dataset. This noise can take the form of faulty instru-
ment/sensor readings, transmission errors, data entry errors, technology limitations, or
naming conventions misused. In many cases, numerous variables are stored in the data-
base that may have nothing whatsoever to do with the particular task at hand. In this
situation, one must be willing to identify those variables that are germane to the current
analysis while ignoring or destroying the other confounding information.
Some of the other issues associated with data preparation include duplicate data re-
moval, missing value imputation (manually or statistical), identification and removal of
data inconsistencies, identification and refreshment of stale or untimely data, and the
creation of a unique record or case id. In many cases, the human has a role in interactive
procedures that accomplish these goals.
Next we consider the distinction between continuous and categorical data. Most
statistical theory and many graphics tools have been developed for continuous data.
However, most of the data that is of particular interest to the data mining community is
categorical. Those data miners that have their roots in the computer science community
often take a set of continuous data and transform the data into categorical data such
as low, medium, and high. We will not focus on the analysis of categorical data here
but the reader is referred to Agresti (2002) for a thorough treatment of categorical data
analysis.
Fig. 3. Missing data plot for an artificial dataset. Each observation is plotted as a vertical bar. Missing values
are plotted in black.
histogram. The color histogram first appeared in the paper on the use of parallel coor-
dinates for statistical data analysis (Wegman, 1990). This data analysis technique was
subsequently rediscovered by Minnotte and West (1998). They coined the phrase “data
image” for this “new” visualization technique in their paper.
Another important issue in data preparation is the removal/identification of outliers.
Outliers, while easy to detect in low dimensions, d = 1, 2, or 3, their identification in
high-dimensional spaces may be more tenuous. In fact, high-dimensional outliers may
not actually manifest their presence in low-dimensional projections. For example, one
could imagine points uniformly distributed on a hyper-dimensional sphere of large ra-
dius with a single outlier point at the center of the sphere. Minimum volume ellipsoid
(MVE) (see Poston et al., 1997) has been previously proposed as a methodology for
outlier detection but these methods are exponentially computationally complex. Meth-
ods based on the use of the Fisher information matrix and convex hull peeling are more
feasible but still too complex for massive datasets.
There are even visualization strategies for the identification of outliers in high-
dimensional spaces. Marchette and Solka (2003) have proposed a method based on
an examination of the color histogram of the interpoint distance matrix. This method
provides the capability to identify outliers in extremely high-dimensional spaces for
moderately sized, fewer than 10 000 observations, datasets. In Figure 4 we present a
plot of the interpoint data image for a collection of 100 observations uniformly dis-
tributed along the surface of a 5-dimensional hypersphere along with a single outlier
point positioned at the center of the hypersphere. We have notated the outlier in the data
Statistical data mining 17
Fig. 4. Interpoint distance data image for a set of 100 observations sampled uniformly from the center of
a 5-dimensional hypersphere with an outlier placed at the center of the hypersphere. We have indicated the
location of the outlier by a rectangle placed at the center of the data image.
image via a square. The presence of the outlier is indicated in the data image via the
characteristic cross like structure.
4.2. Quantization
Once the missing values and outliers have been removed from the dataset, one may be
faced with the task of subsampling from the dataset in the case when there are too many
observations to be processed. Sampling from a dataset can be particularly expensive
in the case where the dataset is stored in some sort of relational databases management
system. So we may be faced with squishing the dataset, reducing the number of cases, or
squashing the dataset, reducing the number of variables or dimensions associated with
the data. One normally thinks of subsampling as the standard way of squishing a dataset
but quantization or binning is an equally viable approach. Quantization has a rich history
of success in the computer science community in application areas including signal and
image processing.
Quantization does possess some useful statistical properties. For example, given that
E[W | Q = yj ] is the mean of the observations on the j th bin = yj then the quan-
tizer can be shown to be self-consistent, i.e. E[W | Q] = Q. The reader is referred
to (Khumbah and Wegman, 2003) for a full development of the statistical properties
associated with the quantization process. See also (Braverman, 2002).
A perhaps more interesting topic is geometry-based tessellation. One needs space-
filling tessellations made up of congruent tiles that are as spherical as possible in order
18 E.J. Wegman and J.L. Solka
5. Databases
5.1. SQL
Knowledge discovery and data mining have many of their roots in database technology.
Relational databases and structured query language (SQL) have a 25+ year history.
However, the boolean relations (and, or, not) commonly used in relational databases
and SQL are inadequate for fully exploring data.
Relational databases and SQL are typically not well-known to the practicing statisti-
cian. SQL, pronounced “ess-que-el” is used to communicate with a database according
to certain American National Standards Institute (ANSI) standards. SQL statements
can be used to store records in a database, access these records, and retrieve the records.
Common SQL commands such as “Select”, “Insert”, “Update”, “Delete”, “Create”, and
“Drop” can be used to accomplish most of the tasks that one needs to do with a database.
Some common relational database management systems that use SQL include Oracle,
Sybase, Microsoft SQL Server, Access, Ingres, and the public license servers MySQL
and MSQL.
A relational database system contains one or more objects called tables. These tables
store the information in the database. Tables are uniquely identified by their names and
are comprised of rows and columns. The columns in the table contain the column name,
the data type and any other attribute for the columns. We, statisticians, would refer
to the columns as the variable identifiers. Rows contain the records of the database.
Statisticians would refer to the rows as cases. An example database table is given in
Table 15.
The select SQL command is one of the standard ways that data is extracted from a
table. The format of the command is: select “column1”[,“column2”, etc.] from “table-
name” [where “condition”]; . The arguments given in the square brackets are optional.
One can select as many column names as they like or use “*” to choose all of the
columns. The optional where clause is used to indicate which data values or rows
should be returned. The operators that are typically used with where include = (equal),
> (greater than), < (less than), >= (greater than or equal to), <= (less than or equal
to, <> (not equal to), and LIKE. The LIKE operator is a pattern matching operator that
does support the use of wild-card characters through the use of %. The % can appear at
the start or end of a string. Some of the other handy SQL operators include create table
(to create a new table), insert (to insert a row into a table), update (to update or change
Table 15
Example relational database
those records that match a specified criteria). The delete operator is used to delete those
designated rows or records from a table.
measures like counts, means, proportions and standard deviations. ROLAP refers to
relational OLAP using extended SQL.
In summary, we note that the relational database technology is fairly compute inten-
sive. Because of this, commercial database technology is challenged by the analysis of
datasets above about 108 observations. This computational limitation applies to many
of the algorithms developed by computer scientists for data mining.
The hunt for structure in numerical data has had a long history within the statistical
community. Examples of methodologies that may be used for data exploration in a data
mining scenario include correlation and regression, discriminant analysis, classifica-
tion, clustering, outlier detection, classification and regression trees, correspondence
analysis, multivariate nonparametric density estimation for hunting bump and ridges,
nonparametric regression, statistical pattern recognition, categorical data analysis, time-
series methods for trend and periodicity, and artificial neural networks. In this volume,
we have a number of detailed discussions on these techniques including nonparamet-
ric density estimation (Scott and Sain, 2005), multivariate outlier detection (Hubert et
al., 2005), classification and regression trees (Sutton, 2005), pattern recognition (Hand,
2005), classification (Marchette et al., 2005), and correspondence analysis (Rao, 2005).
To a large extent, these methods have been developed and treated historically as
confirmatory analysis techniques with emphasis on their statistical optimality and as-
ymptotic properties. In the context of data mining, where data are often not collected
according to accepted statistical sampling procedures, of course, traditional interpre-
tations as probability models with emphasis on statistical properties are inappropriate.
However, most of these methodologies can be reinterpreted as exploratory tools. For
example, while a nonparametric density estimator is frequently interpreted to be an es-
timator that asymptotically converges to the true underlying density, as a data mining
tool, we can simply think of the density estimator as a smoother which helps us hunt for
bumps and other features in the data, where we have no reason to assume that there is a
unique underlying density.
No discussion on statistical methods would be complete without referring to data
visualization as an exploratory tool. This area also is treated by several authors in this
volume, including Carr (2005), Buja et al. (2005), Chen (2005), and Wilhelm (2005).
We shall discuss visual data mining briefly in a later section of this paper, reserving
this section for some brief reviews of some analytical tools which are not covered so
thoroughly in other parts of this volume and which are perhaps a bit less common in
the usual statistical literature. Of further interest is the emerging attention in approaches
to streaming data. This topic will be discussed separately in yet another section of this
chapter.
1
n
1
n
µi = τij x j , Σi = τij (x j − µi )(x j − µi )† ,
nπi nπi
j =1 j =1
where τij is the estimated posterior probability that x j belongs to component i, πi is the
estimated mixing coefficient, µi and Σ i are the estimated mean vector and covariance
matrix, respectively. The EM is applied until convergence is obtained. A visualization of
this is given in Figure 7. Mixture estimates are L1 consistent only if the mixture density
is correct and if the number of mixture terms is correct.
Adaptive mixtures are used in another, semiparametric, recursive method discussed
by Priebe and Marchette (1993) and Priebe (1994) and provide an alternate formulation
avoiding issues of the number of terms being correct. The recursive update equations
become:
πi,n φ(x; x n+1 , θ n ) 1
τi,n+1 = N , πi,n+1 = πi,n + (τi,n+1 − πi,n ),
i=1 πi,n φ(x; x n+1 , θ n )
n
Statistical data mining 23
Fig. 7. Visualization of normal mixture model parameters in a one-dimensional setting. For a color reproduc-
tion of this figure see the color figures section, page 565.
τi,n+1
µi,n+1 = µi,n + (x n+1 − µi,n ),
nπi,n
τi,n+1
Σ i,n+1 = Σ i,n + (x n+1 − µi,n )(x n+1 − µi,n )† .
nπi,n
In addition, it is necessary to have a create rule. The basic idea is that if a new
observation is too far from the previously established components (usually in terms of
Mahalanobis distance), then the new observation will not be adequately represented by
the current mixing terms and a new mixing term should be added. We create a new
component centered at x t with a nominal starting covariance matrix. The basic form of
the update/create rule is as follows:
θ t+1 = θ t + 1 − Pt (x t+1 , θ t ) Ut (x t+1 , θ t ) + Pt (x t+1 , θ t )Ct (x t+1 , θ t )
where Ut is the previously given update rule, Ct is the create rule, and Pt is the decision
rule taking on values either 0 or 1. Figure 8 presents a three-dimensional illustration.
This procedure is nonparametric because of the adaptive character. It is generally L1
consistent under relatively mild conditions, but almost always too many terms are cre-
ated and pruning is needed.
Solka et al. (1995) proposed the visualization methods for mixture models shown
in Figures 7 and 8. Solka et al. (1998) and Ahn and Wegman (1998) proposed alter-
native effective methods for eliminating redundant mixture terms. We note finally that
mixture models are valuable from the perspective that they naturally suggest clusters
centered at the mixture means. However, while the model complexity may be reduced,
24 E.J. Wegman and J.L. Solka
d
Euclidean distance: d(xi· , xj · ) = (xik − xj k )2 ;
k=1
d
City block metric: d(xi· , xj · ) = |xij − xj k |;
k=1
d
|xik − xj k |
Canberra metric: d(xi· , xj · ) = ;
(xik − xj k )
i=1
d
xik xj k
Angular separation metric: d(xi· , xj · ) = k=1 .
d 2 d 2
k=1 xik k=1 xj k
In this matrix, the smallest distance is now 4.0, so that 3 is joined into (4, 5). This yields
the penultimate distance matrix of
(1, 2) (3, 4, 5)
(1, 2) 0.0
(3, 4, 5) 5.0 0.0
Of course, the last agglomerative step is to join (1, 2) to (3, 4, 5) and to yield the single
cluster (1, 2, 3, 4, 5). This hierarchical cluster yields a dendrogram given in Figure 9. In
this example, the complete linkage clustering yields the same cluster sequence and the
same dendrogram. The intermediate distance matrices are different however.
An alternate approach could be to use group-average clustering. The distance be-
tween clusters is the average of the distance between all pairs of individuals between
the two groups. We also note that there are methods for divisive clustering, i.e. begin-
ning with every individual in one large cluster and recursively separating into smaller,
multiple clusters, until every individual is in a singleton cluster.
1 g
ni
W = (xij − x̄j )(xij − x̄i )† ,
n−g
i=1 j =1
g
B= ni (xij − x̄)(xij − x̄)† ,
i=1
T = W + B.
Some optimization strategies are:
(1) minimize trace(W ),
(2) maximize det(T )/det(W ),
(3) minimize det(W ),
(4) maximize trace(BW −1 ).
Statistical data mining 27
It should be noted that all approaches of this discussion have important computa-
tional complexity requirements. The number of partitions of n individuals into g groups
N(n, g) can be quite daunting. For example,
N (15, 3) = 2 375 101,
N (20, 4) = 45 232 115 901,
N (25, 8) = 690 223 721 118 368 580,
N (100, 5) = 1068 .
We note that many references to clustering algorithms exist. Two important ones are
(Everitt et al., 2001, Hartigan, 1975).
Other approaches to clustering involve minimal spanning trees (see, for example,
Solka et al., 2005, this volume), clustering based on mixture densities mentioned in
the previous section, and clustering based on Voronoi tessellations with centroids deter-
mined by estimating modes (Sikali, 2004).
a weight (or strength) in analogy to the synaptic efficiency of the biological neuron.
Each artificial neuron has a threshold value from which the weighted sum of the inputs
minus the threshold is formed. The artificial neuron is not binary as is the biologi-
cal neuron. Instead, the weighted sum of inputs minus threshold is passed through a
transfer function (also called an activation function), which produces the output of the
neuron. Although it is possible to use a step-activation function in order to produce a
binary output, step activation functions are rarely used. Generally speaking, the most
used activation function is a translation of the sigmoid curve given by
1
f (t) = .
1 + e−t
See Figure 10. The sigmoid curve is a special case of the logistic curve.
The most common form of an artificial neural network is called a feed-forward net-
work. The first layer of neurons receives input from the data. The first layer is connected
to one or more additional layers, called hidden layers, and the last hidden layer is con-
nected to the output layer. The signal is fed forward through the network from the input
layer, through the hidden layer to the units in the output layer. This network is stable be-
cause it has no feedback. A network is recurrent if it has a feedback from later neurons
to earlier neurons. Recurrent networks can be unstable and, although interesting from a
research perspective, are rarely useful in solving real problems.
Neural networks are essentially nonlinear nonparametric regression tools. If the func-
tional form of the relationship between input variables and output variables was know,
it would be modeled directly. Neural networks learn the relationship between input and
output through training, which determines the weights of the neurons in the network. In
a supervised learning scenario, a set of inputs and outputs is assembled and the network
is trained (establishes neuron weights and thresholds) to minimize the error of its pre-
dictions on the training dataset. The best known example of a training method is back
propagation. After the network is trained, it models the unknown function and can be
used to make predictions on input values for which the output is not known.
Statistical data mining 29
From this description it is clear that artificial neural networks operate on numerical
data. Although more difficult, categorical data can be modeled by representing the data
by numeric values. The number of neurons required for the artificial neural network
is related to the complexity of the unknown function that is modeled as well as to the
variability of the additive noise. The number of the required training cases increases
nonlinearly as the number of connections in the network increases. A rule-of-thumb is
that the number of cases should be approximately ten times the number of connections.
Obviously, the more variables involved, the more neurons are required. Thus, variables
should be chosen carefully. Missing values can be accommodated if necessary. If there is
enough data, observations with missing values should be discarded. Outliers can cause
problems and should be discarded.
As much as we earlier protested in Section 2.4 that data mining of massive datasets is
unlikely to be successful, still the upper bound of approximately 106 allows for visually
mining relatively large datasets. Indeed, we have pursued this fairly aggressively. See
Wegman (2003). Of course, there are many situations where it is useful to attempt to
apply data mining techniques to more modest datasets. In this section, we would like
to give some perspective on our views of visual data mining. As pointed out earlier,
this volume contains a number of thoughtful chapters on data visualization and the
reader is strongly encouraged to examine the variety of perspectives that these chapters
represent.
30 E.J. Wegman and J.L. Solka
plots, grand tour on all plot devices, linked views, saturation brushing, and pruning and
cropping. We have particularly exploited the combination of parallel coordinate plots,
k-dimensional grand tours (1 k d, where d is the dimension of the data), and
saturation brushing. We briefly describe these below.
Parallel coordinates are a multi-dimensional data plotting device. Ordinary cartesian
coordinates fail as a plot device after three dimensions because we live in a world with
3 orthogonal spatial dimensions. The basic idea of parallel coordinates is to give up the
orthogonality requirement and draw the axes as parallel to each other. A d-dimensional
point is plotted by locating its appropriate component on each of the corresponding par-
allel axes and interpolating with a straight line between axes. Thus a multi-dimensional
point is uniquely represented by a broken line. Much of the elegance of this graph de-
vice is due to a projective geometry duality, which allows for interpretation of structure
in the parallel coordinate plot. Details are available in (Wegman, 1990, 2003).
The grand tour is a method for animating a static plot by forming a generalized ro-
tation in a k-subspace of the d-space where the data live. The basic idea is that we
wish to examine the data from different views uncovering features that may not be vis-
ible in a static plot, literally taking a grand tour of the data, i.e. seeing the data from
all perspectives. We might add that the grand tour is especially effective at uncovering
outliers and clusters, tasks that are difficult analytically because of the computational
complexity of the algorithms involved. A discussion of the mathematics underpinning
both parallel coordinates and the grand tour can be found in (Wegman and Solka,
2002).
The final concept is saturation brushing. Ordinary brushing involves brushing (or
painting) a subset of the data with a color. Using multiple colors in this mode allows
for designation of clusters within the dataset or other subsets, for example, negative
and positive values of a given variable by means of color. Ordinary brushing does not
distinguish between a pixel that represents one point and a pixel that represents 10 000
points. The idea of saturation brushing is to desaturate the color so that it is nearly black
and brush with the desaturated color. Then, using the so-called α-channel found on
modern graphics cards, add up the color components. Heavy overplotting is represented
by a fully saturated pixel whereas a single observation or a small amount of overplotting
will remain nearly black. Thus saturation brushing is an effective way of seeing the
structure of large datasets.
Combining these methods leads to several strategies for interactive data analysis.
The BRUSH-TOUR strategy is a recursive method for uncovering cluster structure. The
basic idea is to brush all visible clusters with distinct colors. If the parallel axes are
drawn horizontally, then any gap in any horizontal slice separates two clusters. (Some
authors draw the parallel coordinates vertically, so in this case any gap in any vertical
slice separates two clusters.) Once all visible clusters are marked, initiate the grand
tour until more gaps appear. Stop the tour and brush the new clusters. Repeat until no
unmarked clusters appear. An example of the use of the BRUSH-TOUR strategy may
be found in (Wilhelm et al., 1999).
A second strategy is the TOUR-PRUNE strategy, which is useful for forming tree
structures. An example the use of a TOUR-PRUNE is to recursively build a deci-
sion tree based on demographic data. In the case illustrated in (Wegman, 2003), we
32 E.J. Wegman and J.L. Solka
considered a profit variable and demographic data for a number of customers of a bank.
The profit variable was binarized by brushing the profit data with red for customers
that lost money for the bank and green for customers that made a profit for the bank.
The profit variable was taken out of the grand tour and the tour allowed to run on the
remaining demographic variables until either a strongly red or strongly green region
was found. A strongly red region indicated a combination of demographic variables
that represented customers who lost money whereas strongly green region indicated a
combination of demographic variables that represented customers who made profits for
the bank. By recursively touring and pruning, a decision tree can be built from combi-
nations of demographic variables for the purpose of avoiding unnecessary risks for the
bank.
Friedman and Tukey (1974) introduced the concept of projection pursuit and used this
methods to explore the PRIM 7 data. This dataset has become something of a challenge
dataset for statisticians seeking to uncover multi-dimensional structure. In addition to
Friedman and Tukey, Carr et al. (1986) and Carr and Nicholson (1988) found linear
features, Scott (1992, p. 213) reported on a triangular structure found by his student
Rod Jee in an unpublished thesis (Jee, 1985, 1987), and Cook et al. (1995) found the
linear features hanging off the vertices of the Jee–Scott triangle.
The PRIM 7 data is taken from a high-energy particle-physics scattering experiment.
A beam of positively charged pi-mesons with an energy of 16 BeV is collided with a sta-
tionary target of protons contained in hydrogen nuclei. In such an experiment, quarks
can be exchanged between the pi-meson and proton, with overall conservation of en-
ergy and momentum. The data consists of 500 examples. For this experiment, seven
independent variables are sufficient to fully characterize the reaction products. A de-
tailed description of the physics of the reaction is given in (Friedman and Tukey, 1974,
p. 887). The seven-dimensional structure of this data has been investigated over the
years.
The initial configuration of the data is given in a 7-dimensional scatterplot matrix
in Figure 13. The BRUSH-TOUR strategy was used to identify substructures of the
data. A semifinal brushed view of the data is given in Figure 11. Of particular interest
is the view illustrating three triangular features given in Figure 12, which is a view of
the data after a GRAND-TOUR rotation. The coloring in Figure 12 is the same as in
Figure 11. Of course, two-dimensional features such as a triangle will often collapse
into a one-dimensional linear feature or a zero-dimensional point feature in many of
the two-dimensional projections. The fundamental question from the exploration of this
data is what is the underlying geometric structure. The presence of triangular features
suggests that the data form a simplex. It is our conjecture that the data actually form a
truncated six-dimensional simplex. Based on this conjecture, we constructed simulated
data. The initial configuration of the simulated data is shown in Figure 14, which can
be compared directly with the real data in Figure 13.
Statistical data mining 33
Fig. 11. Scatterplot matrix of the PRIM 7 data after GRAND-TOUR rotation with features highlighted in
different colors. For a color reproduction of this figure see the color figures section, page 566.
Fig. 12. A scatterplot of the PRIM 7 data after GRAND-TOUR illustrating the three triangular features in the
data, again highlighted in different colors. For a color reproduction of this figure see the color figures section,
page 566.
34 E.J. Wegman and J.L. Solka
Fig. 13. A scatterplot matrix of the initial unrotated configuration of the PRIM 7 data. After considerable
exploration, we conjecture a truncated 6-dimensional simplex based on the multiple triangular features.
Fig. 14. A scatterplot matrix of the initial unrotated configuration of our simulated PRIM 7 data based on our
conjectured truncated 6-dimensional simplex.
Fig. 15. The first two principal components of the hyperspectral imagery. There are 7 classes of pixels includ-
ing runway, water, swamp, grass, scrub, pine, and unknown (really oaks). The water and runway are isolated
in this figure. For a color reproduction of this figure see the color figures section, page 567.
36 E.J. Wegman and J.L. Solka
Fig. 16. The recomputed principal component after denoising by removing water and runway pixels. The
swamp and grass pixels are colored respectively by cyan and blue. These may be removed and the princi-
pal components once again computed. For a color reproduction of this figure see the color figures section,
page 567.
Fig. 17. The penultimate denoised image. The scrub pixels are shown in blue. They are removed and one
final computation of the principal components is completed. For a color reproduction of this figure see the
color figures section, page 568.
so that pines and oaks have similar spectra. This also explains why scrub is closer to the
pines and oaks than say the grass or other categories. One last image (Figure 19) is of
interest. In this image, we took ten principal components and brushed them with distinct
colors for the seven distinct classes. This rotation shows that there is additional structure
in the data that is not encoded by the seven class labels. The source of this structure is
unknown by the experts who provided this hyperspectral data. Finally, we note in clos-
ing this section that the results in Examples 1 and 2 have not previously been published.
Statistical data mining 37
Fig. 18. The final denoised image show heavy overlap between the pine and unknown (oak) pixels. Based
on this analysis, we classify the unknowns as being closest to pines. In fact, both are trees and are actually
intermingled when the hyperspectral imagery was ground truthed. For a color reproduction of this figure see
the color figures section, page 568.
Fig. 19. The same hyperspectral data but based on 10 principal components instead of just 3. The plot of PC 5
versus PC 6 shows that there is additional structure in this dataset not captured by the seven classes originally
conjectured. For a color reproduction of this figure see the color figures section, page 569.
8. Streaming data
Statisticians have flirted with the concept of streaming data in the past. Statistical
process control considers data to be streaming, but at a comparatively low rate because
of the limits with which physical manufacturing can take place. Sequential analysis on
the other hand take a more abstract perspective and assumes that the data is unending,
but very highly structured and seeks to make decisions about the underlying structure
38 E.J. Wegman and J.L. Solka
n = n − 1 X
X n−1 + Xn .
n n
Statistical data mining 39
n
n−1
Xik = Xik + Xnk .
i=1 i=1
A recursive form of the kernel density estimator was formulated by Wolverton and
Wagner (1969) and independently by Yamato (1971):
n−1 ∗ 1 x − Xn
fn∗ (x) = fn−1 (x) + K
n nhn hn
where K is the smoothing kernel and hn is the usual bandwidth parameter. Wegman
and Davies (1979) proposed an additional recursive formulation and showed strong
consistency and asymptotic convergence rates for
1/2
n − 1 hn−1 † 1 x − Xn
fn† (x) = fn−1 (x) + K ,
n hn nhn hn
where the interpretation of K and hn is as above. Finally, we note that the adaptive
mixtures described in Section 6.1 is also a recursive formulation.
The difficulty with all of these procedures is that they do not discount old data. In fact,
both 1/n and 1/(nhn ) converge to 0 so that new data rather than old data is discounted.
The exponential smoother has been traditionally used to discount older data. The general
formulation is
∞
Yt = (1 − θ )θ i Xt−1
k
, 0 < θ < 1.
i=0
Yt = θ Yt−1 + (1 − θ )Xtk .
Fig. 20. Waterfall diagram of source port as a function of time. The diagonal lines indicate distinct operating
systems with different slopes characteristic of different operating systems.
Another random document with
no related content on Scribd:
kääkkyräkatajain, juoksi niinkuin hullu, pillastunut hevonen tai
säikäytetty hirvi.
— Teki!
*****
Mutta pian se huoleton elämä oli loppunut, lyhyt oli ollut vapauden
aika.
— ’Enkö ole sanonut sinulle, että jos missä irrallaan tapaat, niin
heti ota kiinni ja tuo kotiin, lurjus. Ja nyt olet päivän kanssasi
kulettanut ja ehkä vielä ruokkinutkin, pässi! — Löysikö edes jänistä,
ajoiko sitä? Vai sinun kanssasi vaan makasi, ja lehmiä haukkui. Yhtä
typeriä ja kelvottomia te molemmat, kuin lehmännekin.’ — — —
*****
Koko kylä oli liikkeellä ja kauhu oli vallannut kaikki. Koirat ulvoivat
ja haukkuivat tuskaisen kiivaasti, siat juoksivat röhkien, harjakset
pystyssä pitkin kujia, tyhmät lampaat laukkasivat sänkipelloilla kylki
kyljessä kiinni ja seisattuivat sorkkiaan kopautellen, kun olivat
jossain aidankulmauksessa löytäneet selkätuen, lehmät mörisivät
tarhoissa sarvet tanassa — ja kaiken tämän hämmingin seassa
juoksivat kyläläiset itse sinne tänne aseet kourissa, tavottaen
juoksukoiraa, joka takkuisena, likaisena ja laihana luuskana, silmät
verisinä ja suu vaahdossa ennätti joka kohtaan ja joka kohdassa ehti
käyttää hampaitaan, vaan jota ei kukaan ennättänyt siksi lujasti
kolhasta, että olisi kykenemättömäksi saattanut. Ja kohta häntä ei
enää ollut missään. Joku oli nähnyt hänen laukkaa van maantiellä
toiseen kylään päin.
Sillä aikaa venyi juoksukoira saman suovan alla, josta oli kylään
lähtenytkin. Vainu oli vetänyt hänet sekapäisenä ja loppuun
uupuneena takaisin sinne, ja siellä hän makasi kyljellään, jalat
suorana, kuin kuollut, tietämättä mistään.
Oli niinkuin joku näkymätön käsi olisi joka kerta painanut häntä
suulle, kun hänen piti puhua siitä, mitä mietti. Sillä hän mietti
todellakin ja oli jo miettinyt monta päivää erästä asiaa, ja mietti yhä
kiihkeämmin, jota pitemmälle se häneltä jäi puhumatta.
Hänen tuli vari, hän työnsi peitteen pois ja kavahti taas istuilleen.
Hän epäili vielä hetkisen, mutta veti sitten pari kertaa vahvasti
henkeä sisäänsä, rykäsi väkinäisesti ja sanoi vavahtavalla äänellä:
»Maija!»
Eukko ei kuullut.
— Otahan tulta!
— Mitä se vasikka niin sinun päätäsi vaivaa, ettet yön lepoa saa,
miesparka? kysyi Maija emäntä istahtaen rahille ja katsellen lampun
valossa tutkivasti miestään. — Minunhan se on huolenani ollut ja
kyllä se on hyvässä kunnossa, juuri ennen maatapanoa viimeksi
juotin.
— Niin, kyllä, vaan nyt se olisi vietävä kaupunkiin, sanoi Lauri.
Eukko koetti panna vastaan kaikella tavalla, mutta Lauri yhä vaan
äityi, eikä hänen kanssaan enää auttanut mikään väitteleminen.
*****
Minä en ole eilen syntynyt — kaukana siitä. Olen jo niin vanha, että
tuskin enää kaikin ajoin muistan syntymävuottani. Otsani on
kurtuissa, poskeni ryppyjä täynnä. Päälakeni voin kyllä useimmissa
tapauksissa pitää hatulla peitettynä, sillä minulle sattuu aniharvoin
sellaisia tilaisuuksia, joissa sen paljastaminen muitten läsnäollessa
olisi välttämätön, mutta leukani täytyy minun aina pitää sileäksi
ajeltuna niinkuin kuorittu nauris, ett’ei se paistaisi harmaalta, sillä
sitä en voi millään peittää. Jos sitä vielä alkaisin peitellä, esimerkiksi
villahuivilla, niin pidettäisiin minua ainakin satavuotiaana. Ja se olisi
sentään jo liikaa.
Koko seura oli yksimielinen siitä, että niin todella oli asian laita, ja
tämä kysymys, tämä minun oikeudeton pyrkimiseni vieraalle alalle,
joutui vähitellen pohjaksi koko keskustelulle ja minä jäin kuin jäinkin
siihen kurioosumiksi, jota katseltiin ja käänneltiin joka puolelta.