100% found this document useful (4 votes)
64 views84 pages

Handbook of Statistics 24 Data Mining and Data Visualization C.R. Rao 2024 Scribd Download

ebook

Uploaded by

angaloleyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
64 views84 pages

Handbook of Statistics 24 Data Mining and Data Visualization C.R. Rao 2024 Scribd Download

ebook

Uploaded by

angaloleyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Full download ebook at ebookgate.

com

Handbook of Statistics 24 Data Mining and


Data Visualization C.R. Rao

https://ebookgate.com/product/handbook-of-
statistics-24-data-mining-and-data-visualization-
c-r-rao/

Download more ebook from https://ebookgate.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Symbolic Data Analysis Conceptual Statistics and Data


Mining Wiley Series in Computational Statistics 1st
Edition Lynne Billard

https://ebookgate.com/product/symbolic-data-analysis-conceptual-
statistics-and-data-mining-wiley-series-in-computational-
statistics-1st-edition-lynne-billard/

Everyday Data Visualization Desiree Abbott

https://ebookgate.com/product/everyday-data-visualization-
desiree-abbott/

Visualize This The Flowing Data Guide to Design


Visualization and Statistics 2nd Edition Nathan Yau

https://ebookgate.com/product/visualize-this-the-flowing-data-
guide-to-design-visualization-and-statistics-2nd-edition-nathan-
yau/

Social Data Visualization with HTML5 and JavaScript


Timms

https://ebookgate.com/product/social-data-visualization-with-
html5-and-javascript-timms/
Making Sense of Data A Practical Guide to Exploratory
Data Analysis and Data Mining 1st Edition Glenn J.
Myatt

https://ebookgate.com/product/making-sense-of-data-a-practical-
guide-to-exploratory-data-analysis-and-data-mining-1st-edition-
glenn-j-myatt/

Statistical Data Mining Using SAS Applications Second


Edition Chapman Hall CRC Data Mining and Knowledge
Discovery Series George Fernandez

https://ebookgate.com/product/statistical-data-mining-using-sas-
applications-second-edition-chapman-hall-crc-data-mining-and-
knowledge-discovery-series-george-fernandez/

Biological Knowledge Discovery Handbook Preprocessing


Mining and Postprocessing of Biological Data 1st
Edition Mourad Elloumi

https://ebookgate.com/product/biological-knowledge-discovery-
handbook-preprocessing-mining-and-postprocessing-of-biological-
data-1st-edition-mourad-elloumi/

Clustering for Data Mining A Data Recovery Approach 1st


Edition Boris Mirkin

https://ebookgate.com/product/clustering-for-data-mining-a-data-
recovery-approach-1st-edition-boris-mirkin/

Encyclopedia of Data Warehousing and Mining 2nd Edition


John Wang

https://ebookgate.com/product/encyclopedia-of-data-warehousing-
and-mining-2nd-edition-john-wang/
Preface

It has long been a philosophical theme that statisticians ought to be data centric as op-
posed to methodology centric. Throughout the history of the statistical discipline, the
most innovative methodological advances have come when brilliant individuals have
wrestled with new data structures. Inferential statistics, linear models, sequential analy-
sis, nonparametric statistics, robust statistical methods, and exploratory data analysis
have all come about by a focus on a puzzling new data structure. The computer rev-
olution has brought forth a myriad of new data structures for researchers to contend
with including massive datasets, high-dimensional datasets, opportunistically collected
datasets, image data, text data, genomic and proteomic data, and a host of other data
challenges that could not be dealt with without modern computing resources.
This volume presents a collection of chapters that focus on data; in our words, it
is data-centric. Data mining and data visualization are both attempts to handle non-
standard statistical data, that is, data, which do not satisfy traditional assumptions
of independence, stationarity, identically distribution, or parametric formulations. We
believe it is desirable for statisticians to embrace such data and bring innovative per-
spectives to these emerging data types.
This volume is conceptually divided into three sections. The first focuses on aspects
of data mining, the second on statistical and related analytical methods applicable to
data mining, and the third on data visualization methods appropriate to data mining. In
Chapter 1, Wegman and Solka present an overview of data mining including both statis-
tical and computer science-based perspectives. We call attention to their description of
the emerging field of massive streaming datasets. Kaufman and Michalski approach data
mining from a machine learning perspective and emphasize computational intelligence
and knowledge mining. Marchette describes exciting methods for mining computer se-
curity data with the important application to cybersecurity. Martinez turns our attention
to mining of text data and some approaches to feature extraction from text data. Solka
et al. also focuses on text mining applying these methods to cross corpus discovery.
They describe methods and software for discovery subtle, but significant associations
between two corpora covering disparate fields. Finally Duric et al. round out the data
mining methods with a discussion of information hiding known as steganography.
The second section, on statistical methods and related methods applicable to data
mining begins with Rao’s description of methods applicable to dimension reduction
and graphical representation. Hand presents an overview of methods of statistical pat-
tern recognition, while Scott and Sain present an update of Scott’s seminal 1992 book
on multivariate density estimation. Hubert et al., in turn, describe the difficult problem

v
vi Preface

of analytically determining multivariate outliers and their impact on robustness. Sut-


ton describes recent developments in classification and regression trees, especially the
concepts of bagging and boosting. Marchette et al. describe some new computationally
effective classification tools, and, finally Said gives an overview of genetic algorithms.
The final section on data visualization begins with a description of rotations (grand-
tour methods) for high-dimensional visualization by Buja et al. This is followed by
Carr’s description of templates and software for showing statistical summaries, perhaps
the most novel of current approaches to visual data presentation. Wilhelm describes in
depth a framework for interactive statistical graphics. Finally, Chen describes a com-
puter scientist’s approach to data visualization coupled with virtual reality.
The editors sincerely hope that this combination of philosophical approaches and
technical descriptions stimulates, and perhaps even irritates, our readers to encourage
them to think deeply and with innovation about these emerging data structures and de-
velop even better approaches to enrich our discipline.

C.R. Rao
E.J. Wegman
J.L. Solka
Table of contents

Preface v
Contributors xiii

Ch. 1. Statistical Data Mining 1


Edward J. Wegman and Jeffrey L. Solka

1. Introduction 1
2. Computational complexity 2
3. The computer science roots of data mining 9
4. Data preparation 14
5. Databases 19
6. Statistical methods for data mining 21
7. Visual data mining 29
8. Streaming data 37
9. A final word 44
Acknowledgements 44
References 44

Ch. 2. From Data Mining to Knowledge Mining 47


Kenneth A. Kaufman and Ryszard S. Michalski

1. Introduction 47
2. Knowledge generation operators 49
3. Strong patterns vs. complete and consistent rules 60
4. Ruleset visualization via concept association graphs 62
5. Integration of knowledge generation operators 66
6. Summary 69
Acknowledgements 70
References 71

Ch. 3. Mining Computer Security Data 77


David J. Marchette

1. Introduction 77
2. Basic TCP/IP 78

vii
viii Table of contents

3. The threat 84
4. Network monitoring 92
5. TCP sessions 97
6. Signatures versus anomalies 101
7. User profiling 102
8. Program profiling 104
9. Conclusions 107
References 107

Ch. 4. Data Mining of Text Files 109


Angel R. Martinez

1. Introduction and background 109


2. Natural language processing at the word and sentence level 110
3. Approaches beyond the word and sentence level 119
4. Summary 129
References 130

Ch. 5. Text Data Mining with Minimal Spanning Trees 133


Jeffrey L. Solka, Avory C. Bryant and Edward J. Wegman

Introduction 133
1. Approach 133
2. Results 140
3. Conclusions 167
Acknowledgements 168
References 169

Ch. 6. Information Hiding: Steganography and Steganalysis 171


Zoran Duric, Michael Jacobs and Sushil Jajodia

1. Introduction 171
2. Image formats 172
3. Steganography 174
4. Steganalysis 179
5. Relationship of steganography to watermarking 181
6. Literature survey 184
7. Conclusions 186
References 186

Ch. 7. Canonical Variate Analysis and Related Methods for Reduction of


Dimensionality and Graphical Representation 189
C. Radhakrishna Rao

1. Introduction 189
2. Canonical coordinates 190
3. Principal component analysis 197
Table of contents ix

4. Two-way contingency tables (correspondence analysis) 201


5. Discussion 209
References 210

Ch. 8. Pattern Recognition 213


David J. Hand

1. Background 213
2. Basics 214
3. Practical classification rules 216
4. Other issues 226
5. Further reading 227
References 227

Ch. 9. Multidimensional Density Estimation 229


David W. Scott and Stephan R. Sain

1. Introduction 229
2. Classical density estimators 230
3. Kernel estimators 239
4. Mixture density estimation 248
5. Visualization of densities 252
6. Discussion 258
References 258

Ch. 10. Multivariate Outlier Detection and Robustness 263


Mia Hubert, Peter J. Rousseeuw and Stefan Van Aelst

1. Introduction 263
2. Multivariate location and scatter 264
3. Multiple regression 272
4. Multivariate regression 278
5. Classification 282
6. Principal component analysis 283
7. Principal component regression 292
8. Partial Least Squares Regression 296
9. Some other multivariate frameworks 297
10. Availability 297
Acknowledgements 300
References 300

Ch. 11. Classification and Regression Trees, Bagging, and Boosting 303
Clifton D. Sutton

1. Introduction 303
2. Using CART to create a classification tree 306
3. Using CART to create a regression tree 315
4. Other issues pertaining to CART 316
x Table of contents

5. Bagging 317
6. Boosting 323
References 327

Ch. 12. Fast Algorithms for Classification Using Class Cover Catch Digraphs 331
David J. Marchette, Edward J. Wegman and Carey E. Priebe
1. Introduction 331
2. Class cover catch digraphs 332
3. CCCD for classification 334
4. Cluster catch digraph 338
5. Fast algorithms 340
6. Further enhancements 343
7. Streaming data 344
8. Examples using the fast algorithms 346
9. Sloan Digital Sky Survey 351
10. Text processing 355
11. Discussion 357
Acknowledgements 357
References 358

Ch. 13. On Genetic Algorithms and their Applications 359


Yasmin H. Said
1. Introduction 359
2. History 360
3. Genetic algorithms 361
4. Generalized penalty methods 372
5. Mathematical underpinnings 378
6. Techniques for attaining optimization 381
7. Closing remarks 386
Acknowledgements 387
References 387

Ch. 14. Computational Methods for High-Dimensional Rotations in Data


Visualization 391
Andreas Buja, Dianne Cook, Daniel Asimov and Catherine Hurley
1. Introduction 391
2. Tools for constructing plane and frame interpolations: orthonormal frames and planar rotations 397
3. Interpolating paths of planes 403
4. Interpolating paths of frames 406
5. Conclusions 411
References 412

Ch. 15. Some Recent Graphics Templates and Software for Showing Statistical
Summaries 415
Daniel B. Carr
1. Introduction 415
2. Background for quantitative graphics design 417
Table of contents xi

3. The template for linked micromap (LM) plots 420


4. Dynamically conditioned choropleth maps 426
5. Self-similar coordinates plots 431
6. Closing remarks 434
Acknowledgements 435
References 435

Ch. 16. Interactive Statistical Graphics: the Paradigm of Linked Views 437
Adalbert Wilhelm

1. Graphics, statistics and the computer 437


2. The interactive paradigm 444
3. Data displays 446
4. Direct object manipulation 467
5. Selection 469
6. Interaction at the frame level 475
7. Interaction at the type level 476
8. Interactions at the model level 483
9. Interaction at sample population level 486
10. Indirect object manipulation 487
11. Internal linking structures 488
12. Querying 494
13. External linking structure 496
14. Linking frames 498
15. Linking types 499
16. Linking models 499
17. Linking sample populations 503
18. Visualization of linked highlighting 505
19. Visualization of grouping 508
20. Linking interrogation 508
21. Bi-directional linking in the trace plot 509
22. Multivariate graphical data analysis using linked low-dimensional views 509
23. Conditional probabilities 511
24. Detecting outliers 518
25. Clustering and classification 519
26. Geometric structure 520
27. Relationships 522
28. Conclusion 532
Future work 533
References 534

Ch. 17. Data Visualization and Virtual Reality 539


Jim X. Chen

1. Introduction 539
2. Computer graphics 539
3. Graphics software tools 543
4. Data visualization 547
5. Virtual reality 556
xii Table of contents

6. Some examples of visualization using VR 560


References 561

Colour Figures 565

Subject Index 609

Contents of Previous Volumes 619


Contributors

Asimov, Daniel, Department of Mathematics, University of California, Berkeley, CA


94720 USA; e-mail: [email protected] (Ch. 14).
Bryant, Avory C., Code B10, Naval Surface Warfare Center, DD, Dahlgren, VA 22448
USA; e-mail: [email protected] (Ch. 5).
Buja, Andreas, The Wharton School, University of Pennsylvania, 471 Huntsman Hall,
Philadelphia, PA 19104-6302 USA; e-mail: [email protected] (Ch. 14).
Carr, Daniel B., Department of Applied and Engineering Statistics, MS 4A7, George
Mason University, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail:
[email protected] (Ch. 15).
Chen, Jim X., Department of Computer Science, MS 4A5, George Mason University,
4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail: [email protected]
(Ch. 17).
Cook, Dianne, Department of Statistics, Iowa State University, Ames, IA 50011 USA;
e-mail: [email protected] (Ch. 14).
Duric, Zoran, Center for Secure Information Systems, MS 4A4, George Mason Uni-
versity, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail: zduric@
cs.gmu.edu (Ch. 6).
Hand, David J., Department of Mathematics, The Huxley Building, Imperial Col-
lege London, 180 Queen’s Gate, London SW7 2BZ, UK; e-mail: [email protected]
(Ch. 8).
Hubert, Mia, Department of Mathematics, Katholieke Universiteit Leuven, W. de Croy-
laan 54, B-3001 Leuven, Belgium; e-mail: [email protected] (Ch. 10).
Hurley, Catherine, Mathematics Department, National University of Ireland, Maynooth
Co., Kildare, Ireland; e-mail: [email protected] (Ch. 14).
Jacobs, Michael, Center for Secure Information Systems, MS 4A4, George Ma-
son University, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail:
[email protected] (Ch. 6).
Jajodia, Sushil, Center for Secure Information Systems, MS 4A4, George Mason Uni-
versity, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail: jajodia@
ise.gmu.edu (Ch. 6).
Kaufman, Kenneth A., School of Computational Sciences, George Mason University,
4400 University Drive, Fairfax, VA 22030-4444 USA; (Ch. 2).
Marchette, David J., Code B10, Naval Surface Warfare Center, DD, Dahlgren, VA 22448
USA; e-mail: [email protected] (Chs. 3, 12).

xiii
xiv Contributors

Martinez, Angel R., Aegis Metrics Coordinator, NAVSEA Dahlgren – N20P, 17320
Dahlgren Rd., Dahlgren, VA 22448-5100 USA; e-mail: [email protected]
(Ch. 4).
Michalski, Ryszard, School of Computational Sciences, George Mason University,
4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail: richard.michalski@
gmail.com (Ch. 2).
Priebe, Carey E., Department of Applied Mathematics and Statistics, Whitehead Hall
Room 201, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD
21218-2682 USA; e-mail: [email protected] (Ch. 12).
Rao, C.R., 326 Thomas Building, Pennsylvania State University, University Park, PA
16802 USA; e-mail: [email protected] (Ch. 7).
Rousseeuw, Peter J., Department of Mathematics and Computer Science, University of
Antwerp, Middleheimlaan 1, B-2020 Antwerpen, Belgium; e-mail: peter.rousseeuw@
ua.ac.be (Ch. 10).
Said, Yasmin, School of Computational Sciences, George Mason University, 4400
University Drive, Fairfax, VA 22030-4444 USA; e-mail: [email protected],
[email protected] (Ch. 13).
Sain, Stephan R., Department of Mathematics, University of Colorado at Denver,
PO Box 173364, Denver, CO 80217-3364 USA; e-mail: [email protected]
(Ch. 9).
Scott, David W., Department of Statistics, MS-138, Rice University, PO Box 1892,
Houston, TX 77251-1892 USA; e-mail: [email protected] (Ch. 9).
Solka, Jeffrey L., Code B10, Naval Surface Warfare Center, DD, Dahlgren, VA 22448
USA; e-mail: [email protected] (Chs. 1, 5).
Sutton, Clifton D., Department of Applied and Engineering Statistics, MS 4A7, George
Mason University, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail:
[email protected] (Ch. 11).
Van Aelst, Stefan, Department of Applied Mathematics and Computer Science, Ghent
University, Krijslaan 281 S9, B-9000 Ghent, Belgium; e-mail: stefan.vanaelst@
ughent.be (Ch. 10).
Wegman, Edward J., Center for Computational Statistics, MS 4A7, George Ma-
son University, 4400 University Drive, Fairfax, VA 22030-4444 USA; e-mail:
[email protected], [email protected] (Chs. 1, 5, 12).
Wilhelm, Adalbert, School of Humanities and Social Sciences, International Univer-
sity Bremen, PO Box 750 561, D-28725 Bremen, Germany; e-mail: a.wilhelm@
iu-bremen.de (Ch. 16).
Handbook of Statistics, Vol. 24
ISSN: 0169-7161 1
2005 Published by Elsevier B.V.
DOI 10.1016/S0169-7161(04)24001-9

Statistical Data Mining

Edward J. Wegman and Jeffrey L. Solka

Abstract
This paper provides an overview of data mining methodologies. We have been care-
ful during our exposition to focus on many of the recent breakthroughs that might
not be known to the majority of the community. We have also attempted to provide
a brief overview of some of the techniques that we feel are particularly beneficial to
the data mining process. Our exposition runs the gambit from algorithmic complex-
ity considerations, to data preparation, databases, pattern recognition, clustering,
and the relationship between statistical pattern recognition and artificial neural sys-
tems.

Keywords: dimensionality reduction; algorithmic complexity; statistical pattern


recognition; clustering; artificial neural systems

1. Introduction

The phrases ‘data mining’, and, in particular, ‘statistical data mining’, have been at once
a pariah for statisticians and also a darling. For many classically trained statisticians,
data mining has meant the abandonment of the probabilistic roots of statistical analy-
sis. Indeed, this is exactly the case simply because the datasets to which data mining
techniques are typically applied are opportunistically acquired and were meant origi-
nally for some other purpose, for example, administrative records or inventory control.
These datasets are typically not collected according to widely accepted random sam-
pling schemes and hence inferences to general situations from specific datasets are not
valid in the usual statistical sense. Nonetheless, data mining techniques have proven
their value in the marketplace. On the other hand, there has been considerable interest
in the statistics community in recent years for approaches to analyzing this new data
paradigm.
The landmark paper of Tukey (1962), entitled “The future of data analysis,” and later
in the book, Exploratory Data Analysis, John Tukey (1977) sets forth a new paradigm
for statistical analysis. In contrast to what has come to be called confirmatory analysis
in which a statistical model is assumed and inference is made on the parameters of
that model, exploratory data analysis (EDA) is predicated on the fact that we do not

1
2 E.J. Wegman and J.L. Solka

necessarily know that model assumptions actually hold for data under investigation.
Because the data may not conform to the assumptions of the confirmatory analysis,
inferences made with invalid model assumptions are subject to (potentially gross) errors.
The idea then is to explore the data to verify that the model assumptions actually hold for
the data in hand. It is a very short leap of logic to use exploratory techniques to discover
unanticipated structure in the data. With the rise of powerful personal computing, this
more aggressive form of EDA has come into vogue. EDA is no longer used to simply
verify underlying model assumptions, but also to uncover unanticipated structure in the
data.
Within the last decade, computer scientists operating in the framework of databases
and information systems have similarly come to the conclusion that a more powerful
form of data analysis could be used to exploit data residing in databases. That work has
been formulated as knowledge discovery in databases (KDD) and data mining. A land-
mark book in this area is (Fayyad et al., 1996). The convergence of EDA from the
statistical community and KDD from the computer science community has given rise
to a rich if somewhat tense collaboration widely recognized as data mining.
There are many definitions of data mining. The one we prefer was given in (Wegman,
2003). Data mining is an extension of exploratory data analysis and has basically the
same goals, the discovery of unknown and unanticipated structure in the data. The chief
distinction lies in the size and dimensionality of the data sets involved. Data mining, in
general, deals with much more massive data sets for which highly interactive analysis
is not fully feasible.
The implication of the distinction mentioned above is the sheer size and dimension-
ality of data sets. Because scalability to massive datasets is one of the cornerstones of
the data mining activity, it is worthwhile for us to begin our discussion of statistical data
mining with a discussion of the taxonomy of data set sizes and their implications for
scalability and algorithmic complexity.

2. Computational complexity

2.1. Order of magnitude considerations


In Table 1 we present Peter Huber’s taxonomy of data set complexity (Huber, 1992,
1994). This was expanded by Wegman (1995). This taxonomy provides a feel for the
magnitudes of the datasets that one is often interested in analyzing. We present a de-
scriptor for the dataset, the size of the dataset in bytes, and the appropriate storage
mode for such a dataset. Originally, the statistics community was often concerned
with datasets that resided in the Small to Medium classes. Numerous application ar-
eas such as computer security and text processing have helped push the community
into analyzing datasets that reside in the Large or Huge classes. Ultimately we will all
be interested in the analysis of datasets in the Massive or Supermassive class. Such
monstrous datasets will necessitate the development of recursive methodologies and
will require the use of those approaches that are usually associated with streaming
data.
Statistical data mining 3

Table 1
Huber–Wegman taxonomy of data set sizes

Descriptor Data set size in bytes Storage mode


Tiny 102 Piece of paper
Small 104 A few pieces of paper
Medium 106 A floppy disk
Large 108 Hard disk
Huge 1010 Multiple hard disks, e.g. RAID storage
Massive 1012 Disk farms/tape storage silos
Supermassive 1015 Distributed data centers

Table 2
Algorithmic complexity

Complexity Algorithm
O(r) Plot a scatterplot
O(n) Calculate means, variances, kernel density estimates
O(n log(n)) Calculate fast Fourier transforms
O(nc) Calculate singular value decomposition of an rc matrix; solve a multiple linear regression
O(nr), O(n3/2 ) Solve a clustering algorithm with r ∝ sqrt(n)
O(n2 ) Solve a clustering algorithm with c fixed and small so that r ∝ n

Let us consider a data matrix consisting of r rows and c columns. One can calculate
the total number of entries n as n = rc. In the case of higher-dimensional data, we write
d = c and refer to the data as d-dimensional. There are numerous types of operations
or algorithms that we would like to utilize as part of the data mining process. We would
like to be able to plot our data as a scatterplot, to compute summary statistics for our
data such as means and variances, to perform probability density estimation on our data
set using the standard kernel density estimator approach, we might wish to apply the
fast Fourier transform to our dataset, we would like to be able to obtain the singular
value decomposition of a multi-dimensional data set in order to ascertain appropriate
linear subspaces that capture the nature of the data, we may wish to perform a multiple
linear regression on a multi-dimensional dataset, we may wish to applying a cluster-
ing methodology with r proportional to sqrt(n), or, finally, we might wish to solve a
clustering algorithm with c fixed and small so that r is proportional to n. This list of
algorithms/data analysis techniques is not exhaustive but the list does represent many
of the tasks that one may need to do as part of the data mining process. In Table 2 we
examine algorithmic complexity as a function of these various statistical/data mining
algorithms.
Now it is interesting to match the computational requirements of these various al-
gorithms against the various different dataset sizes that we discussed previously. This
allows one to ascertain the necessary computational resources in order to have a hope
of applying each of the particular algorithms to the various datasets. Table 3 details the
number of operations (within a constant multiplicative factor) for algorithms of various
computational complexities and various data set sizes.
4 E.J. Wegman and J.L. Solka

Table 3
Number of operations for algorithms of various computational complexities and various data set sizes

n n1/2 n n log(n) n3/2 n2


Tiny 10 102 2 × 102 103 104
Small 102 104 4 × 104 106 108
Medium 103 106 6 × 106 109 1012
Large 104 108 8 × 108 1012 1016
Huge 105 1010 1011 1015 1020
Massive 106 1012 1.2 × 1013 1018 1024
Supermassive 107.5 1015 1.5 × 1016 1022.5 1030

2.2. Feasibility limits due to CPU performance

Next we would like to take these computational performance figures of merit and exam-
ine the possibility of executing them on some current hardware. It is difficult to know
exactly which machine should be included in the list of current hardware since the
machine performance capabilities are constantly changing as the push for faster CPU
speeds continues. We will use 4 machines for our computational feasibility analysis. The
first machine will be a 1 GHz Pentium IV (PIV) machine with a sustainable performance
of 1 gigaflop. The second machine will consist of a hypothetical flat neighborhood net-
work of 12 1.4 GHz Athlons with a sustainable performance of 24 gigaflops. The third
machine will be a hypothetical flat neighborhood network of 64 0.7 GHz Athlons with
a sustained performance of 64 gigaflops. The fourth and final machine will be a hypo-
thetical massively distributed grid type architecture with a sustained processing speed
of 1 teraflop. This list of machines runs the gambit from a relatively low end system to
a state of the art high performance “super computer”.
In Table 4 we present execution times for a hypothetical 1 GHz PIV machine with a
sustained performance level of 1 gigaflop when applied to the various algorithm/dataset
combinations. In Table 5 we present execution times for a hypothetical Flat Neighbor-
hood Network of 12 1.4 GHz Athlons with a sustained performance of 24 gigaflops
when applied to the various algorithm/dataset combinations. In Table 6 we present ex-
ecution times for a hypothetical Flat Neighborhood Network of 64 0.7 GHz Athlons
with a sustained performance of 64 gigaflops. And finally in Table 7 we present mini-

Table 4
Execution speed of the various algorithm/dataset combinations on a Pentium IV 1 GHz machine with 1
gigaflop performance assumed

n n1/2 n n log(n) n3/2 n2


Tiny 10−8 s 10−7 s 2 × 10−7 s 10−6 s 10−5 s
Small 10−7 s 10−5 s 4 × 10−5 s 0.001 s 0.1 s
Medium 10−6 s 0.001 s 0.006 s 1.002 s 16.74 min
Large 10−5 s 0.1 s 0.78 s 16.74 min 115.7 days
Huge 10−4 s 10.02 s 1.668 min 11.57 days 3170 years
Statistical data mining 5

Table 5
Execution speed of the various algorithm/dataset combinations on a Flat Neighborhood Network of 12
1.4 GHz Athlons with a 24 gigaflop performance assumed

n n1/2 n n log(n) n3/2 n2


Tiny 4.2 × 10−10 s 4.2 × 10−9 s 8.3 × 10−9 s 4.2 × 10−8 s 4.2 × 10−7 s
Small 4.2 × 10−9 s 4.2 × 10−7 s 1.7 × 10−6 s 4.2 × 10−5 s 4.2 × 10−3 s
Medium 4.2 × 10−8 s 4.2 × 10−5 s 2.5 × 10−4 s 4.2 × 10−2 s 42 s
Large 4.2 × 10−7 s 4.2 × 10−3 s 0.03 s 42 s 4.86 days
Huge 4.2 × 10−6 s 0.42 s 4.2 s 11.67 h 133.22 years

Table 6
Execution speed for the various algorithm/dataset combinations on a Flat Neighborhood Network of 64
700 MHz Athlons with a 64 gigaflop performance assumed

n n1/2 n n log(n) n3/2 n2


Tiny 1.6 × 10−10 s 1.6 × 10−9 s 3.1 × 10−9 s 1.6 × 10−8 s 1.6 × 10−7 s
Small 1.6 × 10−9 s 1.6 × 10−7 s 6.3 × 10−7 s 1.6 × 10−5 s 41.6 × 10−3 s
Medium 1.6 × 10−8 s 1.6 × 10−5 s 9.4 × 10−5 s 1.6 × 10−2 s 16 s
Large 1.6 × 10−7 s 1.6 × 10−3 s 0.01 s 16 s 1.85 days
Huge 1.6 × 10−6 s 0.16 s 1.6 s 4.34 h 49.54 years

Table 7
Execution speed for the various algorithm/dataset combinations on a massively distributed grid type architec-
ture with a 1 teraflop performance assumed

n n1/2 n n log(n) n3/2 n2


Tiny 1 × 10−11 s 1 × 10−10 s 2 × 10−10 s 1 × 10−9 s 1 × 10−8 s
Small 1 × 10−10 s 1 × 10−8 s 4 × 10−8 s 1 × 10−6 s 1 × 10−4 s
Medium 1 × 10−9 s 1 × 10−6 s 6 × 10−6 s 1 × 10−3 s 1s
Large 1 × 10−8 s 1 × 10−4 s 0.0008 s 1s 2.78 h
Huge 1 × 10−7 s 0.01 s 0.1 s 16.67 min 3.17 years

mum execution times for a massively distributed grid type architecture with a sustained
processing speed of 1 teraflop.
By way of comparison and to give a sense of scale, as of June, 2004, a NEC com-
puter in the Earth Simulator Center in Japan achieved a speed of 35.860 teraflops
in 2002 and has a theoretical maximum speed of 40.960 teraflops. The record in
the United States is held Lawrence Livermore National Laboratory with a California
Digital Corporation computer that achieved 19.940 teraflops in 2004 with a theoreti-
cal peak performance 22.038 teraflops. These records of course are highly dependent
on specific computational tasks and highly optimized code and do not represent per-
formance that would be achieved with ordinary algorithms and ordinary code. See
http://www.top500.org/sublist/ for details.
Tables 4–7 give computation times as a function of the overall the data set size and
the algorithmic complexity at least to a reasonable order of magnitude for a variety
of computer configurations. This is essentially an updated version of (Wegman, 1995).
What is perhaps of more interest for scalability issues is a discussion of what is com-
6 E.J. Wegman and J.L. Solka

Table 8
Types of computers for interactive feasibility with a response time less than one second

n n1/2 n n log(n) n3/2 n2


Tiny PC PC PC PC PC
Small PC PC PC PC PC
Medium PC PC PC FNN24 TFC
Large PC PC PC TFC –
Huge PC FNN24 TFC – –

Table 9
Types of computers for computational feasibility with a response time less than one week

n n1/2 n n log(n) n3/2 n2


Tiny PC PC PC PC PC
Small PC PC PC PC PC
Medium PC PC PC PC PC
Large PC PC PC PC FNN24
Huge PC PC PC FNN24 –

putationally feasible and what is feasible from an interactive computation perspective.


We consider a procedure to be computationally feasible if the execution time is less
than one week. Of course, this would imply that there is essentially no human inter-
action with the computation process once it has begun. A procedure is thought to be
feasible in an interactive mode if the computation time is less than one second. In Ta-
ble 8 we present needed computational resources for a response time less than 1 s, i.e.
the resources needed for interactive feasibility. In Table 9 we present needed compu-
tational resources for a response time less than one week, that is resources needed for
computational feasibility.
It is clear from Table 8 that interactive feasibility is possible for order O(n) algo-
rithms on relatively simple computational resources, but that for order O(n2 ) algo-
rithms, feasibility begins to disappear already for medium datasets, i.e. a million bytes.
Note that although polynomial-time algorithms are often regarded as feasible, the com-
bination of dataset size and algorithmic complexity severely limit practical interactive
feasibility. Algorithms such as a minimum volume ellipsoid for determining multivari-
ate outliers are exponentially complex and are certainly out of the range of feasibility
except possibly for tiny datasets.
Table 9 suggests that even for settings in which we can be satisfied with compu-
tational feasibility, by the time we consider order O(n2 ) algorithms and huge (1010 )
datasets, we are already out of the realm of computational feasibility. Even with the
World’s fastest supercomputers mentioned above, they are capable of only scalar mul-
tiples of our hypothetical teraflop computers, and of course, it is highly unlikely that
dedicated access for a full week would be give to such an algorithm, e.g. clustering.
The message is clear. In order to be scalable to massive or larger datasets, data mining
algorithms must be of order no more than O(n log(n)) and preferably of order O(n).
That having been said, although data mining is often justified on the basis of dealing
with huge, massive or larger datasets, it is often the case that exemplars of data mining
Statistical data mining 7

methodology are applied to medium or smaller datasets, i.e. falling in the range of more
traditional EDA applications.

2.3. Feasibility limits due to file transfer performance


File transfer speed is one of the key issues in the feasibility of dealing with large
datasets. In (Wegman, 1995), network transfer was thought to be a major factor block-
ing the dealing with massive datasets. With the innovation of gigabit networks and the
National LambdaRail (see http://www.nlr.net/) the major infrastructure backbone for
dealing with transport of massive datasets seems to be emerging. In Table 10 we present
transfer rates associated with several of our datasets for various types of devices. The
National LambdaRail technology promises a transfer rate of 10 gigabits per second
which translates into a massive dataset moving over the backbone in 13.33 min. This
is certainly feasible. Similarly, in a local computer with a 4 GHz clock, data can move
from cache memory to CPU in 4.17 min. The difficult lies not in the two ends of the
data transfer chain, but in the intermediate steps. Read times from a hard drive have
improved considerably but read times even for the fastest disk arrays are still on the
order of hours for massive datasets, and with local Ethernet operating at 100 megabits
per second transfer of a massive dataset would take almost a whole day. This assumes
there is no contention for resources on the fast Ethernet, which is not normally the case.
The conclusion is that with the most esoteric technology in place dealing with huge
and massive datasets may become feasible in the next few years. However, with cur-
rent end-to-end technology, datasets on the order of 108 , i.e. what we have called large
datasets are probably the upper limit that can be reasonably dealt with. It is notable that
while current storage capabilities (hard drives) have improved according to Moore’s
Law at a rate faster than CPU improvements, their data transfer speed has not improved
nearly as dramatically.
It is also worth noting that in Table 10, we assume a cache transfer according to
a 4 GHz clock. However, typically data transfer from and to peripheral devices such
as hard drives and NIC cards is often at a dramatically slower clock speed. For exam-
ple, high-speed USB connections operate at 480 megabits per second although standard
full speed USB connections operate at 12 megabits per second. This is in contrast to a
theoretical LambdaRail transfer rate 10 gigabits per second. So clearly the bottlenecks

Table 10
Transfer rates for a variety of data transfer regimes

n Standard Ethernet Fast Ethernet Cache transfer National LambdaRail


10 Mb/s 100 Mb/s at 4 GHz at 10 Gb/s
1.25 × 106 B/s 1.25 × 107 B/s 4 × 109 B/s 1.25 × 109 B/s
Tiny 8 × 10−5 s 8 × 10−6 s 2.5 × 10−8 s 8 × 10−8 s
Small 8 × 10−3 s 8 × 10−4 s 2.5 × 10−6 s 8 × 10−6 s
Medium 8 × 10−1 s 8 × 10−2 s 2.5 × 10−4 s 8 × 10−4 s
Large 1.3 min 8s 2.5 × 10−2 s 8 × 10−2 s
Huge 2.22 h 13.33 min 2.50 s 8s
Massive 9.26 days 22.22 h 4.17 min 13.33 min
8 E.J. Wegman and J.L. Solka

are related to local network and data transfer within devices internal to the local com-
puter.

2.4. Feasibility limits due to visual resolution


Perhaps the most interesting and common methodology for exploratory data analysis
is based on graphical methods and data visualization. One immediately comes to the
issue of how scalable are graphical methods. This question leads naturally to the issue
of how acute is human vision. Consider the following thought experiment. Suppose by
some immensely clever technique, we could map a multi-dimensional observation into
a single pixel. The question of how much data could we see becomes how many pixels
can we see?
The answer to this question of course depends on the ability of the eye to resolve
small objects and the ability of the display device to display small objects. It is most
convenient to think in terms of angular resolution. It is a matter of simple trigonometry
to calculate the angle that a display device subtends based on the size of the display
device and the distance of the viewer from the display devices. Several hypothetical
scenarios are given in Table 11, i.e. watching a 19 inch monitor at 24 inches, watching
a 25 inch television at 12 feet, watching a 15 foot home theater screen at 20 feet and
perhaps most optimistic, being in an immersive (virtual reality) environment. In the
latter scenario, the field of view is approximately 140◦ . One can test this empirically
by spreading one’s arms straight out parallel to the plane of the body and then slowly
bringing them forward until they appear on one’s peripheral vision.
Of course, once the view angle is determined, it need simply be divided by the (hor-
izontal) angular resolution of the human eye to obtain the number of pixels across that
can be seen. Experts differ on the angular resolution: Valyus (1962) claims 5 seconds
of arc, Wegman (1995) puts the angular resolution at 3.6 minutes of arc, while Maar
(1982) puts it at 4.38 minutes of arc. The Valyus figure seems to be based on the min-
imal size of an object that can be seen at all with the naked eye. Too small an object
will not yield enough photons to stimulate the retinal cells. Whereas the Wegman and
Maar figures are based on minimal angular resolution that can distinguish two adjacent
features. This is more realistic for describing how many pixels can be resolved. Maar
asserts that two objects must be separated by approximately nine foveal cones to be

Table 11
Resolvable number of pixels across screen for several viewing scenarios

19 inch monitor 25 inch TV 15 foot screen Immersion


at 24 inches at 12 feet at 20 feet
Angle 39.005◦ 9.922◦ 41.112◦ 140◦
5 seconds of arc resolution (Valyus) 28 084 7144 29 601 100 800
1 minute of arc resolution 2340 595 2467 8400
3.6 minutes of arc resolution (Wegman) 650 165 685 2333
4.38 minutes of arc resolution (Maar 1) 534 136 563 1918
0.486 minutes of arc/foveal cone (Maar 2) 4815 1225 5076 17 284
Statistical data mining 9

resolvable, thus 0.486 × 9 = 4.38. Because the angular separation between two foveal
cones is approximately one minute of arc (2 × 0.486), we include this angle in Table 11.
The standard aspect ratio for computer monitors and standard NTSC television is
4 : 3, width to height. If we take the Wegman angular resolution in an immersive setting,
i.e. 2333 pixels horizontal resolution, then the vertical resolution would be approxi-
mately 1750 pixels for a total of 4.08 × 106 resolvable pixels. Notice that taking the
high definition TV aspect ratio of 16 : 9 would actually yield fewer resolvable pixels.
Even if we took the most optimistic resolution of one minute of arc (implying each
pixel falls on a single foveal cone) in an immersive setting, the horizontal resolution
would be 8400 pixels and the vertical resolution would be 6300 pixels yielding the to-
tal number of resolvable pixels at 5.29 × 107 resolvable pixels. Thus as far as using
graphical data mining techniques, it would seem that there is an insurmountable upper
bound around 106 to 107 data points, i.e. somewhere between medium to large datasets.
Interestingly enough, this coincides with the approximate number of cones in the retina.
According to Osterberg (1935), there are approximately 6.4 million cones in the retina
and somewhere around 110 to 120 million rods.

3. The computer science roots of data mining

As we have pointed out before, data mining in some sense flows from a conjunction
of both computer science and statistical frameworks. The development of relational
database theory and the structured query language (SQL) among information systems
specialists allowed for logical queries to databases and the subsequent exploitation of
knowledge in the databases. However, being limited to logical queries (and, or, not)
was a handicap and the desire to exploit more numerical and even statistically oriented
queries led to the early development of data mining. Simultaneously, the early exploita-
tion of supercomputers for physical system modeling using partial differential equations
had run its course and by the early 1990s, supercomputer manufacturers were looking
for additional marketing outlets. The exploitation of large scale commercial databases
was a natural application of supercomputer technology. So there was both an academic
pull and a commercial push to develop data mining in the context of computer science.
In later sections, we describe relational databases and SQL.
As discussed above, data mining is often defined in terms of approaches that can
deal with large to massive data set sizes. An important implication of this definition is
that analysis almost by definition has to be automated so that interactive approaches
and approaches that exploit very complex algorithms are prohibited in a data mining
framework.

3.1. Knowledge discovery in databases and data mining


Knowledge discovery in databases consists of identifying those patterns or models that
meet the goals of the knowledge discovery process. So a knowledge discovery engine
needs the ability to measure the validity of a discovered pattern, the utility of the pat-
tern, the simplicity or complexity of the pattern and the novelty of the pattern. These
10 E.J. Wegman and J.L. Solka

“metrics” help to identify the degree of “interestingness” of a discovered pattern. Data


mining itself can be defined as a step in the knowledge discovery process consisting
of particular algorithms (methods) that under some acceptable objective, produces a
particular enumeration of patterns (models) over the data. The knowledge discovery
process can be defined as the process of using data mining methods (algorithms) to ex-
tract (identify) what is deemed knowledge according to the specifications of measures
and thresholds, using a database along with any necessary preprocessing or transforma-
tions.
The steps in the data mining process are usually described as follows. First, an un-
derstanding of the application domain must be obtained including relevant prior domain
knowledge, problem objectives, success criteria, current solutions, inventory resources,
constants, terminology cost and benefits. The next step focuses on the creation of a tar-
get dataset. This step might involve an initial dataset collection, producing an adequate
description of the data, verifying the data quality, and focusing on a subset of possible
measured variables. Following this is data cleaning and preprocessing. In this step the
data is denoised, outliers are removed from the data, missing field are handled, time
sequence information is obtained, known trends are removed from the data, and any
needed data integration is performed. Next data reduction and projection is performed.
This step consists of feature subset selection, feature construction, feature discretization,
and feature aggregation. Finally, one must identify the purpose of the data mining effort
such as classification, segmentation, deviation detection, or link analysis. The appropri-
ate data mining approaches must be identified and used to extract patterns or models
from the data. These models then need to be interpreted and evaluated and finally the
discovered data must be consolidated.
Figure 1 portrays an in-depth flow chart for the data mining process. It might be
surprising to note that the vast majority of effort associated with this process is focused

Fig. 1. An in-depth data mining flowchart.


Statistical data mining 11

on the data preparation portion of the process. The amount of effort associated with the
data collection, cleaning, standardization, etc., can be somewhat daunting and actually
far outweigh those steps associated with the rest of the data analysis.
Our current electronic age has allowed for the easy production of copious amounts
of data that can be subject to analysis. Each time that an individual sets out on a trip
he/she generates a paper trail of credit card receipts, hotel and flight reservation infor-
mation, and cell phone call logs. If one also includes website utilization preferences,
then one can begin to create a unique “electronic fingerprint” for the individual. Once
the information is collected then it must be stored in some sort of data repository. His-
torical precedence indicates that data was initially stored and manipulated as flat files
without any sort of associated database structure. There are some data analysis purists
who might insist that this is still the best strategy. Currently data mining trends have
focused on the use of relational databases as a convenient storage facility. These data-
bases or data warehouses provide the data analyst with a ready supply of real material
for use in knowledge discovery. The discovery of interesting patterns within the data
provides evidence as to the utility of the collected data. In addition, a well thought-out
data warehouse could provide a convenient framework for integration of the knowledge
discovery process into the organization.
The application areas of interest to the data mining community have been driven
by the business community. Market-based analysis which is an example of tool-based
machine learning has been one of the prominent applications of interest. In this case,
one can analyze either the customers, in order to discern what a customer purchases
in order to provide insight into psychological motivations for purchasing strategies or
insight into the products actually purchased. Product-based analysis is usually referred
to as market basket analysis. This type of analysis gives insight into the merchandise
by revealing those products that tend to be purchased together and those that are most
amenable to purchase.
Some of the applications of these types of market-based analysis include focused
mailing in direct/email marketing, fraud detection, warranty claims analysis, department
store floor/shelf layout, catalog design, segmentation based on transaction patterns, and
performance comparison between stores. Some of the questions that might be pertinent
to floor/shelf layout include the following. Where should detergents be placed in the
store in order to maximize their sales? Are window cleaning products purchased when
detergents and orange juice are bought together? Is soda purchased with bananas? Does
the brand of the soda make a difference? How are the demographics of the neighborhood
affecting what customers are buying?

3.2. Association rules


The general market-based analysis problem may be mathematically stated as follows.
Given a database of transactions in which each transaction contains a set of items, find
all rules X → Y that correlate the presence of one set of items X with another set of
items Y . One example of such a rule is when a customer buys bread and butter, they buy
milk 85% of the time. Mined association rules can run the gambit from trivial, customers
who purchase large appliances are likely to purchase maintenance agreements, to use-
ful, on Friday afternoons convenience store customers often purchase diapers and beer
12 E.J. Wegman and J.L. Solka

Table 12
Co-occurrence of products

OJ Window Milk Soda Detergent


cleaner
OJ 4 1 1 2 1
Window cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Soda 2 1 0 3 1
Detergent 1 0 0 1 2

together, to the inexplicable, when a new super store opens, one of the most commonly
sold items is light bulbs.
The creation of association rules often proceeds from the analysis of grocery point-
of-sale transactions. Table 12 provides a hypothetical co-occurrence of products matrix.
A cursory examination of the co-occurrence table suggests some simple patterns that
might be resident in the data. First we note that orange juice and soda are more likely
to be purchases together than any other two items. Next we note that detergent is never
purchased with window cleaner or milk. Finally we note that milk is never purchased
with soda or detergent. These simple observations are examples of associations and
may suggest a formal rule like: if a customer purchases soda, then the customer does
not purchase milk.
In the data, two of the five transactions include both soda and orange juice. These
two transactions support the rule. The support for the rule is two out of five or 40%.
In general, of course, data subject to data mining algorithms is usually collected for
some other administrative purpose other than market research. Consequently, these data
are not considered as a random sample and probability statements and confirmatory
inference procedures in the usual statistical sense may not be associated with such
data.
This caveat aside, it is useful to understand what might be the probabilistic underpin-
nings of the association rules. The support of a product corresponds to the unconditional
probability, P (A), that a product is purchased. The support for a pair of products corre-
sponds to the unconditional probability, P (A ∩ B), that both occur simultaneously.
Because two of the three transactions that contain soda also contain orange juice,
there is 67% confidence in the rule ‘If soda, then orange juice.’ The confidence corre-
sponds to the conditional probability P (A | B) = P (A ∩ B)/P (B).
Typically, the data miner would require a rule to have some minimum user-specified
confidence. Rule 1 & 2 → 3 has a 90% confidence if when a customer bought 1 and 2,
in 90% of the cases, the customer also bought 3. A rule must have some minimum
user-specified support. By this we mean that the rule 1 & 2 → 3 should hold in some
minimum percentage of transactions to have value.
Consider a simple transaction in Table 13. The simple rule 1 → 3 has a minimum
support of 50% or 2 transactions and a minimum confidence of 50%. Table 14 provides
a simple frequency count for each of the items. The rule 1 → 3 has a support of 50%
and a confidence given by Support({1, 3})/Support({1}) = 66%.
Statistical data mining 13

Table 13 Table 14
Simple transaction table Simple item frequency table

Transaction ID number Items Frequency of item set Support


1 {1, 2, 3} {1} 75%
2 {1, 3} {2} 50%
3 {1, 4} {3} 50%
4 {2, 5, 6} {4} 25%

With this sort of tool in hand one can proceed forward with the identification of
interesting associations between various products. For example, one might search for
all of the rules that contain “Diet coke” as a result. The results of this analysis might
help the store better boost the sales of Diet coke. Alternatively one might wish to find
all rules that have “Yogurt” in the condition. These rules may help determining what
products may be impacted if the store discontinues selling “Yogurt”. As another exam-
ple one might wish to find all rules that have “Brats” in the condition and “mustard”
in the result. These rules may help in determining the additional items that have to be
sold together to make it highly likely that mustard will also be sold. Sometimes one
may wish qualify their analysis to identify the top k rules. For example, one might wish
to find the top k rules that contain the word “Yogurt” in the result. Figure 2 presents
a tree-based representation of item associations from the specific level to the general
level.
Association rules may take various forms. They can be quantitative in nature. A good
example of a quantitative rule is “Age[35,40] and Married[Yes] → NumCars[2].” As-
sociation rules may involve constraints like “Find all association rules where the prices
of items are > 100 dollars.” Association rules may vary temporarily. For example, we
may have an association rule “Diaper → Beer (1% support, 80% confidence).” This

Fig. 2. Item association granularities from general to specific.


14 E.J. Wegman and J.L. Solka

rule can be modified to accommodate a different time slot such as “Diaper → Beer
(20% support) 7:00–9:00 pm weekdays.” They can be generalized association rules
consisting of hierarchies over items (UPC codes). For example, even though the rule
“clothes → footwear” may hold even if “clothes → shoes” does not.
Association rules may be represented as Bayesian networks. These provide for the
efficient representation of a probability distribution as a directed acyclic graph where
the nodes represent attributes of interest, the edges direct causal influence between the
nodes and conditional probabilities for nodes are given all possible.
One may actually be interested in optimization of association rules. For example,
given a rule (I < A < u) and X → Y, find values for I and u such that it has a support
greater than certain threshold and maximizes a support confidence or gain. For example,
suppose we have a rule “ChkBal[I u] → DvDPlayer.” We might then ascertain that
choosing I = $30 000 and u = $50 000 optimizes the support confidence or gain of
this rule.
Some of the strengths of market basket analysis are that it produces easy to under-
stand results, it supports undirected data mining, it works on variable length data, and
rules are relatively easy to compute. It should be noted that if there are n items
  under
consideration and rules of the form “A → B” are considered, there will be n2 possible
association
  rules. Similarly, if rules of the form “A&B → C” are considered, there will
be n3 possible association rules. Clearly if all possible association rules are considered,
the number grows exponentially with n. Some of the other weaknesses of market basket
analysis are that it is difficult to determine the optimal number of items, it discounts rare
items, it is limited on the support that it provides.
The other computer science area that has contributed significantly to the roots of data
mining is the area of text classification. Text classification has become a particularly
important topic given the plethora of readily available information from the World Wide
Web. Several other chapters in this volume discuss text mining as another aspect of data
mining.
A discussion of the historical roots of the data mining methodologies within the com-
puter science community would not be complete without touching upon terminology.
One usually starts the data analysis process with a set of n observations in p space.
What is usually referred to as dimensions in the mathematical community are known as
variables in statistics or attributes in computer science. Observations within the math-
ematics community, i.e., one row of the data matrix, are referred to as cases in the
statistics community and records in the computer science community. Unsupervised
learning in computer science is known as clustering in statistics. Supervised learning
in computer science is known as classification or discriminant analysis in the statistics
community. It is worthwhile to note that statistical pattern recognition usually refers to
both clustering and classification.

4. Data preparation

Much time, effort, and money is usually associated with the actual collection of data
prior to data mining analysis. We will assume for the discussions below that the data
Statistical data mining 15

has already been collected and that we will merely have to obtain the data from its
storage facility prior to conducting our analysis. With the data in hand, the first step of
the analysis procedure is data preparation. Our experiences seem to indicate that the data
preparation phase may require up to 60% of the effort associated with a given project.
In fact, data preparation can often determine the success or failure of our analytical
efforts.
Some of the issues associated with data preparation include data cleaning/data
quality assurance, identification of appropriate data type (continuous or categorical),
handling of missing data points, identifying and dealing with outliers, dimensionality
reduction, standardization, quantization, and potentially subsampling. We will briefly
examine each of these issues in turn. First we note that data might be of such a quality
that it does contain statistically significant patterns or relationships. Even if there are
meaningful patterns in the data, these patterns might be inconsistent with results ob-
tained using other data sets. Data might also have been collected in a biased manner or
since in many cases the data is based on human respondents, the data may be of uneven
quality. We finally note that one has to be careful that the discovered patterns are not
too specific or too general for the application at hand.
Even when the researcher is presented with meaningful information, one still often
must remove noise from the dataset. This noise can take the form of faulty instru-
ment/sensor readings, transmission errors, data entry errors, technology limitations, or
naming conventions misused. In many cases, numerous variables are stored in the data-
base that may have nothing whatsoever to do with the particular task at hand. In this
situation, one must be willing to identify those variables that are germane to the current
analysis while ignoring or destroying the other confounding information.
Some of the other issues associated with data preparation include duplicate data re-
moval, missing value imputation (manually or statistical), identification and removal of
data inconsistencies, identification and refreshment of stale or untimely data, and the
creation of a unique record or case id. In many cases, the human has a role in interactive
procedures that accomplish these goals.
Next we consider the distinction between continuous and categorical data. Most
statistical theory and many graphics tools have been developed for continuous data.
However, most of the data that is of particular interest to the data mining community is
categorical. Those data miners that have their roots in the computer science community
often take a set of continuous data and transform the data into categorical data such
as low, medium, and high. We will not focus on the analysis of categorical data here
but the reader is referred to Agresti (2002) for a thorough treatment of categorical data
analysis.

4.1. Missing values and outliers


One way to check for missing values in the case of categorical or continuous data is
with a missing values plot. An example missing data plot is provided in Figure 3. Each
observation is plotted as a vertical line. Missing values are plotted in black. The data
was artificially generated. The missing data plot actually is a special case of the color
16 E.J. Wegman and J.L. Solka

Fig. 3. Missing data plot for an artificial dataset. Each observation is plotted as a vertical bar. Missing values
are plotted in black.

histogram. The color histogram first appeared in the paper on the use of parallel coor-
dinates for statistical data analysis (Wegman, 1990). This data analysis technique was
subsequently rediscovered by Minnotte and West (1998). They coined the phrase “data
image” for this “new” visualization technique in their paper.
Another important issue in data preparation is the removal/identification of outliers.
Outliers, while easy to detect in low dimensions, d = 1, 2, or 3, their identification in
high-dimensional spaces may be more tenuous. In fact, high-dimensional outliers may
not actually manifest their presence in low-dimensional projections. For example, one
could imagine points uniformly distributed on a hyper-dimensional sphere of large ra-
dius with a single outlier point at the center of the sphere. Minimum volume ellipsoid
(MVE) (see Poston et al., 1997) has been previously proposed as a methodology for
outlier detection but these methods are exponentially computationally complex. Meth-
ods based on the use of the Fisher information matrix and convex hull peeling are more
feasible but still too complex for massive datasets.
There are even visualization strategies for the identification of outliers in high-
dimensional spaces. Marchette and Solka (2003) have proposed a method based on
an examination of the color histogram of the interpoint distance matrix. This method
provides the capability to identify outliers in extremely high-dimensional spaces for
moderately sized, fewer than 10 000 observations, datasets. In Figure 4 we present a
plot of the interpoint data image for a collection of 100 observations uniformly dis-
tributed along the surface of a 5-dimensional hypersphere along with a single outlier
point positioned at the center of the hypersphere. We have notated the outlier in the data
Statistical data mining 17

Fig. 4. Interpoint distance data image for a set of 100 observations sampled uniformly from the center of
a 5-dimensional hypersphere with an outlier placed at the center of the hypersphere. We have indicated the
location of the outlier by a rectangle placed at the center of the data image.

image via a square. The presence of the outlier is indicated in the data image via the
characteristic cross like structure.

4.2. Quantization
Once the missing values and outliers have been removed from the dataset, one may be
faced with the task of subsampling from the dataset in the case when there are too many
observations to be processed. Sampling from a dataset can be particularly expensive
in the case where the dataset is stored in some sort of relational databases management
system. So we may be faced with squishing the dataset, reducing the number of cases, or
squashing the dataset, reducing the number of variables or dimensions associated with
the data. One normally thinks of subsampling as the standard way of squishing a dataset
but quantization or binning is an equally viable approach. Quantization has a rich history
of success in the computer science community in application areas including signal and
image processing.
Quantization does possess some useful statistical properties. For example, given that
E[W | Q = yj ] is the mean of the observations on the j th bin = yj then the quan-
tizer can be shown to be self-consistent, i.e. E[W | Q] = Q. The reader is referred
to (Khumbah and Wegman, 2003) for a full development of the statistical properties
associated with the quantization process. See also (Braverman, 2002).
A perhaps more interesting topic is geometry-based tessellation. One needs space-
filling tessellations made up of congruent tiles that are as spherical as possible in order
18 E.J. Wegman and J.L. Solka

Fig. 5. Tessellation of 3-soace by truncated octahedra.

to perform efficient space-filling tessellation. In one dimension, one is relegated to


a straight line segment and in two dimensions one may choose equilateral triangles,
squares, or hexagons. The situation becomes more interesting in three dimensions. In
this case, one can use a tetrahedron, a cube, a hexagonal prism, a rhombic dodecahe-
dron, or a truncated octahedron. We present a visualization of three-space tessellated by
truncated octahedra in Figure 5.
A brief analysis of the computational complexity of the geometry-based quantization
is in order. First we note that the use of up to 106 bins is both computationally and visu-
ally feasible. The index of xi in one dimension is given by j = fixed[k∗(xi −a)/(b−a)]
for data in the range [a, b], k bins and one-dimensional data. The computational com-
plexity of this method is 4n + 1 = O(n) and the memory requirements drop to 3k, the
location of the bin plus the number of items in the bin plus a representer of the bin.
In two dimensions, each hexagon is indexed by 3 parameters. The computational
complexity is 3 times the one-dimensional complexity, i.e. 12n + 3 = O(n). The com-
plexity for the square is two times the one-dimensional complexity and the storage
complexity is still 3k.
In three dimensions, there are 3 pairs of square sides and 4 pairs of hexagonal sides
on a truncated octahedron. The computational complexity of the process is still O(n),
28n + 7. The computational complexity for the cube is 12n + 3 and the storage com-
plexity is still 3k.
In summary, we present the following guidelines. First, optimality in terms of mini-
mizing distortion is obtained using the roundest polytope in d-dimensions. Second, the
complexity is always O(n) with an associated storage complexity of 3k. Third, the num-
ber of tiles grows exponentially with dimension, another manifestation of the so-called
curse of dimensionality. Fourth, for ease of implementation always use a hypercube or
d-dimensional simplex. The hypercube approach is known as a datacube in computer
science literature and is closely related to multivariate histograms in the statistical liter-
ature. Fifth, this sort of geometric approach to binning is applicable up to around 4- or
5-dimensional space. Sixth, adaptive tilings may improve the rate at which the number
of tiles grows, but probably destroys the spherical structure associate with the data. This
property relegates its use to large n but makes its use problematic in large d.
Statistical data mining 19

5. Databases

5.1. SQL
Knowledge discovery and data mining have many of their roots in database technology.
Relational databases and structured query language (SQL) have a 25+ year history.
However, the boolean relations (and, or, not) commonly used in relational databases
and SQL are inadequate for fully exploring data.
Relational databases and SQL are typically not well-known to the practicing statisti-
cian. SQL, pronounced “ess-que-el” is used to communicate with a database according
to certain American National Standards Institute (ANSI) standards. SQL statements
can be used to store records in a database, access these records, and retrieve the records.
Common SQL commands such as “Select”, “Insert”, “Update”, “Delete”, “Create”, and
“Drop” can be used to accomplish most of the tasks that one needs to do with a database.
Some common relational database management systems that use SQL include Oracle,
Sybase, Microsoft SQL Server, Access, Ingres, and the public license servers MySQL
and MSQL.
A relational database system contains one or more objects called tables. These tables
store the information in the database. Tables are uniquely identified by their names and
are comprised of rows and columns. The columns in the table contain the column name,
the data type and any other attribute for the columns. We, statisticians, would refer
to the columns as the variable identifiers. Rows contain the records of the database.
Statisticians would refer to the rows as cases. An example database table is given in
Table 15.
The select SQL command is one of the standard ways that data is extracted from a
table. The format of the command is: select “column1”[,“column2”, etc.] from “table-
name” [where “condition”]; . The arguments given in the square brackets are optional.
One can select as many column names as they like or use “*” to choose all of the
columns. The optional where clause is used to indicate which data values or rows
should be returned. The operators that are typically used with where include = (equal),
> (greater than), < (less than), >= (greater than or equal to), <= (less than or equal
to, <> (not equal to), and LIKE. The LIKE operator is a pattern matching operator that
does support the use of wild-card characters through the use of %. The % can appear at
the start or end of a string. Some of the other handy SQL operators include create table
(to create a new table), insert (to insert a row into a table), update (to update or change

Table 15
Example relational database

City State High Low


Phoenix Arizona 105 90
Tucson Arizona 101 92
Flagstaff Arizona 88 69
San Diego California 77 60
Albuquerque New Mexico 80 72
20 E.J. Wegman and J.L. Solka

Fig. 6. A datacube with various hierarchical summarization rules illustrated.

those records that match a specified criteria). The delete operator is used to delete those
designated rows or records from a table.

5.2. Data cubes and OLAP


Computer scientists tend to deal with relational databases accessing them through SQL.
Statisticians tend to deal with flat, text files that are space, tab, or comma delimited. Re-
lational databases have more structure, data protection, and flexibility but these incur a
large computational overhead. Relational databases are not typically suited for massive
dataset analysis except as a means to extract a flat file.
Data cubes and online analytical processing (OLAP) are ideas that have grown out of
the database industry. These tools/ideas are most often perceived as a response to busi-
ness management. These ideas are often applied to a data warehouse. A data warehouse
is a central data repository wherein several local databases are assembled.
A data cube is quite simply a multi-dimensional array of data. Each dimension of a
data cube is a set of sets representing domain content such as time or geography. The
dimensions scaled categorically are such as region of country, state, quarter of year,
week of quarter. The cells of the cube represent aggregated measures (usually counts)
of variables. An example data cube is illustrated in Figure 6.
Exploration of a data cube involves the operations of drill down, drill up, and drill
through. Drill down involved splitting an aggregation into subsets, e.g. splitting region
of country into states. Drill up involves consolidation, i.e. aggregating subsets along a
dimension. Drill through involves subsets of crossing of sets, i.e. the user might inves-
tigate statistics within a state subsetted by time.
OLAP and MOLAP, multi-dimensional OLAP, are techniques that are typically per-
formed on data cubes in an online fashion. Operations are usually limited to simple
Statistical data mining 21

measures like counts, means, proportions and standard deviations. ROLAP refers to
relational OLAP using extended SQL.
In summary, we note that the relational database technology is fairly compute inten-
sive. Because of this, commercial database technology is challenged by the analysis of
datasets above about 108 observations. This computational limitation applies to many
of the algorithms developed by computer scientists for data mining.

6. Statistical methods for data mining

The hunt for structure in numerical data has had a long history within the statistical
community. Examples of methodologies that may be used for data exploration in a data
mining scenario include correlation and regression, discriminant analysis, classifica-
tion, clustering, outlier detection, classification and regression trees, correspondence
analysis, multivariate nonparametric density estimation for hunting bump and ridges,
nonparametric regression, statistical pattern recognition, categorical data analysis, time-
series methods for trend and periodicity, and artificial neural networks. In this volume,
we have a number of detailed discussions on these techniques including nonparamet-
ric density estimation (Scott and Sain, 2005), multivariate outlier detection (Hubert et
al., 2005), classification and regression trees (Sutton, 2005), pattern recognition (Hand,
2005), classification (Marchette et al., 2005), and correspondence analysis (Rao, 2005).
To a large extent, these methods have been developed and treated historically as
confirmatory analysis techniques with emphasis on their statistical optimality and as-
ymptotic properties. In the context of data mining, where data are often not collected
according to accepted statistical sampling procedures, of course, traditional interpre-
tations as probability models with emphasis on statistical properties are inappropriate.
However, most of these methodologies can be reinterpreted as exploratory tools. For
example, while a nonparametric density estimator is frequently interpreted to be an es-
timator that asymptotically converges to the true underlying density, as a data mining
tool, we can simply think of the density estimator as a smoother which helps us hunt for
bumps and other features in the data, where we have no reason to assume that there is a
unique underlying density.
No discussion on statistical methods would be complete without referring to data
visualization as an exploratory tool. This area also is treated by several authors in this
volume, including Carr (2005), Buja et al. (2005), Chen (2005), and Wilhelm (2005).
We shall discuss visual data mining briefly in a later section of this paper, reserving
this section for some brief reviews of some analytical tools which are not covered so
thoroughly in other parts of this volume and which are perhaps a bit less common in
the usual statistical literature. Of further interest is the emerging attention in approaches
to streaming data. This topic will be discussed separately in yet another section of this
chapter.

6.1. Density estimation


As noted earlier in Section 2.1, kernel smoothers are in general O(n) algorithms and thus
should in principle be easily adapted to service in a data mining framework. A straight-
22 E.J. Wegman and J.L. Solka

forward adaptation to the kernel density smoother in high dimensions is given by


1 
n x − x 
i
f (x) = K , x ∈ Rd ,
nh1 h2 · · · hd h
i=1
where n is the sample size, d is the dimension, K is the smoothing kernel, hj is the
smoothing parameter in the j th dimension, x i is the ith observation, x is the point at
which the density is estimated, and h = (h1 , h2 , . . . , hd ). In general, kernel densities
must be computed on a grid since there is no direct closed form.
The so-called curse of dimensionality dramatically affects the requirement for com-
putation on a grid. Consider, for example, a one-dimensional problem in which we
have a grid with k points. In d dimensions, in order to maintain the same density on
a side, one would require k d grid points implying that the computational complexity
explodes exponentially in the dimension of the data. Moreover, suppose that we have
in one dimension 10 cells with an average of 10 observations per cell or 100 observa-
tions. In order to maintain the same average cell density in d dimensions, we would
required a 10d+1 observations. Thus, not only does the computational complexity in-
crease exponentially with dimension, the data requirements also increase exponentially
with dimension.
One approach which circumvents this requirement is the one which exploits a finite
mixture of normal densities:

N
f (x; ψ) = πi φ(x; x i , θi )
i=1
where ψ = (θ1 , . . . , θN , π1 , . . . , πN ), φ is usually taken to be a normal density, N is
the number of mixing terms (N  n), if φ is normal, then θ i = (µi , Σ i ), where µi
and Σ i are respectively the mean vector and covariance matrix. The parameters are
re-estimated using the EM algorithm. Specifically, we have:
1
n
πi φ(x; x j , θ i )
τij = N , πi = τij ,
i=1 πi φ(x; x j , θ i )
n
j =1

1 
n
1 
n
µi = τij x j , Σi = τij (x j − µi )(x j − µi )† ,
nπi nπi
j =1 j =1
where τij is the estimated posterior probability that x j belongs to component i, πi is the
estimated mixing coefficient, µi and Σ i are the estimated mean vector and covariance
matrix, respectively. The EM is applied until convergence is obtained. A visualization of
this is given in Figure 7. Mixture estimates are L1 consistent only if the mixture density
is correct and if the number of mixture terms is correct.
Adaptive mixtures are used in another, semiparametric, recursive method discussed
by Priebe and Marchette (1993) and Priebe (1994) and provide an alternate formulation
avoiding issues of the number of terms being correct. The recursive update equations
become:
πi,n φ(x; x n+1 , θ n ) 1
τi,n+1 = N , πi,n+1 = πi,n + (τi,n+1 − πi,n ),
i=1 πi,n φ(x; x n+1 , θ n )
n
Statistical data mining 23

Fig. 7. Visualization of normal mixture model parameters in a one-dimensional setting. For a color reproduc-
tion of this figure see the color figures section, page 565.

τi,n+1
µi,n+1 = µi,n + (x n+1 − µi,n ),
nπi,n
τi,n+1  
Σ i,n+1 = Σ i,n + (x n+1 − µi,n )(x n+1 − µi,n )† .
nπi,n
In addition, it is necessary to have a create rule. The basic idea is that if a new
observation is too far from the previously established components (usually in terms of
Mahalanobis distance), then the new observation will not be adequately represented by
the current mixing terms and a new mixing term should be added. We create a new
component centered at x t with a nominal starting covariance matrix. The basic form of
the update/create rule is as follows:
 
θ t+1 = θ t + 1 − Pt (x t+1 , θ t ) Ut (x t+1 , θ t ) + Pt (x t+1 , θ t )Ct (x t+1 , θ t )
where Ut is the previously given update rule, Ct is the create rule, and Pt is the decision
rule taking on values either 0 or 1. Figure 8 presents a three-dimensional illustration.
This procedure is nonparametric because of the adaptive character. It is generally L1
consistent under relatively mild conditions, but almost always too many terms are cre-
ated and pruning is needed.
Solka et al. (1995) proposed the visualization methods for mixture models shown
in Figures 7 and 8. Solka et al. (1998) and Ahn and Wegman (1998) proposed alter-
native effective methods for eliminating redundant mixture terms. We note finally that
mixture models are valuable from the perspective that they naturally suggest clusters
centered at the mixture means. However, while the model complexity may be reduced,
24 E.J. Wegman and J.L. Solka

Fig. 8. Visualization of adaptive normal mixture parameters in a two-dimensional setting.

the computational complexity may be increased dramatically. Nonetheless, once es-


tablished, a mixture model is computationally simple because it does not have to be
estimated on a grid.

6.2. Cluster analysis


There are many methods for cluster analysis. In this section, we give a brief overview
of classical distance-based cluster analysis. It should be remarked at the outset that
traditional distance-based clustering of adataset
 of size n requires distance between
points to be calculated. Because there are n2 point pairs, the computational complexity
of just computing the distance is O(n2 ). The implication is immediately that distance-
based clustering is not suitable for truly massive datasets.
The basic clustering problem is, then, given a collection of n objects each of which is
described by a set of d characteristics, derive a useful division into a number of classes.
Both the number of classes and the properties of the classes are to be determined. We are
interested in doing this for several reasons, including organizing the data, determining
the internal structure of the dataset, predication, and discovery of causes. In general, be-
cause we do not know a priori the number of clusters, the utility of a particular clustering
scheme can only be measured by the quality. In general, there is no absolute measure of
quality.
Suppose that our data is of the form xik , i = 1, . . . , n and k = 1, . . . , d. Here the i
indexes the number of observations and k indexes the set of characteristics. Distances
(or dissimilarities) can be measured in multiple ways. Some frequently used metrics are
the following:
Statistical data mining 25


d
Euclidean distance: d(xi· , xj · ) = (xik − xj k )2 ;
k=1

d
City block metric: d(xi· , xj · ) = |xij − xj k |;
k=1
d
|xik − xj k |
Canberra metric: d(xi· , xj · ) = ;
(xik − xj k )
i=1
d
xik xj k
Angular separation metric: d(xi· , xj · ) =  k=1  .
d 2 d 2
k=1 xik k=1 xj k

6.2.1. Hierarchical clustering


The agglomerative clustering algorithm begins with clusters C1 , C2 , . . . , Cn , each clus-
ter initially with a single point. Find the nearest pair Ci and Cj , merge Ci and Cj
into Cij , delete Ci and Cj , and decrement cluster count by one. If the number of clus-
ters is greater than one, repeat the previous set until there is only one cluster remaining.
In single linkage (nearest neighbor) clustering, the distance is defined as that of the
closet pair of individuals where we consider one individual from each group. We could
also form complete linkage (furthest neighbor) clustering where the distance between
groups is defined as the most distant pair, one individual from each cluster. Consider the
following example with initial distance matrix:
1 2 3 4 5
1 0.0
2 2.0 0.0
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0
5 9.0 8.0 5.0 3.0 0.0
The distance 2.0 is the smallest distance, so that we join individuals 1 and 2 in the
first round of agglomeration. This yields the following distance matrix based on single
linkage:
(1, 2) 3 4 5
(1, 2) 0.0
3 5.0 0.0
4 9.0 4.0 0.0
5 8.0 5.0 3.0 0.0
Now the distance 3.0 is the smallest, so that individuals 4 and 5 are joined in the second
round of agglomeration. This yields the following single linkage distance matrix:
(1, 2) 3 (4, 5)
(1, 2) 0.0
3 5.0 0.0
(4, 5) 8.0 4.0 0.0
26 E.J. Wegman and J.L. Solka

Fig. 9. Dendrogram resulting from single linkage agglomerative clustering algorithm.

In this matrix, the smallest distance is now 4.0, so that 3 is joined into (4, 5). This yields
the penultimate distance matrix of
(1, 2) (3, 4, 5)
(1, 2) 0.0
(3, 4, 5) 5.0 0.0
Of course, the last agglomerative step is to join (1, 2) to (3, 4, 5) and to yield the single
cluster (1, 2, 3, 4, 5). This hierarchical cluster yields a dendrogram given in Figure 9. In
this example, the complete linkage clustering yields the same cluster sequence and the
same dendrogram. The intermediate distance matrices are different however.
An alternate approach could be to use group-average clustering. The distance be-
tween clusters is the average of the distance between all pairs of individuals between
the two groups. We also note that there are methods for divisive clustering, i.e. begin-
ning with every individual in one large cluster and recursively separating into smaller,
multiple clusters, until every individual is in a singleton cluster.

6.2.2. The number of groups problem


A significant question is how do we decide on the number of groups. One approach is
to maximize or minimize some criteria. Suppose g is the number of groups and ni is the
number of items in the ith group. Consider
1 i g n
T = (xij − x̄)(xij − x̄)† ,
n
i=1 j =1

1 g 
ni
W = (xij − x̄j )(xij − x̄i )† ,
n−g
i=1 j =1

g
B= ni (xij − x̄)(xij − x̄)† ,
i=1
T = W + B.
Some optimization strategies are:
(1) minimize trace(W ),
(2) maximize det(T )/det(W ),
(3) minimize det(W ),
(4) maximize trace(BW −1 ).
Statistical data mining 27

It should be noted that all approaches of this discussion have important computa-
tional complexity requirements. The number of partitions of n individuals into g groups
N(n, g) can be quite daunting. For example,
N (15, 3) = 2 375 101,
N (20, 4) = 45 232 115 901,
N (25, 8) = 690 223 721 118 368 580,
N (100, 5) = 1068 .
We note that many references to clustering algorithms exist. Two important ones are
(Everitt et al., 2001, Hartigan, 1975).
Other approaches to clustering involve minimal spanning trees (see, for example,
Solka et al., 2005, this volume), clustering based on mixture densities mentioned in
the previous section, and clustering based on Voronoi tessellations with centroids deter-
mined by estimating modes (Sikali, 2004).

6.3. Artificial neural networks


6.3.1. The biological basis
Artificial neural networks have been inspired by mathematicians attempting to create
an inference network modeled on the human brain. The brain contains approximately
1010 neurons that are highly interconnected with an average of several thousand inter-
connects per neuron. The neuron is composed of three major parts, the cell body, an
input structure called dendrites, and an output structure called the axon. The axon of a
given cell is connected to the dendrites of other cells. The cells communicate by elec-
trochemical signals. When a given cell is activated, it fires the electrochemical signal
along its axon which is transferred to all of the cells with dendrites connected to the
cell’s axon. The receiving cells may or may not fire depending on whether a receiving
cell has received enough total stimulus to exceed its firing threshold.
There is actually a gap called the synapse with the neurotransmitters prepared to
transmit the signal across the synapse. The strength of a signal that a neuron receives
depends on the efficiency of the synapses. The neurotransmitters are released from the
neuron at presynaptic nerve terminal on the axon and cross the synapse to dendrites of
the next neuron. The dendrite of the next neuron may have a receptor site for the neu-
rotransmitter. The activation of the receptor site may be either inhibitory or excitatory,
which may lower or raise the possibility of the next neuron firing. Learning is thought
to take place when the efficiency of the synaptic connection is increased. This process
is called neuronal plasticity.
The brain is capable of extremely complex tasks based on these simple units that
transmit binary signals (fire, don’t fire). Artificial neural networks in principle work the
same way except that the overall complexity in terms of processing units and intercon-
nects does not realistically approximate the scale of the brain.

6.3.2. Functioning of an artificial neural network


An artificial neuron has a number of receptors that receive inputs either from data or
from the outputs of other neurons. Each input comes by way of a connection with
28 E.J. Wegman and J.L. Solka

Fig. 10. The sigmoid curve.

a weight (or strength) in analogy to the synaptic efficiency of the biological neuron.
Each artificial neuron has a threshold value from which the weighted sum of the inputs
minus the threshold is formed. The artificial neuron is not binary as is the biologi-
cal neuron. Instead, the weighted sum of inputs minus threshold is passed through a
transfer function (also called an activation function), which produces the output of the
neuron. Although it is possible to use a step-activation function in order to produce a
binary output, step activation functions are rarely used. Generally speaking, the most
used activation function is a translation of the sigmoid curve given by
1
f (t) = .
1 + e−t
See Figure 10. The sigmoid curve is a special case of the logistic curve.
The most common form of an artificial neural network is called a feed-forward net-
work. The first layer of neurons receives input from the data. The first layer is connected
to one or more additional layers, called hidden layers, and the last hidden layer is con-
nected to the output layer. The signal is fed forward through the network from the input
layer, through the hidden layer to the units in the output layer. This network is stable be-
cause it has no feedback. A network is recurrent if it has a feedback from later neurons
to earlier neurons. Recurrent networks can be unstable and, although interesting from a
research perspective, are rarely useful in solving real problems.
Neural networks are essentially nonlinear nonparametric regression tools. If the func-
tional form of the relationship between input variables and output variables was know,
it would be modeled directly. Neural networks learn the relationship between input and
output through training, which determines the weights of the neurons in the network. In
a supervised learning scenario, a set of inputs and outputs is assembled and the network
is trained (establishes neuron weights and thresholds) to minimize the error of its pre-
dictions on the training dataset. The best known example of a training method is back
propagation. After the network is trained, it models the unknown function and can be
used to make predictions on input values for which the output is not known.
Statistical data mining 29

From this description it is clear that artificial neural networks operate on numerical
data. Although more difficult, categorical data can be modeled by representing the data
by numeric values. The number of neurons required for the artificial neural network
is related to the complexity of the unknown function that is modeled as well as to the
variability of the additive noise. The number of the required training cases increases
nonlinearly as the number of connections in the network increases. A rule-of-thumb is
that the number of cases should be approximately ten times the number of connections.
Obviously, the more variables involved, the more neurons are required. Thus, variables
should be chosen carefully. Missing values can be accommodated if necessary. If there is
enough data, observations with missing values should be discarded. Outliers can cause
problems and should be discarded.

6.3.3. Back propagation


The error generated by a particular configuration (geometry, neural weights, thresholds)
of the network is determined by processing all of the training cases through the network
and comparing the network’s actual output with the known target outputs of the train-
ing cases. The differences are combined by an error function, often a sum of squared
errors or a cross entropy function, which is used for maximum likelihood classification.
The error surfaces of neural networks are often complicated by such features as local
minima, saddle points, and flat or nearly flat spots. Starting at a random point on the
error surface, the training algorithm attempts to find a global minimum. In back prop-
agation, the gradient of the error surface is calculated and a steepest descent trajectory
is followed. Generally, one takes small steps along the steepest descent trajectory mak-
ing smaller steps as the minimum is approached. The size of the steps is crucial in the
sense that a step size that is too big may alternate between sides of the local minimum
well and never converge, or may jump from one local minimum well to another. Step
sizes that are too small may converge properly, but require too many iterations. Usually
the step size is chosen proportional to the slope of the gradient vector and to a con-
stant known as the learning rate. The learning rate is essentially a fudge factor. It is
application dependent and is chosen experimentally.

7. Visual data mining

As much as we earlier protested in Section 2.4 that data mining of massive datasets is
unlikely to be successful, still the upper bound of approximately 106 allows for visually
mining relatively large datasets. Indeed, we have pursued this fairly aggressively. See
Wegman (2003). Of course, there are many situations where it is useful to attempt to
apply data mining techniques to more modest datasets. In this section, we would like
to give some perspective on our views of visual data mining. As pointed out earlier,
this volume contains a number of thoughtful chapters on data visualization and the
reader is strongly encouraged to examine the variety of perspectives that these chapters
represent.
30 E.J. Wegman and J.L. Solka

7.1. The four stages of data graphics


We note that data visualization has a long history in the statistics profession. Michael
Friendly’s website (http://www.math.yorku.ca/SCS/friendly.html) has a wealth of inter-
esting material on the history of data graphics. Anyone interested in this topic should
browse Michael’s website. As a preface to our view of data graphics, we offer the fol-
lowing perspective. In general, we think that statisticians tend to be methodology centric
rather than data centric. By this we mean that statisticians tend to identify themselves
by the tools they employ (i.e. their methodology) rather than by the problem they are
trying to solve (i.e. the data with which they are presented). In general, we believe that
new data types generate new data analysis tools.
The most common form of data graphics is the static graphic. Tufte (1983, 1990,
1997) provides a long significant discussion on the art of static graphics. Wilkinson
(1999), on the other hand, presents an elegant discussion on the mathematics of static
graphics. In addition to the usual line plots and bar charts, static graphics includes a
host of plot devices like scatterplot matrices, parallel coordinate plots, trellis plots, and
density plots. Static graphics may include color and anaglyph stereoscopic plots. The
second, third, and fourth stages of data graphics have come to being with the computer
age. Interactive graphics, the second stage, involves manipulation of a graph data object
but not necessarily the underlying data from which the graph data object was creates.
Think of the data residing on a server and the graph data object residing on a client
(computer). Interactive graphics is possible with relatively large datasets where the re-
sulting graph data object is relatively much smaller. Interactive graphics devices include
brushing, 3D stereoscopic plots, rocking and rotation, cropping and cutting, and linked
views.
Dynamic graphics involves graphics devices in which the underlying data can be
interacted with. There is perhaps a subtle distinction between dynamic graphics and
interactive graphics. However, the requirement for dynamic graphics that the underlying
data must be engaged to complete the graphic device implies that the scale of data must
be considerably smaller than in the case with interactive graphics. Such techniques as
grand tour, dynamic smoothing of plots, conditioned chloropleth maps and pixel tours
are examples of dynamic graphics.
The final category of graphics is what we have begun to call evolutionary graphics.
In this setting, the underlying datasets are no longer static but evolving either by being
dynamically adjusted or because the data are streaming. Examples in which the data are
dynamically adjusted include data set mapping (Wegman and King, 1990) and iterative
denoising (Priebe et al., 2004). Examples of streaming data applications include water-
fall plots, transient geographic mapping, and skyline plots. Evolutionary graphics with
respect to streaming data is discussed more thoroughly in Section 8.

7.2. Graphics constructs for visual data mining


We take as a given in the visual data mining scenario that we will have relatively
high-dimensional data and relatively large datasets. We can routinely handle up to 30
dimensions and 500 000 to 1 million observations. The graphics constructs we routinely
use include scatterplot matrices, parallel coordinate plots, 3D stereoscopic plots, density
Statistical data mining 31

plots, grand tour on all plot devices, linked views, saturation brushing, and pruning and
cropping. We have particularly exploited the combination of parallel coordinate plots,
k-dimensional grand tours (1  k  d, where d is the dimension of the data), and
saturation brushing. We briefly describe these below.
Parallel coordinates are a multi-dimensional data plotting device. Ordinary cartesian
coordinates fail as a plot device after three dimensions because we live in a world with
3 orthogonal spatial dimensions. The basic idea of parallel coordinates is to give up the
orthogonality requirement and draw the axes as parallel to each other. A d-dimensional
point is plotted by locating its appropriate component on each of the corresponding par-
allel axes and interpolating with a straight line between axes. Thus a multi-dimensional
point is uniquely represented by a broken line. Much of the elegance of this graph de-
vice is due to a projective geometry duality, which allows for interpretation of structure
in the parallel coordinate plot. Details are available in (Wegman, 1990, 2003).
The grand tour is a method for animating a static plot by forming a generalized ro-
tation in a k-subspace of the d-space where the data live. The basic idea is that we
wish to examine the data from different views uncovering features that may not be vis-
ible in a static plot, literally taking a grand tour of the data, i.e. seeing the data from
all perspectives. We might add that the grand tour is especially effective at uncovering
outliers and clusters, tasks that are difficult analytically because of the computational
complexity of the algorithms involved. A discussion of the mathematics underpinning
both parallel coordinates and the grand tour can be found in (Wegman and Solka,
2002).
The final concept is saturation brushing. Ordinary brushing involves brushing (or
painting) a subset of the data with a color. Using multiple colors in this mode allows
for designation of clusters within the dataset or other subsets, for example, negative
and positive values of a given variable by means of color. Ordinary brushing does not
distinguish between a pixel that represents one point and a pixel that represents 10 000
points. The idea of saturation brushing is to desaturate the color so that it is nearly black
and brush with the desaturated color. Then, using the so-called α-channel found on
modern graphics cards, add up the color components. Heavy overplotting is represented
by a fully saturated pixel whereas a single observation or a small amount of overplotting
will remain nearly black. Thus saturation brushing is an effective way of seeing the
structure of large datasets.
Combining these methods leads to several strategies for interactive data analysis.
The BRUSH-TOUR strategy is a recursive method for uncovering cluster structure. The
basic idea is to brush all visible clusters with distinct colors. If the parallel axes are
drawn horizontally, then any gap in any horizontal slice separates two clusters. (Some
authors draw the parallel coordinates vertically, so in this case any gap in any vertical
slice separates two clusters.) Once all visible clusters are marked, initiate the grand
tour until more gaps appear. Stop the tour and brush the new clusters. Repeat until no
unmarked clusters appear. An example of the use of the BRUSH-TOUR strategy may
be found in (Wilhelm et al., 1999).
A second strategy is the TOUR-PRUNE strategy, which is useful for forming tree
structures. An example the use of a TOUR-PRUNE is to recursively build a deci-
sion tree based on demographic data. In the case illustrated in (Wegman, 2003), we
32 E.J. Wegman and J.L. Solka

considered a profit variable and demographic data for a number of customers of a bank.
The profit variable was binarized by brushing the profit data with red for customers
that lost money for the bank and green for customers that made a profit for the bank.
The profit variable was taken out of the grand tour and the tour allowed to run on the
remaining demographic variables until either a strongly red or strongly green region
was found. A strongly red region indicated a combination of demographic variables
that represented customers who lost money whereas strongly green region indicated a
combination of demographic variables that represented customers who made profits for
the bank. By recursively touring and pruning, a decision tree can be built from combi-
nations of demographic variables for the purpose of avoiding unnecessary risks for the
bank.

7.3. Example 1 – PRIM 7 data

Friedman and Tukey (1974) introduced the concept of projection pursuit and used this
methods to explore the PRIM 7 data. This dataset has become something of a challenge
dataset for statisticians seeking to uncover multi-dimensional structure. In addition to
Friedman and Tukey, Carr et al. (1986) and Carr and Nicholson (1988) found linear
features, Scott (1992, p. 213) reported on a triangular structure found by his student
Rod Jee in an unpublished thesis (Jee, 1985, 1987), and Cook et al. (1995) found the
linear features hanging off the vertices of the Jee–Scott triangle.
The PRIM 7 data is taken from a high-energy particle-physics scattering experiment.
A beam of positively charged pi-mesons with an energy of 16 BeV is collided with a sta-
tionary target of protons contained in hydrogen nuclei. In such an experiment, quarks
can be exchanged between the pi-meson and proton, with overall conservation of en-
ergy and momentum. The data consists of 500 examples. For this experiment, seven
independent variables are sufficient to fully characterize the reaction products. A de-
tailed description of the physics of the reaction is given in (Friedman and Tukey, 1974,
p. 887). The seven-dimensional structure of this data has been investigated over the
years.
The initial configuration of the data is given in a 7-dimensional scatterplot matrix
in Figure 13. The BRUSH-TOUR strategy was used to identify substructures of the
data. A semifinal brushed view of the data is given in Figure 11. Of particular interest
is the view illustrating three triangular features given in Figure 12, which is a view of
the data after a GRAND-TOUR rotation. The coloring in Figure 12 is the same as in
Figure 11. Of course, two-dimensional features such as a triangle will often collapse
into a one-dimensional linear feature or a zero-dimensional point feature in many of
the two-dimensional projections. The fundamental question from the exploration of this
data is what is the underlying geometric structure. The presence of triangular features
suggests that the data form a simplex. It is our conjecture that the data actually form a
truncated six-dimensional simplex. Based on this conjecture, we constructed simulated
data. The initial configuration of the simulated data is shown in Figure 14, which can
be compared directly with the real data in Figure 13.
Statistical data mining 33

Fig. 11. Scatterplot matrix of the PRIM 7 data after GRAND-TOUR rotation with features highlighted in
different colors. For a color reproduction of this figure see the color figures section, page 566.

Fig. 12. A scatterplot of the PRIM 7 data after GRAND-TOUR illustrating the three triangular features in the
data, again highlighted in different colors. For a color reproduction of this figure see the color figures section,
page 566.
34 E.J. Wegman and J.L. Solka

Fig. 13. A scatterplot matrix of the initial unrotated configuration of the PRIM 7 data. After considerable
exploration, we conjecture a truncated 6-dimensional simplex based on the multiple triangular features.

7.4. Example 2 – iterative denoising with hyperspectral data


The hyperspectral imagery data consists of 14 478 observations selected from higher-
resolution images in 126 dimensions (1 824 228 numbers) arising from six known
classes and one unknown class. The six known classes were determined by ground truth
to represent pixels coming from runway, water, swamp, grass, scrub, and pine. The goal
is to assign a class to the data from the unknown class. The approach here is to use
a modified TOUR-PRUNE strategy. We first begin by reducing dimension. We do this
by forming the first three principal components. Figure 15 is a scatterplot of the first
two principal components. Three components clearly stand apart from the rest. They
are respectively water (blue and green) and runway (red). Once these are pruned away,
we recompute the principal components based on the now reduced (denoised) dataset.
Figure 16 is the scatterplot for the remaining data. The prominent blue points are
the swamp and the prominent cyan points correspond to grass. These are pruned away
leaving only scrub, pines and unknown. Again the first three principal components are
recomputed, and the resulting (toured) image is displayed in Figure 17. Notice that the
blue is scrub, which overlaps somewhat with the other two classes. Removing the scrub
and one final recomputation of the principal components, yields the (toured) image in
Figure 18. Notice that the red (unknown) and green (pines) are thoroughly intermingled,
which suggests that the unknowns are pines. In fact, we do know that they are oak trees,
Statistical data mining 35

Fig. 14. A scatterplot matrix of the initial unrotated configuration of our simulated PRIM 7 data based on our
conjectured truncated 6-dimensional simplex.

Fig. 15. The first two principal components of the hyperspectral imagery. There are 7 classes of pixels includ-
ing runway, water, swamp, grass, scrub, pine, and unknown (really oaks). The water and runway are isolated
in this figure. For a color reproduction of this figure see the color figures section, page 567.
36 E.J. Wegman and J.L. Solka

Fig. 16. The recomputed principal component after denoising by removing water and runway pixels. The
swamp and grass pixels are colored respectively by cyan and blue. These may be removed and the princi-
pal components once again computed. For a color reproduction of this figure see the color figures section,
page 567.

Fig. 17. The penultimate denoised image. The scrub pixels are shown in blue. They are removed and one
final computation of the principal components is completed. For a color reproduction of this figure see the
color figures section, page 568.

so that pines and oaks have similar spectra. This also explains why scrub is closer to the
pines and oaks than say the grass or other categories. One last image (Figure 19) is of
interest. In this image, we took ten principal components and brushed them with distinct
colors for the seven distinct classes. This rotation shows that there is additional structure
in the data that is not encoded by the seven class labels. The source of this structure is
unknown by the experts who provided this hyperspectral data. Finally, we note in clos-
ing this section that the results in Examples 1 and 2 have not previously been published.
Statistical data mining 37

Fig. 18. The final denoised image show heavy overlap between the pine and unknown (oak) pixels. Based
on this analysis, we classify the unknowns as being closest to pines. In fact, both are trees and are actually
intermingled when the hyperspectral imagery was ground truthed. For a color reproduction of this figure see
the color figures section, page 568.

Fig. 19. The same hyperspectral data but based on 10 principal components instead of just 3. The plot of PC 5
versus PC 6 shows that there is additional structure in this dataset not captured by the seven classes originally
conjectured. For a color reproduction of this figure see the color figures section, page 569.

8. Streaming data

Statisticians have flirted with the concept of streaming data in the past. Statistical
process control considers data to be streaming, but at a comparatively low rate because
of the limits with which physical manufacturing can take place. Sequential analysis on
the other hand take a more abstract perspective and assumes that the data is unending,
but very highly structured and seeks to make decisions about the underlying structure
38 E.J. Wegman and J.L. Solka

as quickly as possible. Data acquisition techniques based on electronic and high-speed


computer resources have changed the picture considerably. Data acquired may not be
well structured in the sense that the underlying probabilistic structure may not exist and,
if it does, is likely to be highly nonstationary. Examples of such data streams abound in-
cluding Internet traffic data, point of sales inventory data, telephone traffic billing data,
weather data, military and civilian intelligence data, NASA’s Earth observing system
satellite instrument data, high-energy physics particle collider data, and large-scale sim-
ulation data. It is our contention that massive streaming data represent a fundamentally
new data paradigm and consequently require fundamentally new tools.
In December, 2002 and again in July of 2004, the Committee on Applied and Theo-
retical Statistics of the US National Academy of Science held workshops on streaming
data. Many of the papers presented at the December workshop were fleshed out and
published in a special issue of the December, 2003 issue of Journal of Computational
and Graphical Statistics, vol. 12, no. 4. The reader is directed to this interesting issue.
Our own experience with streaming data has focused on Internet packet header data
gathered at both the Naval Surface Warfare Center and at George Mason University.
These data are highly non-stationary with time-of-day effects, day-of-week effects, and
seasonal affects as well. At current data rates, our estimate in the year 2004 is that at
George Mason University we could collect 24 terabytes of streaming Internet traffic
header data. Clearly, this volume of streaming is not easily stored so new formulations
of data analysis and visualization must be formulated. In the subsequent sections we
give some suggestions.

8.1. Recursive analytic formulations


Much of the discussion in this and the next sections is based on (Wegman and Marchette,
Marchette and Wegman, 2003, 2004, and Kafadar and Wegman, 2004). Each of these
articles, but particularly the first, describe the structure of TCP/IP packet headers. The
graphics illustrations we give later are based on Internet traffic data. In essence, the
variables involved are source IP (SIP), destination IP (DIP), source port, destination
port, length of session, number of packets, and number of bytes. We will not address
specific details in this chapter because we are interested in general principles.
Because the data are streaming at a high rate, algorithmic issues must embrace two
concepts. First, no data is permanently stored. The implication is that algorithms must
operate on the data items and then discard. If each datum is processed and then dis-
carded, we have a purely recursive algorithm. If a small amount of data for a limited
time period are stored as in a moving window, we have a block recursion. Second, the
algorithms must be relatively computationally simple in order to keep up with the data
rates. With these two principles in mind, we can describe some pertinent algorithms.

8.1.1. Counts, moments and densities


Suppose we first agree that Xi , i = 1, 2, . . . , represents the incoming data stream.
Clearly, the count of the number of items can be accumulated recursively. In addition,
n can be computed recursively by
the traditional X

n = n − 1 X
X n−1 + Xn .
n n
Statistical data mining 39

Also clear is that moments of all orders can be computed recursively by


n 
n−1
Xik = Xik + Xnk .
i=1 i=1

A recursive form of the kernel density estimator was formulated by Wolverton and
Wagner (1969) and independently by Yamato (1971):

n−1 ∗ 1 x − Xn
fn∗ (x) = fn−1 (x) + K
n nhn hn
where K is the smoothing kernel and hn is the usual bandwidth parameter. Wegman
and Davies (1979) proposed an additional recursive formulation and showed strong
consistency and asymptotic convergence rates for
1/2 
n − 1 hn−1 † 1 x − Xn
fn† (x) = fn−1 (x) + K ,
n hn nhn hn
where the interpretation of K and hn is as above. Finally, we note that the adaptive
mixtures described in Section 6.1 is also a recursive formulation.
The difficulty with all of these procedures is that they do not discount old data. In fact,
both 1/n and 1/(nhn ) converge to 0 so that new data rather than old data is discounted.
The exponential smoother has been traditionally used to discount older data. The general
formulation is


Yt = (1 − θ )θ i Xt−1
k
, 0 < θ < 1.
i=0

This may be reformulated recursively as

Yt = θ Yt−1 + (1 − θ )Xtk .

It is straightforward to verify if E[Xtk ] = E[X k ] is independent of t, then E[Yt ] =


E[Xtk ].
θ is the parameter that controls the rate of discounting. Small values, i.e. close to
zero, discount older data rapidly while values close to one discount older data more
slowly. The recursive density formulation can also be reformulated as an exponential
smoother:

1−θ x − Xn
fn (x) = θfn−1 (x) + K .
hn hn
Of course, discounting older data may be done simply by keeping a moving window
of data, thus totally discarding data of a certain age. This is the so-called block recur-
sion form. An additional approach is to use the geometric quantization as described in
Section 4.2.
40 E.J. Wegman and J.L. Solka

8.2. Evolutionary graphics


8.2.1. Waterfall diagrams and transient geographic mapping
Waterfall diagrams and transient geographic mapping are examples of evolutionary
graphics. The idea of waterfall diagrams is to record data for a small epoch of time.
In Figure 20, we study the source port as a function of time. In this particular exam-
ple, we record all source ports for a short epoch. This is essentially binary data. If a
source port was observed, it is record as a black pixel. If a source port is not recorded,
the corresponding pixel is left white. At the end of the short epoch, the top line of the
waterfall diagram is recorded and the next epoch begins. As the second epoch ends, the
first line is pushed down in the diagram and the results of the second epoch are recorded
on the top line. This procedure repeats until, say, 1000 lines are recorded. The bottom
of the diagram represents the oldest epoch, the top the newest. As new data comes in,
the oldest epoch is dropped of the bottom and the most recent is appended to the top.
The diagonal streaks in Figure 20 correspond to increments in the source port, which
is characteristic of the operating system of the particular computer. Such a diagram can
readily make inferences about the operating system and detect potential intruders and
unauthorized activity.
Transient geographic mapping is much harder to illustrate, but comparatively easy to
understand. Most users belong to a class B network. Class A are typically reserved for
very large providers of backbone services such as AT&T, Sprint, MCI and the like. Thus
the first two octets can typically be identified with a corporate including ISPs, university

Fig. 20. Waterfall diagram of source port as a function of time. The diagonal lines indicate distinct operating
systems with different slopes characteristic of different operating systems.
Another random document with
no related content on Scribd:
kääkkyräkatajain, juoksi niinkuin hullu, pillastunut hevonen tai
säikäytetty hirvi.

— Se on Matin miilu, huusivat akkunasta katselijat. — Se on


päässyt ilmiliekkiin!

— Herra siunaa ja varjele, Matin miilu, kirahti Matleena


hätääntyneenä, sieppasi pudonneen lakin lattialta käteensä ja riensi
Matin jälestä.

Kaikki rupesivat nauramaan. Toiset nauroivat sellaisella


mielihyvällä, että silmiään pyyhkielivät.

— Jo siinä nyt tuli siunaukset ja varjelukset sen lystin päälle.

— Parempi oli, että niistä päästiin. Lähtihän se hutsu jälestä.

— Mutta ei naureta suotta toisen vahingolle. Eiköhän sinne olisi


sentään mentävä auttamaan, pojat?

— Mitähän siellä apuinesi teet? Eikähän siellä ole edes kulon


vaaraa märässä metsässä. Jos miilu meni, niin se meni.

— Säälittää sentään Matin vahinko.

— Ooh, menihän sinne Matleena, koska otti Matin lakinkin. Kyllä


ne siellä asiansa hoitavat.

Matti riehui ilmitulessa roihuavan miilunsa ympärillä syyspimeässä


yössä kuin paholainen. Huitoi pitkällä kuusen näreellä liekkiä, syyti
multaa miiluun, repi kynsin tuoretta turvetta ahon kamarasta,
koppasi suuria kiviä, kiskasi märkiä juuren lahoja, mitä vaan irti sai,
ja paiskasi kaikki tuleen. Ja kirosi aina välillä ja huusi hammasta
purren, että »anna nyt taivas vettä!»

Saapui Matleena Matin parhaaseen leiskuntaan, puhui avusta ja


tarjosi lakkia paljaspäiselle miehelle.

Matti kahmasi lakin ja viskasi hurjalla voimalla tuleen.

— Sinäkö, sinäkö, sinä silkoinen piru, sinä ruokoton eläin, karjahti


hän ja tempasi miilunuijan käteensä, ja tuskin sai Matleena vältetyksi
iskun, joka varmaan olisi murskannut hänen päänsä — — —

Mutta kun väki Tiivelissä joi aamukahvia, tuli Matti hiljalleen


tupaan. Hän oli nokinen, likainen ja niin äärimmilleen uupuneen
näköinen, että talon väki tuskin häntä tunsi. Housuissa roikkui
palkeenkieliä, takki oli repeillyt ja tulen polttama monesta kohti,
tukka valui märkinä suortuvina ympäri paljaan pään. Hän istahti
alakuloisena ja sanaakaan sanomatta ovensuun rahille. Ei maininnut
edes hyvää huomenta.

— No kuinka nyt on käynyt, Matti? Kuului tuli tehneen ryöviön


yöllä, virkkoi isäntä.

— Teki!

— Voi vahinkoa! Ei suinkaan vaan ihan hullusti käynyt?

— Ei siellä ole kuin tulinen tuhka jäljellä, murahti Matti.

*****

Sen koommin ei Matti mihinkään seuroihin mennyt. Eikä hän naisia


lähennellyt, eivätkä naisetkaan häntä, vaikka ei hän sentään koskaan
voinut lakata niitä salaa ajattelemasta. Eikä hän koskaan puhunut
tarkemmasti sanaakaan tästä tapauksesta. Vaan jos joku sattui
kysymään, niinkuin on turha tapa usein vanhoilta pojilta kysellä, että
eikö hän koskaan ole edes ajatellut kainaloista kanaa itselleen, niin
oli hän äänettä, tai vastasi aikain takaa lyhyeen ja huokaisten:

— Olen, yhden ainoan kerran.

Ja silloin hän tavallisesti pyyhkäsi päätään koko kourallaan,


kiertäen otsasta korvan taitse niskaan saakka.
VAINOTTU

Hän oli pari vuorokautta umpeen maannut puolittain horrostilassa


erään heinäsuovan alla kylän laidassa, mihin märkänä, uupuneena ja
vilusta värisevänä oli ryöminyt pimeän syysyön myrskyistä sadetta ja
äkäisiä vainoojiaan pakoon.

Kaikkialla oli häntä jo useita viikkoja hätyytetty, potkittu, huoneista


ja pihoiltakin pois häädetty. Koko sen ajan, jonka kodittomana ja
ilman merkittyä orjan kahletta kaulassaan oli koettanut vapaana
maita kierrellä, oli hän ollut miltei alituisen ahdistuksen alaisena,
mutta viimeisessä kylässä, johon oli poikennut, oli hän vasta saanut
kovinta kokea, joutunut täydelliseen hengen vaaraan, niin ettei hän
ollut enää ollenkaan rohjennut tämän kylän nurkkia lähestyä. Ja se
oli tullut siitä, että hän oli siellä rohjennut erään talon tunkiolta
mennä ottamaan mädäntynyttä vasikan päätä, jossa ei enää ollut
muuta tähteenä kuin valkea luu ja hiukan karvoja. Hän oli kyllä
kokemuksistaan viisastuneena koettanut tehdä sen salaa, mutta
talossa oli ollut koira myös, lihava, hyvin ruokittu kartanokoira, joka
itse oli vasikanpään tunkiolle kuljettanut ja syönyt siitä syötävät
kohdat omaan suuhunsa. Tämä se ensimmäisenä oli tullut nälkäisen
kulkurin kimppuun, tehnyt tappelun ja saanut lopulta koko talon
väen ja vielä puoli kylää apuunsa, niinkuin ei yksinään olisi voinut
kurittaa laihaa ja heikontunutta heimolaistaan.

Kivet ja kalikat olivat vinkuneet, iskuja oli tullut selkään, päähän,


kylkiin, huudoilla ja sadatuksilla oli vaivaista seurattu aina kylän
päähän saakka, missä hän oli viskautunut tieltä metsään ja lymynnyt
näreikköön. Siinä oli hän virunut vapisevana, huohottaen, kieli
velttona suusta roikkuen ja kuuma kina paksuina puikkoina
suupielistä valuen, eikä ollut jaksanut sillä hetkellä enää edes
päätään kannattaa, vaan sekin oli painunut etujalkojen väliin
maahan.

Oli tullut rankka, sateinen ilta ja vielä rankempi sysipimeä yö, ja


kun häntä oli alkanut ankarasti palella ja puistattaa, oli hän hiipinyt
maantielle takaisin ja alkanut siinä leuka ja turpa karvat vavisten
vainuskella ilmaa sekä matkansa suuntia.

Ei kuulunut enää liikettä eikä elollista ääntä miltään suunnalta,


huudot olivat valjenneet, kartanokoirain haukunta lakannut ja
vainoojat kaikki vetäytyneet lämpimiin suojiinsa vankkojen seinien
sisäpuolelle. Ja niinkuin heikkoutensa tuntien ja aavistaen
väkevämpänsä lähtevän tänä yönä liikkeelle, olivat metsänkin
asukkaat painautuneet pesiinsä ja onkaloihinsa, kun luonto itse pani
voimansa myrskyn vonkuvilla siivillä ja koskena ryöppyävällä
vesitulvalla lentämään maitten yli, yli kylien, vainioitten, metsien —
——

Hän yksin vaan, vainottu kulkukoira, seisoi likomärkänä likaisella


maantiellä, sillä hänellä ei ollut kotoa eikä pesää.

— Tuoltapäin hän oli tullut — sinne ei ollut enää takaisin


yrittämistä. — Eteenpäin siis, ehkäpä jossain vielä löytää suojaa ja
suupalan ruokaakin, ehkä vielä sattuu sellaiseenkin kohtaan, jossa
hänen puutteensa tajutaan ja hänelle annetaan tyyssija.

Ja hän lähti horjuen ja matalana, niinkuin rajuilman painamana


juosta lyyhyttämään metsäistä, kohisevaa ja rytisevää taivalta, ja
huomasi jonkun ajan juostuaan tulleensa taas asutuille maille, ken
tiesi taas toiseen kylään. Mutta hän oli matkoillaan tullut araksi, ja
muistellessaan varsinkin lähtöään edellisestä kylästä, ei hän
rohjennutkaan mennä enää siitä eteenpäin, vaan ryömi murtuneena
ja toivottomana heinäsuovan alle tien sivussa ja sinne jäi.

Siellä hän nyt oli maannut, toisinaan tajuttomassa tilassa, jolloin


nälän ja vilun viiltävä tunne heitti hänet rauhaan, toisinaan taas
valveilla, kärsien tuskia ja kouristuksia, ja vaivaten heikkoja
koiranaivojaan miettimällä sitä hänelle käsittämätöntä asiaa, mikä
kaikkeen tähän kurjuuteen oikeastaan oli syynä.

— Mitä varten tässä maailmassa ei ole muuta, mietti hän taas


kerran, kuin potkuja, ruoskaa, nälkää, kahleita ja kuonokoppia ja
henkeä ahdistavia kaularautoja? Jos yhtä niistä et tahdo sietää, niin
toista saat, ja joskus saat niitä kaikkia yht’aikaa. Vaan jos kaikki ne
kestät, niin saat kuin armosta jonkun palan ruokaa suuhusi ja pääset
joskus rauhassa lämpimään makaamaan. Vältätkö taas niitä kaikkia,
lähdet vapauteesi, niin olet joka kohdassa, mihin tulet, vainottu ja
ahdistettu, ja elämäsi on musta ja myrskyinen kurjuus. — Kun
elämän alku oli niin iloinen, miksi sen ei annettu sellaisena jatkua?
Ooh, hauuh!

Poloinen vavahti ja ulahti haikeasti muistellessaan lapsuutensa


huolettomia hetkiä.
Oli ollut lämmin kesäinen päivä, tuoksuava ja pehmeä nurmikko
komean herraskartanon pihalla, missä he ensi kerran koko poikue
olivat emonsa kanssa vapaudessa piehtaroineet, ja monet hienot
kädet olivat heitä hyväilleet. Heidän kauneuttaan oli ylistetty, niinkuin
muuta kaunista ei maailmassa olisi ollutkaan, sillä heidän piti olla
jaloa rotua, niin sanottiin. Hauskasti ja huolettomasti olivat viikot ja
kuukaudet vierineet emon hellästi vaaliessa, ja varsi oli rivakasti
vaurastunut.

Mutta pian se huoleton elämä oli loppunut, lyhyt oli ollut vapauden
aika.

Hänet riistettiin emostaan ja sisaruksistaan ja kuljetettiin ahtaassa


laatikossa johonkin kauvas pois, outoon paikkaan ja ventovieraaseen
seuraan, jossa eräs tyly ihminen, jota kutsuttiin herraksi, piti häntä
viikkomääriä kahleista seinään kytkettynä, vaan siitä syystä, että hän
oli ollut pakahtua ikävästä, itkenyt ja koettanut etsiä tietä
kotipaikoilleen ja emonsa luo takaisin. — Miksei hänen annettu
vapaasti mennä sinne, minne luontonsa häntä veti? — Mitä varten
hänet asetettiin sellaiseen kauheaan kidutukseen, kun hän
tietääkseen ei ollut mitään pahaa tehnyt, ei ketään purrut, ei edes
äkäisesti haukkunut? Oliko se hänen syynsä, että hänen oli ikävä,
että kaikki siinä vieraassa paikassa hänen sieramiinsa haisi oudolta?

Hänen ruumiinsa hytkähteli hermostuneesti heinäsuovan alla ja


suolista lähti sellainen kutistava tunne kaikkiin jäseniin, niinkuin olisi
joka kohtaa pihdeillä puristeltu. Päällä rysähti ja kohisi, sivuilta tuli
ilkeä viima ja kylmää vettä tihkusi yhä maata pitkin hänen allensa, ja
hän käpristyi kokoon niinkuin kerä. Ankara tuulenpuuska huojutteli
suovaa ja oli painaa taipuvien pohjapuitten alla hänen luunsa poikki.
Mutta hän oli niin masentunut ja raukea, hänet oli vallannut sellainen
veltto välinpitämättömyys, ettei hänellä ollut halua lähteä siitä
mihinkään. Hän koetti uudestaan nukkumalla tukahuttaa tuskansa ja
tuimat tunteensa, mutta uni oli kuin hourimista, kauheita kuvia
täynnä. Käpälät liikahtelivat, ruumis vavahteli, heikkoa, valittavaa
haukuntaa tuli silloin tällöin kuuluviin sieramista, ja hän heräsi taas
ja alkoi ahtaassa olinpaikassaan hajanaisesti mietiskellä elämänsä
juoksua.

— Oih, kuinka häntä oli kiusattu ja kidutettu. Kahleihin hän lopulta


ehkä olisi pakosta tottunut, maannut siinä paikallaan tylsänä
vetelehtien ikänsä kaiken, mutta kun siihen alkoi tulla lisäksi ruoskaa
ja nälkää, niin se oli tehdä hänet hulluksi. Ensin häntä rääkättiin
metsässä ja sitten kotona. Auh, niitä kamalia retkiä! Hänen olisi
toisen, vanhan, äkeän ja tuimasilmäisen roikaleen kanssa, joka häntä
luisevuudellaan inhoitti, pitänyt laukata henkensä edestä vuoria,
rämeitä, risukkoja, huutaa ja ölytä turhan vuoksi kuin vähämielisen.
Siihen hänellä ei ollut halua, ei hän siihen tottunut eikä oikein
ymmärtänyt, mitä sillä tarkoitetuinkaan. — Oliko tarkoitus, että
hänen olisi pitänyt peljättää ja hätyyttää arkoja metsän elukoita ja
siinä juosta kyntensä verille? Hän katsoi joskus kysyvästi siihen
vanhempaan toveriinsa, mutta aina se vouhkotti vaan omaa
menoaan eikä antanut muuta vastausta kuin äkäisen murahduksen.

— Ei tullut sinusta mitään, et ollutkaan oikeata rotua, sekasikiö —


niin hänelle ärjästiin, potkastiin, lyötiin, pantiin taas kahleihin,
pidettiin päiväkausia nälässä ja vietiin sitten uudestaan metsään. —
——

— Kerran hän ne petti, julmat. Se oli ihanin päivä siitä lähtien,


kuin hänet emostaan oli erotettu, mutta katkerasti se loppui. Hän
juoksi metsälle vietäissä irti päästyään salaa aivan toiselle suunnalle,
kuin mihin häntä huudettiin, ja antoi niitten meluten mennä ja
huhuilla torvineen. Sitten tapasi hän kartanon paimenpojan kivellä
istumassa ja syömässä evästä kontistaan, ja hän lyöttäytyi pojan
seuraan. Se silitti päätä, antoi leipää ja jutteli hänelle koko päivän.
Illalla ajoi hän sen kanssa karjan kotiin. Mutta kujassa, kun karja
yhteen kokoon tullen miehissä soitti kellojaan ja hän niitä kuunnellen
oli unohtanut kaiken muun, tulikin herra äkkiarvaamatta vastaan
kahleet ja paksu sauva kourassa. — Se huusi niin julmasti, ettei hän
osannut edes pakoon yrittää, lyyhistyi vaan aitoviereen ja odotti
iskua. — Ei sillä hyvä, että hän yksinään sai, saipa vielä poikakin,
joka häntä oli ruokkinut ja hyvänä pitänyt.

— ’Enkö ole sanonut sinulle, että jos missä irrallaan tapaat, niin
heti ota kiinni ja tuo kotiin, lurjus. Ja nyt olet päivän kanssasi
kulettanut ja ehkä vielä ruokkinutkin, pässi! — Löysikö edes jänistä,
ajoiko sitä? Vai sinun kanssasi vaan makasi, ja lehmiä haukkui. Yhtä
typeriä ja kelvottomia te molemmat, kuin lehmännekin.’ — — —

— Se oli kauhea ilta. Poikaparka itki tuvassa ja hän lehmitarhan


aitaan kiinni kahlehdittuna. Vihdoin tuli kartanon herra karanko
kourassa kiivaasti häntä kohti ja kuului pihalle tullessaan huutavan,
että ottaa hän siitä lehmäin haukunta-innon. — Nyt se lyö, luut
murskaa noin kauhealla aseella, eikä hän pääse välttämään. —
Mutta vaikka korvat jääkööt, tai päänahka revetköön, tai katkennee
kahle, niin kerran hän yrittää — ja hän vetäytyi matalana taapäin ja
pingoitti kahletta — — —

— Tukehtua hän oli silloin, korvat kohisi, silmät sokeni ja hengitys


seisattui kokonaan, mutta samassa solahti kaulain läpi pään, ja sinne
jäi herra hovinsa pihalle seisomaan ja kiroilemaan, kädessä kahle ja
tyhjä vaskella helatta kaulain. Ja hän juoksi ja juoksi yhä eteenpäin
vaan, eikä ajatellut mitään muuta kuin päästä hirmuisesta herrasta
niin kauvas kuin mahdollista. Ensimmäisen päivän noustessa oli hän
jo vierailla paikkakunnilla, ja siitä saakka oli hän omin päin kulkenut,
ilman että kukaan olisi yrittänyt häntä enää kahlehtia tai kiinni ottaa.
— Hän oli vapaa — — —

Tämä menneitten elämän vaiheitten juoksu seisattui taas hänen


muistissaan, sillä häntä alkoi vaivata kummallinen, ennen
tuntematon poltto ja kuivattava kuumuus. Koettipa ajatella mitä
tahansa, kaikki inhoitti häntä, ja hän tunsi itsensä melkein
sekapäiseksi ja kääntyeli suovan alla kyljelle ja toiselle ja väliin
läähätti suullaan, pää velttona käpäläin välissä.

— Houriko hän, vai oliko hän täydessä järjessään?

— Ilkeätä untako tämä oli, joka pani tällaisen palon ruumiiseen,


vai oliko tämä kovaa todellisuutta? Olihan hän vapaa ja oma
herransa. Olihan hänellä valta mennä mihin tahtoi ja tehdä mitä
tahtoi. — Mitäkö tahtoi! Eih — murr-r-r — hän murahti ja puri
hammasta. — Eipä hän saanut ottaa sitä mätää vasikan päätäkään
tunkiolta, vaikka se ei enää ollut kenenkään oma, sittenkun vasikka
sitä oli lakannut kantamasta. — Matoja sen silmistä kömpi ulos,
mutta kyllä ne olisivat voineet asua pelkässä tunkiossakin — — —

— ’Us, luuska! Kenenkähän nälkäinen rakki tuokin on? Ajakaa pois


— huut pellolle!’ Sellaista hän oli kuullut joka kohdassa, ja jos lapset,
niitten kaksijalkaisten pienet pennut olivat jossain hänelle suupalan
antaneet, niin päävihoista oli heitä kahmastu kiinni ja huudettu
ilkeällä äänellä: ’ei saa juoksukoiria ruokkia'. — Ja omia heimolaisia,
niitä lihavia ja typeriä tyhjäntoimittajia, kartanoilla haukkujia, niitäkin
vielä usutettiin kimppuun — — — ja nekin matelevat raukat olivat
tekevinään herroilleen palveluksen hätyyttämällä häntä — — —
— Ihmeellinen ja käsittämätön tämä vapaus, näinkö siitä saa
kärsiä, että tulee hulluksi? Onko sitä olemassa, tunteeko sitä
kukaan? — — Kun on orja, niin potkitaan, kun tahtoo olla vapaa, niin
vainotaan kahta kauheammin ja tapetaan nälkään — — —. Mutta
ehkä ne voimakkaimmat ja rohkeimmat vaan ovat vapaita ja
lihoovat, ottavat heikommilta mitä tahtovat ja tekevät mitä tahtovat
———

Joku uusi aate oli jo ennenkin hämäränä, tiedottomana ja


epäselvänä asunut hänen väsyneissä aivoissaan, mutta nyt, kun
ajatusvoima taas oli kiihkeässä kulussa, ilmestyi se äkkiä aivan
selvänä hänen eteensä.

— Mitä? — Hän nosti kuin heränneenä päätään ja murahti.

— Mikä on pakottanut hänet pysymään heikkona? Eihän hän ole


voimaton nytkään vielä — hän suotta oli nöyrä ja aristeli, kun olisi
pitänyt näyttää voimansa ja asunsa niille. — Ahnaita ne ovat kaikki,
ilkeitä — — — ne ovat kateita ja tunnottomia — — — ei niitten
kanssa arkuudella eikä armoille rupeamisella pääse mihinkään —
niitten sisu siitä vaan kasvaa. — Oi, kuinka ne ovat olleet julmia
häntä kohtaan, kuinka ne kaikki ovat häntä rääkänneet. — Mutta
hän ottaa takaisin kaikki, mitä on menettänyt — — — hän lähtee heti
rohkeasti liikkeelle ja kostaa niille kaikille eroituksetta, sillä kaikki ne
ovat samallaisia, julmat. — Hän kostaa — se on hänen vapautensa
— hän iskee hampaansa niihin, eikä hellitä ennenkuin hänen
leukansa louskahtavat niitten ytimien läpi yhteen — — —

Tämä kärsimystensä ja kurjuutensa pitkällinen miettiminen ja


tämä uusi, rohkea tuuma oli hänet lopulta saanut siihen määrään
raivoksi ja kiihtyneeksi, ettei hän enää tuntenut kipua, ei nälkää, ei
heikkoutta eikä arkuutta. Hän vapisi vihasta nyt ja pureskeli
hampaitaan — — —

*****

Taisi olla aamupuolta, sillä kapeasta aukosta maan pinnan ja


suovan pohjan välillä pilkisti hänen pimeään tyyssijaansa utuisen
punertava valojuova, niinkuin nousevan auringon säde, ja näkyi
muutama oksa syksyisestä haavasta, jossa joitakuita heleänkeltaisia
lehtiä vielä värähteli. Ja heinän korsissa ja suovan reunimmaisissa
pohjapuissa kimalteli jäätyneitä kastehelmiä.

Routaiselta maantieltä alkoi kuulua kovaa tärinää, niin että maa


hänen allaan heikosti vavahteli. Hän kavahti jaloilleen — — — sieltä
oli tulossa ihmisiä hevosella ja rattailla tietä pitkin, kovaa vauhtia
ajaen. Kirkas valo huikasi hänen kauvan hämärässä olleita silmiään,
niin ettei hän niitä oikein nähnyt, kun pisti päänsä aukosta ulos,
mutta hän kuuli ratasten räikinän seasta kovaa puhetta — — —

Niinkuin nuoli lensi hän yli aidan ja maantienojan suoraan ajajia


kohden ja iski kiinni hevosen turpaan. Hevonen kääntyi kuin salaman
iskemänä poikki tien ja kirmasi metsään, rattaat kaatuivat ojaan ja
ajajat suistuivat maahan. Sinne ne jäivät voihkimaan ja huutamaan
apua, olematta oikein selvillä, kuinka tämä kaikki oli tapahtunutkaan,
mutta hän, joka sellaisen äkkirymäyksen oli aikaan saanut, laukkasi
jo puolitiessä kylään päin, ajattelematta mitään muuta kuin että hän
puree, puree kaikkia.

Kylästä juoksi miehiä vastaan avun huutoa kohti, eivätkä kiireessä


häntä huomanneet, ennenkuin hän karkasi hampain kiinni
ensimmäisen jalkaan —
— Äh! niin meni luja rasvanahkainen saapasvarsi puhki hänen
hampaissaan kuin naskalilla, ja kuin rautapihdeillä puristi hän miehen
säärtä, joka huitoen ja kiroillen koetti toisella jalallaan potkien saada
häntä irti.

— Auh! niin ne potkivat häntä ennenkin, kun hän nälkäänsä olisi


purrut mätää vasikan luuta, nämä samat kaksijalkaiset —

Hän hellitti hampaansa ja iski siihen toiseen jalkaan, joka häntä


potki. Mies yritti tarttua käsin häneen, mutta nähtyään hänen hurjan
veriset ja kiiluvat silmänsä, alkoikin hän huutaa. Toiset murtivat
seipäitä aidasta, toiset koppasivat kiviä, mutta taas oli hän jo
jättänyt heidät ällistyneinä ja säikähtyneinä seisomaan maantielle.

Kauhua hän jätti jälkeensä ja aavistamattomana hän hyökkäsi


eteenpäin kuin ukkosen vasama.

— Juoskoon joku kiireimmittäin kylään, se koira on varmaan


vesikauhuinen, huudettiin hänen jälessään.

— Voi kauheata, kuinka se minua puri! Jokohan sillä oli kuoleman


vihat hampaissa, valitteli miehistä kalpeana ja vavisten se, jonka
sääriin hän oli reijät iskenyt — — —

Koko kylä oli liikkeellä ja kauhu oli vallannut kaikki. Koirat ulvoivat
ja haukkuivat tuskaisen kiivaasti, siat juoksivat röhkien, harjakset
pystyssä pitkin kujia, tyhmät lampaat laukkasivat sänkipelloilla kylki
kyljessä kiinni ja seisattuivat sorkkiaan kopautellen, kun olivat
jossain aidankulmauksessa löytäneet selkätuen, lehmät mörisivät
tarhoissa sarvet tanassa — ja kaiken tämän hämmingin seassa
juoksivat kyläläiset itse sinne tänne aseet kourissa, tavottaen
juoksukoiraa, joka takkuisena, likaisena ja laihana luuskana, silmät
verisinä ja suu vaahdossa ennätti joka kohtaan ja joka kohdassa ehti
käyttää hampaitaan, vaan jota ei kukaan ennättänyt siksi lujasti
kolhasta, että olisi kykenemättömäksi saattanut. Ja kohta häntä ei
enää ollut missään. Joku oli nähnyt hänen laukkaa van maantiellä
toiseen kylään päin.

Mutta kaikki olivat vakuutetut siitä, että koira oli ollut


raivotautinen, ja paljon se oli ennättänyt purra haavoja sekä ihmisiin
että eläimiin.

Pantiin heti viesti ruununmiehille ja toinen lääkärille, paljo oli nyt


puuhaa ja paljo oli tuskaa yhden ainoan nälkiintyneen koiran tähden,
sillä haavan saaneita ahdisti kuoleman kalvava pelko, koska yleiseen
tiedettiin vesikauhuisen koiran puremat hengelle vaarallisiksi.

Sillä aikaa venyi juoksukoira saman suovan alla, josta oli kylään
lähtenytkin. Vainu oli vetänyt hänet sekapäisenä ja loppuun
uupuneena takaisin sinne, ja siellä hän makasi kyljellään, jalat
suorana, kuin kuollut, tietämättä mistään.

Häntä etsittiin joukolla ja suurella voimalla, aseet käsissä, monen


kylän piiristä, sillä epäilemättömänä pidettiin, että hän oli
raivotautinen. Risteiltiin metsät kuin suden ajossa, nuuskittiin nurkat,
pengottiin latojen alustat ja kaikki mahdolliset lymypaikat. Ja hän
heräsi vielä kerran tuntoihinsa siitä, että häntä sohittiin selkään
jollain terävällä esineellä. Hän lähti ryömimään ulos, vaistomaisesti
välttäen pistoja, sillä hänellä oli vielä tunto, mutta samassa kuuli hän
kovan pamauksen ja kyljessä tuntui kuumalta — — —

Siinä oli miehiä paljo, oli kylänmiehiä ja oli kiiltävänappisia


herrojakin, ja kaikki puhuivat kovalla touhulla, käsien liikkeillä ja
hengästyneinä juoksukoirista, raivotaudista ja muista kauheista
maan vaivoista, jotka ovat seurauksena siitä, että koiria pidetään
ilman kuonokoppaa ja lasketaan vapaina juoksentelemaan — — —

Mutta juoksukoira ei kuullut enää niitä arveluita. Hän makasi


suovan vieressä laihtuneena ja surkeasti kuihtuneen näköisenä, ja
kahden korkealle pullottavan kylkiluun välistä valui verta syksyiselle
nurmikolle.
SATTUMA — SALLIMA —
KOHTALOKO?

Taipaleen Lauri oli useita päiviä ollut miettivän ja hajamielisen


näköinen, tehnyt töitään niinkuin ei oikein olisi ollut selvillä, mitä
kulloinkin teki; ja kun samaan aikaan sattui juuri viikkopäiväkin
kartanoon tehtäväksi, niin ei mennyt itse, vaan palkkasi etumiehen,
sillä omista pojista ei ollut vielä päivätyön tekijää, vaikka hänellä niitä
oli useampiakin.

Eukko ja lapset eivät käsittäneet syytä tähän hänen omituiseen ja


umpimieliseen miettimiseensä, vaikka he sen kylläkin hyvin olivat
huomanneet, ja kun kysyttiin, miksei isä mene itse viikkopäiväänsä
tekemään, vastasi hän jonkunverran äkäisellä, tuskallisella äänellä,
että »olisi tässä tärkeämpää tointa, kun saisi — tuota noin — mutta
enempää hän ei puhunut, vaikka kaikesta näytti, että hän kyllä olisi
tahtonut puhua.

Se oli juuri se, joka teki hänet miettiväksi, hajamieliseksi ja


toisinaan tuskallisen näköiseksi, ettei saanut sanotuksi aikeitaan, sillä
hän pelkäsi, että eukko iskisi heti lujasti vastaan.
— Mitä tärkeämpiä sitten? kysyi emäntä, kun viikkopäivästä oli
puhe.

Lauri katsoi häneen syrjäsilmällä niinkuin arastellen.

— Niin, olisihan sitä muutakin — tuota noin, vastasi hän vaan


toisenkin kerran, katkaisten lauseensa kesken, vaikka se keskeytys
näytti kylläkin väkinäiseltä.

Eukko katseli kummastellen ja tutkivan näköisenä miestään,


Lauria, joka aina oli ollut iloinen ja avomielinen.

— Mikä kumma sinuun nyt on mennyt? kysäsi hän. — Mitä sinä


oikein mietit ja tahdot, mitä haudot päässäsi? Ethän ennen ole
tuollainen ollut.

— Niin, onhan sitä — ja olisihan sitä — tuotanoin — vastasi vaan


Lauri vitkalleen, mutta siihen se nytkin jäi.

Mikä kumma tosiaankin pitää oleman, ettei saa sanotuksi, mitä


aikoo, ei tehdyksi, mitä tahtoo, ajatteli hän itsekseen.

Oli niinkuin joku näkymätön käsi olisi joka kerta painanut häntä
suulle, kun hänen piti puhua siitä, mitä mietti. Sillä hän mietti
todellakin ja oli jo miettinyt monta päivää erästä asiaa, ja mietti yhä
kiihkeämmin, jota pitemmälle se häneltä jäi puhumatta.

Toisinaan valtasi hänet outo, pelonsekainen tunne. Hänestä tuntui


juuri kuin olisi hän aikeissa tehdä jonkun rikoksen yleistä järjestystä
tai hyvää maailman menoa tai jotain sellaista vastaan, josta hän ei
ilman jälkiseurauksia tai rangaistusta pääsisi. Mutta hän ei voinut
sille mitään, ja jota enemmän hän asiaa mietti ja sillä itseään kiusasi,
sitä lujemmaksi tuli hänen päätöksensä, että hänen lopultakin,
kaikesta huolimatta täytyy saada aikeensa täytetyksi. Se tuntui
hänestä miltei elämän ehdolta — kävipä sitten kuinka tahansa. Ja
kuinkapa tuosta kävisi tuollaisesta — hyvin kaikki vaan —, kunhan
ensin saisi suunsa avatuksi ja sen aikeensa ilmaistuksi.

Oli tuollainen pimeä, kolkon hiljainen ja painostava talvi-ilta,


jommoisena päiväsydämmen ulkoilmassa ahertanut ihminen
mielellään paneutuu aikaiseen levolle ja nukahtaa huomaamattaan.

Lapset vetivät makeata unta Taipaleen tuvassa, emäntä nukkui


levollisena seinän puolella vuodetta, mutta saman vuoteen
lattiapuolella valvoi Lauri, käänsi vuoroin kylkeä ja toista, mietti ja
hikoili, ja nousi välillä vuoteen reunalle istumaan. Hän pyyhkäsi
otsaansa, kyhni päätänsä, pisti piippuun ja koetti vedellä sauhuja,
mutta tupakka ei maistunut, piippu sammui yhtenään ja lopulta heitti
hän sen pois.

— Kummallinen juttu, vähäpätöinen asia oikeastaan, ja sentään se


vaivaa kuin painajainen, valvottaa miehistä miestä, ajatteli hän
tuskallisena. — Heittäisikö koko homman ja rupeisi ajattelemaan
jotain muuta, esimerkiksi halkojen ajoa kartanon metsästä, johon
nyt kaikki rientävät ja josta kuuluvat aika hyviä päiväpalkkoja
nykivän. Kun kehui kahteentoista markkaan Rajalakin toissapäivänä
ajaneensa. Keli on mainio, ajomatka mäetön, tasainen ja lyhyt.
Kelpaisi nyt ottaa rahaa miehen ja hevosen.

— Siinä asia, jota kannattaa miettiä! — On viisainta ruvetakin


harkitsemaan sitä ja heittää suoraan hiiteen tuo toinen ajatus, jota
tuskin sentään saa täytetyksi.

Hän paneutui uudestaan päättäväisenä selälleen vuoteelle, veti


peitteen leukaansa saakka ja päätti nukkua uhallakin. Hän sulki
silmänsä ja koetti ajatuksissaan seurata rahallista halkokuormaa
kartanon metsästä rautatienasemalle. Usein ennenkin oli hän
nukahtanut tuota pikaa, aivan rauhallisesti, kun oli ollut jotain uutta
mieleistä mietittävää, josta tiesi tulevan valmista, tai kun oli
muistellut entisiä hyvästi onnistuneita asioita. — Kolme metriä kun
panee kuormaan — tiehän on pelkkää tasaista suota ja niittyä aina
aseman tienkäänteeseen saakka, että vaikka reen sevillä sopii seistä
— siinä sitten hiukan vähentää mäen alla asemalle käännyttäissä —
vaan jos siitä kääntyisi valtamaantietä kaupunkiin päin, niin ei olisi
isompaa mäkeä vielä moneen virstaan. — Se on mainio tie tämä
maantie tästä kaupunkiin — varsinkin tällaisella kelillä. Lähellä
kaupunkia kyllä on ainakin jonkun verran likaa ja sohjua, mutta sitä
on niin lyhyeltä. Kahden seutuun aamulla jos lähtee kotoa, niin
ennättää parhaaseen toriaikaan perille, tuohon kuuden ja seitsemän
välille, jolloin ihramahat toriämmät alkavat lyhtyineen lihakuormia
tarkastella. — Vaan eihän sitä ole pitkiin aikoihin enään kuljettu, kun
on totuttu tuohon rautatiehen. — Oli niitä monta hauskaa matkaa
ennen, jos joskus kyllä sai hiukan kavahtaakin. Viimeinen matka taisi
olla se, joka tehtiin Kulmalan kaiman kanssa, juuri silloin, kun nämä
uudet mitat olivat tulleet käytäntöön — — —.

Halkokuorma oli jäänyt aseman tienhaaraan, se oli Laurilta


kokonaan unohtunut — ja hän oli ajatuksissaan nyt viimeisellä
kaupunkimatkallaan maanteisin, toistakymmentä vuotta sitten,
Kulmalan Lauri ajaen edellä, hän jälessä, Kulmalaisella leipäkuorma,
hänellä juottovasikka. — On pimeä aamu vielä ja ollaan juuri
Hämeentullin ulkopuolella, kun tulee kaksi retkaletta vastaan ja
ärjäsevät Laurille, että »mitä maanjusseilla on myytävänä?» —
»Leipää», vastaa Kulmala. »Mitä teilläpäin nykyjään leipämeetteri
maksaa?» kysyy silloin jätkistä toinen, pahannahkaisella rykäisyllä, ja
samassa hyökkäävät molemmat Kulmalan rekeen, yrittäen viskata
yhtä leipäsäkkiä tiensivuun. »Meillä myyjäänkin leipää kilokaupalla»,
vastaa Kulmala, koppasee rekensä pohjalta rautapuntarin
varsipuolesta kouraansa ja hujahuttaa pari kertaa ponsipäätä ympäri
— — — Kuului pari pahaa älähdystä — — —. Hyvän hinnan sai
Kulmala torilla leivistään, ja hinnoissaan oli vasikanlihakin. — Kyllä se
matka on kerran vielä tehtävä maanteisin, aivan ehdottomasti, ihan
tällä viikolla. Mutta kun ei saa otetuksi sitä puheeksi. Kun ei saa
suutansa auki, siinäpä se. Jos puhuu ja ilmaisee aikeensa, niin eukko
nauraa, sitten suuttuu ja tekee toran. — On niinkuin pelkäisi ja
häpeisi ottaa sitä puheeksikaan. Kyllä pitää olla noloa tällainen — —

Hänen tuli vari, hän työnsi peitteen pois ja kavahti taas istuilleen.

— Oih, mitä hänen pitikään miettimän — — — halonajoa kartanon


metsästä, eikä kaupunkimatkaa. Mutta kun siitä ei pääse irti, kun ei
mahdu mitään muuta päähän, ei vaikka pakolla koettaisi. — Siitä on
tehtävä selvä juuri tällä hetkellä, tuli mitä tahansa.

Hän epäili vielä hetkisen, mutta veti sitten pari kertaa vahvasti
henkeä sisäänsä, rykäsi väkinäisesti ja sanoi vavahtavalla äänellä:
»Maija!»

Eukko ei kuullut.

— Maija, kuulehan, sanoi hän uudestaan kovempaa, ja melkein


säpsähti omaa ääntänsä hiljaisessa tuvassa, jossa ei kuulunut muuta
kuin seinäkellon naksutus.

— Kah — mitä sinä siinä — joko nyt on aamu? tokasi eukko


unenpöperöisenä.
— Iltaa tämä on vielä, vastasi Lauri, ääni vavahdellen yhä.

— No nuku sitten herrannimessä ja anna muittenkin nukkua!

— Nukkuisihan tässä kylläkin, mutta kun on sitä miettimistä —


tuotanoin —

— Tuotanoin, tuotanoin — aina vaan tuotanoin, matkasi eukko. —


Mikä kumma sinun nahkaasi nyt on mennyt? Sinä niine mietteinesi
yhä! Mitä ihmeen miettimistä — kun et vaan olisi sairas?

— Onhan sitä niin miettimistä — — —

— Sano sitten kerrankin, mitä.

— Se va-asikka, ponnahti Laurin suusta sana niinkuin salpain


takaa, ja hän huokasi helpoituksesta. — Nyt oli, jumalankiitos, alku
tehty!

— Va-sik-ka? toisti eukko kysyvällä äänellä ja nousi istuilleen


vuoteelle hänkin. Hänen päähänsä lennähti varmaan epäilys
miehensä järjen eheydestä, sillä hän sanoi heti hätäisellä äänellä:

— Otahan tulta!

Lauri oli jo niinkuin toinen mies, astui vikkelästi pöydän luo ja


sytytti lampun.

— Mitä se vasikka niin sinun päätäsi vaivaa, ettet yön lepoa saa,
miesparka? kysyi Maija emäntä istahtaen rahille ja katsellen lampun
valossa tutkivasti miestään. — Minunhan se on huolenani ollut ja
kyllä se on hyvässä kunnossa, juuri ennen maatapanoa viimeksi
juotin.
— Niin, kyllä, vaan nyt se olisi vietävä kaupunkiin, sanoi Lauri.

— Viedään tietysti, niinkuin ennenkin, kunhan ennätetään. Näinä


päivinä se on vietävä, vaan eihän niitten vuoksi koskaan
muulloinkaan ole öitä valvottu, vaikka niitä on viety jok’ikinen talvi.

— Kas minä aijon sen viedä hevosella — kaupunkiin saakka —

— Ahaa! huudahti eukko. — Siinäkö se pykälä olikin? Sitäkö olet


sisässäsi hautonut? Onpa sekin päähänpisto, lähteä tästä hevosella
kaanittamaan tällaista taipaletta muutaman markan vuoksi! Pane
hevonen ennemmin halonajoon, niin on jotain ansiota. Etteikö enää
rautatie kelpaa? Mikä hullu päähänpisto tosiaankin!

Mutta Lauri oli rohkea jo, ja sanoi päättävästi:

— On se sellainen päähänpisto, ettei sitä enää muuteta. Hevosella


lähden kuin lähdenkin vaihteeksi kerran vielä. Ja keli on mainio kuin
kierä jää.

— Ja sitten jäät taas vaihteeksi kerran vielä juopottelemaan


Rapakiven mökkiin, niinkuin ennenkin hevosmatkoillasi, etkä tuo
vasikan hinnasta kotiin tuon taivaallista.

— Voihan juopotella vaikka junassakin kulkee, ja voihan


kaupungissakin.

— Ei samalla tavalla. Junasta potkitaan ulos, ja kaupungissa on


poliisit.

— Mutta eihän tässä ole juopoteltu vuosikymmeneen, eikä


juopotella nytkään, sen lupaan ja vannon.
— No mitä varten sinä oikeastaan tahdot hevosella lähteä, kun jo
niin kauvan on huomattu junakulku kaikin puolin edullisemmaksi?
tutkaili eukko.

— En osaa sitä oikein itsekään selittää, vastasi Lauri, mutta niin


minusta vaan tuntuu, että hevosella lähden.

Eukko koetti panna vastaan kaikella tavalla, mutta Lauri yhä vaan
äityi, eikä hänen kanssaan enää auttanut mikään väitteleminen.

Mutta kun hän alkoi katsella vaatteita päälleen ja kurkoitella pitkää


teurastuspuukkoaan seinän raosta ikkunan päällä, tuli uusi
yhteentörmäys. Vaan siinäkin hävisi eukko, sillä kun Lauri kerran oli
päättänyt vielä samana yönä teurastaa vasikan ja lähteä matkalle,
niin hän myös piti päänsä.

— Tulipa tästä nyt levoton yö, voihki eukko.

— Mitä merkitsee yksi yö ihmisen elämässä, kyllähän tulevana


yönä maataan rauhassa, lohdutteli Lauri ja lähti navettaan, johon
eukonkin täytyi seurata hänelle apulaiseksi.

Ja hän teurasti vasikan, sonnusti kuormansa ja lähti kello kahden


seudussa aamuyöstä matkalle.

*****

Vasta iltahämärissä hankki Lauri kaupungista kotiin päin takaisin,


sillä talvinen päivä kului väleen kaikellaisissa pikkuasioissa. Hän oli
tyytyväinen kauppoihinsa ja hyvillään siitä, että sai rautatienjunista
riippumatta lähteä noin vaan, niinkuin itseään halutti.
Tienposkessa, noin vähän yli puolimatkan kaupungista kotiin päin,
oli synkän korven laidassa suuren mäen alla pieni metsätorppa,
Rapakivi nimeltään. Siinä oli ennen nuorempina päivinä
kaupunkimatkoilla otettu monet huimat humalat, juotu ensin omat
leilit tyhjiksi ja ostettu torpasta lisää, sillä torpan väki piti
salakauppaa. Nytkin pistäysi Lauri tulomatkallaan vanhan tavan
mukaan torppaan puhalluttamaan hevostaan.

Ei oltu miestä nähty yli kymmeneen vuoteen, mutta hyvästi


tunnettiin ja hyvänä pidettiin. Ihmeteltiin, mikä nyt oli päähän
pistänyt, kun hevosella lähti, kyseltiin kuulumisia kaikenmoisia ja
tarjottiin lopulta ryyppyäkin. Sitä ei Lauri ottanut, sanoi heittäneensä
viinat niille teilleen, eikä olevan aikaakaan pitempiin viivyttelyihin,
koska täytyi rientää kotiin, että aamulla pääsisi lähtemään
rahatöihin. Noin puoli tuntia hän siinä istui pakinoiden ja sitten lähti
taas.

Sää oli kirkas, mutta suojanpuoleinen. Oli sellainen keskitalven ilta,


joka muistutti kevättä. Tuuli kohahteli silloin tällöin suvekkaasti ja
alkoi nostaa pilviä etelästä päin, Laurin selän takaa, ja kotvan
kuluttua tunsi Lauri jo niskassaan muutamia vetisenpehmeitä
lumihiutaleita. Hän nosti lammasnahkaisen kauluksensa ylös, painoi
lakkia alemmaksi, veti lointa polvilleen ja asettui oikein mukavasti
reen perään kenolleen.

Keskitaivaalla ja pohjan puolella paloivat tähdet vielä kirkkaina,


mutta vähitellen menivät taivaan rannat pilveen kaikkialta ja metsä
kahden puolen tietä muuttui aivan mustaksi. Kohta oli leuto,
talviöinen lumisade täydessä käynnissä, ja yksi ainoa kirkkaampi
tähti pilkoitti enää melkein keskellä taivasta, jotensakin suoraan
Laurin edessä. Se oli hänen mielestään jotakuinkin niinkuin kodin
kohdalla. Hän katseli sitä tarkkaavaisena ja seurasi sen
himmenemistä. — Sammuukohan sekin, vai vieläkö kirkastuu
uudelleen — mietti hän. — Jos tuuli kääntyisi äkkiä pohjoiseen, niin
vielä se ehtisi kirkastua uudestaan. — Mutta tuo pilven sakara tuossa
kulkee kummallisen nopeasti, vaikka täällä metsässä ei käy tuuli
ollenkaan. — Mikä voima niitä siellä ylähällä ajanee, ja mihin heillä
lie kiire, — ne menevät menojaan ja sitten tulevat taas takaisin. —
No nyt se viimeinenkin tähti jäi pilven taa —.

Lauria alkoi torkuttaa, ja hyvähän siinä oli vaikka torkkuakin


selkäkenossa reen perässä, kun ei ollut kylmäkään. Ja kun ei ollut
mitään ulkoapäin tulevia havainnolta, kääntyivät mietteet unisina
sisäänpäin. — Mitähän eukko sanoo, mietti hän, kun otetaan esiin se
uusi ja kaunis villahuivi, jonka hän oli tuomisiksi ostanut juuri niillä
samoilla markoilla, jotka muuten olisivat junakyytiin menneet. Ja
Pekka sitten, vanhin poika, sille oli omiin nimiinsä hyvä veistinkirves,
jota se niin kauvan oli pikku nikarruksiinsa mielinyt. Siitä paisuu
miestä, siitä Pekasta, — huomenna se jo nähdään — hän paneekin
sen ajamaan halkokuormaa ja itse katselee ja raivaa vaan ajoteitä
metsästä ulos. Siihen se kyllä pystyy, ja pystyy se jo kohta
veropäiviäkin tekemään kartanoon. — Pekka saa ajaa ja Ville poika
tekee pinoa asemalla. He ottavat rahaa vahvasti — toimeentulo
torpassa paranee, kun heitä on kolme miestä yhdessä hommassa —
hänelle tulee itselleen helpommat päivät jo vähitellen — ja kyllä
sietääkin — hohhooh —.

Lopulta nukkui Lauri kokonaan ja hevonen juosta hölkytteli omaa


juoksuaan, sitä nopeammasti yhä, jota tutummaksi tunsi tien
käyvän. Siinä oli taloja ja torppia, joitten pihamailla ja porteilla oli
monasti seissyt, tuli laajoja vainioita, joilla oli monta kovaa päivää
saanut astuskella. Mutta nyt ei onneksi tuntunut ohjaksen nykäsyä
minkään veräjän kohdalla ja ruuna juoksi erinomaisen hyvällä päällä
kaikkien niitten ohi. Tuli taas metsää, pitkää pimeätä kuusikkoa, ja
tie painui loivaksi alamäeksi, jossa juoksu kävi yhäkin huokeammasti.

Ruuna heristi korviaan — mitä se oli? Ei se ollutkaan kuin aivan


tuttu kohina ja vihellys, joka kuului metsän takaa. Se oli niin tuttu,
että hän oli kuullut sen koko ikänsä, monta kertaa joka päivä,
laitumella, työssä ollessaan, tallissakin yön aikana. Mutta yht’äkkiä
kohahti se hirveän läheltä, ja samassa silmänräpäyksessä näki ruuna
aivan kupeeltaan kauhean, häikäisevän valon, jonka senkin oli kyllä
ennenkin nähnyt, vaan ei milloinkaan vielä niin peljättävän läheltä
eikä niin päällepuskevan näköisenä. Hän ponnisti takakinttujaan ja
otatti laukkaan, päästäkseen hirviöstä ohi. — Uh, kuinka se nyt
riipasi sieltä takaa, niin että oli suupielet haleta ja lavat mennä
sijoiltaan, ja nyt se pisti ja löi häntä kylkiin ja jalkoihin herkeämättä,
joka askeleella, vaikka hän henkensä edestä nelisti —.

Eikä se lakannut lyömästä, ennenkuin ruuna seisattui omaan tallin


katokseen, vavisten ja syvälti puhaltaen. Siinä se sitten seisoi
aisantyngät vempeleestä rempottaen ja alkoi lopulta hörähdellä, kun
ei kuulunut valjaitten avaajaa——

Se oli Pietarin postijuna, joka edellisellä asemalla oli jonkun


vaurion vuoksi myöhästynyt puolisen tuntia ja jota koneenkäyttäjä
nyt vei Helsinkiä kohti lisätyllä höyryllä, voittaakseen viivytyksen
takaisin. Maantie kulki parin kilometrin päässä Taipaleen torpasta
rautatien yli, ja ratavartija oli kyllä määräaikaan sulkenut siinä portin,
mutta kun juna oli niin paljon myöhästynyt, oli portti jonkun kulkijan
vuoksi avattu ja jäänytkin sitten uudelleen sulkematta.

Kun juna saapui asemalle, roikkui koneen lumiauran päällä reen


pirstaleita ja yläpuoli Taipaleen Laurin ruumista — — —.
Oliko se sattuma — oliko se sallima — vai vaaniiko ihmisen
vaellusta täällä maailmassa joku järkkymätön kohtalo —?
VAIKEA ASEMA

Minä en ole eilen syntynyt — kaukana siitä. Olen jo niin vanha, että
tuskin enää kaikin ajoin muistan syntymävuottani. Otsani on
kurtuissa, poskeni ryppyjä täynnä. Päälakeni voin kyllä useimmissa
tapauksissa pitää hatulla peitettynä, sillä minulle sattuu aniharvoin
sellaisia tilaisuuksia, joissa sen paljastaminen muitten läsnäollessa
olisi välttämätön, mutta leukani täytyy minun aina pitää sileäksi
ajeltuna niinkuin kuorittu nauris, ett’ei se paistaisi harmaalta, sillä
sitä en voi millään peittää. Jos sitä vielä alkaisin peitellä, esimerkiksi
villahuivilla, niin pidettäisiin minua ainakin satavuotiaana. Ja se olisi
sentään jo liikaa.

Kaikki tämä ei haittaisi mitään, jos minulla vaan olisi täydet


oikeudet olla vanha. Minä saattaisin kulkea vapaasti, reimasti, ja
antaa partani kasvaa. Kurttuinen otsani ja harmaa leukani
tuottaisivat minulle ehkä vaan kunniaa, niinkuin muillekin vanhoille,
joitten elämä on ollut työ ja vaiva — niinkuin useimmin uskotaan.
Minä saattaisin käydä teaattereissa, laulajaisissa ynnä muissa
kansankokouksissa, antaa päälakeni hohtaa kunniakseni ja sanoa
nuoremmille hyvällä omallatunnolla: »anteeksi, minä tahtoisin
istua», tai: »väistäkäähän hiukan, minä tahtoisin lähemmäksi», tai
monasti suoraan ja lyhyesti vaan näinkin: »minä tulen täältä»! Ja
minä saattaisin istua hienoilla päivällisillä hienojen naisten keskellä,
ja he osoittaisivat minua kohtaan suurta huomaavaisuutta,
katselisivat minua ehkä peittelemättömällä ihastuksella ja
päästäisivät ruusuisilta huuliltaan makeita sanoja, ja minulla olisi
oikeus käyttää kaikki tämä hyväkseni — jos minulla vaan olisi oikeus
olla vanha.

Mutta nyt, — — — no saattepa kuulla.

Olen joskus sattunut ihmisten seuraan, joille ikä jo on antanut


arvokkaisuutta ja sananvaltaa, suoraan sanoen: auktoriteettia. On
puhuttu viisaita sanoja elämän kokemuksista ja tehty johtopäätöksiä
ja tuomioita niitten nojalla. Mutta jok’ikinen kerta, kun minä
tällaisissa tilaisuuksissa olen muistanut olevani myöskin jo vanha ja
luullut jotain nähneeni ja kokeneeni, ja siihen luottaen avannut
suuni, on minut pahanpäiväisesti nolattu.

Tässä yksi esimerkki monien joukosta.

Istuin kerran ikään eräässä seurassa, jossa keskusteltiin monista


mitä tärkeimmistä kysymyksistä yhteiskunnan alalla, lasten
kasvatuksesta, äitien pyhistä tehtävistä, nuorison riennoista,
nykyaikaisista höllistä rakkaussuhteista ja muista sellaisista. Muistin
onnettomuudeksi sillä kertaa ikäni hyvin, aijoin vetää korteni esiin
myös ja sain tilaisuuden avaamaan suuni.

— Mitä nykyaikaiseen lasten kasvatukseen tulee — jotensakin


siihen suuntaan muistaakseni aloin puhua — niin syrjästä katsoen
näyttää siltä, kuin niitä opetettaisiin ryömimään vanhempain
ihmisten nenälle, ennenkuin vielä osaavat kunnolleen ryömiä edes
lattialla.
Siihen loppui puheeni. Sen enempää ei minun mielipidettäni
suvaittu kuulla, ja tämä suvaitsemattomuus osoitettiin jotensakin
nenäkkäillä keskeytyksillä, joita singahteli joka taholta niin että
korvani kuumenivat, mutta varsinkin rouvain puolelta. Kymmenkunta
paria pilkallisia ja murhaavan pistäviä silmiä oli suunnattuna minua
ainoaa onnetonta kohti.

— Syrjästä katsoen todellakin, hah hah haaah! Mies, jolla ei ikänä


ole ollut omaa perhettä, eikä tule olemaankaan, ei lasta ei vaimoa,
puhuu syrjästä katsoen — — —

— Näin voi puhua ainoastaan se, jolla ei ole pienintäkään


kokemusta, ei hämärintäkään tietoa missään yhteiskunnallisessa
kysymyksessä, sillä — — —

— Niin, kuinka voi tietää mitään yhteiskunnasta, kun ei ollenkaan


tunne sitä sidettä, sitä laitosta — — —

— Sitä pyhää laitosta — — —

— Anteeksi rouva Hanhelin — — — niin, sitä pyhää laitosta, jolla


koko yhteiskunta lepää ja josta koko muu elämä — — —

— Sitä pyhäin ihmisten yhteyttä — — —

— Tämä on jotain aivan hassua, tämä on kurioosia — — —


nuorukainen, perheetön mies alkaa tässä viisastella kokemuksillaan
———

Koko seura oli yksimielinen siitä, että niin todella oli asian laita, ja
tämä kysymys, tämä minun oikeudeton pyrkimiseni vieraalle alalle,
joutui vähitellen pohjaksi koko keskustelulle ja minä jäin kuin jäinkin
siihen kurioosumiksi, jota katseltiin ja käänneltiin joka puolelta.

You might also like