0% found this document useful (0 votes)
11 views61 pages

Introduction To Data Mining 2005th Edition Pang-Ning Tan PDF Download

The document provides links to various eBooks related to data mining and other topics, including 'Introduction to Data Mining' by Pang-Ning Tan and several other titles. It features instant digital downloads in multiple formats such as PDF, ePub, and MOBI. The content includes a detailed table of contents for the 'Introduction to Data Mining' book, outlining key concepts and chapters.

Uploaded by

faunihenak1d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views61 pages

Introduction To Data Mining 2005th Edition Pang-Ning Tan PDF Download

The document provides links to various eBooks related to data mining and other topics, including 'Introduction to Data Mining' by Pang-Ning Tan and several other titles. It features instant digital downloads in multiple formats such as PDF, ePub, and MOBI. The content includes a detailed table of contents for the 'Introduction to Data Mining' book, outlining key concepts and chapters.

Uploaded by

faunihenak1d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Data Mining 2005th Edition Pang-

Ning Tan pdf download

https://ebookname.com/product/introduction-to-data-mining-2005th-
edition-pang-ning-tan/

Get Instant Ebook Downloads – Browse at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Intelligent Data Warehousing From Data Preparation to


Data Mining 1st Edition Zhengxin Chen (Author)

https://ebookname.com/product/intelligent-data-warehousing-from-
data-preparation-to-data-mining-1st-edition-zhengxin-chen-author/

Visual Data Mining Techniques and Tools for Data


Visualization and Mining 1st Edition Tom Soukup

https://ebookname.com/product/visual-data-mining-techniques-and-
tools-for-data-visualization-and-mining-1st-edition-tom-soukup/

Data Mining Using SAS Applications Chapman Hall CRC


Data Mining and Knowledge Discovery Series 1st Edition
George Fernandez

https://ebookname.com/product/data-mining-using-sas-applications-
chapman-hall-crc-data-mining-and-knowledge-discovery-series-1st-
edition-george-fernandez/

Supercontinent Ten Billion Years in the Life of Our


Planet 1st Edition Ted Nield

https://ebookname.com/product/supercontinent-ten-billion-years-
in-the-life-of-our-planet-1st-edition-ted-nield/
The tool steel guide First Edition Szumera

https://ebookname.com/product/the-tool-steel-guide-first-edition-
szumera/

Police and Community in Japan Walter L. Ames

https://ebookname.com/product/police-and-community-in-japan-
walter-l-ames/

Crisis at Sea The United States Navy in European Waters


in World War I 1st Edition William N. Still Jr.

https://ebookname.com/product/crisis-at-sea-the-united-states-
navy-in-european-waters-in-world-war-i-1st-edition-william-n-
still-jr/

Antibiotic and Chemotherapy 9th Edition Roger G. Finch

https://ebookname.com/product/antibiotic-and-chemotherapy-9th-
edition-roger-g-finch/

The Best of Betty Saw 1st Edition Betty Saw

https://ebookname.com/product/the-best-of-betty-saw-1st-edition-
betty-saw/
A Cup of Comfort for Cat Lovers Colleen Sell

https://ebookname.com/product/a-cup-of-comfort-for-cat-lovers-
colleen-sell/
__ ,_· .... ~ ::--

''
I '

PANG-NING TAN
Michigan State University

MICHAEL STEINBACH
Un iversity of Minnesota

VIPIN KUMAR
Univers i ty of Minnesota
and Army High Performance
Comput ing Research Center

~
TT

• .
Boston S;m Fr.mcisco New York
Londo n Toronto Sydney Tokyo Singapore Madrid
Mexico Cicy Munich Paris Cape Town Hong Kong Montreal
xiv Contents

2.4.5 Examples of Proximity Measures . . . . . . . . . . . . . 73


2.4.6 Issues in Proximity Calculation . . . . . . . . . . . . . . 80
2.4.7 Selecting the Right Proximity Measure . . . . . . . . . . 83
2.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Contents 3 Exploring Data 97
3.1 The Iris Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 98
Preface vii 3.2.1 Frequencies and the Mode . . . . . . . . . . . . . . . . . 99
3.2.2 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . 100
1 Introduction 1 3.2.3 Measures of Location: Mean and Median . . . . . . . . 101
1.1 What Is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . 2 3.2.4 Measures of Spread: Range and Variance . . . . . . . . 102
1.2 Motivating Challenges . . . . . . . . . . . . . . . . . . . . . . . 4 3.2.5 Multivariate Summary Statistics . . . . . . . . . . . . . 104
1.3 The Origins of Data Mining . . . . . . . . . . . . . . . . . . . . 6 3.2.6 Other Ways to Summarize the Data . . . . . . . . . . . 105
1.4 Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
1.5 Scope and Organization of the Book . . . . . . . . . . . . . . . 11 3.3.1 Motivations for Visualization . . . . . . . . . . . . . . . 105
1.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.2 General Concepts . . . . . . . . . . . . . . . . . . . . . . 106
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.3 Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.3.4 Visualizing Higher-Dimensional Data . . . . . . . . . . . 124
2 Data 19 3.3.5 Do’s and Don’ts . . . . . . . . . . . . . . . . . . . . . . 130
2.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 OLAP and Multidimensional Data Analysis . . . . . . . . . . . 131
2.1.1 Attributes and Measurement . . . . . . . . . . . . . . . 23 3.4.1 Representing Iris Data as a Multidimensional Array . . 131
2.1.2 Types of Data Sets . . . . . . . . . . . . . . . . . . . . . 29 3.4.2 Multidimensional Data: The General Case . . . . . . . . 133
2.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3 Analyzing Multidimensional Data . . . . . . . . . . . . 135
2.2.1 Measurement and Data Collection Issues . . . . . . . . . 37 3.4.4 Final Comments on Multidimensional Data Analysis . . 139
2.2.2 Issues Related to Applications . . . . . . . . . . . . . . 43 3.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.3.1 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Classification:
2.3.3 Dimensionality Reduction . . . . . . . . . . . . . . . . . 50 Basic Concepts, Decision Trees, and Model Evaluation 145
2.3.4 Feature Subset Selection . . . . . . . . . . . . . . . . . . 52 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
2.3.5 Feature Creation . . . . . . . . . . . . . . . . . . . . . . 55 4.2 General Approach to Solving a Classification Problem . . . . . 148
2.3.6 Discretization and Binarization . . . . . . . . . . . . . . 57 4.3 Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . 150
2.3.7 Variable Transformation . . . . . . . . . . . . . . . . . . 63 4.3.1 How a Decision Tree Works . . . . . . . . . . . . . . . . 150
2.4 Measures of Similarity and Dissimilarity . . . . . . . . . . . . . 65 4.3.2 How to Build a Decision Tree . . . . . . . . . . . . . . . 151
2.4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.3 Methods for Expressing Attribute Test Conditions . . . 155
2.4.2 Similarity and Dissimilarity between Simple Attributes . 67 4.3.4 Measures for Selecting the Best Split . . . . . . . . . . . 158
2.4.3 Dissimilarities between Data Objects . . . . . . . . . . . 69 4.3.5 Algorithm for Decision Tree Induction . . . . . . . . . . 164
2.4.4 Similarities between Data Objects . . . . . . . . . . . . 72 4.3.6 An Example: Web Robot Detection . . . . . . . . . . . 166
Contents xv xvi Contents

4.3.7 Characteristics of Decision Tree Induction . . . . . . . . 168 5.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . 256
4.4 Model Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.5.1 Maximum Margin Hyperplanes . . . . . . . . . . . . . . 256
4.4.1 Overfitting Due to Presence of Noise . . . . . . . . . . . 175 5.5.2 Linear SVM: Separable Case . . . . . . . . . . . . . . . 259
4.4.2 Overfitting Due to Lack of Representative Samples . . . 177 5.5.3 Linear SVM: Nonseparable Case . . . . . . . . . . . . . 266
4.4.3 Overfitting and the Multiple Comparison Procedure . . 178 5.5.4 Nonlinear SVM . . . . . . . . . . . . . . . . . . . . . . . 270
4.4.4 Estimation of Generalization Errors . . . . . . . . . . . 179 5.5.5 Characteristics of SVM . . . . . . . . . . . . . . . . . . 276
4.4.5 Handling Overfitting in Decision Tree Induction . . . . 184 5.6 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.5 Evaluating the Performance of a Classifier . . . . . . . . . . . . 186 5.6.1 Rationale for Ensemble Method . . . . . . . . . . . . . . 277
4.5.1 Holdout Method . . . . . . . . . . . . . . . . . . . . . . 186 5.6.2 Methods for Constructing an Ensemble Classifier . . . . 278
4.5.2 Random Subsampling . . . . . . . . . . . . . . . . . . . 187 5.6.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . 281
4.5.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . 187 5.6.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
4.5.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.6.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
4.6 Methods for Comparing Classifiers . . . . . . . . . . . . . . . . 188 5.6.6 Random Forests . . . . . . . . . . . . . . . . . . . . . . 290
4.6.1 Estimating a Confidence Interval for Accuracy . . . . . 189 5.6.7 Empirical Comparison among Ensemble Methods . . . . 294
4.6.2 Comparing the Performance of Two Models . . . . . . . 191 5.7 Class Imbalance Problem . . . . . . . . . . . . . . . . . . . . . 294
4.6.3 Comparing the Performance of Two Classifiers . . . . . 192 5.7.1 Alternative Metrics . . . . . . . . . . . . . . . . . . . . . 295
4.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.7.2 The Receiver Operating Characteristic Curve . . . . . . 298
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 5.7.3 Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . 302
5.7.4 Sampling-Based Approaches . . . . . . . . . . . . . . . . 305
5 Classification: Alternative Techniques 207 5.8 Multiclass Problem . . . . . . . . . . . . . . . . . . . . . . . . . 306
5.1 Rule-Based Classifier . . . . . . . . . . . . . . . . . . . . . . . . 207 5.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 309
5.1.1 How a Rule-Based Classifier Works . . . . . . . . . . . . 209 5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
5.1.2 Rule-Ordering Schemes . . . . . . . . . . . . . . . . . . 211
5.1.3 How to Build a Rule-Based Classifier . . . . . . . . . . . 212 6 Association Analysis: Basic Concepts and Algorithms 327
5.1.4 Direct Methods for Rule Extraction . . . . . . . . . . . 213 6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.1.5 Indirect Methods for Rule Extraction . . . . . . . . . . 221 6.2 Frequent Itemset Generation . . . . . . . . . . . . . . . . . . . 332
5.1.6 Characteristics of Rule-Based Classifiers . . . . . . . . . 223 6.2.1 The Apriori Principle . . . . . . . . . . . . . . . . . . . 333
5.2 Nearest-Neighbor classifiers . . . . . . . . . . . . . . . . . . . . 223 6.2.2 Frequent Itemset Generation in the Apriori Algorithm . 335
5.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 225 6.2.3 Candidate Generation and Pruning . . . . . . . . . . . . 338
5.2.2 Characteristics of Nearest-Neighbor Classifiers . . . . . 226 6.2.4 Support Counting . . . . . . . . . . . . . . . . . . . . . 342
5.3 Bayesian Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.2.5 Computational Complexity . . . . . . . . . . . . . . . . 345
5.3.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . 228 6.3 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 349
5.3.2 Using the Bayes Theorem for Classification . . . . . . . 229 6.3.1 Confidence-Based Pruning . . . . . . . . . . . . . . . . . 350
5.3.3 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . 231 6.3.2 Rule Generation in Apriori Algorithm . . . . . . . . . . 350
5.3.4 Bayes Error Rate . . . . . . . . . . . . . . . . . . . . . . 238 6.3.3 An Example: Congressional Voting Records . . . . . . . 352
5.3.5 Bayesian Belief Networks . . . . . . . . . . . . . . . . . 240 6.4 Compact Representation of Frequent Itemsets . . . . . . . . . . 353
5.4 Artificial Neural Network (ANN) . . . . . . . . . . . . . . . . . 246 6.4.1 Maximal Frequent Itemsets . . . . . . . . . . . . . . . . 354
5.4.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 247 6.4.2 Closed Frequent Itemsets . . . . . . . . . . . . . . . . . 355
5.4.2 Multilayer Artificial Neural Network . . . . . . . . . . . 251 6.5 Alternative Methods for Generating Frequent Itemsets . . . . . 359
5.4.3 Characteristics of ANN . . . . . . . . . . . . . . . . . . 255 6.6 FP-Growth Algorithm . . . . . . . . . . . . . . . . . . . . . . . 363
Contents xvii xviii Contents

6.6.1 FP-Tree Representation . . . . . . . . . . . . . . . . . . 363 8 Cluster Analysis: Basic Concepts and Algorithms 487
6.6.2 Frequent Itemset Generation in FP-Growth Algorithm . 366 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
6.7 Evaluation of Association Patterns . . . . . . . . . . . . . . . . 370 8.1.1 What Is Cluster Analysis? . . . . . . . . . . . . . . . . . 490
6.7.1 Objective Measures of Interestingness . . . . . . . . . . 371 8.1.2 Different Types of Clusterings . . . . . . . . . . . . . . . 491
6.7.2 Measures beyond Pairs of Binary Variables . . . . . . . 382 8.1.3 Different Types of Clusters . . . . . . . . . . . . . . . . 493
6.7.3 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . 384 8.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
6.8 Effect of Skewed Support Distribution . . . . . . . . . . . . . . 386 8.2.1 The Basic K-means Algorithm . . . . . . . . . . . . . . 497
6.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 390 8.2.2 K-means: Additional Issues . . . . . . . . . . . . . . . . 506
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 8.2.3 Bisecting K-means . . . . . . . . . . . . . . . . . . . . . 508
8.2.4 K-means and Different Types of Clusters . . . . . . . . 510
7 Association Analysis: Advanced Concepts 415 8.2.5 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 510
7.1 Handling Categorical Attributes . . . . . . . . . . . . . . . . . 415 8.2.6 K-means as an Optimization Problem . . . . . . . . . . 513
7.2 Handling Continuous Attributes . . . . . . . . . . . . . . . . . 418 8.3 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . 515
7.2.1 Discretization-Based Methods . . . . . . . . . . . . . . . 418 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 516
7.2.2 Statistics-Based Methods . . . . . . . . . . . . . . . . . 422 8.3.2 Specific Techniques . . . . . . . . . . . . . . . . . . . . . 518
7.2.3 Non-discretization Methods . . . . . . . . . . . . . . . . 424 8.3.3 The Lance-Williams Formula for Cluster Proximity . . . 524
7.3 Handling a Concept Hierarchy . . . . . . . . . . . . . . . . . . 426 8.3.4 Key Issues in Hierarchical Clustering . . . . . . . . . . . 524
7.4 Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 429 8.3.5 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 526
7.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . 429 8.4 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
7.4.2 Sequential Pattern Discovery . . . . . . . . . . . . . . . 431 8.4.1 Traditional Density: Center-Based Approach . . . . . . 527
7.4.3 Timing Constraints . . . . . . . . . . . . . . . . . . . . . 436 8.4.2 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . 528
7.4.4 Alternative Counting Schemes . . . . . . . . . . . . . . 439 8.4.3 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 530
7.5 Subgraph Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 442 8.5 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 532
7.5.1 Graphs and Subgraphs . . . . . . . . . . . . . . . . . . . 443 8.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 533
7.5.2 Frequent Subgraph Mining . . . . . . . . . . . . . . . . 444 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and
7.5.3 Apriori -like Method . . . . . . . . . . . . . . . . . . . . 447 Separation . . . . . . . . . . . . . . . . . . . . . . . . . 536
7.5.4 Candidate Generation . . . . . . . . . . . . . . . . . . . 448 8.5.3 Unsupervised Cluster Evaluation Using the Proximity
7.5.5 Candidate Pruning . . . . . . . . . . . . . . . . . . . . . 453 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
7.5.6 Support Counting . . . . . . . . . . . . . . . . . . . . . 457 8.5.4 Unsupervised Evaluation of Hierarchical Clustering . . . 544
7.6 Infrequent Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 457 8.5.5 Determining the Correct Number of Clusters . . . . . . 546
7.6.1 Negative Patterns . . . . . . . . . . . . . . . . . . . . . 458 8.5.6 Clustering Tendency . . . . . . . . . . . . . . . . . . . . 547
7.6.2 Negatively Correlated Patterns . . . . . . . . . . . . . . 458 8.5.7 Supervised Measures of Cluster Validity . . . . . . . . . 548
7.6.3 Comparisons among Infrequent Patterns, Negative Pat- 8.5.8 Assessing the Significance of Cluster Validity Measures . 553
terns, and Negatively Correlated Patterns . . . . . . . . 460 8.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 555
7.6.4 Techniques for Mining Interesting Infrequent Patterns . 461 8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
7.6.5 Techniques Based on Mining Negative Patterns . . . . . 463
7.6.6 Techniques Based on Support Expectation . . . . . . . . 465 9 Cluster Analysis: Additional Issues and Algorithms 569
7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 469 9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570
7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 9.1.1 Example: Comparing K-means and DBSCAN . . . . . . 570
9.1.2 Data Characteristics . . . . . . . . . . . . . . . . . . . . 571
Contents xix xx Contents

9.1.3 Cluster Characteristics . . . . . . . . . . . . . . . . . . . 573 10.2.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 665


9.1.4 General Characteristics of Clustering Algorithms . . . . 575 10.3 Proximity-Based Outlier Detection . . . . . . . . . . . . . . . . 666
9.2 Prototype-Based Clustering . . . . . . . . . . . . . . . . . . . . 577 10.3.1 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 666
9.2.1 Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . 577 10.4 Density-Based Outlier Detection . . . . . . . . . . . . . . . . . 668
9.2.2 Clustering Using Mixture Models . . . . . . . . . . . . . 583 10.4.1 Detection of Outliers Using Relative Density . . . . . . 669
9.2.3 Self-Organizing Maps (SOM) . . . . . . . . . . . . . . . 594 10.4.2 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 670
9.3 Density-Based Clustering . . . . . . . . . . . . . . . . . . . . . 600 10.5 Clustering-Based Techniques . . . . . . . . . . . . . . . . . . . 671
9.3.1 Grid-Based Clustering . . . . . . . . . . . . . . . . . . . 601 10.5.1 Assessing the Extent to Which an Object Belongs to a
9.3.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . 604 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based 10.5.2 Impact of Outliers on the Initial Clustering . . . . . . . 674
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 608 10.5.3 The Number of Clusters to Use . . . . . . . . . . . . . . 674
9.4 Graph-Based Clustering . . . . . . . . . . . . . . . . . . . . . . 612 10.5.4 Strengths and Weaknesses . . . . . . . . . . . . . . . . . 674
9.4.1 Sparsification . . . . . . . . . . . . . . . . . . . . . . . . 613 10.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 675
9.4.2 Minimum Spanning Tree (MST) Clustering . . . . . . . 614 10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities
Using METIS . . . . . . . . . . . . . . . . . . . . . . . . 616 Appendix A Linear Algebra 685
9.4.4 Chameleon: Hierarchical Clustering with Dynamic A.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 616 A.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 685
9.4.5 Shared Nearest Neighbor Similarity . . . . . . . . . . . 622 A.1.2 Vector Addition and Multiplication by a Scalar . . . . . 685
9.4.6 The Jarvis-Patrick Clustering Algorithm . . . . . . . . . 625 A.1.3 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . 687
9.4.7 SNN Density . . . . . . . . . . . . . . . . . . . . . . . . 627 A.1.4 The Dot Product, Orthogonality, and Orthogonal
9.4.8 SNN Density-Based Clustering . . . . . . . . . . . . . . 629 Projections . . . . . . . . . . . . . . . . . . . . . . . . . 688
9.5 Scalable Clustering Algorithms . . . . . . . . . . . . . . . . . . 630 A.1.5 Vectors and Data Analysis . . . . . . . . . . . . . . . . 690
9.5.1 Scalability: General Issues and Approaches . . . . . . . 630 A.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
9.5.2 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 A.2.1 Matrices: Definitions . . . . . . . . . . . . . . . . . . . . 691
9.5.3 CURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 A.2.2 Matrices: Addition and Multiplication by a Scalar . . . 692
9.6 Which Clustering Algorithm? . . . . . . . . . . . . . . . . . . . 639 A.2.3 Matrices: Multiplication . . . . . . . . . . . . . . . . . . 693
9.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 643 A.2.4 Linear Transformations and Inverse Matrices . . . . . . 695
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 A.2.5 Eigenvalue and Singular Value Decomposition . . . . . . 697
A.2.6 Matrices and Data Analysis . . . . . . . . . . . . . . . . 699
10 Anomaly Detection 651 A.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 700
10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
10.1.1 Causes of Anomalies . . . . . . . . . . . . . . . . . . . . 653 Appendix B Dimensionality Reduction 701
10.1.2 Approaches to Anomaly Detection . . . . . . . . . . . . 654 B.1 PCA and SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
10.1.3 The Use of Class Labels . . . . . . . . . . . . . . . . . . 655 B.1.1 Principal Components Analysis (PCA) . . . . . . . . . . 701
10.1.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 B.1.2 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
10.2 Statistical Approaches . . . . . . . . . . . . . . . . . . . . . . . 658 B.2 Other Dimensionality Reduction Techniques . . . . . . . . . . . 708
10.2.1 Detecting Outliers in a Univariate Normal Distribution 659 B.2.1 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . 708
10.2.2 Outliers in a Multivariate Normal Distribution . . . . . 661 B.2.2 Locally Linear Embedding (LLE) . . . . . . . . . . . . . 710
10.2.3 A Mixture Model Approach for Anomaly Detection . . . 662 B.2.3 Multidimensional Scaling, FastMap, and ISOMAP . . . 712
Contents xxi

B.2.4 Common Issues . . . . . . . . . . . . . . . . . . . . . . . 715


B.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 716

Appendix C Probability and Statistics 719


C.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
C.1.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . 722
C.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
C.2.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . 724
C.2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . 724
C.2.3 Interval Estimation . . . . . . . . . . . . . . . . . . . . . 725
C.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 726

Appendix D Regression 729


D.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
D.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 730
D.2.1 Least Square Method . . . . . . . . . . . . . . . . . . . 731
D.2.2 Analyzing Regression Errors . . . . . . . . . . . . . . . 733
D.2.3 Analyzing Goodness of Fit . . . . . . . . . . . . . . . . 735
D.3 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . 736
D.4 Alternative Least-Square Regression Methods . . . . . . . . . . 737

Appendix E Optimization 739


E.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . 739
E.1.1 Numerical Methods . . . . . . . . . . . . . . . . . . . . 742
E.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . 746
E.2.1 Equality Constraints . . . . . . . . . . . . . . . . . . . . 746
E.2.2 Inequality Constraints . . . . . . . . . . . . . . . . . . . 747

Author Index 750

Subject Index 758

Copyright Permissions 769


1

Ihtrod uctiori
Rapid advances in data collection and storage technology have enabled or-
ganizations to accumulate vast amounts of data. However, extracting useful
information has proven extremely challenging. Often, traditional data analy-
sis tools and techniques cannot be used because of the massive size of a data
set. Sometimes , t he non-traditional nature of the data means that traditional
approaches cannot be applied even if the data set is relatively small. In other
situations, the questions t hat need to be answered cannot be addressed using
existing data analysis techniques, and thus, new methods need to be devel-
oped.
Data mining is a technology that blends traditional data analysis methods
with sophisticated algorithms for processing large volumes of data. It has also
opened up exciting opport unities for exploring and analyzing new types of
data and for analyzing old types of data in new ways. In this introductory
chapter, we present an overview of data mining and outline the key topics
to be covered in this book. We start with a descript ion of some well-known
applications that require new techniques for data analysis.

Business Point-of-sale data collection (bar code scanners, radio frequency


identification (RFID), and smart card technology) have allowed retailers to
collect up-to-the-minute data about customer purchases at the checkout coun-
ters of their stores. Retailers can utilize this information, along with other
business-critical data such as Web logs from e-commerce Web sites and cus-
tomer service records from call centers, to help them better understand the
needs of their customers and make more informed business decisions.
Data mining tech niques can be used to support a wide range of business
intelligence applications such as customer profiling, targeted marketing, work-
flow management, store layout , and fraud detection. It can also help retailers
..... -~-::....o:.·_-:---"

2 Chapter 1 Introduction 1.1 What Is Data Min ing? 3

answer important business questions such as "Who are the most profitable fut ure observation, such as predicting whether a newly arrived customer will
customers?" "What products can be cross-sold or up-sold?" and "What is the spend more t han $100 at a department store.
revenue outlook of the company for next year?" Some of these questions mo- Not all information d iscovery tasks are considered to be data mining . For
tivated the creation of association analysis (Chapters 6 and 7), a new data example, looking up ind ividual records using a database management system
analysis technique. or fi nding particular Web pages via a query to an Int ernet search engine are
tasks related to the area of information r etr ieval. Although such tasks are
Med icine, Science, and Eng ineering Researchers in medicine, science, important and may involve the use of the sophisticated algorithms and data
and engineering are rapidly accumulating data that is key to important new struct ures, t hey rely on traditional com puter science techniques and obvious
discoveries. For example, as an important step toward improving our under- fe at ures of the data to create index structures for efficiently organizing and
standing of the Earth's climate system, NASA has deployed a series of Earth- retrieving information. Nonetheless, d ata m ining techniques have been used
orbiting satellites that continuously generate global observations of the land to enhance information retrieval systems.
sur face, oceans, and atmosphere. However, because of the size and spatia-
temporal nature of the data, tradit ional methods are often not suitable for Data M ining and Knowledge Discovery
analyzing these data sets. Techniques developed in data mining can aid Earth Data mining is an integral part of knowledge d iscovery in databases
scientists in answering questions such as "What is the relationship between (KDD ), which is t he overall process of convert ing raw data into useful in-
the frequency and intensity of ecosystem disturbances such as drougllts and formati on, as shown in Figure 1.1. This process consists of a series of trans-
hurricanes to global warming?" "How is land surface precipit ation and temper- formation steps, from data preprocessing to postprocessing of data mining
ature affected by ocean surface temperature?" and "How well can we predict results.
the beginning and end of the growing season for a region?"
As another example, researchers in molecular biology hope to use the large
amounts of genomic data currently being gathered to better u nderstand the Input
Information
structure and function of genes. In the past, traditional methods in molecu- Data
lar biology allowed scientists to study only a few genes at a time in a given
experiment. Recent breakthroughs in microarray technology have enabled sci-
Feature Selection
entists to compare the behavior of thousands of genes under various situations. Dimensionality Reduction
Filtering Patterns
Such comparisons can help determine the function of each gene and perhaps V isualization
Normalization
Pattern Interpretation
isolate the genes responsible for certain diseases. However, the noisy and high- Data Subsetting
dimensional nature of data requires new types of data analysis. In addition
Figure 1.1. The process of knowledge discovery In databases (KDO).
to analyzing gene array data, data mining can also be used to address other
important biological challenges such as protein structure prediction, multiple
sequence alignm ent, the modeling of biochemical pathways, and phylogenetics.
The input dat,a can be stored in a variety of formats (flat files, spread-
sheets, or relational tables) and may reside in a centralized data repository
1.1 What Is Data Mining? or be dist,r ibu ted across multip le sites. The pu rpose of p r eprocessin g is
to transform the raw input data into an appropriate format for subsequent
Data mining is the process of automatically discovering useful information in analysis. The steps involved in data preprocessing include fusing data from
large data repositories. Data mining techniques are deployed to scour large multip le sources, cleaning data to remove noise and duplicate observations,
databases in order to find novel and useful patterns th at might otherwise and selecting records and features t hat are relevant to t he data mining task
remai n unknown . They also provide capabili ties to predict t.he outcome of a at hand. Because of the many ways data can be collected and stored, data
4 Chapter 1 Introduction 1 .2 Motivating Challenges 5

preprocessing is perhaps the most laborious and time-consuming step in the the number of measurements taken . Traditional data analysis techniques that
overall knowledge discovery process. were developed for low-dimensional data often do not work well for such high-
"Closing the loop" is the phrase often used to refer to t he process of in- dimensional data. Also, for some data analysis algorithms, the computational
tegrating data mining results into decision support systems. For example, complexity increases rapidly as the dimensionality (the number of features)
in business applications, the insights offered by data mining results can be increases.
integrated with campaign management tools so that effective marketing pro-
motions can be conducted and tested . Such integration requires a postpro- Heterogeneous and Complex Dat a Traditional data analysis methods
cessing step that ensures that only valid and useful results are incorporated often deal with data sets containing attributes of the same type, either contin-
into the decision support system. An example of postprocessing is visualiza- uous or categorical. As the role of data mining in business, science, medicine,
tion (see Chapter 3), which allows analysts to explore the data and the data and other fields has grown, so has the need fo r techniques that can handle
mining results from a variety of viewpoints. Statistical measures or hypoth- heterogeneous attributes. Recent years have also seen the emergence of more
esis testing methods can also be applied during postprocessing to eliminate complex data objects. Examples of such non-traditional types of data include
spurious data mining results. collections of Web pages containing semi-structured text and hyperlinks; DNA
data with sequential and three-dimensional structure; and climate data that
consists of time series measurements {temperature, pressure, etc. ) at various
1.2 Motiva ting C ha lle nges locations on the Earth's surface. Techniques developed for mining such com-
As mentioned earlier , traditional data analysis techniques have often encoun- plex objects should take into consideration relationships in the data, such as
tered practical difficult ies in meeting the challenges posed by new data sets. temporal and spatial autocorrelation, graph connectivity, and parent-child re-
The following are some of the specific challenges that motivated the develop- lationships between the elements in semi-structured text and XML documenls.
ment of data mining.
Data Ownership and Distribut ion Sometimes, the data needed for an
Scalability Because of advances in data generation and collection, data sets analysis is not stored in one location or owned by one organization. Instead ,
with sizes of gigabytes, terabytes, or even petabytes are becoming common. the data is geographically distributed among resources belonging to multiple
lf data mining algorithms are to handle these massive data sets, then they entities. This requires the development of distributed data mining techniques.
must be scalable. Many data mining algorithms employ special search strate- Among the key challenges faced by distributed data mining algorithms in-
gies to handle exponential search problems. Scalability may also require the clude (1) how to reduce the amount of communication needed t o perform the
implementation of novel data structures to access individual records in an ef- distributed computation, (2) how to effectively consolidate t he data minillg
ficient manner. For instance, out-of-core algorithms may be necessary when results obtained from multiple sources, and (3) how to address data security
processing data sets that cannot fit into main memory. Scalability can also be issues.
improved by using sampling or developing parallel and distributed algorithms.
Non-trad itio nal Analysis The traditional statistical approach is based on
High D imensionality It is now common to encounter data sets with hun- a hypothesize-and-test paradigm. ln other words, a hypothesis is proposed ,
dreds or thousands of attributes instead of the handful common a few decades an experiment is designed to gather the data, and then the data is analyzed
ago. In bioinformatics, progress in microarray technology has produced gene with respect to the hypothesis. Unfortunately, this process is extremely labor-
expression data involving thousands of featur es. Data sets with temporal intensive. Current data analysis tasks often require the generation and evalu-
or spatial components also tend to have high dimensionality. For example, ation of thousands of hypotheses, and consequently, the development of some
consider a data set that contains measurements of temperature at various data mining techniques has been motivated by the desire to automate the
locations. If the temperature measurements are taken repeatedly for an ex- process of hypothesis generation and evaluation. Furthermore, the data sets
tended period, the number of dimensions (features) increases in proportion to analyzed in data mining are typically not the result of a carefully designed
'*___.. _.

6 C hapter 1 Introduction 1.4 Data Mining Tasks 7

experiment and often represent opportunistic samples of the data, rat her than 1.4 D ata Mining Tasks
random samples. Also, t he data sets frequently involve non-traditional types
of data and data distributions. Data mining tasks are generally divided into two major categories:

Pred ictive tasks. The objective of these tasks is to predict the value of a par-
1.3 The Origins of Data Mining ticular attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or depen dent vari-
Brought together by the goal of meeting the challenges of the previous sec- able, while the attributes used for making t he prediction are known as
tion, researchers from different disciplines began to focus on developing more the explanatory or independ ent variables.
efficient and scalable tools that could handle diverse types of data. This work,
which culminated in the field of data mining, built upon the methodology and Descr iptive tasks. Here, t he objective is to derive pat terns (correlations,
algorithms that researchers had previously used. In particular, data mining trends, clusters, trajectories, and anomalies) that summarize the un-
draws upon ideas, such as (1) sampling, estimation, and hypothesis testing derlying relationships in data. Descri ptive data mining tasks are often
from statistics and (2) search algorithms, modeling techniques, and learning exploratory in nature a nd frequently require postprocessing techniques
theories from artificial intelligence, pattern recognition, and machine learning. to validate and explain the results.
Data mining has also been quick to adopt ideas from other areas, including Figure 1.3 illustrates four of the core data mining tasks that are described
optimization, evolutionary computing, information theory, signal processing, in the remainder of t.his book.
visualization, and information retrieval.
A number of other areas also play key supporting roles. In particular,
database systems are needed to provide support for efficient. storage, index- •••
•• •
ing, and query processing. Techniques from high performance (parallel) com- •••
puting are often important in addressing the massive size of some data sets.

•••
• •• •
Distributed techniques can also help address the issue of size and are essential
when the data cannot be gathered in one location.
• ltl 110 "'0
Dala
tl,\riUI Anrru..\1 DtiJ,.II('d
O..ntr!ir..IIIO ~Oomtfler

Figure 1.2 shows the relationship of data mining to other areas.


~- - 11101 ...
....... !20M ...
~ltSI(

.
.....

,......, lOOt "'-

....... ...
-- ..
~~x- ...
_ ,..
.

Q 0

Figure 1.3. Four of the core data mining tasks.


Figure 1.2. Data mining as a confluence of many disciplines.
-----==========~========

8 Chapter 1 Introduction 1.4 Data Mining Tasks 9

Predictive modeling refers to the task of building a model for the target 2.5 ;
:.
. .
-- - - -.- - ~ .- .. - -- -:

.
• Setosa
variable as a function of the explanatory variables. There are two types of ffff f f f :
... . .
I
• Versicolour
predictive modeling tasks: classification, which is used for discrete target • Virginica ••
•••••
• '
variables, and r egression, which is used for continuous target variables. For
2 •t" '

example, predicting whether a Web user will make a purchase at an online e1.75 r - - - - - - - - - • - • • - - - - !~: ! - . -·~ . t_ _+- - t_ - - -- J
f '
t.
....
I
~ e f
bookstore is a classification task because the target variable is binary-valued. I •

• ••• •t•

..•......
.r::; 1.5
i5
On the other hand, forecasting the future price of a stock is a regression task
~
.. .
because pr ice is a continuous-valued attribute. The goal of both tasks is to
learn a model that minimizes the error between the predicted and true values
of the target variable. Predictive modeling can be used to identify customers
a;
Qj
0.. ... ..
that will respond to a marketing campaign, pred ict disturbances in the Earth's 0.75 ------- -- -- ----- - - -:-' ---- - -- ------- -- --.'
'
ecosystem, or judge whether a patient has a particular disease based on the '
0.5
results of medical tests.
Example 1.1 (Predicting the Type of a Flower). Consider the task of
predicting a species of flower based on the characteristics of the flower. In 2 ~5 3 4 5 6
particular, consider classifying an Iris flower as to whether it belongs to one Petal Length (em)

of the following three Iris species: Setosa, Versicolour, or Virginica. To per- Figure 1.4. Petal width versus petal length tor 150 Iris flowers.
form this task, we need a data set containing the characteristics of various
flowers of these three species. A data set with this type of information is
the well-known Iris data set from the UCI Machine Learning Reposit ory at
ht t p://W\Iw. i cs.uci.edu/~m1earn. In addition to the species of a flower, Association analysis is used to discover patterns that descri be strongly as-
this data set contains four other attributes: sepal width , sepal length, petal sociated features in the data. The discovered patterns are typically represented
length, and petal width. (The Iris data set and its attributes are described in the form of implication rules or feature subsets. Because of the exponential
further in Section 3.1.) Figure 1.4 shows a plot of petal width versus petal size of its search space, the goal of association analysis is to extract the most
length for the 150 flowers in the Iris data set. Petal width is broken into the interesting patterns in an efficient manner. Useful applications of association
categories low, medium, and high, which correspond to the intervals [0, 0.75) , analysis include finding grou ps of genes that have related functionality, identi-
[0.75, 1.75), [1.75, oo), respectively. Also, petal length is broken into categories fying Web pages that are accessed together, or understanding the relationships
low, medium, and high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, between different elements of Earth's climate system.
oo), respectively. Based on these categories of petal width and length , the
following rules can be derived: Example 1. 2 (Market Basket Analysis) . The transactions shown in Ta-
ble 1.1 illustrate point-of-sale data collected at the checkout counters of a
Petal width low and petal length low implies Setosa. grocery store. Association analysis can be applied to find items that are fre-
Petal width medium and petal length medium implies Versicolour. quently bought together by customers. For example, we may discover the
Petal width high and petal length high implies Virginica. rule {Di apers} --+ {Milk}, which suggests that customers who buy diapers
While these rules do not classify all the flowers, they do a good (but not also tend to buy milk. This type of rule can be used to identify potential
perfect) job of classifying most of the flowers. Note that flowers from the cross-selling opportunities among related items. •
Setosa species are well separated from the Versicolour and Virginica species
with respect to petal width and length, but the latter two species overlap Cluster a nalysis seeks to find groups of closely related observations so that
somewhat with respect to these attributes. • observations that belong to the same cluster are more similar to each other
10 Chapter 1 Introduction 1.5 Scope and Organization of the Book 11

Table 1.1 . Market basket data. A nomaly detection is the task of identifying observati ons whose character-
Transaction ID Items istics are significantly different from the rest of the data. Such observations
1 (Bread, Butter, Diapers, Milk} are known as anomalies or outliers. The goal of an anomaly detection al-
2 {Coffee, Sugar, Cookies, Salmon} gorithm is to discover the real anomalies and avoid falsely labeling normal
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs} objects as anomalous. In other words, a good anomaly detector must have
4 {Bread, Butter , Salmon, Chicken} a high detection rate and a low false alarm rate. Applications of anomaly
5 {Eggs, Bread, Butter}
detection include the detection of fraud, network intrusions, unusual patterns
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs} of disease, and ecosystem disturbances.
8 {Cof fee, Sugar, Chicken, Eggs} Example 1.4 (Credit Car d Fraud Det ection). A credit card company
9 {Bread, Diapers, Milk, Salt} records the transactions made by every credit card holder, along with personal
10 {Tea, Eggs, Cookies, Diapers, Milk}
information such as credit limit, age, annual income, and address. Since the
number of fraudulent cases is relatively small compared to the number of
than observations that belong to other clusters. Clustering has been used to legitimate transactions, anomaly detection techniques can be applied to build
group sets of related customers, find areas of the ocean that have a significant a profile of legitimate transactions for t he users. When a. new transaction
impact on the Earth's climate, and compress data. arrives, it is compared against the profile of the user. If the characteristics of
the transaction are very different from the previously created profile, t hen the
Example 1.3 (D ocu ment C lustering). The collection of news articles transaction is flagged as potentially fraudulent. •
shown in Table 1.2 can be grouped based on their respective topics. Each
article is represented as a set of word-frequency pairs (w, c), where w is a word
and c is the number of times the word appears in the article. There are two 1.5 Scope a n d Organization of the Book
natural clusters in the data set. The first cluster consists of the first four ar-
ticles, which correspond to news about the economy, while the second cluster This book introduces the major principles and techniques used in data mining
contains the last four articles, which correspond to news about health care. A from an algorithmic perspective. A study of these principles and techniques is
good clustering algorithm should be able to identify these two clusters based essential for developing a better understanding of how data mining technology
on the similarity between words that appear in the articles. can be applied to various kinds of data. This book also serves as a starting
point for readers who are interested in doing research in th is field.
We begin the techn ical discussion of this book with a. chapter on data.
Table 1.2. Collection of news articles. {Chapter 2), which discusses the basic types of data, datil. quality, prepro-
Article Words cessing techniques, and measures of simi larity and dissimilarity. AILhough
1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, government: 2 t.his material can be covered quickly, it provides an essential foundation for
2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3, country: 1 data analysis. Chapter 3, on data exploration, discusses summary st.atist.ics,
3 job: 5, inAat.ion: 3, rise: 2, jobless: 2, market: 3, country: 2, index: 3
visualization techniques, and On-Line Analytical Processing (OLAP) . These
4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price: 2
patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor: 2 techniques provide the means for quickly gaining insight into a data set.
5
6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1, flu: 3 Chapters 4 and 5 cover classification. Chapter 4 provides a foundation
7 death: 2, cancer: 4, drug: 3, public: 4, heal th: 3, director: 2 by discussing decision tree classifiers and several issues t.hat are important
8 medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care: 1 to all classification: overfitting, performance evaluation, and the comparison
of different classification models. Using this foundation, Chapter 5 describes
a number of other important classification techniques: rule-based systems,
• nearest-neighbor classifiers, Bayesian classifiers , arti ficial neural networks, sup-
port vector machines, and ensemble classifiers, which are collections of classi-
12 Chapter 1 Introduction 1.6 Bi bliographic Notes 13

fiers. The multiclass and imbalanced class problems are also discussed. These 1.6 Bibliographic Notes
topics can be covered independently.
Association analysis is explored in Chapters 6 and 7. Chapter 6 describes The topic of data mining has inspired many textbooks. Introductory text-
the basics of association analysis: frequent itemsets, association rules, and books include those by Dunham [10), Han and Kamber [21], Hand et al. [23],
some of the algorithms used to generate them. Specific types of frequent and Roiger and Geatz [36]. Data mining books with a stronger emphasis on
itemsets-maximal, closed, and hyperclique-that are important for data min- business applications include t he works. by Berry and Linoff [2], Pyle [34], and
ing are also discussed, and the chapter concludes with a d iscussion of evalua- Parr Rud [33]. Books with an emphasis on statistical learning include those
tion measures for association analysis. Chapter 7 considers a variety of more by Cherkassky and Mulier [6), and Hastie et al. [24]. Some books with an
advanced topics, including how association analysis can be applied to categor- emphasis on machine learning or pattern recognition are those by Duda et
ical and continuous data or to data that has a concept hierarchy. (A concept al. [9], Kantardzic [25], Mitchell [31], Webb [41], and Witten and Frank [42].
hierarchy is a hierarchical categorization of objects, e.g., store items, clothing, There are also some more specialized books: Chakrabarti [4] (web mining),
shoes, sneakers.) This chapter also describes how assoc1at ion analysis can be Fayyad et al. [13] (collection of early articles on data mining), Fayyad et a!.
extended to find sequential patterns (patterns involving order), patterns in [11] (visualization), Grossman et a!. [18] (science and engineering), Kargupta
graphs, and negative relationships (if one item is present, then t he other is and Chan [26} (distributed data mining), Wang et al. [40] (bioinformatics) ,
not). and Zaki and Ho [44] (parallel data mining).
Cluster analysis is discussed in Chapters 8 and 9. Chapter 8 first describes There are several conferences related to data mining. Some of t he main
the different types of clusters and then presents three specific clustering tech- conferences dedicated to this field include the ACM SIGKDD lnternaLional
niques: K-means, agglomerative hierarchical cluster ing, and DBSCAN. This Conference on Knowledge Discovery and Data Mining (KDD), the IEEE In-
is followed by a discussion of techniques for validating the results of a cluster- ternational Conference on Data Mining (ICDM), the SIAM International Con-
ing algorithm. Additional clustering concepts and techniques are explored in ference on Data Mining (SDM), the European Conference on Principles and
Chapter 9, incl uding fuzzy and probabi listic clustering, Self-Organizing Maps Practice of Knowledge Discovery in Databases (PKDD), and the Pacific-Asia
(SOM), graph-based clustering, and density-based clustering. There is also a Conference on Knowledge Discovery and Data Mining (PAKDD). Data min-
discussion of scalability issues and factors to consider when selecting a clus- ing papers can also be found in other major conferences such as the ACM
tering algorithm. SIGMOD/PODS conference, the International Conference on Very Large Data
The last chapter, Chapter 10, is on anomaly detection . After some basic Bases (VLDB), the Conference on Information and Knowledge Management
definitions, several different types of anomaly detection are considered: sta- (CIKM ), the International Conference on Data Engineering (ICDE), the In-
tistical , distance-based, density-based, and clustering-based. Appendices A ternational Conference on Machine Learning (ICML), a nd the National Con-
through E give a brief review of important topics that are used in portions of ference on Artificial Intelligence (AAAI).
t he book: li near algebra, di mensionality reduction, statistics, regression , and Journal publications on data min ing include IEEE 7ra.nsactions on Knowl-
edge and Data Engineering, Data Mining and Knowledge Discovery, Knowl-
optimization.
The subject of data mining, while relatively young compared to statistics edge and Information Systems, Intelligent Data Analysis, Information Sys-
or machine learning, is already too large to cover in a single book . Selected tems, and the Journal of Intelligent Information Systems.
references to topics that are only briefl y covered, such as data quality, are There have been a number of general articles on data mining that define the
provided in the bibliographic notes of t he appropriate chapter. References to field or its relationship to other fields, particularly statistics. Fayyad eta!. [12]
topics not covered in this book, s uch as data mining for streams and privacy- describe data mining and how it fits into the total knowledge d iscovery process.
preserving data mining, are provided in the bibliographic notes of this chapter. Chen et al. [5) give a database perspectiv.e on data mining. Ramakrishnan
and Grama [35] provide a general discussion of data mining and present several
viewpoints. Hand [22) describes how data mining differs from statistics, as does
Friedman [14). Lambert [29] explores t he use of statistics for large data sets and
provides some comments on the respective roles of data mining and staListics.
14 Chapter 1 Introduction Bibliography 15

Glymour et al. [16] consider the lessons that statistics may have for data [5) M.-S. Chen, .J. Han, and P . S. Yu. Data. Mi ning: An Overview from a Database
mining. Smyth et al. [38] describe how the evolution of data mining is being Perspective. IEEE Transactions on /(nowledge abd Data Engineering , 8(6):865-883,
1996.
driven by new types of data and applications, such as those involving streams, [6) V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods.
graphs, and text. Emerging applications in data mining are considered by Han Wiley lnterscience, 1998.
et a!. [20] and Smyth [37] describes some research challenges in data mining. [7] C. Clifton , M. Kantarcioglu, and J. Vaidya. Defining privacy for data mining. In
A discussion of how developments in data mining research can be turned into National Science Foundation Workshop on Next Generation Data Mining, pages 125-
practical tools is given by Wu et al. [43]. Data mining standards are the 133, Baltimore, MD, November 2002.
[8) P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of the 6th Jntl.
subject of a paper by Grossman et al. [17]. Bradley [3] discusses how data Conf. on Knowledge Discovery and Data Mining, pages 71-80, Boston, Massachusetts,
mining algorithms can be scaled to large data sets. 2000. ACM Press.
With the emergence of new data mining appli cations have come new chal- [9] R. 0. Duda, P . E. Hart, and D. G . Stork . Pattern Classification. John Wiley & Sons,
lenges that need to be addressed. For instance, concerns about privacy breaches Inc., New York, 2nd edition, 2001.
as a result of data mining have escalated in recent years, particularly in ap- [10) M. H. Dunham. Data Mining: Introductory and Aduanced Topics. Prentice Hall, 2002.
plication domains such as Web commerce and health care. As a result, there [11] U. M. Fayyad, G. G. Grinstein, and A. W ierse, editors. Information Visualizatton in
Data Mining and Knowledge DiscovenJ. Morgan Kaufmann P ublishers, San Francisco,
is growing interest in developing data mining algorithms that maintain user CA, September 2001.
privacy. Developing techniques for mining encrypted or randomized data is [12) U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge
known as privacy-preserving data m in ing. Some general references in this Discovery: An Over view. In Aduances in Knowledge Discovery and Data Mining, pages
area include papers by Agrawal and Srikant [1], Clifton et al. [7] and Kargupta 1- 34. AAAl Press, 1996.
[13) U. M. Fayyad, G. P iatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances
et al. [27]. Vassilios et al. [39] provide a survey.
in Knowledge Discovery and Data Mining. AAAI /MIT Press, 1996.
Recent years have witnessed a growing number of applications that rapidly [14] J. H. Friedman. Data Mining and Statistics: What's the Connection? Unpublished.
generate continuous streams of data. Examples of stream data include network www-stat.stanford.ed u/ ~j hf/ftp/dm-stat. ps, 1997.
traffic, multimedia streams, and stock prices. Several issues must be considered [15) C . Giannella, J. Han, J. Pe.i, X. Yan, and P. S. Yu. Mining Frequent Patterns in Data
when mining data streams, such as the limited amount of memory available, Streams at Multiple Time GranuJarities. In H. Kargupta, A. Joshi , K. Sivaknmar, and
the need for online analysis, and the change of the data over time. Data Y. Yesba, editors, Next Generation Data Mining, pages 191-212. AAAI/MIT, 2003.
[1 6] C . Glymour, D. Madigan, D. P regibon, and P. Smyth. Statistical Themes and Lessons
mining for stream data has become an important area in data mining. Some for Data Mining. Data Mining and Knowledge DiscovenJ, 1(1):11-28, 1997.
selected publications are Domingos and Hulten [8] (classification), Giannella [17] R. L. Grossman, M. F. Hornick , and G. Meyer. Data mining standards initiatives.
et al. [15) (association analysis), Guha et al. [19] (clustering), Kifer et al. [28] Communications of the A CM, 45(8):59-61, 2002.
(change detection), Papadimitriou et al. [32] (time series), and Law eta!. [30] [18) R. L. Grossman , C. Kamatb, P. Kegelmeyer, V. Kumar, and R. Namburu, editors. Data
(dimensionality reduction). Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001 .
[19] S. Cuba, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callagban. Clustering Data
Streams: Theory and Practice. IEEE 7ransactions on Knowledge and Data Engineering,
B ibliogr aphy 15(3):515-528, May/ June 2003.
[20] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon. Emerging scientific
[lj R. Agrawal and R. Sri kant. Privacy-preserving data mining. In Proc. of 2000 A CM-
applications in data mining. Communications of the ACM, 45(8):5~-58, 2002.
SIGMOD Inti. Conf. on Management of Data, pages 439- 450, Dallas, Texas, 2000.
{21) J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
ACM Press.
Publishers, San Francisco, 2001.
[2) M. J . A. Berry and G. Linoff. Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Man agement. Wiley Computer Publishing, 2nd edition, 2004. [22] D. J. Hand. Data Mining: Statistics and More? The American Statistician , 52(2):
112-118, 1998.
[3j P. S. Bradley, J . Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms
[23) D. J. Hand , H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001.
to large databases. Communications of the ACM, 45(8):38-43, 2002.
[4] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan [24] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistirol Learning:
Data Mining, Inference, Prediction. Springer, New York, 2001.
Kaufmann, San Francisco, CA, 2003.
[25) M. Kantardzic. Data Mining: Concepts, Models, Methods, and Algorithms. Wiley-IEEE
P ress, Piscataway, NJ, 2003.
... ...,.. .. .,-. ·~sz· ····;;;:=
-- ·
16 C hapter 1 Introduction 1. 7 Exercises 17

)26) H. Kargupta and P. K. Chan, editors. Aduances in DistT'ibuted and Pam/lei Knowledge (a) Dividing the customers of a company according to their geuder.
DisC01Jery. AAAI Press, September 2002.
(b) Dividing t he customers of a company according to their profitability.
)27) H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Privacy Preserving Prop-
erties of Random Data Perturbation Techniques. In Proc. of the 2003 IEEE Inti. Conf. (c) Computing the total sales of a company.
on Data Mining, pages 99- 106, Melbourne, Florida, December 2003. IEEE Computer
(d) Sorting a student database based on student identification numbers.
Society.
)28) D. Kifer, S. Ben-David, and J. Gehrke. Detecting Change in Data Streams. In Proc. of (e) Predicting the outcomes of tossing a (fair) pair of dice.
the 30th VLDB Con/-, pages 180-191, Toronto, Canada, 2004. Morgan Kaufmann. (f) Predicting the future stock price of a company using historical records.
[29) D. Lambert. What Use is Statistics for Massive Data? In ACM SIGMOD Workshop
(g) Monitoring the heart rate of a patient for abnormalities.
on Research Issues in Data Mining and Knowledge Discouery, pages 54-62, 2000.
[30] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Manifold Learning for Data (h ) Monitoring seismic waves for earthquake activities.
Streams. In Proc. of the SIAM Intl. Conf. on Data Mining, Lake Buena Vista, Florida, (i) Extract ing the frequencies of a sound wave.
April 2004. SIAM.
[31) T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997. 2. Suppose that you are employed as a data m ining consultant for an l nternel
)32) S. Papadirnitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsupervised stream min- search engin e company. D escribe how data mining can help the company by
ing. VLDB Journal, 13(3):222-239, 2004. giving specific examples of how techniques, such as clustering, classification,
)33) 0. Parr Rud. Data Mining Cookbook: Modeling Data for Marketing, Risk and Customer association rule mining, and anomaly detection can be applied.
Relationship Management. John Wiley & Sons, New York , NY, 2001.
)34) D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann, San Francisco, CA, 3. For each of the following data sets, explain wh ether or not data privacy is an
2003. important issue.
)35) N. Ramakrishnan and A. Grama. Data Mining: From Serendipity to Science-Guest
Editors' Introduction. IEEE Computer, 32(8):34-37, 1999. (a) Census data collected from 1900-1950.
)36) R. Reiger and M. Geatz. Data Mining: A Thtorial Based PT'imer. Addison-Wesley, (b) IP addresses and visit times of Web users who visit your Website.
2002.
)37) P. Smyth. Breaking out of the Black-Box: Research Challenges in Data Mining. In (c) Images from Earth-orbiting satellites.
Proc. of the 2001 ACM SIGMOD Workshop on Research Issues in Data Mining and (d) Names and addresses of people from the telephone book .
Knowledge Discoue'1/, 2001.
(e) Names and email addresses collected from the Web.
)38] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolution of data mining algo-
rithms. Communications of the ACM, 45(8):33- 37, 2002.
)39) V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, andY. Theodoridis.
State-of-the-art in privacy preserving data mining. SIGMOD Record, 33( 1):50-57, 2004.
)40) J. T. L. Wang, M. J . Zaki, H. Toivonen, and D. E. Shasha, editors. Data Mining in
Bioinforrnatics. Springer, September 2004.
!41) A. R. Webb. Statistical Pattern Recognition. John Wiley & Sons, 2nd edition, 2002.
)42) I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations. Morgan Kaufmann, 1999.
)43) X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How Research Meets Pract ical
Development? Knowledge and Information Systems, 5(2):248-261, 2003.
)44) M. J . Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining. Springer, September
2002.

1-7 Exercises
L Discuss whether or not each of the following activities is a data mining task.
2

Data
This chapter discusses several data-related issues t hat are important for suc-
cessful data mining:

The Type of Data Data sets differ in a number of ways. For example, the
attributes used to describe data objects can be of different types-quanti tat ive
or qualitative--and data sets may have special characteristics; e.g., some data
sets contain time series or objects with explicit relationships to one another.
Not su rprisingly, the type of data determines which tools and techniq ues can
be used to analyze the data. FUrthermore, new research in data mining is
often driven by t he need to accommodate new application areas and their new
types of data.

The Quality of the Data Data is often far from perfect. While most data
mi ning techniques can tolerate some level of imperfection in the data, a focus
on understanding and improving data quality typically improves the quality
of the result ing analysis. Data quality issues that often need to be addressed
include the presence of noise and outliers; missing, inconsistent , or duplicate
data; and data that is biased or, in some other way, unrepresentative of the
phenomenon or population that the data is supposed to describe.

Prep rocessing S t e ps to M ake t h e D ata More Suita ble for Data Min-
ing Often, the raw data must be processed in order to make it suitable for
analysis. While one objective may be to improve data quality, other goals
focus on modifying the data so that it better fits a specified data mining tech-
nique or tool. For example, a continuous attribute, e.g., lengt h, may need to
be transformed into an attribute with discrete categories, e.g., short, medium,
or long, in order to apply a particular technique. As another example, the
-.

20 Chapter 2 Data 21

number of attributes in a data set is often reduced because many techniques up a conversation with a statistician who is working on the project. When she
are more effective when the data has a relatively small number of attributes. learns that you have also been analyzing the data from the project, she asks
if you would mind giving her a brief overview of your results.
Analyzing Da t a in Terms of Its Relationships O ne approach to data
analysis is to find relationships among the data objects and then perform Statistician: So, you got the data for all the patients?
the remaining analysis using these relationships rather than the data objects Data Miner: Yes. I haven't had much time for analysis , but l
themselves. For instance, we can compute the similarity or distance between do have a few interesting results.
pairs of objects and then perform t he analysis- clustering, classificat ion, or Statistician: Amazing. There were so many data issues with
anomaly detection-based on t hese similarities or distances. There are many this set of patients that I cou ldn't do much .
such similarity or distance measures, and the proper choice depends on the Data M iner: Oh? 1 didn't hear about any poss ible problems.
type of data and the part icular application. Statistician: Well, first there is field 5, the variable we want to
predict. It's common knowledge among people who analyze
Example 2.1 {An Illustration of Data-Related Issues). To further il- this type of data that results are better if you work with the
lustrate the importance of these issues, consider the following hypothetical sit- log of the values, but I didn 't discover this until later. Was it
uation. You receive an email from a medical researcher concerning a project mentioned to you?
that you are eager to work on. Data Miner: No.
Statistician: But surely you heard about what happened to field
Hi , 4? It 's supposed to be measured on a scale from 1 to 10, with
I've attached the data file that ] mentioned in my previous email. 0 indicating a missing value, but because of a data entry
Each line contains the information for a single palient and consists error, all 10's were changed into O's. Unfortunately, since
of five fields. We want to pred ict the last fie ld using the other fields. some of the patients have missing values for this field, it's
I don't have time to provide any more information about the data impossible to say whether a 0 in this field is a real 0 or a 10.
since I'm going out of town fo r a couple of days, but hopefully that Quite a few of the records have that problem.
won't slow you down too much. And if you don't mind, could we Data Miner: Interes ting. Were there any other problems?
meet when ] get back to discuss your preliminary results? I might Statistician : Yes, fields 2 and 3 are basically the same, buL I
invite a few other members of my team. assume that you probably noticed that.
D ata Miner: Yes, but these fiel ds were only weak predictors of
Thanks and see you in a couple of days. field 5.
Statistician: Anyway, given all those problems, I'm surprised
Despite some misgivings, you proceed to analyze the data. The first few you were able to accomplish anything.
rows of the file are as follows: Data Miner: True, but my results are really quite good. Field 1
is a very strong predictor of field 5. I'm surprised i hat this
012 232 33.5 0 10.7 wasn't noticed before.
020 121 16.9 2 210.1 Statistician: What? Field 1 is j ust an ident ification num ber.
027 165 24.0 0 427.6 D ata Miner: Nonetheless, my resul ts speak for themselves.
Statistician: Oh, no! I just remembered. We assigned lD
A brief look at the data reveals nothing strange. You put your doubts aside numbers after we sorted the records based on field 5. There is
and star t the analysis. There are only 1000 lines, a smaller data file than you a strong connection, but it's meaningless. Sorry.
had hoped for , but two days later, you feel that you have made some progress.
You arrive for the meeting, and while waiting for others to arrive, you strike •
22 Chapter 2 Data. 2.1 Types of Data 23

Although this scenario represents an extreme situation, it emphasizes the 2.1.1 Attributes and Measure m ent
importance of "knowing your data." To that end, this chapter will address
In this section we address the issue of describing data by considering what
each of the four issues mentioned above, outlining some of the basic challenges
types of attributes are used to describe data objects. We first define an at-
and standard approaches.
tribute, then consider what we mean by the type of an attribute, and finally
describe the typ es of attributes that are commonly encountered.
2.1 Typ es of Data
What I s an attrib ute?
A data set can often be viewed as a collection of data obj ects. Other
names for a data object are record, point, vector, pattern, event, case, sample, We start with a more detailed definition of an attribute.
observation, or entity. In turn, data objects are described by a number of D efinition 2.1. An attribute is a. property or characteristic of an object
attributes that capture the basic characteristics of an object, such as the that may vary, either from one object to another or from one time to another.
mass of a. physical object or the time at which an event occurred. Other
names for an attribute are variable, characteristic, field, feature, or dimension. For example, eye color varies from person to person, while the temperature
of an object varies over time . Note that eye color is a symbolic attribute with
Example 2.2 (Student Information). Often, a data set is a file, in which a small number of possible values {brown, black, bl1le, green, hazel, etc.} , while
the objects are records (or rows) in the file and each field (or column) corre- temperature is a numerical attribute with a potentially unlimited number of
sponds to an attribute. For example, Table 2.1 shows a data set that consists values.
of student information. Each row corresponds to a student and each column At the most basic level, attributes are not about numbers or symbols.
is an attribute that describes some aspect of a student, such as grade point However, to discuss and more precisely analyze the characteristics of objects,
average (GPA) or identification number (ID). we assign numbers or symbols to them. To do this in a well-defined way, we
need a measurement scale.
Table 2.1. A sample data set containing student information. Definition 2.2. A m easurement scale is a. rule (function) that associates
Student ID Year Grade Point Average (GPA) a numerical or symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a measure-
1034262 Senior 3.24 ment scale to associate a value with a particular attribute of a specific object.
1052663 Sophomore 3.51 While this may seem a bit abstract, we engage in the process of measurement
1082246 Freshman 3.62 all the time. For instance, we step on a bathroom scale to determine our
weight, we classify someone as male or female, or we count the number of
chairs in a room to see if there will be enough to seat all the people coming to
a meeting. In all these cases, the "physical value" of an attribute of an object
• is mapped to a numerical or symbolic value.
Although record-based data sets are common, either in flat files or rela- With this background, we can now discuss the type of an attribute, a
tional database systems, there are other important types of data sets and concept that is important in determining if a part icular data analysis technique
systems for storing data. In Section 2.1.2, we will discuss some of the types of is consistent with a specific type of attribute.
data sets that are commonly encountered in data mining. However, we first
consider attributes. The Type of an Attribute
It should be apparent from the previous discussion t hat. the properties of an
attribute need not be the same as the properties of t he values used to mea-
24 Chapter 2 Data
'. 2.1 TypesofData 25

sure it . In other words, the values used to represent an att ribute may have
.,.._ _________ -----------..
properties that are not properties of the attribute itself, and vice versa. This
3 - --------- - - - - - - - - - -+ 2
is illustrated with two examples.
Example 2.3 (Employee Age a n d ID Number). Two at tributes that __ _ _ _____ ...., 3
7 - ---- -----
might be associated with an employee are ID and age {in years). Both of these
attributes can be represented as integers. However, while it is reasonable to
talk about the average age of an employee, it makes no sense to talk about
8 --------- - - - - - - - - - - -+
the average employee ID. Indeed, the only aspect of employees that we want
to capture with the ID attribute is that they are distinct. Consequently, the
only valid operation for employee IDs is to te's t whether they are equal. There
is no hint of this limitation, however, when integers are used to represent the 10 - - -------- - -- - - - - - - -+ 5
employee ID attribute. For the age attribute, the properties of the integers A mapping of lengths to numbers A mapping of lengths to numbers
used to represent age are very much the properties of the attribute. Even so, that captures only the order that captures both the order and
the correspondence is not complete since, for example, ages have a maximum, properties of length. additivity properties otlength.

while integers do not. • Figure 2.1. The measurement of the length of line segments on two different scales of measurement.
Example 2.4 (Length of Line Segments ). Consider Figure 2.1 , which
shows some objects-line segments-and how the length attribute of these
objects can be mapped to numbers in two different ways. Each successive The Different Typ es of Attributes
line segment, going from the top to the bottom, is formed by appending the A useful (and simple) way to specify the type of an at tribute is to identify
topmost line segment to itself. Thus, the second line segment from the top is the properties of numbers that correspond to underlying properties of the
formed by appending the topmost line segment to itself twice, the third line attribute. For example, an attribute such as length has many of the properties
segment from the top is formed by appending the topmost line segment to of numbers. It makes sense to compare and order objects by length , as well
itself three times, and so forth. In a very real {physical) sense, all the line as to talk about the differences and ratios of length. The following properties
segments are multiples of the first. This fact is captured by the measurements (operations) of numbers are typically used to describe attributes.
on the right-hand side of the figure, but not by those on the left hand-side.
More specifically, the measurement scale on the left-hand side captures only 1. Distinctness = and #
the ordering of the length attribute, while the scale on the right-hand side
2. Order <, S, >, and 2:
captures both the ordering and additivity properties. Thus, an attribute can be
measured in a way that does not capture all the properties of the attribute. • 3. Addition + and -
The type of an attribute should tell us what properties of the attribute are
4. M ultiplication * and /
reflected in the values used to measure it. Knowing the type of an attribute
is important because it tells us which properties of the measured values are Given these properties, we can define four types of attributes: nominal,
consistent with the underlying properties of the attribute, and therefore, it ordinal, interval, and ratio. Table 2.2 gives t he definitions of these types,
allows us to avoid foolish actions, such as computing the average employee ID. along with information about the statistical operations that are valid for each
Note that it is common to refer to the type of an attribute as the type o f a type. Each attribute type possesses all of tne properties and operations of the
m easureme nt scale. attri bute types above it. Consequently, any property or operation that is valid
fo r nominal, ordinal , and interval attributes is also valid for ratio attributes.
In other words, the definition of the attribute types is cumulative. However,
26 C hapter 2 Data 2.1 Types of Data 27

Table 2.2. OiHerent attribute types. Table 2.3. Transformations that define attribute levels.
Attribute Attribute
Description Examples Operations Type TI-ansformation Comment
Type
Nominal The values of a nominal zip codes, mode, entropy, Nominal Any one-to-one mapping, e.g., a If all employee JD numbers are
attribute are just different employee ID numbers, contingency _...,
ro >
permutation of values reassigned, it will not make any
difference.
names; i.e., nominal values eye color, gender correlation,
provide only enough x2 test ·cu ·-
o., ~ Ordinal An order-preserving change of An attribute encompassing the
information to distinguish bO ·- values, i.e., notion of good, better, best can
~~ .£~
new. value = f(old.value), be represented equally well by
·-...u ·-ro
~ one object from another. "'
0~
::l
0 ~ where f is a monotonic function. the values {1, 2, 3} or by
hO ·- (=, #)
!{ Cd {0.5, 1, 10}.
The values of an ordinal hardness of minerals,
r-e ::l Ordinal
0~ attribute provide enough {good, better, best},
median,
percentiles,
..., Interval new. value - a* old.value + b, The Fah renheit and Celsius
information to order grades, rank correlation, ~ a and b constants. temperature scales differ in the
objects. street numbers run tests, ·-..,
u "'
~~
location of their zero value and
(<, >) sign tests e::> ro
<==
the size of a degree (uni t).
Ratio new. value - a* old_value Length can be measured in
Interval For interval attributes, the calendar dates, mean, z~
..., differences between values temperature in Celsius standard deviation,
meters or feet.
.:: are meaningful, i.e., a unit or Fahrenheit Pearson's
u-:;; of measurement exists. correlation,
·c ·"'= the meaning of a length attribute is unchanged if it is measured in meters
"'~ (+, -) t and F tests
>== k
- ro
::l ::l Ratio For ratio variables, both temperature in Kelvin, instead of feet.
geometric mean,
za .:::.. differences and ratios are monetary quantities, harmonic mean , The statistical op erations that make sense for a particular type of attribute
meaningful. counts, age, mass, percent are t hose that will yield the same results when the attribute is transformed us-
(*, / ) length, variation ing a t ra nsform ation that preserves the attribute's meaning. To illustrate, the
electrical current average length of a set of objects is different when measured in meters rather
than in feet, but both averages represent the same length. Table 2.3 shows the
permissible (meaning-preserving) t ransformations for the four attribute types
this does not mean t hat the operations appropriate for one attribute type are
of Table 2.2.
appropriate for the attribute types above it.
Nominal and ordinal attributes are collectively referred to as categorical Example 2.5 (Temperature Sca le s ) . Temperature provides a good illus-
or qua litati ve attributes. As the name suggests, qualitative attributes, such tration of some of the concepts that have been described. First, temperature
as employee ID, lack most of t he properties of numbers. Even if t hey are rep- can be either an interval or a ratio attribute, depending on its measurement
resented by numbers, i.e., integers, they should be treated more like symbols. scale. \'Vhen measured o n the Kelvin scale, a temperature of 2° is, in a physi-
The remaining two types of attributes , interval and ratio, are collectively re- cally meaningful way, twice that of a tempera ture of 1o . This is not true when
ferred to as quantitative or nume ric attributes. Quantitative attributes are temperature is measured on either t.he Celsius or Fahrenheit scales, because,
represented by numbers and have most of t he properties of numbers. Note physically, a temperature of 1° Fahrenheit (Celsius) is not much different t han
that. quantit ative attributes can be integer-valued or cont.inuous. a temperature of 2° Fahren heit (Celsius) . The problem is t hat the zero points
The types of attributes can also be described in terms of transformations of the Fahrenheit and Celsius scales are, in a physical sense, arbitrary, and
that do not change the meaning of a n attribute. Indeed, S. Smith Stevens, the therefore, the ratio of two Celsius o r Fahrenheit temperatures is not. physi-
psychologist who originally defi ned t he t yp es of attributes shown in Table 2.2, cally meaningful. •
defined them in terms of t hese permissible t ran sforma tions . For exam ple,
·'

28 Chapter 2 Data 2.1 Types of Data 29

Describing Attributes by the N umber of Va lues binary attributes. This type of attribute is particularly important for as-
sociation analysis, which is discussed in Chapter 6. It is also possible to have
An independent way of distinguishing between attributes is by t he number of discrete or continuous asymmetric features. For instance, if the number of
values t hey can take. credits associated with each course is recorded, then the resulting data set will
Discrete A discrete attri bute has a finite or count ably infinite set of values. consist of asymmetric d iscr ete or cont in u o us attributes.
Such attributes can be categorical, such as zip codes or ID numbers,
or numeric, such as counts. Discrete attributes are often represented 2.1.2 Types of Data Sets
using integer variables. Binary attributes are a special case of dis-
There are many types of data sets, and as the field of data mining develops
crete attributes and assume only two values, e.g., true/false, yes/no, and mat ures, a greater variety of data sets become available for analysis. Jn
male/female, or 0/1. Binary attributes are often represented as Boolean this section, we describe some of the most common types. For convenience,
variables, or as integer variables that only take the values 0 or 1. we have grouped the types of data sets into three groups: record data, graph-
Continuous A continuous attribute is one whose values are real numbers. Ex- based data, and ordered dat a. These categories do not cover all possibilities
amples include attributes such as temperature, height, or weight. Con- and other groupings are certainly possible.
tinuous attributes are typically represented as floating-point variables.
Practically, real values can only be measured and represented with lim- Gener a] Ch aracteristics of D ata S et s
ited precision. Before providing details of specific kinds of data sets, we discuss three char-
l n theory, any of the measurement scale types- nominal, ordinal, interval , and acterist ics that apply to many data sets and have a significant impact on the
ratio-could be combined with any of the types based on the number of at- data mining techniques that are used: dimensionality, sparsity, and resolution.
tribute values-binary, discrete, and continuous. However , some combinations
occur only infrequently or do not make much sense. For instance, it is difficult Dim ensiona lit y T he dimensionality of a data set is the number of attributes
to think of a realistic data set that contains a continuous binary attribute. that the objects in the data set possess. Data with a small number of dimen-
Typically, nominal and ordinal attributes are binary or discrete, while interval sions tends to be qualit atively different t han moderate or high-dimensional
a nd ratio attributes are continuous. However , cou nt a ttributes, which are data. Indeed, the difficulties associated with analyzing high-dimensional data
discrete, are also ratio attributes. are sometimes referred to as the curse of dimensionality. Because of this,
an important motivation in preprocessing the data is dimensionality reduc-
Asy m metric Attributes t ion . T hese issues are discussed in more depth later in this chapter and in
Appendix B.
For asymmetric attributes, only presence--a non-zero att ribute value--is re-
garded as important. Consider a data set where each object is a student and Sparsit y For some data sets, such as those wi th asymmetric features, most
each attribute records whether or not a student took a particular course at attributes of an object have values of 0; in many cases, fewer than 1% of
a university. For a specific student, an attribute has a value of 1 if the stu- the entries are non-zero. In practical terms, sparsity is an advantage because
dent took the course associated with that attribute and a value of 0 otherwise. usually only the non-zero values need to be stored and manipulated. This
Because students take only a small fraction of all available courses, most of results in significant savings wit h respect to computation time and storage.
the values in such a data set would be 0. Therefore, it is more meaningful Furthermore, some data m ining algorithms ~ork well only for sparse data.
and more efficient to focus on t he non-zero values. To illustrate, if students
are compared on the basis of the courses t hey don 't take, then most students Resolution l t is frequently possible to obtain data at different levels of reso-
would seem very similar, at least if the number of courses is large. Binary lution, and often the properties of the data are different at different resolutions.
attributes where only non-zero values are important are called asymmetric For instance, the surface of the Earth seems very uneven at a resolution of a
30 Chapter 2 Data 2. 1 Types of Data. 31

few meters, but is relatively smooth at a resolution of tens of kilometers. The Tid Refund Marital Taxable Defaulted
patterns in the data. also depend on the level of resolution. If the resolution Status Income Borrower
is too fine, a pattern may not be visible or may be buried in noise; if the 1 .. Yes Si~ t~: ·· 125K "Nd ·-:
Bread, Soda, Milk
resolution is too coarse, the pattern may disappear. For example, variations 2 No Mariiea l OOK No
in atmospheric pressure on a scale of hours reflect the movement of storms 3 No s;ri~t~''·. 70K No ·
2 Beer, Bread
and other weather systems. On a scale of months, such phenomena are not 4 . Yes M~rrled 120K No
5 No DivorCed 95K Yes
detectable. 6 No Married 60K No
3 Beer, Soda. Oiaper. Milk

7 Yes Di~orced 220K No


Beer, Bread, Diaper, Milk
Record Data B. No Singl.e 85K Yes
9 No ¥arri ed 75K No
Much data mining work assumes that the data set is a collection of records 5 Soda, Diaper, Milk
10 No ~ingle · 90K Yes
(data objects), each of which consists of a fixed set of data fields (attributes).
See Figure 2.2(a) . For the most basic form of record data, there is no explicit
(a) Record data. (b) Transaction data.
relationship among records or data fields, and every record (object) has the
same set of attributes. Record data is usually stored either in flat files or in
relational databases. Relational databases are certainly more than a collection Projection cl PrDICcllon Of D1stanco l oact ThlckMSS
of records, but data rruning often does not use any of the additional information •Load v load
available in a relational database. Rather, the database serves as a convenient 10.23 5.27 15.22 27 1.2
12.65 6.25 16.22 22 1.1
place to find records. Different types of record data are described below and
13.54 7.23 17.34 23 1.2
are illustrated in Figure 2.2. 14.27 8.43 18.45 25 0.9

Transaction or Market B asket D ata Transaction data is a special type (c) Data matrix. (d) Document-term matrix.
of record data, where each record (transaction) involves a set of items. Con-
sider a grocery store. The set of products purchased by a customer during one Figure 2.2. DiHerent variations of record data.
shopping trip constitutes a transaction, while t he individual products that
were purchased are the items. This type of data is called market basket
data because t he items in each record are the products in a person's "mar- and n columns, one for each attribute. (A representation that has data objects
ket basket." Transaction data is a collection of sets of items, but it can be as columns and attributes as rows is also fine.) This matrix is called a data
viewed as a set of records whose fields are asymmetric attributes. Most often, matrix or a pattern matrix. A data matrix is a variation of record data,
the attributes are binary, indicating whether or not an item was purchased, but because it consists of numeric attributes, standard matrix operation can
but more generally, the attributes can be discrete or continuous, such as the be applied to transform and manipulate t he data. Therefore, the data matrix
number of items purchased or the amount spent on those items. Figure 2.2(b) is the standard data for mat for most statistical data. F igure 2.2(c) shows a
shows a sample transaction data set,. Each row represents the purchases of a sample data matrix.
particular customer at a particular time.
The Sparse D ata Matrix A sparse data matrix is a special case of a data
The Data Matrix If the data objects in a collection of data all have the matrix in which the attributes are of the same type and are asymmetric; i.e.,
same fixed set of numeric attributes, then the data objects can be thought of as only non-zero values are important. 'fransaction data is an example of a sparse
points (vectors) in a multidimensional space, where each dimension represents data matrix that has only 0-1 entries. Another common exam ple is document
a distinct attribute describing the object. A set of such data objects can be data. In particular, if the order of the terms (words) in a document is ignored,
interpreted as an m by n matrix, where there are m rows, one for each object,
32 Chapter 2 Data 2.1 Types of Data 33

then a document can be represented as a term vector, where each term is


a component (attribute) of the vector and the value of each component is
the number of times the corresponding term occurs in t he document. This
representation of a collection of documents is often called a document-term
matrix. Figure 2.2(d) shows a sample document-term matrix. The documents
Useful links:
are the rows of this matrix, while the terms are the columns. In practice, only Knowledge Discovery and
the non-zero entries of sparse data matrices are stored. - - -- -1+ Data Mining Bibliography
tGn.vpN!ff f.....-•IJ, .o ...itdlu'}

Graph-Based Data
A graph can sometimes be a convenient and powerful representation for dat a.
We consider two specific cases: (1) the graph captures relationships among
u.- ,.,.,.. Ori&O') l'i-.q..~
data objects and (2) the data objects themselves are represented as graphs. .....as.,._.,.._,....._,. u-F-,JM."Iot~OII.-..uT...,.
~,.~~Dt.c·..,.·· · ..... .,
.,.,~ - ~JC~ -0.. flclEEEC_,_ Soc\aJT«.-:IIC..._..
.._,.. .. MAJ~WITPftM. IM
OIIIOM~a. .-111,-.I, M..UIM
' a-..Qo;...... ·c•.);....,.,_,.w~
l&MIIAf", JroWffM it.f_..a~l"). OwiMOpf!cf~ .... ,..; .. """-_,Cn:,..,
Data with Relationships among Objects The relationships among ob- IW~Ct~Mllctf)...., C'ioollllon U..O. ·o-. lrol ;.l"f rtMu'J·!Mf!(ro. ·s7...,.. r. uo-~cVac
lct~tF(.rMit\.etR.I. s.la.IOid c-t DiK•"'"' ifl...._.._.. tru T,_lfOI\l .
jects freq uently convey important information. In such cases, the data is often S~).JQtw!WIIIrJ A s-.t991. ~ ..... _, o.u EaaJ~ l(•tto!-tl),
DlaftlbcflftJ.

represented as a graph. In particular, the data objects are mapped to nodes


of the graph , while the relationships among objects are captured by the links (a) Linked Web pages. (b) Benzene molecule.
between objects and link properties, such as direction and weight . Consider
Web pages on the World Wide Web, which contain both text and links to Figure 2.3. DiHerenl variations of graph data.
other pages. In order to process search queries, Web search engines collect
and process Web pages to extract their contents. It is well known, however,
that the links to and from each page provide a great deal of information about Ordered Data
the relevance of a Web page to a query, and thus, must also be taken into
For some types of data, the attributes have relationships that involve order
consideration. Figure 2.3(a) shows a set of linked Web pages.
in time or space. Different types of ordered data are described next and are
shown in Figure 2.4.
Data with Objects That Are Graphs If objects have structure, that
is, the objects contain subobjects that have relationships, then such objects
Sequential Data Sequential data, also referred to as temporal data, can
are frequently represented as graphs. For example, the structure of chemical
be thought of as an extension of record data, where each record has a time
compounds can be represented by a graph, where the nodes are atoms and the
associated with it. Consider a retail t ransaction data set that also stores the
links between nodes are chemical bonds. Figure 2.3(b) shows a ball-and-stick
time at which the transaction took place. T his time information makes it
diagram of the chemical compound benzene, which contains atoms of carbon
possible to find patterns such as "candy sales peak before Halloween." A time
(black} and hydrogen (gray). A graph representation makes it possible to
can also be associated with each attribute. For example, each record could
determine which substructures occur frequently in a set of compounds and to
be the purchase history of a customer, with a listing of items purchased at
ascertain whether the presence of any of these substructures is associated with
different times. Using this information, it is possible to find patterns such as
the presence or absence of certain chemical properties, such as melting point
"people who buy DVD players tend to buy "DVDs in the period immediately
or heat of formation. Substructure mining, which is a branch of data mining
following the purchase."
that analyzes such data, is considered in Section 7.5.
Figure 2.4(a) shows an example of sequential transaction data. There
are five different times-tl, t2, t3, t4, and tS ; three different customers-Cl I
34 Chapter 2 Data 2.1 Types of Data 35

Sequence Data Sequence data consists of a data set that is a. sequence of


Time Customer Items Purchased GGTTCCGCCTTCAGCCCCGCGCC individual entities, such as a sequence of words or letters. It is quite simi lar to
t1 C1 A,B CGCAGGGCCCGCCCCGCGCCGTC sequential data, except that there are no time stamps; instead, there are posi-
t2 C3 A,C
t2 C1 C,D GAGAAGGGCCCGCCTGGCGGGCG tions in an ordered sequence. For example, the genetic information of plants
t3 C2 A,D GGGGGAGGCGGGGCCGCCCGAGC and animals can be represented in the form of sequences of nucleotides that
14 C2 E are known as genes. Many of the problems associated with genetic sequence
CCAACCGAGTCCGACCAGGTGCC
15 C1 A,E
CCCTCTGCTCGGCCTAGACCTGA data involve pred icting similarit ies in the struct ure and function of genes from
Customer Time and Items Purchased GCTCATTAGGCGGCAGCGGACAG similarities in nucleotide sequences. Figure 2.4(b) shows a section of the hu-
C1 (t1: A,B) (l2:C,D) (t5:A,E) man genetic code expressed using the four nucleotides from which all DNA is
C2 (t3: A, D) (t4: E)
GCCAAGTAGAACACGCGAAGCGC
constructed: A, T, G, and C.
C3 (t2:A,C) TGGGCTGCCTGCTGCGACCAGGG
Time Series Data Time series data is a special type of sequential data
(a) Sequential transaction data. (b) Genomic sequence data.
in which each record is a t ime series, i.e., a series of measurements taken
over time. For example, a financial data set might contain objects that are
time series of the daily prices of various stocks. As another example, consider
Figure 2.4(c), which shows a time series of the average monthly temperature
for Minneapolis during the years 1982 to 1994. When working with temporal
data, it is important to consider temporal autocorrelation ; i.e., if two
measurements are close in t ime, then the values of those measurements are
often very similar.

- H) : Spatial Data Some objects have spatial attributes, such as positions or ar-
- I5
.
~
. . eas, as well as other types of attributes. An example of spat ial data is weat her
~~~ 1~ 1N4 , . ,,_ 1Ml , ,;.. I~ ttrn data (precipitation, temperature, pressure) that is collected for a variety of
·- INf 1..0 1991 ,.,.

geographical locations. An important aspect of spatial data is spatial auto-


(c) Temperature time seri es. (d) Spatial temper ature data. corre lation; i.e., objects that are physically close tend to be simi lar in other
ways as well. Thus, two points on !.he Earth that are close to each other
Figure 2.4. Different variations of ordered dala. usually have similar values for temperature and rainfall.
Important examples of spatial data are the science and engineering data
sets that are the result of measurements or model output taken at regularly
C2, and C3; and five different items- A, B, C, D, and E. In the top table, or irregularly distributed points on a two- or three-dimensional grid or mesh.
each row corresponds to the items purchased at a particular time by each For instance, Earth science data sets record the temperature or pressure mea-
customer. For instance, at time t3, customer C2 purchased items A and D. In sured at points (grid cells) on latitude-longitude spherical grids of various
the bottom table, the same information is displayed, but each row corresponds resolutions, e.g., 1° by 1°. (See Figure 2.4(d).) As another example, in the
to a particular customer. Each row contains information on each transaction simu lation of the flow of a gas, the speed and direction of flow can be recorded
involving the customer, where a t ransaction is considered to be a set of items for each grid point in the simulation.
and the time at which those items were purchased. For example, customer C3
bought items A and C at time t2.
Exploring the Variety of Random
Documents with Different Content
this that motor boat and automobile races are won and lost.
Now the Dartaway was creeping up on her rival. True it was but a
slow advance, for there were still five cylinders in the Tortoise
against her four. But the boys’ craft was doing nobly, and their
hearts beat high with hope.
Mr. Smith was not going to give up without a struggle. His two
companions worked like Trojans over the silent cylinder, but could
not get it to respond.
Then to the boys’ delight they found themselves on even terms
with the redoubtable Tortoise. They were on the home stretch with
less than a mile to go. Already they could hear the shouts, the cries
and the applause of the watching throngs, with which mingled the
shrill whistles of steam and motor boats.
Three minutes later the Dartaway had regained the lead she had
at the start, and thirty seconds later had increased it. With two big
waves rolling away on either side of her cut-water she forged ahead.
Foot by foot she approached the stake boat. With one last look back,
which showed him the Tortoise five lengths to the rear, Jerry with a
final turn of the wheel to clear the judges’ boat safely, sent the
Dartaway over the line a winner.
CHAPTER XXII
THE COLLISION

What shouting and cheers greeted the motor boys as they slowed
up their craft! The din was deafening, augmented as it was by the
shrill whistles. The Tortoise, too, was received with an ovation as she
came over the line second, but it was easy to see the victory of the
smaller boat was popular.
“Congratulations, boys!” called Mr. Smith as he run his craft
alongside. “You beat me fair and square.”
He did not refer to the fact that one of his cylinders went out of
commission, but for which fact he undoubtedly would have won. The
boys appreciated this.
The boys accepted their victory modestly, and when they were
sent for to go aboard the judges’ boat and get the prize Bob was for
backing out, while neither Ned nor Jerry felt much like going through
the ceremony.
“Tell ’em to send it over,” suggested Bob.
“That would hardly look nice,” replied Jerry. “Come on, let’s all go
together. It will soon be over. Who’d have thought we could have
butted into the lime-light so soon?”
Having received the cup and stowed it safely away Jerry was
about to steer the Dartaway back to Deer Island when he was hailed
by Mr. Smith.
“Oh I say, you’re not going away, are you?” asked the skipper of
the Tortoise.
“I think we’d better be getting back,” replied Jerry. “We have to
straighten out the camp.”
“Nonsense,” said Mr. Smith. “The fun’s not half over. Why there’s
no end of good things to eat over there. The committee made
arrangements to dine all contestants, and I’m sure you boys are the
chief ones after the handy way in which you won that race. Really
now, you must stop a bit with us.”
“I guess we’d better,” said Bob, in a whisper. “It wouldn’t be polite
to refuse.”
“You were willing enough when it came to sliding out of the cup
proposition,” said Jerry, “but now, when there’s something to eat,
you’re right on the job, Chunky.”
“Guess we might as well,” put in Ned. “I could dally with a bit of
chicken myself.”
“Well, far be it from me to stand in the way,” said Jerry, and,
throwing the wheel around he followed the Tortoise, which, with the
other boats, was making toward shore.
In the grove the boys found Mr. Smith had not exaggerated
matters when he said there “was no end of good things to eat.”
Large tables had been spread under the trees and waiters were
flying here and there. The boys were a bit confused by all the
excitement, but Mr. Smith soon found them, and introducing them to
some of his friends, got places for them at one of the best tables.
“I guess you boys will have plenty of chances to race while you’re
here,” said Mr. Smith. “I hear a number of skippers want to try issues
with you.”
“Well, they’ll find us ready,” said Jerry. “We’re rather new at the
game, but we’ll do our best.”
“That’s the way to talk,” cried Mr. Smith. “Play the game to the
limit, no matter what it is. I’d like another brush myself. Your boat
can certainly go.”
“I think you could beat us,” said Jerry frankly. “If you hadn’t had
that accident you would have won.”
But now the dinner was almost over. Ice cream was being served,
and when every one had eaten their fill, there arose from the head
table where the regatta committee sat a cry of:
“Speeches! Speeches!”
Then came applause and cheers. The chairman of the committee
arose and, looking down toward where the motor boys were sitting,
began:
“I’m sure it would give us all pleasure to hear a few words from
the winners of the motor boat race. They are newcomers to our
midst, and, as such we welcome them.”
“Hear! Hear!” cried the crowd. “Speech! Speech!”
For a moment the boys felt a sort of cold chill go down their
backs. It was the first time they had been placed in such a position.
Bob looked at Ned, Ned looked at Jerry, and Jerry glanced down at
Bob.
“Say something, Jerry!” whispered Ned.
“Yes; go ahead; talk!” exclaimed Bob.
“Wait until I get you both back to camp!” muttered Jerry, as he
pushed back his chair and arose.
His heart was beating fast and there was a roaring in his ears. He
was greatly embarrassed, but he felt he must say something to show
that he appreciated the honor paid him and his comrades.
“I’m sure my friends and I are deeply sensible of this welcome,”
he said. “We didn’t expect to win the race, though we did our best.
We’re very glad to be here among you, and we hope to continue the
acquaintances we have made. And I want to say that if one of Mr.
Smith’s cylinders—I mean if one of Mr. Cylinder’s smith—er—that is if
the boat Mr. Smith cylinders—I mean owns—if his cylinder—er—that
is if his boat’s culander—cylinder—hadn’t cracked Mr. Smith’s head—I
would say if the cylinder—”
“What he means,” said Mr. Smith gallantly coming to the relief of
poor Jerry, “is that if I hadn’t had the misfortune to crack the
forward cylinder I might not have been beaten so badly. But I want
to say that that’s all nonsense. It was a fair race, and won fairly, and
the Dartaway did it. So I ask you to join with me in giving three
cheers for the owners.”
The cheers were given with a will, and the boys felt the blushes
coming to their cheeks. Altogether it was a jolly time, and one the
lads never forgot.
“We didn’t make any mistake coming here,” said Jerry, who had
taken his place at the wheel as they started for their camp. “It’s
almost as much fun as automobiling in Mexico or crossing the
plains.”
The boys were proceeding rather slowly as they had not yet
familiarized themselves with the lake and their bearings, and they
did not want to run into anything.
For a while the Dartaway skimmed along, there being no other
craft near. The water lapped the sides and broke away in a ripple of
silver waves.
Suddenly Jerry threw out the gear clutch, and began spinning the
wheel around. At the same instant Bob and Ned, who had been
looking to the rear, turned around and saw a big black shape in front
of them.
“Ahoy there! Schooner ahoy!” called Jerry. “What do you mean by
cruising about without a light. You’ve no right to do that. Look out
there. You’ll foul us!”
The sound of feet running about on a deck could be heard. Then
there came a moment of silence followed by a sudden jar and a
grinding crash.
CHAPTER XXIII
THE MYSTERIOUS VOICE

The shock threw the Dartaway back. Jerry had already turned off
the power, and was slowing down for the reverse when the smash
came. The motor boat had fairly poked her nose into the side of the
schooner.
“Are we damaged?” cried Ned.
“I guess not,” replied Jerry, seizing one of the oil lanterns and
holding it over the side of the bow. He could see a big dent in the
wooden hull of the motor boat, and a larger one in the schooner.
The two boats were now drifting apart.
Aboard the schooner there was much confusion. Several persons
seemed to be talking at once. Lights flashed here and there.
“Look out, I’m going to back away,” said Jerry to Bob and Ned. “Is
it all clear to the rear?”
He swung the search lantern so that the beams cut a path of light
aft.
“Nothing in the road,” sung out Ned.
Slowly the Dartaway separated from the side of the schooner. As
she did so the stern of the larger vessel swung over toward the
motor boat, and Bob, who was watching it gave a sudden cry.
“What’s the matter? Is she going to hit us again?” called Jerry,
slowing up the engine.
“No!” cried Bob. Then lowering his voice and crawling to where
Jerry stood he whispered:
“This boat has the name of Bluebird on her stern!”
At the same instant there came floating over the water the sound
of a voice from some one aboard the larger craft.
“We’re sinking! Quick Bill! Get the boat over and find me a life
preserver. I don’t want to drown!”
At the sound of the mysterious voice, coming so plainly amid the
stillness that followed the crash the boys were startled.
“Doesn’t that sound just like—” began Bob.
“Hush!” cautioned Jerry in a whisper. “Wait a while before you
talk.”
“I tell you we’re sinking!” the voice went on. “They rammed a hole
clear through us. They did it on purpose! They want to capture me!”
“Keep quiet, you numbskull!” the boys heard some one exclaim in
reply. “You’ll be caught quick enough if you don’t keep still. Do you
want to give the whole thing away? Get below before they flash that
search light on the deck and see who you are!”
Silence ensued, broken only by the sound of some one moving
about on the deck of the schooner.
“Flash the light on ’em!” called Ned.
Jerry swung the big gas lamp around on its pivot, and the blinding
white glare illuminated the schooner. The only person to be seen on
deck was a man at the helm, and, by the beams the boys could see
he was roughly dressed.
For an instant the steersman stood plainly revealed in the beams.
He wore nothing on his head, but, as soon as the glare set him out
from the darkness he caught up from the rail a slouch hat which he
pulled over his eyes, screening the upper part of his face.
“What’s the matter with you?” demanded Jerry with a pretense of
anger, as he wanted to hear the man’s reply. “Couldn’t you see our
boat?”
“If I could have d’ye s’pose I’d a stood here an’ let ye run int’
me?” the man asked in answer. “Them gasolene boats is gittin’ too
dangerous. I’ll have th’ law on ye for this.”
“What about the law requiring sailing boats to carry lights at
night?” asked Jerry. “I guess if there’s going to be any suing done
we can do our share.”
The steersman made no answer. The wind freshened just then,
and the schooner gathered way. The helmsman put her about, and
she heeled over as the breeze came in powerful gusts.
While the after part of the sailing vessel was still in the zone of the
search light the boys observed a second figure aboard. It came up
the companionway leading down into a small cabin.
“Git down there!” the steersman exclaimed. “They’ll see you!”
The figure disappeared suddenly. The boys, seeing it would be no
further use to argue with the surly skipper, put their boat on her
course and resumed the trip to the island. They found beyond a
slight loosening of the engine, due to the shock, no damage had
resulted.
“Well, I think we ran into something that time,” remarked Ned.
“Two things I would say,” put in Jerry. “If that mysterious voice,
the steersman tried to hush, wasn’t that of Noddy Nixon’s I’ll eat my
hat.”
“I was just going to say the same thing,” added Bob. “I was sure I
recognized it.”
“Then he isn’t kidnapped at all,” said Ned.
“I never believed he was,” came from Jerry.
“I wonder who the other person was,” said Bob.
“I have an idea it was Bill Berry,” said Jerry.
“It didn’t sound like his voice,” interposed Ned.
“If you noticed,” went on Jerry, “he talked with two voices. When
he spoke to Noddy his tones and words were much different than
when he addressed us and threatened to have the law on us. I’m
sure it was Bill Berry.”
“Then those two are up to some mischief, I’ll bet,” ventured Ned.
“There must be some game afoot when Noddy lets it be thought he
is kidnapped, and when we find him away off here in a schooner.”
“There is,” spoke Jerry. “It’s the same game that began with the
reference to something ‘blue’ that Bill Berry made that day. It’s the
same game that we nearly discovered when we almost ran into the
Bluebird, and now we have the same schooner away down here on
the lake and we nearly sink in consequence of hitting her, or of her
hitting us, for I believe they got in the way on purpose.”
“But what is the game?” asked Bob.
“That’s what’s puzzling me,” replied Jerry. “I’m inclined to think
that the gang Chief Dalton is after will be found to have some
connection with this vessel, and while I have only a mere suspicion
of it, I believe the robbery of Mr. Slade’s store is—”
“Look out there! You’re going to hit me! Keep to the left!”
exclaimed an excited voice.
Jerry rapidly spun the wheel around and the Dartaway veered to
one side with a swish of water, just grazing a rowboat with a man in
it, that loomed up dead ahead.
“Almost had me that time,” said the rower pleasantly as the
Dartaway slowed up. “It was my fault though, I ought to have had a
light.”
His frank admission of his error, and his failure to abuse the boys
for nearly colliding with him, as most rowers would have done under
the circumstances, made the boys feel at ease.
“Sorry we caused you such a fright,” said Jerry. “Can we give you
a tow?”
He swung the search light about to illuminate the rowboat. As he
did so he gave an exclamation of astonishment. The rower was none
other than the ragged tramp who had been rescued from the hay
barge, and who had been given a ride in the Terror following the
unsuccessful chase after the motor boat thieves. He recognized the
boys at once.
“Oh it’s you, my young preservers!” the tramp said. “Well, we
seem fated to meet at odd moments. First you save my life, and
then you nearly take it from me. Well, it evens matters up.”
“Can we tow you anywhere?” asked Jerry again.
“Thanks, noble sir,” replied the tramp with the same assumed
grand air he had used when talking to Chief Dalton. “I fain would
dine, and if you can take me to some palace where the beds are not
too hard, and where I could have a broiled fowl, or a bit of planked
whale, with a sip or two of ambrosial nectar, I would forever call you
blessed.”
“Do you mean you’re hungry?” asked Bob, who had a fellow
feeling for all starved persons.
“As the proverbial bear,” answered the tramp. “You haven’t a stray
cracker about your person, have you?”
“No, but I’ve got a couple of ham sandwiches,” said Bob.
“Well if you’re not at it again, Chunky,” said Jerry. “Where’d you
get ’em?”
“I put ’em in my pocket at the feed this afternoon,” replied Bob,
taking the sandwiches out and passing them to the tramp, whose
boat was now alongside. “I thought they’d come in handy.”
“As indeed they do,” the ragged man put in, munching away at the
bread and meat with right good appetite. “I thank you most
heartily.”
“If you care to come to our camp we can give you something
more and a little coffee,” said Jerry. “You could also sleep under
shelter. We have a tent ashore you can use and we can sleep on
board the boat.”
“If it would not discommode you, I would be glad of the
opportunity,” the tramp said, dropping his assumed manner and
speaking sincerely. “I was about to spend the night in the woods,”
he went on, “but I much prefer shelter. I have a mission here, and
while I am on it I have to rough it at times. But I am almost
finished.”
“Will you come aboard, or shall we tow you?” asked Ned.
“Perhaps it would be as well to tow me,” replied the tramp. “I
have some things in my boat I would not like to lose.”
The tow line was soon made fast to the Dartaway, and the boys
resumed their trip which had twice been interrupted by accidents.
They reached the island in safety, and soon were preparing some
coffee and a light supper. The tramp fastened his boat to a tree that
projected over the water, and, then sat at the rough table the boys
had constructed under a canvas awning.
“I don’t believe I have been presented to you gentlemen,” said the
tramp, as the night dinner was about to begin. Jerry laughing,
introduced himself and his chums.
“Are you Aaron Slade’s son?” asked the tramp excitedly, as Ned’s
name was mentioned.
CHAPTER XXIV
A QUEER MESSAGE

“Aaron Slade is my father,” replied Ned, wondering what object the


tramp could have in asking.
“The one who was recently robbed?”
“The same.”
“Well if this isn’t—” began the tramp more excited than before. “I
must—no I must not. Pray excuse me,” he went on, with an
assumption of his former grand air, “I must not refer to that. It
escaped me before I was aware of it. Pay no attention to what I
said. I was going to tell you something, but the time is not yet ripe.
Now let’s fall to, for I’m still imitating the bear in the predilection of
my appetite,” and he attacked the food with every evidence that he
was speaking the truth.
The boys looked at each other in surprise. Ned, in particular,
wondered what the tramp meant by starting as if he intended to tell
some secret and then stopping. Seeing that their guest was not
observing him, Jerry made a gesture that indicated the tramp might
not be altogether right in his head. In this view Bob and Ned
coincided.
They were not alarmed, however, as the man did not seem to be
dangerous. He was too busy eating to talk, and the boys soon forgot
their curiosity in making away with the food, for the trip across the
lake had given them all appetites.
It was arranged that the tramp should sleep in the shelter tent,
while the boys made use of the bunks on board the boat. It was
nearly midnight before they turned in, and the motor boys, at least,
slept soundly until morning.
As for the tramp he may have rested well, but at any rate he was
not a late sleeper, for, when the boys crawled out of their
comfortable beds for a plunge into the lake they found he had built a
fire on shore and was boiling their tea kettle over it.
“That’s very good of you, but you needn’t have gone to that
trouble,” said Jerry. “We have a gasolene stove.”
“Tut, tut!” exclaimed the ragged man. “Water for coffee should
always be boiled over an open fire. It has more flavor.”
Thinking this was only one of the tramp’s odd conceits the boys
did not argue further with him. They took their bath, their odd guest
meanwhile making coffee.
“If you’ll tell me where the bacon and other things are I’ll finish
getting this meal,” he called to them where they were splashing in
the lake.
“Shall we let him?” asked Jerry of his chums in a low voice.
“Guess he won’t poison the stuff,” said Bob. “Besides it will be
ready while we are dressing and we’ll not have to wait.”
Accordingly Jerry called out directions how to find the victuals, and
soon the savory smell of sizzling bacon and frying eggs was wafted
over the water. They had a breakfast fit for a king, and
complimented the tramp on his skill.
A little later the tramp proposed that the boys take his rowboat
and go fishing on the other side of the island. They were doubtful
about leaving him in charge of the camp.
“I see you’re a little suspicious of me,” the tramp said. “Well I
don’t blame you. However to show you that I’m all right read that.”
He held out a slip of paper, on which was written:
“This man can be trusted. Henry Dalton, Chief of
Police, Cresville, Mass.”
“If the chief says you’re all right, I guess that’s enough for us,”
spoke Jerry, as he handed the paper back. “We’ll take a day off and
go fishing. Don’t let any one come bothering around our camp. We
have reason to believe an enemy of ours is on this lake. He would do
us some harm if he could.”
“There are enemies of mine, also,” said the tramp. “But have no
fear. I’ll look after things.”
Getting some bait and fishing tackle the boys started off in the
tramp’s rowboat. They did not take any lunch, as they planned
coming back at noon.
“Do you think it’s all right to trust him?” asked Ned.
“I’m sure it is,” replied Jerry. “That note from the chief was
genuine. I know his writing, and the paper was the same as the
chief uses in his private office. I got a permit once from him to carry
a revolver. You remember, when we made our first auto trip.”
Satisfied that their belongings had been left in good hands, and
were safe from any chance intrusion from Noddy Nixon or his
cronies, the boys put in an enjoyable morning fishing. They made
several good catches, and when the sun indicated that it was nearly
noon, they rowed around the island to camp.
“I hope he has a good fire going so we can cook some of these
fish,” observed Bob.
“I guess he will be ready for us,” said Ned. “He seems to be a
willing worker.”
Sure enough, when the boys rowed to shore they found their odd
guest had built a fine fire in an improvised oven, and was all ready
to proceed with cooking the fish. It was the best meal the boys had
eaten since coming to camp, and they had the tramp to thank for
the major part of it. The ragged man proved he had a better
appetite even than Chunky, which is saying a great deal. The fish
were done to a turn, and the bacon gravy gave them a most
excellent flavor.
So heartily did all eat that they were too lazy to do anything but
lounge around after dinner. They stretched out under the trees and
before they knew it the boys had dozed off.
Jerry was the first to awaken. It was about three o’clock when he
sat up, rubbing his eyes, and, for a moment wondering where he
was. Then he saw the lake through the trees and remembered. He
looked around and saw Bob and Ned still stretched out on the
sward. The tramp was nowhere in sight.
“I wonder if he’s gone fishing,” thought Jerry. “He’s a queer duck.
I must take a look at our motor boat.”
Slowly he walked to where the Dartaway was moored. He saw she
was riding safely. Then he looked for the rowboat. It was nowhere to
be seen, though it had been tied close to the motor craft.
“I guess he’s slipped away,” thought Jerry.
At that instant the sound of oars being worked caught his ears. He
looked up and saw, coming around the point of the island, the
tramp’s craft. But the tramp did not seem to be in it. Instead it held
a fisherman, with a broad brimmed hat, a corduroy coat, green
goggles on, and a big basket hung over one shoulder. In the boat
two poles could be seen, also a gaff sticking up.
“Some one has stolen his boat,” thought Jerry. “Hi there!” he
called. “Where you going?”
“Fare thee well!” called back the fisherman. “I must away on my
mission.”
“Come back with that boat!” yelled Jerry.
“Why so? ’Tis mine,” came back the answer over the waters as the
fisherman rowed farther out from shore. “Sorry to leave you in this
fashion, but my mission calls.”
“Why it’s the tramp!” exclaimed Jerry, as he recognized the voice
of the ragged man in spite of his queer disguise. “But where in the
world did he get that rig?”
“What’s the matter?” asked Ned, having awakened and coming
down to join Jerry.
“There goes our tramp,” said Jerry.
The tramp was now quite a distance out. He stood up in his boat.
“Look—in—your—coffee—pot!” he called. “I—left—a—message!”
Then he sat down and began rowing hard.
“Hurry up, get the coffee pot!” cried Jerry. “We must get at the
bottom of this!”
He and Ned ran back to the tent. They found the pot set in the
middle of the table. Jerry threw back the cover. Inside was a piece of
birch bark, on which was written in pencil:
“Where the bluebird spreads her wings, there you’ll
find the stolen things. Search her deep, and search her
through, you will find I’m speaking true.”
CHAPTER XXV
SEARCHING FOR THE SCHOONER

“Well if this isn’t mystery and more of it!” exclaimed Bob. “What in
the world does it all mean, and the tramp going off in this fashion?”
The boys gathered close together, their heads bent over the
mysterious message on the birch bark.
“Let’s call to him to explain,” suggested Ned.
“It’s too late,” said Jerry. “He’s too far out. Besides I don’t believe
he’d come back. Anyhow I think I know what the message means.”
“What?” asked Ned and Bob in a chorus.
“Isn’t it plain enough?” asked Jerry with a smile. “If Andy Rush
was here he’d have half a dozen explanations.”
“Let me read it once more?” came from Ned.
“‘Where the bluebird spreads her wings, there you’ll find the
stolen things. Search her deep and search her through, you will find
I’m speaking true.’”
“Why of course!” exclaimed Bob. “It must be the schooner
Bluebird he’s referring to, and he means your father’s things will be
found in her, Ned. It’s as plain as the nose on your face.”
“That’s so,” agreed Ned. “Is that what you make of it Jerry?”
“Sure. That part is easy enough. What does puzzle me though is
that tramp. I can’t quite make him out. He’s a funny character, and
his latest effort is stranger than any since his adventure on the hay
barge. I wonder how he knew there was stolen stuff aboard the
Bluebird?”
“Well that seems simple enough to me,” spoke Ned. “He’s probably
been a criminal in his time, and knows some of the crooks who
robbed my father’s store. In some way he found out they had the
stolen stuff on the schooner, and he wanted to let us know to pay
for our favors to him. You remember how excited he got when he
found out my name was Slade.”
“Yes, that’s all right as far as it goes,” said Jerry, “but you’ll never
get me to believe that tramp is either a criminal or one who travels
with thieves. He’s a different character altogether. You’ll see I’m
right. He may have found out where the stolen stuff is, but it was in
some other way than being a companion of the thieves.”
“Well, maybe, you’re right,” came from Ned. “That part can be
settled later. The main thing is to find the Bluebird and see what
there is aboard.”
“Which isn’t going to be such an easy thing as it sounds,” Jerry
remarked.
“Why not?”
“Well, it may be a simple matter to locate the vessel, as the lake is
not very large, but when we get to her have you thought of what we
will do with her?”
“Go aboard, of course, and demand my father’s goods and
money,” said Ned boldly.
“You seem to forget there is a difficulty in the way,” Jerry went on.
“The men who stole the stuff, provided it is aboard the ship, are not
likely to let us come over the side as if we were on a visit, and
search for incriminating evidence. Then, too, there is Noddy, and he
is not likely to welcome a call from us. No, I think we’ll have our
hands full in getting aboard the Bluebird.”
“What would you advise?” asked Bob, as both he and Ned had
come to regard Jerry’s ideas as being a little better than their own
on important matters.
“I think it would do no harm to make a search and find where the
Bluebird is lying,” said Jerry after a little thought. “Then, perhaps we
can decide on a plan of action. It’s a sort of following the old recipe
of making a rabbit pot-pie,—to first catch the rabbit.”
The other boys agreed this was the best idea. They watched the
boat with the tramp-fisherman growing smaller and smaller as he
rowed out on the lake, and puzzled more than ever over the queer
character.
“Well, shall we start right away?” asked Ned.
“I don’t believe it would do any good,” said Jerry. “Let’s get ready
for supper, and this evening we can take a run out on the lake. We
probably will not discover anything, but it will be fun, and we may
gain a clue.”
Shortly after sunset, the evening meal having been finished, the
boys made the Dartaway ready and started away from camp. The
lake was alive with power and other boats and the boys met a
number of new acquaintances they had made at the luncheon
following the winning of the prize. They speeded back and forth until
dusk, and then accepted an invitation of a party that was bound for
one of the resorts on the shore of the lake.
They spent some time there and when they reached their island
dock and made a landing it was as dark as pitch. The boat was
made fast to the wharf and then, lighting some oil lanterns, the boys
walked up to their camp, which was a little way from shore.
As the gleam of the lamps fell on the place Jerry who was in the
lead uttered an exclamation:
“Some one has been paying us a visit!” he said. “And they haven’t
been friends of ours either.”
This was soon evident, for the camp was topsy-turvy. The shelter
tent was pulled down, the utensils and camp stuff were scattered all
about, and the place looked as if a small cyclone had struck it.
“I wonder who did this?” came from Ned. “I’d like to get hold of
them for a few minutes.”
“Maybe this tells,” said Jerry, taking up a piece of paper from the
planks that served as a table. The scrap had evidently been placed
where it would be easily seen. It read:
“You had better clear out of here before something
worse happens to you and your boat.”
“Who signs it?” asked Ned.
“It has ‘The River Pirates’ at the bottom,” said Jerry, “but I’d be
willing to bet a new hat against a cookie that it’s Noddy Nixon’s
writing.”
“Then the Bluebird has been here in our absence,” said Bob.
“Looks so,” admitted Jerry. “Now let’s see if any great damage has
been done.”
They made a hasty examination, but beyond tearing up the camp,
and upsetting things, nothing appeared to have been stolen or
seriously damaged. It seemed that the visitors merely wanted to
annoy the boys.
There was nothing much that could be done until morning, so the
boys, seeing that the Dartaway was securely made fast, went to
sleep on board. They rested undisturbed until morning.
“Now to hunt for the mysterious schooner!” exclaimed Ned after
breakfast. “Do you know I have a good scheme?”
“Let’s hear it,” said Jerry.
“We ought to disguise ourselves,” went on Ned. “If we go hunting
for the schooner in our motor boat the way we are now, they can
see us coming and get on their guard. We ought to make up as
fishermen, just as the tramp did, and steam around slowly.”
“They know the boat by this time,” objected Jerry.
“We can disguise her a bit by hanging strips of canvas over the
sides,” went on Ned, “and by taking the canopy off.”
“I believe that’s a good suggestion,” said Jerry. “Then we could
take the thieves by surprise. Come on, we’ll see what we can do to
the boat.”
By removing the awning, and putting strips of dirty canvas over
the bright clean paint on the sides of the Dartaway the whole
appearance of the craft was changed.
“Now for ourselves,” said Bob. “We’ll wear our oldest clothes.”
If the boys hoped to succeed with little effort they were doomed
to disappointment. They spent all the morning cruising around the
lake and did not get a glimpse of the craft they wanted. They did not
go back to camp for lunch, having brought some eatables with them.
In the afternoon the cruise was resumed, but with no better luck.
For three days the boys went forth every morning disguised as
fishermen, and came back at night having had their trouble for their
pains.
“This is getting tiresome,” said Ned, on the evening of the third
day. “We’re having no fun out of this trip at all. Let’s let the thieves
go. I don’t believe they have any stuff on the boat.”
“Let’s try one more day,” pleaded Jerry. “We’ll go away down to
the other end of the lake.”
So it was agreed. They made an early start the next morning and
in the afternoon found themselves cruising around at the extreme
southern end of the lake. There the body of water narrowed in one
place because of an island close to shore. It was a spot seldom
visited, and there were no camps in that vicinity.
“Let’s take a look around the other side of that island,” suggested
Jerry, when his companions proposed going home. “There might be
a dozen schooners there.”
The Dartaway was headed through the narrow channel. Jerry, who
was steering, was proceeding slowly, as he was in unfamiliar waters,
and the channel seemed rather shallow.
Suddenly, as the motor boat emerged from the strait, the three
boys could hardly help refrain from uttering an exclamation. There,
moored to the shore, was the Bluebird.
“We’ve found her!” whispered Bob excitedly.
“Hush!” cautioned Jerry. “Pretend to be fishing while I work the
boat nearer. Don’t look at the schooner. They may be watching us.”
With swiftly beating hearts the boys listened to the throb of the
propeller that brought them nearer and nearer to the Bluebird.
CHAPTER XXVI
THE PIECE OF SILK

“Are you going right up close?” asked Bob. “Maybe we had better
wait a while.”
“Keep quiet,” said Jerry. “Just watch.”
The Dartaway continued to approach the schooner. In the stern
Bob and Ned pretended to be trolling. Jerry held the motor craft on
her course, going at first speed, and kept her headed right for the
sailing vessel.
“You’re going to bump!” exclaimed Bob in a low tone, looking over
his shoulder at Jerry.
The next instant the Dartaway hit the side of the schooner with a
resounding thump, but not hard enough to do any damage, as Jerry,
on the alert, reversed the screw just in time.
“I told you we were going to hit,” said Bob in reproachful accents,
for he had nearly been tossed overboard by the recoil when the
motor boat backed away from the Bluebird from the force of the
blow.
“That’s all right I meant to hit ’em,” said Jerry coolly, as he caught
hold of a rope that hung over the schooner’s side. “I did it on
purpose,” he went on in a lower voice. “It will seem as if it was an
accident and we can get a chance to see who’s aboard. That knock
ought to bring ’em out.”
The boys, making the motor boat fast to the sailing vessel with
the rope, waited for a hail from those they supposed to be aboard.
But a silence ensued after the noise of the collision and the
throbbing of the motor died away. All that could be heard was the
sound of the wind in the trees, birds singing in the woods, and the
lap of little waves against the sides of the boats.
“Queer,” muttered Jerry, “I thought that would arouse them. Must
be sound asleep. Here goes for another.”
He pushed the Dartaway back from the side of the schooner and
then, holding to the rope pulled her forward again so that the nose
of the motor craft hit the sailing vessel a resounding blow. Still there
was silence on the Bluebird.
The boys waited for several minutes, listening intently, but there
was no sign of life other than on their craft.
“I’m going aboard the schooner,” said Jerry at last.
“Do you think it’s safe?” asked Ned.
“I don’t see why not,” replied Jerry. “There doesn’t seem to be any
one in her. Maybe they’ve only gone away for a little while, but it’s
our best chance. So here goes.”
With that he scrambled up the rope hand over hand, and soon
stood on the schooner’s deck.
“Come on up,” he called to Ned and Bob. “The schooner is
deserted!”
Up came the other two boys. They found the hatches tightly
closed, and, as the day was hot, they reasoned that no one would
be below with all the openings shut. The schooner was in good
order, everything on deck being neatly arranged, and showing that
those who had deserted her had not gone off in any haste. The
vessel was moored to shore with bow and stern lines.
“Well, now that we have things to ourselves,” said Jerry, “let’s see
what we can find. It ought to be an easy matter to get below.”
“I wonder if we have any right to,” said Bob.
“I don’t see why not,” came from Ned. “We suspect that some
things from my father’s store are here. If we take a look and don’t
do any damage where’s the harm. The thieves ought to be caught,
and we may get a clue to them in this way.”
“I say, let’s go below,” put in Jerry. “Try all the hatches. Maybe
some of them are not locked.”
Whoever had deserted the schooner had evidently not felt any
alarm about leaving their property without the protection of lock and
key, for the first hatch cover the boys tried slid back easily, disclosing
a rather dark and steep companionway.
“Who’s going ahead?” asked Jerry. “Don’t all speak at once.”
There was a moment’s hesitancy on the part of all three. There
was no telling what they might meet with, or who might be below.
“Pshaw!” exclaimed Ned. “I don’t believe any one’s there. I’ll make
a break.”
He started down the companion steps, and, after a second, Bob
and Jerry followed.
“It’s as dark as a pocket!” said Bob. “I wish we had a lantern.”
“Hold on!” called Bob who was in the rear. “I have a candle-end in
my pocket.”
He brought it forth and lighted it, sending a rather faint
illumination through the cabin in which the boys found themselves.
No one was to be seen, but, as was the case on deck, everything
was neatly in place, and no disorder evident.
“Now for the search!” exclaimed Ned. “We’ll see if that tramp
knew what he was writing about with his funny message.”
Around the cabin were several lockers. These the boys opened in
succession, only to find them empty. Clearly the booty, if it was
aboard, was not in this part of the vessel.
But there were many other places to search. The craft was not a
large one, but there was a forecastle, and a small hold amidships.
The boys decided to try the hold first. To get into it they found they
would have to slide back the deck hatch, and then lower themselves
into the black hole by means of a rope which hung from the gaff,
and which was evidently used to hoist cargo in or out of the
schooner.
With the hatches open the dark hole was made lighter but at best
it was not a pleasant place. Still the boys were determined to
explore it. Seeing that the rope was securely fastened to the gaff,
Jerry swung himself over the hatchway, and went down hand over
hand. It was about ten feet from the deck to the bottom. Bob and
Ned followed.
In his descent Bob dropped the candle, which, after burning a
little while on the bottom of the hold, went out.
“That’s nice,” said Jerry. “Don’t move now until we get a light. No
telling what sort of a hole you may fall into. Stay under the patch of
sunshine.”
The boys remained immediately under the hatchway until Jerry,
groping around, had found the candle end and lighted it. Then the
boys peered around them, Jerry holding the tallow illuminator above
his head.
“Forward!” cried Ned.
The next instant there sounded a scurrying as if some one was
running about the hold.
“Some one’s coming!” cried Bob. “Come on! They’re after us!”
The noise increased, and Jerry and Ned peered forward expecting
to see some one approaching out of the darkness. Then came a
series of shrill cries.
“Rats!” exclaimed Jerry with a laugh. “I forgot that all vessels are
full of them.”
“Are you sure?” asked Bob, who had grabbed hold of the rope.
“Sure; can’t you see them?” asked Jerry, and, moving his candle
back and forth close to the floor, he pointed out where several big
gray rodents were huddled in one corner.
“Only rats, eh,” muttered Bob. “Well I wouldn’t want a lot of them
to get after me. They’re as big as cats.”
But the animals were probably more frightened than Bob had
been, for the next instant they all disappeared down some hole. The
boys began a systematic search of the hold of the vessel. It did not
take long to show that no booty was contained in it, unless, as Ned
suggested, there was a secret hiding place.
“Well, we’ll try the fo’castle now,” said Jerry as he blew out the
candle to save it, and ascended the rope. Bob and Ned followed.
By opening bull’s-eyes in the forecastle the place was made light
enough to see fairly well in. There were several bunks, and a small
table which could be folded against the side out of the way. The
bunks were provided with bed clothes, and a hasty examination of
them showed nothing to be hidden among them. The whole place
was well looked through, but there was no sign of the goods stolen
from Mr. Slade’s store.
“I guess that tramp must have had a dream,” said Ned, “or else he
wanted to write some poetry.”
“Looks that way,” admitted Jerry, who was idly looking at a figure
of Neptune carved in the middle of a panel on the forward bulkhead.
“Still I don’t believe—”
But what Jerry believed he did not state, for, the next instant he
nearly fell as the panel containing the representation of the sea god
slid back and disclosed a dark opening.
“Why—why—” exclaimed Jerry recovering his balance with
difficulty. “This is queer. I was just pressing on the trident when all
of a sudden—it happened.”
“Well I guess it did!” cried Ned. “I’ll bet it’s the secret hiding place.
Come on, let’s have a look!”
“Light the candle!” said Jerry. “It’s as dark as two pockets.”
In the gleam of the light there was disclosed a place about five
feet square, which had been built forward of the forecastle
bulkhead.
“Now for the stolen stuff!” cried Ned, as he stepped inside. He
flashed the candle around, but it took only an instant to show that
there was nothing in the secret hiding place so opportunely
discovered by Jerry.
“Well of all the—” began Ned, when he suddenly made a grab into
one of the corners. “This looks like something!” he went on. “Let me
get to the light.”
He stepped into the forecastle and held up to the view of his
comrades a piece of cloth.
“What is it?” asked Jerry.
“A piece of red silk!” exclaimed Ned. “It’s just like some that was
stolen from my father’s store! The things have been here, but they
are gone!”
“Perhaps they are here yet,” suggested Jerry, “only we can’t find
them. Maybe there are other secret hiding places. What had we
better do?”
The boys were much excited over their find. That they were on
the trail of the thieves they were certain, but what to do next
puzzled them.
“How would it do for one of us to stay here, and the others go and
get police assistance,” suggested Ned. “We ought to have the
detectives on this case at once.”
“I have a better plan,” said Jerry. “Let two of us stay here, and the
other take the motor boat and go after Chief Dalton in Cresville.”
“How will we decide who are to stay and who is to go?” asked
Ned.
“We’ll draw lots,” replied Jerry. “Those who get the longest will
stay on the schooner, and the one who gets the shortest will start in
the motor boat.”
The lots were made from three straws. Jerry got the shortest.
“Well, the sooner I get off the quicker the chief will be back here,”
he observed.
“Hold on a minute,” put in Bob. “Have you figured how long we’ll
have to stay here, and not a thing to eat? You can’t get back here
before this time to-morrow.”
“That’s so,” admitted Jerry, for once forgetting to laugh at Bob’s
concern over the food question. “I’ll tell you what we’ll do. We’ll run
back to camp and bring enough stuff here to last until I come back.”
“Good idea,” said Ned. “Only there’s no use in us all going. I’ll stay
here, while you and Bob go back to camp. Bring some lanterns, and
some cold victuals. Maybe we can find some food on board. We
certainly can make coffee for there’s a stove in the galley, and I saw
a coffee pot. All we need is some coffee.”
So it was arranged. Jerry and Bob made a fast run to Deer Island,
and were soon back to the schooner with enough provisions to last
the two boys a day or more. In the meanwhile Ned had been all over
the schooner, but had made no new discoveries.
He had found a good supply of canned goods, and even some
coffee, so there was no danger of starving even if the victuals Jerry
and Bob brought gave out. The bunks were clean and there was
plenty of clothing, though it would hardly be needed for the nights
were warm.
It was now getting dusk and, after seeing that his boat was in
good shape Jerry prepared for the long run back to Cresville.
“Take care of yourselves,” said he. “Keep a good watch and if
Noddy and the gang come back, don’t run any chances. They’re
desperate men, and it would be better to retreat than run the
chance of a fight. If I were you I’d sleep in the cabin or on deck in
hammocks. I’ll come back as soon as I can.”
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookname.com

You might also like