0% found this document useful (0 votes)
16 views81 pages

(Ebook) Dirty Data Processing For Machine Learning by Zhixin Qi ISBN 9789819976560, 9789819976577, 9819976561, 981997657X

The document discusses the book 'Dirty Data Processing for Machine Learning' by Zhixin Qi and others, which addresses the challenges posed by dirty data in machine learning contexts. It explores the impact of various types of dirty data on model accuracy and proposes methods for processing such data effectively. The book aims to provide insights and guidelines for both researchers and practitioners in the field of data analysis and machine learning.

Uploaded by

amnadivavov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views81 pages

(Ebook) Dirty Data Processing For Machine Learning by Zhixin Qi ISBN 9789819976560, 9789819976577, 9819976561, 981997657X

The document discusses the book 'Dirty Data Processing for Machine Learning' by Zhixin Qi and others, which addresses the challenges posed by dirty data in machine learning contexts. It explores the impact of various types of dirty data on model accuracy and proposes methods for processing such data effectively. The book aims to provide insights and guidelines for both researchers and practitioners in the field of data analysis and machine learning.

Uploaded by

amnadivavov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Download the Full Ebook and Access More Features - ebooknice.

com

(Ebook) Dirty Data Processing for Machine Learning


by Zhixin Qi ISBN 9789819976560, 9789819976577,
9819976561, 981997657X

https://ebooknice.com/product/dirty-data-processing-for-
machine-learning-53941086

OR CLICK HERE

DOWLOAD EBOOK

Download more ebook instantly today at https://ebooknice.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason;


Viles, James ISBN 9781459699816, 9781743365571,
9781925268492, 1459699815, 1743365578, 1925268497
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

ebooknice.com

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena


Alfredsson, Hans Heikne, Sanna Bodemyr ISBN 9789127456600,
9127456609
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312

ebooknice.com

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT


II Success) by Peterson's ISBN 9780768906677, 0768906679

https://ebooknice.com/product/sat-ii-success-
math-1c-and-2c-2002-peterson-s-sat-ii-success-1722018

ebooknice.com

(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master


the SAT Subject Test: Math Levels 1 & 2) by Arco ISBN
9780768923049, 0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-
arco-master-the-sat-subject-test-math-levels-1-2-2326094

ebooknice.com
(Ebook) Cambridge IGCSE and O Level History Workbook 2C -
Depth Study: the United States, 1919-41 2nd Edition by
Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044

ebooknice.com

(Ebook) Intelligent Sensor Networks: The Integration of


Sensor Networks, Signal Processing and Machine Learning by
Fei Hu (editor), Qi Hao (editor) ISBN 9781439892817,
1439892814
https://ebooknice.com/product/intelligent-sensor-networks-the-
integration-of-sensor-networks-signal-processing-and-machine-
learning-33486826
ebooknice.com

(Ebook) Data Processing with Optimus: Supercharge big data


preparation tasks for analytics and machine learning with
Optimus using Dask and PySpark by Leon, Dr. Argenis,
Aguirre, Luis
https://ebooknice.com/product/data-processing-with-optimus-
supercharge-big-data-preparation-tasks-for-analytics-and-machine-
learning-with-optimus-using-dask-and-pyspark-56143600
ebooknice.com

(Ebook) Python Natural Language Processing: Advanced


machine learning and deep learning techniques for natural
language processing by Jalaj Thanaki ISBN 9781787121423,
1787121429
https://ebooknice.com/product/python-natural-language-processing-
advanced-machine-learning-and-deep-learning-techniques-for-natural-
language-processing-7212480
ebooknice.com

(Ebook) Natural Language Processing Recipes: Unlocking


Text Data with Machine Learning and Deep Learning Using
Python by Akshay Kulkarni, Adarsha Shivananda ISBN
9781484273500, 1484273508
https://ebooknice.com/product/natural-language-processing-recipes-
unlocking-text-data-with-machine-learning-and-deep-learning-using-
python-34204404
ebooknice.com
Dirty Data Processing for Machine Learning
Zhixin Qi • Hongzhi Wang • Zejiao Dong

Dirty Data Processing


for Machine Learning
Zhixin Qi Hongzhi Wang
School of Transportation Science School of Computer Science
and Engineering and Technology
Harbin Institute of Technology Harbin Institute of Technology
Harbin, Heilongjiang, China Harbin, Heilongjiang, China

Zejiao Dong
School of Transportation Science
and Engineering
Harbin Institute of Technology
Harbin, Heilongjiang, China

ISBN 978-981-99-7656-0 ISBN 978-981-99-7657-7 (eBook)


https://doi.org/10.1007/978-981-99-7657-7

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.


This book is dedicated to all contributors in
this field.
Preface

With the rapid growth of data in various domains, many areas have accumulated a
lot of data for data analysis and data mining. Even though massive data bring new
opportunities, data with data quality problems, called dirty data, occur with different
types, such as missing values, inconsistent values, conflicting values, etc.
Obviously, for a data mining or machine learning task, dirty data in both training
and test data sets affect the accuracy of results. Motivated by this, the subject of this
book is to analyze the impacts of dirty data and explore the proper methods for dirty
data processing.
Although existing data cleaning methods improve data quality dramatically, the
cleaning costs are still expensive. If we know how dirty data affect the accuracy of
machine learning models, we could clean data selectively according to the accuracy
requirements instead of cleaning the entire dirty data with large costs. However,
there is no existing research to study the impacts of dirty data on machine learning
models in terms of data quality dimensions. With the focus on filling this gap, the
book is intended for a broad scope of audience ranging from researchers in the
database and machine learning communities to industry practitioners. Therefore,
the book is necessary and important for the research community.
In our book, we first design three evaluation metrics to quantify the dirty-data
sensibility, tolerability, and expected accuracy of a model. Based on the metrics, we
propose a generalized framework to evaluate dirty-data impacts on models. Using
the framework, we conduct an experimental comparison for the effects of missing,
inconsistent, and conflicting data on classification, clustering, and regression mod-
els. Based on the experimental findings, we provide guidelines for model selection
and data cleaning. Then, we present a generic classification model for incomplete
data where existing classification methods can be effectively incorporated. Next, we
develop a density-based clustering approach for incomplete data based on Bayesian
theory, which conducts imputation and clustering concurrently and makes use of
intermediate clustering results. In addition, we involve data quality issues into
feature selection. We model feature selection on inconsistent data as an optimization
problem embedding consistency constraints and value consistencies of features to
make the usage of information provided by consistency issues sufficiently. The

vii
viii Preface

problem is proved to be NP-hard and an efficient algorithm with ratio bound is


developed. Moreover, we consider data quality in the problem of cost-sensitive
decision tree induction. To solve the problem, we present three decision tree
induction methods integrated with data cleaning algorithms.
Readers will be interested in the take-away suggestions of model selection and
data cleaning, the incomplete data classification with view-based decision tree, the
density-based clustering for incomplete data, the feature selection method which
reduces the time costs and guarantees the accuracy of machine learning models, and
the cost-sensitive decision tree induction approaches under different scenarios.
This book opens many noteworthy avenues for the further study of dirty data
analysis, such as data cleaning in demand, constructing a model to predict dirty-
data impacts, and considering data quality issues into other machine learning
models. From this book, readers could learn the state-of-the-art dirty data processing
techniques for machine learning, capture the research advances, and be inspired to
find new ideas in this area.
Basic knowledge in data management and machine learning is sufficient to follow
this book. We hope the discussed ideas in this book can inspire a broad scope of
readers ranging from researchers in the database and machine learning communities
to industry practitioners, and further prompt them to join the field of dirty data
processing for machine learning.

Harbin, P.R. China Zhixin Qi


July 2023 Hongzhi Wang
Zejiao Dong
Acknowledgments

This book was partially supported by the Harbin Institute of Technology Start-
Up Research Funds for Assistant Professors (No. AUGA5630109723) and
the Transportation Investment Group Project of Heilongjiang Province (No.
QTQQ2575109021).

ix
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Why Dirty Data Processing for Machine Learning? . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Data Cleaning and Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Impacts of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Noise-Robust Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Cost-Sensitive Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview of the Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Impacts of Dirty Data on Classification and Clustering Models . . . . . . . . 7
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 How to Evaluate Dirty-Data Impacts on Classification and
Clustering Models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Data Sets, Models, and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Dimensions of Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Evaluation Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 How Dirty Data Affects Classification and Clustering Models? . . . . . . 14
2.3.1 Results and Analysis of Classification Models . . . . . . . . . . . . . . . . 14
2.3.2 Results and Analysis of Clustering Models . . . . . . . . . . . . . . . . . . . 24
2.4 What Do We Learn from Evaluation Results? . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Lessons Learned from Evaluation on Classification
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2 Guidelines of Classification Model Selection
and Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.3 Lessons Learned from Evaluation on Clustering Models . . . . . 34
2.4.4 Guidelines of Clustering Model Selection
and Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.5 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xi
xii Contents

3 Dirty Data Impacts on Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 How to Evaluate Dirty Data Impacts on Regression Models? . . . . . . . . 40
3.3 How Dirty Data Affects Regression Models? . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Data Sets, Models, and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Varying Missing Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Varying Inconsistent Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.4 Varying Conflicting Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Incomplete Data Classification with View-Based Decision Tree . . . . . . . . 51
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 How to Organize Tree-Structured Views? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 How to Select Views? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Evaluation of Incomplete Data Classification
with View-Based Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Comparison Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2 Influence of Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Density-Based Clustering for Incomplete Data . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 Definitions in DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 DBSCAN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Approach of Concurrently Imputation Clustering . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Method Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Approach of Local Imputation Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.2 Method Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Evaluation of Density-Based Clustering for Incomplete Data . . . . . . . . 85
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Feature Selection on Inconsistent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.1 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Consistency Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Contents xiii

6.3 How to Select Features on Inconsistent Data? . . . . . . . . . . . . . . . . . . . . . . . . . 98


6.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 Our Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Evaluation of Feature Selection on Inconsistent Data . . . . . . . . . . . . . . . . . 103
6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.2 Efficiency of FRIEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.3 Accuracy of Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.4 Parameters of SFS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7 Cost-Sensitive Decision Tree Induction on Dirty Data . . . . . . . . . . . . . . . . . . . 111
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 How to Define the Problem of Cost-Sensitive Decision Tree
Induction on Dirty Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.2 Misclassification Cost and Testing Cost . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.3 Detection Cost and Repair Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.4 Cost-Sensitive Decision Tree Building Problem on
Poor Quality Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Cost-Sensitive Decision Tree Induction Methods Integrated
with Data Cleaning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.1 Cost-Sensitive Decision Tree Building Method
Incorporating Stepwise Cleaning Algorithm Based
on Split Attribute Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.2 Cost-Sensitive Decision Tree Building Method
Incorporating One-time Cleaning Algorithm Based
on Split Attribute Gain and Cleaning Cost . . . . . . . . . . . . . . . . . . . . 120
7.3.3 Cost-Sensitive Decision Tree Building Method
Incorporating Stepwise Cleaning Algorithm Based
on Split Attribute Gain and Cleaning Cost . . . . . . . . . . . . . . . . . . . . 121
7.3.4 Applicability Discussion and Time Complexity
Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4 Evaluation of Cost-Sensitive Decision Tree Induction
on Dirty Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4.1 Total Cost Incurred by the Classification Task . . . . . . . . . . . . . . . . 126
7.4.2 Accuracy of Classification Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.4.3 Efficiency of Classification Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 1
Introduction

Abstract In recent years, with the development of the information age, the amount
of data has grown dramatically. At the same time, dirty data have already existed
in various types of databases. Due to the negative impacts of dirty data on data
mining and machine learning results, data quality issues have attracted widespread
attention. Motivated by this, this book aims to analyze the impacts of dirty
data on machine learning models and explore the proper methods for dirty data
processing. This chapter discusses the background of dirty data processing for
machine learning. In Sect. 1.1, we analyze three basic dimensions of data quality
to motivate the necessity of processing dirty data in the database and machine
learning communities. In Sect. 1.2, we summarize the existing studies and explain
the differences of our research and current work. We conclude the chapter with an
overview of the structure of this book in Sect. 1.3.

1.1 Why Dirty Data Processing for Machine Learning?

Data quality has many dimensions. The three basic dimensions are completeness,
consistency, and entity identity. For these dimensions, the corresponding dirty data
types are missing data, inconsistent data, and conflicting data. In this section, we
first introduce these three kinds of dirty data.
Missing data refer to values that are missing from databases. For example, in
Table 1.1, the values of .t1 [Country] and .t2 [City] are missing data.
Inconsistent data are identified as violations of consistency rules which describe
the semantic constraint of data. For example, a consistency rule “[Student No.] .→
[Name]” in Table 1.1 means that Student No. determines Name. As the table shows,
.t1 [Student No.] = .t2 [Student No.], but .t1 [Name] ./= .t2 [Name]. Thus, the values of

.t1 [Student No.], .t1 [Name], and .t2 [Name] are inconsistent.

Conflicting data refer to different values which describe an attribute of the same
entity. For example, in Table 1.1, both .t3 and .t4 describe Bob’s information, but
.t3 [City] and .t4 [City] are different. Thus, .t3 [City] and .t4 [City] are conflicting data.

In addition to the three data quality dimensions explained above, there are also
other dimensions, such as currency, validity, credibility, etc. Data that violate the

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 1
Z. Qi et al., Dirty Data Processing for Machine Learning,
https://doi.org/10.1007/978-981-99-7657-7_1
2 1 Introduction

Table 1.1 Student Tuple Student No. Name City Country


information
.t1 170302 Alice NYC
.t2 170302 Steven FR
.t3 170304 Bob NYC U.S.A
.t4 170304 Bob LA U.S.A

corresponding rules of data quality dimensions are considered as dirty data. Due to
the negative impacts of dirty data on data mining and machine learning tasks, poor-
quality data will affect the decision-making and economic benefits of a business
or group directly. On the contrast, high-quality data will not only guarantee the
correctness of the decisions made by data analysis but also obtain some potential
and valuable rules according to the results of data mining or machine learning
approaches. Therefore, it is necessary to analyze the impacts of different dirty data
on different data mining or machine learning models and explore the relationship
between data quality and the accuracy of model results.
Even though existing dirty data detection and repairing methods improve the
data quality dramatically, the time costs are still expensive. Thus, this book attempts
to discover the way of dirty data processing for machine learning, which not only
guarantees the accuracy and high efficiency of machine learning models but also
reduces the costs of data cleaning.

1.2 Summary of Related Work

This section summarizes the related studies of dirty data processing for machine
learning. We overview these studies and analyze the differences between our
proposed methods and the existing work.

1.2.1 Data Cleaning and Noise Reduction

A lot of studies go into cleaning up training or test data to reduce noise. In the
database world, various data cleaning methods have been proposed in different data
quality dimensions. For example, Wang et al. propose a hybrid human–machine
approach to identify the same entity in a data set [1]. Beskales et al. discuss
simultaneous modifications to inconsistent data and functional dependencies [2].
Chu et al. combined heterogeneity rules to propose a holistic approach for automatic
repair of dirty data [3]. Chu et al. built a data cleansing system driven by knowledge
bases and crowdsourcing [4]. Hao et al. proposed detection rules for detecting and
repairing false relational data [5]. These methods greatly improve data quality, but
because they fix the entire dirty data, data cleaning is costly. Instead of proposing a
1.2 Summary of Related Work 3

data cleaning method, we investigate the relationship between the quality of the data
set and the accuracy of the model. This relationship provides guidance for selective
data cleansing, which reduces costs.
In the machine learning community, reducing the consequences of noise is
important. Gamberger and Lavra foremost test a series of noise detection and
cancelation algorithms in data preprocessing of inductive learning models and
propose that a loose consensus saturation filter is a good noise reduction scheme [6].
García-Laencina et al. analyzed the problem of missing data in pattern classification
tasks [7]. They compare a number of methods used to deal with missing values
and offer solutions based on the experimental results. Lim developed an automatic
correction algorithm to reduce noisy city names [8]. Instead of proposing ways to
reduce noise, our book focuses on exploring the relationship between noise and
learning models. Based on this relationship and given data, we can choose a suitable
machine learning model to achieve high accuracy.

1.2.2 Impacts of Noise

Some researchers have focused on the impact of noise on data mining and machine
learning. Song and Zhang investigated the interesting problem of clustering and
repairing dirty data simultaneously [9]. They formalized it as an integer linear
programming (ILP) problem and proposed an LP solution without calling a solver.
After solving this problem, they designed an approximation algorithm that improved
the accuracy of clustering and cleaning. The work of Zhu and Wu is most closely
related to our work [10]. They investigate the relationship between attribute noise
and classification accuracy, the effects of different attribute noise, and possible
solutions to deal with attribute noise. Building on this work, we sought to explore
the implications of dirty data in terms of data integrity, consistency, and identity.
To achieve this goal, we conducted an experimental evaluation of the relationship
between dirty data and model accuracy for different types of dirty data, including
missing, inconsistent, and conflicting data. Based on the evaluation results, we
provide users with guidelines for model selection and data cleaning.

1.2.3 Noise-Robust Models

Many studies have focused on analyzing which data mining and machine learning
models are more robust to noise. For example, Frénay and Verleysen discuss the
potential impact of label noise on classification and propose some noise-robust and
noise-tolerant models [11]. However, existing studies distinguish noise into class
noise and attribute noise and analyze noise-like robust models for classification
or regression tasks. Instead, our book classifies dirty data into missing values,
inconsistent values, and conflicting values from the perspective of data quality
4 1 Introduction

dimension. We then test classification and clustering models on dirty data to explore
the effects of dirty data on the model. Based on the relationship between model
accuracy and dirty data, we recommend using dirty data-supported models for data
mining and machine learning tasks.

1.2.4 Cost-Sensitive Decision Tree Induction

At present, the research objectives of the cost-sensitive decision tree building


problem mainly include three kinds: the first one is to minimize the misclassification
cost of the decision tree, the second one is to minimize the testing cost of the
decision tree, and the third one is to minimize the sum of misclassification cost
and testing cost of the decision tree [12]. The research objective of this book is the
third kind.
Many cost-sensitive decision tree building methods have been proposed for dif-
ferent problem objectives. These methods can be divided into two main categories.
The first one uses a greedy approach to build a single decision tree. For example,
the CS-ID3 algorithm [13] uses an entropy-based selection method to minimize the
cost during decision tree building, and the AUCSplit algorithm [14] minimizes the
cost after the decision tree has been built. The second one is a non-greedy method
that generates multiple decision trees, such as the genetic algorithm ICET, and the
MetaCost algorithm that wraps together existing accuracy-based methods [15].
In chronological order, Hunt et al. first found that misclassification and testing
have some influence on people’s decision-making and proposed the concept learning
system framework [16]. Subsequently, the ID3 algorithm [17] adopted some of the
ideas of the conceptual learning system framework and used information-theoretic
evaluation parameters to select features. After the information-theoretic approach
was proposed, many methods to minimize the cost in the decision tree building
process were proposed successively, such as CS-ID3 algorithm [13] and EG2
algorithm [18]. Then, methods to minimize the cost after the decision tree building
were proposed, such as AUCSplit algorithm [14]. Then, genetic methods such as
ICET algorithm [19], boosting methods such as UBoost algorithm [20], bagging
methods such as MetaCost algorithm [15], etc. were proposed. Subsequently,
multistructured methods such as LazyTree algorithm [21], randomized methods
such as ACT algorithm [22], and TATA algorithm [23] were proposed one after
another.
Based on minimizing the misclassification cost and testing cost of cost-sensitive
decision trees, some studies have focused on the problem of building cost-sensitive
decision trees with other constraints. For example, since some classification tasks
need to be completed within a specified time, cost-sensitive decision tree building
methods under time constraints have been proposed. Sometimes, users have expec-
tations on the accuracy of the classification task, so cost-sensitive decision tree
building methods for user requirements are proposed [24]. However, no research
has focused on cost-sensitive decision tree building on poor-quality data. Therefore,
this book will fill this gap.
References 5

1.3 Overview of the Book

In Chaps. 2 and 3 of this book, we first design three evaluation metrics to quantify
the dirty-data sensibility, tolerability, and expected accuracy of a model. Based on
the metrics, we propose a generalized framework to evaluate dirty-data impacts
on models. Using the framework, we conduct an experimental comparison for the
effects of missing, inconsistent, and conflicting data on classification, clustering,
and regression models. Based on the experimental findings, we provide guidelines
for model selection and data cleaning. In Chap. 4, we present a generic classi-
fication model for incomplete data where existing classification methods can be
effectively incorporated. In Chap. 5, we develop a density-based clustering approach
for incomplete data based on Bayesian theory, which conducts imputation and
clustering concurrently and makes use of intermediate clustering results. In Chap. 6,
we involve data quality issues into feature selection. We model feature selection
on inconsistent data as an optimization problem embedding consistency constraints
and value consistencies of features to make the usage of information provided by
consistency issues sufficiently. The problem is proved to be NP-hard and an efficient
algorithm with ratio bound is developed. In Chap. 7, we consider data quality in the
problem of cost-sensitive decision tree induction. To solve the problem, we present
three decision tree induction methods integrated with data cleaning algorithms.

References

1. J. Wang, T. Kraska, M.J. Franklin, J. Feng, Crowder: Crowdsourcing entity resolution. PVLDB
5(11), 1483–1494 (2012)
2. G. Beskales, I.F. Ilyas, L. Golab, A. Galiullin, On the relative trust between inconsistent data
and inaccurate constraints, in 2013 IEEE 29th International Conference on Data Engineering
(ICDE) (IEEE, New York, 2013), pp. 541–552
3. X. Chu, I.F. Ilyas, P. Papotti, Holistic data cleaning: Putting violations into context, in 2013
IEEE 29th International Conference on Data Engineering (ICDE) (2013), pp. 458–469
4. X. Chu, J. Morcos, I.F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, Y. Ye, Katara: a data cleaning
system powered by knowledge bases and crowdsourcing, in Proceedings of the 2015 ACM
SIGMOD International Conference on Management of Data (SIGMOD) (2015), pp. 1247–
1261
5. S. Hao, N. Tang, G. Li, J. Li, Cleaning relations using knowledge bases, in 2017 IEEE 33rd
International Conference on Data Engineering (ICDE) (2017), pp. 933–944
6. D. Gamberger, N. Lavrač, Conditions for Occam’s razor applicability and noise elimination, in
European Conference on Machine Learning (Springer, Berlin, 1997), pp. 108–123
7. P.J. García-Laencina, J. Sancho-Gómez, A.R. Figueiras-Vidal, Pattern classification with
missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
8. S. Lim, Cleansing noisy city names in spatial data mining, in International Conference on
Information Science and Applications (IEEE, New York, 2010), pp. 1–8
9. S. Song, C. Li, X. Zhang, Turn waste into wealth: On simultaneous clustering and cleaning over
dirty data, in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (2015), pp. 1115–1124
6 1 Introduction

10. X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3),
177–210 (2004)
11. B. Frénay, M. Verleysen, Classification in the presence of label noise: a survey. IEEE Trans.
Neural Networks Learn. Syst. 25(5), 845–869 (2014)
12. S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms. ACM
Comput. Surv. (CSUR) 45(2), 1–35 (2013)
13. M. Tan, Cost-sensitive learning of classification knowledge and its applications in robotics.
Mach. Learn. 13, 7–33 (1993)
14. C. Ferri, P. Flach, J. Hernández-Orallo, Learning decision trees using the area under the ROC
curve, in ICML, vol. 2 (2002), pp. 139–146
15. P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in Proceedings
of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (1999), pp. 155–164
16. J. Mingers, An empirical comparison of selection measures for decision-tree induction. Mach.
Learn. 3, 319–342 (1989)
17. J.R. Quinlan, Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
18. M. Núñez, The use of background knowledge in decision tree induction. Mach. Learn. 6, 231–
250 (1991)
19. P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree
induction algorithm. J. Artif. Intell. Res. 2, 369–409 (1994)
20. K.M. Ting, Z. Zheng, Boosting cost-sensitive trees, in International Conference on Discovery
Science (Springer, Berlin, 1998), pp. 244–255
21. C.X. Ling, V.S. Sheng, Q. Yang, Test strategies for cost-sensitive decision trees. IEEE Trans.
Knowl. Data Eng. 18(8), 1055–1067 (2006)
22. S. Esmeir, S. Markovitch, Anytime induction of low-cost, low-error classifiers: a sampling-
based approach. J. Artif. Intell. Res. 33, 1–31 (2008)
23. S. Esmeir, S. Markovitch, Anytime learning of anycost classifiers. Mach. Learn. 82, 445–473
(2011)
24. Y.-L. Chen, C.-C. Wu, K. Tang, Time-constrained cost-sensitive decision tree induction. Inf.
Sci. 354, 140–152 (2016)
Chapter 2
Impacts of Dirty Data on Classification
and Clustering Models

Abstract Since dirty data have negative influence on the accuracy of machine
learning models, the relation between data quality and model results could be used
in the selection of the proper model and data cleaning strategies. However, rare
work has focused on this topic. Motivated by this, this chapter compares the impacts
of missing, inconsistent, and conflicting data on basic classification and clustering
models. Based on the evaluation observations, we suggest users how to select
appropriate classification and clustering models and clean dirty data in the database
and machine learning communities. Section 2.1 gives the research motivation of this
chapter. Section 2.2 describes our assessment methodology. Section 2.3 presents
our experimental results and analyses. We discuss the lessons learned from the
evaluation in Sect. 2.4 and provide strategies of model selection and data cleansing,
followed by a brief summary in Sect. 2.5.

2.1 Motivation

Data quality has become a serious problem that cannot be ignored in the database
and machine learning community. We call data that has data quality issues “dirty
data.” Obviously, for classification or clustering tasks, dirty data in both training
and test data sets can affect accuracy. Therefore, we must know the relationship
between the quality of the input data set and the accuracy of the results. Based
on this relationship, we can choose an appropriate model, taking into account data
quality issues and determine the share of data that needs to be cleaned.
Due to the large number of classification and clustering models, it is difficult for
users to decide which model should be adopted. The impact of data quality on the
model is helpful in the selection of the model. Therefore, it is necessary to discuss
the influence of dirty data on the model.
Before performing a classification or clustering task, it is necessary to perform
data cleaning to ensure data quality. Various data cleaning methods have been
proposed, for example, data repair with integrity constraints [1, 2], knowledge-
based cleaning systems [3, 4], and crowdsourced data cleaning [3, 5]. These methods
greatly improve data quality, but the cost of data cleaning is still expensive [6]. If

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 7
Z. Qi et al., Dirty Data Processing for Machine Learning,
https://doi.org/10.1007/978-981-99-7657-7_2
8 2 Impacts of Dirty Data on Classification and Clustering Models

we know how dirty data affects the accuracy of the model, we can selectively clean
the data based on accuracy requirements, rather than cleaning the entire dirty data.
As a result, the cost of data cleaning is reduced. Therefore, there is an urgent need
to study the relationship between data quality and the accuracy of results.
The most existing research is devoted to developing methods for data cleaning
and noise reduction [1–9]. Some literature also focuses on the problem of noise
robustness models from the perspective of class labels [10]. Little work has been
done to study the effects of attribute noise [11]. Unfortunately, there has been no
research to explore the impact of dirty data from a data quality perspective. There-
fore, this chapter aims to fill this gap, bringing with it the following challenges:
(1) There are not enough experiments to comprehensively compare and analyze
the sensitivity and tolerance of classification and clustering models. This makes
it difficult for users to determine which model is the best and decide which
model to use. Therefore, the first challenge is to design appropriate experimental
methods to find models that are least sensitive to and most tolerant of dirty data.
(2) Existing measures of classification and clustering models, such as accuracy,
recall, and F-measure, have been proposed with the aim of testing the accuracy
of the model, while none has been able to quantify the dirty data sensitivity and
tolerance of the model. Defining appropriate evaluation indicators is the second
challenge.
Given these challenges, we selected 12 classical data mining and machine
learning models and tried to explore their sensitivity and tolerance to dirty data.
To achieve this, we generated dirty data sets based on nine typical data sets for
classification and clustering, taking into account various factors such as data quality
dimensions, dirty data rate, and data size. We then experimentally compared the
performance of different models on various dirty data. We propose two new metrics
to measure a model’s sensitivity and tolerance to dirty data. Based on the evaluation
results, we provide recommendations for model selection and data cleaning.
To sum up, our contributions in this chapter are listed below.
(1) To assess the impact of dirty data on the model, we propose two new metrics,
namely sensibility and data quality inf lection point. These two indicators
are used to measure the dirty-data sensibility of the model and the tolerance of
dirty data.
(2) Using the proposed indicators, we experimentally compare the effects of dirty
data from different data quality dimensions on classification and clustering
models. As far as we know, this is the first work to focus on this problem.
(3) Based on the experimental results, we find some factors that affect the perfor-
mance of the model and provided users with guidance on model selection and
data cleaning.
2.2 How to Evaluate Dirty-Data Impacts on Classification and Clustering. . . 9

2.2 How to Evaluate Dirty-Data Impacts on Classification


and Clustering Models?

In this section, we describe our experimental approach, including data sets


(Sect. 2.2.1), classification and clustering models (Sect. 2.2.1), Settings (Sect. 2.2.1),
data quality dimensions (Sect. 2.2.2), and evaluation measures (Sect. 2.2.3).

2.2.1 Data Sets, Models, and Setup

We selected 9 typical data sets of different types and sizes from the UCI public
data set1 . Their basic information is shown in Table 2.1. Because of the integrity
and correctness of these raw data sets, we inject different rates of errors into them
from different data quality dimensions and produce different kinds of dirty data sets.
We then compared how the various models performed on them. In experimental
evaluation, the original data set is used as a baseline, and the accuracy of the model
is measured against the results of the original data set.
We selected 14 classical classification and clustering models. Their types and
parameter settings are shown in Table 2.2. We chose these models because they are
always used as competitive models [12–15].
All experiments were conducted on a machine powered by two Intel.® Xeon.®
E5-2609 [email protected] GHz CPUs and 32 GB of memory under CentOS7. All models
are implemented in C++ and compiled with g++ 4.8.5.

2.2.2 Dimensions of Data Quality

Data quality has many aspects [16]. For each aspect, there is a corresponding error
data type. In the field of research, dirty data is classified into various types. Most
existing research focuses on missing, inconsistent, and conflicting data to improve
data quality [17, 18]. Therefore, in this chapter, we focus on these three basic types.
For these types, the corresponding data quality dimensions are integrity, consistency,
and entity identity.
Missing data refers to values that are missing from the database. For example, in
Table 2.3, the values for .t1 [Country] and .t2 [City] are missing data.
Inconsistent data is considered a violation of functional dependencies that
describe semantic constraints on the data. For example, the function dependency
“[Student No.] .→ [Name]” in Table 2.3 means that the Student No. determines
the Name. As shown in Table 2.3, .t1 [Student No.] = .t2 [Student No.], but .t1 [Name]

1 http://archive.ics.uci.edu/ml/datasets.html.
10 2 Impacts of Dirty Data on Classification and Clustering Models

Table 2.1 Data sets Name Number of attributes Number of records


information
Iris 4 150
Ecoli 8 336
Car 6 1728
Chess 36 3196
Adult 14 48842
Seeds 7 210
Abalone 8 4177
HTRU 9 17898
Activity 3 67651

Table 2.2 Models information


Name Type Parameter setting
Decision Tree Classification .purity = 0.9
K-Nearest Neighbor Classification .k = 600, .distance_type = euclidean
Naive Bayes Classification –
Bayesian Network Classification –
Logistic Regression Classification .α = 0.001, .max_iteration = 7
Random Forests Classification .max_height = 3, .train_times = 100,
.max_container = 40
XGBoost Classification .learning_rate = 0.1, .max_depth = 10,
.random_state = 30

Multi-layer Perceptron Classification .n_hidden = 1024, .epoch_num = 100,


.learning_rate = 0.1
K-Means Clustering .distance_type = euclidean
LVQ Clustering .learning_rate = 0.2, .max_iteration = 1000000,

.distance_type = euclidean
CLARANS Clustering .max_times = 30
DBSCAN Clustering .λ = 2, .min_pts = 4, .max_cluster = 100

BIRCH Clustering .max_cf = 20, .max_radius = 30000


CURE Clustering .α = 0.4, .point_num = 10

Table 2.3 Data sets Tuple Student No. Name City Country
information
.t1 170302 Alice NYC
.t2 170302 Steven FR
.t3 170304 Bob NYC U.S.A
.t4 170304 Bob LA U.S.A

./= .t2 [Name]. Therefore, the values of .t1 [Student No.], .t1 [Name], and .t2 [Name] are
inconsistent.
Conflicting data refers to different values of attributes that describe the same
entity. For example, in Table 2.3, both .t3 and .t4 describe Bob’s information, but
.t3 [City] and .t4 [City] are different. Therefore, .t3 [City] and .t4 [City] are conflicting

data.
2.2 How to Evaluate Dirty-Data Impacts on Classification and Clustering. . . 11

2.2.3 Evaluation Measures

Since there are class labels for classification and clustering in the selected raw data
set, we use standard accuracy, recall, and F-measure to evaluate the effectiveness of
the classification and clustering models. These measures are calculated as follows:
n c rci
i=1 rni
P recision =
. , (2.1)
nc

where .rci is the number of records correctly classified or clustered into class i, .rni
is the number of records classified or clustered into class i, and .nc is the number of
classes.
nc rci
i=1 ri
Recall =
. , (2.2)
nc

where .rci is the number of records that have been correctly classified or clustered
into class i, .ri is the number of records of class i, and .nc is the number of classes.

2 × P recision × Recall
F − measure =
. . (2.3)
P recision + Recall

However, these metrics only show us changes in accuracy. They cannot measure
the degree of fluctuation quantitatively. Therefore, we propose new metrics to
evaluate the impact of dirty data on the model. The first indicator sensibility is
defined as follows.
Definition Given the values .ya , .ya+x , ..., .ya+bx of measure y of a model with
a%, (a+x)%, (a+2x)%, ..., (a+bx)% (a.≥0, x.>0, b.>0) error rate. Sensibility
of a model on dirty data is computed as .|ya.−.ya+x | + .|ya+x.−.ya+2x | + ... +
.|ya+(b−1)x .−.ya+bx |. .□

Note that in this chapter, the error rate represents the proportion of dirty values in
the given data, where a determines the starting point of the error rate, x represents
the step size of the error rate, and b represents the number of error rates used in the
experimental evaluation minus one.
The purpose of sensitivity is to measure how much a model fluctuates against
dirty data. The greater the sensitivity, the greater the fluctuation. Accordingly, dirty
data has a greater impact on the model. Sensitivity is therefore able to assess the
extent to which dirty data affects the model. Here, we use Fig. 2.1 as an example to
explain the calculation of sensitivity.
12 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Missing Rate(%) (b) Missing Rate(%)

(c) Missing Rate(%) (d) Data sets(with Missing Rate)

Fig. 2.1 Results of k-nearest neighbor: varying missing rate. (a) Precision. (b) Recall. (c) F-
measure. (d) Running time

Example 2.1 Since the given data missing rate is .0%, 10%, · · · , 50%, when the
precision(P) value of the decision tree model is given, we calculate the sensitivity
of each data set, as shown in Table 2.4. From Table 2.4, we get that the average
sensitivity of the decision tree is 25.89%
Although sensitivity measures the sensitivity of a model, we cannot determine
an unacceptable error rate for a model. So, with this motivation in mind, we define
a second novel inflection point for measuring data quality, defined below.
Definition Given the values .ya , .ya+x , ..., .ya+bx of a measure y of a model with
a%, (a+x)%, (a+2x)%, ..., (a+bx)% (a.≥0, x.>0, b.>0) error rate, respectively, model
accuracy M, and a number k (k .> 0). If .M1 ≥ M2 when .y1 ≥ y2 and .ya% .−
.y(a+ix)% .> k .(0 < i ≤ b), data quality inflection point (DQIP for brief) is .min{(a +

(i − 1)x)%}. If .ya% .− .y(a+bx)% .≤ k, DQIP is .min{(a + bx)%}. If .M1 ≤ M2 when


2.2 How to Evaluate Dirty-Data Impacts on Classification and Clustering. . . 13

Table 2.4 Models information


Data set Sensibility
Iris .|78.37%.−.84.16%|.+.|84.16%.−.78.08%|.+.|78.08%.−.74.36%|.+.|74.36%.−.64.99%|
.+.|64.99%.−.58.71%| .= 31.24%
Ecoli .|63.47%.−.62.93%|.+.|62.93%.−.53.97%|.+.|53.97%.−.50.93%|.+.|50.93%.−.48.07%|
.+ .|48.07%.−.34.5%| .= 28.97%
Car .|81.33%.−.60.93%|.+.|60.93%.−.43.7%|.+.|43.7%.−.42.87%|.+.|42.87%.−.40.47%|
.+.|40.47%.−.35.47%| .= 45.86%
Chess .|82.17%.−.78.17%|.+.|78.17%.−.76.53%|.+.|76.53%.−.75.77%|.+.|75.77%.−.75.9%|
.+.|75.9%.−.75.57%| .= 6.86%
Adult .|80.5%.−.75.27%|.+.|75.27%.−.71.3%|.+.|71.3%.−.72.93%|.+.|72.93%.−.71.53%|
.+.|71.53%.−.67.23%| .= 16.53%

Table 2.5 Models information


Data set DQIP
Iris Because .y0% .− .y40% = 78.37% .− 64.99% = 13.38% .> 10%, DQI P = 40% .− 10%
= 30%
Ecoli Because .y0% .− .y30% = 63.47% .− 50.93% = 12.54% .> 10%, DQI P = 30% .− 10%
= 20%
Car Because .y0% .− .y10% = 81.33% .− 60.93% = 20.4% .> 10%, DQI P = 10% .− 10%
= 0%
Chess Because .y0% .− .y50% = 82.17% .− 75.57% = 6.6% .≤ 10%, DQI P = 50%
Adult Because .y0% .− .y50% = 80.5% .− 67.23% = 13.27% .> 10%, DQI P = 50% .− 10%
= 40%

y1 ≥ y2 and .y(a+ix)% .− .ya% .> k .(0 < i ≤ b), DQIP is .min{(a + (i − 1)x)%}. If
.

y(a+bx)% .− .ya% .≤ k, DQIP is .min{(a + bx)%}.


. .□

Note that k represents an acceptable degree of decrement. DQIP is defined as


a measure of the acceptable error rate of a model. The greater the value of DQIP,
the greater the acceptable error rate of the model, and correspondingly, the higher
the fault tolerance of the model. Therefore, DQIP is useful for evaluating the fault
tolerance of models. Here, we use Fig. 2.1 as an example to explain the DQIP of the
model.
Example 2.2 We know that the missing rate of a given data is 0%, 10%,..., 50% of
the time, decide the accuracy value of the tree model, and set k to 10%. As shown
in Table 2.5, we calculated the DQIP for each data set and got an average DQIP of
28% for the Decision Tree.
14 2 Impacts of Dirty Data on Classification and Clustering Models

2.3 How Dirty Data Affects Classification and Clustering


Models?

We studied the influence of dirty data on 14 selected models and analyzed the
experimental results. In this chapter, we present some representative results and
others can be found online2 .

2.3.1 Results and Analysis of Classification Models


2.3.1.1 Varying Missing Rate

To assess the impact of missing data on the classification model, we randomly


removed values from the original data set and generated five data sets with missing
rates of 10%, 20%, 30%, 40%, and 50%. For each tuple, we randomly select one or
more properties and remove the corresponding value. We use 10x cross validation
and randomly generate training data and test data. During training and testing, we
replace numeric missing values with average values and categorical missing values
with maximum values. The experimental results of KNN are shown in Fig. 2.1.
Figure 2.2 depicts the experimental results of DT and Fig. 2.3 shows the results
for RF.
Based on these results, we make the following observations. First, for well-
performing models (accuracy, recall, or F-measure greater than 80% on the original
data set), the accuracy, recall, or F-measure of the model becomes less sensitive
as the data size increases, except for logistic regression. The reason is that if the
data are large, the amount of clean data is large enough to reduce the impact
of missing data. However, logistic regression requires parameters to be set in its
regression function, and the calculation of parameters is easily affected by missing
data. Therefore, when the data scale increases, the amount of missing data will
become larger, which has a greater impact on logistic regression.
Second, in Table 2.6, we get the sensitivity ranking of different classification
models in terms of accuracy, recall rate, and F-measure. For example, the order
of sensibility to precision is “Bayesian Network .> logistic regression .> Naive
Bayes .> Decision Tree .> Random Forests .> XGBoost .> Multi-Layer Perceptron
.> KNN.” Therefore, the least sensitive model is KNN. This is because, as the miss

rate rises, the increasing miss value may not affect the k-nearest neighbors. Even
if the k nearest neighbors are affected, they will not necessarily vote for the final
category label. In addition, the most sensitive model is Bayesian networks. The
reason is that the increasing missing data may affect the calculation of the posterior
probability, which will directly affect the classification results.

2 https://github.com/qizhixinhit/Dirty-dataImpacts/blob/master/impacts%20of%20dirty%20data.

pdf, June 2021.


2.3 How Dirty Data Affects Classification and Clustering Models? 15

(a) Missing Rate(%) (b) Missing Rate(%)

(c) Missing Rate(%) (d) Data sets(with Missing Rate)

Fig. 2.2 Results of decision tree: varying missing rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

Third, in Table 2.7, we get the DQIP order of different classification models in
terms of accuracy, recall rate, and F-measure. For example, the order of DQIP in
precision is “Decision Tree .> Naive Bayes .= Random Forests .> KNN .= XGBoost
.> logistic regression .= Multi-Layer Perceptron .> Bayesian Network.” Therefore,

the model that is most tolerant of incompleteness for accuracy and F-measure is the
decision tree. This is because the decision tree model only uses split features for
classification. As the missing rate rises, more and more missing data may not affect
split features. The model with the least complete tolerance for recalls is the random
forest. This is because increasing missing values may not affect the split property.
Even if split attributes are affected, the impact of missing data is reduced because
there are multiple classifiers. For accuracy and recall, the model most intolerant of
incompleteness is Bayesian networks. This is because the increase in missing data
changes the posterior probability, which directly affects the classification result. For
16 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Missing Rate(%) (b) Missing Rate(%)

(c) Missing Rate(%) (d) Data sets(with Missing Rate)

Fig. 2.3 Results of random forest: varying missing rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

F-measure, the model most intolerant of incompleteness is the random forest. The
reason is the high F-measure (error rate of 0%) on the original data set. When there
are a small number of missing values in the data set, the F-measure drops a lot.
Fourth, the Bayesian network’s results for Ecoli’s accuracy, recall, and F-measure
are much slower when the miss rate is 0% than when the miss rate is 10%. In
addition, when the miss rate is 10%, the accuracy value of random forest for Iris
is much higher than that when the miss rate is 0%. This is because Ecoli or Iris
has relatively small amounts of data, which can easily lead to overfitting of models.
While a better model is trained at higher data quality, the model may perform worse
on test data. This observation further confirms that it is not necessary to clean the
entire dirty data.
2.3 How Dirty Data Affects Classification and Clustering Models? 17

Table 2.6 Sensibility results of classification and clustering models


Missing(%) Inconsistent(%) Conflicting(%)
Model Prec. Rec. F-measure Prec. Rec. F-measure Prec. Rec. F-measure
DT 25.89 31.11 26.64 35.41 40.94 38.33 16.09 21.56 16.45
KNN 18.09 13.18 17.45 21.84 19.21 20.93 11.39 6.70 9.32
NB 27.04 23.37 26.40 29.48 37.18 35.49 15.10 21.85 20.33
BN 46.40 34.04 35.37 33.29 21.53 23.15 17.26 15.18 16.01
LR 38.26 18.73 30.69 37.84 28.10 38.83 31.74 18.51 25.60
RF 25.77 24.57 29.39 39.21 34.86 40.74 27.93 15.85 27.53
XGBoost 24.47 25.90 25.81 31.64 33.42 32.76 12.27 12.64 10.52
MLP 22.81 18.14 18.57 28.94 22.92 25.24 17.70 9.22 11.54
KM 31.06 27.80 32.08 31.83 32.21 35.63 23.79 21.86 25.17
LVQ 11.94 21.14 19.61 20.55 18.83 21.41 9.20 19.57 20.13
CLARANS 34.26 40.16 39.48 31.11 29.45 31.56 20.67 22.64 24.04
DBSCAN 15.89 22.88 17.16 20.40 10.39 12.34 18.64 9.55 16.10
BIRCH 32.58 44.56 32.90 24.32 22.48 19.40 15.16 22.44 16.52
CURE 38.68 32.71 39.23 28.81 32.90 32.67 32.74 29.11 32.62
The bold values are the minimum sensibility values of classification and clustering models, which
correspond to the best performance compared to the other models

Table 2.7 DQIP results of classification and clustering models (.k = 10%)
Missing(%) Inconsistent(%) Conflicting(%)
Model Prec. Rec. F-measure Prec. Rec. F-measure Prec. Rec. F-measure
DT 28 26 28 18 16 16 50 50 50
KNN 24 32 20 22 22 22 40 50 40
NB 26 28 24 22 12 12 50 40 40
BN 20 26 24 16 26 26 46 50 50
LR 22 28 16 16 14 16 32 34 32
RF 26 50 10 26 14 8 42 38 34
XGBoost 24 26 22 16 14 12 48 48 48
MLP 22 28 22 18 16 16 46 48 48
KM 38 32 32 28 22 22 44 38 38
LVQ 44 40 48 28 14 20 44 44 40
CLARANS 2 2 0 22 18 18 34 34 28
DBSCAN 30 40 30 32 44 34 36 50 36
BIRCH 24 20 24 20 26 26 50 34 38
CURE 18 18 16 20 18 16 32 34 24
The bold values are the maximum DQIP values of classification and clustering models, which
correspond to the best performance compared to the other models

Fifth, with the increase of the amount of data, the running time of the classifica-
tion model increases with the increase of the missing rate. This is because as the size
of the data increases, the amount of missing data becomes larger, which introduces
more uncertainty to the algorithm. Accordingly, the uncertainty of running time will
also increase.
18 2 Impacts of Dirty Data on Classification and Clustering Models

2.3.1.2 Varying Inconsistent Rate

To assess the impact of inconsistency on the classification model, we injected


inconsistent values into the original data set and randomly varied the inconsistency
rate to generate five data sets with inconsistency rates of 10%, 20%, 30%, 40%, and
50%, respectively. First, we randomly select a certain number of tuples. For each
selected tuple, we build a corresponding tuple with an inconsistent value based on
the functional dependency. Then, we insert all the new tuples into the given data.
In this way, we generate inconsistent data with controllable remediation [19]. We
use tenfold cross validation and randomly generate training data and test data. Since
inconsistent data has no effect on the training and testing process, we train and test
the model on the generated inconsistent data. The experimental results of KNN, DT,
and RF are shown in Figs. 2.4, 2.5, and 2.6.

(a) Inconsistent Rate(%) (b) Inconsistent Rate(%)

(c) Inconsistent Rate(%) (d) Data sets(with Inconsistent Rate)

Fig. 2.4 Results of k-nearest neighbor: varying inconsistent rate. (a) Precision. (b) Recall. (c)
F-measure. (d) Running time
2.3 How Dirty Data Affects Classification and Clustering Models? 19

(a) Inconsistent Rate(%) (b) Inconsistent Rate(%)

(c) Inconsistent Rate(%) (d) Data sets(with Inconsistent Rate)

Fig. 2.5 Results of decision tree: varying inconsistent rate. (a) Precision. (b) Recall. (c) F-
measure. (d) Running time

Based on these results, we make the following observations. First, in Table 2.6,
we get the sensitivity ranking of the classification model in terms of accuracy, recall
rate, and F-measure. Therefore, the least sensitive model is KNN. The reason for
this is similar to the reason for the missing rate of the least sensitive model change.
For accuracy and F-measure, the most sensitive model is the random forest. The
most sensitive model for recall rates is the decision tree. These are due to the fact
that as the inconsistency rate increases, more and more incorrect values overwrite
the correct values in the training model, which leads to inaccurate classification
results. Since the basic classifier of a random forest is a decision tree, the reason for
a random forest is the same as the reason for a decision tree.
Second, in Table 2.7, we get the DQIP ordering of the classification model in
terms of accuracy, recall rate, and F-measure. Therefore, for accuracy, the model
that is most tolerant of inconsistency is the random forest. The reason for this is
similar to the reason for the most tolerant models of incompleteness, which is to
20 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Inconsistent Rate(%) (b) Inconsistent Rate(%)

(c) Inconsistent Rate(%) (d) Data sets(with Inconsistent Rate)

Fig. 2.6 Results of random forest: varying inconsistent rate. (a) Precision. (b) Recall. (c) F-
measure. (d) Running time

change the missing rate. For recall rates and F-measure, the model that is most
tolerant of inconsistencies is Bayesian networks. This is because inconsistent values
contain both incorrect values and correct values. Therefore, an incorrect value has
little effect on the calculation of the posterior probability. Therefore, classification
results may not be affected. In terms of accuracy, the models with the least tolerance
for inconsistencies are Bayesian networks, logistic regression, and XGBoost. For
recall rates, the model with the least tolerance for inconsistencies is Naive Bayes.
For F-measure, the most intolerant model is the random forest. This is because these
models have high accuracy, recall, and F-measure on the original data set (error rate
0%). When some inconsistent values are injected, the accuracy, recall rate, and F-
measure drop dramatically.
Third, the inconsistency rate of the run time change was observed to be the same
as when the loss rate changed.
2.3 How Dirty Data Affects Classification and Clustering Models? 21

2.3.1.3 Varying Conflicting Rate

To assess the impact of conflicting data on the classification model, we randomly


injected conflicting values into the original data set and generated five data sets
with conflict rates of 10%, 20%, 30%, 40%, and 50%. First, we randomly select
a certain number of tuples. For each tuple, we build a corresponding tuple and
modify an attribute value. We then insert the new tuple into the given data. We
used tenfold cross validation and randomly generated training data and test data.
Since the conflict data has no effect on the training and testing process, we train and
test the model on the generated conflict data. The experimental results of KNN, DT,
and RF are shown in Figs. 2.7, 2.8, and 2.9.
First, the relationship between data size and model sensitivity is observed with
the same change conflict rate as with the change loss rate.

(a) Conflicting Rate(%) (b) Conflicting Rate(%)

(c) Conflicting Rate(%) (d) Data sets(with Conflicting Rate)

Fig. 2.7 Results of k-nearest neighbor: varying conflicting rate. (a) Precision. (b) Recall. (c) F-
measure. (d) Running time
22 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Conflicting Rate(%) (b) Conflicting Rate(%)

(c) Conflicting Rate(%) (d) Data sets(with Conflicting Rate)

Fig. 2.8 Results of decision tree: varying conflicting rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

Second, in Table 2.6, we get the sensitivity ranking of the classification model in
terms of accuracy, recall rate, and F-measure. Therefore, the least sensitive model
is KNN. The reason for this is similar to the least sensitive model that changes
the missing rate of a given data. The most sensitive model for accuracy is logistic
regression. This is because the parameter calculation of the regression function is
easily affected by the increase of conflicting values, resulting in inaccurate logistic
regression models. The most sensitive models for recalls are Naive Bayes. This
is because in an ever-increasing number of conflicting data, incorrect values can
affect the calculation of a posteriori probabilities in Bayes’ theorem. For F-measure,
the most sensitive model is the random forest. The reason is the same as the most
sensitive model when the inconsistency rate changes.
Third, in Table 2.7, we get the DQIP ranking of the classification model in terms
of accuracy, recall rate, and F-measure. Therefore, the most conflict-tolerant model
2.3 How Dirty Data Affects Classification and Clustering Models? 23

(a) Conflicting Rate(%) (b) Conflicting Rate(%)

(c) Conflicting Rate(%) (d) Data sets(with Conflicting Rate)

Fig. 2.9 Results of random forest: varying conflicting rate. (a) Precision. (b) Recall. (c) F-
measure. (d) Running time

is the decision tree. The reasons for this are similar to those for the most incomplete
tolerance model. The least conflict-tolerant model is logistic regression. This is
because the incorrect values in the conflicting data affect the parameter calculation
of the logistic regression model. As a result, the classification accuracy will drop
sharply.
Fourth, the Decision Tree, XGBoost, and Multi-Layer Perceptron results in far
lower accuracy, recall, and F-measure on Ecoli when the miss rate is 0% than when
the miss rate is 10%. This is because the amount of data in Ecoli is relatively small,
which is easy to cause overfitting of the model. This observation further confirms
that there is no need to clean up the entire dirty data.
Fifth, the observation that the runtime changes the conflict rate is the same as the
observation that changes the absence rate.
24 2 Impacts of Dirty Data on Classification and Clustering Models

2.3.2 Results and Analysis of Clustering Models


2.3.2.1 Varying Missing Rate

To assess the impact of missing data on the clustering model, we randomly deleted
the values from the original data set and generated five data sets with missing rates
of 10%, 20%, 30%, 40%, and 50%. For each tuple, we randomly select one or more
properties and remove the corresponding value. In the clustering process, we use the
mean value to replace the numeric missing value, and the maximum value to replace
the classification missing value. Figures 2.10, 2.11, and 2.12 depict the experimental
results of LVQ, DBSCAN, and CLARANS.

(a) Missing Rate(%) (b) Missing Rate(%)

(c) Missing Rate(%) (d) Data sets(with Missing Rate)

Fig. 2.10 Results of LVQ: varying missing rate. (a) Precision. (b) Recall. (c) F-measure. (d)
Running time
2.3 How Dirty Data Affects Classification and Clustering Models? 25

(a) Missing Rate(%) (b) Missing Rate(%)

(c) Missing Rate(%) (d) Data sets(with Missing Rate)

Fig. 2.11 Results of DBSCAN: varying missing rate. (a) Precision. (b) Recall. (c) F-measure. (d)
Running time

Based on these results, we make the following observations. First, in Table 2.6,
we get the sensitivity ranking of the clustering model for accuracy, recall rate, and
F-measure. Therefore, the least sensitive model for accuracy and recall is the LVQ.
This is because LVQ is a supervised clustering model based on labels. Therefore, the
possibility of being affected by missing values is small. The least sensitive model
for F-measure is DBSCAN. This is due to DBSCAN eliminating all noise points at
the beginning of the model, which makes it more resistant to missing values. The
most sensitive model for accuracy is the CURE. This is because the positions of
representative points in the CURE are susceptible to missing values, resulting in
inaccurate clustering results. The most sensitive model for recall rates is BIRCH.
This is because missing data will affect the construction of clustering feature trees
in BIRCH, which directly leads to wrong clustering results. For F-measure, the most
sensitive model is CLARANS. This is because the calculation of cost differences in
26 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Missing Rate(%) (b) Missing Rate(%)

(c) Missing Rate(%) (d) Data sets(with Missing Rate)

Fig. 2.12 Results of CLARANS: varying missing rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

CLARANS is susceptible to missing values, which makes the clustering results of


some points incorrect.
Second, in Table 2.7, we get the DQIP ordering of the clustering model in terms
of precision, recall rate, and F-measure. Therefore, the model with the highest
tolerance for incompleteness is LVQ. This is because LVQ is a supervised clustering
model based on tag tags. Therefore, it is highly unlikely to be affected by missing
values. The model most intolerant of incompleteness is CLARANS. This is because
the calculation of cost differences in CLARANS is easily affected by missing data,
resulting in inaccurate clustering results.
Third, K-Means results for accuracy, recall, and F-measure on Abalone are
significantly lower when the miss rate is 0% than when the miss rate is 10%.
BIRCH’s recall rate and F-measure results for HTRU were much higher with a 10%
miss rate than with a 0% miss rate. In addition, when the miss rate was 0%, CURE’s
2.3 How Dirty Data Affects Classification and Clustering Models? 27

accuracy, recall rate, and F-measure on the Activity were significantly lower than the
results when the miss rate was 10%. This phenomenon suggests that fewer missing
values may lead to better clustering results. This observation further confirms that
there is no need to clean up the entire dirty data.
Fourth, with the increase of data volume, the running time of the clustering model
increases with the increase of the missing rate. This is because as the amount of
data increases, the amount of missing data becomes larger, which introduces more
uncertainty into the model. Accordingly, the uncertainty of running time will also
increase.

2.3.2.2 Varying Inconsistent Rate

To assess the impact of inconsistent data on the clustering model, we randomly


injected inconsistent values into the original data set and generated five data sets
with inconsistent rates of 10%, 20%, 30%, 40%, and 50%. First, we randomly
select a certain number of tuples. For each selected tuple, we build a corresponding
tuple with inconsistent values based on the functional dependencies. Then, we insert
all the new tuples into the given data. In this way, we generate inconsistent data
with controllable remediation [19]. Since the inconsistent data has no effect on
the clustering process, we train the clustering model on the generated inconsistent
data. The experimental results of DBSCAN, LVQ, and CLARANS are shown in
Figs. 2.13, 2.14, and 2.15.
Based on these results, we make the following observations. First, for well-
performing models (accuracy, recall, or F-measure greater than 80% on the original
data set), the model’s accuracy, recall, or F-measure fluctuates considerably as the
data size increases, except for DBSCAN. This is because as the amount of data
increases, the inconsistent values become larger. The increasing amount of incorrect
data has a greater impact on the clustering process. However, DBSCAN abandons
noise points at the beginning of the model. As the amount of data increases, the
number of correct values increases. Accordingly, the impact on DBSCAN is reduced
when the proportion of points that are excluded decreases.
Second, in Table 2.6, we get the sensitivity ranking of the clustering model in
terms of accuracy, recall rate, and F-measure. Therefore, the least sensitive model
is DBSCAN. The reason for this is similar to the reason for the missing rate of
the least sensitive model change. For accuracy and F-measure, the most sensitive
model is K-Means. This is due to the fact that the calculation of the central point
is susceptible to incorrect values, resulting in incorrect clustering results. The most
sensitive model for recall rates is CURE. The reasons for this are similar to those
for the most sensitive model change miss rate.
Third, in Table 2.7, we get the DQIP ordering of the clustering model in terms
of precision, recall rate, and F-measure. Therefore, the most tolerant model for
inconsistencies is DBSCAN. This is because DBSCAN eliminates all noise points at
the beginning of the model, which makes it more resistant to inconsistent data. For
accuracy, the models most intolerant of inconsistencies are BIRCH and CURE. For
28 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Inconsistent Rate(%) (b) Inconsistent Rate(%)

(c) Inconsistent Rate(%) (d) Data sets(with Inconsistentg Rate)

Fig. 2.13 Results of DBSCAN: varying inconsistent rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

recall rates, the model most intolerant of inconsistencies was LVQ. For F-measure,
the model with the most intolerant inconsistencies was CURE. These are due to the
fact that the distance calculation of these models is prone to incorrect values, which
leads to inaccurate clustering results.
Fourth, K-Means results for Abalone’s accuracy, recall rate, and F-measure are
much lower when the inconsistency rate is 0% than when the inconsistency rate is
10%. The accuracy of DBSCAN on the Activity with an inconsistency rate of 10%
is much higher than that with an inconsistency rate of 0. In addition, CLARANS
and CURE’s results for accuracy, recall, and F-measure on the Activity were
significantly lower when the inconsistency rate was 0% than when the inconsistency
rate was 10%. This phenomenon suggests that fewer inconsistent values may lead
to better clustering results. This observation further confirms that there is no need to
clean up the entire dirty data.
2.3 How Dirty Data Affects Classification and Clustering Models? 29

(a) Inconsistent Rate(%) (b) Inconsistent Rate(%)

(c) Inconsistent Rate(%) (d) Data sets(with Inconsistentg Rate)

Fig. 2.14 Results of LVQ: varying inconsistent rate. (a) Precision. (b) Recall. (c) F-measure. (d)
Running time

Fifth, the observation of the inconsistency rate of the change of running time is
the same as that of the change of the loss rate.

2.3.2.3 Varying Conflicting Rate

To assess the impact of conflicting data on the clustering model, we randomly


injected conflicting values into the original data set and generated five data sets with
conflict rates of 10%, 20%, 30%, 40%, and 50%, respectively. First, we randomly
select a certain number of tuples. For each tuple, we build a corresponding tuple and
modify an attribute value. We then insert the new tuple into the given data. Since the
conflicting data has no effect on the clustering process, we train the clustering model
30 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Inconsistent Rate(%) (b) Inconsistent Rate(%)

(c) Inconsistent Rate(%) (d) Data sets(with Inconsistentg Rate)

Fig. 2.15 Results of CLARANS: varying inconsistent rate. (a) Precision. (b) Recall. (c) F-
measure. (d) Running time

on the generated conflicting data. The experimental results of DBSCAN, LVQ, and
CLARANS are shown in Figs. 2.16, 2.17, and 2.18.
Based on these results, we make the following observations. First, in Table 2.6,
we get the sensitivity ranking of the clustering model for accuracy, recall rate, and
F-measure. Therefore, the least sensitive model for accuracy is LVQ. The reason
for this is similar to the least sensitive model that changes the missing rate of a
given data. The least sensitive model for recall rates and F-measure is DBSCAN.
The reasons for this have been discussed in Sect. 2.3.2.1. The most sensitive model
is the CURE. The reason for this is similar to the most sensitive model that changes
the loss rate.
Second, in Table 2.7, we get the DQIP ordering of the clustering model in terms
of precision, recall rate, and F-measure. Therefore, for accuracy, the most conflict-
tolerant model is BIRCH. This is because the conflicting data contains both correct
2.3 How Dirty Data Affects Classification and Clustering Models? 31

(a) Conflicting Rate(%) (b) Conflicting Rate(%)

(c) Conflicting Rate(%) (d) Data sets(with Conflicting Rate)

Fig. 2.16 Results of DBSCAN: varying conflicting rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

data and incorrect data, which makes the construction of the clustering feature tree
less susceptible to incorrect values. For recalls, the most conflict-tolerant model
is DBSCAN. The reason for this is similar to the model that is most tolerant of
inconsistency, which is to change the inconsistency rate of a given data. For F-
measure, the most conflict-tolerant model is LVQ. The reason for this is similar to
the reason for models that are most tolerant of incompleteness, namely the change
miss rate. The most conflict-intolerant model is the CURE. This is due to the fact
that the positions of representative points in the CURE are susceptible to conflicting
values, making the clustering of data points inaccurate.
Third, K-Means accuracy, recall, and F-measure results for Abalone were much
lower when the collision rate was 0% than when the collision rate was 10%. At
the same time, CLARANS and CURE’s accuracy, recall, and F-measure results
for campaigns with 0% conflict rates were also much lower than results with 10%
conflict rates. This phenomenon suggests that fewer conflicting values may lead to
32 2 Impacts of Dirty Data on Classification and Clustering Models

(a) Conflicting Rate(%) (b) Conflicting Rate(%)

(c) Conflicting Rate(%) (d) Data sets(with Conflicting Rate)

Fig. 2.17 Results of LVQ: varying conflicting rate. (a) Precision. (b) Recall. (c) F-measure. (d)
Running time

better clustering results. This observation further confirms that there is no need to
clean up the entire dirty data.
Fourth, the observation of the conflict rate with a change in running time is the
same as the observation with a change in the loss rate.

2.4 What Do We Learn from Evaluation Results?

In this section, we first discuss the lessons learned from the evaluation. Based on the
discussion, we provide users with a guide for model selection and data cleaning. At
the same time, we also provide researchers and practitioners with advice on future
work.
2.4 What Do We Learn from Evaluation Results? 33

(a) Conflicting Rate(%) (b) Conflicting Rate(%)

(c) Conflicting Rate(%) (d) Data sets(with Conflicting Rate)

Fig. 2.18 Results of CLARANS: varying conflicting rate. (a) Precision. (b) Recall. (c) F-measure.
(d) Running time

2.4.1 Lessons Learned from Evaluation on Classification


Models

According to the evaluation results and analysis of the classification model, we have
the following conclusions:
• The impact of dirty data is related to the error type and error rate. Therefore, it is
necessary to detect the error rate for each error type in a given data.
• For models with accuracy, recall, or F-measure greater than 80% on the original
data set, the model’s accuracy, recall, or F-measure become less sensitive as the
amount of data increases, except for logistic regression. Since the parameter k in
DQIP is set to 10%, candidate models with accuracy, recall, or F-measure greater
than 70% are acceptable.
34 2 Impacts of Dirty Data on Classification and Clustering Models

• As the amount of data increases, the accuracy, recall rate, and F-measure of the
Logistic Regression become more sensitive.
• It is not necessary to clean up the entire dirty data before proceeding with the
classification task.
• Because the accuracy, recall, or F-measure of the selected classification model
becomes unacceptable when the error rate is higher than its corresponding DQIP,
the error rate of each dirty data type needs to be cleaned below the value of its
DQIP.
• As the amount of data increases, the running time of the classification model
increases with the increase of the error rate.

2.4.2 Guidelines of Classification Model Selection and Data


Cleaning

Based on the discussion, we recommend that users choose a classification model


and follow the steps below to clean dirty data:
(1) The user is advised to detect the error rates (e.g., missing rate, inconsistency
rate, and conflict rate) of the given data [2, 3].
(2) Based on a given task requirement (e.g., good performance on accuracy, recall,
or F-measure), we recommend that the user selects a candidate model with
better accuracy, recall, or F-measure than 70% on a given data.
(3) If the given data size is larger than 100M, logistic regression is not recom-
mended.
(4) According to the task requirements and the types of errors with the largest
proportion, we suggest that users use our proposed experimental method to
obtain the corresponding sensitivity ranking and select the least sensitive
classification model.
(5) Based on the selected model, the task requirements, and the error rate of the
given data, we recommend that users use our evaluation method to obtain the
corresponding DQIP command and clean up each class of dirty data until the
error rate falls below their DQIP.

2.4.3 Lessons Learned from Evaluation on Clustering Models

According to the evaluation results and the analysis of the clustering model, we have
the following conclusions:
• The impact of dirty data is related to the error type and error rate. Therefore, it is
necessary to detect the error rate for each error type in a given data.
Other documents randomly have
different content
the floor. There were nuggets of gold almost solid, and some as
large as a goose egg. They were scattered about in reckless
profusion. There were diamonds of small size, uncut, and great
rubies of pigeon-blood colour. It was a cave of riches, and Edgar and
Will feasted their eyes on it in amazement. They held the rubies in
their hands, and gloated over their wondrous colour. They handled
the gold and felt its weight, and were bewildered with the nature of
the discovery.
‘How did all this come here?’ said Edgar. ‘To whom does it belong?’
‘It is mine,’ said Yacka. ‘I am the son of Enooma, and the tribe
collected it. None of them know its value. They do not wish for gold
or stones. All they wish for is to live a savage life, and to have a
country of their own. They cannot be taught what such things as
these mean. Yacka has been in great cities and knows. He has seen
the white man kill for love of gold; he has seen the women of the
white men sell themselves for these,’ and he held up some rubies
and diamonds. ‘It is better for the Enooma to remain as they are.
Gold would make them fight amongst themselves, now they fight
their enemies.’
‘You may be right,’ said Edgar. ‘All the same, I should like a few
samples of your wealth, Yacka.’
‘Take what you will,’ said Yacka. ‘It is far to carry it. Do not take too
much, or you will not reach Yanda again. Water is more precious
than gold sometimes.’
‘May we return and take away more?’ asked Will.
‘If you can find the place,’ said the black; ‘but Yacka will show you
no more.’
‘Then I am afraid we shall not have much chance,’ said Will. ‘It is a
pity all this wealth should be wasted.’
‘Others may find it, and take their share,’ said Yacka. ‘It is not good
for one man to have too much.’
‘We can carry enough away with us,’ said Edgar, ‘to give us a start in
life, anyhow. Perhaps Yacka is right. It is not good for a man to have
too much. Will you help us, Yacka?’
‘To carry gold for you?’ said the black.
‘Yes,’ said Edgar.
‘I will carry some, and stones for you, but I will not use any,’ Yacka
said.
‘You’re a strange being,’ said Edgar; ‘but the black man lives not as
the white man.’
‘No,’ said Yacka; ‘he does not slay his friend for gold.’
Edgar dropped the subject. Whatever the cruel, cowardly conduct of
the blacks might be, he knew enough about the pursuit of wealth to
refrain from arguing with Yacka.
‘The tribe will be waiting for us,’ said Yacka. ‘We must return.’
‘Perhaps the earthquake has frightened them away,’ said Will.
‘They would not feel it so much as we did, being underground,’ said
Edgar.
‘It was no earthquake,’ said Yacka. ‘It was the White Spirit
welcoming you.’
‘A strange welcome,’ said Edgar.
‘Had it been an earthquake you would have been killed,’ said Yacka.
‘I have seen what an earthquake does. It swallows up mountains
and trees, and heaves up other mountains in their place. All the
plains of Australia were formed by earthquakes, and the mountains
were thrown up to make that part smooth.’
‘How long will it take us to return to the tribe?’ said Edgar.
‘Not long,’ replied Yacka. ‘We will go now. We can return for the
gold.’
‘We had better take some now,’ said practical Will.
Edgar was nothing loath, and they filled what pockets they had left
in their torn clothes with gold, rubies, and diamonds.
Yacka watched them and said:
‘I will return for more. You need not come again.’
‘You mean you do not wish us to return,’ said Edgar.
‘That is it,’ said Yacka. ‘I will return alone.’
To this they agreed, acknowledging that Yacka had the right to do as
he pleased, as it was undoubtedly his find. They were not long in
getting out of this strange labyrinth of caves and passages, and
Edgar wondered why they had not come in this way. Before they
reached the exit Yacka said they must be blindfolded. To this at first
they protested, but as Yacka was firm, and they were in his power,
they consented.
Yacka led Will by the hand, Edgar holding Will’s other hand. They
tramped in this way for a considerable time, and then Yacka
removed the covering from their eyes.
They were on the grassy plain once more, but the whole scene had
been changed by the wondrous forces of Nature. Huge masses of
rock were strewn about, and trees were felled and torn up by the
roots. Where they had entered the mountains there was no other
means of passing through. The blacks had retreated before the
terrible storm, and were encamped a long way off. They could just
see the camp fires in the distance. Several dead blacks lay around,
evidently killed by falling rocks, but Yacka took very little notice of
them. Death ended all for these men, and, being dead, Yacka
thought no more of them.
When Edgar looked round to see where they had come out of the
caves, there was no opening anywhere. Yacka smiled as he said:
‘You will never find the entrance. It is known only to me, and once I
lost it and never found it again.’
‘Then that is the reason we went in the other way,’ said Edgar.
‘Yes,’ said Yacka. ‘Now I have the way out, I can find the way in
again.’
They marched towards the camp, and the Enooma rushed to meet
them, uttering loud cries of delight. They had never expected to see
them return alive after such a terrific earthquake. These blacks were
strange people. Terrified as they had recently been, they had in a
very few hours forgotten their experiences. The sudden changes in
this climate had made them familiar with the working of the forces
of Nature, which are truly marvellous.
In the stillness of the night, as Edgar and Will sat side by side, they
returned thanks for their merciful escape. It was an experience they
would never forget, and now that it was over both felt untold gold
would not tempt them to brave it again.
CHAPTER XX.
THE RETURN TO YANDA.

Before they were awake next morning Yacka, true to his promise,
went to the cave and returned with some of the finest rubies and
purest lumps of gold. He roused Edgar and Will, and showed them
what he had done.
‘It is as much as we can carry,’ he said, and they agreed with him.
The gold was heavy, and they had a long tramp before them.
Without further delay they collected their treasure, and made it
secure in a strong skin loin-cloth, which was fastened by dried strips
of leather, so that none of the stones could fall out.
‘This is like putting all our eggs in one basket,’ said Edgar. ‘I think we
had better carry the best of the rubies about us.’
This was done, and the bag again fastened securely.
The Enooma accompanied them, and left them about a couple of
days’ journey from the ranges.
At this point Edgar and Will bade them farewell, and Yacka promised
to return and travel with them further north. The black had
explained to them all that had taken place in the caves, and they did
not care to remain longer in that district.
Yacka led them safely through the MacDonnell Ranges, and they
reached Alice Springs, where they had a hearty welcome.
‘We never expected to see you alive again,’ said Walter Hepburn.
‘You have been away close upon six months, and we thought you
were gone for good. I hope you are satisfied with your experiences.’
‘We are,’ said Edgar. ‘We have seen many strange and wonderful
sights.’
‘You must tell me about your adventures to-night,’ said Hepburn. ‘I
have kept your horses safe, and they will be ready for the journey.’
It was a relief to Edgar and Will to obtain fresh clothes, for those
they wore were almost in rags.
The night of their arrival they related to Walter Hepburn all that had
befallen them, and he was amazed. He could hardly credit the
account Edgar gave of the wealth found in the cave of Enooma; but
when he saw the precious stones and gold spread out before him,
he was completely overwhelmed.
‘This is pure gold,’ he said, as he handled a large lump of the
precious metal. ‘And these rubies are exceedingly rich in colour, and
worth a heap of money. We have found rubies in the creeks here,
but nothing to be compared to these. Of course, you will return with
a properly equipped expedition, and carry the bulk of it away?’
‘I am afraid that will be out of the question,’ said Edgar. ‘Yacka will
not guide us there again, and I am sure we could not find the place.’
‘Yacka must be forced to act as guide,’ said Hepburn. ‘Such a
treasure as you have discovered cannot be allowed to remain
buried.’
‘I shall not be the one to use force against Yacka,’ said Edgar. ‘The
black has acted honestly by us, and we must do the same by him.’
‘If you fellows do not have another try to find the place I shall,’ said
Hepburn.
Edgar laughed as he said:
‘You are welcome to do so. For my part I have had enough of it, and
am glad to have got back again with a whole skin.’
‘You must be careful not to let anyone know about here what you
have with you. There are some desperate characters, and a mere
hint as to the wealth you have, and your lives would not be safe,’
said Hepburn.
‘We have told no one but yourself,’ said Edgar; ‘and we know we can
trust you. You are an old Redbank boy.’
After some persuasion Walter Hepburn agreed to accept a couple of
fine rubies and a heavy nugget in return for the keep of the horses,
and as a remembrance of their visit. As well as he was able Edgar
described the country they had traversed and the appearance of the
place where the caves were.
‘Even if you reach there safely,’ said Edgar, ‘you will not be able to
find the entrance. We could see nothing of it, and even Yacka lost
the run of it once.’
‘It is worth the risk,’ said Hepburn. ‘I wish I had gone with you. I am
used to these wilds, and once I had been over the ground I am sure
I could find my way back.’
They did not remain long at Alice Springs, as they were eager to
return to Yanda and learn how their friends had got on during their
absence.
The return journey passed in much the same way as their ride to
Alice Springs from Yanda.
They had a plentiful supply of ammunition, which Walter Hepburn
had given them, and consequently were not afraid to shoot when in
need of provisions.
Edgar noticed Yacka was restless, and did not seem at his ease
during their journey, and he questioned him as to the reason.
‘I have a fear we are being followed,’ said Yacka. ‘I have seen no
one, but still I fear it. Did anyone know you had gold and stones at
Alice Springs?’
‘Only Walter Hepburn,’ said Edgar. ‘We were careful not to tell
anyone else.’
‘You showed him the stones?’ asked Yacka.
‘Yes,’ said Edgar; ‘we spread them out on the table in his house, but
no one else was there.’
‘But there are windows,’ said Yacka, ‘and someone may have looked
in. It was foolish.’
‘I think you are wrong about anyone following us,’ said Will. ‘They
would have attacked us before now.’
Yacka explained that he had not slept at night since they left the
Springs. He had watched and waited and heard strange sounds. He
felt sure they were being followed, but at some distance.
‘You must have a sleep to-night, anyhow,’ said Edgar, ‘or you will
knock up. We can keep watch in turns.’
Yacka assented, for he felt much in need of sleep.
They camped on a level patch of ground, where there was not much
surrounding shelter, and where they felt secure against any surprise.
Worn out from want of sleep, Yacka stretched himself on the ground,
and quickly fell into a deep slumber.
‘He’s dead tired,’ said Edgar. ‘I have never seen him drop off into
such a sound sleep. He generally has an eye open, and his ears
catch every sound.’
‘Are you going to take first watch?’ said Will.
‘If you like,’ said Edgar. ‘I will rouse you when I become drowsy.’
Will soon followed Yacka into the land of dreams, and Edgar, leaning
his back against the trunk of a tree, watched them. The treasure
was close to him, and the sight of it brought back to him the scenes
they had witnessed. From these experiences his thoughts wandered
to Wal Jessop and Eva, and he wondered how they had gone on
during his absence. He was anxious to see them again, and when he
reached Yanda meant to take a trip to Sydney as early as possible.
Then he thought of home, and his father and sister, and hoped to
have letters from them at Yanda. They would be anxious to hear
how his exploit had turned out, and what a glowing account he
would give them! Lost in these pleasant reflections, he did not hear
the stealthy tread of two men behind the tree.
These men kept well in the shadow of the trunk of the tree against
which Edgar sat, all unconscious of their approach. They were
desperate-looking fellows, dressed in bush fashion, and had
evidently ridden after Edgar and his companions from Alice Springs.
Cautiously they approached, avoiding the loose twigs on the ground,
and halting to listen intently at every few yards. Each man had a
revolver in his hand, and a knife in his belt.
The taller of the two motioned to the knife at his side, and pointed
to Edgar. The other nodded, and drew out his formidable blade. He
then crept, knife in hand, towards Edgar, and his companion made
towards Will.
Edgar, who began to feel drowsy, rose to his feet and leaned on his
shoulder against the tree, his back still to the man stealing up, knife
in hand. Edgar little knew the peril he was in, and dreaded nothing.
Nearer and nearer drew the man with his murderous weapon. He
was now close to the tree, and had his knife uplifted ready to strike.
Suddenly a laughing jackass, perched in the branches above Edgar’s
head, gave his mocking laugh. The sound startled him, and he
turned round; as he did so he saw the man, and the knife he had in
his uplifted hand flashed in the faint moonlight.
He shouted, ‘Yacka! Yacka! Will! Will!’ and sprang backwards.
The man rushed upon him just as Will opened his eyes in a half-
drowsy way, and dimly realized that a man was pointing his revolver
at him.
‘Move, and I fire!’ said the man to Yacka, as he saw the black spring
to his feet.
Yacka dared not move; he knew it would be instant death to Will.
Meanwhile Edgar grappled with his assailant, and a desperate
struggle was going on.
The man covering Will called out to his mate and Edgar:
‘Drop struggling, or I fire!’
Edgar glanced at him, and saw the danger Will was in.
‘Hands off!’ he said, and the man ceased to struggle with him.
Unfortunately, neither Edgar or Will had their revolvers handy, and
their guns were against the trunk of the tree—the revolvers being
luckily hidden from sight in the long rank grass.
‘We want that bag,’ said the tall man, still covering Will. ‘Let my mate
get the bag and your guns, and then you can go.’
In a moment it flashed across Edgar that if the men took the bag
and the guns there would still be the revolvers, and that gave them
a chance before the thieves reached their horses. He was not,
however, too eager, and said:
‘You are a cowardly pair to rob us like this.’
‘You are three to one,’ said the man with a grin. ‘Nothing very
cowardly about that. Will you “ante up” the “boodle”?’
‘How do we know you will not fire on us? We shall be unarmed,’ said
Edgar.
‘We want the plunder, not your lives,’ said the man. ‘Come, be quick.
We have no time to waste.’
The man was evidently impatient, and Edgar thought: ‘Perhaps they
are afraid of someone following them from the Springs.’ Aloud he
said:
‘We agree. Take the bag and our guns and go.’
The man who had attacked Edgar picked up the bag and the two
guns. It was an anxious moment for Edgar. The revolvers were lying
near the tree, and the man might kick them as he went along. With
a sigh of relief, Edgar saw the man had not discovered them. Yacka
was on the alert, but saw no chance of making a move without
injuring Will, and Edgar was in the same fix. The tall man ‘bailed’
them up until his companion returned with their horses.
Having fixed the bag firmly in front of the saddle the man mounted,
placing the guns also in front of him. He then led the other horse up
to the man covering Will, and levelled his revolver at him while his
mate mounted.
Yacka stood at the other side of the horses, and for a brief moment
the man covering Will could not see him, and the taller man was
mounting with his back to Yacka. In an instant Yacka bounded
between the man with the revolver and Will, and jerked the horse’s
bridle, which caused the animal to suddenly back. The man fired,
but the movement of the horse spoilt his aim and the shot did no
harm.
Seeing how matters stood, Edgar ran for the revolvers, and reached
them before the thieves could realize what had happened.
A desperate fight now took place. The mounted men, whose horses
plunged at the sound of firing, aimed at Will and Edgar, and the
former felt a sharp pain in his left arm.
Yacka still hung on to the horse’s bridle, and the man on it fired
point-blank at him, the bullet grazing his head.
Edgar approached this man, and when close to him fired. The shot
told, and the man’s right arm fell to his side, his revolver dropping
on to the ground.
‘Winged!’ shouted Edgar. ‘Hold on, Yacka!’
But Yacka had let go of the horse and pulled the man out of the
saddle. The horse, finding itself free, galloped off, with the bag still
fast to the front of the saddle.
The other man, seeing how matters were going, and knowing the
loose horse had the bag still fast to the saddle, turned tail and
galloped after it.
‘The horses—the horses! Quick, Will!’ said Edgar. ‘We must be after
them.’
Will brought up the horses, and they were quickly in the saddle.
‘You keep guard over this fellow, Yacka,’ said Edgar. ‘Don’t let him
go.’
For answer Yacka smiled savagely, and gripped the man by the
throat so hard that his eyes started from his head.
‘He’s in safe hands,’ said Edgar. ‘Come along, Will, or we shall lose
our treasure after all.’
They rode away after the other man and the runaway horse as fast
as their nags could carry them.
CHAPTER XXI.
AN EXCITING CHASE.

It proved an exciting chase they had commenced. The thief knew he


need expect no mercy if caught, and rode desperately. He knew the
country better than Edgar and Will, which gave him a decided
advantage; moreover, he had a good horse, probably stolen, and
knew how to ride.
‘He is gaining on us,’ said Edgar. ‘I am afraid we shall lose him.
There is no chance of hitting either man or horse from this distance.’
Mile after mile was traversed, and still the chase went on. The
riderless horse stuck close to his companion, but when he began to
flag the man took hold of the bridle and urged him on. Edgar took
no heed where they were going, nor did Will. They were too excited
to take much notice of the country they passed through. At last the
fugitive turned his horse to the left, and plunged into a much more
difficult country to travel. The undergrowth became denser and
tangled, and it was with difficulty the horses could be forced to go
through it. It was not long before they lost sight of the man they
were in pursuit of.
‘Where can he have got to?’ said Will. ‘He would never hide here
with two of us after him.’
‘We must ride on,’ replied Edgar. ‘It is easy to miss a man and come
across his track again in a very short time.’
They rode on at a slow pace, and presently came to a narrow
opening in the scrub. Here they halted and found recent tracks of
horses, so they determined to follow in this direction. The tracks led
them in a roundabout way, and presently they came to the
conclusion the man had doubled back.
‘He must be heading for our camp again,’ said Edgar. ‘Strange he
should do this unless he fancies we are put off the scent, and he is
riding back to rescue his mate.’
‘If that is his game,’ said Will, ‘we must follow him hard. He might
shoot Yacka before we arrive.’
It was, however, difficult for them to find their way. They were not
experienced bushmen, and had failed to notice certain signs by
which they would know they were on the right track. They saw no
signs of the man, nor could they now observe in which direction the
horses had gone. To ride on and trust to chance was their only hope.
It was quite light now, and this aided them. As time passed they
became anxious, and wondered what would become of Yacka if they
did not arrive on the scene in time, for they had not the least doubt
now that their man was heading for the camp to rescue his mate.
‘This chase he has led us has been a blind,’ said Edgar. ‘If we had
taken ordinary precautions we ought to have found out he was
doubling back.’
‘Only a bushman would have found that out,’ said Will. ‘I do not see
how we can blame ourselves.’
‘We have had enough experience the last few months to have found
that out,’ said Edgar. ‘By Jove! there he is, I believe.’
There was a horseman in front of them, but they could not see the
second horse. They rode on faster now, but did not gain much
ground. A rise in the land hid the man from view, and soon after he
disappeared they heard a shot. This made them ride all the faster,
and they quickly reached the top of the rise, and had a good view of
the plain beyond.
‘He fired that shot to warn his mate,’ said Will. ‘We cannot be far
from the camp now.’
‘I’ll fire,’ said Edgar; ‘and if Yacka hears the two shots he will
probably divine we are in pursuit.’
He fired a shot from his revolver as they rode on.
‘There’s the place we camped at,’ said Edgar, pointing to two or
three tall trees: ‘but I see nothing of Yacka or the other men.’
They rode up to the place, and found the camp deserted. There was
blood upon the ground and signs of a struggle, but they imagined
this must have been caused by Yacka dragging the wounded man
along. Edgar called out ‘Yacka!’ and gave a loud ‘cooee,’ and after
waiting a few moments they heard a faint response. They rode in
the direction of the sound, and, rounding a clump of trees on a
mound, came upon a strange sight.
Stretched on the ground was one of the robbers, the man they
supposed they had left with Yacka. This man had been strangled,
and was dead. Near him sat Yacka with a strange expression on his
face. When the black saw them he gave a faint moan, and pressed
his hand to his side.
‘Good God! he’s shot!’ said Edgar, dismounting and running to the
black. He found blood streaming from a deep wound in his side
evidently inflicted with a knife. ‘How did this happen?’ asked Edgar,
as he endeavoured to stanch the flow of blood with a neckerchief he
had rapidly pulled off.
Yacka pointed to the dead man, and Will, who had come up,
exclaimed:
‘This is not the fellow we left with Yacka. It is the man we have been
chasing all this time.’
‘Where is the other man?’ asked Edgar, who could hardly believe his
eyes.
‘I killed him,’ said Yacka faintly.
‘Where is he?’ asked Will.
Yacka pointed to some bushes, and Will went across and found the
body of the man they had left with Yacka. This man had also been
strangled.
They managed to stop the flow of blood from the deep wound in
Yacka’s side, but it was some hours before he had sufficiently
recovered strength to relate what had happened.
When Yacka heard the shot fired, he at once thought the man’s
mate had doubled back to rescue him, and had given Edgar and Will
the slip. He knew how easily it could be done by an old hand, and
his surmise was confirmed by the expression on the man’s face
when he heard the shot. In a moment Yacka had made up his mind
how to act. He had no gun, for he found that all three had been
taken, instead of only those belonging to Edgar and Will. He seized
his prisoner by the throat, and strangled him. Then he propped the
dead man up with his back to a tree, and tied him to it with one of
the tethering ropes. He hid himself behind the tree and waited, and
in a short time the other robber came on to the scene. When this
man saw his mate bound to the tree, he dismounted and came
towards him, evidently thinking Yacka had made him fast, that he
had fallen asleep, and Yacka had gone away.
Yacka awaited his coming, crouching down behind the tree. No
sooner did the man see his mate was dead than he realized that a
trap had been set for him, and ran back to the horses. Yacka was
quickly after him, and before the man could reach the horses had
caught him up. Finding Yacka at such close quarters, the man drew
his knife instead of his revolver, no doubt thinking it would be more
effective. A desperate struggle ensued, which Yacka described
graphically.
‘We rolled over and over,’ said Yacka. ‘I had no knife, and he was a
powerful man. I caught him by the throat, and he lost the grip of his
knife. I clung to him with both hands, and he managed to get his
knife and stuck it in my side. I did not let go my hold. I became
fainter and fainter, but clung to his throat. Then I fell across him,
and when I came to my senses again, which could not have been
long, he was dead. It was their lives or mine, and they were not fit
to live.’
As they listened to Yacka’s story of this terrible struggle and awful
end of the thieves, they wondered if many men would have had the
courage to act as he had done.
‘The horses will not have gone far,’ said Yacka. ‘They were dead
tired, I could see, when the man dismounted.’
While Will attended to Yacka, Edgar went in search of the two stray
horses, and found them about a couple of miles away, quietly
cropping the scanty herbage. He secured them without trouble, and
was glad to see their precious treasure was safe, and also their
guns.
They had to remain in this spot for a week before Yacka was fit to
be removed, and during that time they buried the bodies of the
robbers as well as they were able with the primitive means at hand.
Their progress was slow, because Yacka could not ride far, and had
to be helped off one of the horses at different times to rest. It was
lucky for them they had the two captured horses in addition to their
own. Yacka guided them, and seemed to take a delight in hiding
from them how far they were from Yanda.
‘Surely we must be somewhere near Yanda by this time,’ said Edgar.
‘I almost fancy I can recognise the country.’
‘You ought to,’ said Yacka, ‘for we are on Yanda Station now, and we
shall reach the homestead to-night.’
They could not suppress their feelings, and gave a loud hurrah.
Yacka had spoken correctly, for towards sundown the familiar
homestead came in sight.
Yacka wished them to gallop on and leave him, but this they
declined to do, saying he had done so much for them, it was only
making a small return to remain with him.
As they neared the homestead they noticed several figures moving
about, evidently in an excited way, on the veranda.
‘There’s Ben Brody!’ said Edgar eagerly. ‘He has recognised us. What
a time we shall have to-night!’
Ben Brody was standing leaning against the door-post when he saw
something moving across the plain in front of him. He went inside
for his glasses, and, after looking through them for several minutes,
he gave a loud shout.
It was such an unusual thing for Ben Brody to shout, except when
issuing orders, or expressing his feelings to some unfortunate new-
chum, that the hands about the place fancied the homestead must
have caught fire. Several of them rushed round to the front, and
found Ben Brody executing a kind of war-dance on the veranda.
‘What’s up now?’ asked Will Henton. ‘Something stinging you?’
‘No, you fool,’ roared Brody. ‘Do you think I’m as tender as you? It’s
them lads coming back!’
‘Not Foster and Brown?’ asked Will.
‘That’s just it, you bet,’ said Brody.
Off ran Will Henton, and in a few moments Harry Noke, Jim Lee, and
two or three more came round.
‘Give me the glasses,’ said Noke.
‘No need for that,’ said Jim Lee. ‘I can spot ’em from here.’
‘We must go and meet them,’ said Will Henton.
‘Right you are,’ said Brody. ‘Boys, we’ll have a terrible night of it.’
They mounted their horses, and in less time than it takes to write it
down were galloping towards the home-comers.
The scene was one to be remembered. They sprang from their
horses, and pulled Edgar and Will out of their saddles, and shook
them by the hands, cheered and hallooed until the plain rang with
their hearty shouts. Yacka stood quietly looking on, and when they
had almost wrung Edgar’s and Will’s hands off they tackled him.
‘Don’t handle Yacka as roughly as you have handled us,’ laughed
Edgar; ‘he’s got a bad wound.’
Then came a string of questions as to how Yacka received his
wound, and who had given it him. Such a rain of questions was
showered at them that at last Ben Brody said:
‘Give them breathing-time, lads. We shall hear all about their
adventures later on. We’re right glad to see you back again safe and
sound.’
A general chorus of assent followed this remark.
‘Expect you have not come back loaded with wealth?’ said Will
Henton.
‘Wait and see,’ said Edgar. ‘I rather fancy we have a surprise in store
for you.’
‘Have you had a good time?’ said Ben Brody.
‘It has been a wonderful time, and we have seen many strange
things, and gone through a good deal of hard work. I’m heartily glad
to see Yanda again, but I would not have missed our experiences for
the world.’
‘Same here,’ said Will Brown, ‘but I never wish to go through such a
time again.’
Yacka rode quietly behind, a lonely black figure, the pain in his face
showing how he still suffered. He was glad to see this hearty
welcome, but it made him feel lonely. He had no friends such as
these men at Yanda were. He was a wanderer, an outcast, a black, a
despised native of the country these white men had taken from his
people. But Yacka was, through all this, white enough at heart to
know it was all for the best. His people could never become like
these people, and the country in the hands of blacks, he knew,
would still have been wild and desolate.
CHAPTER XXII.
TIME FLIES.

The hands at Yanda marvelled greatly at the tale Edgar told of their
adventures, and they marvelled still more when the treasure they
brought with them was shown.
‘And to think that black fellow knew all about it, and kept the secret
so long,’ said Ben Brody. ‘I can hardly believe it is true. You must
have travelled thousands of miles. All I can say is you deserve what
you have got.’
After staying a few weeks at Yanda, where he received letters from
home, and from Wal Jessop, Edgar decided to go to Sydney and see
Eva again. Will Brown remained at Yanda, in order to gain more
experience of station life.
When Edgar arrived in Sydney, he at once went to Watson’s Bay. Wal
Jessop did not know Edgar had left Yanda. Eva had constantly
inquired for Edgar during his absence, and been comforted by the
assurance he would return to her.
Edgar walked up the steep path to the cottage, intending to give the
inmates a surprise, but Eva, who was looking out of the window,
recognised him, and gave a joyful cry that brought Mrs. Jessop to
her. Together they rushed out to greet Edgar, and he soon had little
Eva crowing delightedly in his arms, Mrs. Jessop looking on, her
motherly face beaming with satisfaction.
‘How you have grown, Eva!’ said Edgar, holding her up in his arms to
have a better look at her. ‘You have had a good home, and Mrs.
Jessop has taken great care of you.’
Eva began to prattle in her pretty childish way, and asked Edgar
numerous questions, some of which he found a difficulty in
answering.
When Wal Jessop returned home and found Edgar installed in the
cottage he was delighted. He had been longing to see him again,
and to hear all about his adventures. These Edgar had to relate over
and over again, and little Eva, too, was interested in hearing about
Yacka and the blacks, and the White Spirit in the wonderful cave.
When she saw the precious stones and gold Edgar brought with him,
she clapped her hands with joy, and wanted to play with all the
pretty things.
‘You’ll not be short of money for a time with such rubies as these to
sell,’ said Wal Jessop, as he took some of the stones in his hand.
‘They are the finest I ever saw. You’ll get more for them in London
than you will here.’
‘I shall keep the bulk of them,’ said Edgar; ‘but we must dispose of
some of them, Wal, in order to keep things going.’
‘Captain Fife will be able to do that for you,’ said Wal. ‘He knows the
best market for such things. What a wonderful chap that black must
be! There are not many like him here.’
‘You will see him before long,’ said Edgar. ‘He has promised to come
to Sydney when his wound has quite healed.’
‘A knife-thrust like that will take some time to get well,’ said Wal. ‘I
wonder if he will ever take you back again to find more of the
treasure?’
‘I shall not go,’ said Edgar; ‘but I have no doubt there will be search
made for it, even if Yacka declines to lead the way.’
The evening of Edgar’s arrival at the cottage he had a walk on the
cliffs with Wal Jessop, and again looked down upon the terrible rocks
where the Distant Shore was dashed to pieces, and himself and Eva
were so miraculously saved. As he looked into the depths below, the
scene came vividly to mind again, and he could not resist grasping
Wal Jessop by the hand, while the tears stood in his eyes.
Wal Jessop knew what he meant better than if he had spoken, and
returned the pressure of his hand. They walked back to the cottage,
and once more talked over the scenes of that awful night.
When Edgar saw Captain Fife that gentleman received him cordially,
and promised to dispose of some of the rubies to the best
advantage.
‘They are wonderfully good stones,’ said Captain Fife, ‘and there will
be no difficulty in obtaining a stiff price for them. By the way, what
are you going to do with yourself now? Are you returning to the
station, or would you prefer to remain in Sydney?’
‘If I can obtain a suitable billet,’ said Edgar, ‘I should like to remain
here.’
Captain Fife had been on the look-out for a private secretary for
some time, and he offered Edgar the post, which he willingly
accepted, thinking himself fortunate, as indeed he was, to gain such
a position.
Time flies quickly, and when Edgar Foster had been private secretary
to Captain Fife for over two years, he had become quite at home in
Sydney, and was recognised as one of the best of good fellows.
Edgar was fond of sports of all kinds, and he liked fun as well as any
young fellow of his age, but he shunned the fast sets in the city, and
one of his constant companions was Wal Jessop. Two or three times
a week he went to Wal’s cottage to see Eva, who was rapidly
growing into a very pretty girl. He heard regularly from home, and
also had news from Yanda—for Will Brown was still there. Yacka had
tried Sydney life, but quickly tired of it, and returned to the West.
Two or three expeditions had been fitted out to try and find the Cave
of Enooma, as it was called, for the adventures of Edgar Foster and
Will Brown had been related in the Sydney Mail, and naturally there
was a desire to obtain the wealth stated to be there. These
expeditions had, however, been failures, and nothing came of them.
Yacka refused to lead anyone into the Enooma country, and Edgar
and Will, when approached upon the subject, expressed their
inability to do so. When the second expedition failed in its object,
people said the discovery was a myth, but others knew better, and
Edgar only smiled when he heard disparaging remarks made.
Although Edgar stuck well to his work during the time he had been
with Captain Fife, he found ample opportunity to indulge in his
favourite pastime, cricket, and, much to his delight, had been
selected captain of the South Sydney team. In this capacity he not
only proved himself a good all-round cricketer, but a splendid leader,
and no one, it was generally acknowledged, placed his men to more
advantage in the field. He was selected to play for New South Wales
against Victoria, but, like many a good cricketer before him, he failed
at his first attempt. There was, however, no doubt about his ability,
and he now stood an excellent chance of being selected as one of
the next Australian eleven. This is the height of every cricketer’s
ambition in the colonies, and Edgar felt anxious as to whether his
performances during the season would warrant the selection
committee including him in the team. So far he had done fairly well.
There remained one inter-Colonial match to play against South
Australia, and Edgar knew upon this match would depend the final
decision as to his being a member of the Australian eleven.
He had practised steadily, and felt confident, and was encouraged by
Wal Jessop and Captain Fife. Will Brown wrote from Yanda, saying
they were coming down in force to see him play, and Ben Brody
added a postscript to the effect that the honour of the Yanda boys
was in Edgar’s hands.
When the eventful day arrived Edgar’s feelings can be imagined. The
match took place on the Association ground at Sydney, and the
South Australians placed a formidable team on the field. Several men
on either side were on their best mettle and playing for a place in
the Australian eleven.
Ben Brody appeared on the ground resplendent in a new cabbage-
tree hat, which he had bought in honour of the occasion. He was as
anxious as anyone to see Edgar successful. Will Brown vowed if
Edgar Foster went home with the team, he should go by the same
boat. Will Henton, Harry Noke, and Jim Lee all came up from Yanda
for the match, and consequently there was a family party on the
ground. In Wal Jessop Ben Brody found a man after his own heart,
and they got on well together.
Edgar felt encouraged by their presence to do his best, and
something seemed to tell him he would succeed.
The New South Wales captain won the toss and elected to bat. This
gave Edgar a chance to sit and chat with his friends. He hardly knew
how popular he had become in Sydney, owing to his numerous
adventures and his sterling character, until he saw the number of
people who were only too proud to recognise him.
‘You must be a favourite with the ladies,’ said Ben Brody. ‘All the
pretty girls are smiling at you. Lucky dog!’
It was true Edgar knew several nice girls, but he had not yet found
one he preferred to any of the others. He thought there was time
enough for that in another five or six years.
The home team commenced badly, and lost two wickets for thirty
runs. At the fall of the fourth wicket Edgar Foster went in, and his
appearance on the ground, from the pavilion, was the signal for a
loud outburst of applause. As he walked to the crease Edgar vowed
he would do his utmost to merit this reception. He was cool and
collected, and had seldom felt so confident. He commenced well by
making a couple of boundary hits in his first over. His partner, Frank
Highdale, was well set, and the pair looked like making a big stand.
Edgar roused the spectators by hitting a ball into the pavilion, and
Highdale had completely mastered the bowling. Runs came rapidly,
and the South Australian captain seemed puzzled to know how to
effect a separation.
Although Highdale had been batting some time before Edgar came
in, the latter was first to reach the coveted fifty. When this number
of runs appeared to Edgar’s name on the scoring-board, Ben Brody,
to use his own expression, ‘broke loose.’ He cheered in the most
frantic manner, and waved his huge hat in delight.
The New South Wales eleven were at the wickets all day, and when
stumps were drawn Edgar Foster was ‘not out, one hundred and
nine’! He was congratulated on all sides, and Captain Fife said, as he
shook hands with him:
‘Your place in the team is assured. I shall cable to your father as
soon as the selection is made. He will be mighty proud of his son.’
On the renewal of the match next day, Edgar added another fifty to
his score, and was clean bowled, after making one hundred and
fifty-nine, a magnificent innings.
The match ended in a win for the home colony by two hundred runs.
In the second innings Edgar Foster placed fifty-six to his credit; he
also bowled well during the match, and came out with a very good
average.
Consequently, it was no surprise when he found his name amongst
the favoured thirteen cricketers picked to make up the Australian
team. He received a cablegram from his father congratulating him,
and this gave him more pleasure than anything else.
As usual, there was some grumbling about the composition of the
team, but no one had anything to say about Edgar Foster’s inclusion.
‘We are to go home in the Cuzco,’ said Edgar to Will Brown; ‘so you
had better book your passage.’
‘You bet!’ said Will; ‘and who do you think is going home for a trip
with us?’
‘Don’t know,’ said Edgar. ‘I wish we could take Yacka. He would
create a sensation there.’
‘Yacka is far happier camping out at Yanda,’ said Will. ‘Ben Brody is
going home with us. He says he has never had a holiday since he
was a lad, over forty years ago, and he thinks it is about time he
took one now.’
‘I am glad,’ said Edgar. ‘Ben Brody is a real good sort; he’s a rough
diamond, but I like him better than if he were polished.’
The hands on Yanda were in high glee about Ben leaving them for a
time. They fancied the mutton diet would be knocked off, but Ben
said he should leave strict injunctions behind about that.
The time passed quickly, and the morning the Cuzco was to leave
Circular Quay a large crowd of people assembled to see the New
South Wales members of the team leave for London. There was so
much hand-shaking, and so many parting good-byes, that Edgar felt
sure some of them would be left behind.
Wal Jessop and his wife brought Eva down to see Edgar off, and the
child did not like to see him leave her in the big steamer.
‘I will come back for you, Eva,’ said Edgar; ‘I promise you I will come
back. Be a good girl while I am away, and I will bring you back the
best doll I can find in London.’
‘With brown hair, and blue eyes?’ said Eva.
‘Yes,’ said Edgar. ‘It shall have bonny blue eyes, and bright brown
hair like yours, Eva.’
He took her in his arms, and kissed her over and over again, and
then handed her to Mrs. Jessop. Just as the gangway was about to
be raised they saw a tall figure flying up it with long strides. It was
Ben Brody.
‘You nearly missed us,’ said Edgar, laughing. ‘Where have you been?
I thought I saw you on board some time back.’
‘So I was,’ said Ben, gasping for breath; ‘but I left my ‘bacca behind
in a box at the hotel, and I’d sooner have gone back to Yanda than
been on board without my usual brand.’
The Cuzco had now cast off, and as she left the wharf Edgar singled
out Eva, hoisted high on Wal Jessop’s shoulder, and waved her a
hearty farewell.
CHAPTER XXIII.
AN EVENTFUL NIGHT.

An Australian team bound for England always has a good time on


board the steamer, and the eleven of which Edgar was a member
was no exception to the rule. At Melbourne and Adelaide they were
joined by the members of the team hailing from Victoria and South
Australia.
On arriving at Colombo they went ashore to play a match against a
team selected from the leading local cricketers. Being out of practice
they did not play up to their usual form, and the Colombo team
nearly defeated them, and were much elated in consequence.
At this time the mail steamers did not pass through the Suez Canal
at night-time, and the Cuzco anchored off Ismailia. A run ashore to
pass away the time was only natural, and Edgar, accompanied by
Will Brown and other members of the team, made up a party. This
night ashore at Ismailia was destined to effect a change in Edgar’s
future life.
The population of Ismailia is a mixture of different nationalities,
some of them being of a rather desperate and fierce nature. An
Egyptian wedding-party passed through one of the streets; it was a
curious sight to unaccustomed eyes. The men, swathed in long
white garments, with turbans on their heads, and sandals on their
feet, carried long poles, at the ends of which lanterns were fixed.
Their brown arms and faces shone in the reflected light, and offered
a strong contrast to the colour of their garments. Fierce eyes
gleamed from under dark, bushy eyebrows, and as the men
marched, uttering a wild chant in peculiar tones, the effect was
somewhat weird. The bridegroom, who was being escorted to his
bride, was a tall, powerful young fellow, of a better caste than his
friends.
All went well until the procession approached the bride’s house,
when a party of young fellows from the Cuzco, who had been
revelling not wisely but too well, barred the road. It was a foolhardy
thing to do. To stop such a procession was exceedingly dangerous,
and could only be construed as an insult by the natives, who are not
slow to avenge any slight put upon them.
Edgar and those with him saw the danger, and shouted to the
obstructionists to move out of the way. It was, however, too late,
and the warning would probably not have been heeded in any case.
Seeing how matters stood, the Egyptians grew furious. Knives
flashed in the light, and a rush was made at the foolish young
fellows, who so recklessly hindered the procession.
‘Come on,’ shouted Edgar, ‘or there will be murder done!’
He rushed forward, followed by his companions, but they found it
impossible to render much assistance, owing to the confusion. Edgar
became separated from the others, and was drawing back from the
crowd, when he heard a cry for help, followed by a woman’s shriek.
Rushing in the direction of the sound, he saw a girl of about
eighteen struggling in the grasp of a powerful Egyptian. He
recognised her as Miss Muriel Wylde, a passenger on the Cuzco, with
whom he had had pleasant chats on deck. In a moment Edgar had
the ruffian by the throat, and forced him to loose his hold. No
sooner, however, was the girl free, than another man seized her and
attempted to carry her off. She struggled violently, and shouted
again for help. Edgar had his work cut out with the man he first
tackled. He was unarmed, and had to rely upon his fists. The furious
Egyptian rushed upon him with an uplifted knife in his hand. Edgar
did not flinch, but caught the fellow by the wrist, and the knife flew
from his grasp. Then, with his left fist, he dealt the man a savage
blow between the eyes that well-nigh stunned him.
Turning to see what had become of Miss Wylde, Edgar saw that she
had fainted, and her captor was hurrying away with her. Edgar gave
chase, and quickly came up with him. The Egyptian dropped his
burden, and turned on Edgar, aiming a terrific blow at him with his
knife. Edgar sprang backwards, and the man over-reached himself.
Before he recovered, Edgar had him on the ground, and stunned
him by knocking his head on the hard road.
He then sprang to his feet, and went to the assistance of Miss
Wylde, who had luckily been thrown on the soft sand by the side of
the road, and found she had recovered from her faint.
‘Can you walk?’ said Edgar; ‘are you much hurt?’
She was trembling and alarmed, and could hardly answer him.
‘We must make our way to the quay,’ he said, ‘and get a boat back
to the ship as quickly as possible. These fellows are frantic at being
interfered with, and are in a dangerous state. Lean on me, and try
and walk.’
She put her hand on his shoulder, and Edgar supported her by
placing his arm round her waist.
They had not gone many yards before Edgar heard loud shouting
behind them. It was evident some of the Egyptians were coming
that way, and they must be avoided if possible. A few paces straight
ahead Edgar saw a high wall, and what looked like a doorway. He
lifted his companion off her feet, and ran as fast as he could towards
the archway.
On reaching it he knocked loudly. The door was opened by an old
native woman, who peered curiously into his face.
Without saying a word Edgar stepped inside, and closed the door
behind him.
‘What do you here?’ asked the old woman, in broken English. ‘Are
you from the ship?’
‘Yes,’ said Edgar, not knowing what else to say, or what excuse to
give for his conduct.
The old woman’s eyes gleamed, and her wrinkled, parchment-like
skin seemed to crumple up and almost crack. Her mouth expanded
in what she no doubt meant for a smile, but Edgar thought it a
diabolical grin, and Muriel Wylde shrank back.
‘Money—gold!’ said the woman hoarsely, her skinny hands extended
like a couple of claws. ‘Gold, and you shall hear your fortune. The
oldest Egyptian in Ismailia can speak truth.’
Edgar felt relieved; had the old woman guessed they were fugitives
she might not have been so friendly. He looked at his companion,
and said:
‘We shall be glad to hear our fortunes from you, mother. That is
what we came for,’ and he took a sovereign out of his pocket.
The old Egyptian’s eyes fastened upon it, and her hand was
stretched out.
‘Give me your hand,’ she said to Miss Wylde.
The girl put out her open hand reluctantly, and the Egyptian gazed
at it so attentively that she appeared to have forgotten the coin.
‘You have been in trouble, and he has saved you,’ croaked the
woman.
The girl started, and the Egyptian smiled at this corroborative
evidence. She had hazarded a guess at the situation, and hit the
mark.
She then proceeded to give an account of what would follow this
adventure, and caused Muriel Wylde to blush, and wish she was
safely on board again.
Edgar’s future was soon told, in the usual strain. He was the hero of
the story, and would be rewarded in due time by the hand of the
lady he had rescued.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebooknice.com

You might also like