0% found this document useful (0 votes)

43 views

Data_Science

Uploaded by

Shashank Ghosh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Data_Science

Uploaded by

Shashank Ghosh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 207

Usman Qamar

Muhammad Summair Raza

Data Science
Concepts and
Techniques
with Applications
Data Science Concepts and Techniques
with Applications
Usman Qamar Muhammad Summair Raza
•

Data Science Concepts

and Techniques
with Applications

123
Usman Qamar Muhammad Summair Raza
Knowledge and Data Science Department of Computer Science
Research Centre Virtual University of Pakistan
National University of Sciences Lahore, Pakistan
and Technology (NUST)
Islamabad, Pakistan

ISBN 978-981-15-6132-0 ISBN 978-981-15-6133-7 (eBook)

https://doi.org/10.1007/978-981-15-6133-7

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to our students.
Preface

As this book is about data science, the first question it immediately begs is: What is
data science? It is a surprisingly hard definition to nail down. However, for us, data
science is perhaps the best label for the cross-disciplinary set of skills that
are becoming increasingly important in many applications across industry and
academia. It comprises three distinct and overlapping areas: firstly, statistician who
knows how to model and summarize the data, data scientist who can design and use
algorithms to efficiently process and visualize the data, and finally the domain
expert who will formulate the right questions and put the answers in context. With
this in mind, we would encourage all to think of data science not as a new domain
of knowledge to learn, but a new set of skills that you can apply within your current
area of expertise.
The book is divided into three parts. The first part consists of the first three
chapters. In Chap. 1, we will discuss the data analytics process. Starting from the
basic concepts, we will highlight the types of data, its use, its importance, and issues
that are normally faced in data analytics. Efforts have been made to present the
concepts in the most simple possible way, as conceptual clarity before studying the
advance concepts of data science and related techniques is very much necessary.
Data analytics has a wide range of applications, which are discussed in Chap. 2.
Today, where we have already entered in the era of information, the concept of big
data has already taken over in organizations. With the information generating at
immense rate, it has become very much necessary to discuss the analytics process
from big data point of view, so, in this chapter, we will provide some common
applications of data analytics process from big data perspective. Chapter 3 introduces
widely used techniques for data analytics. Prior to the discussion of common data
analytics techniques, firstly we will try to explain three types of learning to which the
majority of the data analytics algorithms will fall under.
The second part is composed of Chaps. 4–7. Chapter 4 is on data preprocessing.
Data may contain noise, missing values, redundant attributes, etc. so data prepro-
cessing is one of the most important steps to make data ready for final processing.
Feature selection is an important task used for data preprocessing. It helps reduce
the noise and redundant and misleading features. Chapter 5 is on classification

vii
viii Preface

concepts. It is an important step that forms core of the data analytics and machine
learning activities. The focus of Chap. 6 is on clustering. Clustering is the process
of dividing the objects and entities into meaningful and logically related groups. In
contrast to classiﬁcation where we have already labeled classes in data, clustering
involves unsupervised learning, i.e. we do not have any prior classes, whereas
Chap. 7 introduces text mining as well as opining mining.
And ﬁnally, the third part of the book is composed of Chap. 8 which focuses on
two programming languages commonly used for data science projects, i.e. Python
and R programming language.
Data science is an umbrella term that encompasses data analytics, data mining,
machine learning, and several other related disciplines, so contents have been
devised keeping in mind this perspective. An attempt is made to keep the book as
self-contained as possible. The book is suitable for both undergraduate and post-
graduate students as well as those carrying out research in data science. It can be
used as a textbook for undergraduate students in computer science, engineering, and
mathematics. It can also be accessible to undergraduate students from other areas
with the adequate background. The more advanced chapters can be used by post-
graduate researchers intending to gather a deeper theoretical understanding.

Islamabad, Pakistan Dr. Usman Qamar

Lahore, Pakistan Dr. Muhammad Summair Raza
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Big Data Versus Small Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Role of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Types of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Challenges of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Large Volumes of Data . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Processing Real-Time Data . . . . . . . . . . . . . . . . . . . . . . 7
1.6.3 Visual Representation of Data . . . . . . . . . . . . . . . . . . . . 7
1.6.4 Data from Multiple Sources . . . . . . . . . . . . . . . . . . . . . 8
1.6.5 Inaccessible Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6.6 Poor Quality Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6.7 Higher Management Pressure . . . . . . . . . . . . . . . . . . . . 9
1.6.8 Lack of Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.9 Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.10 Shortage of Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Top Tools in Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Business Intelligence (BI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.9 Data Analytics Versus Data Analysis . . . . . . . . . . . . . . . . . . . . . 13
1.10 Data Analytics Versus Data Visualization . . . . . . . . . . . . . . . . . 13
1.11 Data Analyst Versus Data Scientist . . . . . . . . . . . . . . . . . . . . . . 14
1.12 Data Analytics Versus Business Intelligence . . . . . . . . . . . . . . . . 15
1.13 Data Analysis Versus Data Mining . . . . . . . . . . . . . . . . . . . . . . 16
1.14 What Is ETL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.14.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.14.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.14.3 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

ix
x Contents

1.15 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.16 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.17 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Applications of Data Science . . . . . . . . . . . . . . . . ............... 25
2.1 Data Science Applications in Healthcare . . . ............... 25
2.2 Data Science Applications in Education . . . . ............... 30
2.3 Data Science Applications in Manufacturing
and Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Data Science Applications in Sports . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Data Science Applications in Cyber-Security . . . . . . . . . . . . . . . 36
2.6 Data Science Applications in Airlines . . . . . . . . . . . . . . . . . . . . 37
2.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Widely Used Techniques in Data Science Applications . . . . . . . . . . . 41
3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 A/B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 AB Test Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 AB Testing Can Make Huge Difference . . . . . . . . . . . . 45
3.4.3 What You Can Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 For How Long You Should Test . . . . . . . . . . . . . . . . . . 46
3.5 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.2 Conﬁdence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3 Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 How to Draw a Decision Tree . . . . . . . . . . . . . . . . . . . 49
3.6.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . 50
3.6.3 Decision Trees in Machine Learning
and Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.4 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . 51
3.7 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7.1 Different Approaches of Cluster Analysis . . . . . . . . . . . 52
3.7.2 Types of Data and Measures of Distance . . . . . . . . . . . . 52
3.7.3 Selecting the Optimum Number of Clusters . . . . . . . . . . 54
3.7.4 Non-hierarchical or K-Means Clustering Methods . . . . . 54
3.8 Advantages and Disadvantages of Clustering . . . . . . . . . . . . . . . 56
3.9 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.9.1 Pattern Recognition Process . . . . . . . . . . . . . . . . . . . . . 57
3.9.2 Training and Test Datasets . . . . . . . . . . . . . . . . . . . . . . 59
3.9.3 Applications of Pattern Recognition . . . . . . . . . . . . . . . 59
Contents xi

3.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Supervised Feature Selection . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 Unsupervised Feature Selection . . . . . . . . . . . . . . . . . . . 67
4.3 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Filter-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.3 Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Objective of Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Feature Selection Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.2 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.3 Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.5 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Feature Generation Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Miscellaneous Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.1 Feature Relevancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.2 Feature Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.7.3 Applications of Feature Selection . . . . . . . . . . . . . . . . . 78
4.8 Feature Selection: Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8.3 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Different Types of Feature Selection Algorithms . . . . . . . . . . . . . 80
4.10 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.1 Design Issues of Decision Tree Induction . . . . . . . . . . . 94
5.2.2 Model Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xii Contents

5.5 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.6 Artiﬁcial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Types of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.1 Hierarchical Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.2 Partitioned Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.3 Other Cluster Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.1 Centroids and Object Assignment . . . . . . . . . . . . . . . . . 121
6.3.2 Centroids and Objective Function . . . . . . . . . . . . . . . . . 122
6.4 Reducing the SSE with Post-processing . . . . . . . . . . . . . . . . . . . 123
6.4.1 Split a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.2 Introduce a New Cluster Centroid . . . . . . . . . . . . . . . . . 124
6.4.3 Disperse a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.4 Merge Two Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.5 Bisecting K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.6 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . 126
6.7 DBSCAN Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.8 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.9 General Characteristics of Clustering Algorithms . . . . . . . . . . . . 130
6.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Text Mining Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2.1 Exploratory Text Analysis . . . . . . . . . . . . . . . . . . . . . . 134
7.2.2 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2.3 Automatic Text Classiﬁcation . . . . . . . . . . . . . . . . . . . . 135
7.3 Text Categorization Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Text Categorization Approaches . . . . . . . . . . . . . . . . . . . . . . . . 136
7.5 Perspectives for Text Categorization . . . . . . . . . . . . . . . . . . . . . 138
7.6 Text Categorization Applications . . . . . . . . . . . . . . . . . . . . . . . . 138
7.7 Representation of Document . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.7.1 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.7.2 Term Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.8 Text Document Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.9 Opinion Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.9.1 Opinions in Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.9.2 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Contents xiii

7.9.3 Inappropriate Content . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.9.4 Customer Relationship Management . . . . . . . . . . . . . . . 145
7.9.5 Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.9.6 Knowledge Management Benefits . . . . . . . . . . . . . . . . . 146
7.10 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.10.1 SentiWordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.10.2 Word Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.10.3 Applying SentiWordNet . . . . . . . . . . . . . . . . . . . . . . . . 149
7.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8 Data Science Programming Languages . . . . . . . . . . . . . . . . . . . . . . . 153
8.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.1.1 Python Reserved Words . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1.2 Lines and Indentation . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1.3 Multiline Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1.4 Quotations in Python . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.1.5 Comments in Python . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.1.6 Multiline Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.7 Variables in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.1.8 Standard Data Types in Python . . . . . . . . . . . . . . . . . . . 157
8.1.9 Python Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.1.10 Python Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.1.11 Python Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.1.12 Python Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1.13 Python Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.1.14 Data-Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.1.15 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.1.16 Comparison Operators . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.1.17 Assignment Operators . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.1.18 Logical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.19 Operator Precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.1.20 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.1.21 Iterations or Loops in Python . . . . . . . . . . . . . . . . . . . . 167
8.1.22 Nested Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.1.23 Function in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.1.24 User-Defined Function . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.1.25 Pass by Reference Versus Value . . . . . . . . . . . . . . . . . . 175
8.1.26 Function Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.1.27 Required Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.1.28 Keyword Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.1.29 Default Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1.30 Variable-Length Arguments . . . . . . . . . . . . . . . . . . . . . 177
8.1.31 The return Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 178
xiv Contents

8.2 R Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.2.1 Our First Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.2.2 Comments in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.2.3 Data Types in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.4 Vectors in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.5 Lists in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.6 Matrices in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.2.7 Arrays in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.2.8 Factors in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.2.9 Data Frames in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.2.10 Types of Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2.11 Decision Making in R . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.2.12 Loops in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.2.13 Break Statement in R . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2.14 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
About the Authors

Dr. Usman Qamar has over 15 years of experience in data engineering and
decision sciences both in academia and industry. He has a Masters in Computer
Systems Design from University of Manchester Institute of Science and
Technology (UMIST), UK. His MPhil in Computer Systems was a joint degree
between UMIST and University of Manchester which focused on feature selection
in big data. In 2008 he was awarded PhD from University of Manchester, UK. His
Post PhD work at University of Manchester, involved various research projects
including hybrid mechanisms for statistical disclosure (feature selection merged
with outlier analysis) for Ofﬁce of National Statistics (ONS), London, UK, churn
prediction for Vodafone UK and customer proﬁle analysis for shopping with the
University of Ghent, Belgium. He is currently Associate Professor of Data
Engineering at National University of Sciences and Technology (NUST), Pakistan.
He has authored over 200 peer reviewed publications which includes 3 books
published by Springer & Co. He is on the Editorial Board of many journals
including Applied Soft Computing, Neural Computing and Applications,
Computers in Biology and Medicine, Array. He has successfully supervised 5 PhD
students and over 100 master students.

Dr. Muhammad Summair Raza has been afﬁliated with the Virtual University of
Pakistan for more than 8 years and has taught a number of subjects to graduate-level
students. He has authored several articles in quality journals and is currently
working in the ﬁeld of data analysis, big data with a focus on rough sets.

xv
Chapter 1
Introduction

In this chapter we will discuss the data analytics process. Starting from the basic
concepts, we will highlight the types of data, its use, its importance, and issues that
are normally faced in data analytics. Efforts have been made to present the concepts in
most simple possible way, as conceptual clarity before studying the advance concepts
of data science and related techniques is very much necessary.

1.1 Data

Data is an essential need in all domains of life. From research community to business
markets data is always required for analysis and decision-making purpose. However,
the emerging developments in technology of data storage, processing, and transmis-
sion have changed the entire scenario. Now bulk of data is produced on daily basis.
Whenever, you type a message, upload a picture, browse a web, type a social media
message, you are producing data which is being stored somewhere and available
online for processing. Just couple this with development of advance software appli-
cations and inexpensive hardware. With the emergence of the concepts like Internet
of Things (IoT), where the focus is on connected data, the scenario has worsened
the further. From writing something on paper to online distributed storages, data is
everywhere.
Each and every second the amount of data is increasing by the rate of hundreds of
thousands of tons. By 2020, the overall amount of data is predicted to be 44 zettabytes
and just to give an idea, 1.0 ZB is equal to 1.0 trillion gigabytes. With such huge
volumes of data apart from the challenges and the issues like curse of dimensionality,
we also have various opportunities to dig deep into these volumes and extract useful
information and knowledge for the good of society, academia, and business.
Figure 1.1 shows few representations of data.

© The Editor(s) (if applicable) and The Author(s), under exclusive license 1
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_1
2 1 Introduction

Hi John! What’s Type Price

Up? A 10000 Price
B 80000 100000
Textual Data
C 60000
50000

Structural Data 0
A B C

Price

Graphical Data

Fig. 1.1 Few data representations

1.2 Analytics

In the previous section we have discussed about huge volumes of data that is produced
on daily basis. The data is not useful until we have some mechanism to use it to extract
knowledge and make decisions. This is where data analytics process steps in. There
are various definitions of data analytics. We will use a simple one. Data analytics
is the process to use the data (in any form), process it through various tools and
techniques, and then extract the useful knowledge from this data. The knowledge
will ultimately help in decision making.
Overall data analytics is the process of generalizing knowledge from raw data
which includes various steps from storage to final knowledge extractions. Apart
from this the process involves concepts from various other domains of science and
computing. Starting from basic statistic measures, e.g. means, medians, and vari-
ances, etc., to advance data mining and machine learning techniques each and every
step transforms data to extract the knowledge.
This process of data analytics has also opened the doors of new ideas, e.g. how to
mimic the human brain with computer, so that the tasks performed by human could
be performed by the machines at the same level. Artificial neural network was an
important development in this regard.
With the advent of such advance tools and techniques for data analytics, several
other problems have also emerged, e.g. how to efficiently use the computing resources
and enhance the efficiency, how to deal with the various data-related problems, e.g.
inaccuracy, huge volumes, anomalies, etc. Figure 1.2 shows the data analytics process
at high level.

1.3 Big Data Versus Small Data

Nature of the data that applications need now is totally changed. Starting from basic
databases which used to store the daily transactions data, the distributed connected
1.3 Big Data Versus Small Data 3

Data Data Analytics Knowledge

Fig. 1.2 Data analytics process, a broad overview

data has become reality. This change has impacted all the aspects related to data
including the storage mechanisms, processing approaches, and knowledge extrac-
tions. Here we will present a little bit comparison of the “small” data and “big” data
in Table 1.1.
As discussed above the data has grown from few gigabytes to zettabytes, the
change is not only in size, it has changed the entire scenario, e.g. how to process
distributed connected data. How to ensure the security of your data that you do not
know about where it is stored on cloud? How to make maximum out of it with
limited resources? The challenges have opened new windows of opportunities. Grid
computing, clouds, fog computing, etc., are the results of such challenges.
The mere expansion of the resources is not sufficient. A strong support from
software is also essential because the conventional software applications are not
sufficient to cope with the size and nature of big data. For example a simple software
application that performs data distribution in single server environment will not be

Table 1.1 Comparison between small data and big data

Small data Big data
Small volumes of data mostly organizational Large volumes including the text, images,
data videos, etc.
Stored on single systems Data is distributed, i.e. various geographical
locations
Stand-alone storage devices, e.g. local servers Data is connected, e.g. clouds, IoTs
Homogeneous data, i.e. data has the same Heterogeneous data, i.e. data may be in
format different formats and shapes
Mostly structured data May include both structured, unstructured, and
semi-structured data
Simple computing resources are sufficient to May require most computing resources
process the data including cloud computing and grid
computing etc.
Processing of such data is less challenging Processing of such data is more challenging
especially when it comes to accuracy and
efficiency of processing
4 1 Introduction

effective for distributed data. Similarly an application that just collects data from a
single server as a result of a query will not be able to extract distributed data from
different distributed nodes. Various other factors should also be considered, e.g. how
to integrate such data, where to fetch data efficiently in case of a query, should data
be replicated, etc.
Even the above-mentioned issues are simple, and we come across more complex
issues, e.g. in case of cloud computing, your data is stored on a cloud node, you do not
have any idea, so, how to ensure the sufficient security of your data and availability.
One of the common techniques to deal with such big data is MapReduce model.
Hadoop, based on MapReduce, is one of the common platforms for processing
and managing such data. Data is stored on different systems as per needs, and the
processed results are then integrated.
If we consider the algorithmic nature of the analytics process, data analytics for
small data seems to be basic as we have data only in a single structured format, e.g.
we may have data where you are provided with certain number of objects where each
object is defined through well précised features and class labels. However, when it
comes to big data, various issues need to be considered. Talking simply about feature
selection process, a simple feature selection algorithm for homogeneous data will be
different from the one dealing with data having heterogeneous nature.
Similarly, the small data in majority of the cases is structured, i.e. you have well-
defined schemas of the data, however, in big data you may have structured and
unstructured data. So, an algorithm working on simple structure of data will be far
simple as compared to the one working on different structures.
So, as compared to simple small data, big data is characterized by four features
as follows:
Volume: We deal with petabytes, zettabytes, and exabytes of data.
Velocity: Data is generated at immense rate.
Veracity: Refers to bias and anomalies in big data.
Variety: The number of types of data and
As compared to big data, the small data is far simple and less complex with respect
to big data.

1.4 Role of Data Analytics

Job of data analysts includes knowledge from domains like statistics, mathematics,
artificial intelligence, and machine learning. The ultimate intention is to extract
knowledge for the success of business. This is done by extracting the patterns from
data.
It involves a complete interaction with data throughout the entire analysis process.
So, a data analyst works with data in various ways. It may include data storage,
data cleansing, data mining for knowledge extraction, and finally presenting the
knowledge through some measures and figures.
1.4 Role of Data Analytics 5

Data mining forms the core of the entire data analytics process. It may include
extraction of the data from heterogeneous sources including texts, videos, numbers,
and figures. The data is extracted from the sources, transformed in some form which
can easily process, and finally we load the data so that we could perform the required
processing. Overall the process is called extract, transform, and load (ETL) process.
However, note that the entire process is time consuming and requires lot of resources,
so, one of the ultimate goals is to perform the entire process with efficiency.
Statistics and machine learning are few of the major components of data analytics.
They help in analysis and extraction of knowledge from the data. We input data and
use statistics and machine learning techniques to develop models. Later, these models
are then used for prediction analysis and prediction purpose. Nowadays, lots of tools
and libraries are available for this purpose including R and Python, etc.
The final phase in data analytics is data presentation. Data presentation involves
visual representations of the results for the concept of the customer. As customer
is intended audience of data representation, the techniques used should be simple,
formative, and as per his requirements.
Talking about the applications of data analytics, the major role of data analytics
is to enhance performance and efficiency of business and organizations.
One of the major roles data analytics plays is in the banking sector, where you
can find out credit score, predict potential customers for a certain policy, and detect
outliers, etc.
However, apart from its role in finance, data analytics plays a critical role in
security management, health care, emergency management, etc.

1.5 Types of Data Analytics

Data analytics is a broad domain. It has four types, i.e. descriptive analytics, diag-
nostic analytics, predictive analytics, and prescriptive analytics. Each has its own
nature and the type of the tasks performed in it. Here we will provide a brief
description of each.
• Descriptive analytics: Descriptive analytics helps us find about “What happened”
or “What is happening”. In simple words these techniques take the raw data as
input and summarize it in the form of knowledge useful for customers, e.g. it may
help find out the total time spent on each customer by the company or total sales
done by each region in certain season. So, descriptive analytics process comprises
data input, processing, and the results generations. Results generated are in visual
forms for better understanding of the customer.
• Diagnostics analytics: Taking the analytics process a step ahead, diagnostic
analytics help in analyzing about “Why it happened?”. Performing analysis on
historical and current data, we may get details of why a certain event actually
happened at a certain period in time. For example, we can find out the reasons for
a certain drop in sales over the third quarter of the year. Similarly, we can find the
6 1 Introduction

reasons behind the low crop yield in agriculture. Special measures and metrics
can be defined for this purpose, e.g. yield per quarter, profit per six months, etc.
Overall, the process is completed in three steps:
– Data collection
– Anomaly detection
– Data analysis and identification of the reasons.
• Predictive analytics: Prescriptive analytics as name indicates helps in predicting
about future. It helps in finding “What may happen”. Using the current and histor-
ical data predictive analytics finds the patterns and trends by using statistical and
machine learning techniques and tries to predict whether same circumstances
may happen in future. Various machine learning techniques like artificial neural
network, classification algorithms, etc., may be used. Overall process comprises
the following steps:
– Data collection
– Anomaly detection
– Application of machine learning techniques to predict patterns.
• Prescriptive analytics: Prescriptive analytics, as the name implies the necessary
actions that need to be taken in case of certain predicted event, e.g. what should be
done to increase the predicted low yield in last quarter of the year. What measures
should be taken to increase the sales in off season.
So, the different types of analytics help at certain stages for the good of the business
and organizations. One thing common in all types of analytics is the data required
for applying the analytics process. The better the quality of data, the better are the
decisions and results. Figure 1.3 shows scope of different types of data analytics.

1.6 Challenges of Data Analytics

Although data analytics has been widely adapted by the organizations and a lot of
research is underway in this domain, however, still there are a lot of challenges that
need to be catered. Here, we will discuss few of these challenges.

1.6.1 Large Volumes of Data

In this data-driven era, organizations are collecting large amount of data on daily
basis, sufficient to pose significant challenges for analytics process. The volumes
of data available these days require significant amount of resources for storage and
processing. Although analytics processes have come up with various solutions to
cater the issue, but still lot of work needs to be done to solve the problem.
1.6 Challenges of Data Analytics 7

Fig. 1.3 Scope of different data analytics types

1.6.2 Processing Real-Time Data

In majority of the cases, the significance of the data remains valid only in a certain
period of time. Coupling the problem with large amount of data, it becomes impos-
sible to capture such huge volumes in real time and process it for meaningful insights.
The problem becomes more critical in case we do not have sufficient resources to
collect real-time data and process it. This is another research domain where data
analytics processes deal with huge volumes of real-time data and process it in real
time to give meaningful information so that its significance remains intact.

1.6.3 Visual Representation of Data

For the clear understudying of the customer and organizations, data should be
presented in simple and easy ways. This is where visual representations like graphs
and charts, etc., may help. However, presenting the information in simple way is
not an easy task especially when the complexity of the data increases and we need
to present information at various levels. Just inputting the data in tools and getting
8 1 Introduction

knowledge representations may seem to be simple but it involves large amount of

processing in real time.

1.6.4 Data from Multiple Sources

Another issue that needs to be catered is distributed data, i.e. the data stored at
different geographical locations. This may create problems as manually the job seems
very cumbersome. However, regarding data analytics process data distribution may
have many issues which may include data integration, mismatch in data formats,
different data semantics, etc.
The problem lies not only in data distribution but the resources required to process
this data. For example, processing distributed data in real time may require expensive
devices to ensure high-speed connectivity.

1.6.5 Inaccessible Data

For effective data analytics process, data should be accessible 24/7. Accessibility of
data is another issue that needs to be catered. Backup storages and communication
devices need to be purchased to ensure the data is available whenever required.
Because even if you have data but it is not accessible due to any reason, the analytics
process will not be significant.

1.6.6 Poor Quality Data

Data quality lies at the heart of the data analytics process. Incorrect and inaccurate
data means inaccurate results. It is common to have anomalies in data. Anomalies
may include missing values, incorrect values, irrelevant values, etc. Anomalies may
occur due to various reasons, e.g. defects in sensors that may result in incorrect data
collections, users not willing to enter correct values, etc.
Dealing with this poor-quality data is a big challenge for data analytics algorithms.
Various preprocessing techniques are already available but dealing with anomalies
is still a challenge especially when it comes to large volumes and distributed data
with unknown data sources.
1.6 Challenges of Data Analytics 9

1.6.7 Higher Management Pressure

As the results of data analytics are realized and the benefits become evident, the
higher management demands for more results which ultimately increase pressure on
the data analysts. So, work under pressure always has its negatives.

1.6.8 Lack of Support

Lack of support from higher management and peers is another issue to deal with.
Data analytics is not useful if higher management is not supportive and does not give
authority to take actions as results of the knowledge are extracted from the analytics
process. Similarly, if peers, e.g. other departments are not willing to provide data,
the analytics process will not be much useful.

1.6.9 Budget

Budget is one of the core issues to deal with. Data analytics process requires expen-
sive systems having capacity to deal with large volumes of data, hiring consultants
whenever needed, purchasing data and tools, etc., which involves significant amount
of budget. Until the required budget is not provided and organizations are not willing
to spend on data analytics process, it is not possible to get the fruits of the data
analytics.

1.6.10 Shortage of Skills

Data analytics is a rich field involving skill set from various domains like mathe-
matics, statistics, artificial intelligence, machine learning, etc. So, it becomes an issue
to find such experts having knowledge and experience in all of these domains. So,
finding right people for right job is also an issue that organizations and businesses
are facing so far.

1.7 Top Tools in Data Analytics

With increasing trend in data analytics, various tools have already been developed to
create data analytics systems. Here we will discuss few tools that are most commonly
used for this purpose.
10 1 Introduction

R programming R is one of the leading tools for analytics and data modeling. It has
compatible versions for Microsoft Windows, Mac OS, and Unix, etc. Along with, it
has many available libraries for different scientific tasks.

Python Python is another programming language that is most widely used for
writing programs related to data analytics. This open-source and object-oriented
language has number of libraries from high profile developers for performing
different data analytics-related tasks. Few of the most common libraries used in
Python are NTLK, Numpy, Scipy, Scikit, etc.

Tableau Public A free software that can connect to any data source and create
visualizations including graphs, charts, and maps in real time.

QlikView It allows for data processing thus enhancing the efficiency and perfor-
mance. It also offers data association and data visualization with compressed
data.

SAS SAS is another data analytics-related tool that can analyze and process data
from any source.

Microsoft Excel The one of the most common tools used by for organizational
data processing and visualizations. The tool is developed by Microsoft and is part
of the Microsoft Office suite. Tool integrates number of mathematical and statistical
functions.

RapidMiner Mostly used for predictive analytics, the tool can be integrated with
any data source including Excel, Oracle, SQL server, etc.

KNIME An open-source platform that lets you analyze and model data. Through
its modular data pipeline concepts, KNIME provides a platform for reporting and
integration of data.

OpenRefine As per their official statement, OpenRefine (previously Google Refine)

is a powerful tool for working with messy data: cleaning it; transforming it from one
format into another; and extending it with Web services and external data.

Apache Spark Apache server is one of the largest large-scale data processing tools.
Tool executes applications in Hadoop clusters 100 times faster in memory and 10
times faster on disk.

Splunk Splunk is a specialized tool to search, analyze, and manage machine gener-
ated data. Splunk collects, indexes, and analyzes the real-time data into a repository
from which it generates the information visualizations as per requirements.

Talend Talened is a powerful tool for automating the big data integration. Talend
uses native code generation and helps you run your data pipelines across all cloud
providers to get optimized performance on all platforms.
1.7 Top Tools in Data Analytics 11

Splice Machine Splice Machine is a scalable SQL database that lets you modernize
your legacy and custom applications to be agile, data-rich, and intelligent without
modifications. It lets you unify the machine learning and analytics consequently
reducing the ETL costs.

LUMIFY LUMIFY is a powerful big data tool to develop actionable intelligence.

It helps find the complex relationships in data through a suite of analytic options,
including graph visualizations and full-text faceted search.

1.8 Business Intelligence (BI)

Business intelligence deals with analyzing the data and presenting the extracted
information for business actions to make decisions. It is the process that includes
technical infrastructure to collect, store, and analyze the data for different business-
related activities.
The overall objective of the process is to make better decision for the good of the
business. Some benefits may include
• Effective decision making
• Business process optimization
• Enhanced performance and efficiency
• Increased revenues
• Potential advantages over competitors
• Making effective future policies.
These are just few benefits to mention. However, in order to do this, effective
business intelligence needs to meet four major criteria.

Accuracy Accuracy is core of any successful process and product. In case of busi-
ness intelligence process accuracy refers to the accuracy of input data and the
produced output. A process within accurate data may not reflect the actual scenario
and may result in inaccurate output which may lead to ineffective business decisions.
So, we should be especially careful for the accuracy of the input data.

Here the term error is in general sense. It refers to erroneous data that may contain
missing, redundant values, and outliers. All these significantly affect the accuracy of
the process. For this purpose, we need to apply different cleansing techniques as per
requirement in order to ensure the accuracy of the input data.

Valuable Insights The process should generate the valuable insight from the data.
The insight generated by the business intelligence process should be aligned with
the requirements of the business to help it make effective future policies, e.g. for
a medical store owner the information of the customer medical condition is more
valuable than a grocery store.
12 1 Introduction

Timeliness Timeliness is another important consideration of business intelligence

process. Generating the valuable insight is an important component but the insight
should be generated at the right time. For example, for medical store discussed above
if the system does not generate the ratio of the people that may be affected by pollen
allergy in upcoming spring, the store may fail to get full benefit of the process. So,
generating right insight at the right time is essential. It should be noted that here
timeliness refers to both the timeliness of the availability of the input data and the
timeliness of the insight generated.

Actionable The insight provided by business intelligence process should always

consider the organization context in order to provide effective insight that can be
implemented, e.g. although the process may provide maximum amount of the pollen
allergy-related medicines that the medical store should purchase at hand but the
budget of the medical store and other constraints like possible sales limit, etc., should
also be considered for effective decision making. Now let us have a brief look at
business intelligence process.

BI Process It has four broad steps that loop over and over. Figure 1.4 shows the
process.

1. Data Gathering: Refers to collecting the data and cleansing it to convert it in

format suitable for BI processing.
2. Analysis: Refers to processing of data to get insight from it.
3. Action: Refers to action taken in accordance with the information analyzed.
4. Measurement: Results of actions are measured with respect to required output.
5. Feedback: Measurements are then used to improve the BI process.

Fig. 1.4 Phases of a BI

process 1. Data
Gathering

5. Feedback 2. Analysis

4.
3. Actions
Measurement
1.9 Data Analytics Versus Data Analysis 13

Fig. 1.5 Data analytics

versus data analysis

1.9 Data Analytics Versus Data Analysis

Below are the lists of points that describe the key differences between data analytics
and data analysis:
• Data analytics is a general term that refers to the process of making decisions
from data, whereas data analysis is sub-component of data analytics which tends
to analyze the data and get required insight.
• Data analytics refers to data collection and general analysis, whereas data analysis
refers to collecting, cleaning, and transforming the data to get deep insight out of
it.
• Tools required for data analytics may include Python, R, and TensorFlow, etc.,
whereas the tools required for data analysis may include RapidMiner, KNIME,
and Google Fusion Tables, etc.
• Data analysis deals with examining, transforming, and arranging a given data
to extract useful information, whereas data analytics deals with complete
management of data including collection, organization, and storage.
Figure 1.5 shows the relationship between data analytics and data analysis.

1.10 Data Analytics Versus Data Visualization

In this section we will discuss the difference between data analytics and data
visualization.
• Data analytics deals with tools, techniques, and methods to derive deep insight
from the data by finding out the relationships in it, whereas data visualization deals
with presenting the data in a format (mostly graphical) that is easy to understand.
• Data visualization helps organization management to visually perceive the
analytics and concepts present in the data.
• Data analytics is the process that can help organizations increase the operational
performance, make policies, and take decisions that may provide advantages of
over the business competitors.
• Descriptive analytics may, for example, help organizations find out what has
happened and find out its root causes.
14 1 Introduction

Fig. 1.6 Data analytics

versus data visualizations

• Prescriptive analytics may help organizations to find out the available prospects
and opportunities and consequently making the decisions in favor of the business.
• Predictive analytics may help organizations to predict the future scenarios by
looking into the current data and analyzing it.
• Visualizations can be both static and interactive. Static visualizations normally
provide a single view the current visualization is intended for. Normally user
cannot see beyond the lines and figures.
• Interactive visualizations, as the name suggests, can help user interact with
the visualization and get the visualization as per their specified criteria and
requirements.
• Data visualization techniques like charts, graphs, and other figures may help see
the trends and relationships in the data much easily. We can say that it is the
part of the output of the analytics process. For example a bar graph can be easily
understood by as person to understand the sales per month instead of the numbers
and text.
So in general data analytics performs the analytics-related tasks and derives the
information which is then presented to the user in the form of visualizations.
Figure 1.6 shows the relationship between data analytics and data visualizations.

1.11 Data Analyst Versus Data Scientist

Both are the prominent jobs in the market these days. A data scientist is someone
who can predict the future based on the data and relationships in it, whereas a data
analyst is someone who tries to find some meaningful insights from the data. Let us
look into both of them.
• Data analyst deals with analysis of data for report generation, whereas data
scientist has a research-oriented job responsible for understanding the data and
relationships in it.
• Data analysts normally look into the known information from new perspective,
whereas a data scientist may involve finding the unknown from the data.
• The skill set required for data analyst includes the concepts from statistics, math-
ematics, and various data representation and visualization techniques. The skill
set for data scientist includes advance data science programming languages like
1.11 Data Analyst Versus Data Scientist 15

Python, R, TensorFlow, and various libraries related to data science like NLTK,
NumPy, and Scipy, etc.
• Data analyst’s job includes data analysis and visualization, whereas the job of data
scientist includes the skills to understand the data and find out the relationships
in it for deep insight.
• Complexity wise data scientist job is more complex and technical as compared to
data analyst.
• Data analyst normally deals with structured data, whereas data scientist may have
to deal with structured, unstructured, and hybrid data.
It should be noted that we cannot prioritize both of the jobs, both are essential, both
have their own roles and responsibilities, and both are essential for an organization
to help grow the business based on its needs and requirements.

1.12 Data Analytics Versus Business Intelligence

Now we will explain the difference of data science with some other domains.
Apparently both the terms seem to be synonym. However, there is much difference
which is discussed below.
• Business intelligence refers to a generic process that is useful for decision making
out of the historical information in any business, whereas data analytics is the
process of finding the relationships between data to get deep insight.
• The main focus of business intelligence is to help in decision making for further
growth of the business, whereas data analytics deals with gathering, cleaning,
modeling, and using data as per business needs of the organization.
• The key difference between data analytics and business intelligence is that busi-
ness intelligence deals with historical data to help organizations make intelligent
decisions, whereas the data analytics process tends to find out the relationships
between the data.
• Business intelligence tries to look into the past and tends to answer the questions
like what happened? When happened? How many times? etc. Data analytics on
the other hand tries to look into the future and tends to answer the questions like
when will it happen again? What will be the consequences? How much sales will
increase if we do this action? etc.
• Business intelligence deals with the tools and techniques like reporting, dash-
boards, scorecards, and ad hoc queries, etc. Data analytics on the other hand deals
with the tools and techniques like text mining, data mining, multivariate analysis,
and big data analytics, etc.
In short business intelligence is the process to help organizations make intelligent
decision out of the historical data normally stored in data warehouses and organiza-
tional data repositories, whereas data analytics deals with finding the relationships
between data for deep insight.
16 1 Introduction

Fig. 1.7 Data analysis

versus data mining

1.13 Data Analysis Versus Data Mining

Data mining and data analytics are two different processes and terms having their
own scope and flow. Here we will present some differences between the two.
• Data mining is the process of finding the existing patterns in data, whereas data
analysis tends to analyze the data and get required insight.
• It may require the skill set like mathematics, statistics, machine learning, etc.,
whereas data analysis process involves skill set like statistics mathematics,
machine learning, subject knowledge, etc.
• A data mining person is responsible for mining patterns into the data, whereas
data analyst performs data collection, cleaning, and transforming the data to get
deep insight out of it.
Figure 1.7 shows the relation between data mining and data analysis.

1.14 What Is ETL?

In data warehouses, data comes from various sources. These sources may be homo-
geneous or heterogeneous. Homogeneous sources may contain same data semantics,
whereas heterogeneous sources are the ones where data semantics and schemas are
different. Furthermore, the data from different sources may contain anomalies like
missing values, redundant values, and outliers. However, the data warehouse should
contain the homogeneous and accurate data in order to provide this data for further
processing. The main process that enables data to be stored into the data warehouse
is called extract, transform, and load (ETL) process.
1.14 What Is ETL? 17

ETL is the process of collecting data from various homogeneous and hetero-
geneous sources and then applying the transformation process to load it in data
warehouse.
It should be noted that the data in different sources may not be able to store
in data warehouse primarily due to different data formats and semantics. Here we
will present some examples of anomalies that may be present in source data which
required a complex ETL process.

Different data formats Consider two databases, both of which store customer’s
age. Now there is a possibility that one database may store the date in the format
“mm/dd/yyyy” which the other database may have some different format like
“yyyy/mm/dd” or “d/m/yy”, etc. Due to different data formats, it may not be possible
for us to integrate their data without involving the ETL process.

Different data semantics In previous case data formats were different. It may be
the case that data formats are same but data semantics are different. For example
consider two shopping cart databases where the currencies are represented in the
form of floating numbers. Note that although the format of the data is same but
semantics may be different, e.g. the floating currency value in one database may
represent the currency in dollars, whereas the same value in other database may
represent euros, etc., so, again we need the ETL process for converting the data to
homogeneous format.

Missing values Databases may have missing values as well. It may be due to certain
reasons, e.g. normally people are not willing to show their salaries or personal contact
numbers. Similarly gender and date of birth are among some measures which people
may not be willing to provide while filling different online forms. All this results
into missing values which ultimately affects the quality of the output of process
performed on such data.

Incorrect values Similarly, we may have incorrect values, e.g. some outliers perhaps
resulted due to a theft of a credit card or due to malfunctioning of a weather data
collection sensor. Again, such values effect the quality of analysis performed on this
data. So, before storing such data for analysis purpose, ETL process is performed to
fix such issues.

Some benefits of ETL process may include

• You can compare between the data from different sources.
• You can perform complex transformations, however, it should be noted that ETL
process may require separate data stores for intermediate data storage.
• It helps migration of data into a single homogeneous storage suitable for analysis
purpose.
• Improves productivity as transformation process is defined once which is applied
to all the data automatically.
• If data in the source repository is changed, the ETL process will automatically be
performed to transform data for the target data warehouse.
18 1 Introduction

Fig. 1.8 Steps of ETL process

Figure 1.8 shows the ETL process.

ETL process comprises three steps. We will explain all these steps one by one
now.

1.14.1 Extraction

Data extraction is the first step of ETL process. In this step data is read from the
source database and stored into an intermediate storage. The transformation is then
performed. Note that the transformation is performed on other system so that the
source database and the system using it are not affected. Once all the transformations
are performed, the data becomes ready for the next stage; however, after completing
the transformation, it is essential to validate the extracted data before storing it in
data warehouse. Data transformation process is performed when the data is from
different DBMSs, hardware, operating systems, and communication protocols. The
source of data may be any relational database, conventional file system repositories,
spreadsheets, document files, CSVs, etc.
So, it becomes evident that we need schemas of all the sources data as well as
the schema of target system before performing the ETL process. This will help us
identify the nature of required transformation.
• Three Data Extraction methods:

Following are the three extraction methods. However, it should be noted that irre-
spective of the transformation method used, the performance and working of source
and target databases should not be affected. A critical aspect in this regard is when to
perform the extraction as it may require the source database of the company unavail-
able to the customers or in simple case may affect the performance of the system.
So, keeping all this in mind following are the three types of extractions performed.
1. Full extraction
2. Partial extraction—without update notification
3. Partial extraction—with update notification.
1.14 What Is ETL? 19

1.14.2 Transformation

Data extracted from source systems may be in raw format that is not useful for the
target BI process. The primary reason is that data schemas are designed according
to local organizational systems and due to various anomalies present in the systems.
Hence, before moving this data to target system for BI process, we need to transform
it. Now this is the core step of the entire ETL process where the value is added to data
for BI and analysis purpose. We perform different functions to transform the data,
however, sometimes transformation may not be required such data is called direct
move or pass through data.
The transformation functions performed in this stage are defined according to
requirements, e.g. we may require the monthly gross sale of each store by the end
of the month. The source database may only contain the individual time-stamped
transactions on daily basis. So, here we may have two options, we may simply pass the
data to target data warehouse and calculation of monthly gross sale can be performed
at runtime whenever required. The other option is to perform the aggregation and
store the monthly aggregated data in data warehouse. The latter will provide high
performance as compared to calculating the monthly sale at runtime. Now we will
provide some examples of why transformation may be required.
1. Same person may have different name spelling, e.g. Jon or John.
2. A company can be represented in different ways, e.g. HBL or HBL Inc.
3. Use of different names like Cleaveland, Cleveland.
4. Same person may have different account numbers generated by different
applications.
5. Data may have different semantics.
6. A same name, e.g. “Khursheed” can be of a male or female at the same time.
7. Fields may be missing.
8. We may use derived attributes in target system, e.g. “Age” which is not present in
the source system. So, we can apply the expression “Current date minus DOB”
and the resulted value can be stored in target data warehouse (again to enhance
the performance).
There are two types of transformations as follows:
Multistage data transformation—This is conventional method where data is
extracted from the source, stored in an intermediate storage, transformation is
performed, and data is moved to data warehouse.
In-warehouse data transformation—This is a slight modification from the
conventional method. Data is extracted from source and moved into the data ware-
house, and all the transformations are performed there. Formally it may be called as
ELT, i.e. extract, load, and transform process.
Each method has its own merits and demerits, and the selection of any of these
may depend upon the requirements.
20 1 Introduction

1.14.3 Loading

It is the last step of ETL process. In this step the transformed data is loaded into the
data warehouse. Normally it requires huge amount of data loading so the process
should be optimized and performance should not be degraded.
However, if due to some reason, the process results into a failure, then the measures
are taken to restart the loading process from the last checkpoint and the failure should
not affect the integrity of data. So, the entire loading process should be monitored to
ensure the success of the loading process. Based on the nature, the loading process
can be categorized into two types:
Full load: All the data from the source is loaded into the data warehouse for the first
time. However, it takes more time.
Incremental load: Data is loaded into data warehouse in increments. The checkpoints
are recorded. A checkpoint represents the time stamp from which onward data will
be stored into the data warehouse.
Full load takes more time but the process is relatively less complex as compared
to the incremental load which takes less time but is relatively complex.
ETL Challenges:
Apparently the ETL process seems to be interesting and simple tool-oriented
approach where a tool is configured and the process starts automatically. However,
there are certain aspects that are challenging and need substantial consideration. Here
are few:
• Quality of data
• Unavailability of accurate schemas
• Complex dependencies between data
• Unavailability of technical persons
• ETL tool cost
• Cost of the storage
• Schedule of the load process
• Maintain integrity of the data.

1.15 Data Science

Data science is multidisciplinary field that focuses on study of the all aspects of
data right from its generation to processing to converting it into valuable knowledge
source.
As said it is multidisciplinary field, so, it uses concepts from mathematics, statis-
tics, machine learning, data mining, and artificial intelligence, etc., having wide range
of applications, data science has become a buzz word now. With realization of the
worth of insight from data, all the organizations and businesses are striving for best
data science and analytics techniques for the good of the business.
1.15 Data Science 21

Fig. 1.9 Phases of a data

science project
1. Discovery

6. communicate 2. Data
Results Preparation

5.
3. Model Plan
Operationalize

4. Model
Development

Bigger companies like Google, Amazon, and Yahoo, etc., have shown that care-
fully storing data and then using it for extraction of the knowledge to make decisions
and adopt new policies are always a worth. So, small companies are also striving
for such use of data which has ultimately increased the demand of the data science
techniques and skills in the market. So, efforts are also underway to decrease the cost
of providing data science tools and techniques.
Now we will explain different phases of life cycle of a data science project.
Figure 1.9 below shows the pictorial form of the process.
• Phase 1—Discovery: The first phase of data science project is discovery. It is
discovery of the available sources that you have, that you need, your requirements,
your output, i.e. what you want out of the project, its feasibility, the required
infrastructure, etc.
• Phase 2—Data preparation: In this phase you explore about your data and its
worth. You may need to perform some preprocessing including the ETL process.
• Phase 3—Model plan: Now you think about the model that will be implemented
to extract the knowledge out of the data. The model will work as base to find out
patterns and co-relations in data as per your required output.
• Phase 4—Model development: Next phase is the actual model building based
on training and testing data. You may use various techniques like classification
and clustering, etc., based on your requirements and nature of the data.
• Phase 5—Operationalize: The next stage is to implement the project. This may
include delivery of the code, installation of the project, delivery of the documents,
and giving demonstrations, etc.
22 1 Introduction

• Phase 6—Communicate results: The final stage is evaluation, i.e. you evaluate
your project based on various measures like customer satisfaction, achieving the
goal the project was developed far and accuracy of the results of the project, and
so on.

1.16 Further Reading

Followings are some books referred for further reading:

• Introduction to Data Science: A Python Approach to Concepts, Techniques and
Applications by Igual, Laura, Seguí, Santi
• Algorithms for Data Science by: Steele, Brian, Chandler, John, Reddy, Swarna
• Data Science and Predictive Analytics Biomedical and Health Applications using
R by Dinov, Ivo D.
• Data Science: From Research to Application (Bohlouli, M., Sadeghi Bigham, B.,
Narimani, Z., Vasighi, M., Ansari, E.)
• Statistics for Data Scientists An introduction to probability, statistics, and data
analysis (Kaptein, Maurits, van den Heuvel, Edwin)
• New Advances in Statistics and Data Science (Chen, D.-G.D., Jin, Z., Li, G., Li,
Y., Liu, A., Zhao, Y.)
• Applied Data Science Lessons Learned for the Data-Driven Business (Martin
Braschler, ThiloStadelmann, Kurt Stockinger)
• Data Science and Big Data Computing Frameworks and Methodologies (Editor:
Zaigham Mahmood)
• Data Science for Transport: A Self-Study Guide with Computer Exercises (by:
Charles Fox)
• Data Analysis Statistical and Computational Methods for Scientists and Engineers
(Author: Siegmund Brandt)
• Handbook of Operations Analytics Using Data Envelopment Analysis (Authors:
Shiuh-Nan Hwang and Others)
• Mathematical Problems in Data Science Theoretical and Practical Methods
(Authors: Li M. Chen, Zhixun Su, Bo Jiang)
• Data Science and Social Research Epistemology, Methods, Technology and
Applications (Editors: N. Carlo Lauro, Enrica Amaturo, Maria Gabriella Grassia,
Biagio Aragona, Marina Marino)
• Data Science in Practice (Editors: Alan Said, VicencTorra)
• Intelligent Techniques for Data Science (Authors: Rajendra Akerkar, Priti Srinivas
Sajja)
• The Data Science Design Manual (Authors: Steven S Skiena)
• Big Data, Cloud Computing, Data Science and Engineering (Editors: Roger Lee)
• Data Science from Scratch: First Principles with Python 1st Edition (Authors:
Joel Grus)
• Data Science for Business: What You Need to Know about Data Mining and
Data-Analytic Thinking 1st Edition (Authors: Foster Provost, Tom Fawcett)
1.16 Further Reading 23

• Data Analytics Made Accessible (Author: Anil Maheshwari)

• Data Smart: Using Data Science to Transform Information into Insight (Author:
John W. Foreman)
• Doing Data Science: Straight Talk from the Frontline (Author: Cathy O’Neil,
Rachel Schutt)
• Big Data: Principles and best practices of scalable realtime data systems (Authors:
Nathan Marz, James Warren).

1.17 Summary

In this chapter, we have provided some basic concepts to data science and analytics.
Starting from basic definition of data, we discussed several of its representations. We
provided details of different types of data analytics techniques and the challenges
that are normally faced in implementing data analytics process. We also studied
various tools available for developing any data science project at any level. Finally,
we provided a broad overview of the phases of a data science project.
Chapter 2
Applications of Data Science

Data science has wide range of applications. Today, where we already have entered
era of information, all walks of life right from small business to mega industries
are using data science applications for their business needs. In this chapter we will
discuss different applications of data science and their corresponding benefits.

2.1 Data Science Applications in Healthcare

Data Science has played a vital role in the healthcare to predict the patterns that help
in curing contagious, long-lasting diseases using fewer resources.
Extracting meaningful information and pattern from the data collected through
the patient’s history stored in hospital, clinics, and surgeries help the doctors to refine
their decision making and advise best medical treatment that helps the nations to live
longer than before. These patterns provide all the data well before time that enables
the insurance companies to offer packages suitable from patients.
Emerging technologies are providing mature directions in the healthcare to dig
deep insight and revealing more accurate critical medical schemes. These schemes
are taking treatment to next level where patient has also sufficient knowledge about
the existing and upcoming diseases, and it makes easy for doctors to guide patients in
specific way. Nations are getting advantages of data science applications to anticipate
and monitor the medical facilities to prevent and solve healthcare issues before it is
too late.
Due to rising costs of medical treatments, the use of data science becomes neces-
sary. Data science applications in healthcare help reduce the cost by various means,
for example by providing the information at earlier stages helps patient avoid the
expensive medical treatments and medications.

© The Editor(s) (if applicable) and The Author(s), under exclusive license 25
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_2
26 2 Applications of Data Science

Rise in the costs has become a serious problem for healthcare companies since
the last 20 years. Healthcare companies are now rethinking to optimize the treat-
ment procedures for patients by using data science applications. Similarly, insurance
companies are trying to cut down the cost and provide customized plans as per patient
status. All this is guided by the data-driven decision making where insights are taken
from the data to make plans and policies.
With this, healthcare companies are getting true benefits by using the power of
analytical tools, software as service (SaaS) and business intelligence to extract such
patterns and devise schemes that help all the stakeholders in reducing the costs
and increasing the benefits. Now doctors not only have their individual educational
knowledge and experience but they also have access to number of verified treatment
models of other experts with the same specialization and in the same domain.
(1) Patients Predictions for an Improved Staffing

Over staffing and under staffing is a typical problem faced by many hospitals. Over
staffing can over burden the salary and wages costs while less staffing means that
hospital is comprising on the quality of care that is too dangerous because of sensitive
treatments and procedures. Data science applications make it possible to analyze the
admission records and patient visiting patterns with respect of weather, days, month,
time, and location and provide the meaningful insights to place the staff that helps
the staffing manager to optimize the staffing placement in line with patient visits.
Forbes report shows that how hospitals and clinics, by using the patient’s data
and history, can predict the patient’s future visits and can accordingly place the staff
as per the expected number of patients visits. This will not only result in optimized
staff placement but will also help patients by reducing the waiting time. Patients will
have immediate access to their doctor. The same can be especially helpful in case of
emergencies where doctors will already be available.
(2) Electronic Health Records (EHRs)

Using data science applications in EHR, data can be archived according to patient
demographic, laboratory tests, allergies, and complete medical history in an infor-
mation system, and doctors can be provided with this data through some secure
protocols which ultimately can help the doctors in diagnosing and treating their
patients with more accuracy. This use of data science applications is called elec-
tronic health records (EHRs) where patients’ data is shared through secured systems
with the doctors and physicians.
Most developed nations like US have already implemented; European counties are
on the way to implement it while others are in process to develop the rules and policies
for implementation. Although HER is apparently an interesting application of data
science, but it requires special considerations with respected to security because EHR
involves the patient’s personal data that should be handled with care. This is also one
of the reasons that various hospitals and clinics even in developed countries are still
reluctant to participate in EHR. However, with secure systems and keeping in mind
2.1 Data Science Applications in Healthcare 27

the benefits of EHR to hospitals, clinics, and patients, it is not far away that EHR
will be implemented in majority of the countries across the globe.
(3) Real-Time Alerting

Real-time alerting is one of the core benefits of data science applications in healthcare.
Using data science for such applications, the patient’s data is gathered in real-time and
analyzed, and the medical staff is informed about the patient conditions which can
ultimately take decisions well before time. Patients can use GPS guided wearables
that report the patients and doctors about the patients’ medical state, for example,
blood pressure and heart rate of the patient. So, if the patient’s blood pressure changes
abruptly or reach at any dangerous level, then the doctor can contact the patient and
advise him the medication accordingly.
Similarly, for the patients with asthma, this system records the asthma trends by
using GPS guided inhalers. This data is being used for further research at clinical
level and at national level to make a general policy for asthma patients.
(4) Enhancing Patient Engagement

With availability of the more advance wearables, and convincing the patients about
their benefits, we can develop the patient’s interest for use of these devises. With
these wearables, we can track all the changes to human body and give feedback
with initial treatment to reduce the risk by avoiding the critical situations regarding
patient’s health.
Insurance companies are also advertising and guiding their clients to use these
wearables, and even many companies are providing these wearables for free to
promote these trackable devices that prove a great source in reducing the patients
visits and laboratory tests. These devices are gaining huge attention in the health
market and giving fruitful results. It is also engaging the researchers to come up with
more and more features to be added to these devises with passage of time.
(5) Prevent Opioid Abuse

Drug addiction is an extreme problem in many countries even including the developed
nations where billions of dollars have already been spent and still programs are
underway to devise solutions for it.
The problem is getting worse each year. Now the research is already in progress to
come up with the solutions. Many risk factors have already been identified to predict
the patients at the risk of opioid abuse with high accuracy.
Although the problem seems to be challenging to identify and reach such persons
and convince them to avoid drug issues, however, we can hope that the success can
be obtained with little more effort both by doctors and public.
(6) Using Health Data for Informed Strategic Planning

With the help of data science, health organizations create strategic plans for patients
in order to provide better treatments and reducing the costs. These plans help these
28 2 Applications of Data Science

organizations to examine the latest situation in a particular area or region, for example,
current situation of existing chronic disease in a certain region.
Using the advance data science tools and techniques, we can come up with proac-
tive approaches for treating the emergencies and critical situations. For example,
by analyzing the data of a certain region, we can predict the coming heat waves or
dengue virus attacks and can establish the clinics and temporary facility beforehand
ultimately avoiding the critical situations in the region.
(7) Disease Cure

Data science applications provides great opportunities to treat the diseases like cancer
thus by giving relief to cancer patients. We can predict the disease and provide direc-
tions to cure and treat the different patient at different stages. However, all this
requires cooperation at various levels. For example, individuals should be convinced
to participate in information providing process, so that if any information is required
from any cancer victim, he is willing to provide that information. Furthermore, orga-
nizations that already have security policies may be reluctant to provide such data.
Beyond this there are many other issues, e.g. technical compatibility of the diagnostic
systems as they are developed locally without keeping in mind their integration with
other systems. Similarly, there may be legal issues in sharing such information.
Furthermore, we also need to change the mindset which hinders an organization to
share its success with others perhaps due to business reasons.
All this requires a significant effort to deal with such issues in order to come up
with solutions. However, once all such data is available and interlinked, researcher
can come up with the models and cures to help cancer victims with more effectiveness
and accuracy.
(8) Reduce Fraud and Enhance Security

Data breaching is common hack due to the less security of the data because this data
has great value in term of money and competitive advantages and can be used for
unfair means. Data breach can occur due to many reasons, e.g. viruses, cyber-attacks,
penetration in companies’ network through friendly-pretending nodes, etc. However,
it should be noted that as the systems are becoming more and more victims of possible
cyber-attacks and data hacks, the security solutions are also getting mature day by day
with the use of encryption methodologies, anti-viruses, firewalls, and other advance
technologies.
By use of data science applications, organizations may predict any possible attacks
to their organizational data. Similarly, by analyzing the patients’ data, the possible
frauds in patients’ insurance claims can also be identified.
(9) Telemedicine
Telemedicine is the process of contacting the doctor and physician using the advance
technologies without involving personal physical visits. Telemedicine is one of the
most important applications of data science. This helps in delivery of the health
2.1 Data Science Applications in Healthcare 29

services in remote areas and is a special help for the developing countries where
health facilities are not available in rural areas.
Patients can contact their doctor through video conferencing or smart mobile
devices or any other available services.
The above-mentioned use is just a trivial example. Now the doctors can perform
surgeries through robots even sitting far ways from the real surgery room. People are
getting best and instant treatment that make them more comfortable and less costly
than arranging visits and standing in long lines. Telemedicine is helping hospitals
also to reduce cost and manage other critical patients with more care and quality
and placement of the staff as per requirements. It also allows healthcare industry
to predict which diseases and level of diseases may be treated remotely so that the
personal visits could be reduced as much as possible.
(10) Medical Imaging

Medical images are becoming important in diagnosing disease which requires high
skills and experience if performed manually. Moreover, hospitals required huge
budget to store these images for long time as they may be needed anytime in future for
the treatment of the patient whose image is there. Data science tools make it possible
to store these images in an optimized way to take less storage as well as these algo-
rithms generate patterns using pixels and convert these pixels into numbers that help
the medical assistance and doctors to analyze one particular image and compare it
with other images to perform the diagnosis procedure.
Radiologists are the people who are most relevant to generate these images. As
their understandability may change due to human nature of mood swings and many
other reasons, the accuracy of the extracted information may be affected. Computers,
on the other hand, do not get tired and behave in a clean way to predict true sense
of images, alternatively resulting in extraction of quality information. Furthermore,
the process will be more efficient and time saving.
(11) A Way to Prevent Unnecessary ER Visits

Better and optimized use of resources like money, staff, and energy is also main conse-
quence of data science applications, and implementation of these systems makes it
possible. There was a case where a woman having mental illness visited hospitals
900 times in a year. This is just one case of that type that has been reported, but there
can be many similar cases that cause the over burden on healthcare institutions and
other taxpayers.
Healthcare system can easily tackle this problem through sharing the patient’s
information in emergency departments and clinics of the hospitals. So, hospital staff
can check patients’ visits in other hospitals and laboratory test dates and timings to
suggest or not suggest re-test of the patients based on his previous recently conducted
test. This way hospitals may save their times and all resources.
ER system may help the staff in the following ways:
• It may help in finding medical history and patients’ visits to nearby hospitals.
30 2 Applications of Data Science

• Similarly, we can find if the patient has already been assigned to a specialist in
nearby hospital.
• What medical advices given to the patient previously or currently by another
hospital.
This is another benefit of healthcare system that it helps to utilize the all resource
in a better way for the care of the patients and taxpayers. Prior to these systems, a
patient could get medical checkup again and again and causing the great burden on
all stakeholders.

2.2 Data Science Applications in Education

1. To help Individual Learners

Learning applications are providing excellent way to examine a learner performance

and provide them with valuable suggestions. Using run time analysis, these appli-
cations provide the students with their strengths and weaknesses, so that students
may overcome their deficiencies and may well go through with other classmates.
This may be an example of proactive approach where students are informed about
the gray areas where they need to improve. Using the student progress reports and
the contents where students were weak, overall, analytic applications can be a great
help for students to improve their grades.
2. To Help Mentors

Not only the students, but data science applications can be a great assistance to
teachers and faculty members to identify the areas where they need to give more
focus in the class. So, by identifying the patterns, mentors can help a particular
student or group of students, for example, a question was attempted by few students
out of 1000 students which means that there can be any issue with the question so
teacher can modify or change or may re-explain the contents of the questions perhaps
in more details.
3. To Help Developing Curriculum and Learning Processes

Course development and learning process is not a static procedure. Both require
continue effort to develop or enhance the courses and learning process. Educational
organizations can get insight information to develop the new courses as per compe-
tency and levels of students and enhancement in the learning procedures. After exam-
ining the students learning patterns, institutions will be in better position to develop
a course which will give great confidence to learners to get their desired skills as per
their interests.
2.2 Data Science Applications in Education 31

4. To Help Administrators

Institution’s administration can devise policies with the help of data science applica-
tions that which course should be promoted more, which areas are need of the time.
What marketing or resource optimizing policies they should implement? Which
students may get more benefit from their courses and offered programs?
5. Prevent Dropouts and Learn From the Results

Normally the admission in an institution is simple and easy as compared to sustaining

in competition. This is especially true for high ranking institutions. With data science
applications, institutions can find out the root causes of increase in students dropouts
and alternatively can take measures and devise policies to avoid such cases. They can
help students in choosing right course as per their education level and guide them
about the industry situation regarding particular degree programs.
6. A Better Informed Learning Future

Data science applications can help the educational institutes in finding out the current
and future trends in market and industry, which alternatively can help them decide the
future of their students. They can study the industry requirements in near future and
help out the student in selecting best study programs. Even industry may share their
future trends to help in launching a particular study program. This will ultimately
give the students great confidence and risk-free thoughts regarding their job security
in the future.
Industries can work on recruitments strategies and share these with the institutes
to develop new course according to the industry needs. Also, internship opportunities
can be provided by the industry to the students by assessing their learning skills.
7. It Helps You Find Answers to Hard Questions

Sometimes, it seems difficult to identify the causes of the student-related problems,

for example, their assessment, admission policy, development of new courses, etc.,
but analyzing the historical data and mining history can help you find the solutions
of these problems that can be hard to fix without applying these technology or using
legacy procedures. For example, with the help of data science application, you can
find the dropout reasons for a particular department or the overall institution. There
can be various other issues you can resolve, e.g.
• Why there is low registration in a certain program?
• Why students are getting poor grades in certain course
• How the grade relaxation policy has helped in increasing the student confidence?
• Which deficiency courses are required for a certain group of students?
• Does a particular student need extra class hours to overcome the poor grades?
32 2 Applications of Data Science

8. Its Accessible

Technological infrastructure provides an environment in which all departments are

interlinked and can collaborate in an excellent way. Prior to this advancement,
searching any information was a cumbersome task that perhaps had required verifi-
cation from different departments, consequently consuming much energy, cost, and
time. But now with data science tools, interlinking of department makes it possible
to extract required information in seconds and devise a better policy according to the
need of the current situation in the market.
As data is available in centralized repositories, so, creating simple browser base
extensions may ease the process of using the tools across the institution even if the
campuses are geographically located at different places.
9. It Can Save Costs

As there are many departments interlinked and generating lots of data, so, analyzing
this data helps the institution to develop the strategic plans to place the required
staff at right place. So, data science applications can help you optimize the resource
allocation, hiring process, transportation systems, etc. You can be in better position
to develop the infrastructure according to the need to existing and upcoming students
with minimum cost.
Similarly, you can develop new admission policy according to the existing avail-
able resources, mentor the staff, and polish their skills to market the good will in the
community.
So, with the help of data science applications, your institution helps you get
deep insight from the information and alternatively perform best utilization of your
resource by getting the maximum benefits with minimum costs.
10. Its Quick

There is no manual work, all departments generate their own bulk data, and this
data is stored at centralized repositories. So, should you need any information, it is
already available at hand without and delay. This ultimately means that you have
enough time to take actions against any event.
So, with data science tools, you have central repositories that let you take timely
decisions and make accurate action plans for the better of all stakeholders of that
institution.

2.3 Data Science Applications in Manufacturing

and Production

Manufacturing industries are somewhat lacking in adopting technology advancement

because these firms gave it less priority, were taking technology as extra burden, and
2.3 Data Science Applications in Manufacturing and Production 33

consequently were reluctant to apply all this technology. However, with growing
population and increase in manufacturing goods it resulted in large amount of data.
Added with low lower cost of technological equipment, now this is going to be
possible for industries to adopt intelligent tools and apply in manufacturing process
with minimum cost.
With this, data science applications are now helping industries to optimize their
processes and quality of the products thus resulting in increased satisfaction of
customers and the industry itself. This advancement is playing vital roles in expanding
the business getting maximum benefit with minimum cost is great attraction. Without
data science tools, there were various factors adding to the cost like large number of
required labors, no prediction mechanism, and resource optimization policies.
1. Predictive Maintenance

Use of intelligent sensors in industry provides great benefits by predicting the poten-
tial faults and errors in the machinery. Prompt response to fix these minor or major
issues saves the industry owners to re-install or buy new machinery that required huge
financial expenditures. However, the prediction analysis of manufacturing process
can enhance the production with high accuracy and alternatively giving confidence
to buyers to predict their sales and take corrective actions if any is required. With
the further advancement in data science tools and techniques, business industry is
gaining maximum day by day now.
2. Performance Analyses

Performance analysis enables the manufacturing industries to forecast the perfor-

mance prior to the production process and enables them to fulfill the market needs.
This helps them find out the discrepancies in the production process and fix them,
so that the output is expected to be same as is required.
Without data science applications, few years back, it was not possible to predict the
market trends, however, the availability of data science applications in the industrial
organizations is now in better position to tune up their performance according to the
required levels.
3. Decrease in Downtime

Any fault in any part of manufacturing equipment may cause a great loss and change
all the delivery dates and production levels. But inclusion of technology in all parts of
industry, e.g. biometric attendance recording, handling, and fixing of errors with the
help of robotics and smart fixing tools, fault predictions, etc., are giving the excellent
benefits in reducing the downtime of manufacturing units, alternatively, giving good
confidence to utilize these technologies in manufacturing process.
4. Improved Strategic Decision Making

Data science helps businesses in making the strategic decisions based on ground real-
ities and organizational contexts. Various tools are available including data cleanup
34 2 Applications of Data Science

tools, profiling tools, data mining tools, data mapping tools, data analysis platforms,
data visualization resources, data monitoring solutions, and many more. All these
tools make it possible to extract deep insight from the available information and
making the best possible decisions.
5. Asset Optimization

With the emergence of Internet of Things (IoT) and data science applications busi-
nesses can now enhance the production process by automating all the tasks thus
resulting in optimized use of their assets.
With data science tools, it can now be determined about when to use a certain
resource and up to what level will be more beneficial.
6. Product Design

Launching a new product design is always a risky job. So, without thorough analysis
you cannot purchase a new unit for a new product as it may involves huge financial
budget and equal failure risks. So, you can avoid the cost of purchasing equipment
for production of a product that you are not sure about its success or failure.
However, with help of data science applications, by looking into the available
information, you can definitely find out whether to launch a new design or not.
Similarly, you can find out the success or failure ratio of a certain product that
has not been launched so far. This provides a potential advantage over competitive
businesses as you can perform pre-analysis and launch a much demanding product
before some other launches it, thus having more opportunity to capture the certain
portion of market.
7. Product Quality

Designing the great quality product with the help of end users always produce good
financial benefits. End user provides feedback through different channel like social
media, online surveys, review platforms, video streaming feedback, and other digital
media tools. Finding the feedback of the users in this way always helps in finding out
the much-needed products and their features. Even after launching a failed product
in the market, it can be enhanced and updated in an excellent way after incorporating
the user feedback and investigating customer trends in market using the data science
applications.
8. Demand Forecasting

Business intelligent tools provide comprehensive platform to forecast the actual

demands of certain products in the market. These demands may be based on
weather condition, population growth, change in world economic policy, and many
other reasons. However, data science applications provide accurate prediction up
to maximum level, which was unfortunately not possible in past to rough manual
methods.
2.3 Data Science Applications in Manufacturing and Production 35

9. Customer Experience

Data science tools provide deep insight into the customer data, their buying trends,
customer preferences, their priorities, behaviors, etc. The data can be obtained from
various sources and is then normalized to central repositories where data science
tools are applied to extract the customer experiences. These experiences then help
in policy and decision making to enhance the business. For example, the customer
experience may help us maintain a minimum stock level of a certain product in a
specific season.
A very common example of such experience is to place together the products that
are normally soled in groups, e.g. milk, butter, and bread, etc.
10. Supply Chain Optimization

Supply chain is always a complex process. Data science applications help finding
out the necessary information, for example, the suppliers with high efficiency, the
quality of the products they deliver, their production capability, etc.

2.4 Data Science Applications in Sports

Implementation of data science in all fields including healthcare, education, or sports,

etc., is proving to be the need of the time nowadays to unhide the patterns in data that
were unable to exploit in last few years. Sports is a billion-dollar industry nowadays
after starting sports competitions and leagues at multiple national and international
levels. These leagues are hiring data scientist to get insight form the numbers that is
required to take decision in bidding on best players, team management, and investing
in right leagues. Interests have already been developed in finding out the relationships
in data for predicting the performance of the players and teams.
1. Strategic Decisions

Data science plays a critical role in strategic decision making when it comes to sports.
Player’s performance prediction using the previous data is nowadays common appli-
cation of data science. As each game and its complete context can be stored in
databases, with this information, data science applications can help team coaches to
justify out the validity of their decisions and identify the weak areas and discrepan-
cies which resulted in losing a certain game. This can be a great help in future for
overcoming these discrepancies and thus increasing the performance of players and
teams.
You can not only find your own team discrepancies but you can also find the gray
areas in opponent teams. This will help you come up with better strategy next time
which will ultimately increase the success changes in future.
36 2 Applications of Data Science

2. Deep Insight

Data science applications require a central repository where data from multiple
sources is stored for analysis purpose. The more the data is stored the more accu-
rate are the results. However, the application may have their own cost, e.g. to store
and process such large amount of data requires large storage volumes, memory, and
processing power. But once all this is available, the insight obtained from the data by
using data science tools is always a worth as it helps you predict future events and
makes decisions and policies accordingly.
3. Marketing Edge

Nowadays, sports has become a great business. The more you are popular, the more
you will attract the companies to get their ads. Data science applications provide you
better opportunities now to target your ads as compared to earlier days. By finding
out the fan clubs, players and teams ratings, and interests of the people, now you
are in better position to devise your marketing strategies that will ultimately result
in more reach of your product.
4. On-site Assistance and Extra Services

By collecting the information from different sources like ticket booths, shopping
malls, parking areas, etc., and properly storing and analyzing it can help in providing
on-site assistance to people and ultimately increasing the revenue. For example, you
can provide better car parking arrangements based on a person’s priority of seating
in a match. Similarly, in a particular match, providing him with his favorite food at
his seating place, thus making him happier by enjoying the favorite food and game
at the same time, will ultimately ease him and will increase your revenue as people
will also try to attend such matches in future.

2.5 Data Science Applications in Cyber-Security

Cyber-security has become an important domain nowadays because of wide spread

use of computers in all walks of life. Now all such devices whether it be your personal
laptop, smart device, a wearable, smart fridge, car, etc., are all interlinked. So, once
you are linked with other network you are vulnerable of cyber-attacks from all the
directions intended to harm you by any means. Data science is playing wide role in
the domain of cyber-security. Here we will explain few of the contributions of data
science applications.
1. Analyzing Historical Data

Historical data is backbone of any data science application. Same is the case in
domain of cyber-security. You will have to store large amount of previous data to
2.5 Data Science Applications in Cyber-Security 37

find out what a normal service request is and what an attack is. Once the data is
available, you can analyze it to find out the malicious patterns, service requests,
resource allocations, etc.
Once a pattern is verified to be an attack, you can monitor the same pattern for
future and deny it in case it happens again. This can ultimately save the companies
from the huge data and business losses especially in case of sensitive domains like
defense-related, finance-related, or healthcare-related data.
2. Monitoring and Automating Workflows

All the workflows in a network need to be monitored either in real time or offline.
Studies show that data breaches occur mostly due to local employees. One solution
to such problem is to implement some authorization mechanism, so that only the
authorized persons have access to sensitive information. However, whether the flow
is generated from local systems or from some external source, data science tools can
help you in automatically identifying the patterns of these flows and consequently
helping you avoid the worst consequences.
3. Deploying an Intrusion Detection System

As discussed earlier that networked systems, networks, wireless devices, wearables,

and others connected computers may get vulnerable to different type of cyber-attacks.
Furthermore, the type and nature of cyber-attacks is increasing day by day. This
requires a strong need of intrusion detection systems.
Use of data science plays a very crucial role for development of such systems. With
proper use of data science applications you can, beforehand, identify the credibility
of a source sending traffic to your system, nature of request, its consequences, etc.,
and thus can implement corrective actions.

2.6 Data Science Applications in Airlines

Data science applications in airline industry are proving to be a great help to devise
policy according to the customer preferences. A simple example may be the anal-
ysis of booking system, but analyzing the booking system airlines can provide the
customers with personalized deals and thus increasing their revenue. You can find
out the most frequent customers and their preferences and can provide them with the
best experience as per their demand.
Furthermore, airlines are not just about ticketing, by using the data science appli-
cations, you can find the optimized roots and fairs and enhance your customer base.
You can come up with deals that most of the travelers will like to take thus ultimately
enhancing your revenue.
38 2 Applications of Data Science

1. Smart Maintenance

Baggage handling is a big problem on airports, but data science tools provide better
solutions to track baggage on run time using radio frequencies. As air traffic is
increasing day by day and new problems are emerging on daily basis, data science
tools can be a big helping hand in all these scenarios. For example, with increase air
traffic, intelligent routing applications are required. So, in this scenario, intelligent
routing applications can make the journey much safer than before. Furthermore, you
can predict about future issues that are normally difficult to handle at runtime, once
these issues are identified beforehand, you can have contingency plans to fix them.
For example, sudden maintenance is hard to handle at run time, but prediction before
time enables industry to take reasonable measures. Similarly, you can predict weather
conditions and can ultimately inform the customers about the delay in order to avoid
any unhappy experience or customer dissatisfaction.
2. Cost Reduction

There may be many costs that airlines have to beer with, e.g. one aspect is that of
baggage lost. With the use of real-time bag tracking, such costs can significantly be
avoided and thus avoiding the customer dissatisfaction.
One of the major costs of airline companies is that of fuel. Using more fuel may be
expensive and less fuel may be dangerous. So, maintaining a specific level is much
necessary. Data science applications can dig out the relevant data like jet engine
information, weather conditions, altitude, rout information, distance, etc., and can
come up with optimal fuel consumption, ultimately helping companies to decrease
fuel consumption cost.
3. Customer Satisfaction

Airline companies take all measures just to satisfy their customers by enhancing
their experience with the company. User satisfaction depends on number of factors
as everybody has individual level of satisfaction, but creating an environment that can
satisfy large number of customers is difficult but data science tools through analysis of
customers previous data can provide maximum ease like their favorite food, preferred
boarding, preferred movies, etc., which will ultimately make customers to choose
same airline again.
4. Digital Transformation

Transformation of existing processes into digital model is giving high edge to airline
industries whether it is related to customer or other monitoring factors. Attractive
dashboards and smart technological gadgets are making it possible to provide greater
level of services on time, and airlines companies are able to receive and analyze
instant feedback in order to provide better experience and enhanced service quality.
2.6 Data Science Applications in Airlines 39

5. Performance Measurements

Airline companies normally operate at international level and thus face tough compe-
tition. So, in order to remain in the business, they not only have to take performance
measures but have to make sure that they are ahead of their competitors.
With the help of data science applications, airlines can automatically generate
their performance reports and analyze them, e.g. the total number of passengers that
travelled last week preferred the same airline again or the ratio of the fuel consumption
one the same route as compared to the previous flight with respect to the number of
customers, etc.
6. Risk Management

Risk management is an important area where data science applications can help the
airline industry. Whenever a plan takes off, there are multiple risks that are adhered to
the flight including the changing weather conditions, sudden unavailability of a rout,
malfunctioning issues, and most importantly pilots fatigue due to constant flights.
Data science applications can help airlines to overcome these issues and come
up with contingencies plan to manage all these risks, e.g. using dynamic routing
mechanisms to rout the flight to a different path at run time in order to avoid any
disaster. Similarly, in order to avoid pilots from flying fatigues resulting from long
hour flying, optimal staff scheduling can be done by analyzing the pilot’s medical
data.
7. Control and Verification

In order to reduce the cost and make the business successful, airlines need in-depth
analysis of the historical data. Here data science applications can help by using a
central repository containing data from various flights. One such example may be
the verification of expected number of customers as compared to actual customers
that travelled airline.
8. Load Forecasting

Airlines regularly need load forecasting models to in order to arrange availability

of seats and plan other activities like amount of assigned staff, quantity of available
food, etc. Data science applications can be a special help in this scenario. By using
data science applications, we can predict the load of a certain flight with maximum
accuracy.
40 2 Applications of Data Science

2.7 Further Reading

Following are some valuable resources for further reading:

• Geospatial Data Science Techniques and Applications (Editors: Hassan A. Karimi,
Bobak Karimi)
This book is designed to highlight the unique characteristics of geospatial data,
demonstrate the need to different approaches and techniques for obtaining new
knowledge from raw geospatial data, and present selected state-of-the-art geospa-
tial data science techniques and how they are applied to various geoscience
problems.
• Data Science Applications using R (Author: Jeffrey Strickland)
This book is intended to demonstrate the multidisciplinary application of data
science, using R programming with RStudio.
• Big Data Science in Finance: Mathematics and Applications (Authors: Irene
Aldridge, M. Avellaneda)
This book provides a complete account of big data that includes proofs, step-
by-step applications, and code samples which explains the difference between
principal component analysis (PCA) and singular value decomposition (SVD)
that covers vital topics in the field in a clear, straightforward manner
• Data Science for Beginners (Author: Prof John Smith)
Book explains the topics like what is data science, need of the data science, data
science life cycle along with some important applications.
• Data Science: Theory, Analysis and Applications (Authors: Qurban A Memon,
Shakeel Ahmed Khoja)
Book provides collection of scientific research methods, technologies, and appli-
cations in the area of data science. It discusses the topics like theory, concepts,
and algorithms of data science, data design, analysis, applications, and new trends
in data science.
• Data Science for Dummies (Authors: Lillian Pierson, Foreword by Jake Porway)
Intended to be a starting point, data science for dummies is for IT professionals
and students who want a quick primer on all areas of the expansive data science
space. With a focus on business cases, the book explores topics in big data, data
science, and data engineering, and how these three areas are combined to produce
tremendous value.

2.8 Summary

In this chapter we discussed few of the applications of data science. As the size of
data is increasing every second with immense rate, the manual and other conventional
automation mechanisms are not sufficient, so the concepts of data science is very
much relevant now. We have seen that how organizations are benefiting from the entire
process in all walks of life. The overall intention was to emphasize the importance
of the data science applications in daily life.
Chapter 3
Widely Used Techniques in Data Science
Applications

Prior to discussion of common techniques used in data science applications, firstly

we will try to explain three types of learning which include supervised, unsupervised,
and reinforcement learning.

3.1 Supervised Learning

Majority of machine learning algorithms these days are based on supervised machine
learning techniques. Although, it is a complete domain which requires separate in
depth discussion, here we will only provide a brief overview of the topic.
In supervised machine learning, the program already knows the output. Note that
this is opposite to conventional programming where we feed input to the program
and program gives output. Here in this case, we give input and output at the same
time, in order to make program learn that in case of any of this or related input what
program has to output.
This learning process is called model building. It means that through provided
input and output, the system will have to build a model that maps the input to output,
so that next time whenever the input is given to the system, the system provides
the output using this model. Mathematically speaking, the task of machine learning
algorithm is to find the value of dependent variable using the model with provided
independent variables. The more accurate the model is, the more efficient is the
algorithm, and more efficient are the decisions based on this model.
The dependent and independent variables are provided to the algorithm using
training dataset. We will discuss training dataset in upcoming sections. Two important
techniques that use supervised machine learning are:
• Classification: Classification is one of the core machine learning techniques that
use supervised learning. We classify the unknown data using supervised learning,

© The Editor(s) (if applicable) and The Author(s), under exclusive license 41
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_3
42 3 Widely Used Techniques in Data Science Applications

e.g. we may classify the students of a class into male and female, we can classify
the email into two classes like spam and no-spam, etc.
• Regression: In regression, we try to find the value of dependent variables from
independent variables using the already provided data; however, a basic difference
is that the data provided here is in the form of real values.

3.2 Unsupervised Learning

Just like supervised learning, there is another technique called unsupervised learning.
Although not as common as supervised learning, but still we have number of
applications that use unsupervised learning.
As discussed above that in supervised learning, we have training dataset
comprising of both the dependent and independent variables. The dependent variable
is called the class where independent variables are called the features. However, this
is not the case every time. We may have applications where the class labels are not
known. The scenario is called unsupervised learning.
In unsupervised learning, the machine learns from the training dataset and groups
the objects on the basis of similar features, e.g. we may group the fruits on the basis
of colors, weight, size, etc.
This makes it challenging as we do not have any heuristics to guide the algorithm.
However, it also opens the new opportunities to work with the scenarios where the
outcomes are not known beforehand. The only thing available is the set of operations
available to predict the group of the unknown data.
Let us discuss it with the help of an example. Suppose we are given few shapes
including rectangle, triangle, circles, lines, etc., and the problem is to make the system
learn about the shapes so that it may recognize the shapes in future.
For a simple supervised learning scenario the problem will be simple because the
system will already be fed with the labeled data, i.e. the system will be informed that
if the shape has four sides, it will be rectangle, a three sides shape will be triangle,
and a round closed end shape will be a circle, etc. So, next time whenever a shape
with four sides is provided to the system, the system will recognize it as rectangle.
Similarly, a shape with three sides will be recognized as triangle, and so on.
However, things will be little messier in case of unsupervised learning as we will
not provide any prior labels to the system. System will have to recognize the figures
and will have to make groups of similar shapes by recognizing the properties of the
shapes. The shapes will be grouped and will be given system generated labels. That
is why it is a bit challenging than the supervised classification.
This also makes it more error prone as compared to supervised machine learning
techniques. The more accurately the algorithm groups the input, the more accurate
is the output and thus the decisions based on it. One of the most common techniques
using unsupervised learning is clustering. There are number of clustering algorithms
that cluster data using the features. We will provide a complete chapter on clustering
and related technologies later. Table 3.1 shows the difference between the supervised
and unsupervised learning.
3.3 Reinforcement Learning 43

Table 3.1 Supervised versus

Supervised learning Unsupervised learning
unsupervised learning
Uses labeled data Uses unlabeled data
Tries to predict something, Tries to group things on the
e.g. a disease basis of their properties
We can directly measure the Evaluation is normally indirect
accuracy or qualitative
Underlying techniques: Underlying techniques:
classification, regression clustering

3.3 Reinforcement Learning

Reinforcement learning is another important technique of machine learning that is

gaining widespread use. It is the process of training a model so that it could make
series of decisions. In reinforcement learning, a machine agent interacts with its
environment in uncertain conditions in order to perform some actions. Now the
agent is guided in order to achieve the intended output with the help of rewards and
penalties.
The overall goal is to increase the total number of rewards. The designer sets the
rewards policy. Now it is up to the model to perform the actions in order to maximize
the rewards. In this way, the overall training of the model is guided.
Table 3.2 shows all the three types of learning and corresponding commonly used
algorithms.

Table 3.2 Common

Learning type Common algorithm
algorithms in each type of
learning Supervised learning Support vector machines
Linear regression
Logistic regression
Naïve Bayes
K-nearest neighbors
Unsupervised learning K-means
Hierarchical clustering
Principal component analysis
t-distributed stochastic neighbor
embedding (t-SNE)
Reinforcement learning Q-learning
Temporal difference (TD)
Deep adversarial networks
44 3 Widely Used Techniques in Data Science Applications

3.4 A/B Testing

A/B testing is one of the strategies used to test your online promotions and advertising
campaigns, Web site designs, application interfaces, etc., and basically, the test is to
analyze user experience. We present two different versions of the same thing to user
and try to analyze the user experience about both. The item which performs better is
considered to be the best.
So, this testing strategy helps you prioritize your policies and you can find out
what is more effective as compared to others alternatively giving you the chance
to improve your advertisements and business policies. Now we will discuss some
important concepts related to AB testing.

3.4.1 AB Test Planning

Conducting an AB test requires proper planning including what to test, how to test,
and when to test. Giving thorough considerations to these aspects will make you run
a successful test with more effective results because it will help you narrow down
your experiments to exact details you want out of your customers.
For example, either you want to test your sale promotion or your email template.
Either this test will be conducted online or offline. If it is to be conducted online,
then either you will present your actual site to user or will adopt some other way.
Suppose you want to test your sale promotion through your online Web site, you can
better focus on those parts of the Web site that provides the users with sales-related
data. You can provide different copies and find out which design is realizing into the
business. This will not provide you with more online business but the justification to
spend more on effective sale promotions.
Once you have decided about what to test, then you will try to find the variables
that you will include in your test. E.g. if you want to test your sales ad, your variables
may include:
• Color scheme
• The text of the advertisement
• The celebrities to be hired for the campaign.
It is pertinent to carefully find out all these variables and include them in test
which may require to find out these variables that may have strong impact on your
campaign.
Similarly, you should know the possible outcomes you are testing for. You should
consider all the possibilities for each option should be tested, so that the option which
provides better results will be considered.
You need to simultaneously conduct your tests, so that the effectiveness of what-
ever you are testing could be determined continual with passage of time and you keep
on making dynamic decisions with respect to the current scenario and situation.
3.4 A/B Testing 45

3.4.2 AB Testing Can Make Huge Difference

AB testing can have huge impact on your business by finding out user experience
related to something you are testing. It should be noted that starting something
without knowing user experience may always be expensive. AB testing provides you
with the opportunity to find out customer interests before launching something. By
knowing the user preferences and the outcomes of the ads that performed better, you
can have justification for spending more on such campaigns that actually realize into
profitable business.
This will also help you avoid the strategies that do not worth to customers and
thus have little contribution. You can find out customer priorities and alternatively
providing them what they want and how they want.

3.4.3 What You Can Test

In context of AB testing, an important question is that what you can test. Apparently
you can test anything from the format of your sales letter to a single image on Web site
of your business. However, it should be noted that it does not mean that you should
spend time and resources on testing everything related to your business. Carefully
find out the things that have strong impact on your business, and it will be worth to
test them only.
Similarly, once you have decided to test, e.g. two newsletters A and B, make sure
to test all possible options, e.g. header of newsletter A with B and header of newsletter
B with A, and so on. This may require some careful workout before conducting the
tests.
Some examples of what you can test:
• Newsletter:
– Header
– Body text
– Body format
– Bottom image
• Web site:
– Web site header
– Web site body
– Sales advertisement
– Product information
– Web site footers
• Sales advertisement
– Advertisement text
46 3 Widely Used Techniques in Data Science Applications

– Product images
– Celebrities images
– Position of the advertisement in newspaper.

3.4.4 For How Long You Should Test

You should carefully select the time period for which the test will be conducted, and
it depends on the frequency of response you get, e.g. if you are conducting a test on
your Web site and you have lot of traffic per day, you can conduct your test for few
days and vice versa.
The insufficient time period you allocate for your test, the insufficient results you
will get and thus the skewed will be your results, so, before selecting the time make
sure to open the test for time period that you get enough responses to make accurate
decisions.
Similarly, giving more time to a test may also give you skewed results. So, note
that there are no accurate guidelines and heuristics about the time period of a test, but
as discussed above, you should select the time period very carefully keeping in mind
the frequency of responses you get. Past experience in this regard can be helpful.

3.5 Association Rules

Association rules are another machine learning technique that tries to find out the
relationship between two items. An important application of association rules is
market basket analysis. Before going into details of association rules let us discuss
an example.
In market stores it is always important to place the similar items (i.e. the items
that sale together) close to each other. This not only saves the time of the customers
but also promote cross item sales. However, to find out this item relationship requires
some prior processing to find out the relationship between the items.
It should be noted that association rules do not reflect the individual’s interest
and preferences but only the relationship between the products. We find out these
relationships by using our data from previous transactions.
Now let us discuss details of association rules.
It consists of two parts, i.e. an antecedent and a consequent as shown below. Both
of which are a list of items.
Antecedent → Consequent
It simply shows that if there is antecedent, then there is consequent as well which
means that implication represents the co-occurrence. For a given rule, itemset is the
list of all the items in the antecedent and the consequent, for example:
{Bread, Egg} → {Milk}
3.5 Association Rules 47

Here itemset comprises of bread, egg, and milk. It may simply mean that the
customers who purchased bread and eggs also purchased milk most of the time,
which may simply result in placing these products close to each other in the store.
There are number of metrics that can be used to measure the accuracy and other
aspects of association rules. Here we will discuss few.

3.5.1 Support

Support defines how frequently an itemset appears in transaction. For example

consider that there are two itemsets, i.e. itemset1 and itemset2. Itemset1 contains
{Bread}, whereas itemset two contains {Cake}. As there is more likely that large
number of transactions will contain bread as compared to those containing cake, the
support of itemset1 will be higher than itemset2. Mathematically:

Transactions containing both X and Y

Support ({X } → {Y }) =
Total Number of Transactions
The “Support” metric helps identify the items that are sold more than the others
and thus the rules that worth more consideration. For example, a store owner may be
interested in finding out the items that are sold fifty times out of all 100 transactions.
So:

Support = 50/100 = 0.5

3.5.2 Confidence

Confidence determines the likeliness of occurrence of consequent in a transaction

with given antecedent. E.g. we can answer the questions like, in all the transactions
containing bread and butter, how many were containing milk as well. So, we can say
that confidence is the probability of occurrence of consequent given the probability
of antecedent. Mathematically:

Transactions containing both X and Y

Confidence({x} → {y}) =
Transactions containing X

Suppose we want to find the confidence of the {Butter}->{Milk}. If there are 100
transactions in total, out of which 6 have both milk and butter, 60 have milk without
butter, and 10 have butter but no milk, the confidence will be:

Confidence = 6/16 = 0.375

48 3 Widely Used Techniques in Data Science Applications

3.5.3 Lift

Lift is the ratio of probability of consequent being present with the knowledge of
antecedent over the probability of consequent being present without the knowledge
of antecedent. Mathematically:
Mathematically:

(Transactions containing both X and Y )/(Transactions containing X )

Lift({x} → {y}) =
Transactions containing Y

Again, we take the previous example. Suppose, there are 100 transactions in total,
out of which 6 have both milk and butter, 60 have milk without butter, and 10 have
butter but no milk. Now, probability of having milk with knowledge that butter is
also present = 6/(10 + 6) = 0.375.
Similarly:
Probability of having milk without knowledge that butter is also present = 66/100
= 0.66
Now Lift = 0.375/0.66 = 0.56.

3.6 Decision Tree

Decision tree is used to show the possible outcomes of different choices that are
made on the basis of some conditions. Normally we have to make decisions on the
basis of different options, e.g. cost, benefits, availability of resources, etc. This can be
done by using decision trees where following a complete path let us reach a specific
decision.
Figure 3.1 shows a sample decision tree. As it is clear that we start with a single
node that splits into different branches. Each branch refers to an outcome, and once
an outcome is reached we may have further branches, and thus, all the nodes are
connected with each other. We have to follow a path on the basis of different values
of interest.
There are three different types of nodes:
Chance nodes: Chance nodes show the probability of a certain decision. They
are represented by circles.
Decision Nodes: As name implies, the decision node represents a decision that is
made. It is represented by square.
End Node: End nodes represent the final outcome of the decision. It is represented
by a triangle.
Figure 3.2 shows some decision tree notations.
3.6 Decision Tree 49

Fig. 3.1 Decision tree

Fig. 3.2 Decision tree

notations

3.6.1 How to Draw a Decision Tree

To draw a decision tree, you will have to find all the input variables on which your
decisions are based and their possible values. Once you have all this, you are ready
to work on decision trees. Following are the steps to draw a decision tree

1. Start with the first (main) decision. Draw the decision node symbol and then
draw different branches out of it based on the possible options. Label each branch
accordingly.
2. Keep on adding the chance and decision nodes with following rules
• If previous decision is connected with another decision, draw a decision node.
• If still you are not sure about decision, draw a chance node.
• Leave if problem is solved.
50 3 Widely Used Techniques in Data Science Applications

3. Continue until each path has an end node which means that we have drawn all the
possible paths on the basis of which decisions will be made. Now assign the prob-
abilities to branches. These probabilities may be business profits, expenditures,
or any other values guiding toward the decision.

3.6.2 Advantages and Disadvantages

A decision tree may have following advantages:

• Easy to understand
• They force to map all possible outcomes and thus provide complete analysis
• They are flexible as compared to other decision-making tools that require
comprehensive quantitative data
• Provide the mapping of series of connected events
• Enhancement is easy
• Can support multiclassification problem
• Provides chances to quantify the possible outcomes of decisions.
However, decision trees may also have some drawbacks. We will mention few:
• They are not good choice in case of continuous values
• The same sub-tree may be duplicated over various paths
• A small change can lead to significant changes in structure of the tree, etc.

3.6.3 Decision Trees in Machine Learning and Data Mining

Decision trees have also their application in machine learning, data mining, and
statistics. We can build various prediction models using decision trees. These models
take input the values related to different items and try to predict the output on the
basis of items value.
In context of machine learning these type of trees are normally named as clas-
sification trees. We will discuss classification trees in detail in upcoming chapters.
The nodes in these trees represent the attributes on the basis of which classification
is performed. The branches are made on the basis of the values of the attributes.
They can be represented in the form of if then else conditions, e.g. if X > 75, then
TaxRate = 0.34. The end nodes are called the leaf nodes, which represent the final
classification.
As we may have series of events in decision trees, their counter parts, i.e. clas-
sification trees, in machine learning may also have number of nodes connected,
thus forming a hierarchy. The more deep the tree structure is, the more accurate
classification it may provide.
However, note that we may not have the discrete values all the time, there may
be scenarios where continuous numbers such as price or height or person, etc., need
3.6 Decision Tree 51

to be modeled, and for such situation we may have another version of decision trees
called regression trees.
It should be noted that an optimal classification tree is the one which models
most of the data with minimum number of levels. There are multiple algorithms
for creating classification trees. Some common are CART, ASSISTANT, CLS, and
ID3/4/5.

3.6.4 Advantages and Disadvantages

The use of decision trees in machine learning has following advantages:

• Simple and easy to build
• Fast as following a branch let us avoid traversal of others
• Works with multiclassification problems
• For simple problems accuracy can be compared with other classification tech-
niques
• Can handle both numerical and categorical data.
Disadvantages:
• Too small data may result in under fitting
• More training data may result in over fitting for real-world data
• May not be good for continuous data values
• Same sub-tree may be repeated over multiple branches.

3.7 Cluster Analysis

Just like classification, cluster analysis is another important technique, commonly

known as clustering. Cluster analysis is the process of examining the properties of
objects and then grouping them in such a way that similar objects are always arranged
in similar groups.
For example we can group the similar customers on the basis of their properties
into different groups and can offer each group with different sales promotions, e.g.
discounts or promotional tokens, etc.
52 3 Widely Used Techniques in Data Science Applications

3.7.1 Different Approaches of Cluster Analysis

There are number of methods that can be used to perform cluster analysis. Here we
will discuss few of them:

Hierarchical methods
These methods can be further divided into two clustering techniques as follows:

Agglomerative methods In this method each individual objects forms its own
cluster. Then similar clusters are merged. The process goes on until we get the
one bigger or K bigger clusters.

Divisive methods This scheme is an inverse of the above-mentioned method. We

start with one single cluster and assign all the objects to it. Then we split it to two
least similar clusters. The process continues until each individual objects belongs to
its own cluster.

Non-hierarchical methods Non-hierarchical clustering technique is the most

common one. Also known as K-means clustering, we group the similar objects into
similar clusters by maximizing or minimizing some evaluation criteria.

3.7.2 Types of Data and Measures of Distance

The objects in cluster analysis can have different types of data. In order to find out
the similarity you need to have some measure that could determine the distance
between them in some coordinate system. The objects that are close to each other are
considered to be the similar one. There are number of different measures that can be
used to find out the distance between objects. The use of a specific measure depends
upon the type of data. One of the most common measure is Euclidean distance.

3.7.2.1 Euclidean Distance

Euclidean distance between two points is simply the length of the straight line
between the points in Euclidean space. So, if we have points P1 , P2 , P3 , …, Pn
and a point i is represented by yi1 , yi2 , …, yin and point j is represented by yj1 , yj2 ,
…, yjn . then the Euclidean distance dij between points i and j will be calculated as:
2 2
di j = yi1 − y j1 + yi2 − y j2 + · · · + yin − y jn

Here Euclidean space is two-, three-, or n-dimensional space where each point or
vector is represented by n real numbers (x 1 , x 2 , x 3 , …, x n ).
3.7 Cluster Analysis 53

3.7.2.2 Hierarchical Agglomerative Methods

There are different clustering techniques used in hierarchical clustering method. Each
technique has its own mechanism that determines about how to join or split different
clusters. Here we will provide a brief overview of few.

3.7.2.3 Nearest Neighbor Method (Single Linkage Method)

This is one of the simplest methods to measure the distance between two clusters. The
method considers the points (objects) that are in different clusters but are closest to
each other. The two clusters are then considered to be close or similar. The distance
between the closest points can be measured using the Euclidean distance. In this
method the distance between two clusters is defined to be the distance between the
two closest members or neighbors. This method is often criticized because it does
not take account of cluster structure.

3.7.2.4 Furthest Neighbor Method (Complete Linkage Method)

It is just an inverse of single linkage method. Here we consider the two points in
different clusters that are at farthest distance from each other. The distance between
these points is considered to be the distance between two clusters. Again a simple
method but just like single linkage, it also does not take cluster structure into account.

3.7.2.5 Average (Between Groups) Linkage Method (Sometimes

Referred to as UPGMA)

The distance between two clusters is considered to be the average of the distance
between each pair of objects from both of the clusters. This is normally considered
to be the robust approach for clustering.

3.7.2.6 Centroid Method

In this method we calculate the centroid of the points in each cluster. Then the distance
between the centroids of two clusters is considered to be the distance between the
clusters. The clusters having minimum distance between centroids are considered to
be close to each other. Again, this method is also considered to be better than single
linkage method or furthest neighborhood method.
54 3 Widely Used Techniques in Data Science Applications

3.7.3 Selecting the Optimum Number of Clusters

Unfortunately there are no proper heuristics to determine the optimal number of

clusters in a dataset. K-means, one of the most common clustering algorithms, takes
the number of clusters K as user input. The optimal number of clusters in a dataset
may depend on the nature of data and the technique used for clustering.
In case of hierarchical clustering one simple method may be to use dendrogram
to see if it suggests a particular number of clusters.

3.7.4 Non-hierarchical or K-Means Clustering Methods

In this method the desired number of clusters, i.e. K value, is determined in advance
as user input, and we try to find solution that generates optimal clusters. Here, we
will explain this method with the help of an example.
Consider the following example using hypothetical data. Table 3.3 shows the
unsupervised dataset.
Dataset contains objects {A, B, C, D} where each object is characterized by
features F = {X, Y, Z}.
Using K-means clustering, the objects {A, D} belong to cluster C 1 , and objects
{B, C} belong to cluster C 2 . Now if computed, the feature subsets {X, Y }, {Y, Z},
and {X, Z} can produce the same clustering structure. So, any of them can be used
as the selected feature subset. It should be noted that we may have different feature
subsets that fulfill the same criteria, so any of them can be selected, but efforts are
made to find the optimal one, the one having minimum number of features.
Following are the steps to calculate clusters in unsupervised datasets. The steps
of the theorem are given below:
Do {
Step-1: Calculate centroid
Step-2: Calculate distance of each object from centroids
Step-3: Group objects based on minimum distance from centroids
} until no objects moves from one group to other.
First of all we calculate the centroids, centroid is center point of a cluster. For the
first iteration we may assume any number of points as centroids. Then we calculate

Table 3.3 A sample dataset

X Y Z
in unsupervised learning
A 1 2 1
B 2 3 1
C 3 2 3
D 1 1 2
3.7 Cluster Analysis 55

distance of each point from centroid; here distance is calculated using Euclidian
distance measure. Once the distance of each object from each centroid is calculated,
the objects are arranged in such a way that each object falls in the cluster whose
centroid is close to that point. The iteration is repeated then by calculating new
centroids of every cluster using all the points that fall in that cluster after an iteration
which is completed. The process continues until no point changes its cluster.
Now considering the data points given in above datasets:
Step-1: We first suppose point A and B as two centroids C1 and C2 .
Step-2: We calculate Euclidean distance of each point from these centroids using
equation of Euclidean distance as follows:

d(x, y) = (x1 − x2 )2 + (y1 − y2 )2

In our dataset the Euclidean distance of each point will be as follows:

0 1.4 2.8 1.4
D =
1
C1 = {1, 2, 1} and C2 = {2, 3, 1}
1.4 0 2.4 2.4

Using “Round” function:

0131
D1 = C1 = {1, 2, 1} and C2 = {2, 3, 1}
1022

In the above distance matrix the value at row = 1 and column = 1 represents the
distance of point A from first centroid (here point A itself is the first centroid, so it is
distance of point A with itself which is zero), the value at row = 1 and column = 2
represents the Euclidean distance between points B and first centroid, and similarly
the value at row = 2 and column = 1 shows Euclidean distance between point A and
second centroid and so on.
Step-3: If we look at column three, it shows that point “C” is close to second
centroid as compared to its distance from first centroid. On the basis of above distance
matrix following groups are formed:

1001
G1 = C1 = {1, 2, 1} and C2 = {2, 3, 1}
0110

where the value “1” shows that point falls in that group, so from above group matrix
it is clear that, points A and D are in one group (cluster), whereas points B and C are
in other group (cluster).
Now second iteration will be started. As we know that in first cluster there are
two points “A” and “D,” so we will calculate their centroid as first step. So, C1 =
, 2 , 2 = 1, 2, 2. Similarly, C2 = 2+3
1+1 2+1 1+2
2 2
, 3+2
2
, 1+3
2
= 2, 2, 2. On the basis of
these new centroids we calculate the distance matrix which is as follows:
56 3 Widely Used Techniques in Data Science Applications

1221
D =2
C1 = {1, 2, 2} and C2 = {2, 2, 2}
1111

1001
G2 = C1 = {1, 2, 2} and C2 = {2, 2, 2}
0110

Since G 1 = G 2 , so we stop here. Note that first column of D 2 matrix shows that
point A is at equal distance from both clusters C 1 and C 2 , so it can be placed in any
of the cluster.

3.8 Advantages and Disadvantages of Clustering

Clustering may have number of advantages:

• Has vast application in business, e.g. grouping the customers
• Works for unsupervised data which we often come around in real world
• Helpful for identifying the patterns in time or space.
Clustering may have number of disadvantages as well:
• May be complex as the number of dimensions increase
• No heuristics available to find out the optimal number of clusters
• Using different measures may result into different clusters.

3.9 Pattern Recognition

A pattern is structure, sequence, or event that repeats in some manner. We come

across many patterns in our daily life, e.g. a clock ringing a bell every hour, pattern
of round stairs in a home, pattern of white line markings on road, etc.
In terms of computer science and machine learning pattern recognition is the
process of reading the raw data and matching it with some already available data
to see if the raw data has same patterns as we have applied. Figure 3.3 shows the
process.

Fig. 3.3 Components of pattern recognition process

3.9 Pattern Recognition 57

Table 3.4 A sample dataset

Salary Age Home owner
for pattern recognition
15,000 45 Yes
5000 50 No
16,000 55 Yes
20,000 48 Yes
12,000 36 No
20,000 42 Yes
10,000 35 No
18,000 50 Yes

For example, if we consider the following dataset given in Table 3.4.

Now our target is to find the patterns in this data as apparently it is explicit from
the dataset that two patterns exist in it, those having salary greater than or equal
to 15,000 and age greater than 42 years and the remaining having salary less than
15,000 and age less than 40 years. In first case, the people seem to have their own
home while in other case people do not seem to own their own home.
This is what pattern recognition is all about. We examine the data and try to find
the regularities in it.
As input, we can provide any type of data for pattern recognition including:
• Structured data, e.g. Table 3.4
• Images
• Textual data
• Signals, sounds, etc.
• Emotions, etc.

3.9.1 Pattern Recognition Process

In general pattern recognition process comprises of four steps as shown in Fig. 3.4.
Now we will explain these steps one by one.
Data Acquisition Data acquisition is the process of obtaining and storing the data
from which the patterns are intended to be identified. The data collection process
may be automated, e.g. a sensor sensing real-time data or may be manual, e.g. a
person entering the data of the employees. Normally the data is stored in the form of
objects and their corresponding feature values just like we have data in Table 3.4.
Data Preprocessing Data collected in data store for pattern recognition may not
be in ready to process form. It may contain many anomalies, noise, missing values,
etc. All of these deficiencies in data may affect the accuracy of pattern recognition
process and thus may result to inaccurate decisions. This is where preprocessing step
helps to fix the problem, in this step all of such anomalies are fixed, and data is made
ready for processing.
58 3 Widely Used Techniques in Data Science Applications

Fig. 3.4 A generic pattern

recognition process Data Acquisition

Pre-Processing

Feature Extraction

Classification

Post Processing

Feature Extraction/Selection Once the data has been preprocessed, the next step
is to extract or select the features from the data that will be used to build the model.
A feature is the characteristic, property, attribute of an object of interest. Normally
the entire set of data is not used as datasets are large in number with hundreds
of thousands of features. Furthermore, datasets may have redundant and irrelevant
features as well. So, the feature extraction/selection is the process to get the relevant
features for the problem. It should be noted that the accuracy of the model for pattern
extraction depends upon the accuracy of the values of the features and their relevance
to the problem.

Classification Once we have found the feature vector from the entire dataset, the
next step is to develop a classification model for identification of the patterns. The
step of classification is discussed in detail in upcoming chapters. Here it should be
noted that a classifier just reads the values of the data and on the basis of the values
tries to assign a label to the data as per identified pattern. We have various classifiers
available, e.g. decision trees, artificial neural networks, Naïve Bayes, etc. The step
of building a classifier requires two types of data, i.e. training data and testing data.
We will discuss both of these types of data in next section.
3.9 Pattern Recognition 59

Post-Processing Once, the patterns are identified and decisions are made, we try to
validate and justify the decisions made on the basis of the patterns. Post-processing
normally includes the steps to evaluate the confidence on these decisions.

3.9.2 Training and Test Datasets

The process of building the classification model requires training and testing the
model on available data. So, the data may be classified into two categories.
Training dataset: Training dataset, as the name implies, is required to train our
model, so that it could predict with maximum accuracy for some unknown data in
future. We provide the training data to the model, check its output, and compare the
output with the actual output to find the amount of error. On the basis of comparison,
we adjust (train) different parameters of the model. The process continues until we
get the minimum difference (error) between the provided input and the produced
output.
Once the model is trained, the next step is to evaluate or test the model. For this
purpose we use test dataset. It should be noted that both training and test data are
already available, and we know the values of the features and classes of the objects.
The test dataset is then provided to the model, and the produced output is compared
with the actual output in test model. We then take different measures like precision,
accuracy, recall, etc.

3.9.3 Applications of Pattern Recognition

There are various applications of pattern recognition. Here we will mention just few:
Optical Character Readers Optical character reader models read the textual image
and try to recognize the actual text, e.g. the postal codes written on the letters. In this
way the model can sort out the letters according to their postal codes.
Biometrics Biometrics may include face recognition, finger prints identification,
retina scans, etc. It is one of the most common and widely used applications of pattern
recognition. Such models are helpful in automatic attendance marking, personal
identification, criminal investigations, etc.
Diagnostics Systems Diagnostic Systems are advance form of applications that
scan the medical images (e.g. X-ray sheets, internal organ images, etc.) for medical
diagnostics. This facilitates in diagnosing diseases like brain tumor, cancer, bone
fractures, etc.
Speech Recognition Speech recognition is another interesting application of pattern
recognition. This helps in creating virtual assistants, speech-to-text converters, etc.
60 3 Widely Used Techniques in Data Science Applications

Military Applications Military applications of pattern recognition may include

automated target recognition to identify the hidden targets, differentiating between
friendly and enemy entities, radar signal analysis and classification, etc.

3.10 Further Reading

Following are some valuable resources for further reading:

• Intelligent Techniques for Data Science (Authors: Akerkar, Rajendra, Sajja, Priti
Srinivas)
Book focuses on methods significantly beneficial in data science, and clearly
describes them at an introductory level, with extensions to selected interme-
diate and advanced techniques. The authors discuss advantages and drawbacks of
different approaches and present a sound foundation for the reader to design and
implement data analytic solutions for real-world applications. Book also provides
real-world cases of extracting value from data in various domains such as retail,
health, aviation, telecommunication, and tourism.
• Data Science with Matlab. Classification Techniques (Author: A. Vidales)
This book discusses parametric classification supervised techniques such as
decision trees and discriminant analysis along with non-supervised analysis
techniques such as cluster analysis.
• Advanced Statistical Methods In Data Science (Author: Ding-Geng Chen)
This book gathers invited presentations from the second Symposium of the ICSA-
CANADA Chapter held at the University of Calgary from August 4–6, 2015. Book
offers new methods that impact advanced statistical model development in big data
sciences.
• Information-Theoretic Methods in Data Science (Coordinators: Rodrigues Miguel
R. D., EldarYonina C.)
Book discusses how information-theoretic methods are being used in data
acquisition, data representation, data analysis, statistics, and machine learning.
Coverage is broad with chapters on signal acquisition, data compression, compres-
sive sensing, data communication, representation learning, emerging topics in
statistics, etc.

3.11 Summary

In this chapter, we provided a broad overview of the various techniques used in data
science applications. We discussed both classification and clustering methods and
techniques used for this purpose. It should be noted that each technique has its own
advantages and disadvantages. So, the selection of a particular technique depends on
3.11 Summary 61

the requirements of the organizations including the expected results and the type of
the analysis that the organization requires along with nature of the data available.
Chapter 4
Data Preprocessing

It is essential to extract useful knowledge from data for decision making. However,
the entire data is not always processing ready. It may contain noise, missing values,
redundant attributes, etc., so data preprocessing is one of the most important steps
to make data ready for final processing. Feature selection is an important task used
for data preprocessing. It helps reduce the noise, redundant, and misleading features.
Based on its importance, in this chapter, we will focus on feature selection and
different concepts associated with it.

4.1 Feature

A feature is an individual characteristic or property of anything existing in this world.

It represents heuristics of different objects and entities existing in real world. For
example features of a book may include “Total Pages,” “Author Name,” “Publication
Date,” etc. Features help us get certain ideas about the objects they are representing,
e.g. from “Total Pages” feature, we can perceive about the size of the book, from
“Publication Date” we can find how old the contents are. It should be noted that these
features form basis of the models developed at later stages, so, the better quality of
features results in better models [2].
The following dataset shows five objects and their corresponding features
(Table 4.1)
Based on the nature of domain, the attribute is representing, we can classify the
attributes into two types, i.e. numerical features and categorical features. Now we
will discuss each of them.

© The Editor(s) (if applicable) and The Author(s), under exclusive license 63
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_4
64 4 Data Preprocessing

Table 4.1 Sample dataset

Specie Legs Feathers
Human 2 No
Dog 4 No
Cat 4 No
Hen 2 Yes
Peacock 2 Yes

4.1.1 Numerical

Numerical attributes represent numbers, e.g. “Legs” in Table 4.1 is a numerical

attribute. The values of the numerical attributes can be discrete or continuous.
Numerical features are of two types.

Interval-scaled In this type of features, the difference is meaningful between two

values, e.g. the same difference exists between 10 and 20 as can be found between
20 and 30 degree for “Temperature” feature.

Ratio-scaled This type of features has same properties as that of interval-scaled

features. Additionally, ratios also hold, e.g. we can say that a location which is
20 km away is twice away as compared to the location which is 10 km away.

4.1.2 Categorical Features

Categorical features consist of the symbols to represent domain values. For example
to represent “Gender,” we can use “M” or “F”. Similarly, to represent employee type,
we can use “H” for hourly, “P” for permanent, etc. Categorical attributes are of two
types further.

Nominal In nominal category of attributes, order does not make sense. For example
in “Gender” attribute, there is no order, i.e. the operators, equal-to, less-than, or
greater-than, do not make any sense.

Ordinal In contrast with nominal attributes, in ordinal features, comparison opera-

tors make full sense for example the grade “A” is greater than grade “B” for “Grade”
feature.

Figure 4.1 shows types of features and their categorization.

4.2 Feature Selection 65

Fig. 4.1 Feature types

4.2 Feature Selection

We are living in the world where we are bombarded with tons of data every second.
With digitization in every field of life, pace with which data is originating is stag-
gering. It is common to have datasets with hundreds of thousands of records just
for experimental purposes. This increase in data results in the phenomenon called
curse of dimensionality. However, the increase is two dimensional, i.e. we are not only
storing the attributes of the objects in real world but the number of objects and entities
which we are storing is also increasing. The ultimate drawback of these huge volumes
of data is that it becomes very tough to process these huge volumes for knowledge
extraction and analytics which requires lot of resources. So, we have to find alternate
ways to reduce the size ideally without losing the amount of information. One of the
solutions is feature selection.
In this process only those features are selected that provide most of the useful infor-
mation. Ideally, we should get same amount of information that otherwise provided
the entire set of features in dataset. Once such features have been found, and we
can use them on behalf of the entire dataset. Thus, the process helps identify and
eliminate the irrelevant and redundant features. Based on the facts mentioned above,
efforts are always made to select best quality features. Overall, we can categorize
dimensionality reduction techniques in two categories.
(a) Feature Selection: A process which selects features from the given feature
subset without transforming or losing the information. So, we can preserve
data semantics in transformation process.
(b) Feature Extraction: Feature extraction techniques, on the other hand, project
current feature space to a new feature subset space. This can be achieved either
by combining or applying some other mechanism. The process however has a
major drawback that we may lose information in the transformation process,
which means the reverse process may not provide the same information present
in the original dataset.
66 4 Data Preprocessing

Now, the data in real-world applications can either be supervised or unsupervised.

In supervised dataset the class labels are already given for each object or entity in
dataset, e.g. for a student dataset a class label may be given in the form of the attribute
named “Status” having values “Passed” or “Not Passed”. However, majority of the
times we may come across the datasets where class labels may not be given which
is called unsupervised data, so feature selection algorithms have to proceed without
class labels.

4.2.1 Supervised Feature Selection

Most of the real-world applications require supervised data for operations. As

discussed above data may contain huge volumes and hundreds of thousands of
attributes. Noisy, irrelevant, and redundant attributes are other problems. To over-
come these issues, we perform supervised feature selection. In this process only those
features are selected that provide most of the information relevant to the concept.
For example if our concept is to identify the students who have passed a grade,
the relevant features are the only ones that provide exact information and do not
mislead. A classification model can be used as a help to select these features. So,
we will select the features that provide same classification accuracy that is other-
wise achieved by using the entire feature set. So, a feature subset with maximum
accuracy is always desirable. This alternatively means getting the same results but
now using the less number of features which ultimately increases the efficiency and
requires less number of resources. Therefore, in most of the cases, feature selection
is a preprocessing step that reduces the amount of data that is fed to classification
model. A generic supervised feature selection process is shown in Fig. 4.2.

Fig. 4.2 Generic supervised

feature selection approach
4.2 Feature Selection 67

Table 4.2 Sample supervised dataset

U C1 C2 C3 C4 D
X1 0 3 1 2 0
X2 1 1 2 1 0
X3 1 1 1 1 0
X4 2 3 1 1 1
X5 1 2 1 2 1
X6 0 2 2 0 1
X7 0 3 0 2 2
X8 0 3 0 0 2
X9 1 3 0 1 2
X 10 0 2 2 2 1

A complete feature set is provided to feature selection algorithm which selects

features according to specified selection criteria. The selected features are then given
as input to classification algorithm which performs classification on the basis of
provided features. It should be noted that the feature subset selected by the feature
selection algorithm can be modified using various iterations by considering the feed-
back provided by classification algorithm which is shown by dotted lines in diagram
above.
A sample supervised dataset along with four features {C 1 , C 2 , C 3 , C 4 } and a class
{D} is given in Table 4.2.
Here {C 1 , C 2 , C 3 , C 4 } are normal (Conditional) features and “D” is (Decision)
class label. We can use all the four features for different knowledge extraction tasks.
However, an alternate option is to select some features having same significance. We
can use many criteria for selecting those features, e.g. information gain, dependency,
entropy, etc. Here we can see that the attributes C 1 , C 2 , C 3 also have the same
significance as that of the entire feature set. So, we can use only three instead of four.

4.2.2 Unsupervised Feature Selection

As mentioned earlier class labels are not given all the times, we have to proceed
without labeling information which makes unsupervised classification a little bit
tough task. To select the features here, we can use clustering as selection criteria, i.e.
we can select the features that give same clustering structure that can be obtained by
entire feature set. Note that a cluster is just a chunk of similar objects.
A simple unsupervised dataset is shown in Table 4.3.
Dataset contains four objects {X 1 , X 2 , X 3 , X 4 } and three features {C 1 , C 2 , C 3 }.
Objects {X 1 , X 4 } form one cluster and {X 2 , X 3 } form other one. Note that clustering
is explained in detail in upcoming chapters. By applying nearest neighbor algorithm,
68 4 Data Preprocessing

Table 4.3 Sample dataset in

C1 C2 C3
unsupervised learning
X1 X Y I
X2 Y Z I
X3 Z Y K
X4 X X K

we can see that same clusters are obtained if we use only the features {C 1 , C 2 }, {C 2 ,
C 3 } or {C 1 , C 3 }. So, we can use any of these feature subsets.

4.3 Feature Selection Methods

Broadly, we can classify feature selection approaches in three main categories

depending upon the evaluation criteria and relation between the features and the
learning model using these features.
1. Filter-based Methods
2. Wrapper-based Methods
3. Embedded Methods.

4.3.1 Filter-Based Methods

Filtered-based approach is the simplest method for feature subset selection. In this
approach features are selected without considering the learning algorithm. So, selec-
tion remains independent of the algorithm. Now each feature can be evaluated either
as individual or as a complete subset. Various feature selection criteria can be used
for this purpose. In individual feature selection, each feature is assigned some rank
according to a specified criterion and the features with highest ranks are selected. In
other case entire feature subset is evaluated. It should be noted that selected feature
set should always be optimized that is the minimum number of features in feature
subset is better.
A generic filter-based approach is shown in Fig. 4.3.

Fig. 4.3 Generic filter-based approach taken from: John, George H., Ron Kohavi, and Karl Pfleger.
“Irrelevant features and the subset selection problem.” Machine learning: proceedings of the eleventh
international conference. 1994
4.3 Feature Selection Methods 69

It should be noted that only a unidirectional link exists between feature selection
process and induction algorithm.
Pseudocode of a generic feature selection algorithm is shown in figure Listing 4.1
below:

Listing 4.1 A Generic Filter Approach taken from: Ladha, L., and T. Deepa. “Feature
selection methods and algorithms.” International journal on computer science and
engineering 3.5 (2011): 1787–1797.
Input:
S—data sample with C number of features
E—Evaluation measure
SGO—successor generation operator
Output:
S —Output solution
I := Start I;
S i := {best of I with respect to E};
repeat
I := Search (I, SGOI, C);
C’ := {best I according to E};

if E(C ) ≥ E(S ) or E(C ) = E(S ) && C < S then S := C’;
until Stop (E, I).

X is the input feature set and “E” is the measure to be optimized (selection criteria).
S’ is the final result that will be the output. GSO is the operator to generate next
feature subset. We may start with an empty feature set and keep on adding features
according to criteria or alternatively, we may start with full features set and keep on
removing unnecessary features. The process continues until we get the optimized
feature subset.

4.3.2 Wrapper Methods

In contrast to filter approaches, where selection of features is independent of the

learning model, in wrapper-based approaches, the selection is done by keeping in
mind the learning model (the algorithm that will use the features), and the features
that enhance the accuracy and performance of the learning model are selected. So, the
selected features should be aligned with the learning model. A generic wrapper-based
approach is shown in Fig. 4.4.
70 4 Data Preprocessing

Fig. 4.4 Generic wrapper-based approach taken from: John, George H., Ron Kohavi, and Karl
Pfleger. “Irrelevant features and the subset selection problem.” Machine learning: proceedings of
the eleventh international conference. 1994

As you can see that feature selection process comprising feature search and eval-
uation is interconnected with induction algorithm, and the overall process comprises
three steps.
1. Feature subset search
2. Feature evaluation considering the induction algorithm
3. Continue process till optimization.
It should be noted that induction algorithm is sent to the feature subset, and the
quality of the subset is determined on the basis of its feedback. The feedback may
be in the form of any measure, e.g. error rate.

4.3.3 Embedded Methods

Both of the previous approaches discussed have their own advantages and disad-
vantages. Filter-based approaches are efficient but the features selected may not be
quality features because feedback from the learning algorithm is not considered. In
case of wrapper-based approaches, although quality features are selected but getting
feedback and updating features again and again are computationally inefficient.
Embedded methods take advantage of plus points of both of these approaches. In
embedded method, features are selected as part of classification algorithm, without
taking feedback from the algorithm. A generic embedded approach works as follows:
1. Feature subset initialization
2. Feature subset evaluation using evaluation measure
4.3 Feature Selection Methods 71

Table 4.4 Comparison of feature selection approaches

Filter Wrapper Embedded
Advantage Simple approach. Feature dependencies are Combines advantages of
More efficient considered. both filter-based and
Feedback from wrapper-based approach
classification algorithm is
considered, so
high-quality features are
selected
Disadvantage Feedback from Computationally Specific to learning
classification algorithm is expensive as compared to machine
not considered, so filter-based approach.
features selected may May result in high
affect accuracy of overfitting
algorithm

3. It becomes new current subset if better than current subset

4. Evaluation of subset using evaluation criteria of classifier
5. It becomes new current subset if better than current subset
6. Repeat Step-2 to Step-5 until specified criterion is met.
The generic comparison of each of the above-mentioned approaches is given in
Table 4.4.

4.4 Objective of Feature Selection

The core objective of the feature selection process is to select those features that
could represent the entire dataset in terms of the information provided by them. So,
the lesser the number of features, the more is the performance. Various objectives of
feature selection are provided in literature. Some of these are as follows:
1. Efficiency: Feature selection results in lesser number of features which provide
the same information but enhance the quality. Here it is shown using an example
in Table 4.5.
Note that if we classify the objects in above dataset using all three features, we
get the classification given on the left side of the diagram. However, if we only
use features C 1 and C 2 , we also get the same classification as shown in Fig. 4.5.
So, we can use the feature subset {C 1 , C 2 } instead of the entire dataset.
The above mentioned is a very simple example. Consider the case when we have
hundreds and thousands of features.
2. Avoid overfitting: Since feature selection helps us remove the noisy and irrelevant
features, so the accuracy of the classification model also increases. In the example
given above, we can safely remove the feature C 3 as removing this feature does
not increase or decrease accuracy of classification.
72 4 Data Preprocessing

Table 4.5 Symptoms table

Object C1 C2 C3 Class
X1 T T T T
X2 T T F T
X3 T F T T
X4 F T T T
X5 F F F F
X6 F T F T
X7 F F T F

Fig. 4.5 Classification: Using selected features versus entire feature set

3. Identifying the relation between data and process: Feature selection helps us
the relationship between features to understand the process that generated those
features. For example in table we can see that “Class” is fully dependent on {C 1 ,
C 2 }, so to fully predict the class we should have values of these two features.

4.5 Feature Selection Criteria

Selection of the criteria according to which features are selected is an important

step in entire selection process. In Table 4.5, our criterion comprised selecting the
features that provided same classification structure. It should be noted that multiple
feature subsets may exist in same dataset that may full the same evaluation criteria.
Here we present few measures that may be used as underlying criteria for feature set
selection.
4.5 Feature Selection Criteria 73

4.5.1 Information Gain

Information gain relates with the uncertainty of the feature. For two features X and
Y, features X will be preferable if IG(X) > IG(Y ). Mathematically

IG(X ) = U (P(Ci )) − E U (P(Ci |X ))
i i

Here
U = Uncertainty function.
P(C i ) = Probability of C i before considering feature “X”.
P(C i |X) = Probability if C i after considering feature “X”.

4.5.2 Distance

Distance measure represents that how effectively a feature can discriminate the class.
The feature with high discrimination power is more preferable. Discrimination, here,
can be defined in terms of difference of the probabilities P(X|C i ) and P(X|C j ). Here
C i and C j are two classes and P(X|C i ) means the probability of X when class is C i .
Similarly P(X|C j ) shows the probability of class C j X when class is C j . So, for two
features X and Y, X will be preferable if D(X) > D(Y ).

4.5.3 Dependency

Dependency is another measure that can be used as selection criteria. Dependency

measure specifies how uniquely the value of a class can be determined from a feature
or a set of features. The more the dependency of a class on feature, the lesser is the
uncertainty in feature and hence the more preferable that feature is.
Majority of feature selection algorithms use this measure. Starting by empty
feature set, they keep on adding features until we get a feature subset having maximum
dependency, i.e. one (1).
Note that a feature R will be considered as redundant if D(X, D) = D(X ∪ {R},
D).

4.5.4 Consistency

Consistency is another measure that can be used for feature subset selection. A
consistent feature is the one that provides same class structure otherwise provided
74 4 Data Preprocessing

by entire feature subset, so

P(C|Subset) = P(C|Entire Feature Set).

So, the more the feature is consistent, the more it is preferable.

4.5.5 Classification Accuracy

This measure is commonly used in wrapper-based approaches. The features that

enhance the accuracy of model are always preferable. The quality of features is
enhanced gradually using the feedback from the model. The measure provides better
features in terms of quality of feature subset, however, it may be expensive as
compared to other criteria.

4.6 Feature Generation Schemes

One of the important steps in feature selection algorithms may be to generate next
subset of features. It should be noted that the next feature subset should contain
more quality features as compared to the current one. Normally three feature subset
selection schemes are used for this purpose.
Firstly, we have forward feature generation scheme, where the algorithm starts
with an empty set and keeps on adding new features one by one. The process continues
until the desired criteria are met up to maximum.
A generic forward feature generation algorithm is given in Listing 4.2 below:

Listing 4.2 A generic forward feature generation algorithm

Input:
S—Dataset containing X features
E—Evaluation criteria
Initialization:
c) S ← {φ}
do
b) ∀x ∈ X
c) S ← S ∪ {xi } i = 1 . . . n
d) until Stop (E, S’)
4.6 Feature Generation Schemes 75

Output
d) Return S .

S’ here is the final feature subset that algorithm will output. Initially it is initialized
with an empty set and one by one feature are added until the solution meets the
evaluation criteria “E”.
Secondly we can use the backward feature generation scheme. This is totally
opposite of the forward feature generation mechanism. We keep on eliminating
the feature subset until we come to a subset from which no more features can be
eliminated without affecting the criterion.
Listing 4.3 shows a generic backward feature selection algorithm below:

Listing 4.3 A sample backward feature generation algorithm

Input:
S—data sample with features X
E—Evaluation measure
Initialization:
a) S ← X
do
b) ∀x ∈ S
c) S ← S − {xi } i = n . . . 1
d) until Stop (E, S )
Output
d) Return S .

Note that we have assigned the entire feature set to the S’ and are removing
features one by one. A feature can safely be removed if after removing the feature,
criteria remain intact.
We can also use the combination of both approaches where an algorithm may
start with an empty set and a full set. On one side we keep on adding features while
on other side we keep on removing features.
Thirdly we have random approach. As the name employees, in random feature
generation scheme, we randomly include and skip features. The process continues
until the mentioned criterion is met. Features can be selected or skipped by using any
scheme. Figure 4.6 shows a simple one where a random value is generated between
0 and 1. If value is less than one, the feature will be included else excluded. This
gives each feature and equal opportunity to be part of the solution.
A generic random feature generation algorithm is given in Listing 4.4 below:
76 4 Data Preprocessing

Search
Heuristic, Random, Exhaustive
Organization

Selection Criteria Information Gain, Distance,

Dependency, Accuracy, Consistency

Successor Forward, Backward, Hybrid, Random

Generation

Fig. 4.6 Generic feature selection process

Listing 4.4 A random feature generation algorithm using hit and trial approach
Input:
S—data set containing X features
E—evaluation measure
S ← {φ}
For j = 1 to n
If (Random (0,1) <= 0.5) then
S ← S ∪ {X i }
Until Stop (E, S )
Output
d) ReturnS .

Details given above provide basis for three types of feature selection algo-
rithms. Initially we have exhaustive search algorithms where the entire feature space
is searched for selecting appropriate feature subset. Exhaustive search although
provides optimum solution but due to limitation of resources it becomes infeasible
for datasets beyond smaller size.
4.6 Feature Generation Schemes 77

Random search, on the other hand, randomly searches the feature space and keeps
on continuing the process until we get some solution before the mentioned time. The
process is fast but alternate drawback is that we may not have optimal solution
Third and most common strategy is to use heuristics-based search, where the
search mechanism is guided by some heuristic function. A common example of
such types of algorithms is genetic algorithm. The process continues until we find a
solution or a specified time threshold has been met.
So, overall feature selection algorithms can be characterized by three parts as
follows:
1. Search organization
2. Generation strategy for next feature subset selection
3. Selection criteria.
Figure 4.6 shows different parts of a typical feature selection algorithm.

4.7 Miscellaneous Concepts

Now we will provide some concepts related to features and feature selection.

4.7.1 Feature Relevancy

Relevancy of a feature is always related to the selection criteria. A feature will be

said irrelevant if

S(C − X i ) = S(X )

Here:
S is selection criteria (e.g. dependency)
C is current feature subset
X is the entire feature set.
So, even if after removing the feature, the dependency on current feature subset
is equal to that of entire feature subset, the feature can be termed as irrelevant and
thus can be safely removed. Depending on the selection criteria a feature can either
be strongly relevant, relevant, or irrelevant.
78 4 Data Preprocessing

4.7.2 Feature Redundancy

Along with irrelevant features, a dataset my have redundant features as well. A

redundant feature is the one that does not add any information to the feature subset.
So, a feature will be declared as redundant if

D X ∪ {x j } = D(X )

As it can be seen that the selection criterion remains the same even after removing
the feature, so the feature X j will be declared as redundant and like irrelevant feature,
it can also be removed.

4.7.3 Applications of Feature Selection

Today we are living in the world where curse of dimensionality is a common problem
for applications to deal with. Within the limited resources it becomes infeasible
to process such huge volumes of data. So, feature selection remains an effective
approach to deal with this issue. It has become a common preprocessing task for
majority of domain applications. Here we will discuss only few of domains using
feature selection.

4.7.3.1 Text Mining

Text mining is an important application of feature selection. On daily basis tones of

emails, text messages, blogs, etc., are generated. This requires their proper processing
(e.g. their classification, finding their conceptual depth, developing vocabularies,
etc.). Feature selection approach has become very handy in such cases by enhancing
the performance of the entire process. Literature is full of the feature selection
algorithms meant only for text mining.

4.7.3.2 Intrusion Detection

With the widespread use of Internet, the vulnerability of systems has increased for
being attacked by hackers and intruders. Here the static security mechanisms become
insufficient where the hackers and intruders are finding new mechanisms on daily
basis, so we need to have dynamic intrusion detection systems. These systems may
need to monitor huge volume of traffic passed through them every minute and second.
So feature selection may help them identify the important and more relevant features
to select for inspection and detection thus alternatively enhancing their performance.
4.7 Miscellaneous Concepts 79

4.7.3.3 Information Systems

Information systems are one of the main application areas of feature selection. It is
common to have information systems processing hundreds and thousands of features
for different tasks. So feature selection becomes a handy tool for such systems. To
get the idea you can simply consider the classification task using 1000 features
versus the classification performed using 50 features by getting the same/sufficient
classification accuracy.

4.8 Feature Selection: Issues

Feature selection is a great help in era of curse of dimensionality, however, the process
has some issues which still need to be handled. Here we will discuss few.

4.8.1 Scalability

Although feature selection provides a solution to handle the huge volumes of data,
but an inherent issue a feature selection algorithm faces is the required number of
resources.
For a feature selection algorithm to rank the features the entire dataset is required
to be kept in memory until the algorithm completes. Now keeping in mind the large
volume of data, it requires large memory. The issue is unavoidable as to produce
quality features entire data has to be considered. So, scalability of feature selection
algorithms keeping in mind the dataset size is still a challenging task.

4.8.2 Stability

A feature selection algorithm should be stable that is it should produce same results
even with perturbations. Various factors such as amount of data available, the number
of features, distribution of data, etc., may affect stability of a particular feature
selection algorithm.

4.8.3 Linked Data

Feature selection algorithms make an important assumption that the available features
are independent and ideally distributed. An important factor that is ignored is that
80 4 Data Preprocessing

feature may be linked with other features and perhaps sometimes in other dataset,
e.g. a user may be linked with its posts and posts may be liked by other users. Such
scenario especially arises in domain of social media applications. Although research
is underway to deal with this issue, but handling linked data is still a challenge that
should be considered.

4.9 Different Types of Feature Selection Algorithms

Based on how a feature selection algorithm searches the solution space, feature
selection algorithms can be divided into three categories.
• Exhaustive algorithms
• Random search-based algorithms
• Heuristics-based algorithms.
Exhaustive algorithms search the entire feature space for selecting a feature subset.
These algorithms provide optimal results, i.e. the resultant feature subset contains
minimum number of features. This is one of the greatest benefits of feature selection
methods; however, unfortunately, exhaustive search is not possible as it requires
lot of processing resources. The issue becomes more serious when size of datasets
increases. For a dataset with n number of attributes, an exhaustive algorithm will have
to explore 2n solutions. Beyond requiring the computational power, such algorithms
also require lot of memory and it should be noted that now it is common to have
datasets with tens of thousands and millions of features. Exhaustive algorithms almost
become impossible to apply for such datasets.
For example, consider the following dataset given in Table 4.6.
An exhaustive algorithm will have to check the following subsets:
Subset: {}{a}{b}{c}{d}{e}{a, b}{a, c}{a, d}{a, e}{b, c}{b, d}{b, e}{c, d}{c,
e}{d, e}{a, b, c}{a, b, d} {a, b, e}{a, c, d}{a, c, e}{a, d, e}{b, c, d}{b, c, e}{b,
d, e}{c, d, e}{a, b, c, d}{a, b, c, e}{a, b, d, e}{a, c, d, e}{b, c, d, e}{a, b, c, d, e}
Now consider a dataset, the one thousand features or objects, the computational
time will substantially increase. As discussed earlier that feature selection is used

Table 4.6 Sample dataset

U a b c d e Z
1 L J F X D 2
2 M K G Y E 2
3 N J F X E 1
4 M K H Y D 1
5 L J G Y D 2
6 L K G X E 2
4.9 Different Types of Feature Selection Algorithms 81

as preprocessing step for various data analytics-related algorithms, so, if this step is
computationally expensive, it will result in serious performance bottlenecks of the
underlying algorithm using it. However, we may have semi-exhaustive algorithms,
which do not have to explore the entire solution space. Such algorithms search solu-
tion space, until the required feature subset is found. These algorithms can use both
forward elimination and backward elimination strategy.
Algorithm keeps on adding or removing features until a specified criterion is
met. Although such algorithms are efficient than fully exhaustive algorithms, but
even then, it is not possible to use such algorithms for medium or larger dataset
size. Furthermore, such algorithms suffer from a serious dilemma, i.e. distribution
of the features. For example, if a high-quality feature is found in the beginning, the
algorithm is expected to stop earlier as compared to the one where the high-quality
attributes are indexed in later position in dataset.
One optimization for such algorithms is provided in filter-based approaches is
that instead of starting subset, we first rank all the features. Note that here we will
use rough set-based dependency measure for ranking and feature selection criteria.
We check the dependency of decision class (“Z” in above given dataset) on each
attribute one by one. Once all the attributes are ranked, we start combining the
features. However, instead of combining the features in sequence as given above,
we combine the features in decreasing order of their rank. Here we have made two
assumptions.
• Features having high ranks are high-quality features, i.e. having minimum noise
and high classification accuracy.
• Combining the features with high ranks is assumed to generate high-quality
feature subsets more quickly as compared to combining the low ranked features.
Features having the same rank may be considered as redundant.
This approach results in the following benefits:
• Algorithm will not depend on the distribution of features as the order of the features
becomes insignificant while combining the features for subset generation.
• Algorithm is expected to generate results quickly, alternatively increasing the
performance of the corresponding algorithm using it.
However, these algorithms provide more efficiency as compared to the previously
discussed semi-exhaustive algorithms, even these algorithms can be computationally
expensive when the number of features and objects in datasets increases beyond
smaller size.
Then we have random feature selection algorithms. These algorithms use hit
and trial method for feature subset selection. A generic random feature selection
algorithm is given in Listing 4.5
The advantage of these algorithms is that they can generate feature subsets more
quickly as compared to other types of algorithms, but down side is that the generated
feature subset may not be optimized, i.e. it may contain redundant or irrelevant
features that do not participate in feature subset.
82 4 Data Preprocessing

Finally, we have heuristics-based algorithm where the solution space is searched

on the basis of some heuristics, we start with random solutions and keep on improving
the solutions with passage of time. Examples of such algorithms may include genetic
algorithms and swarm-based approaches. Here we will discuss genetic algorithm and
particle swarm algorithm in details.

4.10 Genetic Algorithm

Genetic algorithm is one of the most common heuristics-based approaches inspired

by the nature. The algorithm is based on the concept of survival of the fittest. The
species that adopt themselves with passage of time and strengthen their capacities to
survive are supposed to survive as compared to the ones that do not adopt and vanish
with passage of time.
Applying the concept on optimization problem, we select initial solution, and
keep on updating it so that finally we get our required solution.
The Listing 4.5 below shows the pseudocode of genetic algorithm:

Listing 4.5 Pseudocode of genetic algorithm

Input:
C: Feature Subset X: Objects in dataset
1) Initialize initial population
2) Check fitness of each individual
Repeat
3) Select the best Individuals
4) Perform their crossover and generate off-springs
5) Perform mutation
6) Evaluate fitness of each individual
7) Replace the bad individuals with new best individuals
8) Until (Stopping Criteria is met)

Now we will explain each step one by one.

Initializing the Population

In genetic algorithm, we randomly initialize the population. The population in genetic
algorithm is the collection of chromosomes. A chromosome represents a potential
solution. For feature selection problem, a chromosome represents the features that are
the part of the selected subset. The chromosome is generated using binary encoding
scheme where the value “1” represents the presence of a feature and the value “0”
4.10 Genetic Algorithm 83

represents its absence. These features are selected randomly. For the dataset, given
in Table 4.6, a random chromosome may be

10011

Here the first “1” represents the presence of the feature “a”, the second and third
“0” values show that features “b” and “c” will not be part of the solution. Similarly,
“d” and “e” will be included as shown by the values “1” at the fourth and fifth
place. Each value is called a “Gene”. So, a gene represents a feature. The resulting
chromosome shown above represents the feature subset {a, d, e}.
A population may contain many chromosomes depending on the requirements of
the algorithm.

Fitness Function
Once population is initialized, the next task is to check the fitness of each chromo-
some. The chromosomes having higher fitness are considered and those with low one
are discarded. The fitness function is the one that specifies our selection criteria. It
may be information gain, Gini Index, or dependency in case of rough set theory. Here
we will consider the dependency measure as selection criteria. The chromosomes
with higher dependency are preferable. An ideal chromosome is the one having the
fitness value of “1”, i.e. maximum dependency of decision class on complete set of
features present in dataset.
After this we start our iterations and keep on checking until the required
chromosome is found.
It should be noted that when chromosomes are initialized, different chromosomes
may have different fitness. We will select the best ones having higher dependency
and will use them for further crossover. This is in-line with the concept of survival
of the fittest inspired by the nature.

Crossover
Applying crossover is one of the important steps of genetic algorithm by using which
we generate offsprings. The offsprings form the next population. It should be noted
that only the solutions having high fitness participate in crossover, so that the resulting
offsprings are also healthy solutions. There are various types of crossover operators.
Following is the example of a simple one-point crossover operator in which two
chromosomes simply replace one part of each other to generate new offsprings.

10101

11010

We replace the parts of chromosomes after third gene. So, the resulting offsprings
become

10110
84 4 Data Preprocessing

11001

Similarly, in two-point crossover, we select two random points and the chromo-
somes replace the genes between these points as shown below:

10011

11100

There may be other types of crossover operators, e.g. uniform crossover and
order-based crossover, etc. The choice of selecting a specific crossover operator may
depend on the requirements.

Mutation
Once crossover is performed, the next step is to perform mutation. Mutation is the
process of randomly changing few bits. The reason behind this is to introduce diver-
sity in genetic algorithm and introduce new solutions so that previous solutions are
not repeated. In its simplest form, a single bit is flipped in the offspring. For example,
consider the following offspring resulted after the crossover of parent chromosomes:

10110

After flipping the second gene (bit) the resultant chromosome becomes:

11110

Just like crossover, there are many types of mutation operators. Few may include
flip bit mutation, uniform, non-uniform, Gaussian mutation, etc.
Now we evaluate the fitness of each chromosome in population and the best
chromosomes replace the bad ones. The process continues until the stopping criteria
are met.

Stopping Criteria
There can be three stopping criteria in genetic algorithm. The firs and the ideal one is
we find the ideal chromosomes, i.e. the chromosomes that have the fitness required.
In case of dependency measure the chromosomes having the dependency value of
“1” will be the ideal ones.
Second possibility is that we do not get the ideal chromosome, in such cases the
algorithm is executed till the chromosomes keep on repeating. So, in this case we
can select the chromosomes with maximum fitness. However, in some cases we do
not get this scenario even. So, in all such cases we can use generation threshold, i.e.
if algorithm does not produce a solution after n generations, we will stop and the
chromosomes with maximum fitness may be considered as potential solution.
4.10 Genetic Algorithm 85

Table 4.7 Comparison of

Heuristics-based Exhaustive
heuristics and exhaustive
approaches approaches
approaches
Execution time Take less time Take less time
Result optimality Do not ensure Ensure optimal
optimal results results
Dataset size Can be applied to Can only be applied
dataset with any size to small datasets
Resources Take less resource Take more resources

Heuristics-based algorithms are most commonly used. However, they may have
certain drawbacks, e.g.
• They may not produce the optimal solution. In case of genetic algorithm, the
resulting chromosomes may contain the irrelevant features having no contribution
in feature subset.
• As heuristics-based algorithms are random in nature, different executions may
take different execution time and same solutions may not be produced again.
Table 4.7 shows the comparison of both exhaustive and heuristics-based
approaches.

4.11 Further Reading

Following are some valuable resources for further reading:

• Feature Selection for High-Dimensional Data (Artificial Intelligence: Foun-
dations, Theory, and Algorithms) (Authors: Verónica Bolón-Canedo, Noelia
Sánchez-Maroño, Amparo Alonso-Betanzos)
This book offers a coherent and comprehensive approach to feature subset selec-
tion in the scope of classification problems, explaining the foundations, real appli-
cation problems, and the challenges of feature selection for high-dimensional
data.
• Feature Selection for Data and Pattern Recognition (Editors: UrszulaStańczyk,
Lakhmi C. Jain)
Book provides the reader with a selection of high-quality texts dedicated to current
progress, new developments, and research trends in feature selection for data and
pattern recognition. Points to a number of advances topically are subdivided into
four parts: estimation of importance of characteristic features, their relevance,
dependencies, weighting and ranking; rough set approach to attribute reduction
with focus on relative reducts; construction of rules and their evaluation; and
data-oriented and domain-oriented methodologies.
• Advances in Feature Selection for Data and Pattern Recognition (Editors: Urszula
Stanczyk, Beata Zielosko, Lakhmi C. Jain)
86 4 Data Preprocessing

Book presents recent developments and research trends in the field of feature selec-
tion for data and pattern recognition, highlighting a number of latest advances.
Divided into four parts—nature and representation of data; ranking and explo-
ration of features; image, shape, motion, and audio detection and recognition;
decision support systems, it is of great interest to a large section of researchers
including students, professors, and practitioners.
• Hierarchical Feature Selection for Knowledge Discovery (Authors: Cen Wan)
Book systematically describes the procedure of data mining and knowledge
discovery on bioinformatics databases by using the state-of-the-art hierarchical
feature selection algorithms. Furthermore, this book discusses the mined biolog-
ical patterns by the hierarchical feature selection algorithms relevant to the aging-
associated genes. Those patterns reveal the potential aging-associated factors that
inspire future research directions for the biology of aging research.
• Feature Selection and Enhanced Krill Herd Algorithm for Text Document
Clustering (Authors: Laith Mohammad Qasim Abualigah)
Presents a new method for solving the text document clustering problem and
demonstrates that it can outperform other comparable methods.

4.12 Summary

Feature selection is an important data preprocessing step in data analytic applications,

so, should be given equal importance. In this chapter, we have provided a strong base
for this topic. Starting from basic concepts toward different types and feature selection
mechanisms, we have provided deep insight into the entire types of feature selection
algorithms along with their issues and the challenges that still need to be fixed.
Chapter 5
Classification

Classification is the process of group the objects and entities on the basis of the
available information. It is an important step that forms core of the data analytics and
machine learning activities. In this chapter we will discuss some of the basic concepts
of classification process. We will discuss the decision tree technique in depth for this
purpose. Decision trees have already been discussed at abstract level in the previous
section. Here we will provide in-depth details. Although the concepts discussed will
be from decision tree point of view but these concepts will be applicable to all other
classification techniques. Along with decision trees, we will provide a brief overview
of other classification techniques including naïve Bayes, support vector machines,
and artificial neural networks.

5.1 Classification

Classification is the problem of assigning objects to predefined groups based on

their characteristics. For example assigning students to two classes, e.g. males and
females based on their gender as shown in Fig. 5.1, assigning the email messages to
two classes, e.g. spam and non-spam based on their source, etc.
We input the dataset to a process called classification process and the process
assigns the labels to data based on the features of the dataset. For example we may
enter the source of the emails and subject on the basis of which, classifier may label
a message either as spam or non-spam. Figure 5.2 shows the clustering process at
high level.
The input of classification process is a dataset with n rows (tuples) and m attributes
(also called features, vectors, properties, etc.). Table 5.1 shows a sample dataset.
As shown the dataset defines what classification of different people on the basis
of their property type. “C1”, “C2”, and “C3”, etc. are class labels, whereas “has

© The Editor(s) (if applicable) and The Author(s), under exclusive license 87
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_5
88 5 Classification

Fig. 5.1 Classification between males and females

Fig. 5.2 Classification process

garden”, “occupation”, “home type”, etc. are all the attributes, on the basis of which
the class of the people is decided.
Classification can be used for predictive modeling, i.e. we can predict on the basis
of the classification process about the class of a record whose class is not already
known.
For example, consider the record given in Table 5.2.
We can use the dataset given in Table 5.1 to train the classification model, and
then use the same model to predict the class of the above-mentioned record. It should
be noted that classification techniques are effective for prediction for binary classes,
i.e. where we have attributes with nominal or ordinal category and less effective for
other types of attributes.
A classification process is a systematic approach to build a classifier model on the
basis of training records. The classifier model is built in such a way that it predicts
the class of the unknown data with a certain accuracy. The model identifies the class
of unknown records on the basis of the attribute values and classes in training dataset.
Accuracy of the classification model depends on many factors including the size and
quality of training dataset.
Figure 5.3 shows a generic classification model.
Accuracy of classification model is measured in terms of the records correctly
classified as compared to total number of the records. The tool used for this purpose
is called confusion matrix given in Table 5.3.
A confusion matrix shows the number of records of class i that are predicted as
belonging to class j. For example f 01 means the total number of records that belong
to class “0” but model has assigned it the class “1”.
5.1 Classification 89

Table 5.1 Sample dataset

Name Has Occupation Home Garden Area Taxation Electric Class
garden type size type system label
John Yes Teacher Single Small Urban Taxpayer Solar C1
story
Elizbeth No Manager Double Small Urban Non-taxpayer Grid C2
Story
Mike No Manager Double Big Urban Non-taxpayer Solar C3
Story
Munsab Yes Teacher Single Big Urban Non-taxpayer Solar C1
story
Qian No Salesman Double Medium Urban Taxpayer Grid C4
Story
Maria No Manager Double Small Urban Taxpayer Solar C2
Story
Jacob Yes Teacher Single Small Rural Taxpayer Grid C1
story
Julie Yes Technician Double Small Rural Taxpayer Solar C5
Story
Smith Yes Shopkeeper Single Small Urban Taxpayer Solar C1
story
Evans No Manager Single Big Urban Non-taxpayer Solar C3
story
Roberts No Manager Double Medium Urban Taxpayer Solar C2
Story
William Yes Technician Double Medium Urban Taxpayer Solar C5
Story
Poppy Yes Plumber Single Small Urban Taxpayer Grid C1
story
Emma No Manager Double Big Urban Non-taxpayer Solar C3
Story
Emily No Salesman Double Medium Urban Taxpayer Grid C4
Story

Table 5.2 New record

Name Has Occupation Home Garden Area Taxation Electric Class
garden type size system label
Seena No Manager Double Small Urban Taxpayer Grid ?
Story
90 5 Classification

Fig. 5.3 Generic classification model

Table 5.3 Confusion matrix

Predicted class
Class = 1 Class = 0
Actual class Class = 1 f 11 f 10
Class = 0 f 01 f 00

Confusion matrix can be used for deriving a number of matrices that can be used
for classification model evaluation. Here we will present few metrics.
Accuracy: Accuracy defines the ratio of the number of classes correctly predicted
as compared to total number of records. In our confusion matrix, f 11 and f 00 are the
correctly predicted classes.

Number of correct predictions f 11 + f 00

Accuracy = =
Total number of predictions f 11 + f 10 + f 01 + f 00

Error Rate: Error rate defines the ratio of incorrectly predicted classes to the
total number of records. In our confusion matrix f 10 and f 01 as incorrectly predicted
classes.
Number of wrong predictions f 10 + f 01
Error rate = =
Total number of predictions f 11 + f 10 + f 01 + f 00

A model that provides the maximum accuracy and minimum error rate are always
desirable.
5.2 Decision Tree 91

5.2 Decision Tree

Decision tree is one of the most commonly used classifiers. We have already discussed
decision tree in previous chapters. Here we will provide details of the concept along
with relevant algorithms.
Now before going into details of how decision trees perform classification let us
consider that we want to classify a person “Shelby” on the basis of the values of
the attributes presented in Table 5.1. What we may do is that we may start with first
attribute and by considering that either her home as a “garden” or not, we may then
consider the value of second attribute, i.e. what is her occupation, after this we may
continue with attributes one by one until we reach the decision class of the person.
This is how a decision tree exactly works.
Figure 5.4 shows a sample decision tree of the dataset given in Table 5.1.
Building a decision tree is computationally expensive job because there may be
n number of potential solutions. So, finding an optimal one is computationally very
expensive and especially when the number of attributes increase beyond smaller
number.
So, for this purpose, we normally use the greedy search approach using local
optimum. There exist many of such solutions and one of the algorithms is Hunt’s
algorithm. Hunt’s algorithm uses the recursive approach and builds the final tree by
incrementally adding sub-trees.

Fig. 5.4 Sample decision tree

92 5 Classification

Table 5.4 Students dataset

Sid Student level Student type Marks Registration
S1 PG Self sponsored 130 Not-permitted
S2 UG Scholarship 105 Not-permitted
S3 UG Self sponsored 75 Not-permitted
S4 PG Scholarship 125 Not-permitted
S5 UG partial sponsored 100 Permitted
S6 UG Scholarship 65 Not-permitted
S7 PG partial sponsored 225 Not-permitted
S8 UG Self sponsored 90 Permitted
S9 UG Scholarship 79 Not-permitted
S10 UG Self sponsored 95 Permitted

Suppose Dt is training set and y = {y1 , y2 ,…, yc } are class labels, then according
to Hunt’s algorithm:
Step1: If there is only one class in dataset and all records belong to this class then
there is only one leaf node labeled as yt.
Step2: In case of multiple classes, we select an attribute to partition the dataset
into smaller subsets. We create a child node for each value of the attribute. The
algorithm proceeds recursively until we get the complete tree.
Example: Suppose we want to create a classification tree for determining that
either a person will be permitted for registration or not. For this purpose, we consider
the following dataset given in Table 5.4. We take the data of the previous students
and try to determine on this basis. The dataset is shown in Table 5.4. The attributes
considered are {Student Level, Student type, Marks, Registration}.
We start with initial tree that contains only a single node for class Registration =
Not-permitted. Now note that dataset contains more than one class labels. Now we
select our first attribute, i.e. “Student Level”. So far, we assume that this is the best
attribute for splitting the dataset. We keep on splitting the dataset until we get the
final tree. The tree formed after each step is given in Fig. 5.5.
Hunt’s algorithm assumes that training dataset is complete which means that all
the values of attributes are present. Similarly the dataset is also consistent which
means that unique values lead to unique decision classes. Note that in our training
dataset, the value “Student Level=PG” leads to both Registration = Permitted and
Registration = Not-permitted. However, such conditions normally do not exist in
datasets due to many reasons. Here we will present some other assumptions of the
Hunt’s algorithm.
1. There may be a condition that the child nodes created are empty. Normally this
happens when the training dataset does not have the combination of values of
attributes that result to this child node. In such cases, this node is declared as leaf
node and class is assigned to it on the basis of majority class of training records
that are associated with its parent.
5.2 Decision Tree 93

Registration =
Not permitted Level

(a) PG UG
Level
Registration = Registration =
PG UG Not permitted Not permitted

(b)
Registration = Student
Not permitted Type
Self Sponsored Scholarship
Partial Sponsored

Registration = Registration =
Not permitted Not permitted

PG UG

Registration = Student
Not permitted Type
Self Sponsored Scholarship
Partial Sponsored

Permission =
Marks
Not permitted

< 80
≥ 80

Permission = Permission =
Not permitted Permitted

(d)

Fig. 5.5 Hunt’s algorithm example

94 5 Classification

2. If attribute values in Dt are same with respect to attribute values but only differ in
class labels then we cannot split the records further and the node is declared
as child node. The class assigned to this node is the one having maximum
occurrences in training dataset.

5.2.1 Design Issues of Decision Tree Induction

A learning algorithm for inducing decision trees must address the following two
issues.
• Algorithm incrementally implements the classification tree by selecting the next
attribute to split the dataset. However, as there are number of attributes and each
attribute can be used to split the dataset, there should be some criteria for selecting
the attributes.
• There should be some criteria to stop the algorithm. There can be two possibilities
here. We can stop when all the records have same class label. Similarly, we can
also stop when all the records have same attribute value. There can be some other
criteria such the threshold value, etc.
It should be noted that there can be different types of attributes and splitting on
the basis of these attributes can be different in each case. Now we will consider the
attributes and their corresponding splitting.
Binary Attributes: Binary attributes are perhaps the simplest ones. The splitting
on the basis of these attributes results in two splits as shown in Fig. 5.6a.
Nominal Attributes: Nominal attributes may have multiple values, so we can have
as many numbers of splits as many possible values. If an attribute has three values,
there may be three splits. Consider the attribute “Student Type”, it can have three
possible values, i.e. “self sponsored”, “partial sponsored”, and “scholarship based”,
so the possible splits are given in Fig. 5.6b. It should be noted that some algorithms,
e.g. CART may also produce binary splits by considering all 2 k −1 −1ways of creating
a binary partition of k attribute values. Figure 5.6c illustrates the binary split of the
“student type” attribute that originally had three values.
Ordinal Attributes: As ordinal attributes have an order between the attribute
values, e.g. a Likert-scale may have values not agree, partially agree, agree, and
strongly agree. So, we can split an ordinal attribute in binary and more splits.
However, it should be noted that the split should not affect the order of the attribute
values. Figure 5.7 shows the possible splits of the ordinal attribute shirt size.
Continuous Attributes: Continuous attributes are those attributes that can take
any value between a specified range, e.g. temperature, height, weight, etc. in context
of decision trees, such attributes may either result into binary or multiway split. For
binary split the test condition for such attributes may be of the form:
(C < v) or (C ≥ v). Here we will have two splits, i.e. either the value of attribute
“C” will be less than value “V ” or will be greater-than-or-equal-to “V ”. Figure 5.8a
shows a typical test condition resulting into binary split.
5.2 Decision Tree 95

(a) Binary Classification

Fig. 5.6 Binary and nominal attributes split

Fig. 5.7 Splitting on the basis of ordinal attributes

Similarly, we can have test conditions where we have multiple splits. Such
conditions may be of the form:
V i ≤ C < vi+1. Here we may have number of splits in different ranges. Figure 5.8b
shows an example of such test condition.
96 5 Classification

Fig. 5.8 Splitting continuous attributes

There are many measures that can be used to select the best attribute that we should
select next for distribution. Examples may include entropy, gain, classification error,
etc.

c−1
Entropy (t) = − p(i|t) log2 p(i|t)
t=0

c−1
Gini (t) = 1 − [ p(i|t)]2
t=0

Classification Error (t) = 1 − max[ p(i|t)]

Here are some examples of number of classes and the corresponding measure
value that can be used to select the attribute

Node N1 Count Gini = 1 − (0/4)2 − (4/4)2 = 0

Class = 0 0 Entropy = −(0/4)log2 (0/4) − (4/4)log2 (4/4) = 0
Class = 1 4 Error = 1 − max[0/4, 4/4] = 0

Node N1 Count Gini = 1 − (1/4)2 − (3/4)2 = 0.375

Class = 0 1 Entropy = −(1/4)log2 (1/4) − (3/4)log2 (3/4) = 0.811
Class = 1 3 Error = 1 − max[1/4, 3/4] = 0.25

Node N1 Count Gini = 1 − (2/4)2 − (2/4)2 = 0.5

Class = 0 2 Entropy = −(2/4)log2 (2/4) − (2/4)log2 (2/4) = 1
Class = 1 2 Error = 1 − max[2/4, 2/4] = 0.5
5.2 Decision Tree 97

5.2.2 Model Overfitting

A decision tree can have three types of errors called training errors, i.e. the misclas-
sifications that a tree performs on training dataset, test errors, i.e. the misclassifica-
tion performed on training dataset and generalization errors, i.e. the errors that are
performed on the unseen dataset. A best classification model is the one that avoids
all of these types of errors. However, it should be noted that for small trees we may
high test and training errors. We call it model underfitting because tree is not trained
well to generalize all the possible examples mainly because of insufficient test and
training data. Model overfitting on the other hand results when the tree becomes too
large and complex. In this scenario although the training error decreases but the test
error may increase. One reason behind may be that the training dataset contains noise
and the classes are accidently assigned to data points which may result in misclas-
sification of the data in training dataset. Model overfitting can resulted due to many
factors here we will discuss few.
It can be due to noise in training dataset. For example consider Tables 5.5 and
5.6 representing the training and test datasets. Note that training dataset has two
misclassifications representing the noise. It means that the tree resulted from training
dataset will not have any error but when the test dataset will run the incorrect class
assignments will result in misclassification of training data. Figure 5.9 shows these
classification trees.
Similarly the models that are developed from smaller training data may also result
in model overfitting mainly because of the insufficient number of representative
samples in training data. Such models may have zero training error but due to imma-
turity of the classification model, the test error rate may be high, e.g. consider the
dataset given in Table 5.7. Few training samples are resulting the classification error
for the test dataset given in Table 5.6. Figure 5.10 shows the resulting classification
tree.

Table 5.5 Training dataset

Name Has garden Home type Taxation Electric system Registration
Qian Yes Single story Taxpayer Grid Permitted
Smith Yes Single story Taxpayer Solar Permitted
Jacob Yes Single story Non-taxpayer Grid Not-permitted*
Munsab Yes Single story Non-taxpayer Solar Not-permitted*
Emily No Double story Taxpayer Grid Not-permitted
Maria No Double story Taxpayer Solar Not-permitted
Elizbeth No Double story Non-taxpayer Grid Not-permitted
Mike No Double story Non-taxpayer Solar Not-permitted
Emma Yes Double story Non-taxpayer Solar Not-permitted
John No Single story Non-taxpayer Solar Not-permitted
98 5 Classification

Table 5.6 Test dataset

Name Has garden Home type Taxation Electric system Registration
John Yes Single story Non-taxpayer Solar Permitted
Elizbeth Yes Double story Non-taxpayer Solar Not-permitted
Mike Yes Single story Taxpayer Solar Permitted
Evans No Single story Non-taxpayer Solar Not-permitted
Roberts No Double story Taxpayer Solar Not-permitted
William No Double story Non-taxpayer Solar Not-permitted
Emma No Double story Non-taxpayer Solar Not-permitted
Julie Yes Single story Non-taxpayer Solar Permitted
Smith Yes Double story Taxpayer Grid Permitted
Jamaima No Double story Taxpayer Grid Not-permitted

5.2.3 Entropy

Now we will provide details of the measure called information gain to explain how
this measure can be used to develop a decision tree. But before we dwell into this
measure lets discuss what is entropy?
Entropy defines the purity/impurity of a dataset. Suppose a dataset Dt having
binary classification (e.g. positive and negative classes), the entropy of Dt can be
defined as:

Entropy (Dt ) = −P + log 2 P + − p log 2 P−

Where:
p+ is the proportion of positive examples in Dt
P− is the proportion of negative examples in Dt .
Suppose Dt is a collection of 14 data points having binary classification, including
9 positive and 5 negative examples, then the entropy of Dt will be:

Entropy (Dt ) = − (9/14) log2 (9/14) − (5/14) log2 (5/14) = 0.940

It should be noted that entropy is zero (0) when all data points belong to a single
class and entropy is maximum, i.e. 1 when dataset contains equal number both of
the examples in binary classification.

5.2.4 Information Gain

Given that entropy defines the impurity in datasets, we can define the measure
information gain (IG) as follows:
5.2 Decision Tree 99

Has
garden
Yes No

Home Registration =
type Not permitted
Single Double
story story

Registration = Registration =
Permitted Not permitted

Has
garden
Yes No

Home Registration =
type Not permitted
Single Double
story story

Registration =
Taxation
Not permitted
Tax Non-tax
payer payer

Registration = Registration =
Permitted Not permitted

Fig. 5.9 Tree of dataset given in Tables 5.4 and 5.5

Information gain defines expected reduction in entropy caused by partitioning the

examples according to this attribute.
Now we will create a classification tree with IG as attribute selection criteria.
Consider the following dataset given in Table 5.8.
100 5 Classification

Table 5.7 Training dataset

Name Has garden Home type Taxation Electric system Registration
Salamander No Double story Taxpayer Grid Not-permitted
(Emily)
guppy No Single story Non-taxpayer Solar Not-permitted
(John)
Eagle Yes Double story Non-taxpayer Solar Not-permitted
(Emma)
Poorwill Yes Double story Non-taxpayer Grid Not-permitted
(Waeem)
Platypus Yes Double story Taxpayer Grid Permitted
(Razaq)

Has
garden
Yes No

Electric Registration =
system Not permitted
Grid Solar

Taxation Registration =
Not permitted
Tax Non-tax
payer payer

Registration = Registration =
Permitted Not permitted

Fig. 5.10 Sample tree of dataset given in Table 5.6

Table 5.8 Sample dataset for information gain-based decision tree

D X Y Z C
D1 x1 y1 z2 1
D3 x2 y2 z1 1
D3 x3 y2 z1 0
D4 x2 y2 z1 0
D5 x3 y1 z2 0
5.2 Decision Tree 101

First we calculate entropy of the entire dataset as follows:

E(D) = − p + log2 p + − p log2 p−

E(D) = −(2/5) log2 (2/5)−(2/5) log2 (3/5) = 0.97

Now we will calculate IG of each attribute using this entropy, the attribute having
maximum information gain will be placed at root of the tree.
The formula for information gain is as follows:

|Dx1 | |Dx2 | |Dx3 |

IG(D, X ) = E(D) − E(Sx1 ) − E(Sx2 ) − E(Sx3 )
|D| |D| |D|

Here |Dx1 | is the number of times the attribute “X” takes value “x 1 ”. So, in D:

|D| = 5

|Dx1 | = 1

In our example:

E(Dx1 ) = 0
E(Dx2 ) = 1
E(Dx3 ) = 0

Putting all these values in IG(D, A):

IG(D, X ) = 0.57
IG(D, Y ) = 0.02
IG(D, Z ) = 0.02

This means that attribute “X” has the maximum information gain, so it will be
selected as root node. Figure 5.11 shows the resulting classification tree.
Three performs successful classification of three out of five samples. It should be
noted that the tree correctly classifies the samples D1 , D3 , and D4 . For remaining
samples we will perform another iteration by considering the other two attributes
and only the remaining two samples.

IG(D’, Y ) = 0
IG(D’, Z ) = 1
102 5 Classification

Fig. 5.11 Entropy-based

decision tree after first
iteration

So, here attribute Y has highest information gain, so we will consider it as the
next candidate to split for the remaining two records. The decision tree will be as
follows (Fig. 5.12).
Now all the records have been successfully classified, so we will stop our iterations
and decision tree in Fig. 5.15 will be our final output.
Now we will discuss some other classification techniques.

5.3 Regression Analysis

Before we discuss regression, it is worth mentioning that an analysis involves iden-

tifying the relationship between different variables. There are two types of variables
in this regard. One type of variables is called dependent variables and other ones are
called independent variables. Consider the dataset is given in Table 5.9.
Here we have shown values of two variables, i.e. “X” and “Y ”. The value of “X”
is used to determine the value of “Y ”. In other words, “Y ” is dependent on “X”, so,
“Y ” will be dependent variable and X will be independent variable. The relationship
between them can be mentioned by the model:

Y = (X )2 + 1

Now coming to regression analysis, it is one of the most common techniques used
for performing the analysis between two or more variables, i.e. how the value of
independent variable affects the value of dependent variable. The main purpose of
5.3 Regression Analysis 103

Fig. 5.12 Entropy-based

decision tree after second
iteration

Table 5.9 Sample dataset

X Y
1 2
2 5
3 10
4 17
5 26
104 5 Classification

regression analysis is to predict the value of dependent variable for some unknown
data. The value is predicted on the basis of the model developed using the available
training data. So, overall regression analysis provides us the following information:
(1) What is relation between independent variables and dependent variables. Note
that independent variables are also called predictors and dependent variables
are called outcomes.
(2) How good are the predictors to predict the outcomes?
(3) Which predictors are more important than the others in predicting the outcome.
There are different types of regression analysis:
• Simple linear regression
• Multiple linear regression
• Logistic regression
• Ordinal regression
• Multinomial regression.
For the sake of this discussion we will concise only to simple linear regression.
Simple Linear Regression: Simple linear regression is the simplest form of
regression analysis involving one dependent variable and one independent variable.
The aim of the regression process is to predict the relation between the two.
The model of the simple linear regression comprises of a straight line, passing
through the pints on a two-dimensional space. Now the analysis process tries to find
the line that is close to the majority of the points in order to increase accuracy. Model
is expressed in the form:

Y = a + bX

Here:
Y —Dependent variable
X—Independent (explanatory) variable
a—Intercept
b—Slope.
If you remember, this is equation of a simple linear line, i.e. Y = mx + c. Now
will explain it with the help of an example. Consider the simple dataset comprising
of two variables, i.e. “height” and “weight” as shown in Table 5.10.
Now we have to predict the weight of the person with height 5.4 in. First we draw
the first three points on a two-dimensional space as shown in Fig. 5.13.

Table 5.10 Heights and

Height Weight
weights dataset
5.5 44.5
6 48
4.5 37.5
5.4 ?
5.3 Regression Analysis 105

Fig. 5.13 Weights versus

height

Fig. 5.14 Weights versus

height for a = 6 and b = 7

Now by using the model discussed above, we try to predict the weight of the
person with “height” 5.4.
If we consider a = 6 and b = 7, the predicted weight of the person will be 43.8 kg
as shown in Fig 5.14.

5.4 Support Vector Machines

Support vector machines (SVM) are another classification technique commonly used
for classifying the data in N-dimensional space. The aim of the classifier is to identify
the maximal margin hyperplane that classifies the data. Before discussing support
vector machines let us discuss some prerequisite concepts.
Hyperplane: A hyperplane is a decision boundary that separates the two or more
decision classes. Data points that fall on a certain side of a hyperplane belong to a
different decision class. If the points are linearly separable then hyperplane is just
106 5 Classification

a line, however, in case of three or more features, the hyperplane may become a
two-dimensional space. A linearly separable dataset is the one where data points
can be separated by a simple linear (straight) line. Consider the figure showing a
dataset comprising of two decision classes “rectangle” and “circle”. Figure 5.15a
shows three hyperplanes H 1 , H 2 , and H 3 .
Note that each hyperplane can classify the dataset without any error. However, the
selection of a particular hyperplane depends upon its margin distance. Each hyper-
plane Hi is associated with two more hyperplanes hi1 and hi2 which are obtained by
drawing two parallel hyperplans close to each decision class. In Fig. 5.15b, there are
two more dotted lines h21 and h22 are parallel hyperplanes for H 2 . Parallel hyper-
planes are obtained such that each parallel hyperplane is closed to a decision class.
Figure 5.16 shows the hyperplane drawn close to the filled circle and rectangle.
Support Vectors: Points on other side of the hyperplane are called opponent class.
Support vectors are the vectors that are close to the opponent class. These are the
vectors that actually determine the position of the margin line. In figure above the

(a) Three HyperPlanes (b) Hyperplanes with parralel lines

Fig. 5.15 Hyperplanes

Fig. 5.16 Support vectors

5.4 Support Vector Machines 107

vectors (points) represented by filled circle and rectangle are the support vectors as
each of them is close to opponent class.
Selection of hyperplane: Although there may exist number of hyperplanes,
however, we select the hyperplane having maximum margin. The reason behind this
is that such hyperplanes tend to have more accuracy for unknown data as compared
to hyperplanes with low margin. A margin is the distance between the margin lines
of a hyperplane, so,

D = d1 + d2

where d i is distance of margin line from the hyperplane.

For linearly separable problems the model for decision boundary is the same as
we discussed in case of linear regression, i.e.

W × Ps + b = h for h < 0

And W × Ps + b = h’ for h > 0

So, if we label all the squares with S and circles with C then the model for
prediction will be:

C, x < 0
f (x) =
S, x > 0

Nonlinear support vector machines: So far, we discussed support vector machines

for linearly separable problems where a simple straight line can be used as a hyper-
plane. However, this may not be the case all the times. For example consider the
dataset is shown in Fig. 5.17.

Fig. 5.17 Data points distributed nonlinearly

108 5 Classification

Fig. 5.18 Hyperplane in

original dataset

As we can see the above data points cannot be classified using a linear SVM, so,
what we can do here is that we can convert this data into one with higher feature
space. We can introduce a new dimension classed Z-feature as follows:

Z = x 2 + y2

Now the dataset can be clearly classified using a linear SVM. Now suppose that
the hyperplane is “K” along Z-axis then:

k = x 2 + y2

So, we can represent transformed hyperplane in original data space as shown in

Fig. 5.18.

5.5 Naive Bayes

Naïve Bayes classifier is the probabilistic classifier that predicts the class of a data
based on the previous data by using probability measure. It assumes conditional
independence between every pair of features given the value of the class variable.
The technique can be used both for binary and multiclass classification.
The model of the classifier is given below:

P(c|x)P(c)
P(c|x) =
P(x)
5.5 Naive Bayes 109

Here:
P(c|x) is probability of class c given feature x and P(c) is probability of class c.
Listing 5.7 below shows the pseudocode of the algorithm:

Listing 5.7: Naïve Bayes Pseudocode

Input: Dataset with C features and D decision class
(1) Create frequency table
(2) Create likelihood table
(3) Calculate probability of each class using naïve Bayes model

Now we will explain it with example. Consider the dataset below which shows
the result of each student in a particular subject. Now we need to verify the following
statement:
Student will get high grades if he takes “mathematics”.

P(High|Math) = (P(Math|High) ∗ P(High)/P(Math)

Here we have P(Math | High) = 3/9 = 0.33, P(Math) = 5/14 = 0.36, P(High) =
9/14 = 0.64.
Now, P(High | Math) = 0.33 * 0.64/0.36 = 0.60.
So it is likely that he will have high grades with mathematics. Table 5.11a–c shows
the dataset, frequency, and likelihood tables for this example.
Now we will discuss some advantages and disadvantages of the naïve Bayes
approach.
Advantages:
• Simple and easy to implement
• It is probabilistic based approach that can handle both continues and discrete
feature
• Can be used for multiclass prediction
• It assumes that features are independent, so works well for all such datasets
• Requires less training data.
Disadvantages:
• Assumes that features are independent which may not be possible in the majority
of datasets
• May have low accuracy
• If a categorical feature is not assigned a class, the model will assign it zero
probability consequently w will be unable to make prediction.

5.6 Artificial Neural Networks

Scientists were always interested in making the machines that can learn just like
humans learn. This resulted in the development of artificial neural networks. The first
110 5 Classification

Table 5.11 Dataset, frequency, and likelihood tables

(a) Dataset
Student Subject Grades
X1 Math Low
X2 Urdu High
X3 Physics High
X4 Math High
X5 Math High
X6 Urdu High
X7 Physics Low
X8 Physics Low
X9 Math High
X 10 Physics High
X 11 Math Low
X 12 Urdu High
X 13 Urdu High
X 14 Physics Low
(b) Frequency table
Low High
Urdu 4
Physics 3 2
Math 2 3
Grand total 5 9
(c) Likelihood table
Low High
Urdu 4 4/14 0.29
Physics 3 2 5/14 0.36
Math 2 3 5/14 0.36
Grand total 5 9
5/14 9/14
0.36 0.64

neural network was “perceptron”. In this topic we will discuss multilayer perceptron
with backpropagation.
A simple neural network model is shown in Fig. 5.19.
As shown, a simple neural network model comprises of three layers, i.e. input
layer, hidden layer, and output layer.
Input layers comprises of the n features given as input to the network, as shown in
diagram above, the input comprises of the value {x 0 , x 1 , x 2 ,…, x m }. Now each input
is multiplied by the corresponding weight. The weight determines how important the
5.6 Artificial Neural Networks 111

Fig. 5.19 Simple neural network model

input is in the classification. All the inputs are provided to the summation function
which is shown below:

m
Y = wi xi + biase
i=0

The value of “z” is then provided to activation function which will enable to neuron
to be activated based on the summation value. One of the most common activation
functions is sigmoid:

1
z = Sigmoid(Y ) =
1 + e−y

Sigmoid activation function returns the value between zero and one as shown in
Fig. 5.20.
The value above 0.5 will activate the neuron and value below 0.5 will not activate
it. Sigmoid is the simplest one, however, there are many other functions as well like
rectified linear unit (ReLu), however, will not discuss them.
Loss function: The loss function determines that how much the predicted value
is different from the actual value. This function is used as criteria to train neural
network. We calculate the loss function again and again by using backpropagation
and adjust the weights of neural network accordingly. The process continues until
there is no decrease in the loss function. There are multiple loss functions. Here we
will discuss only the mean squared error (MSE).
MSE finds the average squared difference between the predicted value and the
true value. Mathematically:

m

MSE = yi − ŷi
i=1
112 5 Classification

Fig. 5.20 Sigmoid function

Here yi is the actual output and ŷi is the predicted output. We calculate the
loss function and adjust weights accordingly in order to minimize the loss. For this
purpose, we use backpropagation.
In this process we calculate the derivate of the loss function with respect to the
weights of the last layer.
Consider the following neural network model as shown in Fig. 5.21.
Here O11 represents the output of first neuron of first hidden layer, O21 represents
3
output of first neuron of second layer. With backpropagation, we will adjust w11 as
follows:
∂L
3
w11 (new) = w11
3
(old) − η
∂w11
3

Fig. 5.21 Sample neural network

5.6 Artificial Neural Networks 113

∂L
Now 3
w11
can be defined using the chain rule as follows:

∂L ∂L ∂ O31
= ∗
∂w11
3 ∂ O31 ∂w113

Here ∂L represents the change in loss value.

5.7 Further Reading

Following are some valuable resources for further reading:

• Multilabel Classification Problem Analysis, Metrics and Techniques (Authors:
Francisco Herrera, Francisco Charte, Antonio J. Rivera, María J. del Jesus)
Book offers a comprehensive review of multilabel techniques widely used to
classify and label texts, pictures, videos, and music in the Internet. A deep review
of the specialized literature on the field includes the available software needed to
work with this kind of data
• New Classification Method Based on Modular Neural Networks with the LVQ
Algorithm and Type-2 Fuzzy Logic (Authors: Jonathan Amezcua, Patricia Melin,
Oscar Castillo)
In this book a new model for data classification was developed. This new model
is based on the competitive neural network learning vector quantization (LVQ)
and type-2 fuzzy logic. This computational model consists of the hybridization of
the aforementioned techniques, using a fuzzy logic system within the competitive
layer of the LVQ network to determine the shortest distance between a centroid
and an input vector.
• Machine Learning Models And Algorithms For Big Data Classification: Thinking
With Examples For Effective Learning (Author: Shan Suthaharan)
This book presents machine learning models and algorithms to address big data
classification problems. Existing machine learning techniques like the decision
tree (a hierarchical approach), random forest (an ensemble hierarchical approach),
and deep learning (a layered approach) are highly suitable for the system that can
handle such problems.
• Pattern Recognition and Classification: An Introduction (Author: Geoff
Dougherty)
Book discusses the advance topics like as semi-supervised classification,
combining clustering algorithms and relevance feedback, etc. This book is suit-
able for undergraduates and graduates studying pattern recognition and machine
learning.
• Cluster and Classification Techniques for the Biosciences (Author: Alan H.
Fielding)
Book provides an overview of these important data analysis methods, from long-
established statistical methods to more recent machine learning techniques. It aims
114 5 Classification

to provide a framework that will enable the reader to recognize the assumptions
and constraints that are implicit in all such techniques.
• Inductive Inference for Large Scale Text Classification: Kernel Approaches and
Techniques (Authors: Catarina Silva, Bernadete Ribeiro)
Book gives a concise view on how to use kernel approaches for inductive inference
in large scale text classification; it presents a series of new techniques to enhance,
scale and distribute text classification tasks.

5.8 Summary

In this chapter we discussed the concept of classification with in-depth details. The
reason behind was the classification tree is one of the most commonly used classifiers.
All the concepts necessary for developing a classification tree were also presented
using examples. It should be noted that most of the concepts related to classification
tree also relate to other classification techniques.
Chapter 6
Clustering

Clustering is the process of dividing the objects and entities in meaningful and logi-
cally related grouped. In contrast with classification where we have already labeled
classes in data, clustering involves unsupervised learning, i.e. we do not have any
prior classes. We just collect similar objects in similar groups. E.g. all the fruits
with yellow color and having some length may be placed in one group, perhaps the
group of banana fruit. Just like classification, clustering is another important used in
many sub-domains encompassed by data science like data mining, machine learning,
artificial intelligence, etc.
Figure 6.1 shows some sample clusters of a dataset.
In this chapter we will discuss the concept of cluster, cluster analysis, and different
algorithms used for clustering.

6.1 Cluster Analysis

A cluster is group of data that share some common properties, e.g. in Fig. 6.1 two
clusters are given which share same properties, e.g. all the objects in left cluster are
closed shaped objects, whereas all the objects in right most cluster are open shaped.
Cluster analysis is the process of clustering the objects in groups on the basis of their
properties and their relation with each other and studying their behavior for extracting
the information for analysis purpose. Ideally all the objects with in a cluster have
similar properties which are different from the properties of objects in other groups.
However, it should be noted that the idea of dividing objects in groups (clusters)
may vary from application to application. A dataset that is divided into two clusters
may also be divided into more clusters at some refined level, e.g. consider Fig. 6.2
where the same dataset is divided into four and six clusters.
Clustering can also be called as unsupervised classification in that it also classifies
objects (into clusters), but here the classification is done on the basis of properties of

© The Editor(s) (if applicable) and The Author(s), under exclusive license 115
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_6
116 6 Clustering

Fig. 6.1 Two sample clusters

Fig. 6.2 Different ways of clustering same data points

the objects, and we do not have any predefined model developed from some training
data. It should be noted that sometimes the terms like partitioning or segregation
are also used as synonym for clustering, but technically they do not depict the true
meaning of the actual clustering process.
6.2 Types of Clusters 117

Fig. 6.3 Two sample

clusters

6.2 Types of Clusters

Clusters can be of different types based on their properties and how they are
developed. Here we will provide some detail of each of these types.

6.2.1 Hierarchical Clusters

A hierarchical cluster in simple terms is the one having sub-clusters. It may be

possible that we have some clusters inside another cluster, e.g. cluster of the items
sold within the cluster of the winter items. So at outer most level we have a single
cluster that is further divided into sub-clusters. The outer most level can be referred
to as the root cluster, whereas the inner most clusters (that are not further clustered)
can be referred to as the leaf clusters as shown in Fig. 6.3.

6.2.2 Partitioned Clusters

Unlike hierarchical clusters, a partitioned cluster is simply the division or grouping of

objects into separate (non-overlapping clusters). So, that the objects in one cluster are
not part of any other cluster. The two clusters in Fig. 6.2 form two non-overlapping
partitioned clusters.

6.2.3 Other Cluster Types

We may have some other cluster types like fuzzy clusters. In fuzzy theory an object
belongs to a set by some degree defined by its membership function. So the objects in
118 6 Clustering

Fig. 6.4 Two center-based

clusters

fuzzy clusters belong to all clusters by a degree defined by its membership function.
The value is taken between zero and one.
Similarly we may have complete and partial clustering. In complete clustering
every object is assigned to some cluster, whereas in partial clustering the objects
may not. The reason behind is that the objects may not have well-defined properties
to assign it to a particular clusters. Note that we will discuss only the partitioned and
hierarchical clusters.
There are some other notations used for different types of clusters. Here we will
discuss few of them.

6.2.3.1 Well Separated

As in a cluster the objects share the common properties, so the well-separated clusters
are the one in which objects in one group (cluster) are more close to each other as
compared to the objects in some other cluster. E.g. if we talk about the distance of
objects with each other as proximity measure to determine how close the objects are,
then in well-separated clusters, objects within a cluster are more close as compared
to objects in other clusters. Figure 6.1 is an example of two well-separated clusters.

6.2.3.2 Prototype-Based

Sometimes we need to consider one object from each cluster to represent it. This
representative object is called prototype, centroid, or medoid of the cluster. This
object has all the properties of the other objects in the same cluster. So, in prototyped
cluster, all the objects in the same cluster are more similar to center(oid) of the
cluster as compared to the center(oid) of the other clusters. Figure 6.4 shows four
center-based clusters.

6.2.3.3 Contiguous Clusters

In contiguous clusters objects are connected to each other because they are at a
specified distance, and the distance of an object with another object in the same
cluster is closer to its distance with some other object in other cluster. This may lead
to some irregular and intertwined clusters. However, it should be noted that if there
is noise in data, then two different clusters may appear as one cluster as shown in the
6.2 Types of Clusters 119

Fig. 6.5 Contiguity-based clusters

Fig. 6.6 Density-based clusters

figure due to a bridge that may appear due to presence of noise. Figure 6.5 shows
examples contiguous clusters.

6.2.3.4 Density Based

In density-based clustering the clusters are formed because the objects in a region are
dense (much closer to each other) than the other objects in the region. The surrounding
area of the density-based clusters is less dense having lesser number of objects as
shown in Fig. 6.6. This type of clustering is also used when clusters are irregular or
intertwined.

6.2.3.5 Shared Property

These are also called conceptual clusters. Clusters in this type are formed by objects
having some common properties. Figure 6.7 shows the example of shared property
or conceptual clusters.
120 6 Clustering

Fig. 6.7 Shared property

clusters

6.3 K-Means

Now we will discuss one of the most common clustering algorithms called K-means.
K-means is based on prototype-based clustering, i.e. they create prototypes and then
arrange objects in clusters according to those prototypes. It has two versions, i.e.
K-means and K-medoid. K-means define prototypes in term of the centroid which
are normally the mean points of the objects in clusters. K-medoid, on the other hand,
defines the prototypes in terms of medoids which are most representative points for
the objects in cluster.
Listing Algorithm 6.1 shows the pseudocode of the K-means clustering algorithm:

Algorithm 6.1: K-means clustering algorithm

1. Select value of K (number of initial centroids)
2. Randomly select centroids
3. Assign objects to closest centroids
4. Update centroids
5. Repeat step-3 to step-4 until there are no more updates in centroids

Here K is the number of centroids, and alternative defines the number of clusters
that will be resulted. Value of K is specified by the user.
Algorithm initially selects k centroids and then arranges the objects in such a way
that each object is assigned to centroid closer to it. Once all the objects are assigned
to their closer centroids, the algorithm starts second iteration, and the centroids are
updated. The process continues until centroids do not change any more or no point
changes its cluster. Figure 6.8 shows the working of the algorithm.
Algorithm completes operation in four steps. In first step, the points are assigned to
the corresponding centroids. Note that here value of K is three, so three centroids are
identified. Centroids are represented by “+” symbol. Once the objects are identified
to centroids, in second and third steps the centroids are updated according to points
assigned. Finally the algorithm terminates after the fourth iteration because no more
changes occur. This was a generic introduction of the K-means algorithm. Now we
will discuss each step in detail.
6.3 K-Means 121

Fig. 6.8 Shared property clusters

6.3.1 Centroids and Object Assignment

As discussed above, centroids are the central points of a cluster. All the objects in
a cluster are closer to their central points as compared to the centroids of the other
clusters. To identify this closeness of an object normally we need some proximity
measure. One of the commonly used measure is the Euclidean distance. According
to Euclidian distance, the distance between two points in the plane with coordinates
(x, y) and (a, b) is given by:

dist((x, y)(a, b)) = (x − a)2 + (y − b)2
122 6 Clustering

As an example, the (Euclidean) distance between points (2, −1) and (−2, 2) is
found to be:

dist((2, −1)(−2, 2)) = (2−(−2))2 + ((−1)−2)2

dist((x, y)(a, b)) = (2 + 2)2 + (−1−2)2

dist((x, y)(a, b)) = (4)2 + (−3)2
√
dist((x, y)(a, b)) = 16 + 9
√
dist((x, y)(a, b)) = 25
dist((x, y)(a, b)) = 5

6.3.2 Centroids and Objective Function

Once the objects are assigned to centroids in all the clusters, the next step is to re-
compute or adjust the centroids. This re-computation is done on the basis of some
objective function that we need to maximize or minimize. One of the functions is
to minimize the squared distance between the centroid and each point in cluster.
The use of a particular objective function depends on the proximity measure used
to calculate the distance between points. The objective function “to minimize the
squared distance between centroid and each point” can be realized in terms of the
sum of squared error (SSE). In SSE, we find the Euclidean distance of each point
from the nearest centroid and then compute total SSE. We prefer the clusters where
sum of squared error (SSE) is minimum. Mathematically:

n
SSE = dist(Ci , x)2
j=1 x∈Ci

where:
1
SSE = X
m i x∈C
i

Here:
x = An object.
C i = ith cluster.
ci = Centroid of cluster C i .
c = Centroid of all points.
mi = Number of objects in the ith cluster.
m = Number of objects in the dataset.
n = Number of clusters.
6.3 K-Means 123

Table 6.1 Document term matrix

To The A An I am He It She They
Document 1 4 4 6 2 2 2 3 6 2 1
Document 2 3 2 3 1 5 1 1 5 4 3
Document 3 6 1 5 1 2 2 4 2 3 2

Table 6.2 Some sample

Proximity function Measure used by the function
proximity measure
Bregman divergence Mean
Cosine Mean
Manhattan Median
Squared Euclidean Mean

The centroid ci of the ith cluster can be calculated as follows.

The proximity measure that we have shown above was related to data in Euclidian
space. However, we may have document oriented data as given in Table 6.1.
For such data, our objective function may be to maximize the similarity between
the documents. So we cluster the documents having maximum similar terms. The
one of the objective functions in all such cases may be:

n
SSE = consine(Ci , x)
j=1 x∈Ci

There may be number of proximity functions that can be used depending on the
nature of data and requirements. Table 6.2 shows some sample proximity measures.

6.4 Reducing the SSE with Post-processing

Whenever, we use SSE as proximity measure, the SSE may increase due to outliers.
One of the strategies may be to increase the value of K, i.e. produce more number of
clusters, thus the distance of the points from centroids will be minimum resulting in
minimum SSE. Here we will discuss two strategies that can reduce SSE by increasing
the number of clusters.

6.4.1 Split a Cluster

We can use different strategies to split a cluster for we can split a cluster having
the largest SSE value, and alternatively, the standard deviation measure can also be
124 6 Clustering

used, e.g. the cluster having the largest standard deviation with respect to a standard
attribute can be split.

6.4.2 Introduce a New Cluster Centroid

We can introduce a new cluster centroid. A point far from the cluster center can be
used for this purpose. However, for selecting such a point, we will have to calculate
SSE of each point.

6.4.3 Disperse a Cluster

We can remove the centroid and reassign the points to other clusters. To disperse a
cluster, we select the one that increases the total SSE the least.

6.4.4 Merge Two Clusters

To merge two clusters, we choose clusters which have closest centroids. We can also
choose the clusters which result minimum SSE value.

6.5 Bisecting K-Means

Bisecting K-means is a hybrid approach that uses both K-means and hierarchical
clustering. It starts with one main cluster comprising of all points and keeps on
spiting clusters into two at each step. The process continues until we get some already
specified number of clusters. Its Algorithm 6.2 gives the pseudocode of the algorithm.

Algorithm 6.2: Bisecting K-means clustering algorithm

1. Specify the resulting number of cluster
2. Make all the points as one cluster
3. Split the main cluster with k = 2 using K-means algorithm
4. Calculate SSE of resulting clusters
5. Split the cluster with higher SSE value into two further clusters
6. Repeat step-4 to step-5 until we get the desired number of clusters

We will now explain the algorithm with a simple and generic example. Consider
the dataset P = {P1 P2 P3 P4 P5 P6 P7 } which we will use to perform clustering
using bisecting K-means clustering algorithm.
6.5 Bisecting K-Means 125

Fig. 6.9 Main cluster

Suppose we start with three resulting clusters, i.e. the algorithm will terminate
once all the data points are grouped into three clusters. We will use SSE as split
criteria, i.e. the cluster having high SSE value will be split into further cluster.
So our main cluster will be C = {P1 , P2 , P3 , P4 , P5 , P6 , P7 } as shown in Fig. 6.9.
Now by applying K = 2, we will split the cluster C into two sub-clusters, i.e. C 1
and C 2 , using K-means algorithm, and suppose the resulting clusters are as shown
in Fig. 6.10.
Now we will calculate SSE of both clusters. The cluster having the higher value
will be split further into sub-clusters. We suppose the cluster C 1 has higher SSE
value. So it will be split further into two sub-clusters as shown in Fig. 6.11.
Now algorithm will stop as we have obtained our desired number of resultant
clusters.

Fig. 6.10 Main cluster split into two sub-clusters

126 6 Clustering

Fig. 6.11 C 1 split into two further sub-clusters

6.6 Agglomerative Hierarchical Clustering

Hierarchical clustering also has wide spread use. After partitioned clustering these
techniques are the second most important clustering techniques. There are two
common methods for hierarchical clustering.

Agglomerative These approaches consider the points as individual clusters and keep
on merging the individual clusters on the basis of some proximity measure, e.g. the
clusters that are closer to each other.

Divisive These approaches consider all the data points as a single all inclusive cluster
and keep on splitting it. The process continues until we get singleton clusters that we
cannot split further.

Normally the hierarchical clusters are shown using a diagram called dendrogram
which show the cluster-sub-cluster relationship along with the order in which clusters
were merged. However, we can also use nested view of the hierarchical clusters.
Figure 6.12 shows both views
There are many agglomerative hierarchical clustering techniques; however, all
work in the same way, i.e. by considering each individual point as an individual

Fig. 6.12 Dendrogram and nested cluster view of hierarchical clusters

6.6 Agglomerative Hierarchical Clustering 127

Fig. 6.13 MIN, MAX, and group average proximity measures

cluster, keep on merging them until only one cluster is remained. Algorithm 6.3
shows the pseudocode of such approaches:

Algorithm 6.3: Basic agglomerative hierarchical clustering algorithm

1. Calculate proximity measure
2. Merge two closest clusters
3. Update proximity matrix
4. Repeat step-2 to step-3 until there remains only one cluster

All the agglomerative hierarchical clustering algorithms use some proximity

measure to determine the “closeness” of the clusters. The use of a particular type
of proximity measure depends on the type of the clustering. E.g. for graph-based
clustering, MIN, MAX, and group average are common proximity measures.
MIN proximity measure defines the distance between two closest points in
different clusters as shown in Fig. 6.13a. MAX proximity measure defines the
distance between two farthest points in two different clusters as shown in Fig. 6.13b.
Group average defines the average of the distance of all points in one cluster with all
other points in other cluster as shown in Fig. 6.13c.

6.7 DBSCAN Clustering Algorithm

DBSCAN is a density-based clustering approach. It should be noted that density-

based approach is not as much common as distance-based measure. Furthermore,
128 6 Clustering

Fig. 6.14 Center-based density

there are many definition of the density, but we will consider only center-based
density measure. In this measure, the density for a particular point is measured by
counting the number of points (including the point itself) within a specified radius
called Eps (Epsilon). Figure 6.14 shows the number of points within the radius of
Eps of point P is six including the point A itself.
Algorithm 6.4 shows the pseudocode of the algorithm

Algorithm 6.4: Pseudocode of density-based clustering algorithm

1: Find all core, border, or noise points.
2: Remove points marked as noise points.
3: Find core points that are in Eps of each other and mark and edge between them
4: Find each group of connected core points and mark them a separate cluster.
5: Allocate border points to one of the clusters of their associated core points.

Before we discuss working of DBSCAN let us discuss some terminologies of the

algorithm:
Core points If a point P has greater-than or equal-to minimum number of specified
points in its Eps radius, then the point P will be considered as core point. Suppose if
we consider minimum number of points as five (including the point itself), then in
Fig. 6.15, points P1 and P2 are core points.
Border points If a point does not have minimum number of specified points (five
points in our case) within its Eps radius, however, has at least one core point in its
neighborhood, then this point is called border point. In Fig. 6.15, point P3 is border
point.
6.7 DBSCAN Clustering Algorithm 129

Fig. 6.15 Core, border, and noise points

Noise points If a point is neither a border point and nor a core point, then it is
considered as noise point. In Fig. 6.15, point P4 is a noise point.

Algorithm works as follows:

Any two core points which are within the Eps of each other are selected and merged
into a cluster. Similarly the border points are merged into the cluster of their closet
core point. Noise points, however, are simply discarded.

6.8 Cluster Evaluation

As in classification models, once the model is developed, the evaluation is essential

part to justify the model; however, in clustering, the evaluation process is not well
defined. However, the available measures used to evaluate performance of clustering
algorithms can broadly be categorized into three categories as follows:
• Unsupervised: Unsupervised measures measure the quality of clustering struc-
tures without using any class labels. One such common measure is SSE.
• Supervised: Supervised measures measure quality of clustering structure with
respect to some external structure. One of the examples of such measures is
entropy.
• Relative: Compares different clustering structures. The comparison can be based
on any of the measures, e.g. entropy or SSE.
130 6 Clustering

6.9 General Characteristics of Clustering Algorithms

So far we have discussed different clustering algorithms. Now we will discuss some
generic characteristics of clustering algorithms.
Order Dependence For some algorithms the order and quality of the resulted clus-
tering structure may vary depending on the order of processing of data points.
Although sometimes desirable, but generally we should avoid such algorithms.
Non-determinisim Many of the clustering algorithms like K-means require initial
random initialization. Thus the results they may produce may vary on each run. For
all such algorithms, to get optimal clustering structure, we may need to have multiple
runs.
Scalability It is common to have datasets with millions of data points and attributes.
Therefore, their time complexity should be linear or near to linear in order to avoid
performance degradation.
Parameter Setting Clustering algorithms should have minimum parameter setting,
and especially those parameters that are provided by the user and significantly affect
the performance of the algorithm. It should be noted that the minimum number of
parameters that need to be provided are always better.
Treating Clustering as an Optimization Problem Clustering is an optimization
problem, i.e. we try to find the clustering structures that minimize or maximize an
objective function. Normally such algorithms use some heuristic-based approach to
optimize the search space in order to avoid exhaustive search which is infeasible.
Now we will discuss some important characteristics of the clusters.
Data Distribution Some techniques may assume distributions in data where a
cluster may belong to a particular distribution. Such techniques, however, require
strong conceptual basis of statistics and probability.
Shape Clusters can be of any shape, some may be regularly shaped, e.g. in a triangle
or rectangle, while some others may be irregularly shaped. However, in a dataset a
cluster may appear in any arbitrary shape.
Different Sizes Clusters can be of different sizes. Same algorithms may result in
different clusters of different sizes in different runs based on many factors. One such
factor is random initialization in case of K-means algorithm
Different Densities Clusters may have different densities, and normally such clus-
ters having varying densities may cause problems for methods such as DBSCAN
and K-means.
Poorly Separated Clusters When clusters are close to each other, some techniques
may combine them to form a single cluster which results in disappearance of true
clusters.
6.9 General Characteristics of Clustering Algorithms 131

Cluster relationship Clusters may have particular relationships (e.g. their relative
position); however, majority of techniques normally ignore this factor.

Subspace Clusters Such techniques assume that we can cluster the data using a
subset of attributes from the entire attribute set. Problem with such techniques is that
using a different subset of attributes may result in different clusters in same dataset.
Furthermore, these techniques may not be feasible for dataset with large number of
dimensions.

6.10 Further Reading

Following are some valuable resources for further reading:

• Clustering Methods for Big Data Analytics Techniques, Toolboxes and Applica-
tions (Editors: Olfa Nasraoui, Chiheb-Eddine Ben N’Cir)
Book highlights the state of the art and recent advances in big data clustering
methods and their innovative applications in contemporary AI-driven systems.
The book chapters discuss deep learning for clustering, blockchain data clustering,
cyber-security applications such as insider threat detection, scalable distributed
clustering methods for massive volumes of data; clustering big data streams such
as streams generated by the confluence of Internet of Things, digital, and mobile
health, human–robot interaction, and social networks; spark-based big data clus-
tering using particle swarm optimization; and tensor-based clustering for Web
graphs, sensor streams, and social networks.
• Multicriteria and Clustering Classification Techniques in Agrifood and Envi-
ronment (Authors: Zacharoula Andreopoulou, Christiana Koliouska, Constantin
Zopounidis)
Book provides an introduction to operational research methods and their applica-
tion in the agri-food and environmental sectors. It explains the need for multicri-
teria decision analysis and teaches users how to use recent advances in multicriteria
and clustering classification techniques in practice.
• Partitional Clustering Algorithms (Editors: M. Emre Celebi)
Book summarizes the state of the art in partitional clustering including clus-
tering for large and/or high-dimensional datasets commonly encountered in real-
istic applications. Book covers center-based, competitive learning, density-based,
fuzzy, graph-based, grid-based, metaheuristic, and model-based approaches.
• Algorithms for Fuzzy Clustering (Authors: Sadaaki Miyamoto, Hidetomo Ichi-
hashi, Katsuhiro Honda)
As fuzzy c-means is the one of the most methodology in fuzzy clustering use
fuzzy c-means. So, subject of this book is the fuzzy c-means proposed by Dunn
and Bezdek and their variations including recent studies.
• Data Analysis in Bi-partial Perspective: Clustering and Beyond (Authors: Jan W.
Owsiński)
132 6 Clustering

Book presents the bi-partial approach to data analysis, which is both uniquely
general and enables the development of techniques for many data analysis prob-
lems, including related models and algorithms. Book offers a valuable resource for
all data scientists who wish to broaden their perspective on basic approaches and
essential problems, and to thus find answers to questions that are often overlooked
or have yet to be solved convincingly.

6.11 Summary

Clustering is an important concept used for analysis purpose in data science applica-
tions in many domain including psychology and other social sciences, biology, statis-
tics, pattern recognition, information retrieval, machine learning, and data mining.
In this chapter we provided in-depth discussion on clustering and related concepts.
We discussed different cluster types and relevant clustering techniques. Efforts were
made to discuss the concepts through simple examples and especially in pictorial
form.
Chapter 7
Text Mining

In the exploration of text data the major driving factor is availability of textual data
in digitized format. Based on the knowledge discovery perspective this is similar to
the knowledge discovery in database. Text is a way for the information transfer and
store as it is rich and comes naturally. To support this scenario Internet is considered
the outstanding examples: It was assessed that the most popular Internet search
engines indexed 3 billion available documents in textual format. Providing data in
textual format indicates an opportunity to improve decision making through text data
sources tapping. Therefore to discover automated ways of new knowledge the large
volumes are likely to be targeted. This process initiated the development of a new
branch of text-based knowledge discovery which leads toward text data mining.

7.1 Text Mining

The underlying concept of using text data sources to explore knowledge is that the
nature of data is unstructured, where the organization of data is made in semantic
manner, based on predefined data types, range of values and labels. On the contrary,
text document is unquestionably increasingly versatile and simpler in their expressive
power; however, different constraints come into the way of these benefits which
include addition of complexity inherent to the vagueness, presence of fuzziness, and
uncertainty in any natural language. Due to this reason the knowledge discovery or
text mining discipline contribute in text leverages based on multiple types of research
in computer science area which include artificial intelligence, machine learning,
computational linguistics, and information recovery. The concept of what is known
within the field of text mining differs and sometimes overlaps thereupon of other
fields often handling the computational treatment of text data, like processing of
natural languages and information retrieval.
An undertaking focused perspective on text mining incorporates:

© The Editor(s) (if applicable) and The Author(s), under exclusive license 133
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_7
134 7 Text Mining

• Retrieval of information or recovering a document’s subset from the corpus of

document related to specific criteria of user’s search as found on Web search tools
and text knowledge repositories of searching capabilities of document.
• Information extraction that manages removing explicit data from text document,
such as for any given attribute extraction its numerical values or time and data
extraction from a news document about any event, or numerical qualities for a
given trait or any given assets price. Techniques of text summarization represent
all the document information in condensed form and it comes under this category.
• In the technique of text data mining text data sources are processed through
data mining techniques and this include exploratory data analysis, clustering, and
classification. The purpose of this technique is extraction of information which is
new and useful.
The above recommends that text mining includes a decent scope for text treatment,
and whose goals are not strictly limited to knowledge discovery.
“Text metadata” usage is identified through data mining and it is used for the
illustration of the world which is not included in the text collection. While new
knowledge is derived from text data mining by doing the exploratory analysis of text
document. The definition of text mining derived from the above discussion can be
classified into two types of a broader level.
In the first category the definition of text mining is based on activities that perform
the text treatment for analytical purpose and these activities include retrieval and
extraction techniques. In the second category based on the objective of knowledge
discovery text mining is coordinated with text data mining.
Nonetheless, it is important that text data mining is irrespective of being firmly
linked to different research fields including computational handling of text, and it
is not normal to learn text mining techniques to linked areas and the other way
around. In information retrieval process new pattern in text can also be discovered;
hence, the two useful techniques for data mining are knowledge extraction and infor-
mation retrieval. This is additionally shown inside the accompanying segments on
applications of text mining and its techniques.

7.2 Text Mining Applications

7.2.1 Exploratory Text Analysis

The examination which is based on large set of text data is a valuable method for
gaining insights from data that is not typically possible by manual assessment. This
type of investigation can be used to create knowledge in which mining relationships
where the documents which belong to research from different areas have headed to
a new assumption with domain of medical. Literature contains many approaches in
which there is data related to prototypes and successful systems methods. Also data
description and data visualization are incorporated in literature.
7.2 Text Mining Applications 135

7.2.2 Information Extraction

Accession of information is related to the distinguishing proof of important data

present in text archives, which can be extricated into an increasingly organized
database for additional utilization, or utilized as extra metadata in the investigation
sources of text. Company details are built from a searchable database using news
sources, to extract company, and some executive name by using automated methods.
Some information extraction is also applied in law enforcement to get an assistance
to analyze the seized documents. For visual analysis and reporting entities like bank
account, name, and address are also extracted from database.

7.2.3 Automatic Text Classification

For a particular document to predict the class of the text data text classification is used
which involves the classification techniques and its applications. For the reduction of
manual interference and the accelerating routing automatic text classification method
is used. In automatic text classification methods the text is categorized conferring
to topics, there are a lot of examples that can be seen for the text automation which
support request according to the type of every product.
Text classification involves the application of classification techniques in text data
for the prediction of a class for a given document.
The field of text mining unties prospects to create new knowledge which is
extracted from text. In decision support enriched data sources are used to extract
data from the textual documents and for information optimization. For the manage-
ment of knowledge systems large collection of data text in the format of digital is
available. In the previous study about the knowledge management, competitiveness,
and innovation are having same benefits as the application of data mining methods
to text for generating a new knowledge.
For knowledge management which is used in the application of text mining tech-
nology as an assistive technology to categorized the automated classification of docu-
ments. To mitigate the information overload the text summarization and organization
of knowledge are used.

7.3 Text Categorization Types

Text categorization has three main types and dependence of this text categorization is
based upon the text document link to the total number of text document’s categories.
136 7 Text Mining

(i) Single Label Text Categorization

When each document is assigned only one category from the given categories set
then this is called single label text categorization. In this category the total number
of available categories can be more than one.

(ii) Multilabel Text Categorization

In multilabel text categorization each text document is assigned more than one cate-
gories from the given pool of categories. In this category a threshold and any scoring
mechanism may also be used for document assignment. The score generation based
on threshold determines the category and this can be exemplified as news article
which may include entertainment category and politics category as well.

(iii) Binary Text Categorization

In the binary text categorization only two categories are available as the name depicts
and in this technique there is restriction on the available categories. There are only
two categories available for categorization of any test document. For example for
patient in cancer hospital, there are only two categories one is cancer and the other
non-cancer and also the newsfeed on any platform can simply categorized into two
categories as “interesting” or “uninteresting”.

7.4 Text Categorization Approaches

Text categorization involves two approaches which are as:

(i) Rule base
(ii) Machine learning.

(i) Rule Base

In case of unseen documents there are set of rules needed and these rules are created
manually by the experts of field which are domain experts and knowledge engineers.
As this process is manual, therefore, it is expensive and it consumes a lot of time
as well by putting massive workload on experts. The process may fall into incon-
sistency due to its manual approach because experts may disagree on the categories
assignment. The difference of opinion for a single category can drag the process
toward inconsistency. Therefore problems in rule-based approach laid the founda-
tion of many incomplete projects and produce expert knowledge that is potentially
conflicting.

(ii) Machine Learning

Sample labeled documents are used for the automatic definition of classification in
machine learning. Labeled document classify unseen documents and based on this
7.4 Text Categorization Approaches 137

classification classifier learns its approach. As this is automatic approach, hence,

less human efforts are required to perform the task and this makes the approach
consistent as compared to rule-based approach. It adjusts the generated classifier so
that different types of document could be handled which include electronic mail,
newspaper article, and newsgroup posting that makes it more beneficial. It can also
handle different types of languages as well.
There is a division of machine learning approach into three classes which are
given below:
(i) Supervised
(ii) Semi-supervised
(iii) Unsupervised machine learning.

(a) Supervised Machine Learning

In supervised machine learning the data that is labeled participate in learning and
unseen instances are classified. The system in this technique holds the knowledge
of labeled dataset, whereas it is most frequent and commonly used technique. The
data that is named as labeled data also termed as training data and it comes with
different challenges which includes dealing with variations, choosing right data, and
generalizations.
This technique establishes the complete process to guide the decision making by
establishing concrete approach.

(b) Unsupervised Machine Learning

In unsupervised machine learning unlabeled data is used to do the learning. The

process focuses on two things only and that are similarities and differences as it
makes clusters based on the similarities in the document and keeps in one class. This
technique is used in such cases where there is no labeled dataset. The similarities
and differences are represented mathematically in this technique. This can be demon-
strated through an example of a set of large object collection in order to understand
the relationship among objects and visualize it. Assume that those objects are animals
and birds so to separate those from each other there are certain functionalities like
birds have wings can be used.
In unsupervised learning algorithms similar hierarchical arrangements are created
which is comprises of similarity-based mappings. Similar object’s grouping is done
through hierarchical clustering. When the clusters are non-hierarchical then they
form disjoint clusters.

(c) Semi-Supervised Machine Learning

When learning on both types of data labeled and unlabeled then it forms semi-
supervised machine learning. It is the technique where both paradigms of learning
output the best including similarities based learning and input from the instructor-
based learning. This approach applies the best outcome for both approaches.
138 7 Text Mining

7.5 Perspectives for Text Categorization

There are two broader categories of text categorization

(i) Document pivoted
(ii) Category pivoted.
• Document Pivoted

All the classifications of data can be found from which a document belongs in the
document pivoted category. Text categories can be used where the documents are in
sequential form and the categories of documents are stable intended for a junk mail
system.
• Category Pivoted

All documents which belong to each category are classified or categorized in category
pivoted. While new categories are added to an existing system and there is set of
documents that are already given the category pivoted is a suitable text categorization
method.

7.6 Text Categorization Applications

Nowadays in information systems the main focus is automated text categorization

as it was the main focus since last years.
The couple of the most widely recognized applications of automated text
categorization are given below:
• Document Organization

Documents are organized into their respective categories is a very communal

approach.
For example categories like “weather” and “sport” are automatically categorized
for company like newspaper. Its resource intensive and time consuming when these
tasks are manually perform these tasks for a large newspaper.
• Text Filtering

In text filtering text is filtered by categorizing the documents one by one with the
help of classifiers. For example we can categorize the junk mail by filtering and
categorized it as “spam” or “not spam”. Automated text categorization is also used
in any kind of service where user need to enter any information regarding email,
new, articles, scientific paper, and event notifications.
7.6 Text Categorization Applications 139

• Computational Linguistics

For the processing of natural language grammar and linguistics computational

linguistics are used. Text categorization can also tackled different sub-areas of
computational linguistics which include content-sensitive spelling correction, prepo-
sitional phrase attachment, word choice selection, part of speech (POS) tagging, and
word sense disambiguation.
• Hierarchical Web Page Categorization

The World Wide Web comprises a countless sites on uncountable subjects. Web-based
interfaces, (for example, the Yahoo! Registry) are well-known beginning stages for
exploring to an appropriate site. These portals permit a client to peruse through
a hierarchical structure of classes, constricting the extent of conceivably significant
sites for every click. The user can submit a query to access its selected category or the
user can directly open the Web site by clicking the hyperlinks available. Web sites are
growing very fast and because of it the categorization which was performed manually
is infeasible. Texts categorizations techniques are used in hierarchical Web page
categorization so that it removed obsoletes and add new categories in the hierarchical
Web page on each node by using classifiers.

7.7 Representation of Document

There is no natural language processing in computer, and therefore it is important to

convert the data into textual document so that computer can understand it.

7.7.1 Vector Space Model

Vector space model is broadly used in the text categorization and the common oper-
ation carried through it includes document categorization, text clustering, and docu-
ment ranking in search engines. Through this model a certain document set’s vector
is represented. Each vector corresponds to a single document, and each term in the
vector to a term in the document.
Consider the document set D which contains different documents Di as:

Di = (di1 ; di2 ; . . . din )

An index term d i represents each dimension related to corresponding vector in the

document. The value 0 represents the non-existence of the term in the document and 1
represents the existence of the term. The term is weighted according to the importance
of the term and binary encoding (0, 1) cannot be used for the representation. Weight
140 7 Text Mining

can be calculated through different approaches and these approaches may include
term frequency-inverse document frequency (TF-IDF).
New document query is described in the same vector space as the documents to
make for quicker comparisons of similarities.
In a search engine, in the same vector space as the papers, each query is usually
represented by the same form of vector, allowing quick comparisons of similarities.
Documentation in training and test sets and the documentation to be categorized
are indexed in text categorization and defined by the same model, usually the space
model for the vectors.

7.7.2 Term Weighting

Terms are allocated different weights based on some scheme of weight for the retrieval
of information. The weight that is assigned to the word shows the importance of that
term.
Text categorization uses different weighting techniques and these techniques are
given below.
(i) Term Frequency
Term frequency is a simplest method of term weighting used for the categorization
of the text in a given document for the measurement of terms importance. Based on
the total number of term appearance in a document its weight is calculated. There
are frequent terms that repeatedly occur in a document and it can help recall that
term because of its very less discriminative power. In the process of categorization
generally terms with too high frequency and terms with too low frequency are not
much useful.
(ii) Inverse Document Frequency
Term importance calculation in whole set of documents is called inverse document
frequency. It is related to the frequency count of term document and the importance
of a term is inversely proportional to the frequency of occurrence of that term in the
documents. It is inverse in relationship with document frequency because the term
that has greater discriminative power by occurring in less number of documents is
more considerable then the term that is occurring in larger number of documents.
Let suppose that there are “t” terms in “n” document from the dataset of documents
“N” then the frequency for inverse document will be calculated as:

IDF(t) = log(N /n)

(iii) Document Frequency

Nonetheless, during this situation the term meaning is decided by the amount of
documents during which that term appears, the term frequency is like that of name.
7.7 Representation of Document 141

A term “ti” appearing in large number of documents is going to be considered like

more relevant than the any other term “tj” with occurrence in fewer documents.
(iv) Term Frequency-Inverse Document Frequency

Different terms have varying meaning for various documents counting for any
given dataset supported their discriminative capacity. The term frequency-inverse
frequency of the document calculates each terms weight, taking under consideration
all the documents. During this term importance can be defined as, a term “ti” is
comparatively more important because of its occurrence more in one document, and
in lesser amount in other documents. Within document d the calculation of the worth
of a term “t” can be illustrated as

t f − id f t,d = t f t,d ∗ id f t

It is commonly used for comparison of the similarities in relevant document listing

through search engines as well as between the documents. The process takes placed
with the measurement of document “d” and query “q” similarities.

7.8 Text Document Preprocessing

Sorting text document in raw form probably would not be effective and that we
probably would not be able to accomplish ideal execution because of noise terms
presence as an example it can be illustrated through white spaced, special characters.
Such preprocessing lessens the elements of input text documents extensively. It is
preferred to preprocess these documents and switch them into some forms that is
refined so that optimum performance can be achieved.
(i) Dataset Cleaning
In dataset cleaning the process carried out cleaning of dataset because in public
datasets it is not possible to categorize the dataset directly sometimes. This can
happen because of the presence of any term that not be significantly important
for the process of categorization. Such terms have impact on both efficacy and
performance.
To perform the categorization process efficiently and to achieve better results for
the performance it is important to clean these types of datasets. This is reduced the
terms that are not necessary which will reduce the size of vocabulary. Through this
process noise terms could be reduced and that will ease the process of categorization
and eventually the performance is also improved. Through a research the following
steps have been carried for cleaning the dataset from these types of terms.
(i) Replace special characters with white space, i.e. @, _, etc.
(ii) Replace multiple white spaces with single one
142 7 Text Mining

(iii) Remove digits

(iv) Remove dates.
(ii) Tokenization
The major concern is term related features in the document during the catego-
rization process of text document. During the categorization of a text document
there is a vital role of these terms. Then the process of tokenization takes place
and the process of vocabulary is creation from those terms start. The steps are
simple with some drawbacks as the in the process of text categorization the
text document features are not equally helpful. The major issue is related to
the discriminative power where some have more and other have low that will
increase the size of vocabulary by reducing the categorization efficiency.
(iii) Stop words Removal
In the categorization process feature extracted from text document may be
helpful or may not be helpful because of the frequent occurrence of some
features. These repetitive occurred features may have insignificant content
hence these are termed as features that are less important. These words are
known as stop words because their high frequency and not usefulness in the
process of categorization. In this research these stop words are removed through
the usage of SMART list.
(iv) Terms Pruning
After the removal of stop word the process is not simplified yet because in some
documents the terms ratio may become very high and from which mostly are
not helpful for the categorization of the document. There are terms that may
have high frequency and there are the terms that have low frequency. The terms
from both frequencies can damage the efficiency of categorization process. The
process of pruning removes these terms so that the process of categorization
can be improved but still there is a variation in the value of low frequency and
high frequency based on dataset.
(v) Stemming
In a text document the efficiency of process of categorization is reduced because
of large number of terms. For the improvement of the efficiency stemming is
used and in this technique the words are reduced to the level of their roots.
As example take two words “asked” and “asks” then after stemming both will
be reduced to “ask”. The paper uses stemmer technique for the process of
stemming.
(vi) N-Gram Model
Unigrams can also be used to name the terms and tokens in machine learning.
During the process of categorization it is possible that a unigram may not
provide enough information related to the content that can be helpful for the
classifier. So to handle such difficult situations and make the process of cate-
gorization smooth n-gram model comes to help. This is helpful because of the
chunks of fix n-letter combination in the place of whole words.
Text data try to capture natural language key components because of lack of
structure. Those key components are captured in the preparation stage used for the
7.8 Text Document Preprocessing 143

text mining task employment. Source data treatment is used for the dictation of
characteristics of model and the information that model can provide. Hence matching
preparation steps with the overall exercise objective is very vital.
One of the primary problems with the event of stop lists to gather relevant infor-
mation from the documents. Such lists show what terms are presumably to seem
from the record set, and hold no knowledge when trying to spot trends.
Basic words like “the”, “and”, “of” in the English language are stop list’s ordi-
nary contender. It is important to ensure careful measures for specific objective-based
listing related to any data mining task because stop word are not always useless. As
in some cases these words become significant for any scenario description. Stem-
ming and lemmatization purposefully reduce terms variations through the transfer
of similar occurrences to a lemma or stem or canonical form or by reducing words
to their inflection root. Through this process noisy signals are reduced and in a text
collection the number of attributes that needs analysis is reduced and also dataset
dimension are reduced. This can be exemplified as when there is singular/plural
conversion then stemming is playing its role and also when present and past tense
are converted into single form. The preprocessing phase relies on the objectives of text
mining, and the lack of relevant information due to stemming should be evaluated.
Noisy data is needed to be taken out from the data and when the environment
is uncontrolled then the amount of noisy data may increase to the next level. To
clean this data different measures can be adopted. When the data is in the form
text then the issues may include errors of spellings, correction of spelling that is
inconsistent, abbreviations, resolving term shortenings, text conversion of uppercase
and lowercase when it is required and stripping markup language tags.
In natural language there is another issue in which different meaning can be
extracted through the same term through the usage of words within the sentence or
the type of document. There are many examples that can explain this process and it
can be highlighted through the following example where the word book is used to
demonstrate different meanings in two different sentences.
• “This is a great book.”
• “You can book your flights from this website.”
When the meaning of a term needs to be discovered then the referred sentence is
the mean to make some understandings. This understanding of the meaning of the
term is called word sense disambiguation. This is a topic of research in both field’s
machine learning and natural language processing.
The process of tokenization is actually the atomic components segmentation
process of input of a text. In tokenization of a document the approach is depen-
dent of mining objectives. In these approaches using individual word as token is the
common approach where punctuation marks and spaced are used as separators. A unit
of analysis is based on word collection and this collection is based on more than one
word. When text sets need to be examined the statistical similarity produce colloca-
tions and this can also be produced through dictionaries and techniques information
extraction.
144 7 Text Mining

Eventually determination of a term’s grammatical class depends upon data mining

objectives. The process can be accomplished in the form tag attachment while indi-
cating the part of speech being used by a word in the text sentence. A part of the
tagger for speech is a program that performs this function. Taggers are usually built
by statistical analysis of patterns from large corpus of documents with annotated
speech sections.

7.9 Opinion Mining

Opinion mining might be a cutting edge, imaginative field of research pointed toward
gathering conclusion related data from textual data sources. Opinion mining has many
interesting applications and it covers the academic and commerce both fields. This
has broad range of applications because it is handling novel intellectual challenges,
and therefore a comprehensive amount of research does consider it as interesting
subject. This section introduces and discusses in detail about the opinion mining
research field, challenges, key tasks, and motivations behind it. After discussion of
all these points it will present SentiWordNet lexical resource with its details including
applications, limitations, and potential advantages.

7.9.1 Opinions in Text

Information regarding people’s opinions might be a really important part for creating
decisions more correctly during a sort of realms. For instance, businesses have a
robust interest in searching what thoughts their consumers have a few new products
introduced on a marketing campaign. Buyers, on the opposite hand, may take advan-
tage of reading the thoughts and feedback of others a few given product they shall
buy, since suggestions from other consumers still affect buying decisions. Knowledge
of the views of other citizens is additionally important within the political sphere,
where one might dig out, for instance, the sensation toward a replacement law, or an
individual like an official or demonstrators.
Recently, the Web has made it feasible for opinion to be seen in written text
from, from a spread of sources and on a way more extensive scale. It likewise made
it simpler for individuals to exact their perspectives on pretty much every issue
through expert sites, blogs conversation gatherings, and product audit pages. Suppo-
sition destinations are not constrained to specialty survey locales yet additionally are
remembered for Web client blog entries, conversation discussions, and implanted in
online informal organizations.
Plainly the Web might be an enormous file of substance produced by openly
accessible clients, committed to communicate opinion on any subject of concern.
Moreover, generally opinions are explained in the form of text that makes rich ground
for the text mining and natural language. Thus for the analysis of opinion information
7.9 Opinion Mining 145

large volume which is connected with natural language processing advancement a

strong motivation is needed. Opinion mining is an emerging research field where it
leads the methods of machine learning and natural language processing.
Opinion mining manages the use of computational strategies for measurement
and detection of text subjectivity, sentiment, and opinion. When there is a collective
combination of the subjective and objective statements then this form the text docu-
ment. In the text document subjective statement is referred to the opinions expres-
sion, speculations, and evaluations. While the objective statement is concerned with
presentation of factual information in the form of text. To demonstrate further moti-
vation behind opinion mining survey can be adopted to apply the techniques for
extraction and detection of text’s subject aspects.

7.9.2 Search Engines

The clearest utilization of opinion mining techniques would be the chase inside the
document for opinions. Discovering the abstract presentation related with a subject
and its favoritism may increment regular search engines into bearing motors by
getting results on a given point hold just positive or negative sentiments, for instance,
when requesting items that have gotten best audits during a specific field, as example,
a client question for advanced cameras with better battery life and that has got some
good feedback as well.
On contrary, information retrieval systems that require to offer information that
is factual on a point of interest subject can identify and erase information related to
opinion to expand results significance.

7.9.3 Inappropriate Content

Opinion mining can be used for the classification of the subjective statement in
environment that is collaborative which can be email list or discussion group. This
can contain inappropriate remarks which are known as flaming behavior. In online
advertisement the technique could be used for the placement ad campaign. Through
this technique placement of advertisement can be avoided to the content which is not
related advertisement and may carry opinions which are not favorably related to any
product or any brand.

7.9.4 Customer Relationship Management

Setups that handle customer connections could also be more sensitive by using senti-
ment detection as a way to accurately forecast the feedback of the customer based on
146 7 Text Mining

his satisfaction. Another example is that the automated sorting feedback of customer
through email expressing sentiments that are positive and negative, and then this
could be applied for the message to automatically rout them within the teams that
are appropriate for taking necessarily corrective measures.

7.9.5 Business Intelligence

Opinion mining is capable of adding analyzes of subjective text components to

explore new information from data. This will take the shape of aggregated sentiment
bias information which may be utilized to guide marketing promotions and enhance
product design. The opinion knowledge present on financial news has been analyzed
within the financial sector to work out its effect on securities performance.

7.9.6 Knowledge Management Benefits

Opinion mining does have a potential to add greater value in the field of knowledge
management and through all the examples illustrated above it is clear that this is
a value-added field of research for a wide range of activities for the companies
around the globe. When the query interface is extended then the content that is
stored explicitly through the knowledge-based systems and it will carry out opinion
information for more relevant results. This can also exclude documents that are
subjective for the requirements of results which are more factual. The systems that
are based on knowledge management need less efforts of administration due to the
implication of sentiment detection so that unwanted user behavior and flaming can be
avoided. It can become more fluid because of the tacit or explicit knowledge due to its
collaborative environments. Lastly the systems of knowledge discovery are used to
influence the opinion information which will be used for the creation of knowledge
for an organization. It also improves the process of decision making based on the
relevant user feedback.

7.10 Sentiment Classification

The main concern of sentiment classification is what to determine, if any, based

on the opinions confined within a documents of the sentiment orientation. Through
general assumption the document which is under inspection is known to represent
different opinions about the single entity and product review.
Polarities such as negative or positive are opposite to belongings that are classified
in opinion orientation. Any review about product like it is favorable or not according
to given topic, negative or positive feedback about the product or whether it is ranked
7.10 Sentiment Classification 147

conferring to a spectrum of some possible opinions such as there are reviews about
films and there range ends on five stars and starts from zero.
The usage of keywords is the most common approach in the detection of subjec-
tivity and classification of sentiment that give indication about both the measures
which are positive and negative bias and about the general subjectivity. Training data
is not required for the approaches like wordlist for making predictions. Training data
is used for predefined lexicon sentiment and it is also applicable where data is not
trained. Because of this purpose these approached are considered as unsupervised
learning methods.
To make wordlists manually is a time taking procedure. There are approaches
that automatically create resources that are present in the literature which comprise
opinion information that contain words on the bases of available lexicons which
is labeled as lexical induction. There are other approaches present in the literature
to originate opinion information one of the common approaches is examining term
relationships of the WordNet in which there are some terms used to assume a priori to
carry out the opinion information. These common core words are following “poor”,
“excellent”, “bad”, “and good”.

7.10.1 SentiWordNet

SentiWordNet is one example of a lexical asset intended to aid opinion mining tasks.
SentiWordNet plans to supply information of opinion polarity at the term level by
getting that information from the semi-automatic fashioned relationship and WordNet
database of English terms. A positive and a negative imprints beginning from 0 to 1
for each word in term in WordNet is available in SentiWordNet, mirroring its polarity,
with higher scores indicating terms that convey heavy opinion bias information, while
lower scores appears, a term that is less subjective.

7.10.2 Word Net

In English language WordNet is a lexical database where terms are sorted out as
indicated by their semantic connection. In natural language processing WordNet
has been broadly applied to solve the problems and also a complete list of work in
literature is available.
The WordNet lexicon is the product of work on linguistics and psychology at
Princeton University to better understand the nature of English language semantic
relations between words.
Furthermore in the English language, on giving a full lexicon, where terms can
be recovered and investigated by ideas and their semantic connections.
At its third form, WordNet is accessible as a database, and also accessible through
Web interface or by means of an assortment of software APIs, giving a complete
148 7 Text Mining

database of over 150.000 unique kind of terms sorted out into in excess of 117,000
unique implications.
WordNet additionally developed with augmentations of its structure applied to
various different languages.
The main relationship between term and WordNet is similarity of synonym. Terms
should be put together which create a set of synonyms known as synsets. Main criteria
for gathering a terms into a synsets is whether a term utilized inside a sentence on
a particular context can be supplanted by another term on a similar synsets without
changing the sentence’s understanding.
Terms should be directly separated by syntactic classifications, since things,
descriptive words, action words, and qualifiers are not tradable inside a sentence.
A short description of terms is present in synsets which help to determining the
significance of different terms. This should be very valuable on synsets with just a
single term or synsets with few relations.
The main relationship between term and WordNet is antonimity. In extraordi-
nary case of adjectives, there is a differentiation among immediate and roundabout
antonyms or when term could be arranged as immediate contrary or in roundabout
way by means of another applied relationship.
The word “heavy/weightless” is reasonably alternate and in this way aberrant
antonyms and also the word “wet/dry” are qualified as immediate antonyms since
they have a place with synsets where an immediate antonym exists between the terms
(“heavy/light”), however, they are not legitimately associated.
There are also another class of relationship present between WordNet and term
which is known as hyponymy. The hyponymy class represents the hierarchical “is-a”
type of relation between term and WordNet such as in the case of “oak/plant” and
“car/vehicle”. Another relation exists between term and WordNet which is known as
meronymy which represents “part-of” relationship among term and WordNet.
An attributes type of relation is present for the exceptional case of adjectives shows
that adjective act as a modifier for certain generic qualities such as for “weight”
adjectives “heavy” and “light” are used as modifiers. For linking the noun which
represents certain attributes to its respective adjective, the WordNet is used that
modifies its relationship.
Based on the capabilities of WordNet’s semantic relationships, SentiWordNet
extracts synset viewpoint ratings using a semi-supervised approach where only a
small part of the synset terms known as the paradigmatic words are manually labeled,
with the entire database extracted using an automated process. The procedure is
written below:
1. Paradigmatic words extracted from the WordNet-affect lexical tool are manually
labeled into positive or negative labels according to polarity of opinion.
2. Iteratively extend each mark by inserting WordNet terms that are linked to already
marked terms by a connection that is considered to efficiently maintain term
orientation. To expand labels the following relationships are used:
7.10 Sentiment Classification 149

a. Direct antonym
b. Attribute
c. Hyponymy (pertains-to and derive-from) also see similarity.
3. According to the direct antonym relation, from newly added definitions, add the
terms representing directly opposite opinion orientation to the opposite mark.
4. For a fixed number of iterations K repeat steps 2 and 3.
A subset of WordNet synsets is labeled positive or negative on completion of steps
1–4. After completing the score assessment of all terms, synset glosses are trained on
the set of classifier. On WordNet textual definitions for each synset meaning should
be present. This process should be continued by classifying new entries according
to the training data and getting an aggregated value as given below:
5. A word vector representation was produced as a result of each labeled synset
from steps 1–4, having a negative/positive label. A committee of classifiers can
be trained by using this dataset:
(a) Following predictions can be made by training a pair of classifiers: Non-
positive/positive and non-negative/negative synsets which belongs to nega-
tive and positive labels are not present in the training set and allocated to
the “objective” class, with zero-valued negative and positive scores.
(b) Different sizes of the training set are used to repeat the process. These can
be obtained by changing the values of K in last step: 0, 2, 4, 6, and 8.
(c) Support vector machine and Rocchio classification algorithms are used for
each training set.
6. As a result each resulting classifier returns a prediction score when applying the
set of classifiers to new terms. These summed up and standardized to 1.0 in order
to produce the final positive and negative scores for one word.
The method outlined above for building SentiWordNet highlights the dependence
of term scores on two distinctive factors: Firstly the choice of paradigmatic words
that will produce the entire set of positive and negative scores must be carefully
examined, since the extension of scores to the remainder of WordNet terms relies
on this core set of terms for making a scoring decision. Second, for the machine
learning stage of the process, the process relies on synsets of textual interpretation,
or glosses, to determine the similarity of a new term to positive or negative words.

7.10.3 Applying SentiWordNet

SentiWordNet is used for opinion mining and used as lexical tool which could be
beneficent in certain stances. In opinion mining, term sentiment information has
gained enormous research attention as an approach. SentiWordNet could be used as
a substitute for manually constructing WordNet sentiment lexicons. Often performed
on an individual basis for basic opinion mining computerized methods of constructing
150 7 Text Mining

term alignment knowledge, such as SentiWordNet, can be helpful in the optimization

and automation of these approaches to opinion mining.

7.11 Further Reading

Following are some valuable resources for further reading:

• Handbook of Natural Language Processing and Machine Translation (Editors:
Joseph Olive, Caitlin Christianson, John McCary)
Book provides the groundbreaking research conducted under the breakthrough
GALE program–The Global Autonomous Language Exploitation within the
Defense Advanced Research Projects Agency (DARPA), while placing it in the
context of previous research in the fields of natural language and signal processing,
artificial intelligence, and machine translation.
• Natural Language Processing and Text Mining (Editors: Anne KaoStephen R.
Poteet)
The volume provides an essential resource for professionals and researchers who
wish to learn how to apply text mining and language processing techniques to real-
world problems. In addition, it can be used as a supplementary text for advanced
students studying text mining and NLP.
• Intelligent Natural Language Processing: Trends and Applications (Editors:
khaledshaalan, Aboul Ella Hassanien, Mohammed F. Tolba)
Book brings together scientists, researchers, practitioners, and students from
academia and industry to present recent and ongoing research activities concerning
the latest advances, techniques, and applications of natural language processing
systems, and to promote the exchange of new ideas and lessons learned.
• Natural Language Processing of Semitic Languages (Editor: ImedZitouni)
Book addresses an interesting and challenging languages for NLP research: the
semitic languages. It covers both statistical approaches to NLP, which are domi-
nant across various applications nowadays, and the more traditional, rule-based
approaches, that were proven useful for several other application domains.
• Deep Learning in Natural Language Processing (Editors: Li Deng, Yang Liu)
Book provides the state of the art of deep learning research and its successful
applications to major NLP tasks, including speech recognition and understanding,
dialogue systems, lexical analysis, parsing, knowledge graphs, machine trans-
lation, question answering, sentiment analysis, social computing, and natural
language generation from images.
• Advanced Applications of Natural Language Processing for Performing Informa-
tion Extraction (Authors: Mário Jorge Ferreira Rodrigues, António Joaquim da
Silva Teixeira)
Book explains how can be created information extraction (IE) applications that are
able to tap the vast amount of relevant information available in natural language
7.11 Further Reading 151

sources: Internet pages, official documents such as laws and regulations, books
and newspapers, and social Web.

7.12 Summary

Text mining involves the digital processing of text for the production of novel infor-
mation and incorporates technologies from natural language processing, machine
learning, computational linguistics, and knowledge analysis. Implementations of text
mining to information exploration were studied on the basis of exploratory research
and other conventional data mining techniques. Opinion mining was introduced as
new research area. Opinion mining is a modern field of research using the compo-
nents of text mining, natural language processing, and data mining, and a great range
of applications to derive opinions from documents are feasible, as mentioned in
this section. Things vary from developing business intelligence in organizations to
recommended technologies, more effective online advertising and spam detection
and information management frameworks. It has been shown that opinion mining
can be helpful to knowledge management programs explicitly, through raising the
quality of information archives by opinions-aware apps. And in another way by incor-
porating data that can be derived from textual data sources, thus indirectly providing
further incentives for knowledge development within the business. Ultimately, the
SentiWordNet and WordNet with its potential uses were introduced, with a presen-
tation of its building blocks. WordNet database of relationships and terms is popular
and has a well-known extension named as SentiWordNet. It is a freely present lexical
source of sentiment information. In opinion mining research, WordNet can be used
because a number of compatible approaches were developed on an ad hoc style in it.
One of the most important and prominent component of this thesis is SentiWordNet.
The next chapter discusses the strengths and function of this tool in depth, taking
into account the complexities of opinion mining discussed in this segment. The final
outcome is the implementation of a set of features that incorporate sentiment knowl-
edge derived from SentiWordNet and can be implemented to sentiment classification
issues.
Text classification which is also known as text categorization will be introduced
in this chapter. The researcher will have look on different approaches for text docu-
ment categorization, document representation schemes, text categorization, and text
document preprocessing mechanisms. Text categorization is the process of manu-
ally passing on a text document to a group from a list of specified categories on the
account of its components.
Chapter 8
Data Science Programming Languages

In this chapter we will discuss two programming languages commonly used for data
science projects, i.e. Python and R programming languages. The reason behind is
that a large community is using these languages and there are lot of libraries available
online for these. First Python will be discussed, and in later part we will discuss R
programming language.

8.1 Python

Python is one of the most common programming languages for data science projects.
A number of libraries and tools are available for coding in Python. Python source
code is available under GNU General Public License.
To understand and test the programs that we will learn in this chapter, we need
the following resources and tools.
Python: https://www.python.org/downloads/
Python Documentation: https://www.python.org/doc/
Pychamp an IDE for Python: https://www.jetbrains.com/pycharm/download/.
Installations of these tools are very simple; you just need to follow the instructions
given on screen during installation.
After installation we are ready to make out first program of Python.
Listing 8.1: First Python program
1. print ("Hello World!");

Output: Listing 8.1

Hello World!

© The Editor(s) (if applicable) and The Author(s), under exclusive license 153
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_8
154 8 Data Science Programming Language

Table 8.1 Reserved words of

and exec not
Python
assert finally or
break for pass
class from print
continue global raise
def if return
del import try
elif in while
else is with
except lambda yield

8.1.1 Python Reserved Words

Like other programming languages Python also has some reserved words which
cannot be used for user-defined variables, functions, arrays, or classes. Some of
Python reserved words are given in Table 8.1.

8.1.2 Lines and Indentation

Like other programming languages, Python does not use braces “{” and “}” for
decision-making and looping statements, function definition, and classes. The blocks
of code are denoted by line indentation. Number of spaces in indentation can be
variable, but all statements in a block must have same amount of spaces.

8.1.3 Multiline Statements

Normally a statement in Python ends on new line, but we want to keep a statement
on multiple lines when then Python allows us to use line continuation character “\”
with the combination of “+” sign. The following example will demonstrate you the
use of line continuation character.
Listing 8.2: Line continuation in Python
1. name = "Testing " + \
2. "Line " + \
3. "Continuation"
4. print (name);

Output: Listing 8.2

Testing Line Continuation
8.1 Python 155

If the statements enclosed in brackets like (), {}, and [] are spread into multiple
lines, then we do not need to use line continuation character.
Listing 8.3: Statements enclosed in brackets
1. friends = ['Amber', 'Baron', 'Christian',
2. 'Crash', 'Deuce',
3. 'Evan', 'Hunter', 'Justice', 'Knight']
4. print (friends[0]+" and " + friends[5] + " are friends")

Output: Listing 8.3

Amber and Evan are friends

8.1.4 Quotations in Python

Python uses different types of quotations for strings. Single (’) and double (") can be
used for same strings, but if we have one type of quotation in string then we should
enclose the string with other type of quotation. Triple quotations of single (’) and
double (") types allow the string to contain line breaks or span the string to multiple
lines.
Listing 8.4: Different type of quotations in Python
1. word = "Don't"
2. sentence = 'Testing double(") quotation'
3. paragraph1 = """Testing paragraph
3. with triple single (') quotations"""
4. paragraph2 = '''Testing paragraph
5. with triple double(") quotations'''
6.
7. print (word)
8. print (sentence)
9. print (paragraph1)
10. print (paragraph2)

Output: Listing 8.4

Don't
Testing double(") quotation
Testing paragraph
with triple single (') quotations
Testing paragraph
with triple double(") quotations

8.1.5 Comments in Python

Any line in Python started with # sign is considered as comments. If # sign is used
inside a string little, then it will not be used as comments. Python interpreter ignores
all comments.
156 8 Data Science Programming Language

Listing 8.5: Comments in Python

1. # this is comments
2. print ("Comments in Python") # another comments

Output: Listing 8.5

Comments in Python

8.1.6 Multiline Comments

For multiline comments in Python we need to use three single (’) quotations on start
and end of comments.
Listing 8.6: Multi-line comments in Python
1. '''
2. These are multi-line comments
3. span on
4. multiple lines
5. '''
6. print ("Testing multi-line comments")

Output: Listing 8.6

Comments in Python

8.1.7 Variables in Python

In Python, we declare a variable and assign a value for it. Variable in Python is loosely
typed; it means we do not need to specify the type of variable while declaring it. The
type of variable is decided on assigning the type of value. For assignment of value
we need to use equal (=) sign. The operand on left of equal-to (=) sign is name of
variable, and operand on right is value of variable.
Listing 8.7: Variables in Python
1. age = 25 # Declaring an integer variable
2. radius =10.5 # Declaring a floating point variable
3. name = "Khalid" # Declaring a string
4.
5. print (age)
6. print (radius)
7. print (name)

Output: Listing 8.7

25
10.5
Khalid
8.1 Python 157

8.1.8 Standard Data Types in Python

The data saved in a variable or memory location can be of different types. We can
save age, height, and name of a person. The type of age is integer, the type of height
is float, and name is string type.
Python has five standard data types
• Numbers
• String
• List
• Tuple
• Dictionary.

8.1.9 Python Numbers

In Python we assign a numeric value to a variable and then numeric objects are
created. Numeric objects can have numeric value which can be integer, octal, binary,
hexadecimal, floating-point, and complex number. Python supports four different
numerical types.
• Int
• long
• float
• complex

Listing 8.8: Numbers in Python

1. intNum = 5 # Output: 107
2. binaryNum = 0b1101011 # Output: 107
3. octalNum = 0o15 # Output: 13
4. hexNum = 0xFB # Output: 251
5.
6. print(intNum)
7. print(binaryNum)
8. print(octalNum)
9. print(hexNum)

Output: Listing 8.8

5
107
13
251
158 8 Data Science Programming Language

8.1.10 Python Strings

Strings in Python are set of contiguous characters enclosed in single or double quota-
tions. Subset of string can be extracted with the help of slice operators [ ] and [:].
The first character of each string is located at 0 index which we provide inside slice
operators. In strings plus (+) sign is used for concatenation of two strings and asterisk
(*) sign is used for repetition of string.
Listing 8.9: String in Python
1. test = "Testing strings in Python"
2.
3. print (test) # Displays complete string
4. print (test[0]) # Displays first character of the string
5. print (test[2:5]) # Displays all characters starting from index 2 to 5
6. print (test[2:]) # Displays all characters from index 2 to end of string
7. print (test * 2) # Displays the string two times
8. print (test + " concatenated string") # Display concatenation of two strings

Output: Listing 8.9

Testing strings in Python
T
sti
sting strings in Python
Testing strings in PythonTesting strings in Python
Testing strings in Python concatenated string

8.1.11 Python Lists

Lists are flexible and compound data types in Python. The items are enclosed into
square brackets ([ ]) and separated with commas. The concept of list is similar to
array in C or C++ with some differences. Array in C can hold only one type of
elements but in Python, it can hold different types of elements in one list. Similar to
strings the element stored in list is accessed with the help of slice operator ([ ] and [:])
and starting index starts from zero. In list too plus (+) sign is used for concatenation
and asterisk (*) sign is used for repetition.
8.1 Python 159

Listing 8.10: List in Python

1. # This list contain inform of student in order Name, Age, CGPA, State, marks
2. student = ['John', 20 ,2.23, 85, 'New York']
3.
4. print (student) # prints complete list
5. print (student[0]) # prints first element of list
6. print (student[2:5]) # prints all elements from index 2 to 5
7. print (student[3:]) # prints all elements from index 2 to end of the list
8. print (student * 2) # prints the list two times
9. print (student + student) # Concatenate list with itself

Output: Listing 8.10

['John', 20, 2.23, 85, 'New York']
John
[2.23, 85, 'New York']
[85, 'New York']
['John', 20, 2.23, 85, 'New York', 'John', 20, 2.23, 85, 'New York']
['John', 20, 2.23, 85, 'New York', 'John', 20, 2.23, 85, 'New York']

8.1.12 Python Tuples

Another sequence data type similar to list is called tuple. Similar to list a tuple is
combination of elements separated with commas, but instead of square brackets, the
elements of tuple are enclosed into parentheses ( ( ) ). As modification is not allowed
in tuple, we can say tuple is read-only list.
Listing 8.11: Tuples in Python
1. # This tuple contain inform of student in order Name, Age, CGPA, State, marks
2. tuple = ['John', 20 ,2.23, 85, 'New York']
3.
4. print (tuple) # prints complete tuple
5. print (tuple [0]) # prints first element of tuple
6. print (tuple [2:5]) # prints all elements from index 2 to 5
7. print (tuple [3:]) # prints all elements from index 2 to end of the tuple
8. print (tuple * 2) # prints the tuple two times
9. print (tuple + tuple) # Concatenate tuple with tuple

Output: Listing 8.11

8.1.13 Python Dictionary

Python has a data type which is similar to hash table data structure. It works like
associative array or key-value pair. Key of dictionary can be any data type of Python,
but mostly only number or string is used for key. As compared with other data types
number and string are more meaningful for keys. The value part of diction can be
of any Python object. Dictionary is enclosed by curly braces ({ }), while value of
dictionary can be accessed using square braces ([ ]).
Listing 8.12: Dictionary in Python
1. testdict = {'Name': 'Mike','Age':20, 'CGPA': 3.23, 'Marks':85}
2. testdict['State'] = 'New York'
3.print (testdict) # Prints complete dictionary
4.print (testdict['Age']) # Prints value of dictionary have key Age
5.print (testdict.keys()) # Prints all the keys of dictionary
6.print (testdict.values()) # Prints all the values of dictionary

Output: Listing 8.12

{'Name': 'Mike', 'Age': 20, 'CGPA': 3.23, 'Marks': 85, 'State': 'New York'}
20
dict_keys(['Name', 'Age', 'CGPA', 'Marks', 'State'])
dict_values(['Mike', 20, 3.23, 85, 'New York'])

8.1.14 Data-Type Conversion

We have some built-in functions in Python to convert a data type into another data
type. Some of them are given in Table 8.2. After conversion these functions return a
new object of which has converted value.
Operators are the constructs that are used to perform some specific mathematical,
logical, or any other type of manipulation. Python has many types of operators; some
of them are given below.

Table 8.2 Data-type conversion functions

Function name Description
float(x) It converts object provided in argument into a floating-point number
str(x) It converts object provided in argument into a string
tuple(s) It converts object provided in argument into a tuple
list(s) It converts object provided in argument into a list
set(s) It converts object provided in argument into a set
chr(x) It converts integer provided in argument into a character
ord(x) It converts character provided in argument into an integer
hex(x) It converts integer provided in argument into a hexadecimal number
oct(x) It converts integer provided in argument into octal number
8.1 Python 161

Table 8.3 Arithmetic operations

Operator Description Operation Result
Addition (+) Addition is a binary operator, and it uses two operands, a + b 15
one on left and other on right side, adds both operands,
and returns sum of numbers
Subtraction (−) Subtraction is a binary operator, and it uses two a–b 5
operands, one on left and other on right side, subtracts
number on right from the number on left, and returns
subtraction result
Multiplication (*) Multiplication is a binary operator, and it uses two a*b 50
operands, one on left and other on right side, multiplies
both numbers, and returns multiplication result
Division (/) Division is a binary operator, and it uses two operands, a/b 2
one on left and other on right side, divides number on
left with the number on right, and returns quotient
Modulus (%) Modulus is a binary operator, and it uses two operands, a % b 0
one on left and other on right side, divides number on
left with the number on right, and returns remainder of
division operation

• Arithmetic operators
• Comparison operators (also called relational operators)
• Assignment operators
• Logical operators.

8.1.15 Arithmetic Operators

Suppose we have two variables a and b with numeric value 10 and 5, respectively.
Table 8.3 shows the results of arithmetic operations after applying the arithmetic
operators.

8.1.16 Comparison Operators

Comparison operators in Python are used to compare the equality of two operands
or values.
Suppose we have two variables a and b with numeric value 10 and 5, respec-
tively. In Table 8.4, we are using these variables as operand and performing some
comparison operations.
162 8 Data Science Programming Language

Table 8.4 Comparison operators

Operator Description Operation Result
Equal to (==) Equal-to operator is used to compare a == b False
two operands: One is on left side, and
other is on right side. If the values of
both operands are same, then it returns
true otherwise returns false
Not equal to (! =) Not-equal-to operator is used to a != b True
compare two operands: One is on left
side, and other is on right side. If the
values of both operands are not same,
then it returns true otherwise returns
false
Greater than (>) Greater-than operator is used to a>b True
compare two operands: One is on left
side, and other is on right side. It
returns true if the value on left side of
the operator is greater than the value on
the right side, else it will return false
Less than (<) Less-than operator is used to compare a<b False
two operands: One is on left side, and
other is on right side. If the value of
operand on left is less than value of
operand on right, then it returns true
otherwise returns false
Greater than or equal to (>=) Greater-than-or-equal-to operator is a >= b True
used to compare two operands: One is
on left side, and other is on right side. If
the value of operand on left is
greater-than-or-equal-to value of
operand on right, then it returns true
otherwise returns false
Less than or equal to (<=) Less-than-or-equal-to operator is used a <= b False
to compare two operands: One is on left
side, and other is on right side. If the
value of operand on left is
less-than-or-equal-to value of operand
on right, then it returns true otherwise
returns false

8.1.17 Assignment Operators

Assignment operators in Python are used to assign a value to a variable.

Suppose we have a variable with numeric value 10. In Table 8.5 we are using this
variable as operand and performing some assignment operations.
8.1 Python 163

Table 8.5 Assignment operators

Operator Description Operation
= Equal operator used to assign value on a = 5 assigns value 5 to variable a
right side of operator to operand (variable)
on left
+= This operator adds value on right to a += 5 is equivalent to a = a + 5 which
operand on left and assigns the result back results 15
to operand again
-= This operator subtracts value on right to a -= 5 is equivalent to a = a-5 which
operand on left and assigns the result back results 5
to operand again
*= This operator multiplies value on right a *= 5 is equivalent to a = a*5 which
with operand on left and assigns the result results 50
back to operand again
/= This operator divides the value of operand a /= 5 is equivalent to a = a/5 which
on left with value on right and assigns the results 2
result back to operand again
%= This operator divides the value of operand a %= 5 is equivalent to a = a%5 which
on left with value on right and assigns the results 0
remainder back to operand again

Table 8.6 Logical operators

Operator Description Operation
AND Logical AND operator returns true if both operands are true (a and b) is true
otherwise false
OR Logical OR operator returns true if any operand is true (a or b) is true
otherwise false
NOT Logical NOT operator returns true if operand is true otherwise not(a) is false
false

8.1.18 Logical Operators

Logical operators in Python are used to compare truth value on left and right sides
of operator.
Suppose we have two variables a and b with Boolean value true for both variables.
In Table 8.6 we are using these variables as operand and performing some logical
operations.
164 8 Data Science Programming Language

Table 8.7 Operators’ precedence order

Operator Description
** Exponentiation (raise to the power)
~, +, - Complement, unary plus and minus
*,/, %,// Multiply, divide, modulo, and floor division
+, - Addition and subtraction
>>, << Right and left bitwise shift
& Bitwise “AND”
ˆ, | Bitwise exclusive “OR” and regular “OR”
<=, <, >, >= Less-than-equal-to, less-than, greater-than,
greater-than-equal-to operators
<>, ==, != Equality operators
=, %=, /=, //=, -=, +=, *=, **= Assignment operators
is, is not Identity operators
in, not in Membership operators
NOT, OR, AND Logical operators

8.1.19 Operator Precedence

Like other languages Python also has operator precedence. Table 8.7 shows prece-
dence of some operators in Python. The operators are given in the table in order of
highest to lowest from top to bottom.

8.1.20 Decision Making

Decision making in Python requires specifically one or more conditions to be tested.

On the basis of condition expression output which can be true or false a statement
block can be executed. In Python nonzero and non-null output of expression is
considered as true, and if the output is zero or null, then it will consider false.
Python programming language supports the following decision-making state-
ments shown in Table 8.8.

8.1.20.1 if Statement

If Boolean expression “if” will output true, then statements followed by expression
will execute. In case of false output statements will not execute.
8.1 Python 165

Table 8.8 Decision-making statements

Statement Description
if An “if” statement uses one or more Boolean expressions followed by one or more
statements. If Boolean expression will output true, then statements followed by
expression will execute. In case of false output statements will not execute
if…else An “if … else” statement is combination “if” followed by optional “else”. “if”
statement uses one or more Boolean expressions following by one or more
statements. “else” statement does not have any Boolean expression, but it can have
one or more statements. If output of Boolean expressions of “if” will return true,
then statements of “if” will execute. If output will return false, then statements of
“else” will execute
elif When a series of decision is required, then we can use nested “if” and “if…else”
statements in Python. For this we can put an “if” or “if…else” inside another “if”
statement

Syntax
if expression:
statement(s)
Listing 8.13: if statement in Python
1.booleanTrueTest = True
2.if booleanTrueTest: # Testing true expression using boolean true
3.print ("Expression test of if using Boolean true. If expression is true then this message will
show.")
4.
5.numericTrueTest = 1
6.if numericTrueTest: # Testing true expression using numeric true
7.print ("Expression test of if using numeric true. If expression is true then this message will
show.")
8.
9.print ("Testing finished")

Output: Listing 8.13

Expression test of if using Boolean true. If expression is true then this message will show.
Expression test of if using numeric true. If expression is true then this message will show.
Testing finished

8.1.20.2 if…else Statement

If output of Boolean expressions of “if” will return true, then statements of “if” will
execute. If output will return false, then statements of “else” will execute.
Syntax
if expression:
statement(s)
else:
statement(s)
166 8 Data Science Programming Language

Listing 8.14: if…else in Python

booleanTrueTest = False
if booleanTrueTest: # Testing true expression using Boolean true
print ("Expression test of if using Boolean true. If expression is true then this message will
show.")
else: # It will execute if expression output is false
print("Showing this message mean output of expression is false.")

print ("Testing finished")

Output: Listing 8.14

Showing this message mean output of expression is false.
Testing finished

8.1.20.3 elif Statement

In Python “elif” statement is used to check multiple expressions for true and execute
one or more statements. Similar to “else” the “elif” statement is also optional. Expres-
sion of first “elif” will be tested if expression of “if” will output false. If any expression
of any “elif” is false, then next “elif” will be tested. This will keep doing unless we
reach at “else” statement at the end. If expression of any “if” or “elif” will return true
then all “elif” and “else” will skip.
Syntax
if expression1:
statement(s)
elif expression2:
statement(s)
elif expression3:
statement(s)
else:
statement(s)
8.1 Python 167

Listing 8.15: elif in Python

1.marks = 95
2.
3. if marks == 100:
4.print("Your marks are 100")
5. elifmarks >= 90 and marks <100:
6. print ("Your marks are 90 or more but less than 100")
7. elifmarks == 80:
8. print ("Your marks are 80")
9. elifmarks >0:
10. print ("Your marks are more than 0")
11.else:
12. print ("Your marks are negative")
13.
14.print ("Testing Finished")

Output: Listing 8.15

Your marks are 90 or more but less than 100
Testing Finished

8.1.21 Iterations or Loops in Python

In Python sometime we need to execute a statement or a block of statements for

several times. Repetitive execution of a statement or block of statements or code
over and over is called iteration or loop.
Loops in Python can be divided into two categories: definite and indefinite iter-
ations. If we know the exact number of iterations, we need to perform this type of
loop is definite loop. For example, iterate the block of code ten times.
If we do not know exact number of iterations, the number of iterations is decided
on the basis of any condition. This type of loops is called indefinite loop, for example,
execution of loop to read characters from a provided text file. Table 8.9 shows different
types of loops.

Table 8.9 Different types of loops

Type Description
while In while loop condition or expression to check truth value of is tested at top before
loop body. If condition will return true, body of loop will execute unless condition
or expression output becomes false
for Similarly, to “while” loop, “for” loop also has condition to test truth value is on the
top above the body of loop. Only difference it has is “for” loop execution at
defined number of times
Nested loop In Python it is allowed to put a loop inside another loop. We can put a for loop
inside another for loop. Similarly, we can put a while loop inside another while
loop. Putting loop inside another loop is called nested loops
168 8 Data Science Programming Language

8.1.21.1 While Loop

While loop statements in Python language repeat execution till the provided
expression or condition does not become false.
Syntax
while expression:
statement(s)
While loop, just like in other programming languages, executes a statement or
set of statements until the specified condition remains true. There is possibility that
while may not run even for one time, and this can happen if expression will return
false on testing first time.
Listing 8.16: while in Python
1. character = input("Enter 'y' or 'Y' to iterate loop, any other key to exit: ")
2. counter = 1;
3. while character == 'y' or character == 'Y':
4. print ('Loop iteration :', counter)
5. counter += 1
6.character = input("Enter 'y' or 'Y' to iterate loop again, any other key to exit: ")
7.
8. print ("Testing Finished")

Output: Listing 8.16

Enter 'y' or 'Y' to iterate loop, any other key to exit: y
Loop iteration : 1
Enter 'y' or 'Y' to iterate loop again, any other key to exit: Y
Loop iteration : 2
Enter 'y' or 'Y' to iterate loop again, any other key to exit: n
Testing Finished
The “while” loop program given above takes an input character from user. If the
character entered by user is “y” or “Y”, then we are executing the statements of
“while” loop. When user enters any other character, then loop stops iterating.

8.1.21.2 For Loop

For loop in Python is similar to other programming languages. It is used to execute a

statement or set of statements for a particular number of times. We can use for loop
to iterate the arrays, strings, items in tuples, dictionaries, etc.
Syntax
for iterating_variable in sequence:
statements(s)
Here “iterating variable” is an item of sequence (list, tuple, diction, string).
When we start “for” loop, first item from sequence assigns to “iterating variable”
and loop statement(s) execute one time. In next iteration next item from sequence
8.1 Python 169

assigns to “iterating variable” and loop statement(s) execute. This keeps happening
till we reach end of sequence (list) or expression in sequence returns false.
Listing 8.17: Iterating list through elements using for loop in Python
Colors = ['Red', 'White', 'Black', 'Pink'] # A list of some colors

1. print('Characters of word "Colors" are:')

2. # this loop is printing each letter of string Colors
3. for letter in 'Colors':
4. print (letter)
5. print("First loop's iterations finished \n")
6.
7. # this loop is printing each element of a list
8. for color in Colors:
9.print (color)
10.
11. print("Second loop's iterations finished")

Output: Listing 8.17

Characters of word "Colors" are:
C
o
l
o
r
s
First loop's iterations finished

Red
White
Black
Pink
Second loop's iterations finished
Iterating by Sequence Index
In Python there is another way to iterate each item of the list by index offset. In the
following example we are iterating “for” using index offset method.
Listing 8.18: Iterating list through index using for loop in Python
1. Colors = [Blue', 'Green', 'White', Black'] # A list of some colors
2. # this loop is printing each element of a list
3. for index in range(len(Colors)):
4.print (Colors[index] + " is at index " + str(index) + " in the list")
5.
6. print("Loop iterations finished")

Output: Listing 8.18

Blue is at index 0 in the list
Green is at index 1 in the list
White is at index 2 in the list
Black is at index 3 in the list
Loop iterations finished
In the code given in Listing 8.18 we are using three built-in functions.
170 8 Data Science Programming Language

The range( ) function is used for the number of iterations. If we want to iterate the
loop ten times, then we should provide 10 into argument of range( ) function. For
this configuration loop will iterate from indexes 0 to 10. We can also use range( )
for another other specific range like if we want to iterate the loop from indexes 5 to
10 then we will provide two arguments to range( ) function. First argument will be
5 and second will be 10.
Second built-in function we used is len( ); this function counts the number elements
in the list.
Later we used str( ) function, and it is data-type conversion function which we
used to convert a number into string data type.

8.1.21.3 Using else with Loop

Python provides a unique feature of loops which are not available in mostly famous
programming language that is using else statement with loops. In other language we
can use else statement only with if statement.
We can use else statement with for loop, but it will execute when the loop has
exhausted the iterating list. It means if loop will iterate complete list then else will
execute. But if there will be break statement to stop the loop iterations, then else will
not execute.
We can use else statement with while loop too. If condition will become false and
loop will stop, then else statement will execute. But if loop will stop due to using
break, then else statement will not execute.
The following two programs are demonstrating both scenarios of executing and
not executing else statement with for and while loops.
8.1 Python 171

Listing 8.19: else statement with for loop in Python

1. for iin range(1, 5):
2. print(i)
3. else: # it will execute because break is not used
4.print("Else statement executed")
5. print("Loop iterations stopped with all iterations finished\n")
6.
7. for j in range(1, 5):
8.print(j)
9.if j ==3:
10.print("using break statement")
11.break
12. else: # it will not execute because break is used
13.print("No Break")
14. print("Loop iterations stopped with the use of break statement")

Output: Listing 8.19

1
2
3
4
Else statement executed
Loop iterations stopped with all iterations finished

1
2
3
using break statement
Loop iterations stopped with the use of break statement
172 8 Data Science Programming Language

Listing 8.20: else statement with while loop in Python

1. character = input("Enter 'y' or 'Y' to iterate loop, any other key to exit: ")
2. counter = 1;
3. while character == 'y' or character == 'Y':
4. print ('Loop iteration :', counter)
5. counter += 1
6.character = input("Enter 'y' or 'Y' to iterate loop again, any other key to exit: ")
7. else: # it will execute because break is not used
8.print('Else statement executed')
9. print('Loop iteration stopped because condition become false\n')
10.
11. character = input("Enter 'y' or 'Y' to iterate loop, any other key to exit: ")
12. counter = 1;
13. while character == 'y' or character == 'Y':
14. print ('Loop iteration :', counter)
15. counter += 1
16.if counter == 3:
17.print('using break statement')
18.break;
19. character = input("Enter 'y' or 'Y' to iterate loop again, any other key to exit: ")
20. else: # it will execute because break is used
21.print('Else statement executed')
22. print('Loop iteration stopped because break is used')

Output: Listing 8.20

Enter 'y' or 'Y' to iterate loop, any other key to exit: y

Loop iteration : 1
Enter 'y' or 'Y' to iterate loop again, any other key to exit: Y
Loop iteration : 2
using break statement
Loop iteration stopped because break is used

8.1.22 Nested Loop

Python allows us to put a loop inside another loop. If outer loop will execute, then
inner loop may also execute and it depends on truth value of expression of inner
loop. Normally nested loop is used when we need to extract smallest item from data.
Suppose we want a list which contains different color names and we want to print
each character of color names. In this scenario we need to use two loops: One outer
loop will be used to extract each color name from the list, and other inner loop will
be used to get every character from a color name.
8.1 Python 173

Syntax of nested for loop

for iterating_var in sequence:
for iterating_var in sequence:
statements(s)
statements(s)
Syntax of nested while loop
while expression:
while expression:
statement(s)
statement(s)
Listing 8.21: nested for loop to print every character from list of colors
Colors = [Blue', 'Green', 'White', Black'] # A list of some colors
# this loop is printing each element of a list
for index in range(len(Colors)):
print(Colors[index])
for letter in Colors[index]:
print(letter, end=' ')
print()
print("Loop iterations finished")

Output: Listing 8.21

Blue
B l u e
Green
G r e e n
White
W h i t e
Black
B l a c k
Loop iterations finished

Listing 8.22: Nested while loop to print asterisk in grid

1. # this nested loop will create grid of 3 rows and 4 columns
2. rows = 0
3. while rows <3:
4. cols = 0
5.while cols <4:
6. print("*", end=' ')
7. cols = cols + 1
8.print()
9. rows = rows + 1

Output: Listing 8.22

****
****
****
174 8 Data Science Programming Language

8.1.23 Function in Python

Sometime we need to execute some statements multiple times. If we need to execute

it at same place, then we can use loop but if we need to execute it at different places
then we should use create a function. A function is a block of statements or reusable
that can be used to perform some specific task.
In the code examples we already cover some used functions like print( ), range( ),
and len( ) but these functions are built-in functions. The functions provided by
language with library are called built-in functions. We can design our functions
which are called user-defined functions.

8.1.24 User-Defined Function

To define a user-defined function, we need to provide some details which are given
below.
• To define a function first we need to use def keyword.
• Give the name of function, it always recommended to use meaningful name of
function. Meaningful means that by looking the name of function anyone can
judge that what this function is doing.
• Provide the arguments of function. All functions do not need arguments, and the
number of arguments depends on our requirements.
• Next step is providing a string which is called documentation string or docstring.
This string provides the information about the functionality of function.
• Then we will write statement(s) which want to execute on call of this function.
These statements are also called function suite.
Last statement can be return value. All function does not have a return value; in
this case we will use only return keyword.
Syntax
def functionname(parameters):
“function_docstring”
function_suite
return [expression]
In the code examples we already cover some used functions like print( ), range( ),
and len( ) but these functions are built-in functions. The functions provided by
language with library are called built-in functions. We can design our functions
which are called user-defined functions.
8.1 Python 175

Listing 8.23: user defined function in Python

1. def myfunc(str):
2. """This function print any passed in arguments"""
3. print (str)
4. return;
6. # calling our user defined function by provide argument
8. myfunc("Hello World!")
9. myfunc(10)
10. myfunc(5.5)

Output: Listing 8.23

Hello World!
10
5.5

8.1.25 Pass by Reference Versus Value

Pass by value means the argument passed to function is copy of original object. If we
will change the value of object inside function, then original object will not change.
Pass by reference means the argument passed to function is original object. If we
will change the value of object inside function, then original object will change.
In Python argument is passed to function as pass by reference so if function will
change it then original object will change. Let us test it with the help of an example.
Listing 8.24: function argument pass by value and reference in Python
def myfunc( score ):
"""This changes the value of a passed list"""
score.append(40);
return
score = [10,20,30];
print ("Status of list before calling the function: ", score)
myfunc( score );
print ("Status of list after calling the the function: ", score)

Output: Listing 8.24

Status of list before calling the function: [10, 20, 30]
Status of list after calling the the function: [10, 20, 30, 40]

8.1.26 Function Arguments

There are some types of arguments that can be used while calling user-defined
functions which are given below.
• Required arguments
• Keyword arguments
176 8 Data Science Programming Language

• Default arguments
• Variable-length arguments.

8.1.27 Required Arguments

In required-type arguments the number of arguments passed to function definition

and called function should be the same. If we will not provide same number of
arguments on correct position, then we will get syntax error.
Listing 8.25: function in Python with required type arguments
def myfunc(str):
"""This function print any passed in arguments"""
print (str)
return
# calling our user defined function by provide argument
myfunc("Hello World!")
myfunc(10)
myfunc() # it will show syntax error because we did not provide the argument

Output: Listing 8.25

Traceback (most recent call last):
File "C:/Test.py", line 8, in <module>
myfunc() # it will show syntax error because we did not provide the argument
TypeError: myfunc() missing 1 required positional argument: 'str'
Hello World!
10

8.1.28 Keyword Arguments

In keyword arguments it is not compulsory for function definition and function call
to match the order of arguments. In this type arguments are matched by their names.
Look at the code given below; here we changed the order of arguments in function
definition and function call. The argument at first place in function call is on second
place in function definition, and the argument on second place in function is on first
place in function definition.
8.1 Python 177

Listing 8.26: function in Python with keyword type arguments

def displayinfo( state, marks ):
"""This function prints passed info"""
print ("State: ", state)
print ("Marks ", marks)
return;
# calling function
displayinfo( marks=50, state="NY" )

Output: Listing 8.26

State: NY
Marks 50

8.1.29 Default Arguments

In default-type arguments, we provide a default value of an argument in function

definition. In function call if we will provide argument with default value then it will
auto-include. In the following code when we called the function second time and did
not provide second argument, then function used default argument.
Listing 8.27: function in Python with default type arguments
def displayinfo( marks, location = 'USA'):
"""This prints the information passed to this function."""
print ("State: ", location, end=" ")
print ("Marks ", marks)
return;
# Now you can call printinfo function
displayinfo( marks=50, location="NY")
displayinfo( marks=50)

Output: Listing 8.27

State: NY Marks 50
State: USA Marks 50

8.1.30 Variable-Length Arguments

In this type of arguments when we call the function, we provided different numbers of
arguments. But we do not have any argument with default value. For this scenario the
function call which will have fewer arguments will give syntax error. To handle this
problem, we need to use variable-length argument type. In variable-length argument
we provide first argument as variable and for rest arguments we use tuple. In function
call first argument will be assigned to first argument in function definition. Rest of
arguments in function call will be assigned to tuple.
178 8 Data Science Programming Language

def funcname([formal_args,] *var_args_tuple):

“function_docstring”
function_suite
return [expression]
To create tuple, we just place an asterisk (*) before the variable name that holds
the values of all non-keyword variable arguments. In function call if we will not
provide additional arguments, then tuple will remain empty.
Listing 8.28: function in Python with variable length arguments
def showmarks( marks, *moremarks ):
"""This function prints the variables passed through arguments"""
print ("Output is: ")
print (marks)
for varin moremarks:
print (var)
return;
# calling function
showmarks( 20 )
showmarks( 30, 40, 50 )

Output: Listing 8.28

Output is:
20
Output is:
30
40
50

8.1.31 The return Statement

In Python every user-defined function must have a return statement. Return statement
returns a value on function call. If we do not want to return any value from a specific
function, then we should use return keyword. Providing an expression to return is
optional.
Listing 8.29: return statement in Python function
def findaverage( num1, num2 ):
"""finds the average of two numbers"""
total = num1 + num2
return total/2

# calling function
result = findaverage( 10, 20 )
print ("Average of two numbers is: ", result)

Output: Listing 8.29

Average of two numbers is: 15.0
8.2 R Programming Language 179

8.2 R Programming Language

R is another important programming language used for statistical analysis and data
visualization. Similar to Python it is also a common language for data science-related
projects. R was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development Core
Team.
To use R language, we need to install two applications.
First, we need to install R language precompiled binary to run the code of R
language. R language precompiled binaries are available on following link:
https://cran.r-project.org/
RStudio is an integrated development environment (IDE) intended for the devel-
opment of R programming projects. Major components of RStudio include a console,
syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, debugging, and workspace management. For more information on
RStudio, you can follow the link:
https://rstudio.com/products/rstudio/

8.2.1 Our First Program

Everything is set, and we are ready to create our first R script program.
Listing 8.30: First R program
mystr<- "Hello, World!"

print ( mystr)

Output: Listing 8.30

[1] "Hello, World!"

8.2.2 Comments in R

Comments are helping text in our R program which is ignored by interpreter.

Comments give us information about any specific portion or piece of code. In R
comments are started with # sign.
180 8 Data Science Programming Language

Listing 8.30: Comments in R program

# this is comment in R program on separate line
mystr<- "Comments in R” # assigning a string to a variable

print ( mystr)

Output: Listing 8.30

[1] "Comments in R"

8.2.3 Data Types in R

In any programming language we need variables to store some information and then
later we can use this information in our program. Variables are reserved memory
locations where we can store some information or data. The information we store
in variable can be string, number, Boolean, etc.; the size memory used by variable
depends on type of variable which is reserved through operating system.
In other programming languages like C++ and Java we decide the type of variable
when declaring it. But in R the type of variable is decided by the data type of R-object.
The value of variable is called R-object. Similar to variable in Python, variable of R
is loosely typed. It means we can assign any type of value to its variable. If we will
assign integer, then type of variable will become integer. If we will assign a string to
variable, then type of variable will become string.
Listing 8.31: checking type of variable after assigning R-object
var <- 10
print (paste("Type of variable:", class(var)))

var <- "string"

print (paste("Type of variable:", class(var)))

var <- 50L

print (paste("Type of variable:", class(var)))

var <- 2+5i

print (paste("Type of variable:", class(var)))

var <- TRUE

print (paste("Type of variable:", class(var)))

Output: Listing 8.31

[1] "Type of variable: numeric"
[1] "Type of variable: character"
[1] "Type of variable: integer"
[1] "Type of variable: complex"
[1] "Type of variable: logical"
8.2 R Programming Language 181

There are many types of R-objects, and most frequently used ones are given below.
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data frames.

8.2.4 Vectors in R

To create vector with more than one element we need to use a built-in function called
“c( )”. This function is used to combine the elements of vector.
Listing 8.32: Vectors example in R
#declaring a vector
colors <- c('red', 'green', 'yellow')

print(colors)

# getting the data type or class of vector

print(class(colors))

Output: Listing 8.32

[1] "red" "green" "yellow"
[1] "character"

8.2.5 Lists in R

List is an R-object which contains different types of elements. The element of list
can be another list or vector.
182 8 Data Science Programming Language

Listing 8.33: Lists example in R

#declaring a list
list <- list(c(2,5,3), 21.3, sin)

print(list)

# getting the data type or class of list

print(class(list))

Output: Listing 8.33

[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x). Primitive("sin")

[1] "list"

8.2.6 Matrices in R

In R a matrix is collection of dataset arranged in the form of rows and columns. The
following is an example of a matrix with 3 rows and 2 columns.
Listing 8.34: Matrics example in R
# create a matrix
M = matrix (c ('a', 'a', 'b', 'c', 'b', 'a'), nrow = 3, ncol = 2, byrow = TRUE)

print(M)

# getting the data type or class of M

print(class(M))
Output: Listing 8.34
[,1] [,2]
[1,] "a" "a"
[2,] "b" "c"
[3,] "b" "a"

[1] "matrix"

8.2.7 Arrays in R

Array in R is similar to matrices but unlike in matrices which have two dimensions,
arrays can have any number of dimensions. To create array we need to call built-in
function called “array( )”. It used an attribute called dim which has information of
dimensions.
8.2 R Programming Language 183

Listing 8.35: Arrays example in R

# Creating an array.
A <- array(c('White', 'Black'), dim = c(3, 3, 2))

print(A)

# getting the data type or class of A

print(class(A))

Output: Listing 8.35

,,1

[,1] [,2] [,3]

[1,] "White" "Black" "White"
[2,] "Black" "White" "Black"
[3,] "White" "Black" "White"

,,2

[,1] [,2] [,3]

[1,] "Black" "White" "Black"
[2,] "White" "Black" "White"
[3,] "Black" "White" "Black"

[1] "array"

8.2.8 Factors in R

Factors are R-objects which are created with the help of vector. To create a factor we
need to use factor( ) function. It uses a vector as argument and return factor object.
Numeric and character variables can be made into factors, but the levels of factor will
always be in character. To get the levels of factor we need to use nlevels( ) function.
184 8 Data Science Programming Language

Listing 8.36: Factors example in R

# Creating a vector
colors <- c('white', 'green', 'black', 'yellow')

# Creating a factor object

F <- factor(colors)

# Print the factor.

print(F)
print(nlevels(F))

# getting the data type or class of F

print(class(F))

Output: Listing 8.36

[1] white green black yellow
Levels: black green white yellow
[1] 4
[1] "factor"

8.2.9 Data Frames in R

Data frames are R-objects in tabular form data which is similar to matrices in R, but
in data frames each column can have different types of data. For example, the first
column can be logical, second can be numeric and third can be character type. To
create a data frame we need to use built-in function data.frame( ).
Listing 8.37: Data frames example in R
# Creating data frame
Frm<- data.frame(
gender = c("Male", "Male”, “Female"),
height = c(145, 161, 143.3),
weight = c(70,65,63),
Age = c(35,30,25)
)

print(Frm)

# getting the data type or class of Frm

print(class(Frm))

Output: Listing 8.37

gender height weight Age
1 Male 145.0 70 30
2 Male 161.0 65 35
3 Female 143.3 63 25

[1] "data.frame"
8.2 R Programming Language 185

8.2.10 Types of Operators

We have the following types of operators in R programming.

• Arithmetic operators
• Relational operators
• Logical operators
• Assignment operators
• Miscellaneous operators.

8.2.10.1 Arithmetic Operators

Arithmetic operators are used to perform mathematical operational like addition,

subtraction, multiplication, etc. Table 8.10 shows the arithmetic operators supported
by R language.
Let us see code example of some arithmetic operators.

Table 8.10 Arithmetic operators

Operator Description
/ Used for division of first vector by second one
* Used for multiplication of two vectors
+ Used for addition of two vectors
− Used for subtraction of second vector from the first one
%% Used to get remainder of first vector after division with second one
%/% Used to get the quotient after division of first vector by second one
ˆ Used to raise the power of first vector by the second vector
186 8 Data Science Programming Language

Listing 8.38: Testing some arithmetic operator in R

var1 <- 10
var2 <- 4

print('Addition: ')
print(var1 + var2)

print('Subtraction: ')
print(var1 - var2)

print('Multiplication: ')
print(var1 * var2)

print('Division: ')
print(var1 / var2)

print('Remainder: ')
print(var1 %% var2)

print('Quotient: ')
print(var1 %/% var2)

Output: Listing 8.38

[1] 14

[1] "Subtraction:"
[1] 6

[1] "Multiplication:"
[1] 40

[1] "Division:"
[1] 2.5

[1] "Remainder:"
[1] 2

[1] "Quotient:"
[1] 2

8.2.10.2 Relational Operators

R language supports relational operators which are used to compare two R-objects.
The comparison operation returns a Boolean value which can be true or false.
Table 8.11 shows relational operators in R.
Let us see code example of some relational operators.
8.2 R Programming Language 187

Table 8.11 Relational

Operator Description
operators
> Two vectors are compared for greater-than
relation
< Two vectors are compared for less-than
relation
== Two vectors are compared for equal-to
relation
<= Two vectors are compared for
less-than-or-equal-to relation
>= Two vectors are compared for
greater-than-or-equal-to relation
!= Two vectors are compared for not-equal-to
relation

Listing 8.39: Testing some relational operator in R

var1 <- 10
var2 <- 4

print('Greater than:')
print(var1 > var2)

print('Less than:')
print(var1 < var2)

print('Equal to:')
print(var1 == var2)

print('Less than equal to:')

print(var1 <= var2)

print('Greater than equal to:')

print(var1 >= var2)

print('Not equal too:')

print(var1 != var2)
188 8 Data Science Programming Language

Output: Listing 8.39

[1] "Greater than:"
[1] TRUE

[1] "Less than:"

[1] FALSE

[1] "Equal to:"

[1] FALSE

[1] "Less than equal to:"

[1] FALSE

[1] "Greater than equal to:"

[1] TRUE

[1] "Not equal too:"

[1] TRUE

8.2.11 Decision Making in R

Decision making in R is required to test a Boolean expression (one or more

conditions) which will return true or false depending on output condition test.

8.2.11.1 if Statement in R

R statements will execute if output of condition will return true. “if” statement is
used for decision making in R, and the syntax of “if” statement in R is given below.
Syntax
if(boolean_expression) {
//statement(s) will execute if the Boolean expression will return true
}
Listing 8.40: if statement in R
num <- 50L
if(is.integer(num)) {
print("num is an integer")
}

Output: Listing 8.40

[1] "num is an integer"
8.2 R Programming Language 189

8.2.11.2 if…else in R

Syntax
if(boolean_expression) {
//statement(s) will execute if the Boolean expression will return true
} else {
//statement(s) will execute if the Boolean expression will return false
}
Listing 8.41: if…else statement in R
num <- "50"
if(is.integer(num)) {
print("num is an integer")
} else {
print("num is not an integer")
}

Output: Listing 8.41

[1] "num is not an integer"

8.2.11.3 Nested if…else in R

In R “if” and “else if” statements are used as chain of statements to test various
conditions. Execution of these statements starts from top; if any of “if” or “else if”
will return true, then remaining “else if” and “else” statements will not run. We can
use many “else if” statements after first “if” statement which will end with an optional
“else” statement. Execution of these statements starts from top; if any of “if” or “else
if” will return true, then remaining “else if” and “else” statement will not run. The
following is syntax of nest if, else if, and else statements.
Syntax
if(First Condition) {
//statement(s) will be executed if First Condition will be true
} else if(Second Condition) {
//statement(s) will be executed if Second Condition will be true
} else if(Third Condition) {
//statement(s) will execute if Third Condition will be true
} else {
//statement(s) will be executed all of above conditions will be false
}
190 8 Data Science Programming Language

Listing 8.42: nested if…else statement in R

num <- 50

if(num > 80) {

print("number is greater than 80")
} else if (num > 60) {
print("number is greater than 60")
} else if (num > 40) {
print("number is greater than 40")
} else if (num > 20) {
print("number is greater than 20")
} else {
print("Number is less than or equal to 20")
}

Output: Listing 8.42

[1] "number is greater than 40"

8.2.12 Loops in R

The R language has three types of loops given below.

8.2.12.1 Repeat Loop in R

In R to execute a statement or group of statement again and again we can use repeat
loop. When we need to execute a block of code multiple times at same place, then
we use repeat loop. The syntax of repeat loop is given below.
Syntax
repeat {
statement(s)
if(boolean_expression) {
break
}
}
8.2 R Programming Language 191

Listing 8.43: Repeat loop in R

vector <- c("Loop Test")
counter <- 0

# repeat loop
repeat {
print(vector)
counter <- counter + 1

if(counter > 4) {
break
}
}

print(paste(c("Repeat loop iterated:", counter, "times"), collapse = " "))

Output: Listing 8.43

[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] " Repeat loop iterated: 5 times"

8.2.12.2 While Loop in R

In R repeat loop stops execution with the help of break statement inside “if” statement.
While loop in R stops execution when the Boolean expression or condition of while
loop returns false. The following is syntax of while loop in R.
Syntax
while (boolean_expression) {
statement(s)
}
192 8 Data Science Programming Language

Listing 8.44: While loop in R

vector <- c("Loop Test")
counter <- 1

# while loop
while(counter < 5) {
print(vector)
counter <- counter + 1

print(paste(c("While loop iterated:", counter, "times"), collapse = " "))

Output: Listing 8.44

[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "While loop iterated: 5 times"

8.2.12.3 For Loop in R

In R language when we need to iterate one more statement for a defined number of
times, then we should use for loop. The syntax of using for loop in R language is
given below.
Syntax
for (value in vector) {
Body of Loop (Statements)
}
R’s loops can be used for any data type including integers, strings, lists, etc.
8.2 R Programming Language 193

Listing 8.45: For loop in R

vector <- c("Loop Test")
counter <- 0

# for loop
for(i in 6:10) {

print(vector)
counter <- counter + 1
}

print(paste(c("For loop iterated:", counter, "times"), collapse = " "))

Output: Listing 8.45

[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "For loop iterated: 5 times"

8.2.13 Break Statement in R

If we want to stop execution of any loop on any iteration, for this we can use break
statement. Normally, we use “if” to a specific conduction, when loop iteration reaches
at specific condition then we use break statement to stop loop iterations. Break
statement can be used with repeat, for, and while loops.
Listing 8.46: checking type of variable after assigning R-object
vector <- c("Loop Test")
counter <- 0

# repeat loop
repeat {
print(vector)
counter <- counter + 1

if(counter > 4) {
break
}
}

print("Repeat loop stopped with the help of break statement"))

Output: Listing 8.46

[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] "Loop Test"
[1] " Repeat loop stopped with the help of break statement "
194 8 Data Science Programming Language

8.2.14 Functions in R

There are two types of function in R language called built-in and user-defined func-
tions. The functions provided by R language library are called built-in or predefined
functions. The functions created and defined by user are called user-defined function.
The following are components of user-defined function.
• Function name
• Arguments
• Function body
• Return value.
The following is syntax of user-defined function.
function_name<- function(arg_1, arg_2, …) {
Function body
}
Listing 8.47: Function in R
# creating a user defined function
my.function<- function(arg) {
print(arg)
}

# calling function
my.function(10)
my.function(5.8)
my.function("Hello World!")

Output: Listing 8.47

[1] 10
[1] 5.8
[1] "Hello World!"

8.2.14.1 Function Without Arguments

Listing 8.48: Function without argument in R

# creating a user defined function
my.function<- function() {
print("This function do not have any argument")
}

# calling function
my.function()

Output: Listing 8.48

[1] "This function do not have any argument"
8.2 R Programming Language 195

8.2.14.2 Function with Arguments

Listing 8.49: Function with argument in R

# creating a user defined function
my.function<- function(arg) {
print("This function have argument")
print(paste(c("The value of provided argument is", arg), collapse = " "))
}

# Call the function new.function supplying 6 as an argument.

my.function(10)

Output: Listing 8.49

[1] "This function have argument"
[1] "The value of provided argument is 10"

8.3 Further Reading

The following are some valuable resources for further reading:

• Programming with Python (Author: T. R. Padmanabhan)
The book offers a self-contained, concise, and coherent introduction to program-
ming with Python. The book’s primary focus is on realistic case study applications
of Python. Each practical example is accompanied by a brief explanation of the
problem–terminology and concepts, followed by necessary program development
in Python using its constructs, and simulated testing.
• Python Programming Fundamentals (Author: Kent D. Lee)
The book teaches the reader how to program using Python, an accessible language
which can be learned incrementally. Through an extensive use of examples and
practical exercises, students will learn to recognize and apply abstract patterns in
programming, as well as how to inspect the state of a program using a debugger
tool.
• An Introduction to Python and Computer Programming (Author: Yue Zhang)
The book introduces Python programming language and fundamental concepts
in algorithms and computing. Its target audience includes students and engi-
neers with little or no background in programming, who need to master a
practical programming language and learn the basic thinking in computer
science/programming.
• The Python Workbook (Author: Ben Stephenson)
The book focuses exclusively on exercises, following the philosophy that
computer programming is a skill best learned through experience and practice.
No background knowledge is required to solve the exercises, beyond the material
covered in a typical introductory Python programming course. Undergraduate
students undergoing their first programming course and wishing to enhance their
196 8 Data Science Programming Language

programming abilities will find the exercises and solutions provided in this book
to be ideal for their needs.
• Solving PDEs in Python (Authors: Hans Petter Langtangen and Anders Logg)
The book offers a concise and gentle introduction to finite element programming
in Python based on the popular FEniCS software library. Using a series of exam-
ples, including the Poisson equation, the equations of linear elasticity, the incom-
pressible Navier–Stokes equations, and systems of nonlinear advection–diffu-
sion–reaction equations, it guides readers through the essential steps to quickly
solving a PDE in FEniCS, such as how to define a finite variational problem, how
to set boundary conditions, how to solve linear and nonlinear systems, and how
to visualize solutions and structure finite element Python programs.

8.4 Summary

In this chapter, we have discussed two very important and most commonly used data
science programming languages. The first one is Python, and the second one is R
programming language. We have provided the details of basic code structures of both
programming languages along with programming examples. The overall intention
was to provide you with a simple tutorial to enable you to code complex data science
programs through basic coding structures.

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
[FREE PDF sample] Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization 1st Edition Daniel Gomez Blanco ebooks
100% (3)
[FREE PDF sample] Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization 1st Edition Daniel Gomez Blanco ebooks
51 pages
Biometric Manual
100% (1)
Biometric Manual
45 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (19)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Guide To Intelligent Data Science: Michael R. Berthold Christian Borgelt Frank Höppner Frank Klawonn Rosaria Silipo
No ratings yet
Guide To Intelligent Data Science: Michael R. Berthold Christian Borgelt Frank Höppner Frank Klawonn Rosaria Silipo
427 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Foundational Python For Data Science
100% (1)
Foundational Python For Data Science
324 pages
MS-900 Microsoft 365 Fundamentals Study Guide
No ratings yet
MS-900 Microsoft 365 Fundamentals Study Guide
6 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
Solutions Manual Using R Introductory ST
No ratings yet
Solutions Manual Using R Introductory ST
33 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Eaas0593 Xa
No ratings yet
Eaas0593 Xa
24 pages
Item-Based Collaborative Filtering Recommendation Algorithms
No ratings yet
Item-Based Collaborative Filtering Recommendation Algorithms
11 pages
Data Stucture
100% (5)
Data Stucture
219 pages
Foundations of Calculus For Data Science An Foundational Guide To PDF
No ratings yet
Foundations of Calculus For Data Science An Foundational Guide To PDF
216 pages
Efficient Data Preparation: With Python
No ratings yet
Efficient Data Preparation: With Python
19 pages
Matrix Algebra For Engineers
No ratings yet
Matrix Algebra For Engineers
190 pages
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
No ratings yet
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
161 pages
Get Building Knowledge Graphs: A Practitioner's Guide 1st Edition Jesus Barrasa free all chapters
100% (4)
Get Building Knowledge Graphs: A Practitioner's Guide 1st Edition Jesus Barrasa free all chapters
50 pages
Data Analysis Tutorial
No ratings yet
Data Analysis Tutorial
152 pages
John V. Guttag - Introduction To Computation and Programming Using Python - With Application To Understanding Data-The MIT Press (2016) PDF
100% (1)
John V. Guttag - Introduction To Computation and Programming Using Python - With Application To Understanding Data-The MIT Press (2016) PDF
17 pages
Introduction To Bash Scripting Light
No ratings yet
Introduction To Bash Scripting Light
169 pages
R Markdown: The Definitive Guide: Yihui Xie, J. J. Allaire, Garrett Grolemund
No ratings yet
R Markdown: The Definitive Guide: Yihui Xie, J. J. Allaire, Garrett Grolemund
123 pages
GPUs Data Analytics Book
No ratings yet
GPUs Data Analytics Book
39 pages
Python Reserved Words
No ratings yet
Python Reserved Words
2 pages
Data Science With Python Training in Bangalore - Python Training Institutes in Bangalore, Marathahalli, Jayanagar
100% (1)
Data Science With Python Training in Bangalore - Python Training Institutes in Bangalore, Marathahalli, Jayanagar
8 pages
Simulation
No ratings yet
Simulation
38 pages
Jinja2 Docs
No ratings yet
Jinja2 Docs
121 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
415 pages
Pandas Guide
No ratings yet
Pandas Guide
64 pages
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
No ratings yet
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
56 pages
Query Optimiation
No ratings yet
Query Optimiation
39 pages
Essentials of Real Time Networking
No ratings yet
Essentials of Real Time Networking
802 pages
Modern Enterprise Architecture: Using DevSecOps and Cloud-Native in Large Enterprises 1st Edition Jeroen Mulder download
100% (1)
Modern Enterprise Architecture: Using DevSecOps and Cloud-Native in Large Enterprises 1st Edition Jeroen Mulder download
80 pages
R Machine Learning PDF
No ratings yet
R Machine Learning PDF
137 pages
Mastering RStudio - Develop, Communicate, and Collaborate With R - Sample Chapter
100% (1)
Mastering RStudio - Develop, Communicate, and Collaborate With R - Sample Chapter
40 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
100% (1)
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
65 pages
Behavior Analysis With Machine Learning Using R (Ceja, Enrique Garci)
No ratings yet
Behavior Analysis With Machine Learning Using R (Ceja, Enrique Garci)
432 pages
Woolford D. Applied Data Science. Data Translators Across The Disciplines 2023
No ratings yet
Woolford D. Applied Data Science. Data Translators Across The Disciplines 2023
195 pages
Advanced Data Analytics Using Python - Unit II
No ratings yet
Advanced Data Analytics Using Python - Unit II
57 pages
Machine Learning Design Patterns Solutions to Common Challenges in Data Preparation Model Building and MLOps 1st Edition Valliappa Lakshmanan Sara Robinson Michael Munn download pdf
100% (3)
Machine Learning Design Patterns Solutions to Common Challenges in Data Preparation Model Building and MLOps 1st Edition Valliappa Lakshmanan Sara Robinson Michael Munn download pdf
65 pages
Learning R
100% (1)
Learning R
619 pages
MIE CompTIA Cert Path Professional Skills
No ratings yet
MIE CompTIA Cert Path Professional Skills
4 pages
The Python Master (Robert Smallshire Austin Bingham) (Z-Library)
No ratings yet
The Python Master (Robert Smallshire Austin Bingham) (Z-Library)
209 pages
d3 T and T v6
No ratings yet
d3 T and T v6
411 pages
HTML Cheat Sheet
No ratings yet
HTML Cheat Sheet
14 pages
Python Interview Questions and Answers For 2019 - Intellipaat
No ratings yet
Python Interview Questions and Answers For 2019 - Intellipaat
25 pages
DuckDB in Action MEAP v02 Chptrs 1to4 MotheDuck
No ratings yet
DuckDB in Action MEAP v02 Chptrs 1to4 MotheDuck
123 pages
Becoming A Data Analyst Study Plan
100% (1)
Becoming A Data Analyst Study Plan
7 pages
Python
No ratings yet
Python
211 pages
Core Python Summer Training Course
No ratings yet
Core Python Summer Training Course
3 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
A Gentle Introduction To Python
100% (2)
A Gentle Introduction To Python
29 pages
Ebook Ancient Language Latin - Cicero Select Orations
No ratings yet
Ebook Ancient Language Latin - Cicero Select Orations
455 pages
AdvancesInKnowledgeDicoveryAndDataMining 2012 Part1
No ratings yet
AdvancesInKnowledgeDicoveryAndDataMining 2012 Part1
642 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
(Treading On Python 2) Matt Harrison - Treading On Python Volume 2 - Intermediate Python 2 (2013, Hairysun)
No ratings yet
(Treading On Python 2) Matt Harrison - Treading On Python Volume 2 - Intermediate Python 2 (2013, Hairysun)
144 pages
A Beginners Guide To Python 3 Programming: John Hunt
No ratings yet
A Beginners Guide To Python 3 Programming: John Hunt
532 pages
Keras - TF2 - Book
No ratings yet
Keras - TF2 - Book
364 pages
Python Advanced - Finite State Machine in Python
No ratings yet
Python Advanced - Finite State Machine in Python
1 page
From Concepts To Code
No ratings yet
From Concepts To Code
386 pages
Day 45 PyTorch Presentation
No ratings yet
Day 45 PyTorch Presentation
67 pages
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
No ratings yet
Assignment OF Data Science (AIT 120) : Submitted To: Submitted by
10 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
Chapter 1 Introduction To Data Analytics
No ratings yet
Chapter 1 Introduction To Data Analytics
4 pages
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
The POSING ISSUE - Rangefinder - July August 2016
100% (3)
The POSING ISSUE - Rangefinder - July August 2016
100 pages
Protfolio Management
No ratings yet
Protfolio Management
88 pages
Current Components in A Transistor - Electrical4u
No ratings yet
Current Components in A Transistor - Electrical4u
8 pages
Example Diagram of Office Lan-Internet and Wireless Network
No ratings yet
Example Diagram of Office Lan-Internet and Wireless Network
2 pages
Plotting 'Timeseries' Objects: Diethelm W Urtz and Tobias Setz Eth Zurich and Rmetrics Association Zurich May 12, 2014
No ratings yet
Plotting 'Timeseries' Objects: Diethelm W Urtz and Tobias Setz Eth Zurich and Rmetrics Association Zurich May 12, 2014
69 pages
Chapter 1: Introducing Today's Technologies
No ratings yet
Chapter 1: Introducing Today's Technologies
36 pages
M. Masrur Ubaidillah: Contact
No ratings yet
M. Masrur Ubaidillah: Contact
1 page
Intranets and Extranets
No ratings yet
Intranets and Extranets
12 pages
Mod 3 ppt
No ratings yet
Mod 3 ppt
120 pages
Mohan_reddy_BGC_1724440357
No ratings yet
Mohan_reddy_BGC_1724440357
3 pages
Nice File
No ratings yet
Nice File
3 pages
PA,GAS
No ratings yet
PA,GAS
146 pages
Curriculum Map Math 7+ 16-17
No ratings yet
Curriculum Map Math 7+ 16-17
1 page
Expected Simplification Questions PDF For Bank PO/ Clerk Prelims Exam
100% (1)
Expected Simplification Questions PDF For Bank PO/ Clerk Prelims Exam
18 pages
Open Fractures Management - Trauma - Orthobullets
No ratings yet
Open Fractures Management - Trauma - Orthobullets
7 pages
Pedestrian Dead Reckoning Based On Walking Pattern Recognition and Online Magnetic Fingerprint Trajectory Calibration
No ratings yet
Pedestrian Dead Reckoning Based On Walking Pattern Recognition and Online Magnetic Fingerprint Trajectory Calibration
17 pages
Diode Diffusion Capacitance (20.8.20)
No ratings yet
Diode Diffusion Capacitance (20.8.20)
20 pages
CTS Marketing Executive - CTS - NSQF-4
No ratings yet
CTS Marketing Executive - CTS - NSQF-4
40 pages
Linear Algebra and Advanced Calculus: Somitra Sanadhya
No ratings yet
Linear Algebra and Advanced Calculus: Somitra Sanadhya
11 pages
Home Assignment No-1 (CLO1) : School of Electrical Engineering and Computer Science
No ratings yet
Home Assignment No-1 (CLO1) : School of Electrical Engineering and Computer Science
1 page
Direct Methods To The Solution of Linear Equations Systems
No ratings yet
Direct Methods To The Solution of Linear Equations Systems
42 pages
Mpgb550gamingplus en
No ratings yet
Mpgb550gamingplus en
59 pages
Fire Risk Assessment.28-45
No ratings yet
Fire Risk Assessment.28-45
18 pages
Day 2 - MateCat - TM Sample
No ratings yet
Day 2 - MateCat - TM Sample
6 pages
Earn 200$-300$ Per Month With ShareCash
No ratings yet
Earn 200$-300$ Per Month With ShareCash
8 pages
HIKAPRIL24 - Hikvision Mobile Surveillance Solution V1.2.5
No ratings yet
HIKAPRIL24 - Hikvision Mobile Surveillance Solution V1.2.5
54 pages
Minilec Catalouge
No ratings yet
Minilec Catalouge
64 pages