Data_Science
Data_Science
Data Science
Concepts and
Techniques
with Applications
Data Science Concepts and Techniques
with Applications
Usman Qamar Muhammad Summair Raza
•
123
Usman Qamar Muhammad Summair Raza
Knowledge and Data Science Department of Computer Science
Research Centre Virtual University of Pakistan
National University of Sciences Lahore, Pakistan
and Technology (NUST)
Islamabad, Pakistan
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to our students.
Preface
As this book is about data science, the first question it immediately begs is: What is
data science? It is a surprisingly hard definition to nail down. However, for us, data
science is perhaps the best label for the cross-disciplinary set of skills that
are becoming increasingly important in many applications across industry and
academia. It comprises three distinct and overlapping areas: firstly, statistician who
knows how to model and summarize the data, data scientist who can design and use
algorithms to efficiently process and visualize the data, and finally the domain
expert who will formulate the right questions and put the answers in context. With
this in mind, we would encourage all to think of data science not as a new domain
of knowledge to learn, but a new set of skills that you can apply within your current
area of expertise.
The book is divided into three parts. The first part consists of the first three
chapters. In Chap. 1, we will discuss the data analytics process. Starting from the
basic concepts, we will highlight the types of data, its use, its importance, and issues
that are normally faced in data analytics. Efforts have been made to present the
concepts in the most simple possible way, as conceptual clarity before studying the
advance concepts of data science and related techniques is very much necessary.
Data analytics has a wide range of applications, which are discussed in Chap. 2.
Today, where we have already entered in the era of information, the concept of big
data has already taken over in organizations. With the information generating at
immense rate, it has become very much necessary to discuss the analytics process
from big data point of view, so, in this chapter, we will provide some common
applications of data analytics process from big data perspective. Chapter 3 introduces
widely used techniques for data analytics. Prior to the discussion of common data
analytics techniques, firstly we will try to explain three types of learning to which the
majority of the data analytics algorithms will fall under.
The second part is composed of Chaps. 4–7. Chapter 4 is on data preprocessing.
Data may contain noise, missing values, redundant attributes, etc. so data prepro-
cessing is one of the most important steps to make data ready for final processing.
Feature selection is an important task used for data preprocessing. It helps reduce
the noise and redundant and misleading features. Chapter 5 is on classification
vii
viii Preface
concepts. It is an important step that forms core of the data analytics and machine
learning activities. The focus of Chap. 6 is on clustering. Clustering is the process
of dividing the objects and entities into meaningful and logically related groups. In
contrast to classification where we have already labeled classes in data, clustering
involves unsupervised learning, i.e. we do not have any prior classes, whereas
Chap. 7 introduces text mining as well as opining mining.
And finally, the third part of the book is composed of Chap. 8 which focuses on
two programming languages commonly used for data science projects, i.e. Python
and R programming language.
Data science is an umbrella term that encompasses data analytics, data mining,
machine learning, and several other related disciplines, so contents have been
devised keeping in mind this perspective. An attempt is made to keep the book as
self-contained as possible. The book is suitable for both undergraduate and post-
graduate students as well as those carrying out research in data science. It can be
used as a textbook for undergraduate students in computer science, engineering, and
mathematics. It can also be accessible to undergraduate students from other areas
with the adequate background. The more advanced chapters can be used by post-
graduate researchers intending to gather a deeper theoretical understanding.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Big Data Versus Small Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Role of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Types of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Challenges of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Large Volumes of Data . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Processing Real-Time Data . . . . . . . . . . . . . . . . . . . . . . 7
1.6.3 Visual Representation of Data . . . . . . . . . . . . . . . . . . . . 7
1.6.4 Data from Multiple Sources . . . . . . . . . . . . . . . . . . . . . 8
1.6.5 Inaccessible Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6.6 Poor Quality Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6.7 Higher Management Pressure . . . . . . . . . . . . . . . . . . . . 9
1.6.8 Lack of Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.9 Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.10 Shortage of Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Top Tools in Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Business Intelligence (BI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.9 Data Analytics Versus Data Analysis . . . . . . . . . . . . . . . . . . . . . 13
1.10 Data Analytics Versus Data Visualization . . . . . . . . . . . . . . . . . 13
1.11 Data Analyst Versus Data Scientist . . . . . . . . . . . . . . . . . . . . . . 14
1.12 Data Analytics Versus Business Intelligence . . . . . . . . . . . . . . . . 15
1.13 Data Analysis Versus Data Mining . . . . . . . . . . . . . . . . . . . . . . 16
1.14 What Is ETL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.14.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.14.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.14.3 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ix
x Contents
Dr. Usman Qamar has over 15 years of experience in data engineering and
decision sciences both in academia and industry. He has a Masters in Computer
Systems Design from University of Manchester Institute of Science and
Technology (UMIST), UK. His MPhil in Computer Systems was a joint degree
between UMIST and University of Manchester which focused on feature selection
in big data. In 2008 he was awarded PhD from University of Manchester, UK. His
Post PhD work at University of Manchester, involved various research projects
including hybrid mechanisms for statistical disclosure (feature selection merged
with outlier analysis) for Office of National Statistics (ONS), London, UK, churn
prediction for Vodafone UK and customer profile analysis for shopping with the
University of Ghent, Belgium. He is currently Associate Professor of Data
Engineering at National University of Sciences and Technology (NUST), Pakistan.
He has authored over 200 peer reviewed publications which includes 3 books
published by Springer & Co. He is on the Editorial Board of many journals
including Applied Soft Computing, Neural Computing and Applications,
Computers in Biology and Medicine, Array. He has successfully supervised 5 PhD
students and over 100 master students.
Dr. Muhammad Summair Raza has been affiliated with the Virtual University of
Pakistan for more than 8 years and has taught a number of subjects to graduate-level
students. He has authored several articles in quality journals and is currently
working in the field of data analysis, big data with a focus on rough sets.
xv
Chapter 1
Introduction
In this chapter we will discuss the data analytics process. Starting from the basic
concepts, we will highlight the types of data, its use, its importance, and issues that
are normally faced in data analytics. Efforts have been made to present the concepts in
most simple possible way, as conceptual clarity before studying the advance concepts
of data science and related techniques is very much necessary.
1.1 Data
Data is an essential need in all domains of life. From research community to business
markets data is always required for analysis and decision-making purpose. However,
the emerging developments in technology of data storage, processing, and transmis-
sion have changed the entire scenario. Now bulk of data is produced on daily basis.
Whenever, you type a message, upload a picture, browse a web, type a social media
message, you are producing data which is being stored somewhere and available
online for processing. Just couple this with development of advance software appli-
cations and inexpensive hardware. With the emergence of the concepts like Internet
of Things (IoT), where the focus is on connected data, the scenario has worsened
the further. From writing something on paper to online distributed storages, data is
everywhere.
Each and every second the amount of data is increasing by the rate of hundreds of
thousands of tons. By 2020, the overall amount of data is predicted to be 44 zettabytes
and just to give an idea, 1.0 ZB is equal to 1.0 trillion gigabytes. With such huge
volumes of data apart from the challenges and the issues like curse of dimensionality,
we also have various opportunities to dig deep into these volumes and extract useful
information and knowledge for the good of society, academia, and business.
Figure 1.1 shows few representations of data.
© The Editor(s) (if applicable) and The Author(s), under exclusive license 1
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_1
2 1 Introduction
Structural Data 0
A B C
Price
Graphical Data
1.2 Analytics
In the previous section we have discussed about huge volumes of data that is produced
on daily basis. The data is not useful until we have some mechanism to use it to extract
knowledge and make decisions. This is where data analytics process steps in. There
are various definitions of data analytics. We will use a simple one. Data analytics
is the process to use the data (in any form), process it through various tools and
techniques, and then extract the useful knowledge from this data. The knowledge
will ultimately help in decision making.
Overall data analytics is the process of generalizing knowledge from raw data
which includes various steps from storage to final knowledge extractions. Apart
from this the process involves concepts from various other domains of science and
computing. Starting from basic statistic measures, e.g. means, medians, and vari-
ances, etc., to advance data mining and machine learning techniques each and every
step transforms data to extract the knowledge.
This process of data analytics has also opened the doors of new ideas, e.g. how to
mimic the human brain with computer, so that the tasks performed by human could
be performed by the machines at the same level. Artificial neural network was an
important development in this regard.
With the advent of such advance tools and techniques for data analytics, several
other problems have also emerged, e.g. how to efficiently use the computing resources
and enhance the efficiency, how to deal with the various data-related problems, e.g.
inaccuracy, huge volumes, anomalies, etc. Figure 1.2 shows the data analytics process
at high level.
Nature of the data that applications need now is totally changed. Starting from basic
databases which used to store the daily transactions data, the distributed connected
1.3 Big Data Versus Small Data 3
data has become reality. This change has impacted all the aspects related to data
including the storage mechanisms, processing approaches, and knowledge extrac-
tions. Here we will present a little bit comparison of the “small” data and “big” data
in Table 1.1.
As discussed above the data has grown from few gigabytes to zettabytes, the
change is not only in size, it has changed the entire scenario, e.g. how to process
distributed connected data. How to ensure the security of your data that you do not
know about where it is stored on cloud? How to make maximum out of it with
limited resources? The challenges have opened new windows of opportunities. Grid
computing, clouds, fog computing, etc., are the results of such challenges.
The mere expansion of the resources is not sufficient. A strong support from
software is also essential because the conventional software applications are not
sufficient to cope with the size and nature of big data. For example a simple software
application that performs data distribution in single server environment will not be
effective for distributed data. Similarly an application that just collects data from a
single server as a result of a query will not be able to extract distributed data from
different distributed nodes. Various other factors should also be considered, e.g. how
to integrate such data, where to fetch data efficiently in case of a query, should data
be replicated, etc.
Even the above-mentioned issues are simple, and we come across more complex
issues, e.g. in case of cloud computing, your data is stored on a cloud node, you do not
have any idea, so, how to ensure the sufficient security of your data and availability.
One of the common techniques to deal with such big data is MapReduce model.
Hadoop, based on MapReduce, is one of the common platforms for processing
and managing such data. Data is stored on different systems as per needs, and the
processed results are then integrated.
If we consider the algorithmic nature of the analytics process, data analytics for
small data seems to be basic as we have data only in a single structured format, e.g.
we may have data where you are provided with certain number of objects where each
object is defined through well précised features and class labels. However, when it
comes to big data, various issues need to be considered. Talking simply about feature
selection process, a simple feature selection algorithm for homogeneous data will be
different from the one dealing with data having heterogeneous nature.
Similarly, the small data in majority of the cases is structured, i.e. you have well-
defined schemas of the data, however, in big data you may have structured and
unstructured data. So, an algorithm working on simple structure of data will be far
simple as compared to the one working on different structures.
So, as compared to simple small data, big data is characterized by four features
as follows:
Volume: We deal with petabytes, zettabytes, and exabytes of data.
Velocity: Data is generated at immense rate.
Veracity: Refers to bias and anomalies in big data.
Variety: The number of types of data and
As compared to big data, the small data is far simple and less complex with respect
to big data.
Job of data analysts includes knowledge from domains like statistics, mathematics,
artificial intelligence, and machine learning. The ultimate intention is to extract
knowledge for the success of business. This is done by extracting the patterns from
data.
It involves a complete interaction with data throughout the entire analysis process.
So, a data analyst works with data in various ways. It may include data storage,
data cleansing, data mining for knowledge extraction, and finally presenting the
knowledge through some measures and figures.
1.4 Role of Data Analytics 5
Data mining forms the core of the entire data analytics process. It may include
extraction of the data from heterogeneous sources including texts, videos, numbers,
and figures. The data is extracted from the sources, transformed in some form which
can easily process, and finally we load the data so that we could perform the required
processing. Overall the process is called extract, transform, and load (ETL) process.
However, note that the entire process is time consuming and requires lot of resources,
so, one of the ultimate goals is to perform the entire process with efficiency.
Statistics and machine learning are few of the major components of data analytics.
They help in analysis and extraction of knowledge from the data. We input data and
use statistics and machine learning techniques to develop models. Later, these models
are then used for prediction analysis and prediction purpose. Nowadays, lots of tools
and libraries are available for this purpose including R and Python, etc.
The final phase in data analytics is data presentation. Data presentation involves
visual representations of the results for the concept of the customer. As customer
is intended audience of data representation, the techniques used should be simple,
formative, and as per his requirements.
Talking about the applications of data analytics, the major role of data analytics
is to enhance performance and efficiency of business and organizations.
One of the major roles data analytics plays is in the banking sector, where you
can find out credit score, predict potential customers for a certain policy, and detect
outliers, etc.
However, apart from its role in finance, data analytics plays a critical role in
security management, health care, emergency management, etc.
Data analytics is a broad domain. It has four types, i.e. descriptive analytics, diag-
nostic analytics, predictive analytics, and prescriptive analytics. Each has its own
nature and the type of the tasks performed in it. Here we will provide a brief
description of each.
• Descriptive analytics: Descriptive analytics helps us find about “What happened”
or “What is happening”. In simple words these techniques take the raw data as
input and summarize it in the form of knowledge useful for customers, e.g. it may
help find out the total time spent on each customer by the company or total sales
done by each region in certain season. So, descriptive analytics process comprises
data input, processing, and the results generations. Results generated are in visual
forms for better understanding of the customer.
• Diagnostics analytics: Taking the analytics process a step ahead, diagnostic
analytics help in analyzing about “Why it happened?”. Performing analysis on
historical and current data, we may get details of why a certain event actually
happened at a certain period in time. For example, we can find out the reasons for
a certain drop in sales over the third quarter of the year. Similarly, we can find the
6 1 Introduction
reasons behind the low crop yield in agriculture. Special measures and metrics
can be defined for this purpose, e.g. yield per quarter, profit per six months, etc.
Overall, the process is completed in three steps:
– Data collection
– Anomaly detection
– Data analysis and identification of the reasons.
• Predictive analytics: Prescriptive analytics as name indicates helps in predicting
about future. It helps in finding “What may happen”. Using the current and histor-
ical data predictive analytics finds the patterns and trends by using statistical and
machine learning techniques and tries to predict whether same circumstances
may happen in future. Various machine learning techniques like artificial neural
network, classification algorithms, etc., may be used. Overall process comprises
the following steps:
– Data collection
– Anomaly detection
– Application of machine learning techniques to predict patterns.
• Prescriptive analytics: Prescriptive analytics, as the name implies the necessary
actions that need to be taken in case of certain predicted event, e.g. what should be
done to increase the predicted low yield in last quarter of the year. What measures
should be taken to increase the sales in off season.
So, the different types of analytics help at certain stages for the good of the business
and organizations. One thing common in all types of analytics is the data required
for applying the analytics process. The better the quality of data, the better are the
decisions and results. Figure 1.3 shows scope of different types of data analytics.
Although data analytics has been widely adapted by the organizations and a lot of
research is underway in this domain, however, still there are a lot of challenges that
need to be catered. Here, we will discuss few of these challenges.
In this data-driven era, organizations are collecting large amount of data on daily
basis, sufficient to pose significant challenges for analytics process. The volumes
of data available these days require significant amount of resources for storage and
processing. Although analytics processes have come up with various solutions to
cater the issue, but still lot of work needs to be done to solve the problem.
1.6 Challenges of Data Analytics 7
In majority of the cases, the significance of the data remains valid only in a certain
period of time. Coupling the problem with large amount of data, it becomes impos-
sible to capture such huge volumes in real time and process it for meaningful insights.
The problem becomes more critical in case we do not have sufficient resources to
collect real-time data and process it. This is another research domain where data
analytics processes deal with huge volumes of real-time data and process it in real
time to give meaningful information so that its significance remains intact.
For the clear understudying of the customer and organizations, data should be
presented in simple and easy ways. This is where visual representations like graphs
and charts, etc., may help. However, presenting the information in simple way is
not an easy task especially when the complexity of the data increases and we need
to present information at various levels. Just inputting the data in tools and getting
8 1 Introduction
Another issue that needs to be catered is distributed data, i.e. the data stored at
different geographical locations. This may create problems as manually the job seems
very cumbersome. However, regarding data analytics process data distribution may
have many issues which may include data integration, mismatch in data formats,
different data semantics, etc.
The problem lies not only in data distribution but the resources required to process
this data. For example, processing distributed data in real time may require expensive
devices to ensure high-speed connectivity.
For effective data analytics process, data should be accessible 24/7. Accessibility of
data is another issue that needs to be catered. Backup storages and communication
devices need to be purchased to ensure the data is available whenever required.
Because even if you have data but it is not accessible due to any reason, the analytics
process will not be significant.
Data quality lies at the heart of the data analytics process. Incorrect and inaccurate
data means inaccurate results. It is common to have anomalies in data. Anomalies
may include missing values, incorrect values, irrelevant values, etc. Anomalies may
occur due to various reasons, e.g. defects in sensors that may result in incorrect data
collections, users not willing to enter correct values, etc.
Dealing with this poor-quality data is a big challenge for data analytics algorithms.
Various preprocessing techniques are already available but dealing with anomalies
is still a challenge especially when it comes to large volumes and distributed data
with unknown data sources.
1.6 Challenges of Data Analytics 9
As the results of data analytics are realized and the benefits become evident, the
higher management demands for more results which ultimately increase pressure on
the data analysts. So, work under pressure always has its negatives.
Lack of support from higher management and peers is another issue to deal with.
Data analytics is not useful if higher management is not supportive and does not give
authority to take actions as results of the knowledge are extracted from the analytics
process. Similarly, if peers, e.g. other departments are not willing to provide data,
the analytics process will not be much useful.
1.6.9 Budget
Budget is one of the core issues to deal with. Data analytics process requires expen-
sive systems having capacity to deal with large volumes of data, hiring consultants
whenever needed, purchasing data and tools, etc., which involves significant amount
of budget. Until the required budget is not provided and organizations are not willing
to spend on data analytics process, it is not possible to get the fruits of the data
analytics.
Data analytics is a rich field involving skill set from various domains like mathe-
matics, statistics, artificial intelligence, machine learning, etc. So, it becomes an issue
to find such experts having knowledge and experience in all of these domains. So,
finding right people for right job is also an issue that organizations and businesses
are facing so far.
With increasing trend in data analytics, various tools have already been developed to
create data analytics systems. Here we will discuss few tools that are most commonly
used for this purpose.
10 1 Introduction
R programming R is one of the leading tools for analytics and data modeling. It has
compatible versions for Microsoft Windows, Mac OS, and Unix, etc. Along with, it
has many available libraries for different scientific tasks.
Python Python is another programming language that is most widely used for
writing programs related to data analytics. This open-source and object-oriented
language has number of libraries from high profile developers for performing
different data analytics-related tasks. Few of the most common libraries used in
Python are NTLK, Numpy, Scipy, Scikit, etc.
Tableau Public A free software that can connect to any data source and create
visualizations including graphs, charts, and maps in real time.
QlikView It allows for data processing thus enhancing the efficiency and perfor-
mance. It also offers data association and data visualization with compressed
data.
SAS SAS is another data analytics-related tool that can analyze and process data
from any source.
Microsoft Excel The one of the most common tools used by for organizational
data processing and visualizations. The tool is developed by Microsoft and is part
of the Microsoft Office suite. Tool integrates number of mathematical and statistical
functions.
RapidMiner Mostly used for predictive analytics, the tool can be integrated with
any data source including Excel, Oracle, SQL server, etc.
KNIME An open-source platform that lets you analyze and model data. Through
its modular data pipeline concepts, KNIME provides a platform for reporting and
integration of data.
Apache Spark Apache server is one of the largest large-scale data processing tools.
Tool executes applications in Hadoop clusters 100 times faster in memory and 10
times faster on disk.
Splunk Splunk is a specialized tool to search, analyze, and manage machine gener-
ated data. Splunk collects, indexes, and analyzes the real-time data into a repository
from which it generates the information visualizations as per requirements.
Talend Talened is a powerful tool for automating the big data integration. Talend
uses native code generation and helps you run your data pipelines across all cloud
providers to get optimized performance on all platforms.
1.7 Top Tools in Data Analytics 11
Splice Machine Splice Machine is a scalable SQL database that lets you modernize
your legacy and custom applications to be agile, data-rich, and intelligent without
modifications. It lets you unify the machine learning and analytics consequently
reducing the ETL costs.
Business intelligence deals with analyzing the data and presenting the extracted
information for business actions to make decisions. It is the process that includes
technical infrastructure to collect, store, and analyze the data for different business-
related activities.
The overall objective of the process is to make better decision for the good of the
business. Some benefits may include
• Effective decision making
• Business process optimization
• Enhanced performance and efficiency
• Increased revenues
• Potential advantages over competitors
• Making effective future policies.
These are just few benefits to mention. However, in order to do this, effective
business intelligence needs to meet four major criteria.
Accuracy Accuracy is core of any successful process and product. In case of busi-
ness intelligence process accuracy refers to the accuracy of input data and the
produced output. A process within accurate data may not reflect the actual scenario
and may result in inaccurate output which may lead to ineffective business decisions.
So, we should be especially careful for the accuracy of the input data.
Here the term error is in general sense. It refers to erroneous data that may contain
missing, redundant values, and outliers. All these significantly affect the accuracy of
the process. For this purpose, we need to apply different cleansing techniques as per
requirement in order to ensure the accuracy of the input data.
Valuable Insights The process should generate the valuable insight from the data.
The insight generated by the business intelligence process should be aligned with
the requirements of the business to help it make effective future policies, e.g. for
a medical store owner the information of the customer medical condition is more
valuable than a grocery store.
12 1 Introduction
BI Process It has four broad steps that loop over and over. Figure 1.4 shows the
process.
5. Feedback 2. Analysis
4.
3. Actions
Measurement
1.9 Data Analytics Versus Data Analysis 13
Below are the lists of points that describe the key differences between data analytics
and data analysis:
• Data analytics is a general term that refers to the process of making decisions
from data, whereas data analysis is sub-component of data analytics which tends
to analyze the data and get required insight.
• Data analytics refers to data collection and general analysis, whereas data analysis
refers to collecting, cleaning, and transforming the data to get deep insight out of
it.
• Tools required for data analytics may include Python, R, and TensorFlow, etc.,
whereas the tools required for data analysis may include RapidMiner, KNIME,
and Google Fusion Tables, etc.
• Data analysis deals with examining, transforming, and arranging a given data
to extract useful information, whereas data analytics deals with complete
management of data including collection, organization, and storage.
Figure 1.5 shows the relationship between data analytics and data analysis.
In this section we will discuss the difference between data analytics and data
visualization.
• Data analytics deals with tools, techniques, and methods to derive deep insight
from the data by finding out the relationships in it, whereas data visualization deals
with presenting the data in a format (mostly graphical) that is easy to understand.
• Data visualization helps organization management to visually perceive the
analytics and concepts present in the data.
• Data analytics is the process that can help organizations increase the operational
performance, make policies, and take decisions that may provide advantages of
over the business competitors.
• Descriptive analytics may, for example, help organizations find out what has
happened and find out its root causes.
14 1 Introduction
• Prescriptive analytics may help organizations to find out the available prospects
and opportunities and consequently making the decisions in favor of the business.
• Predictive analytics may help organizations to predict the future scenarios by
looking into the current data and analyzing it.
• Visualizations can be both static and interactive. Static visualizations normally
provide a single view the current visualization is intended for. Normally user
cannot see beyond the lines and figures.
• Interactive visualizations, as the name suggests, can help user interact with
the visualization and get the visualization as per their specified criteria and
requirements.
• Data visualization techniques like charts, graphs, and other figures may help see
the trends and relationships in the data much easily. We can say that it is the
part of the output of the analytics process. For example a bar graph can be easily
understood by as person to understand the sales per month instead of the numbers
and text.
So in general data analytics performs the analytics-related tasks and derives the
information which is then presented to the user in the form of visualizations.
Figure 1.6 shows the relationship between data analytics and data visualizations.
Both are the prominent jobs in the market these days. A data scientist is someone
who can predict the future based on the data and relationships in it, whereas a data
analyst is someone who tries to find some meaningful insights from the data. Let us
look into both of them.
• Data analyst deals with analysis of data for report generation, whereas data
scientist has a research-oriented job responsible for understanding the data and
relationships in it.
• Data analysts normally look into the known information from new perspective,
whereas a data scientist may involve finding the unknown from the data.
• The skill set required for data analyst includes the concepts from statistics, math-
ematics, and various data representation and visualization techniques. The skill
set for data scientist includes advance data science programming languages like
1.11 Data Analyst Versus Data Scientist 15
Python, R, TensorFlow, and various libraries related to data science like NLTK,
NumPy, and Scipy, etc.
• Data analyst’s job includes data analysis and visualization, whereas the job of data
scientist includes the skills to understand the data and find out the relationships
in it for deep insight.
• Complexity wise data scientist job is more complex and technical as compared to
data analyst.
• Data analyst normally deals with structured data, whereas data scientist may have
to deal with structured, unstructured, and hybrid data.
It should be noted that we cannot prioritize both of the jobs, both are essential, both
have their own roles and responsibilities, and both are essential for an organization
to help grow the business based on its needs and requirements.
Now we will explain the difference of data science with some other domains.
Apparently both the terms seem to be synonym. However, there is much difference
which is discussed below.
• Business intelligence refers to a generic process that is useful for decision making
out of the historical information in any business, whereas data analytics is the
process of finding the relationships between data to get deep insight.
• The main focus of business intelligence is to help in decision making for further
growth of the business, whereas data analytics deals with gathering, cleaning,
modeling, and using data as per business needs of the organization.
• The key difference between data analytics and business intelligence is that busi-
ness intelligence deals with historical data to help organizations make intelligent
decisions, whereas the data analytics process tends to find out the relationships
between the data.
• Business intelligence tries to look into the past and tends to answer the questions
like what happened? When happened? How many times? etc. Data analytics on
the other hand tries to look into the future and tends to answer the questions like
when will it happen again? What will be the consequences? How much sales will
increase if we do this action? etc.
• Business intelligence deals with the tools and techniques like reporting, dash-
boards, scorecards, and ad hoc queries, etc. Data analytics on the other hand deals
with the tools and techniques like text mining, data mining, multivariate analysis,
and big data analytics, etc.
In short business intelligence is the process to help organizations make intelligent
decision out of the historical data normally stored in data warehouses and organiza-
tional data repositories, whereas data analytics deals with finding the relationships
between data for deep insight.
16 1 Introduction
Data mining and data analytics are two different processes and terms having their
own scope and flow. Here we will present some differences between the two.
• Data mining is the process of finding the existing patterns in data, whereas data
analysis tends to analyze the data and get required insight.
• It may require the skill set like mathematics, statistics, machine learning, etc.,
whereas data analysis process involves skill set like statistics mathematics,
machine learning, subject knowledge, etc.
• A data mining person is responsible for mining patterns into the data, whereas
data analyst performs data collection, cleaning, and transforming the data to get
deep insight out of it.
Figure 1.7 shows the relation between data mining and data analysis.
In data warehouses, data comes from various sources. These sources may be homo-
geneous or heterogeneous. Homogeneous sources may contain same data semantics,
whereas heterogeneous sources are the ones where data semantics and schemas are
different. Furthermore, the data from different sources may contain anomalies like
missing values, redundant values, and outliers. However, the data warehouse should
contain the homogeneous and accurate data in order to provide this data for further
processing. The main process that enables data to be stored into the data warehouse
is called extract, transform, and load (ETL) process.
1.14 What Is ETL? 17
ETL is the process of collecting data from various homogeneous and hetero-
geneous sources and then applying the transformation process to load it in data
warehouse.
It should be noted that the data in different sources may not be able to store
in data warehouse primarily due to different data formats and semantics. Here we
will present some examples of anomalies that may be present in source data which
required a complex ETL process.
Different data formats Consider two databases, both of which store customer’s
age. Now there is a possibility that one database may store the date in the format
“mm/dd/yyyy” which the other database may have some different format like
“yyyy/mm/dd” or “d/m/yy”, etc. Due to different data formats, it may not be possible
for us to integrate their data without involving the ETL process.
Different data semantics In previous case data formats were different. It may be
the case that data formats are same but data semantics are different. For example
consider two shopping cart databases where the currencies are represented in the
form of floating numbers. Note that although the format of the data is same but
semantics may be different, e.g. the floating currency value in one database may
represent the currency in dollars, whereas the same value in other database may
represent euros, etc., so, again we need the ETL process for converting the data to
homogeneous format.
Missing values Databases may have missing values as well. It may be due to certain
reasons, e.g. normally people are not willing to show their salaries or personal contact
numbers. Similarly gender and date of birth are among some measures which people
may not be willing to provide while filling different online forms. All this results
into missing values which ultimately affects the quality of the output of process
performed on such data.
Incorrect values Similarly, we may have incorrect values, e.g. some outliers perhaps
resulted due to a theft of a credit card or due to malfunctioning of a weather data
collection sensor. Again, such values effect the quality of analysis performed on this
data. So, before storing such data for analysis purpose, ETL process is performed to
fix such issues.
1.14.1 Extraction
Data extraction is the first step of ETL process. In this step data is read from the
source database and stored into an intermediate storage. The transformation is then
performed. Note that the transformation is performed on other system so that the
source database and the system using it are not affected. Once all the transformations
are performed, the data becomes ready for the next stage; however, after completing
the transformation, it is essential to validate the extracted data before storing it in
data warehouse. Data transformation process is performed when the data is from
different DBMSs, hardware, operating systems, and communication protocols. The
source of data may be any relational database, conventional file system repositories,
spreadsheets, document files, CSVs, etc.
So, it becomes evident that we need schemas of all the sources data as well as
the schema of target system before performing the ETL process. This will help us
identify the nature of required transformation.
• Three Data Extraction methods:
Following are the three extraction methods. However, it should be noted that irre-
spective of the transformation method used, the performance and working of source
and target databases should not be affected. A critical aspect in this regard is when to
perform the extraction as it may require the source database of the company unavail-
able to the customers or in simple case may affect the performance of the system.
So, keeping all this in mind following are the three types of extractions performed.
1. Full extraction
2. Partial extraction—without update notification
3. Partial extraction—with update notification.
1.14 What Is ETL? 19
1.14.2 Transformation
Data extracted from source systems may be in raw format that is not useful for the
target BI process. The primary reason is that data schemas are designed according
to local organizational systems and due to various anomalies present in the systems.
Hence, before moving this data to target system for BI process, we need to transform
it. Now this is the core step of the entire ETL process where the value is added to data
for BI and analysis purpose. We perform different functions to transform the data,
however, sometimes transformation may not be required such data is called direct
move or pass through data.
The transformation functions performed in this stage are defined according to
requirements, e.g. we may require the monthly gross sale of each store by the end
of the month. The source database may only contain the individual time-stamped
transactions on daily basis. So, here we may have two options, we may simply pass the
data to target data warehouse and calculation of monthly gross sale can be performed
at runtime whenever required. The other option is to perform the aggregation and
store the monthly aggregated data in data warehouse. The latter will provide high
performance as compared to calculating the monthly sale at runtime. Now we will
provide some examples of why transformation may be required.
1. Same person may have different name spelling, e.g. Jon or John.
2. A company can be represented in different ways, e.g. HBL or HBL Inc.
3. Use of different names like Cleaveland, Cleveland.
4. Same person may have different account numbers generated by different
applications.
5. Data may have different semantics.
6. A same name, e.g. “Khursheed” can be of a male or female at the same time.
7. Fields may be missing.
8. We may use derived attributes in target system, e.g. “Age” which is not present in
the source system. So, we can apply the expression “Current date minus DOB”
and the resulted value can be stored in target data warehouse (again to enhance
the performance).
There are two types of transformations as follows:
Multistage data transformation—This is conventional method where data is
extracted from the source, stored in an intermediate storage, transformation is
performed, and data is moved to data warehouse.
In-warehouse data transformation—This is a slight modification from the
conventional method. Data is extracted from source and moved into the data ware-
house, and all the transformations are performed there. Formally it may be called as
ELT, i.e. extract, load, and transform process.
Each method has its own merits and demerits, and the selection of any of these
may depend upon the requirements.
20 1 Introduction
1.14.3 Loading
It is the last step of ETL process. In this step the transformed data is loaded into the
data warehouse. Normally it requires huge amount of data loading so the process
should be optimized and performance should not be degraded.
However, if due to some reason, the process results into a failure, then the measures
are taken to restart the loading process from the last checkpoint and the failure should
not affect the integrity of data. So, the entire loading process should be monitored to
ensure the success of the loading process. Based on the nature, the loading process
can be categorized into two types:
Full load: All the data from the source is loaded into the data warehouse for the first
time. However, it takes more time.
Incremental load: Data is loaded into data warehouse in increments. The checkpoints
are recorded. A checkpoint represents the time stamp from which onward data will
be stored into the data warehouse.
Full load takes more time but the process is relatively less complex as compared
to the incremental load which takes less time but is relatively complex.
ETL Challenges:
Apparently the ETL process seems to be interesting and simple tool-oriented
approach where a tool is configured and the process starts automatically. However,
there are certain aspects that are challenging and need substantial consideration. Here
are few:
• Quality of data
• Unavailability of accurate schemas
• Complex dependencies between data
• Unavailability of technical persons
• ETL tool cost
• Cost of the storage
• Schedule of the load process
• Maintain integrity of the data.
Data science is multidisciplinary field that focuses on study of the all aspects of
data right from its generation to processing to converting it into valuable knowledge
source.
As said it is multidisciplinary field, so, it uses concepts from mathematics, statis-
tics, machine learning, data mining, and artificial intelligence, etc., having wide range
of applications, data science has become a buzz word now. With realization of the
worth of insight from data, all the organizations and businesses are striving for best
data science and analytics techniques for the good of the business.
1.15 Data Science 21
6. communicate 2. Data
Results Preparation
5.
3. Model Plan
Operationalize
4. Model
Development
Bigger companies like Google, Amazon, and Yahoo, etc., have shown that care-
fully storing data and then using it for extraction of the knowledge to make decisions
and adopt new policies are always a worth. So, small companies are also striving
for such use of data which has ultimately increased the demand of the data science
techniques and skills in the market. So, efforts are also underway to decrease the cost
of providing data science tools and techniques.
Now we will explain different phases of life cycle of a data science project.
Figure 1.9 below shows the pictorial form of the process.
• Phase 1—Discovery: The first phase of data science project is discovery. It is
discovery of the available sources that you have, that you need, your requirements,
your output, i.e. what you want out of the project, its feasibility, the required
infrastructure, etc.
• Phase 2—Data preparation: In this phase you explore about your data and its
worth. You may need to perform some preprocessing including the ETL process.
• Phase 3—Model plan: Now you think about the model that will be implemented
to extract the knowledge out of the data. The model will work as base to find out
patterns and co-relations in data as per your required output.
• Phase 4—Model development: Next phase is the actual model building based
on training and testing data. You may use various techniques like classification
and clustering, etc., based on your requirements and nature of the data.
• Phase 5—Operationalize: The next stage is to implement the project. This may
include delivery of the code, installation of the project, delivery of the documents,
and giving demonstrations, etc.
22 1 Introduction
• Phase 6—Communicate results: The final stage is evaluation, i.e. you evaluate
your project based on various measures like customer satisfaction, achieving the
goal the project was developed far and accuracy of the results of the project, and
so on.
1.17 Summary
In this chapter, we have provided some basic concepts to data science and analytics.
Starting from basic definition of data, we discussed several of its representations. We
provided details of different types of data analytics techniques and the challenges
that are normally faced in implementing data analytics process. We also studied
various tools available for developing any data science project at any level. Finally,
we provided a broad overview of the phases of a data science project.
Chapter 2
Applications of Data Science
Data science has wide range of applications. Today, where we already have entered
era of information, all walks of life right from small business to mega industries
are using data science applications for their business needs. In this chapter we will
discuss different applications of data science and their corresponding benefits.
Data Science has played a vital role in the healthcare to predict the patterns that help
in curing contagious, long-lasting diseases using fewer resources.
Extracting meaningful information and pattern from the data collected through
the patient’s history stored in hospital, clinics, and surgeries help the doctors to refine
their decision making and advise best medical treatment that helps the nations to live
longer than before. These patterns provide all the data well before time that enables
the insurance companies to offer packages suitable from patients.
Emerging technologies are providing mature directions in the healthcare to dig
deep insight and revealing more accurate critical medical schemes. These schemes
are taking treatment to next level where patient has also sufficient knowledge about
the existing and upcoming diseases, and it makes easy for doctors to guide patients in
specific way. Nations are getting advantages of data science applications to anticipate
and monitor the medical facilities to prevent and solve healthcare issues before it is
too late.
Due to rising costs of medical treatments, the use of data science becomes neces-
sary. Data science applications in healthcare help reduce the cost by various means,
for example by providing the information at earlier stages helps patient avoid the
expensive medical treatments and medications.
© The Editor(s) (if applicable) and The Author(s), under exclusive license 25
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_2
26 2 Applications of Data Science
Rise in the costs has become a serious problem for healthcare companies since
the last 20 years. Healthcare companies are now rethinking to optimize the treat-
ment procedures for patients by using data science applications. Similarly, insurance
companies are trying to cut down the cost and provide customized plans as per patient
status. All this is guided by the data-driven decision making where insights are taken
from the data to make plans and policies.
With this, healthcare companies are getting true benefits by using the power of
analytical tools, software as service (SaaS) and business intelligence to extract such
patterns and devise schemes that help all the stakeholders in reducing the costs
and increasing the benefits. Now doctors not only have their individual educational
knowledge and experience but they also have access to number of verified treatment
models of other experts with the same specialization and in the same domain.
(1) Patients Predictions for an Improved Staffing
Over staffing and under staffing is a typical problem faced by many hospitals. Over
staffing can over burden the salary and wages costs while less staffing means that
hospital is comprising on the quality of care that is too dangerous because of sensitive
treatments and procedures. Data science applications make it possible to analyze the
admission records and patient visiting patterns with respect of weather, days, month,
time, and location and provide the meaningful insights to place the staff that helps
the staffing manager to optimize the staffing placement in line with patient visits.
Forbes report shows that how hospitals and clinics, by using the patient’s data
and history, can predict the patient’s future visits and can accordingly place the staff
as per the expected number of patients visits. This will not only result in optimized
staff placement but will also help patients by reducing the waiting time. Patients will
have immediate access to their doctor. The same can be especially helpful in case of
emergencies where doctors will already be available.
(2) Electronic Health Records (EHRs)
Using data science applications in EHR, data can be archived according to patient
demographic, laboratory tests, allergies, and complete medical history in an infor-
mation system, and doctors can be provided with this data through some secure
protocols which ultimately can help the doctors in diagnosing and treating their
patients with more accuracy. This use of data science applications is called elec-
tronic health records (EHRs) where patients’ data is shared through secured systems
with the doctors and physicians.
Most developed nations like US have already implemented; European counties are
on the way to implement it while others are in process to develop the rules and policies
for implementation. Although HER is apparently an interesting application of data
science, but it requires special considerations with respected to security because EHR
involves the patient’s personal data that should be handled with care. This is also one
of the reasons that various hospitals and clinics even in developed countries are still
reluctant to participate in EHR. However, with secure systems and keeping in mind
2.1 Data Science Applications in Healthcare 27
the benefits of EHR to hospitals, clinics, and patients, it is not far away that EHR
will be implemented in majority of the countries across the globe.
(3) Real-Time Alerting
Real-time alerting is one of the core benefits of data science applications in healthcare.
Using data science for such applications, the patient’s data is gathered in real-time and
analyzed, and the medical staff is informed about the patient conditions which can
ultimately take decisions well before time. Patients can use GPS guided wearables
that report the patients and doctors about the patients’ medical state, for example,
blood pressure and heart rate of the patient. So, if the patient’s blood pressure changes
abruptly or reach at any dangerous level, then the doctor can contact the patient and
advise him the medication accordingly.
Similarly, for the patients with asthma, this system records the asthma trends by
using GPS guided inhalers. This data is being used for further research at clinical
level and at national level to make a general policy for asthma patients.
(4) Enhancing Patient Engagement
With availability of the more advance wearables, and convincing the patients about
their benefits, we can develop the patient’s interest for use of these devises. With
these wearables, we can track all the changes to human body and give feedback
with initial treatment to reduce the risk by avoiding the critical situations regarding
patient’s health.
Insurance companies are also advertising and guiding their clients to use these
wearables, and even many companies are providing these wearables for free to
promote these trackable devices that prove a great source in reducing the patients
visits and laboratory tests. These devices are gaining huge attention in the health
market and giving fruitful results. It is also engaging the researchers to come up with
more and more features to be added to these devises with passage of time.
(5) Prevent Opioid Abuse
Drug addiction is an extreme problem in many countries even including the developed
nations where billions of dollars have already been spent and still programs are
underway to devise solutions for it.
The problem is getting worse each year. Now the research is already in progress to
come up with the solutions. Many risk factors have already been identified to predict
the patients at the risk of opioid abuse with high accuracy.
Although the problem seems to be challenging to identify and reach such persons
and convince them to avoid drug issues, however, we can hope that the success can
be obtained with little more effort both by doctors and public.
(6) Using Health Data for Informed Strategic Planning
With the help of data science, health organizations create strategic plans for patients
in order to provide better treatments and reducing the costs. These plans help these
28 2 Applications of Data Science
organizations to examine the latest situation in a particular area or region, for example,
current situation of existing chronic disease in a certain region.
Using the advance data science tools and techniques, we can come up with proac-
tive approaches for treating the emergencies and critical situations. For example,
by analyzing the data of a certain region, we can predict the coming heat waves or
dengue virus attacks and can establish the clinics and temporary facility beforehand
ultimately avoiding the critical situations in the region.
(7) Disease Cure
Data science applications provides great opportunities to treat the diseases like cancer
thus by giving relief to cancer patients. We can predict the disease and provide direc-
tions to cure and treat the different patient at different stages. However, all this
requires cooperation at various levels. For example, individuals should be convinced
to participate in information providing process, so that if any information is required
from any cancer victim, he is willing to provide that information. Furthermore, orga-
nizations that already have security policies may be reluctant to provide such data.
Beyond this there are many other issues, e.g. technical compatibility of the diagnostic
systems as they are developed locally without keeping in mind their integration with
other systems. Similarly, there may be legal issues in sharing such information.
Furthermore, we also need to change the mindset which hinders an organization to
share its success with others perhaps due to business reasons.
All this requires a significant effort to deal with such issues in order to come up
with solutions. However, once all such data is available and interlinked, researcher
can come up with the models and cures to help cancer victims with more effectiveness
and accuracy.
(8) Reduce Fraud and Enhance Security
Data breaching is common hack due to the less security of the data because this data
has great value in term of money and competitive advantages and can be used for
unfair means. Data breach can occur due to many reasons, e.g. viruses, cyber-attacks,
penetration in companies’ network through friendly-pretending nodes, etc. However,
it should be noted that as the systems are becoming more and more victims of possible
cyber-attacks and data hacks, the security solutions are also getting mature day by day
with the use of encryption methodologies, anti-viruses, firewalls, and other advance
technologies.
By use of data science applications, organizations may predict any possible attacks
to their organizational data. Similarly, by analyzing the patients’ data, the possible
frauds in patients’ insurance claims can also be identified.
(9) Telemedicine
Telemedicine is the process of contacting the doctor and physician using the advance
technologies without involving personal physical visits. Telemedicine is one of the
most important applications of data science. This helps in delivery of the health
2.1 Data Science Applications in Healthcare 29
services in remote areas and is a special help for the developing countries where
health facilities are not available in rural areas.
Patients can contact their doctor through video conferencing or smart mobile
devices or any other available services.
The above-mentioned use is just a trivial example. Now the doctors can perform
surgeries through robots even sitting far ways from the real surgery room. People are
getting best and instant treatment that make them more comfortable and less costly
than arranging visits and standing in long lines. Telemedicine is helping hospitals
also to reduce cost and manage other critical patients with more care and quality
and placement of the staff as per requirements. It also allows healthcare industry
to predict which diseases and level of diseases may be treated remotely so that the
personal visits could be reduced as much as possible.
(10) Medical Imaging
Medical images are becoming important in diagnosing disease which requires high
skills and experience if performed manually. Moreover, hospitals required huge
budget to store these images for long time as they may be needed anytime in future for
the treatment of the patient whose image is there. Data science tools make it possible
to store these images in an optimized way to take less storage as well as these algo-
rithms generate patterns using pixels and convert these pixels into numbers that help
the medical assistance and doctors to analyze one particular image and compare it
with other images to perform the diagnosis procedure.
Radiologists are the people who are most relevant to generate these images. As
their understandability may change due to human nature of mood swings and many
other reasons, the accuracy of the extracted information may be affected. Computers,
on the other hand, do not get tired and behave in a clean way to predict true sense
of images, alternatively resulting in extraction of quality information. Furthermore,
the process will be more efficient and time saving.
(11) A Way to Prevent Unnecessary ER Visits
Better and optimized use of resources like money, staff, and energy is also main conse-
quence of data science applications, and implementation of these systems makes it
possible. There was a case where a woman having mental illness visited hospitals
900 times in a year. This is just one case of that type that has been reported, but there
can be many similar cases that cause the over burden on healthcare institutions and
other taxpayers.
Healthcare system can easily tackle this problem through sharing the patient’s
information in emergency departments and clinics of the hospitals. So, hospital staff
can check patients’ visits in other hospitals and laboratory test dates and timings to
suggest or not suggest re-test of the patients based on his previous recently conducted
test. This way hospitals may save their times and all resources.
ER system may help the staff in the following ways:
• It may help in finding medical history and patients’ visits to nearby hospitals.
30 2 Applications of Data Science
• Similarly, we can find if the patient has already been assigned to a specialist in
nearby hospital.
• What medical advices given to the patient previously or currently by another
hospital.
This is another benefit of healthcare system that it helps to utilize the all resource
in a better way for the care of the patients and taxpayers. Prior to these systems, a
patient could get medical checkup again and again and causing the great burden on
all stakeholders.
Not only the students, but data science applications can be a great assistance to
teachers and faculty members to identify the areas where they need to give more
focus in the class. So, by identifying the patterns, mentors can help a particular
student or group of students, for example, a question was attempted by few students
out of 1000 students which means that there can be any issue with the question so
teacher can modify or change or may re-explain the contents of the questions perhaps
in more details.
3. To Help Developing Curriculum and Learning Processes
Course development and learning process is not a static procedure. Both require
continue effort to develop or enhance the courses and learning process. Educational
organizations can get insight information to develop the new courses as per compe-
tency and levels of students and enhancement in the learning procedures. After exam-
ining the students learning patterns, institutions will be in better position to develop
a course which will give great confidence to learners to get their desired skills as per
their interests.
2.2 Data Science Applications in Education 31
4. To Help Administrators
Institution’s administration can devise policies with the help of data science applica-
tions that which course should be promoted more, which areas are need of the time.
What marketing or resource optimizing policies they should implement? Which
students may get more benefit from their courses and offered programs?
5. Prevent Dropouts and Learn From the Results
Data science applications can help the educational institutes in finding out the current
and future trends in market and industry, which alternatively can help them decide the
future of their students. They can study the industry requirements in near future and
help out the student in selecting best study programs. Even industry may share their
future trends to help in launching a particular study program. This will ultimately
give the students great confidence and risk-free thoughts regarding their job security
in the future.
Industries can work on recruitments strategies and share these with the institutes
to develop new course according to the industry needs. Also, internship opportunities
can be provided by the industry to the students by assessing their learning skills.
7. It Helps You Find Answers to Hard Questions
8. Its Accessible
As there are many departments interlinked and generating lots of data, so, analyzing
this data helps the institution to develop the strategic plans to place the required
staff at right place. So, data science applications can help you optimize the resource
allocation, hiring process, transportation systems, etc. You can be in better position
to develop the infrastructure according to the need to existing and upcoming students
with minimum cost.
Similarly, you can develop new admission policy according to the existing avail-
able resources, mentor the staff, and polish their skills to market the good will in the
community.
So, with the help of data science applications, your institution helps you get
deep insight from the information and alternatively perform best utilization of your
resource by getting the maximum benefits with minimum costs.
10. Its Quick
There is no manual work, all departments generate their own bulk data, and this
data is stored at centralized repositories. So, should you need any information, it is
already available at hand without and delay. This ultimately means that you have
enough time to take actions against any event.
So, with data science tools, you have central repositories that let you take timely
decisions and make accurate action plans for the better of all stakeholders of that
institution.
consequently were reluctant to apply all this technology. However, with growing
population and increase in manufacturing goods it resulted in large amount of data.
Added with low lower cost of technological equipment, now this is going to be
possible for industries to adopt intelligent tools and apply in manufacturing process
with minimum cost.
With this, data science applications are now helping industries to optimize their
processes and quality of the products thus resulting in increased satisfaction of
customers and the industry itself. This advancement is playing vital roles in expanding
the business getting maximum benefit with minimum cost is great attraction. Without
data science tools, there were various factors adding to the cost like large number of
required labors, no prediction mechanism, and resource optimization policies.
1. Predictive Maintenance
Use of intelligent sensors in industry provides great benefits by predicting the poten-
tial faults and errors in the machinery. Prompt response to fix these minor or major
issues saves the industry owners to re-install or buy new machinery that required huge
financial expenditures. However, the prediction analysis of manufacturing process
can enhance the production with high accuracy and alternatively giving confidence
to buyers to predict their sales and take corrective actions if any is required. With
the further advancement in data science tools and techniques, business industry is
gaining maximum day by day now.
2. Performance Analyses
Any fault in any part of manufacturing equipment may cause a great loss and change
all the delivery dates and production levels. But inclusion of technology in all parts of
industry, e.g. biometric attendance recording, handling, and fixing of errors with the
help of robotics and smart fixing tools, fault predictions, etc., are giving the excellent
benefits in reducing the downtime of manufacturing units, alternatively, giving good
confidence to utilize these technologies in manufacturing process.
4. Improved Strategic Decision Making
Data science helps businesses in making the strategic decisions based on ground real-
ities and organizational contexts. Various tools are available including data cleanup
34 2 Applications of Data Science
tools, profiling tools, data mining tools, data mapping tools, data analysis platforms,
data visualization resources, data monitoring solutions, and many more. All these
tools make it possible to extract deep insight from the available information and
making the best possible decisions.
5. Asset Optimization
With the emergence of Internet of Things (IoT) and data science applications busi-
nesses can now enhance the production process by automating all the tasks thus
resulting in optimized use of their assets.
With data science tools, it can now be determined about when to use a certain
resource and up to what level will be more beneficial.
6. Product Design
Launching a new product design is always a risky job. So, without thorough analysis
you cannot purchase a new unit for a new product as it may involves huge financial
budget and equal failure risks. So, you can avoid the cost of purchasing equipment
for production of a product that you are not sure about its success or failure.
However, with help of data science applications, by looking into the available
information, you can definitely find out whether to launch a new design or not.
Similarly, you can find out the success or failure ratio of a certain product that
has not been launched so far. This provides a potential advantage over competitive
businesses as you can perform pre-analysis and launch a much demanding product
before some other launches it, thus having more opportunity to capture the certain
portion of market.
7. Product Quality
Designing the great quality product with the help of end users always produce good
financial benefits. End user provides feedback through different channel like social
media, online surveys, review platforms, video streaming feedback, and other digital
media tools. Finding the feedback of the users in this way always helps in finding out
the much-needed products and their features. Even after launching a failed product
in the market, it can be enhanced and updated in an excellent way after incorporating
the user feedback and investigating customer trends in market using the data science
applications.
8. Demand Forecasting
9. Customer Experience
Data science tools provide deep insight into the customer data, their buying trends,
customer preferences, their priorities, behaviors, etc. The data can be obtained from
various sources and is then normalized to central repositories where data science
tools are applied to extract the customer experiences. These experiences then help
in policy and decision making to enhance the business. For example, the customer
experience may help us maintain a minimum stock level of a certain product in a
specific season.
A very common example of such experience is to place together the products that
are normally soled in groups, e.g. milk, butter, and bread, etc.
10. Supply Chain Optimization
Supply chain is always a complex process. Data science applications help finding
out the necessary information, for example, the suppliers with high efficiency, the
quality of the products they deliver, their production capability, etc.
Data science plays a critical role in strategic decision making when it comes to sports.
Player’s performance prediction using the previous data is nowadays common appli-
cation of data science. As each game and its complete context can be stored in
databases, with this information, data science applications can help team coaches to
justify out the validity of their decisions and identify the weak areas and discrepan-
cies which resulted in losing a certain game. This can be a great help in future for
overcoming these discrepancies and thus increasing the performance of players and
teams.
You can not only find your own team discrepancies but you can also find the gray
areas in opponent teams. This will help you come up with better strategy next time
which will ultimately increase the success changes in future.
36 2 Applications of Data Science
2. Deep Insight
Data science applications require a central repository where data from multiple
sources is stored for analysis purpose. The more the data is stored the more accu-
rate are the results. However, the application may have their own cost, e.g. to store
and process such large amount of data requires large storage volumes, memory, and
processing power. But once all this is available, the insight obtained from the data by
using data science tools is always a worth as it helps you predict future events and
makes decisions and policies accordingly.
3. Marketing Edge
Nowadays, sports has become a great business. The more you are popular, the more
you will attract the companies to get their ads. Data science applications provide you
better opportunities now to target your ads as compared to earlier days. By finding
out the fan clubs, players and teams ratings, and interests of the people, now you
are in better position to devise your marketing strategies that will ultimately result
in more reach of your product.
4. On-site Assistance and Extra Services
By collecting the information from different sources like ticket booths, shopping
malls, parking areas, etc., and properly storing and analyzing it can help in providing
on-site assistance to people and ultimately increasing the revenue. For example, you
can provide better car parking arrangements based on a person’s priority of seating
in a match. Similarly, in a particular match, providing him with his favorite food at
his seating place, thus making him happier by enjoying the favorite food and game
at the same time, will ultimately ease him and will increase your revenue as people
will also try to attend such matches in future.
Historical data is backbone of any data science application. Same is the case in
domain of cyber-security. You will have to store large amount of previous data to
2.5 Data Science Applications in Cyber-Security 37
find out what a normal service request is and what an attack is. Once the data is
available, you can analyze it to find out the malicious patterns, service requests,
resource allocations, etc.
Once a pattern is verified to be an attack, you can monitor the same pattern for
future and deny it in case it happens again. This can ultimately save the companies
from the huge data and business losses especially in case of sensitive domains like
defense-related, finance-related, or healthcare-related data.
2. Monitoring and Automating Workflows
All the workflows in a network need to be monitored either in real time or offline.
Studies show that data breaches occur mostly due to local employees. One solution
to such problem is to implement some authorization mechanism, so that only the
authorized persons have access to sensitive information. However, whether the flow
is generated from local systems or from some external source, data science tools can
help you in automatically identifying the patterns of these flows and consequently
helping you avoid the worst consequences.
3. Deploying an Intrusion Detection System
Data science applications in airline industry are proving to be a great help to devise
policy according to the customer preferences. A simple example may be the anal-
ysis of booking system, but analyzing the booking system airlines can provide the
customers with personalized deals and thus increasing their revenue. You can find
out the most frequent customers and their preferences and can provide them with the
best experience as per their demand.
Furthermore, airlines are not just about ticketing, by using the data science appli-
cations, you can find the optimized roots and fairs and enhance your customer base.
You can come up with deals that most of the travelers will like to take thus ultimately
enhancing your revenue.
38 2 Applications of Data Science
1. Smart Maintenance
Baggage handling is a big problem on airports, but data science tools provide better
solutions to track baggage on run time using radio frequencies. As air traffic is
increasing day by day and new problems are emerging on daily basis, data science
tools can be a big helping hand in all these scenarios. For example, with increase air
traffic, intelligent routing applications are required. So, in this scenario, intelligent
routing applications can make the journey much safer than before. Furthermore, you
can predict about future issues that are normally difficult to handle at runtime, once
these issues are identified beforehand, you can have contingency plans to fix them.
For example, sudden maintenance is hard to handle at run time, but prediction before
time enables industry to take reasonable measures. Similarly, you can predict weather
conditions and can ultimately inform the customers about the delay in order to avoid
any unhappy experience or customer dissatisfaction.
2. Cost Reduction
There may be many costs that airlines have to beer with, e.g. one aspect is that of
baggage lost. With the use of real-time bag tracking, such costs can significantly be
avoided and thus avoiding the customer dissatisfaction.
One of the major costs of airline companies is that of fuel. Using more fuel may be
expensive and less fuel may be dangerous. So, maintaining a specific level is much
necessary. Data science applications can dig out the relevant data like jet engine
information, weather conditions, altitude, rout information, distance, etc., and can
come up with optimal fuel consumption, ultimately helping companies to decrease
fuel consumption cost.
3. Customer Satisfaction
Airline companies take all measures just to satisfy their customers by enhancing
their experience with the company. User satisfaction depends on number of factors
as everybody has individual level of satisfaction, but creating an environment that can
satisfy large number of customers is difficult but data science tools through analysis of
customers previous data can provide maximum ease like their favorite food, preferred
boarding, preferred movies, etc., which will ultimately make customers to choose
same airline again.
4. Digital Transformation
Transformation of existing processes into digital model is giving high edge to airline
industries whether it is related to customer or other monitoring factors. Attractive
dashboards and smart technological gadgets are making it possible to provide greater
level of services on time, and airlines companies are able to receive and analyze
instant feedback in order to provide better experience and enhanced service quality.
2.6 Data Science Applications in Airlines 39
5. Performance Measurements
Airline companies normally operate at international level and thus face tough compe-
tition. So, in order to remain in the business, they not only have to take performance
measures but have to make sure that they are ahead of their competitors.
With the help of data science applications, airlines can automatically generate
their performance reports and analyze them, e.g. the total number of passengers that
travelled last week preferred the same airline again or the ratio of the fuel consumption
one the same route as compared to the previous flight with respect to the number of
customers, etc.
6. Risk Management
Risk management is an important area where data science applications can help the
airline industry. Whenever a plan takes off, there are multiple risks that are adhered to
the flight including the changing weather conditions, sudden unavailability of a rout,
malfunctioning issues, and most importantly pilots fatigue due to constant flights.
Data science applications can help airlines to overcome these issues and come
up with contingencies plan to manage all these risks, e.g. using dynamic routing
mechanisms to rout the flight to a different path at run time in order to avoid any
disaster. Similarly, in order to avoid pilots from flying fatigues resulting from long
hour flying, optimal staff scheduling can be done by analyzing the pilot’s medical
data.
7. Control and Verification
In order to reduce the cost and make the business successful, airlines need in-depth
analysis of the historical data. Here data science applications can help by using a
central repository containing data from various flights. One such example may be
the verification of expected number of customers as compared to actual customers
that travelled airline.
8. Load Forecasting
2.8 Summary
In this chapter we discussed few of the applications of data science. As the size of
data is increasing every second with immense rate, the manual and other conventional
automation mechanisms are not sufficient, so the concepts of data science is very
much relevant now. We have seen that how organizations are benefiting from the entire
process in all walks of life. The overall intention was to emphasize the importance
of the data science applications in daily life.
Chapter 3
Widely Used Techniques in Data Science
Applications
Majority of machine learning algorithms these days are based on supervised machine
learning techniques. Although, it is a complete domain which requires separate in
depth discussion, here we will only provide a brief overview of the topic.
In supervised machine learning, the program already knows the output. Note that
this is opposite to conventional programming where we feed input to the program
and program gives output. Here in this case, we give input and output at the same
time, in order to make program learn that in case of any of this or related input what
program has to output.
This learning process is called model building. It means that through provided
input and output, the system will have to build a model that maps the input to output,
so that next time whenever the input is given to the system, the system provides
the output using this model. Mathematically speaking, the task of machine learning
algorithm is to find the value of dependent variable using the model with provided
independent variables. The more accurate the model is, the more efficient is the
algorithm, and more efficient are the decisions based on this model.
The dependent and independent variables are provided to the algorithm using
training dataset. We will discuss training dataset in upcoming sections. Two important
techniques that use supervised machine learning are:
• Classification: Classification is one of the core machine learning techniques that
use supervised learning. We classify the unknown data using supervised learning,
© The Editor(s) (if applicable) and The Author(s), under exclusive license 41
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_3
42 3 Widely Used Techniques in Data Science Applications
e.g. we may classify the students of a class into male and female, we can classify
the email into two classes like spam and no-spam, etc.
• Regression: In regression, we try to find the value of dependent variables from
independent variables using the already provided data; however, a basic difference
is that the data provided here is in the form of real values.
Just like supervised learning, there is another technique called unsupervised learning.
Although not as common as supervised learning, but still we have number of
applications that use unsupervised learning.
As discussed above that in supervised learning, we have training dataset
comprising of both the dependent and independent variables. The dependent variable
is called the class where independent variables are called the features. However, this
is not the case every time. We may have applications where the class labels are not
known. The scenario is called unsupervised learning.
In unsupervised learning, the machine learns from the training dataset and groups
the objects on the basis of similar features, e.g. we may group the fruits on the basis
of colors, weight, size, etc.
This makes it challenging as we do not have any heuristics to guide the algorithm.
However, it also opens the new opportunities to work with the scenarios where the
outcomes are not known beforehand. The only thing available is the set of operations
available to predict the group of the unknown data.
Let us discuss it with the help of an example. Suppose we are given few shapes
including rectangle, triangle, circles, lines, etc., and the problem is to make the system
learn about the shapes so that it may recognize the shapes in future.
For a simple supervised learning scenario the problem will be simple because the
system will already be fed with the labeled data, i.e. the system will be informed that
if the shape has four sides, it will be rectangle, a three sides shape will be triangle,
and a round closed end shape will be a circle, etc. So, next time whenever a shape
with four sides is provided to the system, the system will recognize it as rectangle.
Similarly, a shape with three sides will be recognized as triangle, and so on.
However, things will be little messier in case of unsupervised learning as we will
not provide any prior labels to the system. System will have to recognize the figures
and will have to make groups of similar shapes by recognizing the properties of the
shapes. The shapes will be grouped and will be given system generated labels. That
is why it is a bit challenging than the supervised classification.
This also makes it more error prone as compared to supervised machine learning
techniques. The more accurately the algorithm groups the input, the more accurate
is the output and thus the decisions based on it. One of the most common techniques
using unsupervised learning is clustering. There are number of clustering algorithms
that cluster data using the features. We will provide a complete chapter on clustering
and related technologies later. Table 3.1 shows the difference between the supervised
and unsupervised learning.
3.3 Reinforcement Learning 43
A/B testing is one of the strategies used to test your online promotions and advertising
campaigns, Web site designs, application interfaces, etc., and basically, the test is to
analyze user experience. We present two different versions of the same thing to user
and try to analyze the user experience about both. The item which performs better is
considered to be the best.
So, this testing strategy helps you prioritize your policies and you can find out
what is more effective as compared to others alternatively giving you the chance
to improve your advertisements and business policies. Now we will discuss some
important concepts related to AB testing.
Conducting an AB test requires proper planning including what to test, how to test,
and when to test. Giving thorough considerations to these aspects will make you run
a successful test with more effective results because it will help you narrow down
your experiments to exact details you want out of your customers.
For example, either you want to test your sale promotion or your email template.
Either this test will be conducted online or offline. If it is to be conducted online,
then either you will present your actual site to user or will adopt some other way.
Suppose you want to test your sale promotion through your online Web site, you can
better focus on those parts of the Web site that provides the users with sales-related
data. You can provide different copies and find out which design is realizing into the
business. This will not provide you with more online business but the justification to
spend more on effective sale promotions.
Once you have decided about what to test, then you will try to find the variables
that you will include in your test. E.g. if you want to test your sales ad, your variables
may include:
• Color scheme
• The text of the advertisement
• The celebrities to be hired for the campaign.
It is pertinent to carefully find out all these variables and include them in test
which may require to find out these variables that may have strong impact on your
campaign.
Similarly, you should know the possible outcomes you are testing for. You should
consider all the possibilities for each option should be tested, so that the option which
provides better results will be considered.
You need to simultaneously conduct your tests, so that the effectiveness of what-
ever you are testing could be determined continual with passage of time and you keep
on making dynamic decisions with respect to the current scenario and situation.
3.4 A/B Testing 45
AB testing can have huge impact on your business by finding out user experience
related to something you are testing. It should be noted that starting something
without knowing user experience may always be expensive. AB testing provides you
with the opportunity to find out customer interests before launching something. By
knowing the user preferences and the outcomes of the ads that performed better, you
can have justification for spending more on such campaigns that actually realize into
profitable business.
This will also help you avoid the strategies that do not worth to customers and
thus have little contribution. You can find out customer priorities and alternatively
providing them what they want and how they want.
In context of AB testing, an important question is that what you can test. Apparently
you can test anything from the format of your sales letter to a single image on Web site
of your business. However, it should be noted that it does not mean that you should
spend time and resources on testing everything related to your business. Carefully
find out the things that have strong impact on your business, and it will be worth to
test them only.
Similarly, once you have decided to test, e.g. two newsletters A and B, make sure
to test all possible options, e.g. header of newsletter A with B and header of newsletter
B with A, and so on. This may require some careful workout before conducting the
tests.
Some examples of what you can test:
• Newsletter:
– Header
– Body text
– Body format
– Bottom image
• Web site:
– Web site header
– Web site body
– Sales advertisement
– Product information
– Web site footers
• Sales advertisement
– Advertisement text
46 3 Widely Used Techniques in Data Science Applications
– Product images
– Celebrities images
– Position of the advertisement in newspaper.
You should carefully select the time period for which the test will be conducted, and
it depends on the frequency of response you get, e.g. if you are conducting a test on
your Web site and you have lot of traffic per day, you can conduct your test for few
days and vice versa.
The insufficient time period you allocate for your test, the insufficient results you
will get and thus the skewed will be your results, so, before selecting the time make
sure to open the test for time period that you get enough responses to make accurate
decisions.
Similarly, giving more time to a test may also give you skewed results. So, note
that there are no accurate guidelines and heuristics about the time period of a test, but
as discussed above, you should select the time period very carefully keeping in mind
the frequency of responses you get. Past experience in this regard can be helpful.
Association rules are another machine learning technique that tries to find out the
relationship between two items. An important application of association rules is
market basket analysis. Before going into details of association rules let us discuss
an example.
In market stores it is always important to place the similar items (i.e. the items
that sale together) close to each other. This not only saves the time of the customers
but also promote cross item sales. However, to find out this item relationship requires
some prior processing to find out the relationship between the items.
It should be noted that association rules do not reflect the individual’s interest
and preferences but only the relationship between the products. We find out these
relationships by using our data from previous transactions.
Now let us discuss details of association rules.
It consists of two parts, i.e. an antecedent and a consequent as shown below. Both
of which are a list of items.
Antecedent → Consequent
It simply shows that if there is antecedent, then there is consequent as well which
means that implication represents the co-occurrence. For a given rule, itemset is the
list of all the items in the antecedent and the consequent, for example:
{Bread, Egg} → {Milk}
3.5 Association Rules 47
Here itemset comprises of bread, egg, and milk. It may simply mean that the
customers who purchased bread and eggs also purchased milk most of the time,
which may simply result in placing these products close to each other in the store.
There are number of metrics that can be used to measure the accuracy and other
aspects of association rules. Here we will discuss few.
3.5.1 Support
3.5.2 Confidence
Suppose we want to find the confidence of the {Butter}->{Milk}. If there are 100
transactions in total, out of which 6 have both milk and butter, 60 have milk without
butter, and 10 have butter but no milk, the confidence will be:
3.5.3 Lift
Lift is the ratio of probability of consequent being present with the knowledge of
antecedent over the probability of consequent being present without the knowledge
of antecedent. Mathematically:
Mathematically:
Again, we take the previous example. Suppose, there are 100 transactions in total,
out of which 6 have both milk and butter, 60 have milk without butter, and 10 have
butter but no milk. Now, probability of having milk with knowledge that butter is
also present = 6/(10 + 6) = 0.375.
Similarly:
Probability of having milk without knowledge that butter is also present = 66/100
= 0.66
Now Lift = 0.375/0.66 = 0.56.
Decision tree is used to show the possible outcomes of different choices that are
made on the basis of some conditions. Normally we have to make decisions on the
basis of different options, e.g. cost, benefits, availability of resources, etc. This can be
done by using decision trees where following a complete path let us reach a specific
decision.
Figure 3.1 shows a sample decision tree. As it is clear that we start with a single
node that splits into different branches. Each branch refers to an outcome, and once
an outcome is reached we may have further branches, and thus, all the nodes are
connected with each other. We have to follow a path on the basis of different values
of interest.
There are three different types of nodes:
Chance nodes: Chance nodes show the probability of a certain decision. They
are represented by circles.
Decision Nodes: As name implies, the decision node represents a decision that is
made. It is represented by square.
End Node: End nodes represent the final outcome of the decision. It is represented
by a triangle.
Figure 3.2 shows some decision tree notations.
3.6 Decision Tree 49
To draw a decision tree, you will have to find all the input variables on which your
decisions are based and their possible values. Once you have all this, you are ready
to work on decision trees. Following are the steps to draw a decision tree
1. Start with the first (main) decision. Draw the decision node symbol and then
draw different branches out of it based on the possible options. Label each branch
accordingly.
2. Keep on adding the chance and decision nodes with following rules
• If previous decision is connected with another decision, draw a decision node.
• If still you are not sure about decision, draw a chance node.
• Leave if problem is solved.
50 3 Widely Used Techniques in Data Science Applications
3. Continue until each path has an end node which means that we have drawn all the
possible paths on the basis of which decisions will be made. Now assign the prob-
abilities to branches. These probabilities may be business profits, expenditures,
or any other values guiding toward the decision.
Decision trees have also their application in machine learning, data mining, and
statistics. We can build various prediction models using decision trees. These models
take input the values related to different items and try to predict the output on the
basis of items value.
In context of machine learning these type of trees are normally named as clas-
sification trees. We will discuss classification trees in detail in upcoming chapters.
The nodes in these trees represent the attributes on the basis of which classification
is performed. The branches are made on the basis of the values of the attributes.
They can be represented in the form of if then else conditions, e.g. if X > 75, then
TaxRate = 0.34. The end nodes are called the leaf nodes, which represent the final
classification.
As we may have series of events in decision trees, their counter parts, i.e. clas-
sification trees, in machine learning may also have number of nodes connected,
thus forming a hierarchy. The more deep the tree structure is, the more accurate
classification it may provide.
However, note that we may not have the discrete values all the time, there may
be scenarios where continuous numbers such as price or height or person, etc., need
3.6 Decision Tree 51
to be modeled, and for such situation we may have another version of decision trees
called regression trees.
It should be noted that an optimal classification tree is the one which models
most of the data with minimum number of levels. There are multiple algorithms
for creating classification trees. Some common are CART, ASSISTANT, CLS, and
ID3/4/5.
There are number of methods that can be used to perform cluster analysis. Here we
will discuss few of them:
Hierarchical methods
These methods can be further divided into two clustering techniques as follows:
Agglomerative methods In this method each individual objects forms its own
cluster. Then similar clusters are merged. The process goes on until we get the
one bigger or K bigger clusters.
The objects in cluster analysis can have different types of data. In order to find out
the similarity you need to have some measure that could determine the distance
between them in some coordinate system. The objects that are close to each other are
considered to be the similar one. There are number of different measures that can be
used to find out the distance between objects. The use of a specific measure depends
upon the type of data. One of the most common measure is Euclidean distance.
Euclidean distance between two points is simply the length of the straight line
between the points in Euclidean space. So, if we have points P1 , P2 , P3 , …, Pn
and a point i is represented by yi1 , yi2 , …, yin and point j is represented by yj1 , yj2 ,
…, yjn . then the Euclidean distance dij between points i and j will be calculated as:
2 2
di j = yi1 − y j1 + yi2 − y j2 + · · · + yin − y jn
Here Euclidean space is two-, three-, or n-dimensional space where each point or
vector is represented by n real numbers (x 1 , x 2 , x 3 , …, x n ).
3.7 Cluster Analysis 53
There are different clustering techniques used in hierarchical clustering method. Each
technique has its own mechanism that determines about how to join or split different
clusters. Here we will provide a brief overview of few.
This is one of the simplest methods to measure the distance between two clusters. The
method considers the points (objects) that are in different clusters but are closest to
each other. The two clusters are then considered to be close or similar. The distance
between the closest points can be measured using the Euclidean distance. In this
method the distance between two clusters is defined to be the distance between the
two closest members or neighbors. This method is often criticized because it does
not take account of cluster structure.
It is just an inverse of single linkage method. Here we consider the two points in
different clusters that are at farthest distance from each other. The distance between
these points is considered to be the distance between two clusters. Again a simple
method but just like single linkage, it also does not take cluster structure into account.
The distance between two clusters is considered to be the average of the distance
between each pair of objects from both of the clusters. This is normally considered
to be the robust approach for clustering.
In this method we calculate the centroid of the points in each cluster. Then the distance
between the centroids of two clusters is considered to be the distance between the
clusters. The clusters having minimum distance between centroids are considered to
be close to each other. Again, this method is also considered to be better than single
linkage method or furthest neighborhood method.
54 3 Widely Used Techniques in Data Science Applications
In this method the desired number of clusters, i.e. K value, is determined in advance
as user input, and we try to find solution that generates optimal clusters. Here, we
will explain this method with the help of an example.
Consider the following example using hypothetical data. Table 3.3 shows the
unsupervised dataset.
Dataset contains objects {A, B, C, D} where each object is characterized by
features F = {X, Y, Z}.
Using K-means clustering, the objects {A, D} belong to cluster C 1 , and objects
{B, C} belong to cluster C 2 . Now if computed, the feature subsets {X, Y }, {Y, Z},
and {X, Z} can produce the same clustering structure. So, any of them can be used
as the selected feature subset. It should be noted that we may have different feature
subsets that fulfill the same criteria, so any of them can be selected, but efforts are
made to find the optimal one, the one having minimum number of features.
Following are the steps to calculate clusters in unsupervised datasets. The steps
of the theorem are given below:
Do {
Step-1: Calculate centroid
Step-2: Calculate distance of each object from centroids
Step-3: Group objects based on minimum distance from centroids
} until no objects moves from one group to other.
First of all we calculate the centroids, centroid is center point of a cluster. For the
first iteration we may assume any number of points as centroids. Then we calculate
distance of each point from centroid; here distance is calculated using Euclidian
distance measure. Once the distance of each object from each centroid is calculated,
the objects are arranged in such a way that each object falls in the cluster whose
centroid is close to that point. The iteration is repeated then by calculating new
centroids of every cluster using all the points that fall in that cluster after an iteration
which is completed. The process continues until no point changes its cluster.
Now considering the data points given in above datasets:
Step-1: We first suppose point A and B as two centroids C1 and C2 .
Step-2: We calculate Euclidean distance of each point from these centroids using
equation of Euclidean distance as follows:
d(x, y) = (x1 − x2 )2 + (y1 − y2 )2
In the above distance matrix the value at row = 1 and column = 1 represents the
distance of point A from first centroid (here point A itself is the first centroid, so it is
distance of point A with itself which is zero), the value at row = 1 and column = 2
represents the Euclidean distance between points B and first centroid, and similarly
the value at row = 2 and column = 1 shows Euclidean distance between point A and
second centroid and so on.
Step-3: If we look at column three, it shows that point “C” is close to second
centroid as compared to its distance from first centroid. On the basis of above distance
matrix following groups are formed:
1001
G1 = C1 = {1, 2, 1} and C2 = {2, 3, 1}
0110
where the value “1” shows that point falls in that group, so from above group matrix
it is clear that, points A and D are in one group (cluster), whereas points B and C are
in other group (cluster).
Now second iteration will be started. As we know that in first cluster there are
two points “A” and “D,” so we will calculate their centroid as first step. So, C1 =
, 2 , 2 = 1, 2, 2. Similarly, C2 = 2+3
1+1 2+1 1+2
2 2
, 3+2
2
, 1+3
2
= 2, 2, 2. On the basis of
these new centroids we calculate the distance matrix which is as follows:
56 3 Widely Used Techniques in Data Science Applications
1221
D =2
C1 = {1, 2, 2} and C2 = {2, 2, 2}
1111
1001
G2 = C1 = {1, 2, 2} and C2 = {2, 2, 2}
0110
Since G 1 = G 2 , so we stop here. Note that first column of D 2 matrix shows that
point A is at equal distance from both clusters C 1 and C 2 , so it can be placed in any
of the cluster.
In general pattern recognition process comprises of four steps as shown in Fig. 3.4.
Now we will explain these steps one by one.
Data Acquisition Data acquisition is the process of obtaining and storing the data
from which the patterns are intended to be identified. The data collection process
may be automated, e.g. a sensor sensing real-time data or may be manual, e.g. a
person entering the data of the employees. Normally the data is stored in the form of
objects and their corresponding feature values just like we have data in Table 3.4.
Data Preprocessing Data collected in data store for pattern recognition may not
be in ready to process form. It may contain many anomalies, noise, missing values,
etc. All of these deficiencies in data may affect the accuracy of pattern recognition
process and thus may result to inaccurate decisions. This is where preprocessing step
helps to fix the problem, in this step all of such anomalies are fixed, and data is made
ready for processing.
58 3 Widely Used Techniques in Data Science Applications
Pre-Processing
Feature Extraction
Classification
Post Processing
Feature Extraction/Selection Once the data has been preprocessed, the next step
is to extract or select the features from the data that will be used to build the model.
A feature is the characteristic, property, attribute of an object of interest. Normally
the entire set of data is not used as datasets are large in number with hundreds
of thousands of features. Furthermore, datasets may have redundant and irrelevant
features as well. So, the feature extraction/selection is the process to get the relevant
features for the problem. It should be noted that the accuracy of the model for pattern
extraction depends upon the accuracy of the values of the features and their relevance
to the problem.
Classification Once we have found the feature vector from the entire dataset, the
next step is to develop a classification model for identification of the patterns. The
step of classification is discussed in detail in upcoming chapters. Here it should be
noted that a classifier just reads the values of the data and on the basis of the values
tries to assign a label to the data as per identified pattern. We have various classifiers
available, e.g. decision trees, artificial neural networks, Naïve Bayes, etc. The step
of building a classifier requires two types of data, i.e. training data and testing data.
We will discuss both of these types of data in next section.
3.9 Pattern Recognition 59
Post-Processing Once, the patterns are identified and decisions are made, we try to
validate and justify the decisions made on the basis of the patterns. Post-processing
normally includes the steps to evaluate the confidence on these decisions.
The process of building the classification model requires training and testing the
model on available data. So, the data may be classified into two categories.
Training dataset: Training dataset, as the name implies, is required to train our
model, so that it could predict with maximum accuracy for some unknown data in
future. We provide the training data to the model, check its output, and compare the
output with the actual output to find the amount of error. On the basis of comparison,
we adjust (train) different parameters of the model. The process continues until we
get the minimum difference (error) between the provided input and the produced
output.
Once the model is trained, the next step is to evaluate or test the model. For this
purpose we use test dataset. It should be noted that both training and test data are
already available, and we know the values of the features and classes of the objects.
The test dataset is then provided to the model, and the produced output is compared
with the actual output in test model. We then take different measures like precision,
accuracy, recall, etc.
There are various applications of pattern recognition. Here we will mention just few:
Optical Character Readers Optical character reader models read the textual image
and try to recognize the actual text, e.g. the postal codes written on the letters. In this
way the model can sort out the letters according to their postal codes.
Biometrics Biometrics may include face recognition, finger prints identification,
retina scans, etc. It is one of the most common and widely used applications of pattern
recognition. Such models are helpful in automatic attendance marking, personal
identification, criminal investigations, etc.
Diagnostics Systems Diagnostic Systems are advance form of applications that
scan the medical images (e.g. X-ray sheets, internal organ images, etc.) for medical
diagnostics. This facilitates in diagnosing diseases like brain tumor, cancer, bone
fractures, etc.
Speech Recognition Speech recognition is another interesting application of pattern
recognition. This helps in creating virtual assistants, speech-to-text converters, etc.
60 3 Widely Used Techniques in Data Science Applications
3.11 Summary
In this chapter, we provided a broad overview of the various techniques used in data
science applications. We discussed both classification and clustering methods and
techniques used for this purpose. It should be noted that each technique has its own
advantages and disadvantages. So, the selection of a particular technique depends on
3.11 Summary 61
the requirements of the organizations including the expected results and the type of
the analysis that the organization requires along with nature of the data available.
Chapter 4
Data Preprocessing
It is essential to extract useful knowledge from data for decision making. However,
the entire data is not always processing ready. It may contain noise, missing values,
redundant attributes, etc., so data preprocessing is one of the most important steps
to make data ready for final processing. Feature selection is an important task used
for data preprocessing. It helps reduce the noise, redundant, and misleading features.
Based on its importance, in this chapter, we will focus on feature selection and
different concepts associated with it.
4.1 Feature
© The Editor(s) (if applicable) and The Author(s), under exclusive license 63
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_4
64 4 Data Preprocessing
4.1.1 Numerical
Categorical features consist of the symbols to represent domain values. For example
to represent “Gender,” we can use “M” or “F”. Similarly, to represent employee type,
we can use “H” for hourly, “P” for permanent, etc. Categorical attributes are of two
types further.
Nominal In nominal category of attributes, order does not make sense. For example
in “Gender” attribute, there is no order, i.e. the operators, equal-to, less-than, or
greater-than, do not make any sense.
We are living in the world where we are bombarded with tons of data every second.
With digitization in every field of life, pace with which data is originating is stag-
gering. It is common to have datasets with hundreds of thousands of records just
for experimental purposes. This increase in data results in the phenomenon called
curse of dimensionality. However, the increase is two dimensional, i.e. we are not only
storing the attributes of the objects in real world but the number of objects and entities
which we are storing is also increasing. The ultimate drawback of these huge volumes
of data is that it becomes very tough to process these huge volumes for knowledge
extraction and analytics which requires lot of resources. So, we have to find alternate
ways to reduce the size ideally without losing the amount of information. One of the
solutions is feature selection.
In this process only those features are selected that provide most of the useful infor-
mation. Ideally, we should get same amount of information that otherwise provided
the entire set of features in dataset. Once such features have been found, and we
can use them on behalf of the entire dataset. Thus, the process helps identify and
eliminate the irrelevant and redundant features. Based on the facts mentioned above,
efforts are always made to select best quality features. Overall, we can categorize
dimensionality reduction techniques in two categories.
(a) Feature Selection: A process which selects features from the given feature
subset without transforming or losing the information. So, we can preserve
data semantics in transformation process.
(b) Feature Extraction: Feature extraction techniques, on the other hand, project
current feature space to a new feature subset space. This can be achieved either
by combining or applying some other mechanism. The process however has a
major drawback that we may lose information in the transformation process,
which means the reverse process may not provide the same information present
in the original dataset.
66 4 Data Preprocessing
As mentioned earlier class labels are not given all the times, we have to proceed
without labeling information which makes unsupervised classification a little bit
tough task. To select the features here, we can use clustering as selection criteria, i.e.
we can select the features that give same clustering structure that can be obtained by
entire feature set. Note that a cluster is just a chunk of similar objects.
A simple unsupervised dataset is shown in Table 4.3.
Dataset contains four objects {X 1 , X 2 , X 3 , X 4 } and three features {C 1 , C 2 , C 3 }.
Objects {X 1 , X 4 } form one cluster and {X 2 , X 3 } form other one. Note that clustering
is explained in detail in upcoming chapters. By applying nearest neighbor algorithm,
68 4 Data Preprocessing
we can see that same clusters are obtained if we use only the features {C 1 , C 2 }, {C 2 ,
C 3 } or {C 1 , C 3 }. So, we can use any of these feature subsets.
Filtered-based approach is the simplest method for feature subset selection. In this
approach features are selected without considering the learning algorithm. So, selec-
tion remains independent of the algorithm. Now each feature can be evaluated either
as individual or as a complete subset. Various feature selection criteria can be used
for this purpose. In individual feature selection, each feature is assigned some rank
according to a specified criterion and the features with highest ranks are selected. In
other case entire feature subset is evaluated. It should be noted that selected feature
set should always be optimized that is the minimum number of features in feature
subset is better.
A generic filter-based approach is shown in Fig. 4.3.
Fig. 4.3 Generic filter-based approach taken from: John, George H., Ron Kohavi, and Karl Pfleger.
“Irrelevant features and the subset selection problem.” Machine learning: proceedings of the eleventh
international conference. 1994
4.3 Feature Selection Methods 69
It should be noted that only a unidirectional link exists between feature selection
process and induction algorithm.
Pseudocode of a generic feature selection algorithm is shown in figure Listing 4.1
below:
Listing 4.1 A Generic Filter Approach taken from: Ladha, L., and T. Deepa. “Feature
selection methods and algorithms.” International journal on computer science and
engineering 3.5 (2011): 1787–1797.
Input:
S—data sample with C number of features
E—Evaluation measure
SGO—successor generation operator
Output:
S —Output solution
I := Start I;
S i := {best of I with respect to E};
repeat
I := Search (I, SGOI, C);
C’ := {best I according to E};
if E(C ) ≥ E(S ) or E(C ) = E(S ) && C < S then S := C’;
until Stop (E, I).
X is the input feature set and “E” is the measure to be optimized (selection criteria).
S’ is the final result that will be the output. GSO is the operator to generate next
feature subset. We may start with an empty feature set and keep on adding features
according to criteria or alternatively, we may start with full features set and keep on
removing unnecessary features. The process continues until we get the optimized
feature subset.
Fig. 4.4 Generic wrapper-based approach taken from: John, George H., Ron Kohavi, and Karl
Pfleger. “Irrelevant features and the subset selection problem.” Machine learning: proceedings of
the eleventh international conference. 1994
As you can see that feature selection process comprising feature search and eval-
uation is interconnected with induction algorithm, and the overall process comprises
three steps.
1. Feature subset search
2. Feature evaluation considering the induction algorithm
3. Continue process till optimization.
It should be noted that induction algorithm is sent to the feature subset, and the
quality of the subset is determined on the basis of its feedback. The feedback may
be in the form of any measure, e.g. error rate.
Both of the previous approaches discussed have their own advantages and disad-
vantages. Filter-based approaches are efficient but the features selected may not be
quality features because feedback from the learning algorithm is not considered. In
case of wrapper-based approaches, although quality features are selected but getting
feedback and updating features again and again are computationally inefficient.
Embedded methods take advantage of plus points of both of these approaches. In
embedded method, features are selected as part of classification algorithm, without
taking feedback from the algorithm. A generic embedded approach works as follows:
1. Feature subset initialization
2. Feature subset evaluation using evaluation measure
4.3 Feature Selection Methods 71
The core objective of the feature selection process is to select those features that
could represent the entire dataset in terms of the information provided by them. So,
the lesser the number of features, the more is the performance. Various objectives of
feature selection are provided in literature. Some of these are as follows:
1. Efficiency: Feature selection results in lesser number of features which provide
the same information but enhance the quality. Here it is shown using an example
in Table 4.5.
Note that if we classify the objects in above dataset using all three features, we
get the classification given on the left side of the diagram. However, if we only
use features C 1 and C 2 , we also get the same classification as shown in Fig. 4.5.
So, we can use the feature subset {C 1 , C 2 } instead of the entire dataset.
The above mentioned is a very simple example. Consider the case when we have
hundreds and thousands of features.
2. Avoid overfitting: Since feature selection helps us remove the noisy and irrelevant
features, so the accuracy of the classification model also increases. In the example
given above, we can safely remove the feature C 3 as removing this feature does
not increase or decrease accuracy of classification.
72 4 Data Preprocessing
Fig. 4.5 Classification: Using selected features versus entire feature set
3. Identifying the relation between data and process: Feature selection helps us
the relationship between features to understand the process that generated those
features. For example in table we can see that “Class” is fully dependent on {C 1 ,
C 2 }, so to fully predict the class we should have values of these two features.
Information gain relates with the uncertainty of the feature. For two features X and
Y, features X will be preferable if IG(X) > IG(Y ). Mathematically
IG(X ) = U (P(Ci )) − E U (P(Ci |X ))
i i
Here
U = Uncertainty function.
P(C i ) = Probability of C i before considering feature “X”.
P(C i |X) = Probability if C i after considering feature “X”.
4.5.2 Distance
Distance measure represents that how effectively a feature can discriminate the class.
The feature with high discrimination power is more preferable. Discrimination, here,
can be defined in terms of difference of the probabilities P(X|C i ) and P(X|C j ). Here
C i and C j are two classes and P(X|C i ) means the probability of X when class is C i .
Similarly P(X|C j ) shows the probability of class C j X when class is C j . So, for two
features X and Y, X will be preferable if D(X) > D(Y ).
4.5.3 Dependency
4.5.4 Consistency
Consistency is another measure that can be used for feature subset selection. A
consistent feature is the one that provides same class structure otherwise provided
74 4 Data Preprocessing
One of the important steps in feature selection algorithms may be to generate next
subset of features. It should be noted that the next feature subset should contain
more quality features as compared to the current one. Normally three feature subset
selection schemes are used for this purpose.
Firstly, we have forward feature generation scheme, where the algorithm starts
with an empty set and keeps on adding new features one by one. The process continues
until the desired criteria are met up to maximum.
A generic forward feature generation algorithm is given in Listing 4.2 below:
Output
d) Return S .
S’ here is the final feature subset that algorithm will output. Initially it is initialized
with an empty set and one by one feature are added until the solution meets the
evaluation criteria “E”.
Secondly we can use the backward feature generation scheme. This is totally
opposite of the forward feature generation mechanism. We keep on eliminating
the feature subset until we come to a subset from which no more features can be
eliminated without affecting the criterion.
Listing 4.3 shows a generic backward feature selection algorithm below:
Note that we have assigned the entire feature set to the S’ and are removing
features one by one. A feature can safely be removed if after removing the feature,
criteria remain intact.
We can also use the combination of both approaches where an algorithm may
start with an empty set and a full set. On one side we keep on adding features while
on other side we keep on removing features.
Thirdly we have random approach. As the name employees, in random feature
generation scheme, we randomly include and skip features. The process continues
until the mentioned criterion is met. Features can be selected or skipped by using any
scheme. Figure 4.6 shows a simple one where a random value is generated between
0 and 1. If value is less than one, the feature will be included else excluded. This
gives each feature and equal opportunity to be part of the solution.
A generic random feature generation algorithm is given in Listing 4.4 below:
76 4 Data Preprocessing
Search
Heuristic, Random, Exhaustive
Organization
Listing 4.4 A random feature generation algorithm using hit and trial approach
Input:
S—data set containing X features
E—evaluation measure
S ← {φ}
For j = 1 to n
If (Random (0,1) <= 0.5) then
S ← S ∪ {X i }
Until Stop (E, S )
Output
d) ReturnS .
Details given above provide basis for three types of feature selection algo-
rithms. Initially we have exhaustive search algorithms where the entire feature space
is searched for selecting appropriate feature subset. Exhaustive search although
provides optimum solution but due to limitation of resources it becomes infeasible
for datasets beyond smaller size.
4.6 Feature Generation Schemes 77
Random search, on the other hand, randomly searches the feature space and keeps
on continuing the process until we get some solution before the mentioned time. The
process is fast but alternate drawback is that we may not have optimal solution
Third and most common strategy is to use heuristics-based search, where the
search mechanism is guided by some heuristic function. A common example of
such types of algorithms is genetic algorithm. The process continues until we find a
solution or a specified time threshold has been met.
So, overall feature selection algorithms can be characterized by three parts as
follows:
1. Search organization
2. Generation strategy for next feature subset selection
3. Selection criteria.
Figure 4.6 shows different parts of a typical feature selection algorithm.
Now we will provide some concepts related to features and feature selection.
S(C − X i ) = S(X )
Here:
S is selection criteria (e.g. dependency)
C is current feature subset
X is the entire feature set.
So, even if after removing the feature, the dependency on current feature subset
is equal to that of entire feature subset, the feature can be termed as irrelevant and
thus can be safely removed. Depending on the selection criteria a feature can either
be strongly relevant, relevant, or irrelevant.
78 4 Data Preprocessing
As it can be seen that the selection criterion remains the same even after removing
the feature, so the feature X j will be declared as redundant and like irrelevant feature,
it can also be removed.
Today we are living in the world where curse of dimensionality is a common problem
for applications to deal with. Within the limited resources it becomes infeasible
to process such huge volumes of data. So, feature selection remains an effective
approach to deal with this issue. It has become a common preprocessing task for
majority of domain applications. Here we will discuss only few of domains using
feature selection.
With the widespread use of Internet, the vulnerability of systems has increased for
being attacked by hackers and intruders. Here the static security mechanisms become
insufficient where the hackers and intruders are finding new mechanisms on daily
basis, so we need to have dynamic intrusion detection systems. These systems may
need to monitor huge volume of traffic passed through them every minute and second.
So feature selection may help them identify the important and more relevant features
to select for inspection and detection thus alternatively enhancing their performance.
4.7 Miscellaneous Concepts 79
Information systems are one of the main application areas of feature selection. It is
common to have information systems processing hundreds and thousands of features
for different tasks. So feature selection becomes a handy tool for such systems. To
get the idea you can simply consider the classification task using 1000 features
versus the classification performed using 50 features by getting the same/sufficient
classification accuracy.
Feature selection is a great help in era of curse of dimensionality, however, the process
has some issues which still need to be handled. Here we will discuss few.
4.8.1 Scalability
Although feature selection provides a solution to handle the huge volumes of data,
but an inherent issue a feature selection algorithm faces is the required number of
resources.
For a feature selection algorithm to rank the features the entire dataset is required
to be kept in memory until the algorithm completes. Now keeping in mind the large
volume of data, it requires large memory. The issue is unavoidable as to produce
quality features entire data has to be considered. So, scalability of feature selection
algorithms keeping in mind the dataset size is still a challenging task.
4.8.2 Stability
A feature selection algorithm should be stable that is it should produce same results
even with perturbations. Various factors such as amount of data available, the number
of features, distribution of data, etc., may affect stability of a particular feature
selection algorithm.
Feature selection algorithms make an important assumption that the available features
are independent and ideally distributed. An important factor that is ignored is that
80 4 Data Preprocessing
feature may be linked with other features and perhaps sometimes in other dataset,
e.g. a user may be linked with its posts and posts may be liked by other users. Such
scenario especially arises in domain of social media applications. Although research
is underway to deal with this issue, but handling linked data is still a challenge that
should be considered.
Based on how a feature selection algorithm searches the solution space, feature
selection algorithms can be divided into three categories.
• Exhaustive algorithms
• Random search-based algorithms
• Heuristics-based algorithms.
Exhaustive algorithms search the entire feature space for selecting a feature subset.
These algorithms provide optimal results, i.e. the resultant feature subset contains
minimum number of features. This is one of the greatest benefits of feature selection
methods; however, unfortunately, exhaustive search is not possible as it requires
lot of processing resources. The issue becomes more serious when size of datasets
increases. For a dataset with n number of attributes, an exhaustive algorithm will have
to explore 2n solutions. Beyond requiring the computational power, such algorithms
also require lot of memory and it should be noted that now it is common to have
datasets with tens of thousands and millions of features. Exhaustive algorithms almost
become impossible to apply for such datasets.
For example, consider the following dataset given in Table 4.6.
An exhaustive algorithm will have to check the following subsets:
Subset: {}{a}{b}{c}{d}{e}{a, b}{a, c}{a, d}{a, e}{b, c}{b, d}{b, e}{c, d}{c,
e}{d, e}{a, b, c}{a, b, d} {a, b, e}{a, c, d}{a, c, e}{a, d, e}{b, c, d}{b, c, e}{b,
d, e}{c, d, e}{a, b, c, d}{a, b, c, e}{a, b, d, e}{a, c, d, e}{b, c, d, e}{a, b, c, d, e}
Now consider a dataset, the one thousand features or objects, the computational
time will substantially increase. As discussed earlier that feature selection is used
as preprocessing step for various data analytics-related algorithms, so, if this step is
computationally expensive, it will result in serious performance bottlenecks of the
underlying algorithm using it. However, we may have semi-exhaustive algorithms,
which do not have to explore the entire solution space. Such algorithms search solu-
tion space, until the required feature subset is found. These algorithms can use both
forward elimination and backward elimination strategy.
Algorithm keeps on adding or removing features until a specified criterion is
met. Although such algorithms are efficient than fully exhaustive algorithms, but
even then, it is not possible to use such algorithms for medium or larger dataset
size. Furthermore, such algorithms suffer from a serious dilemma, i.e. distribution
of the features. For example, if a high-quality feature is found in the beginning, the
algorithm is expected to stop earlier as compared to the one where the high-quality
attributes are indexed in later position in dataset.
One optimization for such algorithms is provided in filter-based approaches is
that instead of starting subset, we first rank all the features. Note that here we will
use rough set-based dependency measure for ranking and feature selection criteria.
We check the dependency of decision class (“Z” in above given dataset) on each
attribute one by one. Once all the attributes are ranked, we start combining the
features. However, instead of combining the features in sequence as given above,
we combine the features in decreasing order of their rank. Here we have made two
assumptions.
• Features having high ranks are high-quality features, i.e. having minimum noise
and high classification accuracy.
• Combining the features with high ranks is assumed to generate high-quality
feature subsets more quickly as compared to combining the low ranked features.
Features having the same rank may be considered as redundant.
This approach results in the following benefits:
• Algorithm will not depend on the distribution of features as the order of the features
becomes insignificant while combining the features for subset generation.
• Algorithm is expected to generate results quickly, alternatively increasing the
performance of the corresponding algorithm using it.
However, these algorithms provide more efficiency as compared to the previously
discussed semi-exhaustive algorithms, even these algorithms can be computationally
expensive when the number of features and objects in datasets increases beyond
smaller size.
Then we have random feature selection algorithms. These algorithms use hit
and trial method for feature subset selection. A generic random feature selection
algorithm is given in Listing 4.5
The advantage of these algorithms is that they can generate feature subsets more
quickly as compared to other types of algorithms, but down side is that the generated
feature subset may not be optimized, i.e. it may contain redundant or irrelevant
features that do not participate in feature subset.
82 4 Data Preprocessing
represents its absence. These features are selected randomly. For the dataset, given
in Table 4.6, a random chromosome may be
10011
Here the first “1” represents the presence of the feature “a”, the second and third
“0” values show that features “b” and “c” will not be part of the solution. Similarly,
“d” and “e” will be included as shown by the values “1” at the fourth and fifth
place. Each value is called a “Gene”. So, a gene represents a feature. The resulting
chromosome shown above represents the feature subset {a, d, e}.
A population may contain many chromosomes depending on the requirements of
the algorithm.
Fitness Function
Once population is initialized, the next task is to check the fitness of each chromo-
some. The chromosomes having higher fitness are considered and those with low one
are discarded. The fitness function is the one that specifies our selection criteria. It
may be information gain, Gini Index, or dependency in case of rough set theory. Here
we will consider the dependency measure as selection criteria. The chromosomes
with higher dependency are preferable. An ideal chromosome is the one having the
fitness value of “1”, i.e. maximum dependency of decision class on complete set of
features present in dataset.
After this we start our iterations and keep on checking until the required
chromosome is found.
It should be noted that when chromosomes are initialized, different chromosomes
may have different fitness. We will select the best ones having higher dependency
and will use them for further crossover. This is in-line with the concept of survival
of the fittest inspired by the nature.
Crossover
Applying crossover is one of the important steps of genetic algorithm by using which
we generate offsprings. The offsprings form the next population. It should be noted
that only the solutions having high fitness participate in crossover, so that the resulting
offsprings are also healthy solutions. There are various types of crossover operators.
Following is the example of a simple one-point crossover operator in which two
chromosomes simply replace one part of each other to generate new offsprings.
10101
11010
We replace the parts of chromosomes after third gene. So, the resulting offsprings
become
10110
84 4 Data Preprocessing
11001
Similarly, in two-point crossover, we select two random points and the chromo-
somes replace the genes between these points as shown below:
10011
11100
There may be other types of crossover operators, e.g. uniform crossover and
order-based crossover, etc. The choice of selecting a specific crossover operator may
depend on the requirements.
Mutation
Once crossover is performed, the next step is to perform mutation. Mutation is the
process of randomly changing few bits. The reason behind this is to introduce diver-
sity in genetic algorithm and introduce new solutions so that previous solutions are
not repeated. In its simplest form, a single bit is flipped in the offspring. For example,
consider the following offspring resulted after the crossover of parent chromosomes:
10110
After flipping the second gene (bit) the resultant chromosome becomes:
11110
Just like crossover, there are many types of mutation operators. Few may include
flip bit mutation, uniform, non-uniform, Gaussian mutation, etc.
Now we evaluate the fitness of each chromosome in population and the best
chromosomes replace the bad ones. The process continues until the stopping criteria
are met.
Stopping Criteria
There can be three stopping criteria in genetic algorithm. The firs and the ideal one is
we find the ideal chromosomes, i.e. the chromosomes that have the fitness required.
In case of dependency measure the chromosomes having the dependency value of
“1” will be the ideal ones.
Second possibility is that we do not get the ideal chromosome, in such cases the
algorithm is executed till the chromosomes keep on repeating. So, in this case we
can select the chromosomes with maximum fitness. However, in some cases we do
not get this scenario even. So, in all such cases we can use generation threshold, i.e.
if algorithm does not produce a solution after n generations, we will stop and the
chromosomes with maximum fitness may be considered as potential solution.
4.10 Genetic Algorithm 85
Heuristics-based algorithms are most commonly used. However, they may have
certain drawbacks, e.g.
• They may not produce the optimal solution. In case of genetic algorithm, the
resulting chromosomes may contain the irrelevant features having no contribution
in feature subset.
• As heuristics-based algorithms are random in nature, different executions may
take different execution time and same solutions may not be produced again.
Table 4.7 shows the comparison of both exhaustive and heuristics-based
approaches.
Book presents recent developments and research trends in the field of feature selec-
tion for data and pattern recognition, highlighting a number of latest advances.
Divided into four parts—nature and representation of data; ranking and explo-
ration of features; image, shape, motion, and audio detection and recognition;
decision support systems, it is of great interest to a large section of researchers
including students, professors, and practitioners.
• Hierarchical Feature Selection for Knowledge Discovery (Authors: Cen Wan)
Book systematically describes the procedure of data mining and knowledge
discovery on bioinformatics databases by using the state-of-the-art hierarchical
feature selection algorithms. Furthermore, this book discusses the mined biolog-
ical patterns by the hierarchical feature selection algorithms relevant to the aging-
associated genes. Those patterns reveal the potential aging-associated factors that
inspire future research directions for the biology of aging research.
• Feature Selection and Enhanced Krill Herd Algorithm for Text Document
Clustering (Authors: Laith Mohammad Qasim Abualigah)
Presents a new method for solving the text document clustering problem and
demonstrates that it can outperform other comparable methods.
4.12 Summary
Classification is the process of group the objects and entities on the basis of the
available information. It is an important step that forms core of the data analytics and
machine learning activities. In this chapter we will discuss some of the basic concepts
of classification process. We will discuss the decision tree technique in depth for this
purpose. Decision trees have already been discussed at abstract level in the previous
section. Here we will provide in-depth details. Although the concepts discussed will
be from decision tree point of view but these concepts will be applicable to all other
classification techniques. Along with decision trees, we will provide a brief overview
of other classification techniques including naïve Bayes, support vector machines,
and artificial neural networks.
5.1 Classification
© The Editor(s) (if applicable) and The Author(s), under exclusive license 87
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_5
88 5 Classification
garden”, “occupation”, “home type”, etc. are all the attributes, on the basis of which
the class of the people is decided.
Classification can be used for predictive modeling, i.e. we can predict on the basis
of the classification process about the class of a record whose class is not already
known.
For example, consider the record given in Table 5.2.
We can use the dataset given in Table 5.1 to train the classification model, and
then use the same model to predict the class of the above-mentioned record. It should
be noted that classification techniques are effective for prediction for binary classes,
i.e. where we have attributes with nominal or ordinal category and less effective for
other types of attributes.
A classification process is a systematic approach to build a classifier model on the
basis of training records. The classifier model is built in such a way that it predicts
the class of the unknown data with a certain accuracy. The model identifies the class
of unknown records on the basis of the attribute values and classes in training dataset.
Accuracy of the classification model depends on many factors including the size and
quality of training dataset.
Figure 5.3 shows a generic classification model.
Accuracy of classification model is measured in terms of the records correctly
classified as compared to total number of the records. The tool used for this purpose
is called confusion matrix given in Table 5.3.
A confusion matrix shows the number of records of class i that are predicted as
belonging to class j. For example f 01 means the total number of records that belong
to class “0” but model has assigned it the class “1”.
5.1 Classification 89
Confusion matrix can be used for deriving a number of matrices that can be used
for classification model evaluation. Here we will present few metrics.
Accuracy: Accuracy defines the ratio of the number of classes correctly predicted
as compared to total number of records. In our confusion matrix, f 11 and f 00 are the
correctly predicted classes.
Error Rate: Error rate defines the ratio of incorrectly predicted classes to the
total number of records. In our confusion matrix f 10 and f 01 as incorrectly predicted
classes.
Number of wrong predictions f 10 + f 01
Error rate = =
Total number of predictions f 11 + f 10 + f 01 + f 00
A model that provides the maximum accuracy and minimum error rate are always
desirable.
5.2 Decision Tree 91
Decision tree is one of the most commonly used classifiers. We have already discussed
decision tree in previous chapters. Here we will provide details of the concept along
with relevant algorithms.
Now before going into details of how decision trees perform classification let us
consider that we want to classify a person “Shelby” on the basis of the values of
the attributes presented in Table 5.1. What we may do is that we may start with first
attribute and by considering that either her home as a “garden” or not, we may then
consider the value of second attribute, i.e. what is her occupation, after this we may
continue with attributes one by one until we reach the decision class of the person.
This is how a decision tree exactly works.
Figure 5.4 shows a sample decision tree of the dataset given in Table 5.1.
Building a decision tree is computationally expensive job because there may be
n number of potential solutions. So, finding an optimal one is computationally very
expensive and especially when the number of attributes increase beyond smaller
number.
So, for this purpose, we normally use the greedy search approach using local
optimum. There exist many of such solutions and one of the algorithms is Hunt’s
algorithm. Hunt’s algorithm uses the recursive approach and builds the final tree by
incrementally adding sub-trees.
Suppose Dt is training set and y = {y1 , y2 ,…, yc } are class labels, then according
to Hunt’s algorithm:
Step1: If there is only one class in dataset and all records belong to this class then
there is only one leaf node labeled as yt.
Step2: In case of multiple classes, we select an attribute to partition the dataset
into smaller subsets. We create a child node for each value of the attribute. The
algorithm proceeds recursively until we get the complete tree.
Example: Suppose we want to create a classification tree for determining that
either a person will be permitted for registration or not. For this purpose, we consider
the following dataset given in Table 5.4. We take the data of the previous students
and try to determine on this basis. The dataset is shown in Table 5.4. The attributes
considered are {Student Level, Student type, Marks, Registration}.
We start with initial tree that contains only a single node for class Registration =
Not-permitted. Now note that dataset contains more than one class labels. Now we
select our first attribute, i.e. “Student Level”. So far, we assume that this is the best
attribute for splitting the dataset. We keep on splitting the dataset until we get the
final tree. The tree formed after each step is given in Fig. 5.5.
Hunt’s algorithm assumes that training dataset is complete which means that all
the values of attributes are present. Similarly the dataset is also consistent which
means that unique values lead to unique decision classes. Note that in our training
dataset, the value “Student Level=PG” leads to both Registration = Permitted and
Registration = Not-permitted. However, such conditions normally do not exist in
datasets due to many reasons. Here we will present some other assumptions of the
Hunt’s algorithm.
1. There may be a condition that the child nodes created are empty. Normally this
happens when the training dataset does not have the combination of values of
attributes that result to this child node. In such cases, this node is declared as leaf
node and class is assigned to it on the basis of majority class of training records
that are associated with its parent.
5.2 Decision Tree 93
Registration =
Not permitted Level
(a) PG UG
Level
Registration = Registration =
PG UG Not permitted Not permitted
(b)
Registration = Student
Not permitted Type
Self Sponsored Scholarship
Partial Sponsored
Registration = Registration =
Not permitted Not permitted
(c)
Level
PG UG
Registration = Student
Not permitted Type
Self Sponsored Scholarship
Partial Sponsored
Permission =
Marks
Not permitted
< 80
≥ 80
Permission = Permission =
Not permitted Permitted
(d)
2. If attribute values in Dt are same with respect to attribute values but only differ in
class labels then we cannot split the records further and the node is declared
as child node. The class assigned to this node is the one having maximum
occurrences in training dataset.
A learning algorithm for inducing decision trees must address the following two
issues.
• Algorithm incrementally implements the classification tree by selecting the next
attribute to split the dataset. However, as there are number of attributes and each
attribute can be used to split the dataset, there should be some criteria for selecting
the attributes.
• There should be some criteria to stop the algorithm. There can be two possibilities
here. We can stop when all the records have same class label. Similarly, we can
also stop when all the records have same attribute value. There can be some other
criteria such the threshold value, etc.
It should be noted that there can be different types of attributes and splitting on
the basis of these attributes can be different in each case. Now we will consider the
attributes and their corresponding splitting.
Binary Attributes: Binary attributes are perhaps the simplest ones. The splitting
on the basis of these attributes results in two splits as shown in Fig. 5.6a.
Nominal Attributes: Nominal attributes may have multiple values, so we can have
as many numbers of splits as many possible values. If an attribute has three values,
there may be three splits. Consider the attribute “Student Type”, it can have three
possible values, i.e. “self sponsored”, “partial sponsored”, and “scholarship based”,
so the possible splits are given in Fig. 5.6b. It should be noted that some algorithms,
e.g. CART may also produce binary splits by considering all 2 k −1 −1ways of creating
a binary partition of k attribute values. Figure 5.6c illustrates the binary split of the
“student type” attribute that originally had three values.
Ordinal Attributes: As ordinal attributes have an order between the attribute
values, e.g. a Likert-scale may have values not agree, partially agree, agree, and
strongly agree. So, we can split an ordinal attribute in binary and more splits.
However, it should be noted that the split should not affect the order of the attribute
values. Figure 5.7 shows the possible splits of the ordinal attribute shirt size.
Continuous Attributes: Continuous attributes are those attributes that can take
any value between a specified range, e.g. temperature, height, weight, etc. in context
of decision trees, such attributes may either result into binary or multiway split. For
binary split the test condition for such attributes may be of the form:
(C < v) or (C ≥ v). Here we will have two splits, i.e. either the value of attribute
“C” will be less than value “V ” or will be greater-than-or-equal-to “V ”. Figure 5.8a
shows a typical test condition resulting into binary split.
5.2 Decision Tree 95
Similarly, we can have test conditions where we have multiple splits. Such
conditions may be of the form:
V i ≤ C < vi+1. Here we may have number of splits in different ranges. Figure 5.8b
shows an example of such test condition.
96 5 Classification
There are many measures that can be used to select the best attribute that we should
select next for distribution. Examples may include entropy, gain, classification error,
etc.
c−1
Entropy (t) = − p(i|t) log2 p(i|t)
t=0
c−1
Gini (t) = 1 − [ p(i|t)]2
t=0
Here are some examples of number of classes and the corresponding measure
value that can be used to select the attribute
A decision tree can have three types of errors called training errors, i.e. the misclas-
sifications that a tree performs on training dataset, test errors, i.e. the misclassifica-
tion performed on training dataset and generalization errors, i.e. the errors that are
performed on the unseen dataset. A best classification model is the one that avoids
all of these types of errors. However, it should be noted that for small trees we may
high test and training errors. We call it model underfitting because tree is not trained
well to generalize all the possible examples mainly because of insufficient test and
training data. Model overfitting on the other hand results when the tree becomes too
large and complex. In this scenario although the training error decreases but the test
error may increase. One reason behind may be that the training dataset contains noise
and the classes are accidently assigned to data points which may result in misclas-
sification of the data in training dataset. Model overfitting can resulted due to many
factors here we will discuss few.
It can be due to noise in training dataset. For example consider Tables 5.5 and
5.6 representing the training and test datasets. Note that training dataset has two
misclassifications representing the noise. It means that the tree resulted from training
dataset will not have any error but when the test dataset will run the incorrect class
assignments will result in misclassification of training data. Figure 5.9 shows these
classification trees.
Similarly the models that are developed from smaller training data may also result
in model overfitting mainly because of the insufficient number of representative
samples in training data. Such models may have zero training error but due to imma-
turity of the classification model, the test error rate may be high, e.g. consider the
dataset given in Table 5.7. Few training samples are resulting the classification error
for the test dataset given in Table 5.6. Figure 5.10 shows the resulting classification
tree.
5.2.3 Entropy
Now we will provide details of the measure called information gain to explain how
this measure can be used to develop a decision tree. But before we dwell into this
measure lets discuss what is entropy?
Entropy defines the purity/impurity of a dataset. Suppose a dataset Dt having
binary classification (e.g. positive and negative classes), the entropy of Dt can be
defined as:
Where:
p+ is the proportion of positive examples in Dt
P− is the proportion of negative examples in Dt .
Suppose Dt is a collection of 14 data points having binary classification, including
9 positive and 5 negative examples, then the entropy of Dt will be:
It should be noted that entropy is zero (0) when all data points belong to a single
class and entropy is maximum, i.e. 1 when dataset contains equal number both of
the examples in binary classification.
Given that entropy defines the impurity in datasets, we can define the measure
information gain (IG) as follows:
5.2 Decision Tree 99
Has
garden
Yes No
Home Registration =
type Not permitted
Single Double
story story
Registration = Registration =
Permitted Not permitted
Has
garden
Yes No
Home Registration =
type Not permitted
Single Double
story story
Registration =
Taxation
Not permitted
Tax Non-tax
payer payer
Registration = Registration =
Permitted Not permitted
Has
garden
Yes No
Electric Registration =
system Not permitted
Grid Solar
Taxation Registration =
Not permitted
Tax Non-tax
payer payer
Registration = Registration =
Permitted Not permitted
Now we will calculate IG of each attribute using this entropy, the attribute having
maximum information gain will be placed at root of the tree.
The formula for information gain is as follows:
Here |Dx1 | is the number of times the attribute “X” takes value “x 1 ”. So, in D:
|D| = 5
|Dx1 | = 1
In our example:
E(Dx1 ) = 0
E(Dx2 ) = 1
E(Dx3 ) = 0
IG(D, X ) = 0.57
IG(D, Y ) = 0.02
IG(D, Z ) = 0.02
This means that attribute “X” has the maximum information gain, so it will be
selected as root node. Figure 5.11 shows the resulting classification tree.
Three performs successful classification of three out of five samples. It should be
noted that the tree correctly classifies the samples D1 , D3 , and D4 . For remaining
samples we will perform another iteration by considering the other two attributes
and only the remaining two samples.
IG(D’, Y ) = 0
IG(D’, Z ) = 1
102 5 Classification
So, here attribute Y has highest information gain, so we will consider it as the
next candidate to split for the remaining two records. The decision tree will be as
follows (Fig. 5.12).
Now all the records have been successfully classified, so we will stop our iterations
and decision tree in Fig. 5.15 will be our final output.
Now we will discuss some other classification techniques.
Y = (X )2 + 1
Now coming to regression analysis, it is one of the most common techniques used
for performing the analysis between two or more variables, i.e. how the value of
independent variable affects the value of dependent variable. The main purpose of
5.3 Regression Analysis 103
regression analysis is to predict the value of dependent variable for some unknown
data. The value is predicted on the basis of the model developed using the available
training data. So, overall regression analysis provides us the following information:
(1) What is relation between independent variables and dependent variables. Note
that independent variables are also called predictors and dependent variables
are called outcomes.
(2) How good are the predictors to predict the outcomes?
(3) Which predictors are more important than the others in predicting the outcome.
There are different types of regression analysis:
• Simple linear regression
• Multiple linear regression
• Logistic regression
• Ordinal regression
• Multinomial regression.
For the sake of this discussion we will concise only to simple linear regression.
Simple Linear Regression: Simple linear regression is the simplest form of
regression analysis involving one dependent variable and one independent variable.
The aim of the regression process is to predict the relation between the two.
The model of the simple linear regression comprises of a straight line, passing
through the pints on a two-dimensional space. Now the analysis process tries to find
the line that is close to the majority of the points in order to increase accuracy. Model
is expressed in the form:
Y = a + bX
Here:
Y —Dependent variable
X—Independent (explanatory) variable
a—Intercept
b—Slope.
If you remember, this is equation of a simple linear line, i.e. Y = mx + c. Now
will explain it with the help of an example. Consider the simple dataset comprising
of two variables, i.e. “height” and “weight” as shown in Table 5.10.
Now we have to predict the weight of the person with height 5.4 in. First we draw
the first three points on a two-dimensional space as shown in Fig. 5.13.
Now by using the model discussed above, we try to predict the weight of the
person with “height” 5.4.
If we consider a = 6 and b = 7, the predicted weight of the person will be 43.8 kg
as shown in Fig 5.14.
Support vector machines (SVM) are another classification technique commonly used
for classifying the data in N-dimensional space. The aim of the classifier is to identify
the maximal margin hyperplane that classifies the data. Before discussing support
vector machines let us discuss some prerequisite concepts.
Hyperplane: A hyperplane is a decision boundary that separates the two or more
decision classes. Data points that fall on a certain side of a hyperplane belong to a
different decision class. If the points are linearly separable then hyperplane is just
106 5 Classification
a line, however, in case of three or more features, the hyperplane may become a
two-dimensional space. A linearly separable dataset is the one where data points
can be separated by a simple linear (straight) line. Consider the figure showing a
dataset comprising of two decision classes “rectangle” and “circle”. Figure 5.15a
shows three hyperplanes H 1 , H 2 , and H 3 .
Note that each hyperplane can classify the dataset without any error. However, the
selection of a particular hyperplane depends upon its margin distance. Each hyper-
plane Hi is associated with two more hyperplanes hi1 and hi2 which are obtained by
drawing two parallel hyperplans close to each decision class. In Fig. 5.15b, there are
two more dotted lines h21 and h22 are parallel hyperplanes for H 2 . Parallel hyper-
planes are obtained such that each parallel hyperplane is closed to a decision class.
Figure 5.16 shows the hyperplane drawn close to the filled circle and rectangle.
Support Vectors: Points on other side of the hyperplane are called opponent class.
Support vectors are the vectors that are close to the opponent class. These are the
vectors that actually determine the position of the margin line. In figure above the
vectors (points) represented by filled circle and rectangle are the support vectors as
each of them is close to opponent class.
Selection of hyperplane: Although there may exist number of hyperplanes,
however, we select the hyperplane having maximum margin. The reason behind this
is that such hyperplanes tend to have more accuracy for unknown data as compared
to hyperplanes with low margin. A margin is the distance between the margin lines
of a hyperplane, so,
D = d1 + d2
W × Ps + b = h for h < 0
So, if we label all the squares with S and circles with C then the model for
prediction will be:
C, x < 0
f (x) =
S, x > 0
As we can see the above data points cannot be classified using a linear SVM, so,
what we can do here is that we can convert this data into one with higher feature
space. We can introduce a new dimension classed Z-feature as follows:
Z = x 2 + y2
Now the dataset can be clearly classified using a linear SVM. Now suppose that
the hyperplane is “K” along Z-axis then:
k = x 2 + y2
Naïve Bayes classifier is the probabilistic classifier that predicts the class of a data
based on the previous data by using probability measure. It assumes conditional
independence between every pair of features given the value of the class variable.
The technique can be used both for binary and multiclass classification.
The model of the classifier is given below:
P(c|x)P(c)
P(c|x) =
P(x)
5.5 Naive Bayes 109
Here:
P(c|x) is probability of class c given feature x and P(c) is probability of class c.
Listing 5.7 below shows the pseudocode of the algorithm:
Now we will explain it with example. Consider the dataset below which shows
the result of each student in a particular subject. Now we need to verify the following
statement:
Student will get high grades if he takes “mathematics”.
Here we have P(Math | High) = 3/9 = 0.33, P(Math) = 5/14 = 0.36, P(High) =
9/14 = 0.64.
Now, P(High | Math) = 0.33 * 0.64/0.36 = 0.60.
So it is likely that he will have high grades with mathematics. Table 5.11a–c shows
the dataset, frequency, and likelihood tables for this example.
Now we will discuss some advantages and disadvantages of the naïve Bayes
approach.
Advantages:
• Simple and easy to implement
• It is probabilistic based approach that can handle both continues and discrete
feature
• Can be used for multiclass prediction
• It assumes that features are independent, so works well for all such datasets
• Requires less training data.
Disadvantages:
• Assumes that features are independent which may not be possible in the majority
of datasets
• May have low accuracy
• If a categorical feature is not assigned a class, the model will assign it zero
probability consequently w will be unable to make prediction.
Scientists were always interested in making the machines that can learn just like
humans learn. This resulted in the development of artificial neural networks. The first
110 5 Classification
neural network was “perceptron”. In this topic we will discuss multilayer perceptron
with backpropagation.
A simple neural network model is shown in Fig. 5.19.
As shown, a simple neural network model comprises of three layers, i.e. input
layer, hidden layer, and output layer.
Input layers comprises of the n features given as input to the network, as shown in
diagram above, the input comprises of the value {x 0 , x 1 , x 2 ,…, x m }. Now each input
is multiplied by the corresponding weight. The weight determines how important the
5.6 Artificial Neural Networks 111
input is in the classification. All the inputs are provided to the summation function
which is shown below:
m
Y = wi xi + biase
i=0
The value of “z” is then provided to activation function which will enable to neuron
to be activated based on the summation value. One of the most common activation
functions is sigmoid:
1
z = Sigmoid(Y ) =
1 + e−y
Sigmoid activation function returns the value between zero and one as shown in
Fig. 5.20.
The value above 0.5 will activate the neuron and value below 0.5 will not activate
it. Sigmoid is the simplest one, however, there are many other functions as well like
rectified linear unit (ReLu), however, will not discuss them.
Loss function: The loss function determines that how much the predicted value
is different from the actual value. This function is used as criteria to train neural
network. We calculate the loss function again and again by using backpropagation
and adjust the weights of neural network accordingly. The process continues until
there is no decrease in the loss function. There are multiple loss functions. Here we
will discuss only the mean squared error (MSE).
MSE finds the average squared difference between the predicted value and the
true value. Mathematically:
m
MSE = yi − ŷi
i=1
112 5 Classification
Here yi is the actual output and ŷi is the predicted output. We calculate the
loss function and adjust weights accordingly in order to minimize the loss. For this
purpose, we use backpropagation.
In this process we calculate the derivate of the loss function with respect to the
weights of the last layer.
Consider the following neural network model as shown in Fig. 5.21.
Here O11 represents the output of first neuron of first hidden layer, O21 represents
3
output of first neuron of second layer. With backpropagation, we will adjust w11 as
follows:
∂L
3
w11 (new) = w11
3
(old) − η
∂w11
3
∂L
Now 3
w11
can be defined using the chain rule as follows:
∂L ∂L ∂ O31
= ∗
∂w11
3 ∂ O31 ∂w113
to provide a framework that will enable the reader to recognize the assumptions
and constraints that are implicit in all such techniques.
• Inductive Inference for Large Scale Text Classification: Kernel Approaches and
Techniques (Authors: Catarina Silva, Bernadete Ribeiro)
Book gives a concise view on how to use kernel approaches for inductive inference
in large scale text classification; it presents a series of new techniques to enhance,
scale and distribute text classification tasks.
5.8 Summary
In this chapter we discussed the concept of classification with in-depth details. The
reason behind was the classification tree is one of the most commonly used classifiers.
All the concepts necessary for developing a classification tree were also presented
using examples. It should be noted that most of the concepts related to classification
tree also relate to other classification techniques.
Chapter 6
Clustering
Clustering is the process of dividing the objects and entities in meaningful and logi-
cally related grouped. In contrast with classification where we have already labeled
classes in data, clustering involves unsupervised learning, i.e. we do not have any
prior classes. We just collect similar objects in similar groups. E.g. all the fruits
with yellow color and having some length may be placed in one group, perhaps the
group of banana fruit. Just like classification, clustering is another important used in
many sub-domains encompassed by data science like data mining, machine learning,
artificial intelligence, etc.
Figure 6.1 shows some sample clusters of a dataset.
In this chapter we will discuss the concept of cluster, cluster analysis, and different
algorithms used for clustering.
A cluster is group of data that share some common properties, e.g. in Fig. 6.1 two
clusters are given which share same properties, e.g. all the objects in left cluster are
closed shaped objects, whereas all the objects in right most cluster are open shaped.
Cluster analysis is the process of clustering the objects in groups on the basis of their
properties and their relation with each other and studying their behavior for extracting
the information for analysis purpose. Ideally all the objects with in a cluster have
similar properties which are different from the properties of objects in other groups.
However, it should be noted that the idea of dividing objects in groups (clusters)
may vary from application to application. A dataset that is divided into two clusters
may also be divided into more clusters at some refined level, e.g. consider Fig. 6.2
where the same dataset is divided into four and six clusters.
Clustering can also be called as unsupervised classification in that it also classifies
objects (into clusters), but here the classification is done on the basis of properties of
© The Editor(s) (if applicable) and The Author(s), under exclusive license 115
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_6
116 6 Clustering
the objects, and we do not have any predefined model developed from some training
data. It should be noted that sometimes the terms like partitioning or segregation
are also used as synonym for clustering, but technically they do not depict the true
meaning of the actual clustering process.
6.2 Types of Clusters 117
Clusters can be of different types based on their properties and how they are
developed. Here we will provide some detail of each of these types.
We may have some other cluster types like fuzzy clusters. In fuzzy theory an object
belongs to a set by some degree defined by its membership function. So the objects in
118 6 Clustering
fuzzy clusters belong to all clusters by a degree defined by its membership function.
The value is taken between zero and one.
Similarly we may have complete and partial clustering. In complete clustering
every object is assigned to some cluster, whereas in partial clustering the objects
may not. The reason behind is that the objects may not have well-defined properties
to assign it to a particular clusters. Note that we will discuss only the partitioned and
hierarchical clusters.
There are some other notations used for different types of clusters. Here we will
discuss few of them.
As in a cluster the objects share the common properties, so the well-separated clusters
are the one in which objects in one group (cluster) are more close to each other as
compared to the objects in some other cluster. E.g. if we talk about the distance of
objects with each other as proximity measure to determine how close the objects are,
then in well-separated clusters, objects within a cluster are more close as compared
to objects in other clusters. Figure 6.1 is an example of two well-separated clusters.
6.2.3.2 Prototype-Based
Sometimes we need to consider one object from each cluster to represent it. This
representative object is called prototype, centroid, or medoid of the cluster. This
object has all the properties of the other objects in the same cluster. So, in prototyped
cluster, all the objects in the same cluster are more similar to center(oid) of the
cluster as compared to the center(oid) of the other clusters. Figure 6.4 shows four
center-based clusters.
In contiguous clusters objects are connected to each other because they are at a
specified distance, and the distance of an object with another object in the same
cluster is closer to its distance with some other object in other cluster. This may lead
to some irregular and intertwined clusters. However, it should be noted that if there
is noise in data, then two different clusters may appear as one cluster as shown in the
6.2 Types of Clusters 119
figure due to a bridge that may appear due to presence of noise. Figure 6.5 shows
examples contiguous clusters.
In density-based clustering the clusters are formed because the objects in a region are
dense (much closer to each other) than the other objects in the region. The surrounding
area of the density-based clusters is less dense having lesser number of objects as
shown in Fig. 6.6. This type of clustering is also used when clusters are irregular or
intertwined.
These are also called conceptual clusters. Clusters in this type are formed by objects
having some common properties. Figure 6.7 shows the example of shared property
or conceptual clusters.
120 6 Clustering
6.3 K-Means
Now we will discuss one of the most common clustering algorithms called K-means.
K-means is based on prototype-based clustering, i.e. they create prototypes and then
arrange objects in clusters according to those prototypes. It has two versions, i.e.
K-means and K-medoid. K-means define prototypes in term of the centroid which
are normally the mean points of the objects in clusters. K-medoid, on the other hand,
defines the prototypes in terms of medoids which are most representative points for
the objects in cluster.
Listing Algorithm 6.1 shows the pseudocode of the K-means clustering algorithm:
Here K is the number of centroids, and alternative defines the number of clusters
that will be resulted. Value of K is specified by the user.
Algorithm initially selects k centroids and then arranges the objects in such a way
that each object is assigned to centroid closer to it. Once all the objects are assigned
to their closer centroids, the algorithm starts second iteration, and the centroids are
updated. The process continues until centroids do not change any more or no point
changes its cluster. Figure 6.8 shows the working of the algorithm.
Algorithm completes operation in four steps. In first step, the points are assigned to
the corresponding centroids. Note that here value of K is three, so three centroids are
identified. Centroids are represented by “+” symbol. Once the objects are identified
to centroids, in second and third steps the centroids are updated according to points
assigned. Finally the algorithm terminates after the fourth iteration because no more
changes occur. This was a generic introduction of the K-means algorithm. Now we
will discuss each step in detail.
6.3 K-Means 121
As discussed above, centroids are the central points of a cluster. All the objects in
a cluster are closer to their central points as compared to the centroids of the other
clusters. To identify this closeness of an object normally we need some proximity
measure. One of the commonly used measure is the Euclidean distance. According
to Euclidian distance, the distance between two points in the plane with coordinates
(x, y) and (a, b) is given by:
dist((x, y)(a, b)) = (x − a)2 + (y − b)2
122 6 Clustering
As an example, the (Euclidean) distance between points (2, −1) and (−2, 2) is
found to be:
dist((2, −1)(−2, 2)) = (2−(−2))2 + ((−1)−2)2
dist((x, y)(a, b)) = (2 + 2)2 + (−1−2)2
dist((x, y)(a, b)) = (4)2 + (−3)2
√
dist((x, y)(a, b)) = 16 + 9
√
dist((x, y)(a, b)) = 25
dist((x, y)(a, b)) = 5
Once the objects are assigned to centroids in all the clusters, the next step is to re-
compute or adjust the centroids. This re-computation is done on the basis of some
objective function that we need to maximize or minimize. One of the functions is
to minimize the squared distance between the centroid and each point in cluster.
The use of a particular objective function depends on the proximity measure used
to calculate the distance between points. The objective function “to minimize the
squared distance between centroid and each point” can be realized in terms of the
sum of squared error (SSE). In SSE, we find the Euclidean distance of each point
from the nearest centroid and then compute total SSE. We prefer the clusters where
sum of squared error (SSE) is minimum. Mathematically:
n
SSE = dist(Ci , x)2
j=1 x∈Ci
where:
1
SSE = X
m i x∈C
i
Here:
x = An object.
C i = ith cluster.
ci = Centroid of cluster C i .
c = Centroid of all points.
mi = Number of objects in the ith cluster.
m = Number of objects in the dataset.
n = Number of clusters.
6.3 K-Means 123
n
SSE = consine(Ci , x)
j=1 x∈Ci
There may be number of proximity functions that can be used depending on the
nature of data and requirements. Table 6.2 shows some sample proximity measures.
Whenever, we use SSE as proximity measure, the SSE may increase due to outliers.
One of the strategies may be to increase the value of K, i.e. produce more number of
clusters, thus the distance of the points from centroids will be minimum resulting in
minimum SSE. Here we will discuss two strategies that can reduce SSE by increasing
the number of clusters.
We can use different strategies to split a cluster for we can split a cluster having
the largest SSE value, and alternatively, the standard deviation measure can also be
124 6 Clustering
used, e.g. the cluster having the largest standard deviation with respect to a standard
attribute can be split.
We can introduce a new cluster centroid. A point far from the cluster center can be
used for this purpose. However, for selecting such a point, we will have to calculate
SSE of each point.
We can remove the centroid and reassign the points to other clusters. To disperse a
cluster, we select the one that increases the total SSE the least.
To merge two clusters, we choose clusters which have closest centroids. We can also
choose the clusters which result minimum SSE value.
Bisecting K-means is a hybrid approach that uses both K-means and hierarchical
clustering. It starts with one main cluster comprising of all points and keeps on
spiting clusters into two at each step. The process continues until we get some already
specified number of clusters. Its Algorithm 6.2 gives the pseudocode of the algorithm.
We will now explain the algorithm with a simple and generic example. Consider
the dataset P = {P1 P2 P3 P4 P5 P6 P7 } which we will use to perform clustering
using bisecting K-means clustering algorithm.
6.5 Bisecting K-Means 125
Suppose we start with three resulting clusters, i.e. the algorithm will terminate
once all the data points are grouped into three clusters. We will use SSE as split
criteria, i.e. the cluster having high SSE value will be split into further cluster.
So our main cluster will be C = {P1 , P2 , P3 , P4 , P5 , P6 , P7 } as shown in Fig. 6.9.
Now by applying K = 2, we will split the cluster C into two sub-clusters, i.e. C 1
and C 2 , using K-means algorithm, and suppose the resulting clusters are as shown
in Fig. 6.10.
Now we will calculate SSE of both clusters. The cluster having the higher value
will be split further into sub-clusters. We suppose the cluster C 1 has higher SSE
value. So it will be split further into two sub-clusters as shown in Fig. 6.11.
Now algorithm will stop as we have obtained our desired number of resultant
clusters.
Hierarchical clustering also has wide spread use. After partitioned clustering these
techniques are the second most important clustering techniques. There are two
common methods for hierarchical clustering.
Agglomerative These approaches consider the points as individual clusters and keep
on merging the individual clusters on the basis of some proximity measure, e.g. the
clusters that are closer to each other.
Divisive These approaches consider all the data points as a single all inclusive cluster
and keep on splitting it. The process continues until we get singleton clusters that we
cannot split further.
Normally the hierarchical clusters are shown using a diagram called dendrogram
which show the cluster-sub-cluster relationship along with the order in which clusters
were merged. However, we can also use nested view of the hierarchical clusters.
Figure 6.12 shows both views
There are many agglomerative hierarchical clustering techniques; however, all
work in the same way, i.e. by considering each individual point as an individual
cluster, keep on merging them until only one cluster is remained. Algorithm 6.3
shows the pseudocode of such approaches:
there are many definition of the density, but we will consider only center-based
density measure. In this measure, the density for a particular point is measured by
counting the number of points (including the point itself) within a specified radius
called Eps (Epsilon). Figure 6.14 shows the number of points within the radius of
Eps of point P is six including the point A itself.
Algorithm 6.4 shows the pseudocode of the algorithm
Noise points If a point is neither a border point and nor a core point, then it is
considered as noise point. In Fig. 6.15, point P4 is a noise point.
So far we have discussed different clustering algorithms. Now we will discuss some
generic characteristics of clustering algorithms.
Order Dependence For some algorithms the order and quality of the resulted clus-
tering structure may vary depending on the order of processing of data points.
Although sometimes desirable, but generally we should avoid such algorithms.
Non-determinisim Many of the clustering algorithms like K-means require initial
random initialization. Thus the results they may produce may vary on each run. For
all such algorithms, to get optimal clustering structure, we may need to have multiple
runs.
Scalability It is common to have datasets with millions of data points and attributes.
Therefore, their time complexity should be linear or near to linear in order to avoid
performance degradation.
Parameter Setting Clustering algorithms should have minimum parameter setting,
and especially those parameters that are provided by the user and significantly affect
the performance of the algorithm. It should be noted that the minimum number of
parameters that need to be provided are always better.
Treating Clustering as an Optimization Problem Clustering is an optimization
problem, i.e. we try to find the clustering structures that minimize or maximize an
objective function. Normally such algorithms use some heuristic-based approach to
optimize the search space in order to avoid exhaustive search which is infeasible.
Now we will discuss some important characteristics of the clusters.
Data Distribution Some techniques may assume distributions in data where a
cluster may belong to a particular distribution. Such techniques, however, require
strong conceptual basis of statistics and probability.
Shape Clusters can be of any shape, some may be regularly shaped, e.g. in a triangle
or rectangle, while some others may be irregularly shaped. However, in a dataset a
cluster may appear in any arbitrary shape.
Different Sizes Clusters can be of different sizes. Same algorithms may result in
different clusters of different sizes in different runs based on many factors. One such
factor is random initialization in case of K-means algorithm
Different Densities Clusters may have different densities, and normally such clus-
ters having varying densities may cause problems for methods such as DBSCAN
and K-means.
Poorly Separated Clusters When clusters are close to each other, some techniques
may combine them to form a single cluster which results in disappearance of true
clusters.
6.9 General Characteristics of Clustering Algorithms 131
Cluster relationship Clusters may have particular relationships (e.g. their relative
position); however, majority of techniques normally ignore this factor.
Subspace Clusters Such techniques assume that we can cluster the data using a
subset of attributes from the entire attribute set. Problem with such techniques is that
using a different subset of attributes may result in different clusters in same dataset.
Furthermore, these techniques may not be feasible for dataset with large number of
dimensions.
Book presents the bi-partial approach to data analysis, which is both uniquely
general and enables the development of techniques for many data analysis prob-
lems, including related models and algorithms. Book offers a valuable resource for
all data scientists who wish to broaden their perspective on basic approaches and
essential problems, and to thus find answers to questions that are often overlooked
or have yet to be solved convincingly.
6.11 Summary
Clustering is an important concept used for analysis purpose in data science applica-
tions in many domain including psychology and other social sciences, biology, statis-
tics, pattern recognition, information retrieval, machine learning, and data mining.
In this chapter we provided in-depth discussion on clustering and related concepts.
We discussed different cluster types and relevant clustering techniques. Efforts were
made to discuss the concepts through simple examples and especially in pictorial
form.
Chapter 7
Text Mining
In the exploration of text data the major driving factor is availability of textual data
in digitized format. Based on the knowledge discovery perspective this is similar to
the knowledge discovery in database. Text is a way for the information transfer and
store as it is rich and comes naturally. To support this scenario Internet is considered
the outstanding examples: It was assessed that the most popular Internet search
engines indexed 3 billion available documents in textual format. Providing data in
textual format indicates an opportunity to improve decision making through text data
sources tapping. Therefore to discover automated ways of new knowledge the large
volumes are likely to be targeted. This process initiated the development of a new
branch of text-based knowledge discovery which leads toward text data mining.
The underlying concept of using text data sources to explore knowledge is that the
nature of data is unstructured, where the organization of data is made in semantic
manner, based on predefined data types, range of values and labels. On the contrary,
text document is unquestionably increasingly versatile and simpler in their expressive
power; however, different constraints come into the way of these benefits which
include addition of complexity inherent to the vagueness, presence of fuzziness, and
uncertainty in any natural language. Due to this reason the knowledge discovery or
text mining discipline contribute in text leverages based on multiple types of research
in computer science area which include artificial intelligence, machine learning,
computational linguistics, and information recovery. The concept of what is known
within the field of text mining differs and sometimes overlaps thereupon of other
fields often handling the computational treatment of text data, like processing of
natural languages and information retrieval.
An undertaking focused perspective on text mining incorporates:
© The Editor(s) (if applicable) and The Author(s), under exclusive license 133
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_7
134 7 Text Mining
The examination which is based on large set of text data is a valuable method for
gaining insights from data that is not typically possible by manual assessment. This
type of investigation can be used to create knowledge in which mining relationships
where the documents which belong to research from different areas have headed to
a new assumption with domain of medical. Literature contains many approaches in
which there is data related to prototypes and successful systems methods. Also data
description and data visualization are incorporated in literature.
7.2 Text Mining Applications 135
For a particular document to predict the class of the text data text classification is used
which involves the classification techniques and its applications. For the reduction of
manual interference and the accelerating routing automatic text classification method
is used. In automatic text classification methods the text is categorized conferring
to topics, there are a lot of examples that can be seen for the text automation which
support request according to the type of every product.
Text classification involves the application of classification techniques in text data
for the prediction of a class for a given document.
The field of text mining unties prospects to create new knowledge which is
extracted from text. In decision support enriched data sources are used to extract
data from the textual documents and for information optimization. For the manage-
ment of knowledge systems large collection of data text in the format of digital is
available. In the previous study about the knowledge management, competitiveness,
and innovation are having same benefits as the application of data mining methods
to text for generating a new knowledge.
For knowledge management which is used in the application of text mining tech-
nology as an assistive technology to categorized the automated classification of docu-
ments. To mitigate the information overload the text summarization and organization
of knowledge are used.
Text categorization has three main types and dependence of this text categorization is
based upon the text document link to the total number of text document’s categories.
136 7 Text Mining
When each document is assigned only one category from the given categories set
then this is called single label text categorization. In this category the total number
of available categories can be more than one.
In multilabel text categorization each text document is assigned more than one cate-
gories from the given pool of categories. In this category a threshold and any scoring
mechanism may also be used for document assignment. The score generation based
on threshold determines the category and this can be exemplified as news article
which may include entertainment category and politics category as well.
In the binary text categorization only two categories are available as the name depicts
and in this technique there is restriction on the available categories. There are only
two categories available for categorization of any test document. For example for
patient in cancer hospital, there are only two categories one is cancer and the other
non-cancer and also the newsfeed on any platform can simply categorized into two
categories as “interesting” or “uninteresting”.
Sample labeled documents are used for the automatic definition of classification in
machine learning. Labeled document classify unseen documents and based on this
7.4 Text Categorization Approaches 137
In supervised machine learning the data that is labeled participate in learning and
unseen instances are classified. The system in this technique holds the knowledge
of labeled dataset, whereas it is most frequent and commonly used technique. The
data that is named as labeled data also termed as training data and it comes with
different challenges which includes dealing with variations, choosing right data, and
generalizations.
This technique establishes the complete process to guide the decision making by
establishing concrete approach.
When learning on both types of data labeled and unlabeled then it forms semi-
supervised machine learning. It is the technique where both paradigms of learning
output the best including similarities based learning and input from the instructor-
based learning. This approach applies the best outcome for both approaches.
138 7 Text Mining
All the classifications of data can be found from which a document belongs in the
document pivoted category. Text categories can be used where the documents are in
sequential form and the categories of documents are stable intended for a junk mail
system.
• Category Pivoted
All documents which belong to each category are classified or categorized in category
pivoted. While new categories are added to an existing system and there is set of
documents that are already given the category pivoted is a suitable text categorization
method.
In text filtering text is filtered by categorizing the documents one by one with the
help of classifiers. For example we can categorize the junk mail by filtering and
categorized it as “spam” or “not spam”. Automated text categorization is also used
in any kind of service where user need to enter any information regarding email,
new, articles, scientific paper, and event notifications.
7.6 Text Categorization Applications 139
• Computational Linguistics
The World Wide Web comprises a countless sites on uncountable subjects. Web-based
interfaces, (for example, the Yahoo! Registry) are well-known beginning stages for
exploring to an appropriate site. These portals permit a client to peruse through
a hierarchical structure of classes, constricting the extent of conceivably significant
sites for every click. The user can submit a query to access its selected category or the
user can directly open the Web site by clicking the hyperlinks available. Web sites are
growing very fast and because of it the categorization which was performed manually
is infeasible. Texts categorizations techniques are used in hierarchical Web page
categorization so that it removed obsoletes and add new categories in the hierarchical
Web page on each node by using classifiers.
Vector space model is broadly used in the text categorization and the common oper-
ation carried through it includes document categorization, text clustering, and docu-
ment ranking in search engines. Through this model a certain document set’s vector
is represented. Each vector corresponds to a single document, and each term in the
vector to a term in the document.
Consider the document set D which contains different documents Di as:
can be calculated through different approaches and these approaches may include
term frequency-inverse document frequency (TF-IDF).
New document query is described in the same vector space as the documents to
make for quicker comparisons of similarities.
In a search engine, in the same vector space as the papers, each query is usually
represented by the same form of vector, allowing quick comparisons of similarities.
Documentation in training and test sets and the documentation to be categorized
are indexed in text categorization and defined by the same model, usually the space
model for the vectors.
Terms are allocated different weights based on some scheme of weight for the retrieval
of information. The weight that is assigned to the word shows the importance of that
term.
Text categorization uses different weighting techniques and these techniques are
given below.
(i) Term Frequency
Term frequency is a simplest method of term weighting used for the categorization
of the text in a given document for the measurement of terms importance. Based on
the total number of term appearance in a document its weight is calculated. There
are frequent terms that repeatedly occur in a document and it can help recall that
term because of its very less discriminative power. In the process of categorization
generally terms with too high frequency and terms with too low frequency are not
much useful.
(ii) Inverse Document Frequency
Term importance calculation in whole set of documents is called inverse document
frequency. It is related to the frequency count of term document and the importance
of a term is inversely proportional to the frequency of occurrence of that term in the
documents. It is inverse in relationship with document frequency because the term
that has greater discriminative power by occurring in less number of documents is
more considerable then the term that is occurring in larger number of documents.
Let suppose that there are “t” terms in “n” document from the dataset of documents
“N” then the frequency for inverse document will be calculated as:
Different terms have varying meaning for various documents counting for any
given dataset supported their discriminative capacity. The term frequency-inverse
frequency of the document calculates each terms weight, taking under consideration
all the documents. During this term importance can be defined as, a term “ti” is
comparatively more important because of its occurrence more in one document, and
in lesser amount in other documents. Within document d the calculation of the worth
of a term “t” can be illustrated as
t f − id f t,d = t f t,d ∗ id f t
Sorting text document in raw form probably would not be effective and that we
probably would not be able to accomplish ideal execution because of noise terms
presence as an example it can be illustrated through white spaced, special characters.
Such preprocessing lessens the elements of input text documents extensively. It is
preferred to preprocess these documents and switch them into some forms that is
refined so that optimum performance can be achieved.
(i) Dataset Cleaning
In dataset cleaning the process carried out cleaning of dataset because in public
datasets it is not possible to categorize the dataset directly sometimes. This can
happen because of the presence of any term that not be significantly important
for the process of categorization. Such terms have impact on both efficacy and
performance.
To perform the categorization process efficiently and to achieve better results for
the performance it is important to clean these types of datasets. This is reduced the
terms that are not necessary which will reduce the size of vocabulary. Through this
process noise terms could be reduced and that will ease the process of categorization
and eventually the performance is also improved. Through a research the following
steps have been carried for cleaning the dataset from these types of terms.
(i) Replace special characters with white space, i.e. @, _, etc.
(ii) Replace multiple white spaces with single one
142 7 Text Mining
text mining task employment. Source data treatment is used for the dictation of
characteristics of model and the information that model can provide. Hence matching
preparation steps with the overall exercise objective is very vital.
One of the primary problems with the event of stop lists to gather relevant infor-
mation from the documents. Such lists show what terms are presumably to seem
from the record set, and hold no knowledge when trying to spot trends.
Basic words like “the”, “and”, “of” in the English language are stop list’s ordi-
nary contender. It is important to ensure careful measures for specific objective-based
listing related to any data mining task because stop word are not always useless. As
in some cases these words become significant for any scenario description. Stem-
ming and lemmatization purposefully reduce terms variations through the transfer
of similar occurrences to a lemma or stem or canonical form or by reducing words
to their inflection root. Through this process noisy signals are reduced and in a text
collection the number of attributes that needs analysis is reduced and also dataset
dimension are reduced. This can be exemplified as when there is singular/plural
conversion then stemming is playing its role and also when present and past tense
are converted into single form. The preprocessing phase relies on the objectives of text
mining, and the lack of relevant information due to stemming should be evaluated.
Noisy data is needed to be taken out from the data and when the environment
is uncontrolled then the amount of noisy data may increase to the next level. To
clean this data different measures can be adopted. When the data is in the form
text then the issues may include errors of spellings, correction of spelling that is
inconsistent, abbreviations, resolving term shortenings, text conversion of uppercase
and lowercase when it is required and stripping markup language tags.
In natural language there is another issue in which different meaning can be
extracted through the same term through the usage of words within the sentence or
the type of document. There are many examples that can explain this process and it
can be highlighted through the following example where the word book is used to
demonstrate different meanings in two different sentences.
• “This is a great book.”
• “You can book your flights from this website.”
When the meaning of a term needs to be discovered then the referred sentence is
the mean to make some understandings. This understanding of the meaning of the
term is called word sense disambiguation. This is a topic of research in both field’s
machine learning and natural language processing.
The process of tokenization is actually the atomic components segmentation
process of input of a text. In tokenization of a document the approach is depen-
dent of mining objectives. In these approaches using individual word as token is the
common approach where punctuation marks and spaced are used as separators. A unit
of analysis is based on word collection and this collection is based on more than one
word. When text sets need to be examined the statistical similarity produce colloca-
tions and this can also be produced through dictionaries and techniques information
extraction.
144 7 Text Mining
Opinion mining might be a cutting edge, imaginative field of research pointed toward
gathering conclusion related data from textual data sources. Opinion mining has many
interesting applications and it covers the academic and commerce both fields. This
has broad range of applications because it is handling novel intellectual challenges,
and therefore a comprehensive amount of research does consider it as interesting
subject. This section introduces and discusses in detail about the opinion mining
research field, challenges, key tasks, and motivations behind it. After discussion of
all these points it will present SentiWordNet lexical resource with its details including
applications, limitations, and potential advantages.
Information regarding people’s opinions might be a really important part for creating
decisions more correctly during a sort of realms. For instance, businesses have a
robust interest in searching what thoughts their consumers have a few new products
introduced on a marketing campaign. Buyers, on the opposite hand, may take advan-
tage of reading the thoughts and feedback of others a few given product they shall
buy, since suggestions from other consumers still affect buying decisions. Knowledge
of the views of other citizens is additionally important within the political sphere,
where one might dig out, for instance, the sensation toward a replacement law, or an
individual like an official or demonstrators.
Recently, the Web has made it feasible for opinion to be seen in written text
from, from a spread of sources and on a way more extensive scale. It likewise made
it simpler for individuals to exact their perspectives on pretty much every issue
through expert sites, blogs conversation gatherings, and product audit pages. Suppo-
sition destinations are not constrained to specialty survey locales yet additionally are
remembered for Web client blog entries, conversation discussions, and implanted in
online informal organizations.
Plainly the Web might be an enormous file of substance produced by openly
accessible clients, committed to communicate opinion on any subject of concern.
Moreover, generally opinions are explained in the form of text that makes rich ground
for the text mining and natural language. Thus for the analysis of opinion information
7.9 Opinion Mining 145
The clearest utilization of opinion mining techniques would be the chase inside the
document for opinions. Discovering the abstract presentation related with a subject
and its favoritism may increment regular search engines into bearing motors by
getting results on a given point hold just positive or negative sentiments, for instance,
when requesting items that have gotten best audits during a specific field, as example,
a client question for advanced cameras with better battery life and that has got some
good feedback as well.
On contrary, information retrieval systems that require to offer information that
is factual on a point of interest subject can identify and erase information related to
opinion to expand results significance.
Opinion mining can be used for the classification of the subjective statement in
environment that is collaborative which can be email list or discussion group. This
can contain inappropriate remarks which are known as flaming behavior. In online
advertisement the technique could be used for the placement ad campaign. Through
this technique placement of advertisement can be avoided to the content which is not
related advertisement and may carry opinions which are not favorably related to any
product or any brand.
Setups that handle customer connections could also be more sensitive by using senti-
ment detection as a way to accurately forecast the feedback of the customer based on
146 7 Text Mining
his satisfaction. Another example is that the automated sorting feedback of customer
through email expressing sentiments that are positive and negative, and then this
could be applied for the message to automatically rout them within the teams that
are appropriate for taking necessarily corrective measures.
Opinion mining does have a potential to add greater value in the field of knowledge
management and through all the examples illustrated above it is clear that this is
a value-added field of research for a wide range of activities for the companies
around the globe. When the query interface is extended then the content that is
stored explicitly through the knowledge-based systems and it will carry out opinion
information for more relevant results. This can also exclude documents that are
subjective for the requirements of results which are more factual. The systems that
are based on knowledge management need less efforts of administration due to the
implication of sentiment detection so that unwanted user behavior and flaming can be
avoided. It can become more fluid because of the tacit or explicit knowledge due to its
collaborative environments. Lastly the systems of knowledge discovery are used to
influence the opinion information which will be used for the creation of knowledge
for an organization. It also improves the process of decision making based on the
relevant user feedback.
conferring to a spectrum of some possible opinions such as there are reviews about
films and there range ends on five stars and starts from zero.
The usage of keywords is the most common approach in the detection of subjec-
tivity and classification of sentiment that give indication about both the measures
which are positive and negative bias and about the general subjectivity. Training data
is not required for the approaches like wordlist for making predictions. Training data
is used for predefined lexicon sentiment and it is also applicable where data is not
trained. Because of this purpose these approached are considered as unsupervised
learning methods.
To make wordlists manually is a time taking procedure. There are approaches
that automatically create resources that are present in the literature which comprise
opinion information that contain words on the bases of available lexicons which
is labeled as lexical induction. There are other approaches present in the literature
to originate opinion information one of the common approaches is examining term
relationships of the WordNet in which there are some terms used to assume a priori to
carry out the opinion information. These common core words are following “poor”,
“excellent”, “bad”, “and good”.
7.10.1 SentiWordNet
SentiWordNet is one example of a lexical asset intended to aid opinion mining tasks.
SentiWordNet plans to supply information of opinion polarity at the term level by
getting that information from the semi-automatic fashioned relationship and WordNet
database of English terms. A positive and a negative imprints beginning from 0 to 1
for each word in term in WordNet is available in SentiWordNet, mirroring its polarity,
with higher scores indicating terms that convey heavy opinion bias information, while
lower scores appears, a term that is less subjective.
In English language WordNet is a lexical database where terms are sorted out as
indicated by their semantic connection. In natural language processing WordNet
has been broadly applied to solve the problems and also a complete list of work in
literature is available.
The WordNet lexicon is the product of work on linguistics and psychology at
Princeton University to better understand the nature of English language semantic
relations between words.
Furthermore in the English language, on giving a full lexicon, where terms can
be recovered and investigated by ideas and their semantic connections.
At its third form, WordNet is accessible as a database, and also accessible through
Web interface or by means of an assortment of software APIs, giving a complete
148 7 Text Mining
database of over 150.000 unique kind of terms sorted out into in excess of 117,000
unique implications.
WordNet additionally developed with augmentations of its structure applied to
various different languages.
The main relationship between term and WordNet is similarity of synonym. Terms
should be put together which create a set of synonyms known as synsets. Main criteria
for gathering a terms into a synsets is whether a term utilized inside a sentence on
a particular context can be supplanted by another term on a similar synsets without
changing the sentence’s understanding.
Terms should be directly separated by syntactic classifications, since things,
descriptive words, action words, and qualifiers are not tradable inside a sentence.
A short description of terms is present in synsets which help to determining the
significance of different terms. This should be very valuable on synsets with just a
single term or synsets with few relations.
The main relationship between term and WordNet is antonimity. In extraordi-
nary case of adjectives, there is a differentiation among immediate and roundabout
antonyms or when term could be arranged as immediate contrary or in roundabout
way by means of another applied relationship.
The word “heavy/weightless” is reasonably alternate and in this way aberrant
antonyms and also the word “wet/dry” are qualified as immediate antonyms since
they have a place with synsets where an immediate antonym exists between the terms
(“heavy/light”), however, they are not legitimately associated.
There are also another class of relationship present between WordNet and term
which is known as hyponymy. The hyponymy class represents the hierarchical “is-a”
type of relation between term and WordNet such as in the case of “oak/plant” and
“car/vehicle”. Another relation exists between term and WordNet which is known as
meronymy which represents “part-of” relationship among term and WordNet.
An attributes type of relation is present for the exceptional case of adjectives shows
that adjective act as a modifier for certain generic qualities such as for “weight”
adjectives “heavy” and “light” are used as modifiers. For linking the noun which
represents certain attributes to its respective adjective, the WordNet is used that
modifies its relationship.
Based on the capabilities of WordNet’s semantic relationships, SentiWordNet
extracts synset viewpoint ratings using a semi-supervised approach where only a
small part of the synset terms known as the paradigmatic words are manually labeled,
with the entire database extracted using an automated process. The procedure is
written below:
1. Paradigmatic words extracted from the WordNet-affect lexical tool are manually
labeled into positive or negative labels according to polarity of opinion.
2. Iteratively extend each mark by inserting WordNet terms that are linked to already
marked terms by a connection that is considered to efficiently maintain term
orientation. To expand labels the following relationships are used:
7.10 Sentiment Classification 149
a. Direct antonym
b. Attribute
c. Hyponymy (pertains-to and derive-from) also see similarity.
3. According to the direct antonym relation, from newly added definitions, add the
terms representing directly opposite opinion orientation to the opposite mark.
4. For a fixed number of iterations K repeat steps 2 and 3.
A subset of WordNet synsets is labeled positive or negative on completion of steps
1–4. After completing the score assessment of all terms, synset glosses are trained on
the set of classifier. On WordNet textual definitions for each synset meaning should
be present. This process should be continued by classifying new entries according
to the training data and getting an aggregated value as given below:
5. A word vector representation was produced as a result of each labeled synset
from steps 1–4, having a negative/positive label. A committee of classifiers can
be trained by using this dataset:
(a) Following predictions can be made by training a pair of classifiers: Non-
positive/positive and non-negative/negative synsets which belongs to nega-
tive and positive labels are not present in the training set and allocated to
the “objective” class, with zero-valued negative and positive scores.
(b) Different sizes of the training set are used to repeat the process. These can
be obtained by changing the values of K in last step: 0, 2, 4, 6, and 8.
(c) Support vector machine and Rocchio classification algorithms are used for
each training set.
6. As a result each resulting classifier returns a prediction score when applying the
set of classifiers to new terms. These summed up and standardized to 1.0 in order
to produce the final positive and negative scores for one word.
The method outlined above for building SentiWordNet highlights the dependence
of term scores on two distinctive factors: Firstly the choice of paradigmatic words
that will produce the entire set of positive and negative scores must be carefully
examined, since the extension of scores to the remainder of WordNet terms relies
on this core set of terms for making a scoring decision. Second, for the machine
learning stage of the process, the process relies on synsets of textual interpretation,
or glosses, to determine the similarity of a new term to positive or negative words.
SentiWordNet is used for opinion mining and used as lexical tool which could be
beneficent in certain stances. In opinion mining, term sentiment information has
gained enormous research attention as an approach. SentiWordNet could be used as
a substitute for manually constructing WordNet sentiment lexicons. Often performed
on an individual basis for basic opinion mining computerized methods of constructing
150 7 Text Mining
sources: Internet pages, official documents such as laws and regulations, books
and newspapers, and social Web.
7.12 Summary
Text mining involves the digital processing of text for the production of novel infor-
mation and incorporates technologies from natural language processing, machine
learning, computational linguistics, and knowledge analysis. Implementations of text
mining to information exploration were studied on the basis of exploratory research
and other conventional data mining techniques. Opinion mining was introduced as
new research area. Opinion mining is a modern field of research using the compo-
nents of text mining, natural language processing, and data mining, and a great range
of applications to derive opinions from documents are feasible, as mentioned in
this section. Things vary from developing business intelligence in organizations to
recommended technologies, more effective online advertising and spam detection
and information management frameworks. It has been shown that opinion mining
can be helpful to knowledge management programs explicitly, through raising the
quality of information archives by opinions-aware apps. And in another way by incor-
porating data that can be derived from textual data sources, thus indirectly providing
further incentives for knowledge development within the business. Ultimately, the
SentiWordNet and WordNet with its potential uses were introduced, with a presen-
tation of its building blocks. WordNet database of relationships and terms is popular
and has a well-known extension named as SentiWordNet. It is a freely present lexical
source of sentiment information. In opinion mining research, WordNet can be used
because a number of compatible approaches were developed on an ad hoc style in it.
One of the most important and prominent component of this thesis is SentiWordNet.
The next chapter discusses the strengths and function of this tool in depth, taking
into account the complexities of opinion mining discussed in this segment. The final
outcome is the implementation of a set of features that incorporate sentiment knowl-
edge derived from SentiWordNet and can be implemented to sentiment classification
issues.
Text classification which is also known as text categorization will be introduced
in this chapter. The researcher will have look on different approaches for text docu-
ment categorization, document representation schemes, text categorization, and text
document preprocessing mechanisms. Text categorization is the process of manu-
ally passing on a text document to a group from a list of specified categories on the
account of its components.
Chapter 8
Data Science Programming Languages
In this chapter we will discuss two programming languages commonly used for data
science projects, i.e. Python and R programming languages. The reason behind is
that a large community is using these languages and there are lot of libraries available
online for these. First Python will be discussed, and in later part we will discuss R
programming language.
8.1 Python
Python is one of the most common programming languages for data science projects.
A number of libraries and tools are available for coding in Python. Python source
code is available under GNU General Public License.
To understand and test the programs that we will learn in this chapter, we need
the following resources and tools.
Python: https://www.python.org/downloads/
Python Documentation: https://www.python.org/doc/
Pychamp an IDE for Python: https://www.jetbrains.com/pycharm/download/.
Installations of these tools are very simple; you just need to follow the instructions
given on screen during installation.
After installation we are ready to make out first program of Python.
Listing 8.1: First Python program
1. print ("Hello World!");
© The Editor(s) (if applicable) and The Author(s), under exclusive license 153
to Springer Nature Singapore Pte Ltd. 2020
U. Qamar and M. S. Raza, Data Science Concepts and Techniques
with Applications, https://doi.org/10.1007/978-981-15-6133-7_8
154 8 Data Science Programming Language
Like other programming languages Python also has some reserved words which
cannot be used for user-defined variables, functions, arrays, or classes. Some of
Python reserved words are given in Table 8.1.
Like other programming languages, Python does not use braces “{” and “}” for
decision-making and looping statements, function definition, and classes. The blocks
of code are denoted by line indentation. Number of spaces in indentation can be
variable, but all statements in a block must have same amount of spaces.
Normally a statement in Python ends on new line, but we want to keep a statement
on multiple lines when then Python allows us to use line continuation character “\”
with the combination of “+” sign. The following example will demonstrate you the
use of line continuation character.
Listing 8.2: Line continuation in Python
1. name = "Testing " + \
2. "Line " + \
3. "Continuation"
4. print (name);
If the statements enclosed in brackets like (), {}, and [] are spread into multiple
lines, then we do not need to use line continuation character.
Listing 8.3: Statements enclosed in brackets
1. friends = ['Amber', 'Baron', 'Christian',
2. 'Crash', 'Deuce',
3. 'Evan', 'Hunter', 'Justice', 'Knight']
4. print (friends[0]+" and " + friends[5] + " are friends")
Python uses different types of quotations for strings. Single (’) and double (") can be
used for same strings, but if we have one type of quotation in string then we should
enclose the string with other type of quotation. Triple quotations of single (’) and
double (") types allow the string to contain line breaks or span the string to multiple
lines.
Listing 8.4: Different type of quotations in Python
1. word = "Don't"
2. sentence = 'Testing double(") quotation'
3. paragraph1 = """Testing paragraph
3. with triple single (') quotations"""
4. paragraph2 = '''Testing paragraph
5. with triple double(") quotations'''
6.
7. print (word)
8. print (sentence)
9. print (paragraph1)
10. print (paragraph2)
Any line in Python started with # sign is considered as comments. If # sign is used
inside a string little, then it will not be used as comments. Python interpreter ignores
all comments.
156 8 Data Science Programming Language
For multiline comments in Python we need to use three single (’) quotations on start
and end of comments.
Listing 8.6: Multi-line comments in Python
1. '''
2. These are multi-line comments
3. span on
4. multiple lines
5. '''
6. print ("Testing multi-line comments")
In Python, we declare a variable and assign a value for it. Variable in Python is loosely
typed; it means we do not need to specify the type of variable while declaring it. The
type of variable is decided on assigning the type of value. For assignment of value
we need to use equal (=) sign. The operand on left of equal-to (=) sign is name of
variable, and operand on right is value of variable.
Listing 8.7: Variables in Python
1. age = 25 # Declaring an integer variable
2. radius =10.5 # Declaring a floating point variable
3. name = "Khalid" # Declaring a string
4.
5. print (age)
6. print (radius)
7. print (name)
The data saved in a variable or memory location can be of different types. We can
save age, height, and name of a person. The type of age is integer, the type of height
is float, and name is string type.
Python has five standard data types
• Numbers
• String
• List
• Tuple
• Dictionary.
In Python we assign a numeric value to a variable and then numeric objects are
created. Numeric objects can have numeric value which can be integer, octal, binary,
hexadecimal, floating-point, and complex number. Python supports four different
numerical types.
• Int
• long
• float
• complex
Strings in Python are set of contiguous characters enclosed in single or double quota-
tions. Subset of string can be extracted with the help of slice operators [ ] and [:].
The first character of each string is located at 0 index which we provide inside slice
operators. In strings plus (+) sign is used for concatenation of two strings and asterisk
(*) sign is used for repetition of string.
Listing 8.9: String in Python
1. test = "Testing strings in Python"
2.
3. print (test) # Displays complete string
4. print (test[0]) # Displays first character of the string
5. print (test[2:5]) # Displays all characters starting from index 2 to 5
6. print (test[2:]) # Displays all characters from index 2 to end of string
7. print (test * 2) # Displays the string two times
8. print (test + " concatenated string") # Display concatenation of two strings
Lists are flexible and compound data types in Python. The items are enclosed into
square brackets ([ ]) and separated with commas. The concept of list is similar to
array in C or C++ with some differences. Array in C can hold only one type of
elements but in Python, it can hold different types of elements in one list. Similar to
strings the element stored in list is accessed with the help of slice operator ([ ] and [:])
and starting index starts from zero. In list too plus (+) sign is used for concatenation
and asterisk (*) sign is used for repetition.
8.1 Python 159
Another sequence data type similar to list is called tuple. Similar to list a tuple is
combination of elements separated with commas, but instead of square brackets, the
elements of tuple are enclosed into parentheses ( ( ) ). As modification is not allowed
in tuple, we can say tuple is read-only list.
Listing 8.11: Tuples in Python
1. # This tuple contain inform of student in order Name, Age, CGPA, State, marks
2. tuple = ['John', 20 ,2.23, 85, 'New York']
3.
4. print (tuple) # prints complete tuple
5. print (tuple [0]) # prints first element of tuple
6. print (tuple [2:5]) # prints all elements from index 2 to 5
7. print (tuple [3:]) # prints all elements from index 2 to end of the tuple
8. print (tuple * 2) # prints the tuple two times
9. print (tuple + tuple) # Concatenate tuple with tuple
Python has a data type which is similar to hash table data structure. It works like
associative array or key-value pair. Key of dictionary can be any data type of Python,
but mostly only number or string is used for key. As compared with other data types
number and string are more meaningful for keys. The value part of diction can be
of any Python object. Dictionary is enclosed by curly braces ({ }), while value of
dictionary can be accessed using square braces ([ ]).
Listing 8.12: Dictionary in Python
1. testdict = {'Name': 'Mike','Age':20, 'CGPA': 3.23, 'Marks':85}
2. testdict['State'] = 'New York'
3.print (testdict) # Prints complete dictionary
4.print (testdict['Age']) # Prints value of dictionary have key Age
5.print (testdict.keys()) # Prints all the keys of dictionary
6.print (testdict.values()) # Prints all the values of dictionary
We have some built-in functions in Python to convert a data type into another data
type. Some of them are given in Table 8.2. After conversion these functions return a
new object of which has converted value.
Operators are the constructs that are used to perform some specific mathematical,
logical, or any other type of manipulation. Python has many types of operators; some
of them are given below.
• Arithmetic operators
• Comparison operators (also called relational operators)
• Assignment operators
• Logical operators.
Suppose we have two variables a and b with numeric value 10 and 5, respectively.
Table 8.3 shows the results of arithmetic operations after applying the arithmetic
operators.
Comparison operators in Python are used to compare the equality of two operands
or values.
Suppose we have two variables a and b with numeric value 10 and 5, respec-
tively. In Table 8.4, we are using these variables as operand and performing some
comparison operations.
162 8 Data Science Programming Language
Logical operators in Python are used to compare truth value on left and right sides
of operator.
Suppose we have two variables a and b with Boolean value true for both variables.
In Table 8.6 we are using these variables as operand and performing some logical
operations.
164 8 Data Science Programming Language
Like other languages Python also has operator precedence. Table 8.7 shows prece-
dence of some operators in Python. The operators are given in the table in order of
highest to lowest from top to bottom.
8.1.20.1 if Statement
If Boolean expression “if” will output true, then statements followed by expression
will execute. In case of false output statements will not execute.
8.1 Python 165
Syntax
if expression:
statement(s)
Listing 8.13: if statement in Python
1.booleanTrueTest = True
2.if booleanTrueTest: # Testing true expression using boolean true
3.print ("Expression test of if using Boolean true. If expression is true then this message will
show.")
4.
5.numericTrueTest = 1
6.if numericTrueTest: # Testing true expression using numeric true
7.print ("Expression test of if using numeric true. If expression is true then this message will
show.")
8.
9.print ("Testing finished")
If output of Boolean expressions of “if” will return true, then statements of “if” will
execute. If output will return false, then statements of “else” will execute.
Syntax
if expression:
statement(s)
else:
statement(s)
166 8 Data Science Programming Language
In Python “elif” statement is used to check multiple expressions for true and execute
one or more statements. Similar to “else” the “elif” statement is also optional. Expres-
sion of first “elif” will be tested if expression of “if” will output false. If any expression
of any “elif” is false, then next “elif” will be tested. This will keep doing unless we
reach at “else” statement at the end. If expression of any “if” or “elif” will return true
then all “elif” and “else” will skip.
Syntax
if expression1:
statement(s)
elif expression2:
statement(s)
elif expression3:
statement(s)
else:
statement(s)
8.1 Python 167
While loop statements in Python language repeat execution till the provided
expression or condition does not become false.
Syntax
while expression:
statement(s)
While loop, just like in other programming languages, executes a statement or
set of statements until the specified condition remains true. There is possibility that
while may not run even for one time, and this can happen if expression will return
false on testing first time.
Listing 8.16: while in Python
1. character = input("Enter 'y' or 'Y' to iterate loop, any other key to exit: ")
2. counter = 1;
3. while character == 'y' or character == 'Y':
4. print ('Loop iteration :', counter)
5. counter += 1
6.character = input("Enter 'y' or 'Y' to iterate loop again, any other key to exit: ")
7.
8. print ("Testing Finished")
assigns to “iterating variable” and loop statement(s) execute. This keeps happening
till we reach end of sequence (list) or expression in sequence returns false.
Listing 8.17: Iterating list through elements using for loop in Python
Colors = ['Red', 'White', 'Black', 'Pink'] # A list of some colors
Red
White
Black
Pink
Second loop's iterations finished
Iterating by Sequence Index
In Python there is another way to iterate each item of the list by index offset. In the
following example we are iterating “for” using index offset method.
Listing 8.18: Iterating list through index using for loop in Python
1. Colors = [Blue', 'Green', 'White', Black'] # A list of some colors
2. # this loop is printing each element of a list
3. for index in range(len(Colors)):
4.print (Colors[index] + " is at index " + str(index) + " in the list")
5.
6. print("Loop iterations finished")
The range( ) function is used for the number of iterations. If we want to iterate the
loop ten times, then we should provide 10 into argument of range( ) function. For
this configuration loop will iterate from indexes 0 to 10. We can also use range( )
for another other specific range like if we want to iterate the loop from indexes 5 to
10 then we will provide two arguments to range( ) function. First argument will be
5 and second will be 10.
Second built-in function we used is len( ); this function counts the number elements
in the list.
Later we used str( ) function, and it is data-type conversion function which we
used to convert a number into string data type.
Python provides a unique feature of loops which are not available in mostly famous
programming language that is using else statement with loops. In other language we
can use else statement only with if statement.
We can use else statement with for loop, but it will execute when the loop has
exhausted the iterating list. It means if loop will iterate complete list then else will
execute. But if there will be break statement to stop the loop iterations, then else will
not execute.
We can use else statement with while loop too. If condition will become false and
loop will stop, then else statement will execute. But if loop will stop due to using
break, then else statement will not execute.
The following two programs are demonstrating both scenarios of executing and
not executing else statement with for and while loops.
8.1 Python 171
1
2
3
using break statement
Loop iterations stopped with the use of break statement
172 8 Data Science Programming Language
Python allows us to put a loop inside another loop. If outer loop will execute, then
inner loop may also execute and it depends on truth value of expression of inner
loop. Normally nested loop is used when we need to extract smallest item from data.
Suppose we want a list which contains different color names and we want to print
each character of color names. In this scenario we need to use two loops: One outer
loop will be used to extract each color name from the list, and other inner loop will
be used to get every character from a color name.
8.1 Python 173
To define a user-defined function, we need to provide some details which are given
below.
• To define a function first we need to use def keyword.
• Give the name of function, it always recommended to use meaningful name of
function. Meaningful means that by looking the name of function anyone can
judge that what this function is doing.
• Provide the arguments of function. All functions do not need arguments, and the
number of arguments depends on our requirements.
• Next step is providing a string which is called documentation string or docstring.
This string provides the information about the functionality of function.
• Then we will write statement(s) which want to execute on call of this function.
These statements are also called function suite.
Last statement can be return value. All function does not have a return value; in
this case we will use only return keyword.
Syntax
def functionname(parameters):
“function_docstring”
function_suite
return [expression]
In the code examples we already cover some used functions like print( ), range( ),
and len( ) but these functions are built-in functions. The functions provided by
language with library are called built-in functions. We can design our functions
which are called user-defined functions.
8.1 Python 175
Pass by value means the argument passed to function is copy of original object. If we
will change the value of object inside function, then original object will not change.
Pass by reference means the argument passed to function is original object. If we
will change the value of object inside function, then original object will change.
In Python argument is passed to function as pass by reference so if function will
change it then original object will change. Let us test it with the help of an example.
Listing 8.24: function argument pass by value and reference in Python
def myfunc( score ):
"""This changes the value of a passed list"""
score.append(40);
return
score = [10,20,30];
print ("Status of list before calling the function: ", score)
myfunc( score );
print ("Status of list after calling the the function: ", score)
There are some types of arguments that can be used while calling user-defined
functions which are given below.
• Required arguments
• Keyword arguments
176 8 Data Science Programming Language
• Default arguments
• Variable-length arguments.
In keyword arguments it is not compulsory for function definition and function call
to match the order of arguments. In this type arguments are matched by their names.
Look at the code given below; here we changed the order of arguments in function
definition and function call. The argument at first place in function call is on second
place in function definition, and the argument on second place in function is on first
place in function definition.
8.1 Python 177
In this type of arguments when we call the function, we provided different numbers of
arguments. But we do not have any argument with default value. For this scenario the
function call which will have fewer arguments will give syntax error. To handle this
problem, we need to use variable-length argument type. In variable-length argument
we provide first argument as variable and for rest arguments we use tuple. In function
call first argument will be assigned to first argument in function definition. Rest of
arguments in function call will be assigned to tuple.
178 8 Data Science Programming Language
In Python every user-defined function must have a return statement. Return statement
returns a value on function call. If we do not want to return any value from a specific
function, then we should use return keyword. Providing an expression to return is
optional.
Listing 8.29: return statement in Python function
def findaverage( num1, num2 ):
"""finds the average of two numbers"""
total = num1 + num2
return total/2
# calling function
result = findaverage( 10, 20 )
print ("Average of two numbers is: ", result)
R is another important programming language used for statistical analysis and data
visualization. Similar to Python it is also a common language for data science-related
projects. R was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development Core
Team.
To use R language, we need to install two applications.
First, we need to install R language precompiled binary to run the code of R
language. R language precompiled binaries are available on following link:
https://cran.r-project.org/
RStudio is an integrated development environment (IDE) intended for the devel-
opment of R programming projects. Major components of RStudio include a console,
syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, debugging, and workspace management. For more information on
RStudio, you can follow the link:
https://rstudio.com/products/rstudio/
Everything is set, and we are ready to create our first R script program.
Listing 8.30: First R program
mystr<- "Hello, World!"
print ( mystr)
8.2.2 Comments in R
print ( mystr)
In any programming language we need variables to store some information and then
later we can use this information in our program. Variables are reserved memory
locations where we can store some information or data. The information we store
in variable can be string, number, Boolean, etc.; the size memory used by variable
depends on type of variable which is reserved through operating system.
In other programming languages like C++ and Java we decide the type of variable
when declaring it. But in R the type of variable is decided by the data type of R-object.
The value of variable is called R-object. Similar to variable in Python, variable of R
is loosely typed. It means we can assign any type of value to its variable. If we will
assign integer, then type of variable will become integer. If we will assign a string to
variable, then type of variable will become string.
Listing 8.31: checking type of variable after assigning R-object
var <- 10
print (paste("Type of variable:", class(var)))
There are many types of R-objects, and most frequently used ones are given below.
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data frames.
8.2.4 Vectors in R
To create vector with more than one element we need to use a built-in function called
“c( )”. This function is used to combine the elements of vector.
Listing 8.32: Vectors example in R
#declaring a vector
colors <- c('red', 'green', 'yellow')
print(colors)
8.2.5 Lists in R
List is an R-object which contains different types of elements. The element of list
can be another list or vector.
182 8 Data Science Programming Language
print(list)
[[2]]
[1] 21.3
[[3]]
function (x). Primitive("sin")
[1] "list"
8.2.6 Matrices in R
In R a matrix is collection of dataset arranged in the form of rows and columns. The
following is an example of a matrix with 3 rows and 2 columns.
Listing 8.34: Matrics example in R
# create a matrix
M = matrix (c ('a', 'a', 'b', 'c', 'b', 'a'), nrow = 3, ncol = 2, byrow = TRUE)
print(M)
[1] "matrix"
8.2.7 Arrays in R
Array in R is similar to matrices but unlike in matrices which have two dimensions,
arrays can have any number of dimensions. To create array we need to call built-in
function called “array( )”. It used an attribute called dim which has information of
dimensions.
8.2 R Programming Language 183
print(A)
,,2
[1] "array"
8.2.8 Factors in R
Factors are R-objects which are created with the help of vector. To create a factor we
need to use factor( ) function. It uses a vector as argument and return factor object.
Numeric and character variables can be made into factors, but the levels of factor will
always be in character. To get the levels of factor we need to use nlevels( ) function.
184 8 Data Science Programming Language
Data frames are R-objects in tabular form data which is similar to matrices in R, but
in data frames each column can have different types of data. For example, the first
column can be logical, second can be numeric and third can be character type. To
create a data frame we need to use built-in function data.frame( ).
Listing 8.37: Data frames example in R
# Creating data frame
Frm<- data.frame(
gender = c("Male", "Male”, “Female"),
height = c(145, 161, 143.3),
weight = c(70,65,63),
Age = c(35,30,25)
)
print(Frm)
[1] "data.frame"
8.2 R Programming Language 185
print('Addition: ')
print(var1 + var2)
print('Subtraction: ')
print(var1 - var2)
print('Multiplication: ')
print(var1 * var2)
print('Division: ')
print(var1 / var2)
print('Remainder: ')
print(var1 %% var2)
print('Quotient: ')
print(var1 %/% var2)
[1] "Subtraction:"
[1] 6
[1] "Multiplication:"
[1] 40
[1] "Division:"
[1] 2.5
[1] "Remainder:"
[1] 2
[1] "Quotient:"
[1] 2
R language supports relational operators which are used to compare two R-objects.
The comparison operation returns a Boolean value which can be true or false.
Table 8.11 shows relational operators in R.
Let us see code example of some relational operators.
8.2 R Programming Language 187
print('Greater than:')
print(var1 > var2)
print('Less than:')
print(var1 < var2)
print('Equal to:')
print(var1 == var2)
8.2.11.1 if Statement in R
R statements will execute if output of condition will return true. “if” statement is
used for decision making in R, and the syntax of “if” statement in R is given below.
Syntax
if(boolean_expression) {
//statement(s) will execute if the Boolean expression will return true
}
Listing 8.40: if statement in R
num <- 50L
if(is.integer(num)) {
print("num is an integer")
}
8.2.11.2 if…else in R
Syntax
if(boolean_expression) {
//statement(s) will execute if the Boolean expression will return true
} else {
//statement(s) will execute if the Boolean expression will return false
}
Listing 8.41: if…else statement in R
num <- "50"
if(is.integer(num)) {
print("num is an integer")
} else {
print("num is not an integer")
}
In R “if” and “else if” statements are used as chain of statements to test various
conditions. Execution of these statements starts from top; if any of “if” or “else if”
will return true, then remaining “else if” and “else” statements will not run. We can
use many “else if” statements after first “if” statement which will end with an optional
“else” statement. Execution of these statements starts from top; if any of “if” or “else
if” will return true, then remaining “else if” and “else” statement will not run. The
following is syntax of nest if, else if, and else statements.
Syntax
if(First Condition) {
//statement(s) will be executed if First Condition will be true
} else if(Second Condition) {
//statement(s) will be executed if Second Condition will be true
} else if(Third Condition) {
//statement(s) will execute if Third Condition will be true
} else {
//statement(s) will be executed all of above conditions will be false
}
190 8 Data Science Programming Language
8.2.12 Loops in R
In R to execute a statement or group of statement again and again we can use repeat
loop. When we need to execute a block of code multiple times at same place, then
we use repeat loop. The syntax of repeat loop is given below.
Syntax
repeat {
statement(s)
if(boolean_expression) {
break
}
}
8.2 R Programming Language 191
# repeat loop
repeat {
print(vector)
counter <- counter + 1
if(counter > 4) {
break
}
}
In R repeat loop stops execution with the help of break statement inside “if” statement.
While loop in R stops execution when the Boolean expression or condition of while
loop returns false. The following is syntax of while loop in R.
Syntax
while (boolean_expression) {
statement(s)
}
192 8 Data Science Programming Language
# while loop
while(counter < 5) {
print(vector)
counter <- counter + 1
In R language when we need to iterate one more statement for a defined number of
times, then we should use for loop. The syntax of using for loop in R language is
given below.
Syntax
for (value in vector) {
Body of Loop (Statements)
}
R’s loops can be used for any data type including integers, strings, lists, etc.
8.2 R Programming Language 193
# for loop
for(i in 6:10) {
print(vector)
counter <- counter + 1
}
If we want to stop execution of any loop on any iteration, for this we can use break
statement. Normally, we use “if” to a specific conduction, when loop iteration reaches
at specific condition then we use break statement to stop loop iterations. Break
statement can be used with repeat, for, and while loops.
Listing 8.46: checking type of variable after assigning R-object
vector <- c("Loop Test")
counter <- 0
# repeat loop
repeat {
print(vector)
counter <- counter + 1
if(counter > 4) {
break
}
}
8.2.14 Functions in R
There are two types of function in R language called built-in and user-defined func-
tions. The functions provided by R language library are called built-in or predefined
functions. The functions created and defined by user are called user-defined function.
The following are components of user-defined function.
• Function name
• Arguments
• Function body
• Return value.
The following is syntax of user-defined function.
function_name<- function(arg_1, arg_2, …) {
Function body
}
Listing 8.47: Function in R
# creating a user defined function
my.function<- function(arg) {
print(arg)
}
# calling function
my.function(10)
my.function(5.8)
my.function("Hello World!")
# calling function
my.function()
programming abilities will find the exercises and solutions provided in this book
to be ideal for their needs.
• Solving PDEs in Python (Authors: Hans Petter Langtangen and Anders Logg)
The book offers a concise and gentle introduction to finite element programming
in Python based on the popular FEniCS software library. Using a series of exam-
ples, including the Poisson equation, the equations of linear elasticity, the incom-
pressible Navier–Stokes equations, and systems of nonlinear advection–diffu-
sion–reaction equations, it guides readers through the essential steps to quickly
solving a PDE in FEniCS, such as how to define a finite variational problem, how
to set boundary conditions, how to solve linear and nonlinear systems, and how
to visualize solutions and structure finite element Python programs.
8.4 Summary
In this chapter, we have discussed two very important and most commonly used data
science programming languages. The first one is Python, and the second one is R
programming language. We have provided the details of basic code structures of both
programming languages along with programming examples. The overall intention
was to provide you with a simple tutorial to enable you to code complex data science
programs through basic coding structures.