0% found this document useful (0 votes)
198 views34 pages

Module 3 21cs752

The document provides an introduction to machine learning, highlighting its importance in managing large data volumes for business organizations. It explains the relationship between machine learning, artificial intelligence, data science, and statistics, as well as the different types of machine learning, including supervised, semi-supervised, and reinforcement learning. Additionally, it discusses the concepts of data, information, knowledge, intelligence, and wisdom using a knowledge pyramid framework.

Uploaded by

rajeev.t.v1972
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views34 pages

Module 3 21cs752

The document provides an introduction to machine learning, highlighting its importance in managing large data volumes for business organizations. It explains the relationship between machine learning, artificial intelligence, data science, and statistics, as well as the different types of machine learning, including supervised, semi-supervised, and reinforcement learning. Additionally, it discusses the concepts of data, information, knowledge, intelligence, and wisdom using a knowledge pyramid framework.

Uploaded by

rajeev.t.v1972
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

lOMoARcPSD|39472316

AI&ML, 21CS752, 7th Semester ECE

Module 3

INTRODUCTION TO MACHINE L E A R N I N G
1.1 NEED FOR MACHINE LEARNING
Businessorganizations usehugeamount of data for their dailyactivities. They have now
started to use the latest technology, machinelearning, to manage the data.
Machine learning has become so popular because of three reasons:

AIT, TUMKUR

1
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th Semester ECE

1. High volume of available data to manage: Big companies such as Facebook,


Twitter, and YouTube generate huge amount of data that grows at a
phenomenal rate. It is estimated that the data approximately gets doubled
every year.
2. Second reason is that the cost of storage has reduced. The hardware cost has
also dropped.Therefore, it is easier now to capture, process, store, distribute,
and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of complex
algorithms now. Especially with the advent of deep learning, many
algorithms are available for machine learning.
let us establish these terms - data, information, knowledge, intelligence, and wisdom
using a knowledge pyramid as shown in Figure 1.1.

Figure 1.1: The Knowledge Pyramid


 All facts are data. Data can be numbers or text that can be processed by a computer.
Today, organizations are accumulating vast and growing amounts of data with data
sources such as flat files, databases, or data warehouses in different storage
formats.
 Processed data is called information. This includes patterns, associations, or
relationships among data. For example, sales data can be analyzed to extract
information like which is the fast selling product.
 Condensed information is called knowledge. For example, the historical patterns
and future trends obtained in the above sales data can be called knowledge. Unless
knowledge is extracted, data is of no use. Similarly, knowledge is not useful
unless it is put into action.
 Intelligence is the applied knowledge for actions. An actionable form of
knowledge is called intelligence. Computer systems have been successful till this
stage.
 The ultimate objective of knowledge pyramid is wisdom that represents the
maturity of mind that is, so far, exhibited only by humans.

AIT, TUMKUR

2
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th Semester ECE

The objective of machine learning is to process these archival data for


organizations to take better decisions to design new products, improve the
business processes, and to develop effective decision support systems.

1.2 MACHINE LEARNING EXPLAINED


Machine learning is an important sub-branch of Artificial Intelligence (AI). A
frequently quoted definition of machine learning was by Arthur Samuel, one of the
pioneers of Artificial Intelligence. He stated that “Machine learning is the 𝑓ield of
study that gives the computers ability to learn without being explicitly
programmed.”
The key to this definition is that the systems should learn by itself without explicit
programming.How is it possible? It is widely known that to perform a computation,
one needs to write programsthat teach the computers how to do that computation.
In conventional programming, after understanding the problem, a detailed
design of the program such as a flowchart or an algorithm needs to be created and
converted into programs using a suitable programming language. This approach
could be difficult for many real-world problems such as puzzles, games, and complex
image recognition applications. Initially, artificial intelligence aims to understand
these problems and develop general purpose rules manually. Then, these rules are
formulated into logic and implemented in a program to create intelligent systems.
This idea of developing intelligent systems by using logic and reasoning by converting
anexpert’s knowledge into a set of rules and programs is called an expert system. An
expert system like MYCIN was designed for medical diagnosis after converting the
expert knowledge of many doctors into a system. However, this approach did not
progress much as programs lacked real intelligence. The word MYCIN is derived from
the fact that most of the antibiotics’ names end with‘mycin’.
The above approach was impractical in many domains as programs still
depended on humanexpertise and hence did not truly exhibit intelligence. Then, the
momentum shifted to machine learning in the form of data driven systems. The focus
of AI is to develop intelligent systems by using data-driven approach, where data is
used as an input to develop intelligent models. The models can then be used to predict
new inputs. Thus, the aim of machine learning is to learn a model or set of rules from
the given dataset automatically so that it can predict the unknown datacorrectly.
As humans take decisions based on an experience, computers make models based
on extracted patterns in the input data and then use these data-filled models for
prediction and to take decisions. For computers, the learnt model is equivalent to
human experience. This is shown in Figure 1.2.

AIT, TUMKUR

3
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Figure 1.2: (a) A Learning System for Humans (b) A Learning Systemfor
Machine Learning
Often, the quality of data determines the quality of experience and, therefore, the quality ofthe
learning system. In statistical learning, the relationship between the input x and output y is
modeled as a function in the form y = f(x). Here, f is the learning function that maps the input xto
output y. Learning of function f is the crucial aspect of forming a model in statistical learning.In
machine learning, this is simply called mapping of input to output.
The learning program summarizes the raw data in a model. Formally stated, a model is anexplicit
description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
In summary, a model can be a formula, procedure or representation that can generate data
decisions. The difference between pattern and model is that the former is local and applicable onlyto
certain attributes but the latter is global and fits the entire dataset. For example, a model can behelpful
to examine whether a given email is spam or not. The point is that the model is generated automatically
from the given data.
Another pioneer of AI, Tom Mitchell’s definition of machine learning states that, “A computer
program is said to learn from experience E, with respect to task T and some performance measure
P,if its performance on T measured by P improves with experience E.” The important components of this
definition are experience E, task T, and performance measure P.
For example, the task T could be detecting an object in an image. The machine can gain the
knowledge of object using training dataset of thousands of images. This is called experience E.So,
the focus is to use this experience E for this task of object detection T. The ability of the systemto detect
the object is measured by performance measures like precision and recall. Based on the performance
measures, course correction can be done to improve the performance of the system.
Models of computer systems are equivalent to human experience. Experience is based on data.
Humans gain experience by various means. They gain knowledge by rote learning. They observe others
andimitateit. Humansgainalot of knowledgefrom teachersandbooks. Welearnmany things by trial and
error. Once the knowledge is gained, when a new problem is encountered, humans search for similar
past situations and then formulate the heuristics and use that for prediction. But, in systems,
experience is gathered by these steps:
1. Collection of data

AIT, TUMKUR

. 4
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

2. Once data is gathered, abstract concepts are formed out of that data. Abstraction is used to
generate concepts. This is equivalent to humans’ idea of objects, for example, we have some
idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable form of intelligence. It can
be viewed as ordering of all possible concepts. So, generalization involves ranking of concepts,
inferencing from them and formation of heuristics, an actionable aspect of intelligence.
Heuristics are educated guesses for all tasks. For example, if one runs or encounters a danger,
it is the resultant of human experience or his heuristics formation.In machines, it happens
the same way.
4. Heuristics normally works! But, occasionally, it may fail too. It is not the faultof
heuristics as it is just a ‘rule of thumb′. The course correction is done by taking
evaluation measures. Evaluation checks the thoroughness of the models and to-do course
correction, if necessary, to generate better formulations.

1.3 MACHINE LEARNING IN RELATION TO OTHER FIELDS


Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics primarily.It is
the resultant of combined ideas of diverse fields.

1.3.1 Machine Learning and Artificial Intelligence


Machine learning is an important branch of AI, which is a much broader subject. The aim of AI is to
develop intelligent agents. An agent can be a robot, humans, or any autonomous systems. Initially, the
idea of AI was ambitious, that is, to develop intelligent systems like human beings. The focus was on
logic and logical inferences. It had seen many ups and downs. These down periods were called AI
winters.
The resurgence in AI happened due to development of data driven systems. The aim is to find
relations and regularities present in the data. Machine learning is the subbranch of AI, whose aimis to
extract the patterns for prediction. It is a broad field that includes learning from examples andother
areas like reinforcement learning. The relationship of AI and machine learning is shown in Figure 1.3.
The model can take an unknown instance and generate results.
Figure 1.3: Relationship of AI with Machine Learning

intelligence

Deeplearning isa subbranch of machine learning. Indeeplearning, the modelsare constructedusing


neural network technology. Neural networks are based on the human neuron models. Many neurons
form a network connected with the activation functions that trigger further neurons to perform tasks.

AIT, TUMKUR

. 5
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
Data science is an ‘Umbrella’ term that encompasses many fields. Machine learning starts with data.
Therefore, data science and machine learning are interlinked. Machine learning is a branch of data
science. Data science deals with gathering of data for analysis. It is a broad field that includes:
Big Data Data science concerns about collection of data. Big data is a field of data science that deals
with data’s following characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter,
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.
3. Velocity: It refers to the speed at which the data is generated and processed.
Big data is used by many machine learning algorithms for applications such as languagetrans-lation
and image recognition. Big data influences the growth of subjects like Deep learning. Deep learning is
a branch of machine learning that deals with constructing models using neural networks.

Data Mining Data mining’s original genesis is in the business. Like while mining the earth one gets
into precious resources, it is often believed that unearthing of the data produces hidden infor- mation
that otherwise would have eluded the attention of the management. Nowadays, many consider that
data mining and machine learning are same. There is no difference between these fields except that
data mining aims to extract the hidden patterns that are present in the data, whereas, machine learning
aims to use it for prediction.

Data Analytics Another branch of data science is data analytics. It aims to extract useful knowledge
from crude data. There are different types of analytics. Predictive data analytics is usedfor making
predictions. Machine learning is closely related to this branch of analytics and shares almost all
algorithms.
Pattern Recognition It is an engineering field. It uses machine learning algorithms to extract the
features for pattern analysis and pattern classification. One can view pattern recognition as a specific
application of machine learning.
These relations are summarized in Figure 1.4.
Data science

analytics

Figure 1.4: Relationship of Machine Learning with Other Major Fields

1.3.3 Machine Learning and Statistics


Statistics isabranch ofmathematics thathasasolidtheoreticalfoundationregarding statisticallearning. Like
machine learning (ML), it can learn from data. But the difference between statistics and ML is thatstatistical

AIT, TUMKUR

. 6
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

methods look for regularity in data called patterns. Initially, statistics sets a hypothesis andperforms
experiments to verify and validate the hypothesis in order to find relationships among data.
Statistics requires knowledge of the statistical procedures and the guidance of a good statistician.
It is mathematics intensive and models are often complicated equations and involve many
assumptions. Statistical methods are developed in relation to the data being analyzed. In addition,
statistical methods are coherent and rigorous. It has strong theoretical foundations and interpretations
that require a strong statistical knowledge.
Machine learning, comparatively, has less assumptions and requires less statistical knowledge.
But, it often requires interaction with various tools to automate the process of learning.
Nevertheless, there is a school of thought that machine learning is just the latest version of ‘old
Statistics’ and hence this relationship should be recognized.

1.4 TYPES OF MACHINE LEARNING


What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of interaction of the
program with its environment. It can be compared with the interaction between a teacher and a
student

Supervised Semi-supervised Reinforcement

Cluster
analysis

Figure 1.5: Types of Machine Learning


Before discussing the types of learning, it is necessary to discuss about data.

Labelled and Unlabeled Data Data is a raw fact. Normally, data is represented in the formof a
table. Data also can be referred to as a data point, sample, or an example. Each row of the table
represents a data point. Features are attributes or characteristics of an object. Normally, the columns
of the table are attributes. Out of all attributes, one attribute is important and is called a label. Label is
the feature that we aim to predict. Thus, there are two types of data – labelled and unlabeled.

Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower datasetor
Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width of sepals
and petals. The target variable is called class. There are three classes – Iris setosa, Iris virginica, and
Iris versicolor.

The partial data of Iris dataset is shown in Table 1.1.


Table 1.1: Iris Flower Dataset

AIT, TUMKUR

. 7
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

S.No. Length of Width of Length of Width of Class


Petal Petal Sepal Sepal
1. 5.5 4.2 1.4 0.2 Setosa
2. 7 3.2 4.7 1.4 Versicolor
3. 7.3 2.9 6.3 1.8 Virginica
A dataset need not be always numbers. It can be images or video frames. Deep neural networkscan
handle images with labels. In the following Figure 1.6, the deep neural network takes images ofdogs and
cats with labels for classification.
(a)

Cat

(b)

Figure 1.6: (a) Labelled Dataset (b) Unlabeled Dataset


In unlabeled data, there are no labels in the dataset.

1.4.1 Supervised Learning


Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or teacher
component in supervised learning. A supervisor provides labelled data so that the model is constructed
and generates test data.
In supervised learning algorithms, learning takes place in two stages. In layman terms, during thefirst
stage, the teacher communicates the information to the student that the student is supposed tomaster.
The student receives the information and understands it. During this stage, the teacher has noknowledge
of whether the information is grasped by the student.
This leads to the second stage of learning. The teacher then asks the student a set of questionsto
find out how much information has been grasped by the student. Based on these questions,
the student is tested, and the teacher informs the student about his assessment. This kind of learningis
typically called supervised learning.
Supervised learning has two methods:
1. Classification
2. Regression

AIT, TUMKUR

. 8
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Classi𝑓ication
Classification is a supervised learning method. The input attributes of the classification algorithmsare
called independent variables. The target attribute is called label or dependent variable. The
relationship between the input and target variable is represented in the form of a structure which is
called a classification model. So, the focus of classification is to predict the ‘label’ that is in a discrete
form (a value from the set of finite values). An example is shown in Figure 1.7 where a classification
algorithm takes a set of labelled data images such as dogs and cats to construct a model that can later
be used to classify an unknown test image data.

AIT, TUMKUR

. 9
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

In classification, learning takes place in two stages. During the first stage, called training stage,the learning
algorithm takes a labelled dataset and starts learning. After the training set, samples are processed and
the model is generated. In the second stage, the constructed model is tested withtest or unknown sample
and assigned a label. This is the classification process.
This is illustrated in the above Figure 1.7. Initially, the classification learning algorithm learnswith
the collection of labelled data and constructs the model. Then, a test case is selected, and the model
assigns a label.
Similarly, in the case of Iris dataset, if the test is given as (6.3, 2.9, 5.6, 1.8, ?), the classification will
generate the label for this. This is called classification. One of the examples of classification is –Image
recognition, which includes classification of diseases like cancer, classification of plants, etc.
The classification models can be categorized based on the implementation technology like decision
trees, probabilistic methods, distance measures, and soft computing methods. Classificationmodels can
also be classified as generative models and discriminative models. Generative modelsdeal with the
process of data generation and its distribution. Probabilistic models are examples of
generative models. Discriminative models do not care about the generation of data. Instead, they
simply concentrate on classifying the given data.
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naı¨ve Bayes
• Artificial Neural Network and Deep Learning networks like CNN

Regression Models
Regression models, unlike classification algorithms, predict continuous variables like price. In
other words, it is a number. A fitted regression model is shown in Figure 1.8 for a dataset that represent
weeks input x and product sales y.
4

3.5
y-axis - Product sales data (y)

2.5

1.5

1
1 2 3 4 5
x-axis - Week data (x)
Regression line (y = 0.66X + 0.54)

Figure 1.8: A Regression Model of the Form y = ax + b


The regression model takes input x and generates a model in the form of a fitted line of the form y
= f(x). Here, x is the independent variable that may be one or more attributes and y is the dependent
variable. In Figure 1.8, linear regression takes the training set and tries to fit it with a line – product
sales = 0.66 Week + 0.54. Here, 0.66 and 0.54 are all regression coefficients that arelearnt from data.
The advantage of this model is that prediction for product sales (y) can be made for unknown week

AIT, TUMKUR

. 10
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

data (x). For example, the prediction for unknown eighth week can be made bysubstituting x as 8 in that
regression formula to get y.
One of the most important regression algorithms is linear regression that is explained in the next
section.
Both regression and classification models are supervised algorithms. Both have a supervisor andthe
concepts of training and testing are applicable to both. What is the difference between classificationand
regression models? The main difference is that regression models predict continuous variablessuch as
product price, while classification concentrates on assigning labels such as class.
1.4.2 Unsupervised Learning
The second kind of learning is by self-instruction. As the name suggests, there are no supervisor or
teacher components. In the absence of a supervisor or teacher, self-instruction is the most commonkind
of learning process. This process of self-instruction is based on the concept of trial and error.
Here, the program is supplied with objects, but no labels are defined. The algorithm itself observes
the examples and recognizes patterns based on the principles of grouping. Grouping is done in ways
that similar objects form the same group.
Cluster analysis and Dimensional reduction algorithms are examples of unsupervised algorithms.

Cluster Analysis
Cluster analysis is an example of unsupervised learning. It aims to group objects into disjoint clusters
or groups. Cluster analysis clusters objects based on its attributes. All the data objectsof the
partitions are similar in some aspect and vary from the data objects in the other partitions significantly.
Some of the examples of clustering processes are — segmentation of a region of interest in an
image, detection of abnormal growth in a medical image, and determining clusters of signatures in a
gene database.
An example of clustering scheme is shown in Figure 1.9 where the clustering algorithm takes a set
of dogs and cats images and groups it as two clusters-dogs and cats. It can be observed that the samples
belonging to a cluster are similar and samples are different radically across clusters.

Some of the key clustering algorithms are:


• k-means algorithm
• Hierarchical algorithms
Dimensionality Reduction
Dimensionality reduction algorithms are examples of unsupervised algorithms. It takes a higher
dimension data as input and outputs the data in lower dimension by taking advantage of the varianceof the
data. It is a task of reducing the dataset with few features without losing the generality.

AIT, TUMKUR

. 11
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

The differences between supervised and unsupervised learning are listed in the following
Table 1.2.
Table 1.2: Differences between Supervised and Unsupervised Learning

S.No. Supervised Learning Unsupervised Learning


1. There is a supervisor component No supervisor component
2. Uses Labelled data Uses Unlabelled data
3. Assigns categories or labels Performs grouping process such that similar objectswill
be in one cluster

1.4.3 Semi-supervised Learning


There are circumstances where the dataset has a huge collection of unlabelled data and some labelled
data. Labelling isa costly processanddifficult to performby the humans. Semi-supervisedalgorithms use
unlabelled data by assigning a pseudo-label. Then, the labelled and pseudo-labelled dataset can be
combined.
1.4.4 Reinforcement Learning
Reinforcementlearning mimicshumanbeings. Likehuman beingsuseearsandeyes to perceive theworld
and take actions, reinforcement learning allows the agent to interact with the environment to get
rewards. The agent can be human, animal, robot, or any independent program. The rewardsenable the
agent to gain experience. The agent aims to maximize the reward.
The reward can be positive or negative (Punishment). When the rewards are more, the behaviorgets
reinforced and learning becomes possible.
Consider the following example of a Grid game as shown in Figure 1.10.
Block

Goal

Danger

Figure 1.10: A Grid game

AIT, TUMKUR

. 12
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

In this grid game, the gray tile indicates the danger, black is a block, and the tile with diagonallines
is the goal. The aim is to start, say from bottom-left grid, using the actions left, right, top andbottom to
reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment toget
experience. In the above case, the agent tries to create a model by simulating many paths and finding
rewarding paths. This experience helps in constructing a model.
It can be said in summary, compared to supervised learning, there is no supervisor orlabelled
dataset. Many sequential decisions need to be taken to reach the final decision. Therefore,
reinforcement algorithms are reward-based, goal-oriented algorithms.

1.5 CHALLENGES OF MACHINE LEARNING


What are the challenges of machine learning? Let us discuss about them now.

Problems that can be Dealt with Machine Learning


Computers are better than humans in performing tasks like computation. For example, while calculatingthe
square root of large numbers, an average human may blink but computers can display the result inseconds.
Computers can play games like chess, GO, and even beat professional players of that game.
However, humans are better than computers in many aspects like recognition. But, deep learning
systems challenge human beings in this aspect as well. Machines can recognize human faces in a
second. Still, there are tasks where humans are better as machine learning systems still require quality
data for model construction. The quality of a learning system depends on the quality of data. This is a
challenge. Some of the challenges are listed below:
1. Problems – Machine learning can deal with the ‘well-posed’ problems where specificationsare
complete and available. Computers cannot solve ‘ill-posed’ problems.
Consider one simple example (shown in Table 1.3):
Table 1.3: An Example

Input (x1, x2) Output (y)

1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5
Can a model for this test data be multiplication? That is, y x x . Well! It is true! But, this is
1 2
equally true that y may be y x x , or y x x2. So, there are three functions that fit the data.
1 2 1
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become anill-
posed problem and scientific computation has many ill-posed problems.

AIT, TUMKUR

. 13
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

2. Huge data – This is a primary requirement of machine learning. Availability of a quality data is
a challenge. A quality data means it should be large and should not have data problems such
as missing data or incorrect data.
3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even Tensor
Processing Unit (TPU) are required to execute machine learning algorithms. Also, machine
learning tasks have become complex and hence time complexity has increased, and that can
be solved only with high computing power.
4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,
application of algorithms to solve machine learning task, and comparison of algorithms have
become necessary for machine learning or data scientists now. Algorithms have become a big
topic of discussion and it is a challenge for machine learning professionals todesign, select, and
evaluate optimal algorithms.
5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test data, in general
lacks generalization, is called overfitting. The reverse problem is called underfitting where the
model fails for training data but has good generalization. Overfitting and underfitting are great
challenges for machine learning algorithms.

1.6 MACHINE LEARNING PROCESS


The emerging process model for the data mining solutions for business organizations is CRISP-DM.Since
machine learning is like data mining, except for the aim, this process can be used for machinelearning.
CRISP-DM stands for Cross Industry Standard Process – Data Mining. This process involves six steps.
The steps are listed below in Figure 1.11.

Understand the
business

preprocessing

Model evaluation

Model deployment

Figure 1.11: A Machine Learning/Data Mining Process

14
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is enough
for giving the solution. This step also involves the formulation of the problem statement for
the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the charac teristics
of the data, formulation of hypothesis, and matching of patterns to the selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw data
and preparation of data for the data mining process. The missing values may cause problems
during both training and testing phases. Missing data forces classifiers to produceinaccurate
results. This is a perennial problem for the classification models. Hence, suitablestrategies
should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the datato
obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical analysis
and visualization methods. The performance of the classifier is determined by evaluating the
accuracy of the classifier. The process of classification is a fuzzy issue.For example,
classification of emails requires extensive domain knowledge and requires domain experts.
Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm to
improve the existing process or for a new situation.

1.7 MACHINE LEARNING APPLICATIONS


Machine Learning technologiesare usedwidelynow indifferentdomains. Machinelearningapplications
are everywhere! One encounters many machine learning applications in the day-to-day life. Some
applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which arecaptured
by emoticons effectively. For movie reviews or product reviews, five stars or onestar are
automatically attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases possible.For
example, Amazon recommends users to find related books or books bought by peoplewho have
the same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4 summarizes some ofthe
machine learning applications.

15
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Table 1.4: Applications’ Survey Table


S.No. Problem Domain Applications
1. Business Predicting the bankruptcy of a business firm
2. Banking Prediction of bank loan defaulters and detecting credit card frauds
3. Image Processing Image search engines, object identification, image classification, and
generating synthetic images
4. Audio/Voice Chatbots like Alexa, Microsoft Cortana. Developing chatbots forcustomer
support, speech to text, and text to voice
5. Telecommuni- Trend analysis and identification of bogus calls, fraudulent calls and
cation its callers, churn analysis

6. Marketing Retail sales analysis, market basket analysis, product performance


analysis, market segmentation analysis, and study of travel patterns of
customers for marketing tours

7. Games Game programs for Chess, GO, and Atari video games
8. Natural Language Google Translate, Text summarization, and sentiment analysis
Translation
9. Web Analysis and Identification of access patterns, detection of e-mail spams, viruses,
Services personalized web services, search engines like Google, detection of
promotion of user websites, and finding loyalty of users after web page
layout modification

10. Medicine Prediction of diseases, given disease symptoms as cancer or diabetes.


Prediction of effectiveness of the treatment using patient history and
Chatbots to interact with patients like IBM Watson uses machinelearning
technologies.

11. Multimedia and Face recognition/identification, biometric projects like identificationof


Security a person from a large image or video database, and applications involving
multimedia retrieval

12. Scientific Domain Discovery of new galaxies, identification of groups of houses basedon
house type/geographical location, identification of earthquake
epicenters, and identification of similar land use

Key Terms:
• Machine Learning – A branch of AI that concerns about machines to learn automatically withoutbeing
explicitly programmed.
• Data – A raw fact.
• Model – An explicit description of patterns in a data.
• Experience – A collection of knowledge and heuristics in humans and historical training data in case of
machines.
• Predictive Modelling – A technique of developing models and making a prediction of unseen data.
• Deep Learning – A branch of machine learning that deals with constructing models using neural
networks.
• Data Science – A field of study that encompasses capturing of data to its analysis covering all stagesof
data management.
• Data Analytics – A field of study that deals with analysis of data.

16
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

• Big Data – A study of data that has characteristics of volume, variety, and velocity.
• Statistics – A branch of mathematics that deals with learning from data using statistical methods.
• Hypothesis – An initial assumption of an experiment.
• Learning – Adapting to the environment that happens because of interaction of an agent with the
environment.
• Label – A target attribute.
• Labelled Data – A data that is associated with a label.
• Unlabelled Data – A data without labels.
• Supervised Learning – A type of machine learning that uses labelled data and learns with the help of a
supervisor or teacher component.
• Classification Program – A supervisory learning method that takes an unknown input and assigns a
label for it. In simple words, finds the category of class of the input attributes.
• Regression Analysis – A supervisory method that predicts the continuous variables based on the input
variables.
• Unsupervised Learning – A type of machine leaning that uses unlabelled data and groups the attributes
to clusters using a trial and error approach.
• Cluster Analysis – A type of unsupervised approach that groups the objects based on attributesso
that similar objects or data points form a cluster.
• Semi-supervised Learning – A type of machine learning that uses limited labelled and largeunlabelled
data. It first labels unlabelled data using labelled data and combines it for learning purposes.
• Reinforcement Learning – A type of machine learning that uses agents and environment interactionfor
creating labelled data for learning.
• Well-posed Problem – A problem that has well-defined specifications. Otherwise, the problem is called
ill-posed.
• Bias/Variance – The inability of the machine learning algorithm to predict correctly due to lackof
generalization is called bias. Variance is the error of the model for training data. This leads to problems
called overfitting and underfitting.
• Model Deployment – A method of deploying machine learning algorithms to improve the existing
business processes for a new situation.

17
.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Introduction to Machine Learning  21

2.1 WHAT IS DATA?


All facts are data. In computer systems, bits encode facts present in numbers, text, images, audio,
and video. Data can be directly human interpretable (such as numbers or texts) or diffused data
such as images or video that can be interpreted only by a computer.
Data is available in different data sources like flat files, databases, or data warehouses. It can
either be an operational data or a non-operational data. Operational data is the one that is
encountered in normal business procedures and processes. For example, daily sales data is
operational data, on the other hand, non-operational data is the kind of data that is used for decision
making.
Data by itself is meaningless. It has to be processed to generate any information. A string of
bytes is meaningless. Only when a label is attached like height of students of a class, the data
becomes meaningful. Processed data is called information that includes patterns, associations, or
relationships among data. For example, sales data can be analyzed to extract information like which
product was sold larger in the last quarter of the year.
Elements of Big Data
Data whose volume is less and can be stored and processed by a small-scale computer is called ‘small
data’. These data are collected from several sources, and integrated and processed by a small-scale
computer. Big data, on the other hand, is a larger data whose volume is much larger than ‘small data’
and is characterized as follows:
1. Volume – Since there is a reduction in the cost of storing devices, there has been a
tremendous growth of data. Small traditional data is measured in terms of gigabytes (GB) and
terabytes (TB), but Big Data is measured in terms of petabytes (PB) and exabytes (EB). One exabyte
is 1 million terabytes.
2. Velocity – The fast arrival speed of data and its increase in data volume is noted as velocity.
The availability of IoT devices and Internet power ensures that the data is arriving at a faster rate.
Velocity helps to understand the relative growth of big data and its accessibility by users, systems
and applications.
3. Variety – The variety of Big Data includes:
• Form – There are many forms of data. Data types range from text, graph, audio, video, to
maps. There can be composite data too, where one media can have many other sources of
data, for example, a video can have an audio song.
• Function – These are data from various sources like human conversations, transaction
records, and old archive data.
• Source of data – This is the third aspect of variety. There are many sources of data. Broadly,
the data source can be classified as open/public data, social media data and multimodal
data.

Some of the other forms of Vs that are often quoted in the literature as characteristics of
Big data are:
4. Veracity of data – Veracity of data deals with aspects like conformity to the facts, truthfulness,
believablity, and confidence in data. There may be many sources of error such as technical
errors, typographical errors, and human errors. So, veracity is one of the most important
aspects of data.
5. Validity – Validity is the accuracy of the data for taking decisions or for any other goals that
are needed by the given problem.
6. Value – Value is the characteristic of big data that indicates the value of the information that
is extracted from the data and its influence on the decisions that are taken based on it.
Thus, these 6 Vs are helpful to characterize the big data. The data quality of the numeric
attributes is determined by factors like precision, bias, and accuracy.

18

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

 Precision is defined as the closeness of repeated measurements. Often, standard deviation is


used to measure the precision.
 Bias is a systematic result due to erroneous assumptions of the algorithms or procedures.
 Accuracy is the degree of measurement of errors that refers to the closeness of measurements
to the true value of the quantity. Normally, the significant digits used to store and manipulate
indicate the accuracy of the measurement.

2.1.1 Types of Data


In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.

Structured Data
In structured data, data is stored in an organized manner such as a database where it is available in
the form of a table. The data can also be retrieved in an organized manner using tools like SQL. The
structured data frequently encountered in machine learning are listed below:

Record Data A dataset is a collection of measurements taken from a process. We have a collection
of objects in a dataset and each object has a set of measurements. The measurements can be
arranged in the form of a matrix. Rows in the matrix represent an object and can be called as entities,
cases, or records. The columns of the dataset are called attributes, features, or fields. The table is
filled with observed data. Also, it is better to note the general jargons that are associated with the
dataset. Label is the term that is used to describe the individual observations.

Data Matrix It is a variation of the record type because it consists of numeric attributes. The
standard matrix operations can be applied on these data. The data is thought of as points or vectors
in the multidimensional space where every attribute is a dimension describing the object.

Graph Data It involves the relationships among objects. For example, a web page can refer to
another web page. This can be modeled as a graph. The modes are web pages and the hyperlink is
an edge that connects the nodes.

Ordered Data Ordered data objects involve attributes that have an implicit order among them.
The examples of ordered data are:
 Temporal data – It is the data whose attributes are associated with time. For example, the
customer purchasing patterns during festival time is sequential data. Time series data is a
special type of sequence data where the data is a series of measurements over time.

 Sequence data – It is like sequential data but does not have time stamps. This data involves the
sequence of words or letters. For example, DNA data is a sequence of four characters – A T G C.

 Spatial data – It has attributes such as positions or areas. For example, maps are spatial data
where the points are related by location.

Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents, programs,
and blog data. It is estimated that 80% of the data are unstructured data.

Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.

2.1.2 Data Storage and Representation

19

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis. The
goal of data storage management is to make data available for analysis. There are different
approaches to organize and manage data in storage files and systems from flat file to data
warehouses. Some of them are listed below:

Flat Files These are the simplest and most commonly available data source. It is also the cheapest
way of organizing the data. These flat files are the files where data is stored in plain ASCII or EBCDIC
format. Minor changes of data in flat files affect the results of the data mining algorithms.
Hence, flat file is suitable only for storing small dataset and not desirable if the dataset becomes
larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the values are separated by
commas. These are used by spreadsheet and database applications. The first row may have
attributes and the rest of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both
CSV and TSV files are generic in nature and can be shared. There are many tools like Google Sheets
and Microsoft Excel to process these files.
Database System It normally consists of database files and a database management system
(DBMS). Database files contain original data and metadata. DBMS aims to manage data and improve
operator performance by including various tools like database administrator, query processing, and
transaction manager. A relational database consists of sets of tables. The tables have rows and
columns. The columns represent the attributes and rows represent tuples. A tuple corresponds to
either an object or a relationship between objects. A user can access and manipulate the data in the
database using SQL.

Different types of databases are listed below:


1 A transactional database is a collection of transactional records. Each record is a
transaction. A transaction may have a time stamp, identifier and a set of items, which may have links
to other tables. Normally, transaction databases are created for performing associational analysis
that indicates the correlation among the items.
2. Time-series database stores time related information like log files where data is associated
with a time stamp. This data represents the sequences of data, which represent values or events
obtained over a period (for example, hourly, weekly or yearly) or repeated time span. Observing
sales of product continuously may yield a time-series data.
3. Spatial databases contain spatial information in a raster or vector format. Raster formats are
either bitmaps or pixel maps. For example, images can be stored as a raster data. On the other hand,
the vector format can be used to store maps as maps use basic geometric primitives like points, lines,
polygons and so forth.
World Wide Web (WWW) It provides a diverse, worldwide online information source.
The objective of data mining algorithms is to mine interesting patterns of information present in
WWW.
XML (eXtensible Markup Language) It is both human and machine interpretable data format that
can be used to represent data that needs to be shared across the platforms.
Data Stream It is dynamic data, which flows in and out of the observing environment. Typical
characteristics of data stream are huge volume of data, dynamic, fixed order movement, and real-
time constraints.
RSS (Really Simple Syndication) It is a format for sharing instant feeds across services.
JSON (JavaScript Object Notation) It is another useful data interchange format that is often used for
many machine learning algorithms.

2.2 BIG DATA ANALYTICS AND TYPES OF ANALYTICS


The primary aim of data analysis is to assist business organizations to take decisions. For example,
a business organization may want to know which is the fastest selling product, in order for them to

20

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

market activities. Data analysis is an activity that takes the data and generates useful information
and insights for assisting the organizations.
Data analysis and data analytics are terms that are used interchangeably to refer to the same
concept. However, there is a subtle difference. Data analytics is a general term and data analysis is a
part of it. Data analytics refers to the process of data collection, preprocessing and analysis. It deals
with the complete cycle of data management. Data analysis is just analysis and is a part of data
analytics. It takes historical data and does the analysis.
Data analytics, instead, concentrates more on future and helps in prediction.
There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics

Descriptive Analytics It is about describing the main features of the data. After data collection is
done, descriptive analytics deals with the collected data and quantifies it. It is often stated that
analytics is essentially statistics. There are two aspects of statistics – Descriptive and Inference.
Descriptive analytics only focuses on the description part of the data and not the inference part.
Diagnostic Analytics It deals with the question – ‘Why?’. This is also known as causal analysis, as
it aims to find out the cause and effect of the events. For example, if a product is not selling,
diagnostic analytics aims to find out the reason. There may be multiple reasons and associated
effects are analyzed as part of it.
Predictive Analytics It deals with the future. It deals with the question – ‘What will happen in
future given this data?’. This involves the application of algorithms to identify the patterns to predict
the future. The entire course of machine learning is mostly about predictive analytics and forms the
core of this book.
Prescriptive Analytics It is about the finding the best course of action for the business
organizations. Prescriptive analytics goes beyond prediction and helps in decision making by giving
a set of actions. It helps the organizations to plan better for the future and to mitigate the risks that
are involved.

2.3 BIG DATA ANALYSIS FRAMEWORK


For performing data analytics, many frameworks are proposed. All proposed analytics frameworks
have some common factors. Big data framework is a layered architecture. Such an architecture has
many advantages such as genericness. A 4-layer architecture has the following layers:
1. Date connection layer
2. Data management layer
3. Data analytics later
4. Presentation layer
Data Connection Layer It has data ingestion mechanisms and data connectors. Data ingestion
means taking raw data and importing it into appropriate data structures. It performs the tasks of
ETL process. By ETL, it means extract, transform and load operations.

Data Management Layer It performs preprocessing of data. The purpose of this layer is to
allow parallel execution of queries, and read, write and data management tasks. There may be
many schemes that can be implemented by this layer such as data-in-place, where the data is
not moved at all, or constructing data repositories such as data warehouses and pull data
on-demand mechanisms.
Data Analytic Layer It has many functionalities such as statistical tests, machine learning
algorithms to understand, and construction of machine learning models. This layer implements
many model validation mechanisms too. The processing is done as shown in Box 2.1.

Presentation Layer It has mechanisms such as dashboards, and applications that display the

21

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

results of analytical engines and machine learning algorithms.


Thus, the Big Data processing cycle involves data management that consists of the following
steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm
This is an iterative process and is carried out on a permanent basis to ensure that data is suitable
for data mining.
Application and interpretation of machine learning algorithms constitute the basis for the rest
of the book. So, primarily, data collection and data preprocessing are covered as part of this chapter.

2.3.1 Data Collection


The first task of gathering datasets are the collection of data. It is often estimated that most of the
time is spent for collection of good quality data. A good quality data yields a better result. It is
often difficult to characterize a ‘Good data’. ‘Good data’ is one that has the following properties:
1. Timeliness – The data should be relevant and not stale or obsolete data.
2. Relevancy – The data should be relevant and ready for the machine learning or data mining
algorithms. All the necessary information should be available and there should be no bias in
the data.
3. Knowledge about the data – The data should be understandable and interpretable, and should
be self-sufficient for the required application as desired by the domain knowledge engineer.

Broadly, the data source can be classified as open/public data, social media data and multimodal
data.
1. Open or public data source – It is a data source that does not have any stringent copyright
rules or restrictions. Its data can be primarily used for many purposes. Government census
data are good examples of open data:
• Digital libraries that have huge amount of text data as well as document images
• Scientific domains with a huge collection of experimental data like genomic data
and biological data
• Healthcare systems that use extensive databases like patient databases, health insurance
data, doctors’ information, and bioinformatics information
2. Social media – It is the data that is generated by various social media platforms like
Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is generated by these
platforms.
3. Multimodal data – It includes data that involves many modes such as text, video, audio
and mixed types. Some of them are listed below:
• Image archives contain larger image databases along with numeric and text data
• The World Wide Web (WWW) has huge amount of data that is distributed on the Internet.
These data are heterogeneous in nature.

2.3.2 Data Preprocessing


In real world, the available data is ’dirty’. By this word ’dirty’, it means:
• Incomplete data • Inaccurate data
• Outlier data • Data with missing values
• Data with inconsistent values • Duplicate data
Data preprocessing improves the quality of the data mining techniques. The raw data must
be preprocessed to give accurate results. The process of detection and removal of errors in data
is called data cleaning. Data wrangling means making the data processable for machine learning
algorithms. Some of the data errors include human errors such as typographical errors or
incorrect

22

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

measurement and structural errors like improper data formats. Data errors can also arise from
omission and duplication of attributes. Noise is a random component and involves distortion of
a value or introduction of spurious objects. Often, the noise is used if the data is a spatial or
temporal component. Certain deterministic distortions in the form of a streak are known as
artifacts.
Consider, for example, the following patient Table 2.1. The ‘bad’ or ‘dirty’ data can be observed
in this table.

23

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

It can be observed that data like Salary = ’ ’ is incomplete data. The DoB of patients, John, Andre, and
Raju, is the missing data. The age of David is recorded as ‘5’ but his DoB indicates it is 10/10/1980.
This is called inconsistent data.

Inconsistent data occurs due to problems in conversions, inconsistent formats, and difference in
units. Salary for John is -1500. It cannot be less than ‘0’. It is an instance of noisy data. Outliers are
data that exhibit the characteristics that are different from other data and have very unusual values.
The age of Raju cannot be 136. It might be a typographical error. It is often required to distinguish
between noise and outlier data.
Outliers may be legitimate data and sometimes are of interest to the data mining algorithms. These
errors often come during data collection stage. These must be removed so that machine learning
algorithms yield better results as the quality of results is determined by the quality of input data.
This removal process is called data cleaning.

Missing Data Analysis


The primary data cleaning process is missing data analysis. Data cleaning routines attempt to fill
up the missing values, smoothen the noise while identifying the outliers and correct the
inconsistencies of the data. This enables data mining to avoid overfitting of the models.
The procedures that are given below can solve the problem of missing data:
1. Ignore the tuple – A tuple with missing data, especially the class label, is ignored. This
method is not effective when the percentage of the missing values increases.
2. Fill in the values manually – Here, the domain expert can analyse the data tables and carry
out the analysis and fill in the values manually. But, this is time consuming and may not
be feasible for larger sets.
3. A global constant can be used to fill in the missing attributes. The missing values may be
’Unknown’ or be ’Infinity’. But, some data mining results may give spurious results by
analysing these labels.
4. The attribute value may be filled by the attribute value. Say, the average income can replace
a missing value.
5. Use the attribute mean for all samples belonging to the same class. Here, the average value
replaces the missing values of all tuples that fall in this group.
6. Use the most possible value to fill in the missing value. The most probable value can be
obtained from other methods like classification and decision tree prediction.
Some of these methods introduce bias in the data. The filled value may not be correct and could
be just an estimated value. Hence, the difference between the estimated and the original value is
called an error or bias.

24

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Removal of Noisy or Outlier Data


Noise is a random error or variance in a measured value. It can be removed by using binning,
which is a method where the given data values are sorted and distributed into equal frequency
bins. The bins are also called as buckets. The binning method then uses the neighbor values to
smooth the noisy data.
Some of the techniques commonly used are ‘smoothing by means’ where the mean of the
bin removes the values of the bins, ‘smoothing by bin medians’ where the bin median replaces
the bin values, and ‘smoothing by bin boundaries’ where the bin value is replaced by the closest
bin boundary. The maximum and minimum values are called bin boundaries. Binning methods
may be used as a discretization technique. Example 2.1 illustrates this principle.

Example 2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply various
binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3, then the above data is distributed across the bins as shown below:
Bin 1 : 12 , 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 32
By smoothing bins method, the bins are replaced by the bin means. This method results in:
Bin 1 : 15, 15, 15
Bin 2 : 24, 24, 24
Bin 3 : 30.3, 30.3, 30.3
Using smoothing by bin boundaries method, the bins' values would be like:
Bin 1 : 12, 12, 19
Bin 2 : 22, 22, 26
Bin 3 : 28, 32, 32
As per the method, the minimum and maximum values of the bin are determined, and it serves
as bin boundary and does not change. Rest of the values are transformed to the nearest value. It
can be observed in Bin 1, the middle value 14 is compared with the boundary values 12 and 19
and changed to the closest value, that is 12. This process is repeated for all bins.

Data Integration and Data Transformations


Data integration involves routines that merge data from multiple sources into a single data source.
So, this may lead to redundant data. The main goal of data integration is to detect and remove
redundancies that arise from integration. Data transformation routines perform operations like
normalization to improve the performance of the data mining algorithms. It is necessary to
transform data so that it can be processed. This can be considered as a preliminary stage of data
conditioning. Normalization is one such technique. In normalization, the attribute values are
scaled to fit in a range (say 0-1) to improve the performance of the data mining algorithm. Often, in
neural networks, these techniques are used. Some of the normalization procedures used are:
1. Min-Max
2. z-Score
Min-Max Procedure It is a normalization technique where each variable V is normalized by its
difference with the minimum value divided by the range to a new range, say 0–1. Often, neural
networks require this kind of normalization. The formula to implement this normalization is
given as:

Here max-min is the range. Min and max are the minimum and maximum of the given data,
new max and new min are the minimum and maximum of the target range, say 0 and 1.

Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and map the marks

25

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

to a new range 0–1.


Solution: The minimum of the list V is 88 and maximum is 94. The new min and new max are
0 and 1, respectively. The mapping can be done using Eq. (2.1) as:

So, it can be observed that the marks {88, 90, 92, 94} are mapped to the new range {0, 0.33, 0.66,
1}. Thus, the Min-Max normalization range is between 0 and 1.

z-Score Normalization This procedure works by taking the difference between the field value
and mean value, and by scaling this difference by standard deviation of the attribute.

Here, s is the standard deviation of the list V and m is the mean of the list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the list V are 20 and 10, respec-
tively. So the z-scores of these marks are calculated using Eq. (2.2) as:

Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.

Data Reduction
Data reduction reduces data size but produces the same results. There are different ways in which
data reduction can be carried out such as data aggregation, feature selection, and dimensionality
reduction.

2.4 DESCRIPTIVE STATISTICS


Descriptive statistics is a branch of statistics that does dataset summarization. It is used to
summarize and describe data. Descriptive statistics are just descriptive and do not go beyond that.
In other words, descriptive statistics do not bother too much about machine learning algorithms
and its functioning.
Let us discuss descriptive statistics with the fundamental concepts of datatypes.
Dataset and Data Types
A dataset can be assumed to be a collection of data objects. The data objects may be records,
points, vectors, patterns, events, cases, samples or observations. These records contain many
attributes. An attribute can be defined as the property or characteristics of an object.

26

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

For example, consider the following database shown in sample Table 2.2.

Every attribute should be associated with a value. This process is called measurement.
The type of attribute determines the data types, often referred to as measurement scale types.
The data types are shown in Figure 2.1.

Broadly, data can be classified into two types:


1. Categorical or qualitative data
2. Numerical or quantitative data

27

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Categorical or Qualitative Data The categorical data can be divided into two types. They are
nominal type and ordinal type.
• Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are symbols and
cannot be processed like a number. For example, the average of a patient ID does not make
any statistical sense. Nominal data type provides only information but has no ordering
among data. Only operations like (=, ≠) are meaningful for these data. For example, the
patient ID can be checked for equality and nothing else.
• Ordinal Data – It provides enough information and has natural order. For example, Fever
= {Low, Medium, High} is an ordinal data. Certainly, low is less than medium and medium
is less than high, irrespective of the value. Any transformation can be applied to these data
to get a new value.
Numeric or Qualitative Data It can be divided into two categories. They are interval type and
ratio type.
• Interval Data – Interval data is a numeric data for which the differences between values
are meaningful. For example, there is a difference between 30 degree and 40 degree. Only
the permissible operations are + and -.
• Ratio Data – For ratio data, both differences and ratio are meaningful. The difference
between the ratio and interval data is the position of zero in the scale. For example,
take the Centigrade-Fahrenheit conversion. The zeroes of both scales do not match.
Hence, these are interval data.

Another way of classifying the data is to classify it as:


1.Discrete value data
2.Continuous data
Discrete Data This kind of data is recorded as integers. For example, the responses of the survey
can be discrete data. Employee identification number such as 10001 is discrete data.
Continuous Data It can be fitted into a range and includes decimal point. For example, age is a
continuous data. Though age appears to be discrete data, one may be 12.5 years old and it makes
sense. Patient height and weight are all continuous data.
Third way of classifying the data is based on the number of variables used in the dataset. Based

28

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

on that, the data can be classified as univariate data, bivariate data, and multivariate data. This is
shown in Figure 2.2.

2.5 UNIVARIATE DATA ANALYSIS AND VISUALIZATION


Univariate analysis is the simplest form of statistical analysis. As the name indicates, the dataset
has only one variable. A variable can be called as a category. Univariate does not deal with cause or
relationships. The aim of univariate analysis is to describe data and find patterns.
Univariate data description involves finding the frequency distributions, central tendency
measures, dispersion or variation, and shape of the data.

2.5.1 Data Visualization


Let us consider some forms of graphs

Bar Chart A Bar chart (or Bar graph) is used to display the frequency distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help to explain the counts of
nominal data. It also helps in comparing the frequency of different groups.
The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown
below in Figure 2.3.

Pie Chart These are equally helpful in illustrating the univariate data. The percentage frequency
distribution of students' marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.

It can be observed that the number of students with 22 marks are 2. The total number of
students are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted for marks 22 in Figure 2.4.

29

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Histogram It plays an important role in data mining for showing frequency distributions.
The histogram for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75,
76-100 is given below in Figure 2.5. One can visually inspect from Figure 2.5 that the number of
students in the range 76-100 is 2.

Histogram conveys useful information like nature of data and its mode. Mode indicates the
peak of dataset. In other words, histograms can be used as charts to show frequency, skewness
present in the data, and shape.

Dot Plots These are similar to bar charts. They are less clustered as compared to bar charts,
as they illustrate the bars only with single points. The dot plot of English marks for five students
with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage
is that by visual inspection one can find out who got more marks.

2.5.2 Central Tendency


Therefore, a condensation or summary of the data is necessary. This makes the data analysis easy
and simple. One such summary is called central tendency. Thus, central tendency can explain the
characteristics of data and that further helps in comparison. Mass data have tendency to
concentrate at certain values, normally in the central location. It is called measure of central
tendency (or averages). Popular measures are mean, median and mode.
1. Mean – Arithmetic average (or mean) is a measure of central tendency that represents the
‘center’ of the dataset. Mathematically, the average of all the values in the sample (population) is
denoted as x. Let x1, x2, … , xN be a set of ‘N’ values or observations, then the arithmetic mean is
given as:

For example, the mean of the three numbers 10, 20, and 30 is 20

30

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

• Weighted mean – Unlike arithmetic mean that gives the weightage of all items equally,
weighted mean gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency distribution, mid values of
the range are taken for computation. This is illustrated in the following computation.

In weighted mean, the mean is computed by adding the product of proportion and
group mean. It is mostly used when the sample sizes are unequal.

• Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or observations. Geometric mean
is the Nth root of the product of N items. The formula for computing geometric mean is
given as follows:

Here, n is the number of items and xi are values. For example, if the values are 6 and 8, the
geometric mean is given as In larger cases, computing geometric mean is difficult. Hence, it is
usually calculated as:

The problem of mean is its extreme sensitiveness to noise. Even small changes in the input
affect the mean drastically. Hence, often the top 2% is chopped off and then the mean is calcu-
lated for a larger dataset.

2. Median – The middle value in the distribution is called median. If the total number of items
in the distribution is odd, then the middle value is called median. A median class is that class
where (N/2)th item is present.
In the continuous case, the median is given by the formula:

Median class is that class where N/2th item is present. Here, i is the class interval of the
median class and L1 is the lower limit of median class, f is the frequency of the median class, and
cf is the cumulative frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the dataset. In other words, the
value that has the highest frequency is called mode.

2.5.3 Dispersion
The spreadout of a set of data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard deviation,
and standard error. These are second order measures. The most common measures of the
dispersion data are listed below:

Range Range is the difference between the maximum and minimum of values of the given list
of data.

Standard Deviation The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference between
these two sets is the spread of data. Standard deviation is the average distance from the mean of
the dataset to each point.

31

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

The formula for sample standard deviation is given by:

Here, N is the size of the population, xi is observation or value from the population and m is
the population mean. Often, N – 1 is used instead of N in the denominator of Eq. (2.8).

Quartiles and Inter Quartile Range It is sometimes convenient to subdivide the dataset using
coordinates. Percentiles are about data that are less than the coordinates by some percentage of
the total value. kth percentile is the property that the k% of the data lies at or below Xi. For
example, median is 50th percentile and can be denoted as Q0.50. The 25th percentile is called first
quartile (Q1) and the 75th percentile is called third quartile (Q3). Another measure that is useful
to measure dispersion is Inter Quartile Range (IQR). The IQR is the difference between Q3 and Q1.
Interquartile percentile = Q3 – Q1 (2.9)
Outliers are normally the values falling apart at least by the amount 1.5 × IQR above the third
quartile or below the first quartile.
Interquartile is defined by Q0.75 – Q0.25. (2.10)

Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34}, find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the median. The first quartile is
median of the scores below the mean i.e., {12, 14, 19, 22}. Hence, it’s the median of the list below
24. In this case, the median is the average of the second and third values, that is, Q0.25 = 16.5.
Similarly, the third quartile is the median of the values above the median, that is {26, 28, 31, 34}.
So, Q0.75 is the average of the seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13

Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and minimum
and maximum written in the order < Minimum, Q1, Median, Q3, Maximum > is known as
five-point summary.

Example 2.5: Find the 5-point summary of the list {13, 11, 2, 3, 4, 8, 9}.
Solution: The minimum is 2 and the maximum is 13. The Q1, Q2 and Q3 are 3, 8 and 11, respectively.
Hence, 5-point summary is {2, 3, 8, 11, 13}, that is, {minimum, Q1, median, Q3, maximum}. Box plots
are useful for describing 5-point summary. The Box plot for the set is given in
Figure 2.7.

2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location of
the dataset.
Skewness
The measures of direction and degree of symmetry are called measures of third order. Ideally,
skewness should be zero as in ideal normal distribution. More often, the given dataset may not

32

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

have perfect symmetry (consider the following Figure 2.8).

Generally, for negatively skewed distribution, the median is more than the mean. The relationship
between skew and the relative size of the mean and median can be summarized by a convenient
numerical skew index known as Pearson 2 skewness coefficient.

Also, the following measure is more commonly used to measure skewness. Let X1, X2, …, XN
be a set of ‘N’ values or observations then the skewness can be given as:

Here, m is the population mean and s is the population standard deviation of the univariate
data. Sometimes, for bias correction instead of N, N - 1 is used.

Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis
and vice versa. Kurtosis is measured using the formula given below:

It can be observed that N - 1 is used instead of N in the numerator of Eq. (2.14) for bias correction.
Here, x and s are the mean and standard deviation of the univariate data, respectively.
Some of the other useful measures for finding the shape of the univariate dataset are mean
absolute deviation (MAD) and coefficient of variation (CV).

Mean Absolute Deviation (MAD)


MAD is another dispersion measure and is robust to outliers. Normally, the outlier point is
detected by computing the deviation from median and by dividing it by MAD. Here, the absolute
deviation between the data and mean is taken. Thus, the absolute deviation is given as:

Coefficient of Variation (CV)


Coefficient of variation is used to compare datasets with different units. CV is the ratio of standard
deviation and mean, and %CV is the percentage of coefficient of variations.

2.5.5 Special Univariate Plots


The ideal way to check the shape of the dataset is a stem and leaf plot. A stem and leaf plot are a
display that help us to know the shape and distribution of the data. In this method, each value is
split into a ’stem’ and a ’leaf’. The last digit is usually the leaf and digits to the left of the leaf mostly
form the stem. For example, marks 45 are divided into stem 4 and leaf 5 in Figure 2.9.
The stem and leaf plot for the English subject marks, say, {45, 60, 60, 80, 85} is given in

33

.
lOMoARcPSD|39472316

AI&ML, 21CS752, 7th


Semester

Figure 2.9.

It can be seen from Figure 2.9 that the first column is stem and the second column is leaf.
For the given English marks, two students with 60 marks are shown in stem and leaf plot as stem-
6 with 2 leaves with 0. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below in
Figure 2.10.

2.6 BIVARIATE DATA AND MULTIVARIATE DATA


Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data of the temperature in
a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The relationships can
then be used in comparisons, finding causes, and in further explorations. To do that, graphical
display of the data is necessary. One such graph method is called scatter plot.

Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without

34

You might also like