0% found this document useful (0 votes)
14 views

Data Science IV

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Science IV

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 126

Unit – I

What is Machine Learning? How does it works. What are the


features of Machine Learning
Machine Learning:
Machine learning is a growing technology which enables computers to
learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is being
used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.
In the real world, we are surrounded by humans who can learn everything
from their experiences with their learning capability, and we have
computers or machines which work on our instructions. But can a machine
also learn from experiences or past data like a human does? So here
comes the role of Machine Learning.
Machine Learning is said as a subset of artificial intelligence that is
mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own. The
term machine learning was first introduced by Arthur Samuel in 1959.
We can define it in a summarized way as:
Machine learning enables a machine to automatically learn from
data, improve performance from experiences, and predict things
without being explicitly programmed.
With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the
algorithms that learn from historical data. The more we will provide the
information, the higher will be the performance.
A machine has the ability to learn if it can improve its
performance by gaining more data.
How does Machine Learning work
A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts
the output for it. The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the
data to generic algorithms, and with the help of these algorithms,
machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below
block diagram explains the working of Machine Learning algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given
dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with
the huge amount of the data.
Need for Machine Learning
The need for machine learning is increasing day by day. The reason
behind the need for machine learning is that it is capable of doing tasks
that are too complex for a person to implement directly. As a human, we
have some limitations as we cannot access the huge amount of data
manually, so for this, we need some computer systems and here comes
the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge
amount of data and let them explore the data, construct the models, and
predict the required output automatically. The performance of the
machine learning algorithm depends on the amount of data, and it can be
determined by the cost function. With the help of machine learning, we
can save both time and money.
The importance of machine learning can be easily understood by its use
cases, Currently, machine learning is used in self-driving cars, cyber
fraud detection, face recognition, and friend suggestion by
Facebook, etc. Various top companies such as Netflix and Amazon have
build machine learning models that are using a vast amount of data to
analyze the user interest and recommend product accordingly.
Following are some key points which show the importance of
Machine Learning:
o Rapid increment in the production of data
o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.
Classification of Machine Learning
At a broad level, machine learning can be classified into three types:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning
Supervised learning is a type of machine learning method in which we
provide sample labeled data to the machine learning system in order to
train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the
datasets and learn about each data, once the training and processing are
done then we test the model by providing a sample data to check whether
it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data.
The supervised learning is based on supervision, and it is the same as
when a student learns things in the supervision of the teacher. The
example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of
algorithms:
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.
The training is provided to the machine with the set of data that has not
been labeled, classified, or categorized, and the algorithm needs to act on
that data without any supervision. The goal of unsupervised learning is to
restructure the input data into new features or a group of objects with
similar patterns.
In unsupervised learning, we don't have a predetermined result. The
machine tries to find useful insights from the huge amount of data. It can
be further classifieds into two categories of algorithms:
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a
learning agent gets a reward for each right action and gets a penalty for
each wrong action. The agent learns automatically with these feedbacks
and improves its performance. In reinforcement learning, the agent
interacts with the environment and explores it. The goal of an agent is to
get the most reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is
an example of Reinforcement learning.

Applications of Machine learning ( Real World Applications of Machine


Learning)
Machine learning is a buzzword for today's technology, and it is growing
very rapidly day by day. We are using machine learning in our daily life
even without knowing it such as Google Maps, Google assistant, Alexa,
etc. Below are some most trending real-world applications of Machine
Learning
1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc.
The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion.
Whenever we upload a photo with our Facebook friends, then we
automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition
algorithm.
It is based on the Facebook project named "Deep Face," which is
responsible for face recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine
learning.
Speech recognition is a process of converting voice instructions into text,
and it is also known as "Speech to text", or "Computer speech
recognition." At present, machine learning algorithms are widely used by
various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows
us the correct path with the shortest route and predicts the traffic
conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-
moving, or heavily congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and
sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It
takes information from the user and sends back to its database to
improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine
learning.
Google understands the user interest using various machine learning
algorithms and suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for
entertainment series, movies, etc., and this is also done with the help of
machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving
cars. Machine learning plays a significant role in self-driving cars. Tesla,
the most popular car manufacturing company is working on self-driving
car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as
important, normal, and spam. We always receive an important mail in our
inbox with the important symbol and spam emails in our spam box, and
the technology behind this is Machine learning. Below are some spam
filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer
Perceptron, Decision tree, and Naïve Bayes classifier are used for
email spam filtering and malware detection.
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in
finding the information using our voice instruction. These assistants can
help us in various ways just by our voice instructions such as Play music,
call someone, Open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important
part.
These assistant record our voice instructions, send it over the server on a
cloud, and decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by
detecting fraud transaction. Whenever we perform some online
transaction, there may be various ways that a fraudulent transaction can
take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a
fraud transaction.
For each genuine transaction, the output is converted into some hash
values, and these values become the input for the next round. For each
genuine transaction, there is a specific pattern which gets change for the
fraud transaction hence, it detects it and makes our online transactions
more secure.
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock
market, there is always a risk of up and downs in shares, so for this
machine learning's long short term memory neural network is used
for the prediction of stock market trends.
10. Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With
this, medical technology is growing very fast and able to build 3D models
that can predict the exact position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language
then it is not a problem at all, as for this also machine learning helps us by
converting the text into our known languages. Google's GNMT (Google
Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and
it called as automatic translation.
The technology behind the automatic translation is a sequence to
sequence learning algorithm, which is used with image recognition and
translates the text from one language to another language.

What are some Canonical Learning Problems in ML

Machine learning Life cycle


Machine learning has given the computer systems the abilities to
automatically learn without being explicitly programmed. But how does a
machine learning system work? So, it can be described using the life cycle
of machine learning. Machine learning life cycle is a cyclic process to build
an efficient machine learning project. The main purpose of the life cycle is
to find a solution to the problem or project.
Machine learning life cycle involves seven major steps, which are given
below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment

The most important thing in the complete process is to understand the


problem and to know the purpose of the problem. Therefore, before
starting the life cycle, we need to understand the problem because the
good result depends on the better understanding of the problem.
In the complete life cycle process, to solve a problem, we create a
machine learning system called "model", and this model is created by
providing "training". But to train a model, we need data, hence, life cycle
starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal
of this step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be
collected from various sources such as files, database, internet,
or mobile devices. It is one of the most important steps of the life cycle.
The quantity and quality of the collected data will determine the efficiency
of the output. The more will be the data, the more accurate will be the
prediction.
This step includes the below tasks:
o Identify various data sources
o Collect data
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called
as a dataset. It will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and
prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the
ordering of data.
This step can be further divided into two processes:
o Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and quality
of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a
useable format. It is the process of cleaning the data, selecting the
variable to use, and transforming the data in a proper format to make it
more suitable for analysis in the next step. It is one of the most important
steps of the complete process. Cleaning of data is required to address the
quality issues.
It is not necessary that data we have collected is always of our use as
some of the data may not be useful. In real-world applications, collected
data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can
negatively affect the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This
step involves:
o Selection of analytical techniques
o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the
data using various analytical techniques and review the outcome. It starts
with the determination of the type of the problems, where we select the
machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc.
then build the model using prepared data, and evaluate the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to
improve its performance for better outcome of the problem.
We use datasets to train the model using various machine learning
algorithms. Training a model is required so that it can understand the
various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset,
then we test the model. In this step, we check for the accuracy of our
model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as
per the requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.
If the above-prepared model is producing an accurate result as per our
requirement with acceptable speed, then we deploy the model in the real
system. But before deploying the project, we will check whether it is
improving its performance using available data or not. The deployment
phase is similar to making the final report for a project.

Key differences between Artificial Intelligence (AI) and Machine


learning (ML):

Artificial Intelligence Machine learning

Artificial intelligence is a technology Machine learning is a subset of AI


which enables a machine to which allows a machine to
simulate human behavior. automatically learn from past data
without programming explicitly.

The goal of AI is to make a smart The goal of ML is to allow machines


computer system like humans to to learn from data so that they can
solve complex problems. give accurate output.

In AI, we make intelligent systems In ML, we teach machines with data


to perform any task like a human. to perform a particular task and give
an accurate result.

Machine learning and deep learning Deep learning is a main subset of


are the two main subsets of AI. machine learning.

AI has a very wide range of scope. Machine learning has a limited


scope.

AI is working to create an intelligent Machine learning is working to


system which can perform various create machines that can perform
complex tasks. only those specific tasks for which
they are trained.

AI system is concerned about Machine learning is mainly


maximizing the chances of success. concerned about accuracy and
patterns.

The main applications of AI are Siri, The main applications of machine


customer support using learning are Online recommender
catboats, Expert System, Online system, Google search
game playing, intelligent humanoid algorithms, Facebook auto
robot, etc. friend tagging suggestions, etc.

On the basis of capabilities, AI can Machine learning can also be


be divided into three types, which divided into mainly three types that
are, Weak AI, General AI, are Supervised
and Strong AI. learning, Unsupervised learning,
and Reinforcement learning.

It includes learning, reasoning, and It includes learning and self-


self-correction. correction when introduced with
new data.

AI completely deals with Structured, Machine learning deals with


semi-structured, and unstructured Structured and semi-structured
data. data.

What is a dataset?
A dataset is a collection of data in which data is arranged in some order.
A dataset can contain any data from a series of an array to a database
table. Below table shows an example of the dataset:

Country Age Salary Purchased

India 38 48000 No

France 43 45000 Yes


Germany 30 54000 No

France 48 65000 No

Germany 40 Yes

India 35 58000 Yes


A tabular dataset can be understood as a database table or matrix, where
each column corresponds to a particular variable, and each row
corresponds to the fields of the dataset. The most supported file type
for a tabular dataset is "Comma Separated File," or CSV. But to store a
"tree-like data," we can use the JSON file more efficiently.
Types of data in datasets
o Numerical data:Such as house price, temperature, etc.
o Categorical data:Such as Yes/No, True/False, Blue/green, etc.
o Ordinal data:These data are similar to categorical data but can be
measured on the basis of comparison.
Note: A real-world dataset is of huge size, which is difficult to
manage and process at the initial level. Therefore, to practice
machine learning algorithms, we can use any dummy dataset.
Need of Dataset
To work with machine learning projects, we need a huge amount of data,
because, without the data, one cannot train ML/AI models. Collecting and
preparing the dataset is one of the most crucial parts while creating an
ML/AI project.
The technology applied behind any ML projects cannot work properly if
the dataset is not well prepared and pre-processed.
During the development of the ML project, the developers completely rely
on the datasets. In building ML applications, datasets are divided into two
parts:
o Training dataset:
o Test Dataset
Note: The datasets are of large size, so to download these
datasets, you must have fast internet on your computer.
Popular sources for Machine Learning datasets
Below is the list of datasets which are freely available for the public to
work on it:
Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists
and Machine Learners. It allows users to find, download, and publish
datasets in an easy way. It also provides the opportunity to work with
other machine learning engineers and solve difficult Data Science related
tasks.
Kaggle provides a high-quality dataset in different formats that we can
easily find and download.
The link for the Kaggle dataset is https://www.kaggle.com/datasets
UCI Machine Learning Repository
UCI Machine learning repository is one of the great sources of machine
learning datasets. This repository contains databases, domain theories,
and data generators that are widely used by the machine learning
community for the analysis of ML algorithms.
Since the year 1987, it has been widely used by students, professors,
researchers as a primary source of machine learning dataset.
It classifies the datasets as per the problems and tasks of machine
learning such as Regression, Classification, Clustering, etc. It also
contains some of the popular datasets such as the Iris dataset, Car
Evaluation dataset, Poker Hand dataset, etc.
Datasets via AWS
We can search, download, access, and share the datasets that are publicly
available via AWS resources. These datasets can be accessed through
AWS resources but provided and maintained by different government
organizations, researches, businesses, or individuals.
Anyone can analyze and build various services using shared data via AWS
resources. The shared dataset on cloud helps users to spend more time on
data analysis rather than on acquisitions of data.
his source provides the various types of datasets with examples and ways
to use the dataset. It also provides the search box using which we can
search for the required dataset. Anyone can add any dataset or example
to the Registry of Open Data on AWS.
The link for the resource is https://registry.opendata.aws/
Google's Dataset Search Engine
Google dataset search engine is a search engine launched
by Google on September 5, 2018. This source helps researchers to get
online datasets that are freely available for use.
The link for the UCI machine learning repository
is https://archive.ics.uci.edu/ml/index.php

Formalizing the Learning Problem


The first step in any project is defining your problem. You can use the
most powerful and shiniest algorithms available, but the results will be
meaningless if you are solving the wrong problem.
Problem Definition Framework
I use a simple framework when defining a new problem to address with
machine learning. The framework helps me to quickly understand the
elements and motivation for the problem and whether machine learning is
suitable or not.
The framework involves answering three questions to varying degrees of
thoroughness:
 Step 1: What is the problem?
 Step 2: Why does the problem need to be solved?
 Step 3: How would I solve the problem?
Step 1: What is the Problem
The first step is defining the problem. I use a number of tactics to collect
this information.
Informal description
Describe the problem as though you were describing it to a friend or
colleague. This can provide a great starting point for highlighting areas
that you might need to fill. It also provides the basis for a one sentence
description you can use to share your understanding of the problem.
For example: I need a program that will tell me which tweets will get
retweets.
Formalism
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.
Use this formalism to define the T, P, and E for your problem.
For example:
 Task (T): Classify a tweet that has not been published as going to
get retweets or not.
 Experience (E): A corpus of tweets for an account where some
have retweets and some do not.
 Performance (P): Classification accuracy, the number of tweets
predicted correctly out of all tweets considered as a percentage.
Assumptions
Create a list of assumptions about the problem and it’s phrasing. These
may be rules of thumb and domain specific information that you think will
get you to a viable solution faster.
It can be useful to highlight questions that can be tested against real data
because breakthroughs and innovation occur when assumptions and best
practice are demonstrated to be wrong in the face of real data. It can also
be useful to highlight areas of the problem specification that may need to
be challenged, relaxed or tightened.
For example:
 The specific words used in the tweet matter to the model.
 The specific user that retweets does not matter to the model.
 The number of retweets may matter to the model.
 Older tweets are less predictive than more recent tweets.
Similar problems
What other problems have you seen or can you think of that are like the
problem you are trying to solve? Other problems can inform the problem
you are trying to solve by highlighting limitations in your phrasing of the
problem such as time dimensions and conceptual drift (where the concept
being modeled changes over time). Other problems can also point to
algorithms and data transformations that could be adopted to spot check
performance.
For example: A related problem would be email spam discrimination that
uses text messages as input data and needs binary classification decision.
Step 2: Why does the the problem need to be solved?
The second step is to think deeply about why you want or need the
problem solved.
Motivation
Consider your motivation for solving the problem. What need will be
fulfilled when the problem is solved?
For example, you may be solving the problem as a learning exercise. This
is useful to clarify as you can decide that you don’t want to use the most
suitable method to solve the problem, but instead you want to explore
methods that you are not familiar with in order to learn new skills.
Alternatively, you may need to solve the problem as part of a duty at
work, ultimately to keep your job.
Solution Benefits
Consider the benefits of having the problem solved. What capabilities
does it enable?
It is important to be clear on the benefits of the problem being solved to
ensure that you capitalize on them. These benefits can be used to sell the
project to colleagues and management to get buy in and additional time
or budget resources.
If it benefits you personally, then be clear on what those benefits are and
how you will know when you have got them. For example, if it’s a tool or
utility, then what will you be able to do with that utility that you can’t do
now and why is that meaningful to you?
Solution Use
Consider how the solution to the problem will be used and what type of
lifetime you expect the solution to have. As programmers we often think
the work is done as soon as the program is written, but really the project
is just beginning it’s maintenance lifetime.
The way the solution will be used will influence the nature and
requirements of the solution you adopt.
Consider whether you are looking to write a report to present results or
you want to operationalize the solution. If you want to operationalize the
solution, consider the functional and nonfunctional requirements you have
for a solution, just like a software project.
Step 3: How would I solve the problem?
In this third and final step of the problem definition, explore how you
would solve the problem manually.
List out step-by-step what data you would collect, how you would prepare
it and how you would design a program to solve the problem. This may
include prototypes and experiments you would need to perform which are
a gold mine because they will highlight questions and uncertainties you
have about the domain that could be explored.
This is a powerful tool. It can highlight problems that actually can be
solved satisfactorily using a manually implemented solution. It also
flushes out important domain knowledge that has been trapped up until
now like where the data is actually stored, what types of features would
be useful and many other details.
Collect all of these details as they occur to you and update the previous
sections of the problem definition. Especially the assumptions and rules of
thumb.
We have considered a manually specified solution before when describing
complex problems in why machine learning matters.
Summary
 Step 1: What is the problem? Describe the problem informally
and formally and list assumptions and similar problems.
 Step 2: Why does the problem need to be solve? List your
motivation for solving the problem, the benefits a solution provides
and how the solution will be used.
 Step 3: How would I solve the problem? Describe how the
problem would be solved manually to flush domain knowledge.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be
used for both classification and Regression problems, but mostly it
is preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf
node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of
the given dataset.
o It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs a
tree-like structure.
o A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best
algorithm for the given dataset and problem is the main point to
remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm compares
the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with
the other sub-nodes and move further. It continues the process until it
reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The
next decision node further gets split into one decision node (Cab facility)
and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes. So, to solve
such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information gain
is split first. It can be calculated using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each featu


re)
Entropy: Entropy is a metric to measure the impurity in a given attribute.
It specifies randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to
the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree
may increase.
What are the steps in ID3 algorithm?
The steps in ID3 algorithm are as follows:
1. Calculate entropy for dataset.
2. For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until we get the desired tree.
Use ID3 algorithm on a data
We'll discuss it here mathematically and later see it's implementation in
Python.
So, Let's take an example to make it more clear.
Here,dataset is of binary classes(yes and no), where 9 out of 14 are "yes"
and 5 out of 14 are "no".
Complete entropy of dataset is:

H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))


= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53)
= 0.94

For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute - Outlook

Categorical values - sunny, overcast and rain


H(Outlook=sunny)= -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0

Average Entropy Information for Outlook -


I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) +
p(overcast) * H(Outlook=overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0
= 0.693

Information Gain = H(S) - I(Outlook)


= 0.94 - 0.693
= 0.247

Second Attribute - Temperature

Categorical values - hot, mild, cool


H(Temperature=hot)= -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
Average Entropy Information for Temperature -
I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool)
= (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811
= 0.9108

Information Gain = H(S) - I(Temperature)


= 0.94 - 0.9108
= 0.0292

Third Attribute - Humidity

Categorical values - high, normal


H(Humidity=high)= -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591

Average Entropy Information for Humidity -


I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591
= 0.787

Information Gain = H(S) - I(Humidity)


= 0.94 - 0.787
= 0.153

Fourth Attribute - Wind

Categorical values - weak, strong


H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1

Average Entropy Information for Wind -


I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1
= 0.892

Information Gain = H(S) - I(Wind)


= 0.94 - 0.892
= 0.048
Here, the attribute with maximum information gain is Outlook. So, the
decision tree built so far -

Here, when Outlook == overcast, it is of pure class(Yes).


Now, we have to repeat same procedure for the data with rows consist of
Outlook value as Sunny and then for Outlook value as Rain.
Now, finding the best attribute for splitting the data with Outlook=Sunny
values{ Dataset rows = [1, 2, 8, 9, 11]}.

Complete entropy of Sunny is -


H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
= 0.971

First Attribute - Temperature

Categorical values - hot, mild, cool


H(Sunny, Temperature=hot)= -0-(2/2)*log(2/2) = 0
H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
Average Entropy Information for Temperature -
I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny,
mild)*H(Sunny, Temperature=mild) + p(Sunny, cool)*H(Sunny,
Temperature=cool)
= (2/5)*0 + (1/5)*0 + (2/5)*1
= 0.4

Information Gain = H(Sunny) - I(Sunny, Temperature)


= 0.971 - 0.4
= 0.571
Second Attribute - Humidity

Categorical values - high, normal


H(Sunny, Humidity=high)= - 0 - (3/3)*log(3/3) = 0
H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0

Average Entropy Information for Humidity -


I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny,
normal)*H(Sunny, Humidity=normal)
= (3/5)*0 + (2/5)*0
= 0

Information Gain = H(Sunny) - I(Sunny, Humidity)


= 0.971 - 0
= 0.971

Third Attribute - Wind

Categorical values - weak, strong


H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918
H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1

Average Entropy Information for Wind -


I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny,
strong)*H(Sunny, Wind=strong)
= (3/5)*0.918 + (2/5)*1
= 0.9508

Information Gain = H(Sunny) - I(Sunny, Wind)


= 0.971 - 0.9508
= 0.0202
Here, the attribute with maximum information gain is Humidity. So, the
decision tree built so far -

Here, when Outlook = Sunny and Humidity = High, it is a pure class of


category "no". And When Outlook = Sunny and Humidity = Normal, it is
again a pure class of category "yes". Therefore, we don't need to do further
calculations.
Now, finding the best attribute for splitting the data with Outlook=Sunny
values{ Dataset rows = [4, 5, 6, 10, 14]}.

Complete entropy of Rain is -


H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (3/5) * log(3/5) - (2/5) * log(2/5)
= 0.971

First Attribute - Temperature

Categorical values - mild, cool


H(Rain, Temperature=cool)= -(1/2)*log(1/2)- (1/2)*log(1/2) = 1
H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918
Average Entropy Information for Temperature -
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain,
cool)*H(Rain, Temperature=cool)
= (2/5)*1 + (3/5)*0.918
= 0.9508

Information Gain = H(Rain) - I(Rain, Temperature)


= 0.971 - 0.9508
= 0.0202
Second Attribute - Wind

Categorical values - weak, strong


H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0

Average Entropy Information for Wind -


I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain,
Wind=strong)
= (3/5)*0 + (2/5)*0
= 0

Information Gain = H(Rain) - I(Rain, Wind)


= 0.971 - 0
= 0.971

Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far -

Here, when Outlook = Rain and Wind = Strong, it is a pure class of


category "no". And When Outlook = Rain and Wind = Weak, it is again a
pure class of category "yes".
And this is our final desired tree for the given dataset.
What are the characteristics of ID3 algorithm?
Finally, I am concluding with Characteristics of ID3.
Characteristics of ID3 Algorithm are as follows:
1. ID3 uses a greedy approach that's why it does not guarantee
an optimal solution; it can get stuck in local optimums.
2. ID3 can overfit to the training data (to avoid overfitting,
smaller decision trees should be preferred over larger ones).
3. This algorithm usually produces small trees, but it does not
always produce the smallest possible tree.
4. ID3 is harder to use on continuous data (if the values of any
given attribute is continuous, then there are many more
places to split the data on this attribute, and searching for
the best value to split by can be time consuming).

Accuracy, Precision, Recall & F1 Score: Interpretation of Performance


Measures
How to evaluate the performance of a ML and understanding
“Confusion Metrics

True positive and true negatives are the observations that are correctly
predicted and therefore shown in green. We want to minimize false
positives and false negatives so they are shown in red color. These
terms are a bit confusing. So let’s take each term one by one and
understand it fully.
True Positives (TP) - These are the correctly predicted positive
values which means that the value of actual class is yes and the value
of predicted class is also yes. E.g. if actual class value indicates that
this passenger survived and predicted class tells you the same thing.
True Negatives (TN) - These are the correctly predicted negative
values which means that the value of actual class is no and value of
predicted class is also no. E.g. if actual class says this passenger did
not survive and predicted class tells you the same thing.
False positives and false negatives, these values occur when
your actual class contradicts with the predicted class.
False Positives (FP) – When actual class is no and predicted class is
yes. E.g. if actual class says this passenger did not survive but
predicted class tells you that this passenger will survive.
False Negatives (FN) – When actual class is yes but predicted class
in no. E.g. if actual class value indicates that this passenger survived
and predicted class tells you that passenger will die.
Once you understand these four parameters then we can
calculate Accuracy, Precision, Recall and F1 score.
Accuracy - Accuracy is the most intuitive performance measure
and it is simply a ratio of correctly predicted observation to the
total observations. One may think that, if we have high
accuracy then our model is best. Yes, accuracy is a great
measure but only when you have symmetric datasets where
values of false positive and false negatives are almost same.
Therefore, you have to look at other parameters to evaluate
the performance of your model. For our model, we have got
0.803 which means our model is approx. 80% accurate.
Accuracy = TP+TN/TP+FP+FN+TN
Precision - Precision is the ratio of correctly predicted positive
observations to the total predicted positive observations. The question
that this metric answer is of all passengers that labeled as survived,
how many actually survived? High precision relates to the low false
positive rate. We have got 0.788 precision which is pretty good.
Precision = TP/TP+FP
Recall (Sensitivity) - Recall is the ratio of correctly predicted positive
observations to the all observations in actual class - yes. The question
recall answers is: Of all the passengers that truly survived, how many
did we label? We have got recall of 0.631 which is good for this model
as it’s above 0.5.
Recall = TP/TP+FN
F1 score - F1 Score is the weighted average of Precision and Recall.
Therefore, this score takes both false positives and false negatives into
account. Intuitively it is not as easy to understand as accuracy, but F1
is usually more useful than accuracy, especially if you have an uneven
class distribution. Accuracy works best if false positives and false
negatives have similar cost. If the cost of false positives and false
negatives are very different, it’s better to look at both Precision and
Recall. In our case, F1 score is 0.701.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
So, whenever you build a model, this article should help you to figure
out what these parameters mean and how good your model has
performed.

Data Generating Distributions


In statistics and in empirical sciences, a data generating
process is a process in the real world that "generates" the data
one is interested in. Usually, scholars do not know the real data
generating model. However, it is assumed that those real models
have observable consequences. Those consequences are
the distributions of the data in the population. Those distributors or
models can be represented via mathematical functions. There are
many functions of data distribution. For example, normal
distribution, Bernoulli distribution, Poisson distribution, etc.
Datasets are composed of two main types of data: Numerical (eg.
integers, floats), and Categorical (eg. names, laptops brands).
Numerical data can additionally be divided into other two
categories: Discrete and Continue. Discrete data can take only certain
values (eg. number of students in a school) while continuous data can
take any real or fractional value (eg. the concepts of height and weights).
From discrete random variables, it is possible to calculate Probability
Mass Functions, while from continuous random variables can be
derived Probability Density Functions.
Probability Mass Functions gives the probability that a variable can be
equal to a certain value, instead, the values of Probability Density
Functions are not itself probabilities because they need first to be
integrated over the given range.
There exist many different probability distributions in nature (Figure 1), in
this article I will introduce you to the ones most commonly used in Data
Science.

Figure 1: Probability Distributions Flowchart [1]


Inductive bias
The inductive bias (also known as learning bias) of a learning
algorithm is the set of assumptions that the learner uses to predict
outputs of given inputs that it has not encountered.
In machine learning, one aims to construct algorithms that are able
to learn to predict a certain target output. To achieve this, the learning
algorithm is presented some training examples that demonstrate the
intended relation of input and output values. Then the learner is supposed
to approximate the correct output, even for examples that have not been
shown during training. Without any additional assumptions, this problem
cannot be solved since unseen situations might have an arbitrary output
value. The kind of necessary assumptions about the nature of the target
function are subsumed in the phrase inductive bias.
Types
The following is a list of common inductive biases in machine learning algorithms.
 Maximum conditional independence: if the hypothesis can be cast in
a Bayesian framework, try to maximize conditional independence. This is
the bias used in the Naive Bayes classifier.
 Minimum cross-validation error: when trying to choose among
hypotheses, select the hypothesis with the lowest cross-validation error.
Although cross-validation may seem to be free of bias, the "no free
lunch" theorems show that cross-validation must be biased.
 Maximum margin: when drawing a boundary between two classes,
attempt to maximize the width of the boundary. This is the bias used
in support vector machines. The assumption is that distinct classes tend to
be separated by wide boundaries.
 Minimum description length: when forming a hypothesis, attempt to
minimize the length of the description of the hypothesis.
 Minimum features: unless there is good evidence that a feature is
useful, it should be deleted. This is the assumption behind feature
selection algorithms.
 Nearest neighbors: assume that most of the cases in a small
neighborhood in feature space belong to the same class. Given a case for
which the class is unknown, guess that it belongs to the same class as the
majority in its immediate neighborhood. This is the bias used in the k-
nearest neighbors algorithm. The assumption is that cases that are near
each other tend to belong to the same class.

NOT EVERYTHING LEARNABLE;


We are always amazed at how machine learning has made such an impact
on our lives. There is no doubt that ML will completely change the face of
various industries, as well as job profiles. While it offers a promising
future, there are some inherent problems at the heart of ML and AI
advancements that put these technologies at a disadvantage. While it can
solve a plethora of challenges, there are a few tasks which ML fails to
answer. We are listing five such problems in this article.
1. Reasoning Power
One area where ML has not mastered successfully is reasoning power, a
distinctly human trait. Algorithms available today are mainly oriented
towards specific use-cases and are narrowed down when it comes to
applicability. They cannot think as to why a particular method is
happening that way or ‘introspect’ their own outcomes.
For instance, if an image recognition algorithm identifies apples and
oranges in a given scenario, it cannot say if the apple (or orange) has
gone bad or not, or why is that fruit an apple or orange. Mathematically,
all of this learning process can be explained by us, but from an algorithmic
perspective, the innate property cannot be told by the algorithms or even
us.
In other words, ML algorithms lack the ability to reason beyond their
intended application.
2. Contextual Limitation
If we consider the area of natural language processing (NLP), text and
speech information are the means to understand languages by NLP
algorithms. They may learn letters, words, sentences or even the syntax,
but where they fall back is the context of the language. Algorithms do not
understand the context of the language used.
So, ML does not have an overall idea of the situation. It is limited by
mnemonic interpretations rather than thinking to see what is actually
going on.
3. Scalability
Although we see ML implementations being deployed on a significant
basis, it all depends on data as well as its scalability. Data is growing at an
enormous rate and has many forms which largely affects the scalability of
an ML project. Algorithms cannot do much about this unless they are
updated constantly for new changes to handle data. This is where ML
regularly requires human intervention in terms of scalability and remains
unsolved mostly.
In addition, growing data has to be dealt the right way if shared on an ML
platform which again needs examination through knowledge and intuition
apparently lacked by current ML.
4. Regulatory Restriction For Data In ML
ML usually need considerable amounts (in fact, massive) of data in stages
such as training, cross-validation etc. Sometimes, data includes private as
well as general information. This is where it gets complicated. Most tech
companies have privatised data and these data are the ones which are
actually useful for ML applications. But, there comes the risk of the wrong
usage of data, especially in critical areas such as medical research, health
insurance etc.,
Even though data are anonymised at times, it has the possibility of being
vulnerable. Hence this is the reason regulatory rules are imposed heavily
when it comes to using private data.
5. Internal Working Of Deep Learning
This sub-field of ML is actually responsible for today’s AI growth. What was
once just a theory has appeared to be the most powerful aspect of ML.
Deep Learning (DL) now powers applications such as voice recognition,
image recognition and so on through artificial neural networks.
But, the internal working of DL is still unknown and yet to be solved.
Advanced DL algorithms still baffle researchers in terms of its working and
efficiency. Millions of neurons that form the neural networks in DL
increase abstraction at every level, which cannot be comprehended at all.
This is why deep learning is dubbed a ‘black box’ since its internal agenda
is unknown.
Conclusion
All of these problems are very challenging for computer scientists and
researchers to solve. The reason for this is uncertainty. If researchers aim
at more groundwork related to ML rather than improving this field, we
might have an answer to the unsolved problems listed here. After all, ML
should be realised apart from being utilitarian.

Overfitting and Underfitting in Machine Learning


Overfitting and Underfitting are the two main problems that occur in
machine learning and degrade the performance of the machine learning
models.
The main goal of each machine learning model is to generalize well.
Here generalization defines the ability of an ML model to provide a
suitable output by adapting the given set of unknown input. It means after
providing training on the dataset, it can produce reliable and accurate
output. Hence, the underfitting and overfitting are the two terms that
need to be checked for the performance of the model and whether the
model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand
some basic term that will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that
helps the machine learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due
to oversimplifying the machine learning algorithms. Or it is the
difference between the predicted values and the actual values.
o Variance: If the machine learning model performs well with the
training dataset, but does not perform well with the test dataset,
then variance occurs.
Overfitting
Overfitting occurs when our machine learning
Model tries to cover all the data points or more than the required data points present in the
given dataset. Because of this, the model starts caching noise and inaccurate values present in
the dataset, and all these factors reduce the efficiency and accuracy of the model. The
overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide
training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.
Overfitting is the main problem that occurs in supervised learning
.
Example: The concept of the overfitting can be understood by the below
graph of the linear regression output:

As we can see from the above graph, the model tries to cover all the data
points present in the scatter plot. It may look efficient, but in reality, it is
not so. Because the goal of the regression model to find the best fit line,
but here we have not got any best fit, so, it will generate the prediction
errors.

How to avoid the Overfitting in Model


Both overfitting and underfitting cause the degraded performance of the
machine learning model. But the main cause is overfitting, so there are
some ways by which we can reduce the occurrence of overfitting in our
model.
o Training with more data
o Removing features
o Early stopping the training
o Regularization
Underfitting
Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data. To avoid the overfitting in the
model, the fed of training data can be stopped at an early stage, due to
which the model may not learn enough from the training data. As a result,
it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.
An underfitted model has high bias and low variance.
Example: We can understand the underfitting using below output of the
linear regression model:

As we can see from the above diagram, the model is unable to capture
the data points present in the plot.
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the
machine learning models to achieve the goodness of fit. In statistics
modeling, it defines how closely the result or predicted values match the
true values of the dataset.
The model with a good fit is between the underfitted and overfitted model,
and ideally, it makes predictions with 0 errors, but in practice, it is difficult
to achieve it.
As when we train our model for a time, the errors in the training data go
down, and the same happens with test data. But if we train the model for
a long duration, then the performance of the model may decrease due to
the overfitting, as the model also learn the noise present in the dataset.
The errors in the test dataset start increasing, so the point, just before the
raising of errors, is the good point, and we can stop here for achieving a
good model.
There are two other methods by which we can get a good point for our model,
which are the resampling method to estimate model accuracy and validation
dataset.
What is a Model Parameter?
A model parameter is a configuration variable that is internal to the model and whose
value can be estimated from data.
 They are required by the model when making predictions.
 They values define the skill of the model on your problem.
 They are estimated or learned from data.
 They are often not set manually by the practitioner.
 They are often saved as part of the learned model.
Parameters are key to machine learning algorithms. They are the part of the model that
is learned from historical training data.
In classical machine learning literature, we may think of the model as the hypothesis and
the parameters as the tailoring of the hypothesis to a specific set of data.
Often model parameters are estimated using an optimization algorithm, which is a type
of efficient search through possible parameter values.
 Statistics: In statistics, you may assume a distribution for a variable, such as a
Gaussian distribution. Two parameters of the Gaussian distribution are the mean
(mu) and the standard deviation (sigma). This holds in machine learning, where
these parameters may be estimated from data and used as part of a predictive
model.
 Programming: In programming, you may pass a parameter to a function. In this
case, a parameter is a function argument that could have one of a range of
values. In machine learning, the specific model you are using is the function and
requires parameters in order to make a prediction on new data.
Whether a model has a fixed or variable number of parameters determines whether it
may be referred to as “parametric” or “nonparametric“.
Some examples of model parameters include:
 The weights in an artificial neural network.
 The support vectors in a support vector machine.
 The coefficients in a linear regression or logistic regression.
What is a Model Hyperparameter?
A model hyperparameter is a configuration that is external to the model and whose value
cannot be estimated from data.
 They are often used in processes to help estimate model parameters.
 They are often specified by the practitioner.
 They can often be set using heuristics.
 They are often tuned for a given predictive modeling problem.
We cannot know the best value for a model hyperparameter on a given problem. We
may use rules of thumb, copy values used on other problems, or search for the best
value by trial and error.
When a machine learning algorithm is tuned for a specific problem, such as when you
are using a grid search or a random search, then you are tuning the hyperparameters of
the model or order to discover the parameters of the model that result in the most skillful
predictions.
Many models have important parameters which cannot be directly estimated from the
data. For example, in the K-nearest neighbor classification model … This type of model
parameter is referred to as a tuning parameter because there is no analytical formula
available to calculate an appropriate value.
Model hyperparameters are often referred to as model parameters which can make
things confusing. A good rule of thumb to overcome this confusion is as follows:
If you have to specify a model parameter manually then
it is probably a model hyperparameter.
Some examples of model hyperparameters include:
 The learning rate for training a neural network.
 The C and sigma hyperparameters for support vector machines.
 The k in k-nearest neighbors.

Feature Vector
Feature: is a list of values eg: age, name, height, weight etc., that means
every column is a feature in relational table.
Feature Vector
A feature vector is a vector that stores the features for a particular
observation in a specific order.

For example, Alice is 26 years old and she is 5' 6" tall. Her feature vector
could be [26, 5.5] or [5.5, 26] depending on your choice of how to order the
elements. The order is only important insofar as it is consistent
It is representation of particular row in relational table. Each row is a feature
vector, row 'n' is a feature vector for the 'n'th sample.
Feature Set: Help to predict the output variable.
Example: To predict the age of particular person we need to know the year of
birth. Here Feature Set = Year of Birth.

Normally good feature set can be identified using expert domain knowledge
or mathematical approach.
Hope this will help you.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning


o K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and
at the time of classification, yit performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or dog.
So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of
neighbors
o Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data
points in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required
category. Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose


the k=5.
o Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors,
as three nearest neighbors in category A and two nearest neighbors
in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence


this new data point must belong to category A.
How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the
K-NN algorithm:

o There is no particular way to determine the best value for "K", so we


need to try some values to find the best out of them. The most
preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex


some time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups


the unlabeled dataset into different clusters. Here K defines the number of
pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so
on.

It is an iterative algorithm that divides the unlabeled dataset into k


different clusters in such a way that each dataset belongs only one group
that has similar properties.

It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an


iterative process.
o Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is


away from other clusters.

The below diagram explains the working of the K-means Clustering


Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the
input dataset).

Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and
to put them into different clusters. It means here we will try to group
these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:

o Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some
mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the
centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process


by choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this,
we will repeat the same process of finding a median line. The
median will be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which


is finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of
centroids, so the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points
on either side of the line, which means our model is formed.
Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
What is Dimensionality Reduction?

The number of input features, variables, or columns present in a given


dataset is known as dimensionality, and the process to reduce these
features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases,


which makes the predictive modeling task more complicated. Because it is
very difficult to visualize or make predictions for the training dataset with
a high number of features, for such cases, dimensionality reduction
techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of


converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information." These
techniques are widely used in machine learning

for obtaining a better fit predictive model while solving the classification
and regression problems.

It is commonly used in the fields that deal with high-dimensional data,


such as speech recognition, signal processing, bioinformatics, etc.
It can also be used for data visualization, noise reduction, cluster
analysis, etc

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly


known as the curse of dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and model becomes
more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting
also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be


done with dimensionality reduction.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given


dataset are given below:

o By reducing the dimensions of the features, the space required to


store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions
of features.
o Reduced dimensions of features of the dataset help in visualizing
the data quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality


reduction, which are given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the
principal components required to consider are unknown.

DECISION BOUNDARY

While training a classifier on a dataset, using a specific classification


algorithm, it is required to define a set of hyper-planes, called Decision
Boundary, that separates the data points into specific classes, where the
algorithm switches from one class to another. On one side a decision
boundary, a datapoints is more likely to be called as class A — on the other
side of the boundary, it’s more likely to be called as class B.

Let’s take an example of a Logistic Regression.

The goal of logistic regression, is to figure out some way to split the
datapoints to have an accurate prediction of a given observation’s class
using the information present in the features.

Let’s suppose we define a line that describes the decision boundary. So, all
of the points on one side of the boundary shall have all the datapoints
belong to class A and all of the points on one side of the boundary shall
have all the datapoints belong to class B.

S(z)=1/(1+e^-z)

 S(z) = Output between 0 and 1 (probability estimate)


 z = Input to the function (z= mx + b)
 e = Base of natural log
Our current prediction function returns a probability score between 0 and
1. In order to map this to a discrete class (A/B), we select a threshold value
or tipping point above which we will classify values into class A and below
which we classify values into class B.

p>=0.5,class=A

p<=0.5,class=B

If our threshold was .5 and our prediction function returned .7, we would
classify this observation belongs to class A. If our prediction was .2 we
would classify the observation belongs to class B.

So, line with 0.5 is called the decision boundary.

In order to map predicted values to probabilities, we use the Sigmoid


function.
IMPORTANCE OF DECISION BOUNDARY

A decision boundary, is a surface that separates data points belonging to


different class lables. Decision Boundaries are not only confined to just the
data points that we have provided, but also they span through the entire
feature space we trained on. The model can predict a value for any
possible combination of inputs in our feature space. If the data we train on
is not ‘diverse’, the overall topology of the model will generalize poorly to
new instances. So, it is important to analyse all the models which can be
best suitable for ‘diverse’ dataset, before using the model into production.

Examining decision boundaries is a great way to learn how the training


data we select affects performance and the ability for our model to
generalize. Visualization of decision boundaries can illustrate how sensitive
models are to each dataset, which is a great way to understand how
specific algorithms work, and their limitations for specific datasets.
UNIT – II

Bio-inspired Learning and Perception:


Consider the following neuron

Similar to the above model, artificial neuron called ‘ perceptron’ is

designed:
Characteristics of Perceptron

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning


of binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision
is made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the
weight function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction
between the two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold
value, it must have an output signal; otherwise, no output will be
shown.

Limitations of Perceptron Model

A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due


to the hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of
input vectors. If input vectors are non-linear, it is not easy to classify
them properly.

Geometric Interpretation
Step 1:
Step 2

Step 3

Step 4
Step 5

Step 6
Perceptron Algorithm :

Perceptron Convergence and Linear Separability

Answer is shared in separate PDF

What are characteristic of good features?


Here are some characteristics of good features:

 Features must be found in most of the data samples: Great


features represent unique characteritistics which can be applied
across different types of data samples and are not limited to just
one data sample. For example, can the “red” color of apple act as a
feature? Not really. Because apple can be found in different colors.
It might have happened that the sample of apples that was taken
for evaluation contained apple of just “red” color. If not found, we
may end up creating models having high bias.
 Features must be unique and may not be found prevalent
with other (different) forms: Great features are the ones which
is unique to apple and should not be applicable for other fruits. The
toughness characteristic of apple such as “hard to teeth” may not
be good feature. This is because a guava can also be explained
using this feature.
 Features in reality: There can be features which can be
accidental in nature and is not a feature at all when considering the
population. For example, in a particular sample of data, a particular
kind of feature can be found to be prevalent. However, when
multiple data samples are taken, the feature goes missing.
A great feature must satisfy all of the above criteria. From that perspective,
one can design derived features appropriately if the features represented
using raw data do not satisfy above criteria. Creating/deriving good features
from raw data is also called as feature engineering. The following are two
most important aspects of feature engineering:

Feature Selection Techniques in Machine Learning

Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.

So, we can define feature Selection as, "It is a process of


automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building."
Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection

Before implementing any technique, it is really important to understand,


need for the technique and so for the Feature Selection. As we know, in
machine learning, it is necessary to provide a pre-processed and good
input dataset in order to get better outcomes. We collect a huge amount
of data to train our model and help it to learn better. Generally, the
dataset consists of noisy data, irrelevant data, and some part of useful
data. Moreover, the huge amount of data also slows down the training
process of the model, and with noise and irrelevant data, the model may
not predict and perform well. So, it is very necessary to remove such
noises and less-important data from the dataset and to do this, and
Feature selection techniques are used.

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that it can be easily
interpreted by the researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:
o Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and
can be used for the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and
can be used for the unlabelled dataset.

Feature Selection Method

For machine learning engineers, it is very important to understand that


which feature selection method will work properly for their model. The
more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
To know this, we need to first identify the type of input and output
variables. In machine learning, variables are of mainly two types:

o Numerical Variables: Variable with continuous values such as integer,


float
o Categorical Variables: Variables with categorical values such as
Boolean, ordinal, nominals.

We can summarise the above cases with appropriate measures in


the below table:

Input Output Feature Selection technique


Variable Variable

Numerical Numerical
o Pearson's correlation coefficient (For linear
Correlation).
o Spearman's rank coefficient (for non-linear
correlation).

Numerical Categorical
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
Categorical Numerical
o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).

Categorical Categorical
o Chi-Squared test (contingency tables).
o Mutual Information.

Pruning
Pruning is a data compression technique in machine learning and search
algorithms that reduces the size of decision trees by removing sections of the tree
that are non-critical and redundant to classify instances. Pruning reduces the
complexity of the final classifier, and hence improves predictive accuracy by the
reduction of overfitting.
One of the questions that arises in a decision tree algorithm is the optimal size of the
final tree. A tree that is too large risks overfitting the training data and poorly
generalizing to new samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell when a tree algorithm
should stop because it is impossible to tell if the addition of a single extra node will
dramatically decrease error. This problem is known as the horizon effect. A common
strategy is to grow the tree until each node contains a small number of instances
then use pruning to remove nodes that do not provide additional information. [1]
Pruning should reduce the size of a learning tree without reducing predictive
accuracy as measured by a cross-validation set. There are many techniques for tree
pruning that differ in the measurement that is used to optimize performance.

What is Feature Scaling?

Feature scaling is a method used to normalize the range of independent


variables or features of data. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing
step. Just to give you an example — if you have multiple independent
variables like age, salary, and height; With their range as (18–100 Years),
(25,000–75,000 Euros), and (1–2 Meters) respectively, feature scaling
would help them all to be in the same range, for example- centered
around 0 or in the range (0,1) depending on the scaling technique.

Methods for Scaling

Now, since you have an idea of what is feature scaling. Let us explore
what methods are available for doing feature scaling. Of all the methods
available, the most common ones are:

Normalization

Also known as min-max scaling or min-max normalization, it is the


simplest method and consists of rescaling the range of features to scale
the range in [0, 1]. The general formula for normalization is given as:

Here, max(x) and min(x) are the maximum and the minimum values of
the feature respectively.

We can also do a normalization over different intervals, e.g. choosing to


have the variable laying in any [a, b] interval, a and b being real numbers.
To rescale a range between an arbitrary set of values [a, b], the formula
becomes:
Standardization

Feature standardization makes the values of each feature in the data have
zero mean and unit variance. The general method of calculation is to
determine the distribution mean and standard deviation for each feature
and calculate the new data point by the following formula:

Here, σ is the standard deviation of the feature vector, and x̄ is the


average of the feature vector.

Scaling to unit length

The aim of this method is to scale the components of a feature vector


such that the complete vector has length one. This usually means dividing
each component by the Euclidean length of the vector:

In addition to the above 3 widely-used methods, there are some other


methods to scale the features viz. Power Transformer, Quantile
Transformer, Robust Scaler, etc. For the scope of this discussion, we are
deliberately not diving into the details of these techniques.

Normalization is good to use when the distribution of data does not


follow a Gaussian distribution. It can be useful in algorithms that do not
assume any distribution of the data like K-Nearest Neighbors.

In Neural Networks algorithm that require data on a 0–1 scale,


normalization is an essential pre-processing step. Another popular
example of data normalization is image processing, where pixel intensities
have to be normalized to fit within a certain range (i.e., 0 to 255 for the
RGB color range).

Standardization can be helpful in cases where the data follows a Gaussian


distribution. Though this does not have to be necessarily true. Since
standardization does not have a bounding range, so, even if there are
outliers in the data, they will not be affected by standardization.

In clustering analyses, standardization comes in handy to compare


similarities between features based on certain distance measures.
Another prominent example is the Principal Component Analysis, where
we usually prefer standardization over Min-Max scaling since we are
interested in the components that maximize the variance.

Combinatorial explosions

Combinatorial explosions occur in some numeric problems when the


complexity rapidly increases, caused by the increasing the number of
possible combinations of inputs. This explosion in complexity can make
some mathematical problems intractable to brute force solutions.
Combinatorial explosions are a manifestation of the curse of
dimensionality.
The problem of combinatorial explosions occurs frequently in insurance
pricing. For example, I have data for an auto/motor insurance pricing
project, and it has 27 rating factors. My rating structure could use
anywhere from 0 to all 27 of these rating factors, and I want to find the
best combination of rating factors. How many combinations will I have to
search through?

If it took me only 1 minute to analyze each combination (and that’s faster


than I’ve ever been able to work), then it would take me approximately
21,515,067,731,468 billion years to try each combination. To put this into
perspective, the universe is only 13.8 billion years old!
But this is only a small part of the problem. Some rating factors interact
with each other. For example, auto/motor insurers often find an
interaction between the age of the driver and their gender.

How to evaluate ML models?

Models can be evaluated using multiple metrics. However, the right choice
of an evaluation metric is crucial and often depends upon the problem
that is being solved. A clear understanding of a wide range of metrics can
help the evaluator to chance upon an appropriate match of the problem
statement and a metric.

Classification metrics

For every classification model prediction, a matrix called the confusion


matrix can be constructed which demonstrates the number of test cases
correctly and incorrectly classified.

It looks something like this (considering 1 -Positive and 0 -Negative are


the target classes):

Actual 0

Actual 1

Predicted 0
True Negatives (TN)
False Negatives (FN)
Predicted 1
False Positives (FP)
True Positives (TP)

 TN: Number of negative cases correctly classified


 TP: Number of positive cases correctly classified
 FN: Number of positive cases incorrectly classified as negative
 FP: Number of negative cases correctly classified as positive

Accuracy

Accuracy is the simplest metric and can be defined as the number of test
cases correctly classified divided by the total number of test cases.
It can be applied to most generic problems but is not very useful when it
comes to unbalanced datasets.

For instance, if we are detecting frauds in bank data, the ratio of fraud to
non-fraud cases can be 1:99. In such cases, if accuracy is used, the model
will turn out to be 99% accurate by predicting all test cases as non-fraud.
The 99% accurate model will be completely useless.

If a model is poorly trained such that it predicts all the 1000 (say) data
points as non-frauds, it will be missing out on the 10 fraud data points. If
accuracy is measured, it will show that that model correctly predicts 990
data points and thus, it will have an accuracy of (990/1000)*100 = 99%!

This is why accuracy is a false indicator of the model’s health.

Therefore, for such a case, a metric is required that can focus on the ten
fraud data points which were completely missed by the model.

Precision

Precision is the metric used to identify the correctness of classification.

Intuitively, this equation is the ratio of correct positive classifications to


the total number of predicted positive classifications. The greater the
fraction, the higher is the precision, which means better is the ability of
the model to correctly classify the positive class.

In the problem of predictive maintenance (where one must predict in


advance when a machine needs to be repaired), precision comes into
play. The cost of maintenance is usually high and thus, incorrect
predictions can lead to a loss for the company. In such cases, the ability of
the model to correctly classify the positive class and to lower the number
of false positives is paramount!

Recall

Recall tells us the number of positive cases correctly identified out of the
total number of positive cases.

Going back to the fraud problem, the recall value will be very useful in
fraud cases because a high recall value will indicate that a lot of fraud
cases were identified out of the total number of frauds.
F1 Score

F1 score is the harmonic mean of Recall and Precision and therefore,


balances out the strengths of each.

It is useful in cases where both recall and precision can be valuable – like
in the identification of plane parts that might require repairing. Here,
precision will be required to save on the company’s cost (because plane
parts are extremely expensive) and recall will be required to ensure that
the machinery is stable and not a threat to human lives.

Regression metrics

Regression models provide a continuous output variable, unlike


classification models that have discrete output variables. Therefore, the
metrics for assessing the regression models are accordingly designed.

Mean Squared Error or MSE

MSE is a simple metric that calculates the difference between the actual
value and the predicted value (error), squares it and then provides the
mean of all the errors.

MSE is very sensitive to outliers and will show a very high error value even
if a few outliers are present in the otherwise well-fitted model predictions.

Root Mean Squared Error or RMSE

RMSE is the root of MSE and is beneficial because it helps to bring down
the scale of the errors closer to the actual values, making it more
interpretable.

Mean Absolute Error or MAE

MAE is the mean of the absolute error values (actuals – predictions).


If one wants to ignore the outlier values to a certain degree, MAE is the
choice since it reduces the penalty of the outliers significantly with the
removal of the square terms.

Root Mean Squared Log Error or RMSLE

In RMSLE, the same equation as that of RMSE is followed except for an


added log function along with the actual and predicted values.

x is the actual value and y is the predicted value. This helps to scale down
the effect of the outliers by downplaying the higher error rates with the
log function. Also, RMSLE helps to capture a relative error (by comparing
all the error values) through the use of logs.

Cross-Validation in Machine Learning

Cross-validation is a technique for validating the model efficiency by


training it on the subset of input data and testing on previously unseen
subset of the input data. We can also say that it is a technique to
check how a statistical model generalizes to an independent
dataset.

In machine learning , there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our model
on the training dataset. For this purpose, we reserve a particular sample
of the dataset, which was not part of the training dataset. After that, we
test our model on that sample before deployment, and this complete
process comes under cross-validation. This is something different from the
general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the
model performs well with the validation set, perform the further
step, else check for the issues.

Methods used for Cross-Validation

There are some common methods that are used for cross-validation.
These methods are given below:

1. Validation Set Approach


2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation

Validation Set Approach

We divide our input dataset into a training set and test or validation set in
the validation set approach. Both the subsets are given 50% of the
dataset.

But it has one of the big disadvantages that we are just using a 50%
dataset to train our model, so the model may miss out to capture
important information of the dataset. It also tends to give the underfitted
model.

Leave-P-out cross-validation

In this approach, the p datasets are left out of the training data. It means,
if there are total n datapoints in the original input dataset, then n-p data
points will be used as the training dataset and the p data points as the
validation set. This complete process is repeated for all the samples, and
the average error is calculated to know the effectiveness of the model.

There is a disadvantage of this technique; that is, it can be


computationally difficult for the large p.

Leave one out cross-validation

This method is similar to the leave-p-out cross-validation, but instead of p,


we need to take 1 dataset out of training. It means, in this approach, for
each learning set, only one datapoint is reserved, and the remaining
dataset is used to train the model. This process repeats for each
datapoint. Hence for n samples, we get n different training set and n test
set. It has the following features:

o In this approach, the bias is minimum as all the data points are
used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of
the model as we iteratively check against one data point.

K-Fold Cross-Validation

K-fold cross-validation approach divides the input dataset into K groups of


samples of equal sizes. These samples are called folds. For each learning
set, the prediction function uses k-1 folds, and the rest of the folds are
used for the test set. This approach is a very popular CV approach
because it is easy to understand, and the output is less biased than other
methods.

The steps for k-fold cross-validation are:

o Split the input dataset into K groups


o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the
performance of the model using the test set.

Comparison of Cross-validation to train/test split in Machine


Learning

o Train/test split: The input data is divided into two parts, that are
training set and test set on a ratio of 70:30, 80:20, etc. It provides a
high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model,
and the dependent variable is known.
o Test Data: The test data is used to make the predictions from
the model that is already trained on the training data. This has
the same features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the
disadvantage of train/test split by splitting the dataset into groups
of train/test splits, and averaging the result. It can be used if we
want to optimize our model that has been trained on the training
dataset for the best performance. It is more efficient as compared to
train/test split as every observation is used for the training and
testing both.

Hypothesis Testing and Statistical Significance,

Hypothesis Testing?

Any data science project starts with exploring the data. When we perform
an analysis on a sample through exploratory data analysis and inferential
statistics, we get information about the sample. Now, we want to use this
information to predict values for the entire population.

Hypothesis testing is done to confirm our observation about the


population using sample data, within the desired error level. Through
hypothesis testing, we can determine whether we have enough statistical
evidence to conclude if the hypothesis about the population is true or not.

How to perform hypothesis testing in machine learning?

To trust your model and make predictions, we utilize hypothesis testing.


When we will use sample data to train our model, we make assumptions
about our population. By performing hypothesis testing, we validate these
assumptions for a desired significance level.
Let’s take the case of regression models: When we fit a straight line
through a linear regression model, we get the slope and intercept for the
line. Hypothesis testing is used to confirm if our beta coefficients are
significant in a linear regression model. Every time we run the linear
regression model, we test if the line is significant or not by checking if the
coefficient is significant

Key steps to perform hypothesis test are as follows:

1. Formulate a Hypothesis
2. Determine the significance level
3. Determine the type of test
4. Calculate the Test Statistic values and the p values
5. Make Decision

Now let’s look into the steps in detail:

Formulating the hypothesis

One of the key steps to do this is to formulate the below two hypotheses:

The null hypothesis represented as H₀ is the initial claim that is based


on the prevailing belief about the population.
The alternate hypothesis represented as H₁ is the challenge to the null
hypothesis. It is the claim which we would like to prove as True

Select the type of Hypothesis test


We choose the type of test statistic based on the predictor variable –
quantitative or categorical. Below are a few of the commonly used test
statistics for quantitative data

Type of
Distribution Desired
predictor Attributes
type Test
variable

Large sample size


Normal
Quantitative Z – Test Population standard
Distribution deviation known

Sample size less than 30


Quantitative T Distribution T-Test Population standard
deviation unknown

Positively skewed When you want to compare


Quantitative F – Test 3 or more variables
distribution

Negatively Requires feature


Quantitative skewed NA transformation to perform a
distribution hypothesis test

Chi-
Test of independence
Categorical NA Square Goodness of fit
test

1) Type1 Error – This occurs when the null hypothesis is true but we
reject it.The probability of type I error is denoted by alpha (α). Type 1
error is also known as the level of significance of the hypothesis test

2) Type 2 Error – This occurs when the null hypothesis is false but we fail
to reject it. The probability of type II error is denoted by beta (β)
Debugging Learning Algorithms,

Why do we need debugging?

Debugging is a key part of software development, regardless of what type


of software it is—so, naturally, it also applies to machine learning.

In machine learning, poor model performance can have a wide range of


causes, and debugging might take a lot of work. No predictive power or
suboptimal values can cause models to perform poorly. So, we debug our
model to find the root cause of the issue.

Models are being deployed on increasingly larger tasks and datasets, and
the more the scale grows, the more important it is to debug your model.

What are the general steps for debugging?

Common bugs in machine learning models

 Dimension error

The most common issue in models is the dimension, which happens


because of the nature of linear algebra. Most popular libraries can spot
inconsistent shapes. While working with matrices, let’s say we use
shape(n1,n2)(n3,n4) so in matrix multiplication, it is important to
match n2 and n3 dimensions for the matmul to work. It’s also
critical to find where they’re coming from.

 Variable

Data goes through a long process starting from preparation, cleaning, and
more. In this process, developers often get confused or forget correct data
variables. So, to stay on the correct path, it’s good practice to use a data
flow diagram before architecting our models. This will help us find
the correct data variable names, model flow, and expected results.

 Flaws in input data

To figure out if our model contains predictive information or not, try with
humans first. If humans can’t predict the data (image or text), then our ML
model won’t make any difference. If we try to feed it more inputs, it still
won’t make a difference. Chances are that the model will lose its
accuracy.

Once we get adequate predictive information, we need to know if our data


is adequate to train a model and get the signal. In general, we need a
minimum of 30 samples per class and 10 samples for specific features.

Size Of Dataset ∝ Number Of Parameters In Model


The exact shape of the above equation depends on your machine learning
application.

 Learn from minimal data

External data, let’s say data found on the internet or open-sourced, can be
useful. Once you collect that data and label it, you can then use it for
training. It can also be used for many other tasks. Just like external data,
we can also use an external model which was trained by another person
and reuse it for our task.

Using a high-quality but small dataset is the best way to train a simple
model. Sometimes, when you use large training data sets you can waste
too many resources and money.

 Preparing data and avoiding the common issue

When preparing features, it’s crucial to measure the scaling factors,


mean, and standard deviation on the test dataset. Measuring these will
improve the performance on the test data set. Standardization will help
you make sure that all data has an SD of 1 and a mean of 0. This is more
effective if the data has outliers, and it’s the most effective way to scale a
feature.

 Hyperparameter Tuning

Hyperparameter tuning is about improving the hyperparameters. These


parameters control the behavior of a learning algorithm. For example,
learning rate (alpha), or the complexity parameter (m) in gradient descent
optimization.

A common hyperparameter tuning case is to select optimal values by


using cross-validation in order to choose what works best on unseen data.
This evaluates how model parameters get updated during the training
period. Often this task is carried out manually, using a simple trial and
error method. There are plenty of different ways to tune hyperparameters,
such as grid searches, random search methods, Bayesian optimization
methods, and a simple educated guess.

 Verification strategy

With verification strategies, we can find issues that aren’t related to the
actual model. We can verify the integrity of a model (i.e. verifying that it
hasn’t been changed or corrupted), or if the model is correct and
maintainable. Many practices have evolved for verification, like automated
generation of test data sequences, running multiple analyses with
different sets of input values, and performing validation checks when
importing data into a file.
Understanding the Bias-Variance Tradeoff

Whenever we discuss model prediction, it’s important to understand


prediction errors (bias and variance). There is a tradeoff between a
model’s ability to minimize bias and variance. Gaining a proper
understanding of these errors would help us not only to build accurate
models but also to avoid the mistake of overfitting and underfitting.

So let’s start with the basics and see how they make difference to our
machine learning Models.

Bias

Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict. Model with high bias pays
very little attention to the training data and oversimplifies the model. It
always leads to high error on training and test data.

Variance

Variance is the variability of model prediction for a given data point or a


value which tells us spread of our data. Model with high variance pays a lot
of attention to training data and does not generalize on the data which it
hasn’t seen before. As a result, such models perform very well on training
data but has high error rates on test data.

Bias and variance using bulls-eye diagram


In the above diagram, center of the target is a model that perfectly
predicts correct values. As we move away from the bulls-eye our
predictions become get worse and worse. We can repeat our process of
model building to get separate hits on the target.

In supervised learning, underfitting happens when a model unable to


capture the underlying pattern of the data. These models usually have
high bias and low variance. It happens when we have very less amount
of data to build an accurate model or when we try to build a linear model
with a nonlinear data. Also, these kind of models are very simple to
capture the complex patterns in data like Linear and logistic regression.

In supervised learning, overfitting happens when our model captures the


noise along with the underlying pattern in data. It happens when we train
our model a lot over noisy dataset. These models have low bias and high
variance. These models are very complex like Decision trees which are
prone to overfitting.
Bias Variance Tradeoff

If our model is too simple and has very few parameters then it may have
high bias and low variance. On the other hand if our model has large
number of parameters then it’s going to have high variance and low bias.
So we need to find the right/good balance without overfitting and
underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and


variance. An algorithm can’t be more complex and less complex at the
same time.

CONVEX SURROGATE FUNCTION

For a univariate function, this means that the line segment connecting two
function’s points lays on or above its curve (it does not cross it). If it does it
means that it has a local minimum which is not a global one.

Mathematically, for two points x₁, x₂ laying on the function’s curve this
condition is expressed as:

where λ denotes a point’s location on a section line and its value has to be
between 0 (left point) and 1 (right point), e.g. λ=0.5 means a location in
the middle.

Below there are two functions with exemplary section lines.


Exemplary convex and non-convex functions; Image by author

Another way to check mathematically if a univariate function is convex is


to calculate the second derivative and check if its value is always bigger
than 0.

Let’s investigate a simple quadratic function given by:

Its first and second derivative are:

Because the second derivative is always bigger than 0, our function is


strictly convex.

It is also possible to use quasi-convex functions with a gradient descent


algorithm. However, often they have so-called saddle points (called also
minimax points) where the algorithm can get stuck (we will demonstrate it
later in the article). An example of a quasi-convex function is:

Let’s stop here for a moment. We see that the first derivative equal zero at
x=0 and x=1.5. This places are candidates for function’s extrema
(minimum or maximum )— the slope is zero there. But first we have to
check the second derivative first.

The value of this expression is zero for x=0 and x=1. These locations are
called an inflexion point — a place where the curvature changes sign —
meaning it changes from convex to concave or vice-versa. By analysing
this equation we conclude that :

 for x<0: function is convex

 for 0<x<1: function is concave

 for x>1: function is convex again

Now we see that point x=0 has both first and second derivate equal to
zero meaning this is a saddle point and point x=1.5 is a global minimum.

Let’s look at the graph of this function. As calculated before a saddle point
is at x=0 and minimum at x=1.5.

Semi-convex function with a saddle point;


Gradient

In the case of a univariate function, gradient is simply the first


derivative at a selected point. In the case of a multivariate function,
it is a vector of derivatives in each main direction (along variable axes).
Because we are interested only in a slope along one axis and we don’t care
about others these derivatives are called partial derivatives.

A gradient for an n-dimensional function f(x) at a given point p is defined


as follows:

The upside-down triangle is a so-called nabla symbol and you read it “del”.
To better understand how to calculate it let’s do a hand calculation for an
exemplary 2-dimensional function below.

Let’s assume we are interested in a gradient at point p(10,10):

so consequently:

By looking at these values we conclude that the slope is twice steeper


along the y axis.

4. Gradient Descent Algorithm

Gradient Descent Algorithm iteratively calculates the next point using


gradient at the current position, then scales it (by a learning rate) and
subtracts obtained value from the current position (makes a step). It
subtracts the value because we want to minimise the function (to
maximise it would be adding). This process can be written as:

There’s an important parameter η which scales the gradient and thus


controls the step size. In machine learning, it is called learning rate and
have a strong influence on performance.

 The smaller learning rate the longer GD converges, or may reach


maximum iteration before reaching the optimum point

 If learning rate is too big the algorithm may not converge to the optimal
point (jump around) or even to diverge completely.

In summary, Gradient Descent method’s steps are:

1. choose a starting point (initialisation)

2. calculate gradient at this point

3. make a scaled step in the opposite direction to the gradient (objective:


minimise)

4. repeat points 2 and 3 until one of the criteria is met:

 maximum number of iterations reached

 step size is smaller than the tolerance.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in
Machine Learning.

The goal of the SVM algorithm is to create the best line or decision
boundary that can segregate n-dimensional space into classes so that we
can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the


hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified
using a decision boundary or hyperplane:

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to


segregate the classes in n-dimensional space, but we need to find out the
best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the


dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector.
Since these vectors support the hyperplane, hence called a Support
vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an


example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider
the below image:

So as it is 2-d space so by just using a straight line, we can easily separate


these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane. SVM algorithm
finds the closest point of the lines from both the classes. These points are
called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight


line, but for non-linear data, we cannot draw a single straight line.
Consider the below image:

So to separate these data points, we need to add one more dimension. For
linear data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the following way.
Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis. If we convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


UNIT – III
Classification:
In machine learning, classification refers to a predictive modeling problem
where a class label is predicted for a given input data.
Following are the examples of cases where the data analysis task is
Classification −
 A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with
a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
Prediction:
Following are the examples of cases where the data analysis task is
Prediction −
Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we are
bothered to predict a numeric value. Therefore the data analysis task is
an example of numeric prediction. In this case, a model or a predictor will
be constructed that predicts a continuous-valued-function or ordered
value.
Note − Regression analysis is a statistical methodology that is most often
used for numeric prediction.
Classification Works as follows:
With the help of the bank loan application that we have discussed above,
let us understand the working of classification. The Data Classification
process includes two steps −
 Building the Classifier or Model
 Using Classifier for Classification
Building the Classifier or Model
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database
tuples and their associated class labels.
 Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample,
object or data points.

Using Classifier for Classification


In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification
rules can be applied to the new data tuples if the accuracy is considered
acceptable.

Naïve Bayes Classifier:

Naïve Bayes Classifier is a classification algorithm which works on


conditional probability. It predicts the correct class for the test data after
trained with classification algorithm
The below example demonstrates the concept.

Class:

C1:buys_computer = ‘yes’

C2:buys_computer = ‘no’

Data to be classified:

X = (age <=30,

Income = medium,

Student = yes

Credit_rating = Fair)

Training set:

CREDIT-
AGE INCOME STUDENT BUYS_COMPUTER
RATING
<=30 HIGH NO FAIR NO
<=30 HIGH NO EXCELLENT NO
31..40 HIGH NO FAIR YES
>40 MEDIUM NO FAIR YES
>40 LOW YES FAIR YES
>40 LOW YES EXCELLENT NO
31..40 LOW YES EXCELLENT YES
<=30 MEDIUM NO FAIR NO
<=30 LOW YES FAIR YES
>40 MEDIUM YES FAIR YES
<=30 MEDIUM YES EXCELLENT YES
31..40 MEDIUM NO EXCELLENT YES
31..40 HIGH YES FAIR YES
>40 MEDIUM NO EXCELLENT NO

 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

X = (age <= 30 , income = medium, student = yes, credit_rating


= fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 =


0.044

P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”)


= 0.028

P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

Comments on Naïve Bayes Classifier:

 Advantages

 Easy to implement

 Good results obtained in most of the cases

 Disadvantages

 Assumption: class conditional independence, therefore loss of


accuracy

 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.


 Dependencies among these cannot be modeled by
Naïve Bayes Classifier

Multi-Layer Networks

Multi-Layer perceptron defines the most complex architecture of artificial


neural networks. It is substantially formed from multiple layers of the
perceptron. The pictorial representation of multi-layer perceptron learning
is as shown below-

MLP networks are used for supervised learning format. A typical learning
algorithm for MLP networks is also called back propagation's
algorithm.

A multilayer perceptron (MLP) is a feed forward artificial neural network


that generates a set of outputs from a set of inputs. An MLP is
characterized by several layers of input nodes connected as a directed
graph between the input nodes connected as a directed graph between
the input and output layers. MLP uses backpropagation for training the
network. MLP is a deep learning method.
Method of working.

Backpropagation Process in Deep Neural Network

Backpropagation is one of the important concepts of a neural network.


Our task is to classify our data best. For this, we have to update the
weights of parameter and bias, but how can we do that in a deep neural
network? In the linear regression model, we use gradient descent to
optimize the parameter. Similarly here we also use gradient descent
algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the


gradient of the error function. Backpropagation can be written as a
function of the neural network. Backpropagation algorithms are a set of
methods used to efficiently train artificial neural networks following a
gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and


efficient method through which it calculates the updated weight to
improve the network until it is not able to perform the task for which it is
being trained. Derivatives of the activation function to be known at
network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how
Backpropagation works? Let start with an example and do it
mathematically to understand how exactly updates the weight using
Backpropagation.

890.2K

elementary OS 6.1 'Jolinr' Review: It is a Promising Minor Update

Input values

X1=0.05
X2=0.10

Initial weight

W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values

b1=0.35 b2=0.60

Target Values

T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.


Forward Pass

To find the value of H1 we first multiply the input value from the weights
as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we


calculate the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome
of H1 and H2 from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched
with our target values T1 and T2.

Now, we will find the total error, which is simply the difference between
the outputs from the target outputs. The total error is calculated as

So, the total error is


Now, we will backpropagate this error to update the weights using a
backward pass.
UNIT – IV
UNIT – IV
Mining Frequent Patterns, Association Rules

 Frequent patterns are patterns (such as itemsets, subsequences,


or substructures) that appear in a data set frequently.
 For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset.
 A subsequence, such as buying first a PC, then a digital camera, and
then a memory card, if it occurs frequently in a shopping history
database, is a (frequent) sequential pattern.
 A substructure can refer to different structural forms, such as
subgraphs, subtrees, or sublattices, which may be combined with
itemsets or subsequences. If a substructure occurs frequently, it is
called a (frequent) structured pattern.
 Finding such frequent patterns plays an essential role in mining
associations, correlations, and many other interesting relationships
among data.
 Moreover, it helps in data classification, clustering, and other data
mining tasks as well. Thus, frequent pattern mining has become an
important data mining task
 Applications:

 Basket data analysis, cross-marketing, catalog design, sale


campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
Example for frequent patterns:

Consider the following transactional data:

Transaction- Items
id bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F
An item is considered as frequent item if it’s occurrence in the transaction
is more than or equal to min. threshold support.

Let supmin = 50%, confmin = 50%

Freq. Pat.: {A:3, B:3, D:4, E:3,


AD:3}

Support and confidence is taken as follows.

 support, s, probability that a transaction contains A È B

 Confidence, c, conditional probability that a transaction having A


also contains B.

For the association rules the support and confidence is given below:

Association rules: support confidence

AàD 60% (3/5)*100 100% (3/3) *100

DàA 60% (3/5)*100 75% (3/4) *100

Closed Patterns and Max-Patterns:

 A long pattern contains a combinatorial number of sub-patterns,


e.g., {a1, …, a100} contains 2100 – 1 = 1.27*1030 sub-patterns!
Forming this many association rules is difficult and also it is useless
to mine operations.

 So solution to reduce the association rules is closed patterns and


Max- patterns.

 An item set X is closed if X is frequent and there exists no super-


pattern Y ‫ כ‬X, with the same support as X.

 An item set X is a max-pattern if X is frequent and there exists no


frequent super-pattern Y ‫ כ‬X

The downward closure property of frequent patterns

 Any subset of a frequent itemset must be frequent


 If {beer, diaper, nuts} is frequent, so is {beer, diaper}

 i.e., every transaction having {beer, diaper, nuts} also


contains {beer, diaper}

Apriori Algorithm:

 Apriori pruning (cut/trim) principle: If there is any itemset


which is infrequent, its superset should not be generated/tested!

 Method:

 Initially, scan DB once to get frequent 1-itemset

 Generate length (k+1) candidate itemsets from length k


frequent itemsets

 Test the candidates against DB

 Terminate when no frequent or candidate set can be


generated

Pseudo-code:

Ck: Candidate itemset of size k


Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=Æ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Èk Lk ;
Important Details of Apriori:

 How to generate candidates?

 Step 1: self-joining Lk

 Step 2: pruning

 How to count supports of candidates?

 Example of Candidate-generation

 L3={abc, abd, acd, ace, bcd}

 Self-joining: L3*L3

 abcd from abc and abd

 acde from acd and ace

 Pruning:

 acde is removed because ade is not in L3

 C4={abcd}

The Apriori Algorithm—An Example

Supmin = 2 Confmin =50%


In the previous example we found that the following is the max frequent
pattern with min. support:

Itemset sup

{B, C, E} 2

Now we find the Association Rules for the above pattern

 L3= {B,C,E} here min. support is 2

 From L2 we have {B,E} with Min support 3.

So we consider L2 also for Association rules.


Given that min confidence is 50%

All the above rules satisfied min. threshold confidence

So the Association Rules = { B à CE, CàBE

EàCB, BàE

EàB}

Hence these rules can be used for Basket data analysis, cross-marketing,
catalog design, sale campaign analysis

Association rules
• An association rule analyzes and predicts Customer behavior
• They are like if/else statements
Example:
Bread àbutter
Buys { onions, potatoes} à buys tomatoes
Parts of Association Rule:
Bread à butter [20%, 45%]
Bread : Antecedent
Butter : consequent
20% Support
45% confidence
Support and Confidence:
AàB
 Support denotes probability that contains both A&B
 Confidence denotes probability that a transaction containing A also
contains B.
Example
Consider in a super market
 Total transactions : 100
 Bread : 20
 So, 20/100 *100 = 20% which is support
 In 20 transactions, butter : 9 transactions
So 9/20 *100=45% which is confidence
Classification of Association Rule:
Single Dimensional Association Rule
 Bread à butter
 Dimension : buying
Multidimensional Association Rule
 With 2 or more predicates or dimensions,
 Occupation (I.T.), Age(>22) à buys (laptop)
Hybrid Association Rule
 Time (5’ O clock), buys (tea) à buys (biscuits)
Applications where the association rules are used:
 Web Usage Mining
 Banking
 Bio Informatics
 Market based Analysis
 Credit / debit card analysis
 Product clustering
 Catalog design
Clustering applications and requirements.

Clustering:
• Partitioning the data into subclasses
• It is grouping of similar objects
• Partitioning of data based on similarity
Eg: library
Here the books are grouped by subject, author etc.
Cluster Analysis:
• Cluster: a collection of data objects
– Here objects similar to one another are kept within the same
cluster
– Dissimilar objects are kept in other clusters
• Cluster analysis
– It is the process of finding similarities between data according
to the characteristics found in the data and grouping similar
data objects into clusters.
• Clustering is Unsupervised learning, it means there are no
predefined classes.

RAW DATA

CLUSTERING ALGORITHM

CLUSTERS OF DATA
Different Representations Of Clustering
APPLICATIONS OF CLUSTERING
• MARKET RESEARCH
• WWW
• PATTERN RECOGNITION
• IMAGRE PROCESSING
• DATA MINING
EXAMPLES OF CLUSTERING
 Search Engine
 Social Network Analysis
 Genetics
 Marketing
Requirements of Clustering in Data Mining
Scalability
We need highly scalable clustering algorithms to deal with large
databases.
Ability to deal with different kinds of attributes Algorithms
should be capable to be applied on any kind of data such as interval-
based (numerical) data, categorical, and binary data.

Discovery of clusters with attribute shape


The clustering algorithm should be capable of detecting clusters of
arbitrary shape. They should not be bounded to only distance measures
that tend to find spherical cluster of small sizes.
High dimensionality
The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
Ability to deal with noisy data
Databases contain noisy, missing or erroneous data. Some algorithms are
sensitive to such data and may lead to poor quality clusters.
Interpretability
The clustering results should be interpretable, comprehensible, and
usable.
Types of Data in Cluster Analysis
In general, we use two data structures to represent data for clustering.
The two structures are given below.

[ ]
x 11 . .. x 1f . .. x 1p
Data matrix
.. . . .. . .. . .. ...
(two modes) x i1 . .. x if . .. xip
.. . . .. . .. . .. ...
x n1 . .. x nf . .. x np

[ ]
0
d ( 2,1 ) 0
d ( 3,1 ) d(3 , 2) 0
Dissimilarity matrix : : :
d(n, 1) d(n , 2) .. . .. . 0
(one mode)
Type of data in clustering analysis
 Interval-valued variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types
Interval-valued variables:
Interval-scaled variables are continuous measurements of a roughly linear
scale. Typical examples include weight and height, latitude and longitude
coordinates. To standardize measurements to make all variables have
equal weights, Mean absolute deviation and z-score are computed by:

where

and

The mean absolute deviation, sf , is more robust to outliers than the


standard deviation. When computing the mean absolute deviation, the
deviations from the mean are not squared; hence, the effect of outliers is
somewhat reduced. There are more robust measures of dispersion, such
as the median absolute deviation.

However, the advantage of using the mean absolute deviation is that the
z-scores of outliers do not become too small. So the outliers are
detectable.

After standardization, or without standardization in certain applications,


the dissimilarity (or similarity) between the objects described by interval-
scaled variables is typically computed based on the distance between
each pair of objects.

 Manhattan distance(L1 norm)

 Euclidean distance(L2 norm)

 Minkowski distance (Lp norm)


 Weighted distance

Binary Variables
A binary variable has only two states: 0 or 1, where 0 means that the
variable is absent, and 1 means that it is present.

A binary variable can be divided into two kinds, symmetric and


asymmetric. A binary variable is symmetric if both of its states are equally
valuable and carry the same weight, such a gender; A binary variable is
asymmetric if the outcomes of the states are not equally important, such
as the positive and negative outcomes of a disease test. Considering the
contingency table for binary variables:

For symmetric binary dissimilarity, we can calculate it as:

For asymmetric binary dissimilarity, we can calculate it as:

Because given two asymmetric binary variables, the agreement of two 1s


(a positive match) is then considered more significant than that of two 0s
(a negative match). Complementarily, the asymmetric binary
similarity can be computed as
Categorical, Ordinal, and Ratio-Scaled Variables:
Categorical Variables:

 Nominal variables are also called categorical or qualitative


 Eg: sex, preferred type of chocolate, color
 On nominal data we compute frequency, or percentage
 We can’t compute mean.
Computing dissimilarity between nominal variables is given
below:
 m: no. of matches, p: total no. of variables.

p−m
d (i , j )=
p
0rdinal variables:
 Ordinal variable have meaningful order.
 Eg: Rank, satisfaction
 Here the difference between the interval may mislead us.
 Eg: difference between 1 and 2 ranks , 2 and 3 are not same.
 Difference between very satisfied, satisfied and not satisfied is not
same.
 Like nominal data, for ordinal data we compute frequency.
Sometimes rarely people find mean.
Dissimilarity matrices

Categorical/ ordinal
nominal
Ratio / Interval variables:
 They are also called quantitative, scale, or parametric variables
 Eg. no. of customers, age, weight etc.
 The values may be discrete, or continuous
Eg: no. of customers =20 (discrete)
no of stores = 3 ( discrete)
weight = 2.5kg (continuous)
Vector Objects
To measure the distance between complex objects, it is often desirable to
abandon (stop) traditional metric distance computation and introduce a
nonmetric similarity function. There are several ways to define such a
similarity function, s(x, y), to compare two vectors x and y. One popular
way is to define the similarity function as a cosine measure as follows:

where x^t is a transposition of vector x, ||x|| is the Euclidean norm of


vector x, ||y|| is the Euclidean norm of vector y, and s is essentially the
cosine of the angle between vectors x and y. The Euclidean normal of
vector x=(x_1, x_2 ..., x_p) is defined as

Conceptually, it is the length of the vector.


Categorization of Major Clustering Methods

Partitioning method. It classifies the data into k groups, which


together satisfy the following requirements:

o Each group must contain at least one object,


o Each object must belong to exactly one group.

Like the algorithms k-means and k-medoids. The disadvantage of it is,


most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped clusters
and encounter difficulty at discovering clusters of arbitrary shapes.

Hierarchical methods: A hierarchical method creates a hierarchical


decomposition of the given set of data objects. Hierarchical methods
suffer from the fact that once a step (merge or split) is done, it can
never be undone.

Density-based methods: The general idea is to continue growing the


given cluster as long as the density (number of objects or data points)
in the “neighborhood” exceeds some threshold; that is, for each data
point within a given cluster, the neighborhood of a given radius has to
contain at least a minimum number of points. Such a method can be
used to filter out noise (outliers) and discover clusters of arbitrary
shape.

Grid-based methods: Grid-based methods quantize the object space


into a finite number of cells that form a grid structure. The main
advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the
number of cells in each dimension in the quantized space.

Model-based methods: Model-based methods hypothesize a model


for each of the clusters and find the best fit of the data to the given
model. A model-based algorithm may locate clusters by constructing a
density function that reflects the spatial distribution of the data points.

Partitioning Methods

Given D, a data set of n objects, and k, the number of clusters to form, a


partitioning algorithm organizes the objects into k partitions (k ≤ n),
where each partition represents a cluster. The usually used methods are
k-Means and k-Medoids.

K-Means Algorithm
Refer Unit - I
k-Medoids Algorithm
Instead of taking the mean value of the objects in a cluster as a reference
point, we can pick actual objects to represent the clusters, using one
representative object per cluster. Each remaining object is clustered with
the representative object to which it is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference
point. That is, an absolute-error criterion is used, defined as

where E is the sum of the absolute error for all objects in the data set;
p is the point in space representing a given object in cluster C_j ;
and o_j is the representative object of C_j . In general, the algorithm
iterates until, eventually, each representative object is actually the
medoid, or most centrally located object, of its cluster. This is the basis of
the k-medoids method for grouping n objects into k clusters.

The iterative process of replacing representative objects by non


representative objects continues as long as the quality of the resulting
clustering is improved.

Comparing against K-Means. K-Means requires that all the data should
be in Euclidean space, the dissimilarity is measured by Euclidean
distance. However, not all the dataset meet this demand. For instance,
the categorical feature. You can not exactly tell how far is the distance
between a apple and a pear. None of the Euclidean distance formula can
be applied here. For the K-Medoids, you only need to care about the
degree of the dissimilarity which is represented by dissimilarity matrix.

Also, the k-medoids method is more robust than k-means in the presence
of noise and outliers, because a medoid is less influenced by outliers or
other extreme values than a mean.
Hierarchical approach for clustering:

This method creates a hierarchical decomposition of the given set of data


objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −

 Agglomerative Approach

 Divisive Approach

Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start


with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until
all of the groups are merged into one or until the termination condition
holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start


with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in
one cluster or the termination condition holds. This method is rigid, i.e.,
once a merging or splitting is done, it can never be undone.

Hierarchical Clustering

It uses distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a termination
condition
Step Step Step Step Step
0 1 2 3 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d de
e
divisive
Step Step Step Step Step (DIANA)
4 3 2 1 0
AGNES (Agglomerative Nesting)

 It is implemented in statistical analysis packages, e.g., Splus


 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
Dendrogram: Shows How the Clusters are Merged
 Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
 A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
Dendrogram showing a sample cluster

DIANA (Divisive Analysis):

 It is implemented in statistical analysis packages, e.g., Splus

 Inverse order of AGNES

 Eventually each node forms a cluster on its own

What is a Minimum Spanning Tree?


A minimum spanning tree is a special kind of tree that minimizes the
lengths (or “weights”) of the edges of the tree. An example is a cable
company wanting to lay line to multiple neighborhoods; by minimizing the
amount of cable laid, the cable company will save money.
A tree has one path joins any two vertices. A spanning tree of a graph is
a tree that:
 Contains all the original graph’s vertices.
 Reaches out to (spans) all vertices.
 Is acyclic. In other words, the graph doesn’t have any nodes which loop
back to itself.
Even the simplest of graphs can contain many spanning trees. For
example, the following graph:

…has many possibilities for spanning trees, including:

Finding Minimum Spanning Trees


we can probably imagine, larger graphs have more nodes and many more
possibilities for subgraphs. The number of subgraphs can quickly reach
into the millions, or billions, making it very difficult (and sometimes
impossible) to find the minimum spanning tree.
A few popular algorithms for finding this minimum distance include:
Kruskal’s algorithm, Prim’s algorithm
Kruskal’s algorithm example
Find the edge with the least weight and highlight it. For this example
graph, we have highlighted the top edge (from A to C) in red. It has the
lowest weight (of 1):

Find the next edge with the lowest weight and highlight it:

Continue selecting the lowest edges until all nodes are in the same tree.

The finished minimum spanning tree for this example looks like this:

What is Prim’s Algorithm?


Prim’s algorithm is one way to find a minimum spanning tree (MST).
A minimum spanning tree (shown in red) minimizes the edges (weights) of
a tree.

How to Run Prim’s Algorithm


Step 1: Choose a random node and highlight it. For this example, I’m
choosing node C.

Step 2: Find all of the edges that go to un-highlighted nodes. For this
example, node C has three edges with weights 1, 2, and 3. Highlight the
edge with the lowest weight. For this example, that’s 1.

Step 3: Highlight the node you just reached (in this example, that’s node
A).
Step 4: Look at all of the nodes highlighted so far (in this example, that’s
A And C). Highlight the edge with lowest weight (in this example, that’s
the edge with weight 2).
Note: if you have have more than one edge with the same weight, pick a
random one.

Step 5: Highlight the node you just reached.

Step 6: Highlight the edge with the lowest weight. Choose from all of the
edges that:
1. Come from all of the highlighted nodes.
2. Reach a node that you haven’t highlighted yet
Step 7: Repeat steps 5 and 6 until you have no more un-highlighted
nodes. For this particular example, the specific steps remaining are:
 a. Highlight node E.
 b. Highlight edge 3 and then node D.
 c. Highlight edge 5 and then node B.
 d. Highlight edge 6 and then node F.
 e. Highlight edge 9 and then node G.
The finished graph is shown at the bottom right of this image:

Real Life Applications


Minimum spanning trees are used for network designs (i.e. telephone or
cable networks). They are also used to find approximate solutions for
complex mathematical problems like the Traveling Salesman Problem.
Other, diverse applications include:
 Cluster Analysis.
 Real-time face tracking and verification (i.e. locating human faces in
a video stream).
 Protocols in computer science to avoid network cycles.
 Entropy based image registration.
 Max bottleneck paths.
 Dithering (adding white noise to a digital recording in order to reduce
distortion).

You might also like