0% found this document useful (0 votes)
25 views127 pages

CSE3013 Module6

Uploaded by

krish.bagga10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views127 pages

CSE3013 Module6

Uploaded by

krish.bagga10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

"Artificial Intelligence"

Learning Systems

Dr. Rabindra Kumar Singh


Associate Professor (Sr.)
School of Computer Science and Engineering
VIT - Chennai

Dr. Rabindra Kumar Singh "Artificial Intelligence" 1/ 127


Contents...

Introduction to Machine Learning


Traditional Learning Vs Machine Learning
Types of Learning (Machine)
Features and Applications of ML
AI Verses ML
Types of Machine Learning
Supervised Learning
Unsupervised Learning
Semi Supervised Learning
Reinforcement Learning
Decision Trees and its types

Dr. Rabindra Kumar Singh "Artificial Intelligence" 2/ 127


Introduction

Definition-1
It is a system of computer algorithms that can learn from example through
self-improvement without being explicitly coded by a programmer.

Definition-2
It is all about making computers how to learn from data to make decisions /
predictions / identify patterns without being explicitly programmed to.

How Machine Learning Works?

Dr. Rabindra Kumar Singh "Artificial Intelligence" 3/ 127


Machine Learning vs. Traditional Programming

Dr. Rabindra Kumar Singh "Artificial Intelligence" 4/ 127


Machine Learning...!

Definition-3
Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly
programmed.

A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it

Dr. Rabindra Kumar Singh "Artificial Intelligence" 5/ 127


Introduction to ML

Features of Machine Learning

Machine learning uses data to detect various patterns in a given dataset.


It can learn from past data and improve automatically.
It is a data-driven technology.
Machine learning is much similar to data mining as it also deals with the
huge amount of the data.

Types of Machine Learning

Supervised Learning (Classification, Regression)


Unsupervised Learning (Clustering, Association)
Semi-supervised Learning (class of supervised learning)
Reinforcement Learning

Dr. Rabindra Kumar Singh "Artificial Intelligence" 6/ 127


Types of Ml

Dr. Rabindra Kumar Singh "Artificial Intelligence" 7/ 127


Supervised Learning
In Supervised learning, we provide the data along with the desired output (i.e
Labelled data). For instance, If we want our system to learn cat detection, we’ll
collect thousands of images, draw a bounding box around the cat and feed the
entire dataset to the machine so it can learn all by itself.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 8/ 127


Supervised Learning...

Dr. Rabindra Kumar Singh "Artificial Intelligence" 9/ 127


Unsupervised Learning

Here, we provide data and let the machine find out the patterns in the dataset.
For instance, provided 3 different shapes (circles, triangles, and squares) and let
the machine cluster them. Such a technique is called clustering.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 10/ 127


Unsupervised Learning...

Dr. Rabindra Kumar Singh "Artificial Intelligence" 11/ 127


Semi-Supervised Learning

Semi-Supervised learning is a class of supervised learning tasks and techniques


that also make use of unlabeled data for training. Here, the machine learns
from partially labelled data and maps these learning’s to unlabeled data.

For instance, a photo-storage service would group all the photos of an


individual, and you only have to label one image and all the rest will be labelled
with the same name because they have the same person.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 12/ 127


Re-enforcement Learning

Here, the machine is commonly referred to as an agent, and the agent receives
a reward (or a penalty) based on each of its actions. It then learns what would
be the best actions to maximize the rewards and alleviate the penalties.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 13/ 127


Overfitting and Underfitting

After getting trained on data, the goal of our trained model is the
generalize on unseen data as accurately as possible.
If the model yield very accurate results on training data but fails to
generalize on unseen data, it’s called over-fitting because the model
over-fits the training data.
If the model doesn’t even predict accurately on training data, that means
the model has not learned anything, which is known as under-fitting.

Challenges that encounters while machine learning?


Insufficient Data
Poor-Quality Data - Reduce Noise, Discard Outliers (differing from all the
members of the same group)
Irrelevant Features

Dr. Rabindra Kumar Singh "Artificial Intelligence" 14/ 127


Applications of ML I

Augmentation:
Machine learning, which assists humans with their day-to-day tasks,
personally or commercially without having complete control of the output.
Such machine learning is used in different ways such as Virtual Assistant,
Data analysis, software solutions. The primary user is to reduce errors due
to human bias.

Automation:
Machine learning, which works entirely autonomously in any field without
the need for any human intervention. For example, robots performing the
essential process steps in manufacturing plants.

Finance Industry :
Machine learning is growing in popularity in the finance industry. Banks
are mainly using ML to find patterns inside the data but also to prevent
fraud.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 15/ 127


Applications of ML II

Government organization :
The government makes use of ML to manage public safety and utilities.
Take the example of China with the massive face recognition. The
government uses Artificial intelligence to prevent jaywalker.

Healthcare industry
Healthcare was one of the first industry to use machine learning with
image detection.

Marketing
Broad use of AI is done in marketing thanks to abundant access to data.
Before the age of mass data, researchers develop advanced mathematical
tools like Bayesian analysis to estimate the value of a customer. With the
boom of data, marketing department relies on AI to optimize the customer
relationship and marketing campaign.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 16/ 127


Applications of Machine Learning

Dr. Rabindra Kumar Singh "Artificial Intelligence" 17/ 127


History of Machine Learning

Dr. Rabindra Kumar Singh "Artificial Intelligence" 18/ 127


Machine Learning Life Cycle I

Dr. Rabindra Kumar Singh "Artificial Intelligence" 19/ 127


Machine Learning Life Cycle II

1 Gathering Data : is the first step to identify and obtain all data-related
problems. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate
will be the prediction.
Identify various data(Structured/Unstructured) sources
(Files/Database/Internet)
Collect data
Integrate the data obtained from different sources (coherent set of data -
Dataset)

2 Data Preparation : is a step where we put our data into a suitable place
and prepare it to use in our machine learning training.
Data exploration: It is used to understand the nature of data that we have
to work with. We need to understand the characteristics, format, and
quality of data. A better understanding of data leads to an effective
outcome. In this, we find Correlations, general trends, and outliers.
Data pre-processing: Now the next step is preprocessing of data for its
analysis.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 20/ 127


Machine Learning Life Cycle III

3 Data Wrangling : It is the process of cleaning the data, selecting the


variable to use, and transforming the data in a proper format to make it
more suitable for analysis.(To Avoid negative affect of the quality of the
outcome.)
Collected data may have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
So, can use various filtering techniques to clean the data.

4 Analysis of Data : To build a ML model to analyze the data using various


analytical techniques and review the outcome. It involves -
Selection of analytical techniques (Classification, Regression, Cluster
Analysis, Association...)
Building models
Review the result

Dr. Rabindra Kumar Singh "Artificial Intelligence" 21/ 127


Machine Learning Life Cycle IV

5 Train Model : By using Datasets, train the model to improve its


performance for better outcome of the problem. raining a model is
required so that it can understand the various patterns, rules, and,
features.

6 Test Model : To check for the accuracy of the trained model by providing
a test dataset to it. Testing the model determines the percentage accuracy
of the model as per the requirement of project or problem.

7 Deployment : The last step of machine learning life cycle is deployment,


where we deploy the model in the real-world system.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 22/ 127


AI verses ML
Artificial Intelligence Machine learning
ML is a subset of AI which allows a machine
AI is a technology which enables a
to automatically learn from past data without
machine to simulate human behavior.
programming explicitly.
The goal is to make a smart system The goal is to allow machines to learn from
like humans to solve complex problems. data so that they can give accurate output.
In ML, we teach machines with data to
Intelligent systems to perform
perform a particular task and
any task like a human.
give an accurate result.
ML and DL are the two main subsets of AI. Deep learning is a main subset of ML
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent Machine learning is working to create
system which can perform machines that can perform only those
various complex tasks. specific tasks for which they are trained.
AI system is concerned about maximizing Machine learning is mainly concerned
the chances of success. about accuracy and patterns.
The main applications of AI are Siri, The main applications of ML are
customer support using catboats, Online recommender system,
Expert System, Online game playing, Google search algorithms,
intelligent humanoid robot, etc. Facebook auto friend tagging suggestions, etc.
ML can also be divided into
AI can be divided into three types, Supervised learning,
Weak AI, General AI, and Strong AI. Unsupervised learning, and
Reinforcement learning.
It includes learning, reasoning, and It includes learning and self-correction
self-correction. when introduced with new data.
AI completely deals with Structured, ML deals with Structured and
semi-structured, and unstructured data. semi-structured data.
Dr. Rabindra Kumar Singh "Artificial Intelligence" 23/ 127
Data Sets I

What is a Dataset?
A dataset is a collection of data in which data is arranged in some order.
A dataset can contain any data from a series of an array to a database
table.

A tabular dataset can be understood as a database table or matrix, where


each column corresponds to a particular variable, and each row
corresponds to the fields of the dataset.
The most supported file type for a tabular dataset is "Comma Separated
File," or CSV.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 24/ 127


Data Sets II

Types of data in datasets


Numerical data: Such as house price, temperature, etc.
Categorical data: Such as Yes/No, True/False, Blue/green, etc.
Ordinal data: These data are similar to categorical data but can be
measured on the basis of comparison.
Note: A real-world dataset is of huge size, which is difficult to manage and
process at the initial level. Therefore, to practice ML algorithms, we can use
any dummy dataset.

Need of Dataset : We need a lot of data to work on ML projects because


ML/AI models can’t be trained without data. One of the most important
aspects of building an ML/AI project is gathering and preparing the dataset.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 25/ 127


Data Sets III

During the development of the ML project, the developers completely rely on


the datasets. In building ML applications, datasets are divided into two parts:
Training dataset:
Test Dataset

Note: The datasets are of large size, so to download these datasets, you must
have fast internet on your computer.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 26/ 127


Data Sets IV

Popular sources for Machine Learning datasets


Kaggle Datasets : https://www.kaggle.com/datasets.
UCI Machine Learning Repository :
https://archive.ics.uci.edu/ml/index.php.
Datasets via AWS : https://registry.opendata.aws/
Google’s Dataset Search Engine :
https://toolbox.google.com/datasetsearch
Microsoft Datasets : https://msropendata.com/
Awesome Public Dataset Collection :
https://github.com/awesomedata/awesome-public-datasets
Computer Vision Datasets : https://www.visualdata.io/
Scikit-learn dataset : https://scikit-learn.org/stable/datasets/index.html.
Government Datasets :
https://data.gov.in/
https://www.data.gov/
https://data.europa.eu/euodp/data/dataset

Dr. Rabindra Kumar Singh "Artificial Intelligence" 27/ 127


"Artificial Intelligence"
Types of Machine Learning

Dr. Rabindra Kumar Singh


Associate Professor (Sr.)
School of Computer Science and Engineering
VIT - Chennai

Dr. Rabindra Kumar Singh "Artificial Intelligence" 28/ 127


1. Supervised Learning

Supervised learning is a type of ML, in which machines are trained using


well-labeled training data and then predict the output based on that data.

Labeled data indicates that some input data has already been tagged with the
appropriate output.

In supervised learning, the training data provided to the machines acts as a


supervisor, teaching the machines how to correctly predict the output. It
employs the same concept that a student would learn under the supervision of
a teacher.

Supervised learning is the process of providing correct input and output data to
a machine learning model. And the goal is to find a mapping function that
maps the input variable (X) to the output variable (Y).

In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 29/ 127


How Supervised Learning Works?

Models are trained using labelled datasets, where the model learns about each
type of data. After the training process is completed, the model is tested on
test data (a subset of the training set) and predicts the output.

The working of Supervised learning can be easily understood by the below


example and diagram:

Dr. Rabindra Kumar Singh "Artificial Intelligence" 30/ 127


How Supervised Learning Works?

Assume we have a dataset with various shapes such as squares, rectangles,


triangles, and polygons. The model must now be trained for each shape, which
is the first step.
If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
After training, we use the test data set to put our model to the test, and the
model’s task is to identify the shape.

The machine has already been trained on all types of shapes, and when it
discovers a new one, it classifies it based on a number of sides and predicts the
output.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 31/ 127


How Supervised Learning Works?

Steps Involved in Supervised Learning

First Determine the type of training dataset


Collect/Gather the labelled training data.
Split the training dataset into training dataset, test dataset, and
validation dataset.
Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of training
datasets.
Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 32/ 127


Advantages and Disadvantages of Supervised Learning

Advantages of Supervised learning

With the help of supervised learning, the model can predict the output on
the basis of prior experiences.
In supervised learning, we can have an exact idea about the classes of
objects.
Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.

Disadvantages of Supervised learning

Supervised learning models are not suitable for handling the complex tasks.
Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
Training required lots of computation times.
In supervised learning, we need enough knowledge about the classes of
object.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 33/ 127


Types of Supervised Learning

1 Classification : are used when the output variable is categorical, which


means there are two classes such as Yes/No, Male/Female, True/false,
etc.
Random Forest
Decision Trees
Logistic Regression
Support vector Machines
2 Regression - are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc.
Linear Regression
Regression Trees
Non-Linear Regression
Bayesian Linear Regression
Polynomial Regression

Dr. Rabindra Kumar Singh "Artificial Intelligence" 34/ 127


2. Unsupervised Learning

As w.k.t, Supervised ML is a type of learning in which models are trained using


labelled data under the supervision of training data.

However, there may be many cases where we do not have labelled data and
must find hidden patterns in the given dataset. Unsupervised learning
techniques are required to solve such types of cases in machine learning.

It is a ML technique, in which models are not supervised using training dataset.


But, Models itself find the hidden patterns and insights from the given data. It
can be compared to learning which takes place in the human brain while
learning new things.

It is a type of ML in which models are trained using unlabeled dataset and are
allowed to act on that data without any supervision.

The goal of unsupervised learning is to find the underlying structure of


dataset, group that data according to similarities, and represent that
dataset in a compressed format.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 35/ 127


Example

Given a dataset containing images of various types of cats and dogs The
algorithm is never trained on the given dataset, so it has no idea about the
dataset’s characteristics.

The task of this learning is to identify the image features on their own. And
will perform by clustering the image dataset into the groups according to
similarities between images.

Why use Unsupervised Learning?


It is helpful for finding useful insights from the data.
It is much similar as a human learns to think by their own experiences,
which makes it closer to the real AI.
It works on unlabeled and uncategorized data which make unsupervised
learning more important.
In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 36/ 127


Working of Unsupervised Learning

Here, Input data is unlabeled, i.e, it is not categorized and corresponding


outputs are also not given.

Now, this unlabeled data is fed to the ML model in order to train it. Firstly, it
will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 37/ 127


Types of Unsupervised Learning

Clustering: is a method of grouping the objects into clusters such that


objects with most similarities remains into a group and has less or no
similarities with the objects of another group. It finds the commonalities
between the data objects and categorizes them as per the presence and
absence of those commonalities.
Association: It is used for finding the relationships between variables in
the large database. It determines the set of items that occurs together in
the dataset. It makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 38/ 127


Unsupervised Learning Algorithms

K-means clustering
KNN (k-nearest neighbors)
Hierarchal clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition

Dr. Rabindra Kumar Singh "Artificial Intelligence" 39/ 127


Advantages and Disadvantages of Unsupervised Learning

Advantages of Unsupervised learning

Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don’t have
labeled input data.
Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised learning

Unsupervised learning is intrinsically more difficult than supervised learning


as it does not have corresponding output.
The result of the unsupervised learning algorithm might be less accurate
as input data is not labeled, and algorithms do not know the exact output
in advance.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 40/ 127


Supervised Verses Unsupervised I
Supervised Learning Unsupervised Learning
These algorithms are trained using
They are trained using unlabeled data.
labeled data.
Its model takes direct feedback to check
It does not take any feedback.
if it is predicting correct output or not.
Model predicts the output. It finds the hidden patterns in data.
Input data is provided to the model Only input data is provided
along with the output. to the model.
The goal is to train the model so that it The goal is to find the hidden
can predict the output patterns and useful insights from
when it is given new data. the unknown dataset.
It does not need any supervision
It needs supervision to train the model.
to train the model.
It can be categorized in Classification It can be classified in Clustering and
and Regression problems. Associations problems.
It can be used for those cases where
It can be used for those cases where we
we have only i/p data and no
know the i/p and its corresponding o/p.
corresponding o/p data.
It produces an accurate result. It may give less accurate result.
It is not close to true AI as in this, It is more close to the true AI,
we first train the model for each data, as it learns similarly as a child
and then only it can predict the learns daily routine things
correct o/p. by his experiences.
It includes various algorithms such as
Linear Regression, Logistic Regression,
It includes various algorithms such as
Support Vector Machine, Multi-class
Clustering, KNN, and Apriori algorithm.
Classification, Decision tree,
Bayesian Logic, etc.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 41/ 127


"Artificial Intelligence"
Reinforcement Learning

Dr. Rabindra Kumar Singh


Associate Professor (Sr.)
School of Computer Science and Engineering
VIT - Chennai

Dr. Rabindra Kumar Singh "Artificial Intelligence" 42/ 127


Introduction...!

The best way to train your dog is by using a reward system. You give the dog a
treat when it behaves well, and you chastise it when it does something wrong.

Same policy can be applied to ML models too! This type of learning method,
where we use a reward system to train our model, is called Reinforcement
Learning.

Need for Reinforcement Learning

A major drawback of ML is that a tremendous amount of data is needed to


train models. The more complex a model, the more data it may require. But
this data may not be available / may not exist / may not have access to it /
Not Reliable / may have false or missing values or it might be outdated

Also, learning from a small subset of actions will not help expand the vast
realm of solutions that may work for a particular problem. Machines need to
learn to perform actions by themselves and not just learn from humans.

All of these problems are overcome by reinforcement learning. Here, we


introduce our model to a controlled environment which is modeled after the
problem statement to be solved instead of using actual data to solve it.
Dr. Rabindra Kumar Singh "Artificial Intelligence" 43/ 127
What is Reinforcement Learning? I

It is a sub-branch of ML that trains a model to return an optimum solution for


a problem by taking a sequence of decisions by itself.

We model an environment after the problem statement. The model interacts


with this environment and comes up with solutions all on its own, without
human interference.

To push it in the right direction, we simply give it a positive reward if it


performs an action that brings it closer to its goal or a negative reward if it
goes away from its goal.

Consider a dog that we have to house train. Here, the dog is the agent and the
house, the environment.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 44/ 127


What is Reinforcement Learning? II

We can get the dog to perform various actions by offering incentives such as
dog biscuits as a reward.

The dog will follow a policy to maximize its reward and hence will follow every
command and might even learn a new action, like begging, all by itself.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 45/ 127


What is Reinforcement Learning? III

The dog will also want to run around and play and explore its environment.
This quality of a model is called Exploration.

The tendency of the dog to maximize rewards is called Exploitation.

There is always a tradeoff between exploration and exploitation, as


exploration actions may lead to lesser rewards.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 46/ 127


Supervised vs Unsupervised vs Reinforcement Learning

Table: Differences between Supervised, Unsupervised, and Reinforcement Learning

Types of Learning
Supervised Unsupervised Reinforcment
Data provided is unlabeled, The machine learns from its
Data provided is labeled data,
the outputs are not specified, environment using rewards
with output values specified
Machine makes its own prediction and Errors
Used to solve Regression and Used to solve Association and Used to solve Reward based
Classification Problems Clustering problems Problems
Labeled data is used Unlabeled data is used No predefined data is used
External Su[ervision No Supervision No Supervision
Solves problems by
Solves Problems by mapping Follows Trail & Erro
Understanding Patterns
labeled Input to known Output Problem Solving Approach
and Discovering Outputs

Dr. Rabindra Kumar Singh "Artificial Intelligence" 47/ 127


Important Terms

Agent: Agent is the model that is being trained via reinforcement learning
Environment: The training situation that the model must optimize to is
called its environment
Action: All possible steps that can be taken by the model
State: The current position/ condition returned by the model
Reward: To help the model move in the right direction, it is
rewarded/points are given to it to appraise some action
Policy: Policy determines how an agent will behave at any time. It acts as
a mapping between Action and present State

Dr. Rabindra Kumar Singh "Artificial Intelligence" 48/ 127


Markov’s Decision Process I

It is a Reinforcement Learning policy used to map a current state to an


action where the agent continuously interacts with the environment to
produce new solutions and receive rewards.

It states that the future is independent of the past, given the present. This
means that, given the present state, the next state can be predicted easily,
without the need for the previous state.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 49/ 127


Markov’s Decision Process II

It use’s the following


A set of States (S)
A set of Models
A set of all possible actions (A)
A reward function that depends on the state and action : R(S,A)
A policy which is the solution of MDP
The policy of Markov’s Decision Process aims to maximize the reward at
each state.
The Agent interacts with the Environment and takes Action while it is at
one State to reach the next future State. The maximum reward is
returned based on the action taken.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 50/ 127


Example-1 I

In the diagram shown, we need to find the shortest path between node A and D.

Each path has a reward associated with it, and the path with maximum reward
is what we want to choose.

The nodes; A, B, C, D; denote the nodes. To travel from node to node (A to


B) is an action. The reward is the cost at each path, and policy is each path
taken.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 51/ 127


Example-1 II

The process will maximize the output based on the reward at each step and will
traverse the path with the highest reward. This process does not explore but
maximizes reward.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 52/ 127


Examples of where to apply reinforcement learning I

Rocket engineering – Explore how reinforcement learning is used in the


field of rocket engine development. You’ll find a lot of valuable
information on the use of machine learning in manufacturing industries.
See why reinforcement learning is favored over other machine learning
algorithms when it comes to manufacturing rocket engines.
Traffic Light Control – This site provides multiple research papers and
project examples that highlight the use of core reinforcement learning and
deep reinforcement learning in traffic light control. It has tutorials,
datasets, and relevant example papers that use RL as a backbone so that
you can make a new finding of your own.
Marketing and advertising – See how to make an AI system learn from a
pre-existing dataset which may be infeasible or unavailable, and how to
make AI learn in real-time by creating advertising content. This is where
they have made use of reinforcement learning.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 53/ 127


Examples of where to apply reinforcement learning II

Reinforcement Learning in Marketing | by Deepthi A R – This example


focuses on the changing business dynamics to which marketers need to
adapt. The AI equipped with a reinforcement learning scheme can learn
from real-time changes and help devise a proper marketing strategy. This
article highlights the changing business environment as a problem and
reinforcement learning as a solution to it.
Robotics – This video demonstrates the use of reinforcement learning in
robotics. The aim is to show the implementation of autonomous
reinforcement learning agents for robotics. A prime example of using
reinforcement learning in robotics.
Recommendation – Recommendation systems are widely used in
eCommerce and business sites for product advertisement. There’s always a
recommendation section displayed in many popular platforms such as
YouTube, Google, etc. The ability of AI to learn from real-time user
interactions, and then suggest them content, would not have been possible
without reinforcement learning. This article shows the use of
reinforcement learning algorithms and practical implementations in
recommendation systems.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 54/ 127


Examples of where to apply reinforcement learning III

Healthcare – Healthcare is a huge industry with many state-of-the-art


technologies bound to it, where the use of AI is not new. The main
question here is how to optimize AI in healthcare, and make it learn based
on real-time experiences. This is where reinforcement learning comes in.
Reinforcement learning has undeniable value for healthcare, with its ability
to regulate ultimate behaviors. With RL, healthcare systems can provide
more detailed and accurate treatment at reduced costs.
NLP – This article shows the use of reinforcement learning in combination
with Natural Language Processing to beat a question and answer
adventure game. This example might be an inspiration for learners
engaged in Natural Language Processing and gaming solutions.
Trading – Deep reinforcement learning is a force to reckon with when it
comes to the stock trading market. The example here demonstrates how
deep reinforcement learning techniques can be used to analyze the stock
trading market, and provide proper investment reports. Only an AI
equipped with reinforcement learning can provide accurate stock market
reports.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 55/ 127


Applications of ReInforcement Learning

Dr. Rabindra Kumar Singh "Artificial Intelligence" 56/ 127


"Artificial Intelligence"
Decision Trees

Dr. Rabindra Kumar Singh


Associate Professor (Sr.)
School of Computer Science and Engineering
VIT - Chennai

Dr. Rabindra Kumar Singh "Artificial Intelligence" 57/ 127


Introduction...! I

It is a Supervised learning technique that can be used for both


classification and Regression problems, but mostly it is preferred for
solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features
of a dataset, branches represent the decision rules and each leaf node
represents the outcome.
It contains two nodes, which are the Decision and Leaf Node.
Decision nodes are used to make any decision and have multiple branches.
Leaf nodes are the output of those decisions and do not contain any
further branches.
The decisions or the test are performed on the basis of features of the
given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 58/ 127


Introduction...! II

In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees. Below diagram explains the general
structure of a decision tree:

Dr. Rabindra Kumar Singh "Artificial Intelligence" 59/ 127


Decision Trees

Why use Decision Trees?


Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies


Root Node: It is starting point of Decision Tree, represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: is the process of dividing the decision node into sub-nodes w.r.t
the given conditions. The Sub Tree is formed by splitting the tree.
Pruning: is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 60/ 127


Decision Trees

How does the Decision Tree algorithm Work?


Step-1: Begin the tree with the root node, says ’S’, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the ’S’ into subsets that contains possible values for the
best attributes.
Step-4: Generate the decision tree node, which contains the best
attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 61/ 127


Decision Tree : Example
Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not.
To Solve this, the decision tree starts with the root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node further gets
split into one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer).

Consider the below diagram:

Dr. Rabindra Kumar Singh "Artificial Intelligence" 62/ 127


Attribute Selection Measures (ASM)

While implementing DT, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique called as Attribute selection measure (ASM). By
this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:
Information Gain
Gini Index

Dr. Rabindra Kumar Singh "Artificial Intelligence" 63/ 127


Information Gain

It is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
It calculates how much information a feature provides us about a class.
At this value of Information Gain, we split the node and build the DT.
A DT Algo, always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first.

It can be calculated using the below formula:

Information Gain = Entropy(S) - [(Weighted Avg) * Entropy(each feature)

Entropy : is a metric to measure the impurity in a given attribute. It specifies


randomness in data. And can be calculated as:

Entropy(S) = -P(’yes’)*log2 P(’yes’) + -P(’no’)*log2 P(’no’)

Where ’S’ - Number of Samples

Dr. Rabindra Kumar Singh "Artificial Intelligence" 64/ 127


Gini Index

Gini index is a measure of impurity or purity used while creating a decision


tree in the CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to
the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index
to create binary splits.

Gini index can be calculated using the below formula:

C
X
GI = 1 − Pj2
j=1

Where, ’C’ is the no. of classes and ’Pj ’ is the probability associated with the jth class

Pruning (2 Types): Getting an Optimal Decision tree, i.e is a process of


deleting the unnecessary nodes from a tree in order to get the optimal DT.
(Decreased the size of Learning Tree).
Cost Complexity Pruning
Reduced Error Pruning

Dr. Rabindra Kumar Singh "Artificial Intelligence" 65/ 127


Merits and DeMerits

Advantages of the Decision Tree


It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
Quick to Train

Disadvantages of the Decision Tree


The decision tree contains lots of layers, which makes it complex.
It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
For more class labels, the computational complexity of the decision tree
may increase.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 66/ 127


DT Induction Algorithms

There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID,
QUEST, GUIDE, CRUISE and CTREE, that are used for classification in
real-time environment.

The most commonly used DT algorithms are ID3 (Iterative Dichotomizer3),


developed by J.R. Quinlan in 1986

C4.5 is an advancement of ID3 presented by the same author in 1993.

CART(Classification and Regression Trees), is another algorithm, developed by


Breiman et al. in 1984.

The accuracy depends upon the selection of the best split attribute.

Both ID3 and C4.5 are called as univariate(One feature/attribute to split at


each decision) DT.

CART are multivariate(More than one feature/attribute to split)

Dr. Rabindra Kumar Singh "Artificial Intelligence" 67/ 127


ID3 Algorithms I

ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides) features into two or
more groups at each step.

ID3 uses a top-down greedy approach to build a decision tree.

Algorithm
Compute "Entropy_Info" for the whole training DS based on Target Variable.
Compute "Entropy_Info" and "Information Gain" for each attribute in DS.
Chose the attribute for which Entropy is minimum and Gain is maximum as
best split attribute and consider it as Root Node.
The root node is branched into subtrees with each subtee as an outcome of the
test condition of the root node attribute. Accordingly, The training DS is also
split into subsets.

Recursively repeat the process for the subset till a leaf node is derived or no more
training instances are available in the subset.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 68/ 127


ID3 Algorithms II

Definitions :

Let ’T’ be the Training Dataset.


Let ’A’ be the set of attributes A = A1 , A2 , A3 , ....An
Let ’m’ be the number of classes in the training DS
let Pi be the probability that a data instance or a tuple ’d’ belongs to class
|dci |
Ci . i.e Ci = T
Pm
Entroy_Info(T) = − i=1
Pi log2 Pi
Pv |Ai |
Entroy_Info(T, A) = i=1 T
∗ Entropy _Info(Ai )
Where,
The Attribute ’A’ has ’v’ distinct values [a1 , a2 , a3 , ...an ].
|A| is the number instances for distinct value ’i’ in attribute ’A’
Entropy _Info(Ai ) is the entropy for that set of instances.
Information Gain = Entropy_Info(T) - Entropy_Info(Ai )

Dr. Rabindra Kumar Singh "Artificial Intelligence" 69/ 127


Example-1 I

Problem :

Consider the below Table, to assess a student’s performance during his course
of study and predict whether a student will get a job offer or not in his final
year of the course. The training dataset T consists of 10 data instances with
attributes ’CGPA (C)’, ’Interactiveness (I)’, ’Practical Knowledge(Pk)’,
’Communication Skills (Cs)’ as shown below dataset :
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 < 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 < 8 Yes Good Poor No
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes

Dr. Rabindra Kumar Singh "Artificial Intelligence" 70/ 127


Example-1 II

Solution :

Step-1 : Calculate the Entropy for the Target Class ’Job Offer’ :

Entropy_Info(Target Variable = Job Offer ) = Entropy_Info(7,3)


7 7 3 3
= − 10 ∗ log2 10 + 10
∗ log2 10
= −(−0.3599 + −0.5208)
= 0.8807

Step-2 : Calculate the Entropy_Info and Gain(Information_Gain) for each


of the attribute in the Training DataSet

Iteration-1 :
Table: Entropy Information for CGPA

CGPA Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


>=9 3 1 4
>=8 4 0 4 0
<8 0 2 2 0

Dr. Rabindra Kumar Singh "Artificial Intelligence" 71/ 127


Example-1 III

Entropy_Info(T, CGPA) =
4
= 10
(− 34 log2 34 − 14 log2 41 ) + 4
10
(− 44 log2 44 − 40 log2 04 ) + 2
10
(− 02 log2 20 − 22 log2 22 )
4
= 10
(0.3111 + 0.4997) + 0 + 0 = 0.3243

Gain(CGPA) = 0.8807 - 0.3243 = 0.5564

Table: Entropy Information for Interactiveness (I)

I Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


Yes 5 1 6
No 2 2 4 0

Entropy_Info(T, Interactiveness) =
6
= 10
(− 56 log2 56 − 16 log2 61 ) + 4
10
(− 42 log2 24 − 42 log2 24 )
6 4
= 10
(0.2191 + 0.4306) + 10
(0.4997 + 0.4997) = 0.7896

Gain(Interactiveness) = 0.8807 - 0.7896 = 0.0911

Dr. Rabindra Kumar Singh "Artificial Intelligence" 72/ 127


Example-1 IV

Table: Entropy Information for Practical Knowledge(Pk)

Pk Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


Very Good 2 0 2 0
Average 1 2 3
Good 4 1 5

Entropy_Info(T, Practical Knowledge) =


2
= 10
(− 22 log2 22 − 02 log2 20 ) + 3
10
(− 31 log2 13 − 32 log2 23 + 5
10
(− 54 log2 45 − 51 log2 15 )
2 3 5
= 10
(0) + 10
(0.5280 + 0.3897) + 10
(0.2574 + 0.4641) = 0.6361

Gain(Practical Knowledge) = 0.8807 - 0.6361 = 0.2446

Dr. Rabindra Kumar Singh "Artificial Intelligence" 73/ 127


Example-1 V
Table: Entropy Information for Communication Skills (Cs)

Cs Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


Good 4 1 5
Moderate 3 0 3
Poor 0 2 2

Entropy_Info(T, Communication Skills) =


5
= 10
(− 45 log2 45 − 15 log2 51 ) + 3
10
(− 33 log2 33 − 30 log2 03 ) + 2
10
(− 02 log2 20 − 22 log2 22 )
5 3 2
= 10
(0.5280 + 0.3897) + 10
(0) + 10
(0) = 0.3609

Gain(Communication Skills) = 0.8807 - 0.3609 = 0.5203

Table: Gain for all attributes


Attributes Gain Entropy
CGPA 0.5564 0.3243
Interactiveness 0.0911 0.7896
Practical Knowledge 0.2246 0.6361
Communication Skills 0.5203 0.3609
Dr. Rabindra Kumar Singh "Artificial Intelligence" 74/ 127
Example-1 VI

Step-3 : Choose the attribute for which entropy is minimum and ∴ the gain is
maximum as the best split attribute. So, we choose CGPA as root node.

Now Continue the same process for the subset of data instances branched with
CGPA >= 9

Dr. Rabindra Kumar Singh "Artificial Intelligence" 75/ 127


Example-1 VII

Iteration-2 :

Here Once again, the same process of computing the Entropy_Info and Gain
are repeated with the subset of training set.

The subset consists of 4 data instances.

Entropy_Info(T) = Entropy_Infor(3,1) =

= - ( 34 log2 34 + 41 log2 14 ) = −(−0.3111 + −0.4997) = 0.8108

Entropy_Info(T, Interactiveness) =

= 24 (− 22 log2 22 − 02 log2 02 ) + 24 (− 21 log2 12 - 21 log2 21 ) = 0 + 0.4997 = 0.4997

Gain(Interactiveness) = 0.8108 - 0.4997 = 0.3111

Dr. Rabindra Kumar Singh "Artificial Intelligence" 76/ 127


Example-1 VIII

Entropy_Info(T, Practical Knowledge) =


= 24 (− 22 log2 22 − 02 log2 02 ) + 14 (− 10 log2 01 − 11 log2 11 + 14 (− 11 log2 11 − 01 log2 10 )

= 0

Gain(Practical Knowledge) = 0.8108 - 0 = 0.8108

Entropy_Info(T, Communication Skills) =


= 24 (− 22 log2 22 − 02 log2 02 ) + 14 (− 10 log2 01 − 11 log2 11 + 14 (− 11 log2 11 − 01 log2 10 )

= 0

Gain(Communication Skills) = 0.8108 - 0 = 0.8108

Table: Gain for all attributes


Attributes Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skills 0.8108

Dr. Rabindra Kumar Singh "Artificial Intelligence" 77/ 127


Example-1 IX

Here, bot the attributes ’Practical Knowledge’ and ’Communication Skills’ have
the same Gain, So, can either construct the DT using ’Practical Knowledge’ or
’Communication Skills’

Figure: Final Decision Tree

Dr. Rabindra Kumar Singh "Artificial Intelligence" 78/ 127


Merits and DeMerits of ID3

Mertis
Understandable prediction rules are created from the training data.
Builds a short tree in relatively small time.
It only needs to test enough attributes until all data is classified.
Finding leaf nodes enables test data to be pruned, reducing the number of
tests.

Demerits:
Data may be over-fitted or over-classified, if a small sample is tested.
Only one attribute at a time is tested for making a decision.

Overfitting
It is an undesirable learning behavior that occurs when the machine learning
model gives accurate predictions for training data but not for new data.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 79/ 127


C4.5 Algorithm

It is an improvement over ID2 and works with Continuous/discrete


attributes and missing values by marking as ’?’
It also supports Post-Pruning.
C5.0 is the successor of C4.5 and is more efficient and also used for
building smaller decision trees.

Example :

Given a Training Data Set ’T’, then

Split_Info(T,A) and Gain_Ratio(A) of an Attribute ’A" is


Pv |Ai |
Split_Info(T,A) = - i=1 |T |
∗ log2 |Ai|
|T |
and

Info_Gain(A)
Gain_Ratio(A) = Split_Info(T,A)

Dr. Rabindra Kumar Singh "Artificial Intelligence" 80/ 127


C4.5 Algorithm

Compute Entropy_Info(T), T is a Training Data Set.

Compute Entropy_Info(T,A), Info_Gain(Ai ), Split_Gain(Ai ) and


Gain_Ratio(Ai ) for each of the attribute ’Ai ’ in the training DS.

Choose the attribute for which Gain_Ratio is maximum as the best split
attribute.

The root node is branched into subtrees with each subtree as an outcome
of the test condition of the root node attribute. Accordingly, the training
DS is a also split into subsets.

Recursively apply the same operation for the subset of the training set
with the remaining attributes until a leaf node is derived or not more
training instances are available in the subset.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 81/ 127


Example-1 (Using C4.5 Algorithm) I

Problem :

Consider the below Table, to assess a student’s performance during his course
of study and predict whether a student will get a job offer or not in his final
year of the course. The training dataset T consists of 10 data instances with
attributes ’CGPA (C)’, ’Interactiveness (I)’, ’Practical Knowledge(Pk)’,
’Communication Skills (Cs)’ as shown below dataset :
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 < 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 < 8 Yes Good Poor No
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes

Dr. Rabindra Kumar Singh "Artificial Intelligence" 82/ 127


Example-1 (Using C4.5 Algorithm) II

Solution :

Step-1 : Calculate the Entropy for the Target Class ’Job Offer’ :

Entropy_Info(Target Variable = Job Offer ) = Entropy_Info(7,3)


7 7 3 3
= − 10 ∗ log2 10 + 10
∗ log2 10
= −(−0.3599 + −0.5208)
= 0.8807

Step-2 : Calculate the Entropy_Info and Gain(Information_Gain) for each


of the attribute in the Training DataSet

Iteration-1 :
Table: Entropy Information for CGPA

CGPA Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


>=9 3 1 4
>=8 4 0 4 0
<8 0 2 2 0

Dr. Rabindra Kumar Singh "Artificial Intelligence" 83/ 127


Example-1 (Using C4.5 Algorithm) III

Entropy_Info(T, CGPA) =
4
= 10
(− 34 log2 34 − 14 log2 41 ) + 4
10
(− 44 log2 44 − 40 log2 04 ) + 2
10
(− 02 log2 20 − 22 log2 22 )
4
= 10
(0.3111 + 0.4997) + 0 + 0 = 0.3243

Gain(CGPA) = 0.8807 - 0.3243 = 0.5564


4 4 4 4 2 2
Split_Info(T, CGPA) = − 10 log2 10 − 10
log2 10 − 10
log2 10 = 1.5211
Gain(CGPA) 0.5564
Gain_Ratio(CGPA) = Split_Info(T ,CGPA)
= 1.5211
= 0.3658

Table: Entropy Information for Interactiveness (I)

I Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


Yes 5 1 6
No 2 2 4 0

Dr. Rabindra Kumar Singh "Artificial Intelligence" 84/ 127


Example-1 (Using C4.5 Algorithm) IV

Entropy_Info(T, Interactiveness) =
6
= 10
(− 56 log2 56 − 16 log2 61 ) + 4
10
(− 42 log2 24 − 42 log2 24 )
6 4
= 10
(0.2191 + 0.4306) + 10
(0.4997 + 0.4997) = 0.7896

Gain(Interactiveness) = 0.8807 - 0.7896 = 0.0911


6 6 4 4
Split_Info(T, Interactivenss) = − 10 log2 10 − 10
log2 10 = 0.9704
Gain(Interactiveness) 0.0911
Gain_Ratio(CGPA) = Split_Info(T ,Interactivenss)
= 0.9704
= 0.0939

Table: Entropy Information for Practical Knowledge(Pk)

Pk Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


Very Good 2 0 2 0
Average 1 2 3
Good 4 1 5

Dr. Rabindra Kumar Singh "Artificial Intelligence" 85/ 127


Example-1 (Using C4.5 Algorithm) V

Entropy_Info(T, Practical Knowledge) =


2
= 10
(− 22 log2 22 − 02 log2 20 ) + 3
10
(− 31 log2 13 − 32 log2 23 + 5
10
(− 54 log2 45 − 51 log2 15 )
2 3 5
= 10
(0) + 10
(0.5280 + 0.3897) + 10
(0.2574 + 0.4641) = 0.6361

Gain(Practical Knowledge) = 0.8807 - 0.6361 = 0.2446


2 2 5 5 3 3
Split_Info(T, Practical Knowledge) = − 10 log2 10 − 10
log2 10 − 10
log2 10
= 1.4853
Gain(PracticalKnowledge) 0.2448
Gain_Ratio(CGPA) = Split_Info(T ,PracticalKnowledge)
= 1.4853
= 0.1648

Table: Entropy Information for Communication Skills (Cs)

Cs Job Offer = ’Yes’ Job Offer = ’No’ Total Entropy


Good 4 1 5
Moderate 3 0 3
Poor 0 2 2

Dr. Rabindra Kumar Singh "Artificial Intelligence" 86/ 127


Example-1 (Using C4.5 Algorithm) VI

Entropy_Info(T, Communication Skills) =


5
= 10
(− 45 log2 45 − 15 log2 51 ) + 3
10
(− 33 log2 33 − 30 log2 03 ) + 2
10
(− 02 log2 20 − 22 log2 22 )
5 3 2
= 10
(0.5280 + 0.3897) + 10
(0) + 10
(0) = 0.3609

Gain(Communication Skills) = 0.8807 - 0.3609 = 0.5203


5 5 3 3 2 2
Split_Info(T, Communication Skills) = − 10 log2 10 − 10
log2 10 − 10
log2 10
= 1.4853
Gain(CommunicationSkills) 0.2448
Gain_Ratio(Communication Skills) = Split_Info(T ,CommunicationSkills)
= 1.4853

= 0.3502

Table: Gain_Ratio for all attributes

Attributes Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Practical Knowledge 0.1648
Communication Skills 0.3502

Dr. Rabindra Kumar Singh "Artificial Intelligence" 87/ 127


Example-1 (Using C4.5 Algorithm) VII

Step-3 : Choose the attribute for which Gain_Ratio is maximum as the best
split attribute. So, we choose CGPA as root node.

Now Continue the same process for the subset of data instances branched with
CGPA >= 9

Dr. Rabindra Kumar Singh "Artificial Intelligence" 88/ 127


Example-1 (Using C4.5 Algorithm) VIII

Iteration-2 :

Here Once again, the same process of computing the Entropy_Info and Gain,
Split_Info and Gain_Ratio are repeated with the subset of training set.
The subset consists of 4 data instances.

Entropy_Info(T) = Entropy_Infor(3,1) =

= - ( 34 log2 34 − 41 log2 14 ) = −(−0.3111 + −0.4997) = 0.8108

Entropy_Info(T, Interactiveness) =

= 24 (− 22 log2 22 − 02 log2 02 ) + 24 (− 21 log2 12 - 21 log2 21 ) = 0 + 0.4997 = 0.4997

Gain(Interactiveness) = 0.8108 - 0.4997 = 0.3111


Split_Info(T, Interactiveness) = − 42 log2 24 − 42 log2 24 = 1
Gain(Interactiveness) 0.3112
Gain_Ratio(Interactiveness) = Split_Info(T ,Interactiveness)
= 1
= 0.3112

Dr. Rabindra Kumar Singh "Artificial Intelligence" 89/ 127


Example-1 (Using C4.5 Algorithm) IX

Entropy_Info(T, Practical Knowledge) =

= 24 (− 22 log2 22 − 02 log2 02 ) + 14 (− 10 log2 01 − 11 log2 11 + 41 (− 11 log2 11 − 01 log2 01 ) = 0

Gain(Practical Knowledge) = 0.8108 - 0 = 0.8108


Split_Info(T, Practical Knowledge) = − 24 log2 24 − 41 log4 14 − 41 log4 14 = 1.5
Gain(Pk ) 0.8108
Gain_Ratio(Practical Knowledge) = Split_Info(T ,P
= = 0.5408
k) 1.5

Entropy_Info(T, Communication Skills) =


2
= 4
(− 22 log2 22 − 02 log2 02 ) + 1
4
(− 10 log2 10 − 11 log2 11 + 1
4
(− 11 log2 11 − 01 log2 10 ) = 0

Gain(Communication Skills) = 0.8108 - 0 = 0.8108


Split_Info(T, Communication Skills) = − 24 log2 24 − 14 log4 41 − 14 log4 14 = 1.5
Gain(Cs ) 0.8108
Gain_Ratio(Communication Skills) = Split_Info(T ,Cs )
= 1.5
= 0.5408

Dr. Rabindra Kumar Singh "Artificial Intelligence" 90/ 127


Example-1 (Using C4.5 Algorithm) X

Table: Gain Ratio for all attributes


Attributes Gain_Ratio
Interactiveness 0.3111
Practical Knowledge 0.5408
Communication Skills 0.5408

Figure: Final Decision Tree

Dr. Rabindra Kumar Singh "Artificial Intelligence" 91/ 127


Dealing with Continuous Attributes in C4.5 I

C4.5 Algorithm si further improved by considering attributes which are


continuous, i.e., Discretized by finding a split point or threshold.

When an attribute ’A’ has numerical values which are continuous, then a
threshold or split point ’s’ is found such that the set of values is categorized
into 2 sets as A < s and A >= s. The best split point is the attribute value
which has maximum information gain for that attribute.

Table: Sample Data Set

S.No CGPA Job Offer


1 9.5 Yes
2 8.2 Yes
3 9.1 No
4 6.8 No
5 8.5 Yes
6 9.5 Yes
7 7.9 No
8 9.1 Yes
9 8.8 Yes
10 8.8 Yes

Dr. Rabindra Kumar Singh "Artificial Intelligence" 92/ 127


Dealing with Continuous Attributes in C4.5 II

First Sort the values in an Ascending Orders


6.8 7.9 8.2 8.5 8.8 8.8 9.1 9.1 9.5 9.5

Remove the duplicates and consider only the unique values of the attribute
6.8 7.9 8.2 8.5 8.8 9.1 9.5
Now Compute the Gain for the distinct values of this continuous attributes

6.8 7.9 8.2 8.5 8.8 9.1 9.5


Range <= > <= > <= > <= > <= > <= > <= >
Yes 0 7 0 7 1 6 2 5 4 3 5 2 7 0
No 1 2 2 1 2 1 2 1 2 1 3 0 3 0
Entropy 0 0.76 0 0.54 0.91 0.59 1 0.64 0.91 0.81 0.95 0 0.88 0
Entroy_Info
0.6873 0.4346 0.6892 0.7898 0.8749 0.7630 0.8808
(S,T)
Gain 0.1935 0.4462 0.1916 0.091 0.0059 0.1178 0

From the above table, we observe that CGPA with 7.9 has maximum gain as
0.4462. So we choose CGPA @7.9 as Split Point as >7.9 and <= 7.9.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 93/ 127


Dealing with Continuous Attributes in C4.5 III

Table: Discretized Instances

S. No CGPA Continuous CGPA Discretized Job Offer = ’Yes’


1 9.5 >7.9 Yes
2 8.2 >7.9 Yes
3 9.1 >7.9 N0
4 6.8 <= 9 N0
5 8.5 >7.9 Yes
6 9.5 >7.9 Yes
7 7.9 <= 7.9 N0
8 9.1 >7.9 Yes
9 8.8 >7.9 Yes
10 8.8 >7.9 Yes

Dr. Rabindra Kumar Singh "Artificial Intelligence" 94/ 127


CART - Classification and Regression Tree I

What category of algorithms does CART belong to? CART (Classification


and Regression Trees) can be used for both classification and regression
problems. The difference lies in the target variable:
Classification, An attempt to predict a class label. In other words,
classification is used for problems where the output (target variable) takes
a finite set of values, e.g., whether it will rain tomorrow or not.
Regression is used to predict a numerical label. This means your output
can take an infinite set of values, e.g., a house price

How do CART work? Example...! : Assume you have a bunch of Oranges


and Mandrins with labels on them, and Need to identify a set of simple rules
that you can use in the future to distinguish between these two types of fruit.

Typically, oranges (diameter 6 to 10cm) are bigger than mandarins (diameter


4 to 8cm), so the first rule found by your algorithm might be based on size:
Diameter ≤ 7cm.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 95/ 127


CART - Classification and Regression Tree II

Next, you may notice that mandarins tend to be slightly darker in color than
oranges. So, can use a color scale (1=dark to 10=light) to split your tree
further:
Color ≤ 5 for the left side of the sub-tree
Color ≤ for the right side of the sub-tree

Dr. Rabindra Kumar Singh "Artificial Intelligence" 96/ 127


CART - Classification and Regression Tree III

How does CART find the best split?


Several methods can be used in CART to identify the best splits. Here are two
of the most common ones for classification trees:
Pn
Gini Impurity(Index) : 1 - Gini = 1 - i=1
Pi2
Pn
Entropy : - i=1
pi log2 (pi )
where pi is the fraction of items in the class ’i’.

Gini Impurity for the leftmost leaf node(Above Figure) would be:
Pn
Gini Impurity = 1- i=1
Pi2 = 1 − (0.0272 + 0.9732 ) = 0.053

To find the best split, we need to calculate the weighted sum of Gini Impurity
for both child nodes. We do this for all possible splits and then take the one
with the lowest Gini Impurity as the best split.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 97/ 127


CART - Classification and Regression Tree IV

The Weighted Sum of Gini Impurity for the two child is

37 9
Gini Impurity = 46
* 0.053 + 46
* 0.345 = 0.110

Note
If the best weighted Gini Impurity for the two child nodes is not lower than Gini
Impurity for the parent node, you should not split the parent node any further.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 98/ 127


CART - Classification and Regression Tree V

The Entropy approach is essentially the same as Gini Impurity, except it uses a
slightly different formula:
Pn
- i=1
pi log2 (pi )
To identify the best split, you would have to follow all the same steps outlined
above. The split with the lowest entropy is the best one. Similarly, if the
entropy of the two child nodes is not lower than that of a parent node, you
should not split any further.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 99/ 127


CART Algorithm

Pn
Compute Gini Index(T) = 1- i=1 i
P 2 , Where T is the Training DS.
|S1 |
Compute Gini Index(T, A) = |T |
Gini(S1 ) + |S2|
|T |
Gini(S2 ), Where A is the
attribute.
Choose the best splitting subset which has minimum Gini_Index for an
attribute.
Compute △ Gini = Gini(T) - Gini(T,A) for the best splitting subset of
that attribute and consider as root node.
The Root node is branched into 2 subtrees with each subtree as outcome
of the test condition of the root node attribute. Accordingly, the training
dataset is also split into 2 subsets.
Recursively apply the same operation for the subset of the training set
with the remaining attributes until a leaf node is derived or no more
training instances are available in the subset.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 100/ 127


CART - Example I

Problem :

Consider the below Table, to assess a student’s performance during his course
of study and predict whether a student will get a job offer or not in his final
year of the course. The training dataset T consists of 10 data instances with
attributes ’CGPA (C)’, ’Interactiveness (I)’, ’Practical Knowledge(Pk)’,
’Communication Skills (Cs)’ as shown below dataset :
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 < 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 < 8 Yes Good Poor No
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes

Dr. Rabindra Kumar Singh "Artificial Intelligence" 101/ 127


CART - Example II

Solution :

Step-1 : Calculate the Gini_Index for the data set, consists of 10 data
instances. The target attribute ’Job Offer’ has 7 instances as ’Yes’ and 3
instances as ’No’.

7 2 3 2
∴ Gini_Index(T) = 1 - ( 10 ) − ( 10 ) = 0.42

Step-2 : Compute Gini_Index for each of the attribute and each of the subset
in the attribute.

CGPA : It has 3 categories, so there are 6 subsets and hence 3 combinations of


subsets as below

Table: Categories of CGPA

CGPA Job Offer = ’Yes’ Job Offer = ’No’


>= 9 3 1
>= 8 4 0
<8 0 2

Dr. Rabindra Kumar Singh "Artificial Intelligence" 102/ 127


CART - Example III

Gini_Index(T, CGPA ϵ {≥ 9, ≥ 8}) = 1 - ( 87 )2 - ( 81 )2 = 0.2194


Gini_Index(T, CGPA ϵ {< 8}) = 1 - ( 02 )2 - ( 22 )2 = 0
8 2
Gini_Index(T, CGPA ϵ {(≥ 9, ≥ 8), < 8}) = 10
∗ 0.2194 + 10
∗ 0 = 0.17552

Gini_Index(T, CGPA ϵ {≥ 9, < 8}) = 1 - ( 63 )2 - ( 63 )2 = 0.5


Gini_Index(T, CGPA ϵ {≥ 8}) = 1 - ( 44 )2 - ( 40 )2 = 0
6 4
Gini_Index(T, CGPA ϵ {(≥ 9, < 8), ≥ 8}) = 10
∗ 0.5 + 10
∗ 0 = 0.3

Gini_Index(T, CGPA ϵ {≥ 8, < 8}) = 1 - ( 64 )2 - ( 62 )2 = 0.445


Gini_Index(T, CGPA ϵ {≥ 9}) = 1 - ( 34 )2 - ( 41 )2 = 0.375
6 4
Gini_Index(T, CGPA ϵ {(≥ 8, < 8), ≥ 9}) = 10
∗ 0.445 + 10
∗ 0.375 = 0.417

Subsets Gini_Index
(≥ 9, ≥ 8) <8 0.1755
(≥ 9, <8) ≥ 8 0.3
(≥ 8, <8) ≥ 9 0.417
Dr. Rabindra Kumar Singh "Artificial Intelligence" 103/ 127
CART - Example IV

Step-3 : Choose the best splitting subset which has minimum Gini_Index for an
attribute. ∴ the subset CGPA ϵ {(≥ 9, ≥ 8), < 8} is choose as best attribute.

Step-4 : Compute △Gini or the best splitting subset of that attribute.

△Gini(CGPA) = Gini(T) - Gini(T, CGPA) = 0.42 - 0.1755 = 0.2445

Now Repeat the same process for the other attributes in the Training Data set.

Interactiveness : It has 2 categories as show below :

Table: Categories for Interactiveness (I)

Interactiveness(I) Job Offer = Yes Job Offer = No


Yes 5 1
No 2 2

Dr. Rabindra Kumar Singh "Artificial Intelligence" 104/ 127


CART - Example V

Gini_Index(T, I ϵ {|’Yes’|) = 1 - ( 56 )2 - ( 61 )2 = 0.28


Gini_Index(T, I ϵ {|’No’|}) = 1 - ( 42 )2 - ( 24 )2 = 0.5
6 4
Gini_Index(T, I ϵ {(|’Yes’,|’No’|)}) = 10
∗ 0.28 + 10
∗ 0.5 = 0.368

△Gini(I) = Gini(T) - Gini(T, I) = 0.42 - 0.368 = 0.052

Practical Knowledge (Pk ) : It has 3 categories as show below :


Practical Knowledge (Pk) Job Offer = Yes Job Offer = No
Very Good 2 0
Good 4 1
Average 1 2

Dr. Rabindra Kumar Singh "Artificial Intelligence" 105/ 127


CART - Example VI

Gini_Index(T, Pk ϵ {VG, G}) = 1 - ( 76 )2 - ( 17 )2 = 0.2456


Gini_Index(T, Pk ϵ {Avg}) = 1 - ( 13 )2 - ( 32 )2 = 0.445
7 3
Gini_Index(T, Pk ϵ {(VG, G), Avg}) = 10
∗ 0.2456 + 10
∗ 0.445 = 0.3054

Gini_Index(T, Pk ϵ {VG, Avg}) = 1 - ( 53 )2 - ( 25 )2 = 0.48


Gini_Index(T, Pk ϵ {G}) = 1 - ( 45 )2 - ( 51 )2 = 0.32
5 5
Gini_Index(T, Pk ϵ {(VG, Avg), G }) = 10
∗ 0.48 + 10
∗ 032 = 0.4

Gini_Index(T, Pk ϵ {G, Avg}) = 1 - ( 58 )2 - ( 83 )2 = 0.4688


Gini_Index(T, Pk ϵ {VG}) = 1 - ( 22 )2 - ( 02 )2 = 0
8 2
Gini_Index(T, Pk ϵ {(G, Avg), VG }) = 10 ∗ 0.4688 + 10
∗ 0 = 0.3750
Subsets Gini_Index
(VG, G) Avg 0.3054
(VG, Avg) G 0.40
(G, Avg) VG 0.3750

∴ △Gini(Pk ) = Gini(T) - Gini(T, Pk ) = 0.42 - 0.3054 = 0.1146

Dr. Rabindra Kumar Singh "Artificial Intelligence" 106/ 127


CART - Example VII

Communication Skills (Cs ) : It has 3 categories as shown below :


Table: Categories for Communication Skills (Cs )

Communication Skills (Ck) Job Offer = Yes Job Offer = No


Good 4 1
Moderate 3 0
Poor 0 2

Gini_Index(T, Cs ϵ {G, M}) = 1 - ( 78 )2 - ( 81 )2 = 0.2194


Gini_Index(T,Cs ϵ {P}) = 1 - ( 22 )2 - ( 02 )2 = 0.
8 2
Gini_Index(T, Cs ϵ {(G, M), P}) = 10
∗ 0.2194 + 10
∗ 0 = 0.1755

Gini_Index(T, Cs ϵ {G, P}) = 1 - ( 74 )2 - ( 73 )2 = 0.4899


Gini_Index(T, Cs ϵ {M}) = 1 - ( 33 )2 - ( 03 )2 = 0
7 3
Gini_Index(T, Cs ϵ {(G, P), M }) = 10
∗ 0.4899 + 10
∗ 0 = 0.3429

Gini_Index(T, Cs ϵ {M, P}) = 1 - ( 35 )2 - ( 25 )2 = 0.48


Gini_Index(T, Cs ϵ {G}) = 1 - ( 54 )2 - ( 15 )2 = 0.32
5 5
Gini_Index(T, Cs ϵ {(M, P), G }) = 10
∗ 0.48 + 10
∗ 0.32 = 0.4
Dr. Rabindra Kumar Singh "Artificial Intelligence" 107/ 127
CART - Example VIII

Table: Gini_Index for subsets of Communication Skills

Subsets Gini_Index
{G, M} P 0.1755
{G, P} M 0.3429
{M, P} G 0.40

∴ △Gini(Cs ) = Gini(T) - Gini(T, Cs ) = 0.42 - 0.1755

Table: Gini_Index and △Gini for all Attributes

Attribute Gini_Index △Gini


CGPA 0.1755 0.2445
Interactiveness 0.368 0.052
Practical Knowledge 0.3054 0.116
Communication Skills 0.1755 0.2445
Step-5 : Choose the best splitting attribute that has Maximum △Gini. i.e.,
Can Choose Either CGPA or Communication Skills. (as shown in fig)

Dr. Rabindra Kumar Singh "Artificial Intelligence" 108/ 127


CART - Example IX

Figure: DT after first Iteration

Dr. Rabindra Kumar Singh "Artificial Intelligence" 109/ 127


CART - Example X

Iteration-2 : Now the DS has 8 instances. Repeat the same process to find the
best splitting attribute and the splitting subset for that attribute.
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes

Gini_Index = 1 − ( 78 )2 − ( 81 )2 = 1 − 0.766 − 0.0156 = 0.2184

Table: Categories of Interactivenss

Interactiveness (I) Job Offer = Yes Job Offer = No


Yes 5 0
No 2 1

Dr. Rabindra Kumar Singh "Artificial Intelligence" 110/ 127


CART - Example XI

Gini_Index(T, I ϵ {Yes}) = 1 - ( 55 )2 − ( 50 )2 = 0
Gini_Index(T, I ϵ {No}) = 1 - ( 23 )2 − ( 31 )2 = 0.449
Gini_Index(T, I ϵ {Yes, No}) = ( 87 ) ∗ 0 + ( 18 ) ∗ 0.449 = 0.056

△Gini(Interactiveness) = Gini(T) - Gini(T, Interactiveness)


= 0.2184-0.056 = 0.1624

Table: Categories for Practical Knowledge (Pk )

Practical Knowledge Job Offer = Yes Job Offer = No


Very Good 2 0
Good 4 0
Average 1 1

Gini_Index(T, Pk ϵ {VG, G}) = 1 - ( 66 )2 − ( 60 )2 = 0


Gini_Index(T, Pk ϵ {Avg}) = 1 - ( 21 )2 − ( 21 )2 = 0.5
Gini_Index(T, Pk ϵ {VG, G}, Avg) = ( 86 ) ∗ 0 + ( 28 ) ∗ 0.5 = 0.125

Dr. Rabindra Kumar Singh "Artificial Intelligence" 111/ 127


CART - Example XII

Gini_Index(T, Pk ϵ {VG, Avg}) = 1 - ( 34 )2 − ( 41 )2 = 0.375


Gini_Index(T, Pk ϵ {G}) = 1 - ( 44 )2 − ( 04 )2 = 0
Gini_Index(T, Pk ϵ {VG, Avg}, G) = ( 84 ) ∗ 0.375 + ( 82 ) ∗ 0 = 0.1875

Gini_Index(T, Pk ϵ {G, Avg}) = 1 - ( 65 )2 − ( 16 )2 = 0.278


Gini_Index(T, Pk ϵ {VG}) = 1 - ( 22 )2 − ( 02 )2 = 0
Gini_Index(T, Pk ϵ {G, Avg}, VG) = ( 86 ) ∗ 0.278 + ( 82 ) ∗ 0 = 0.2085

Table: Gini_Index for subsets of Practical Knowledge Pk

Subsets Gini_Index
{VG, G} Avg 0.125
{VG, Avg} G 0.1875
{G, Avg} VG 0.2085

∴ △Gini(Pk ) = Gini(T) - Gini(T, Pk ) = 0.2184 - 0.125 = 0.0934

Dr. Rabindra Kumar Singh "Artificial Intelligence" 112/ 127


CART - Example XIII

Table: Categories for Communication Skills (Cs )

Communication Skills Job Offer = Yes Job Offer = No


Good 4 0
Moderate 3 0
Poor 0 1

Gini_Index(T, Cs ϵ {G, M}) = 1 - ( 77 )2 − ( 70 )2 = 0


Gini_Index(T, Cs ϵ {P}) = 1 - ( 10 )2 − ( 11 )2 = 0
Gini_Index(T, Cs ϵ {G, M}, P) = ( 87 ) ∗ 0 + ( 81 ) ∗ 0 = 0

Gini_Index(T, Cs ϵ {G, P}) = 1 - ( 54 )2 − ( 51 )2 = 0.32


Gini_Index(T, Cs ϵ {M}) = 1 - ( 33 )2 − ( 30 )2 = 0
Gini_Index(T, Cs ϵ {G, M}, P) = ( 85 ) ∗ 0.32 + ( 83 ) ∗ 0 = 0.2

Gini_Index(T, Cs ϵ {M, P}) = 1 - ( 34 )2 − ( 41 )2 = 0.375


Gini_Index(T, Cs ϵ {G}) = 1 - ( 44 )2 − ( 04 )2 = 0
Gini_Index(T, Cs ϵ {M, P}, G) = ( 84 ) ∗ 0.375 + ( 84 ) ∗ 0 = 0.1875

Dr. Rabindra Kumar Singh "Artificial Intelligence" 113/ 127


CART - Example XIV

Table: Gini_Index for subsets of Communication Skills

Subsets Gini_Index
{G, M} P 0
{G, P} M 0.2
{M, P} G 0.1875

∴ △Gini(Cs ) = Gini(T) - Gini(T, Cs ) = 0.2184 - 0 = 0.2184

Table: Gini_Index and △Gini Values of all attributes

Attribute Gini_Index △Gini


Interactiveness 0.056 0.1624
Practical Knowledge 0.125 0.0934
Communication Skills 0 0.2184

Dr. Rabindra Kumar Singh "Artificial Intelligence" 114/ 127


CART - Example XV

Communication Skills has the Highest △Gini value. The Tree is further
branched based on the attribute ’Communication Skills".

Here, All the branches end up in a leaf node and the process of construction is
completed.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 115/ 127


Regression Trees

RT are a variant of DT, where the target feature is a continuous valued


variable.

These trees are constructed by using an algorithm called reduction in variance


which uses Standard Deviation to choose the best splitting attribute

Algorithm
Compute Standard Deviation(SD) for each attribute w.r.t target variable
Compute SD for the No. of Data Instances of each distinct value of an attribute
and Weighted SD for each attribute.
Compute SD reduction by subtracting weighted standard deviation for each
attribute from SD of each attribute.
Choose the attribute with a higher SD reduction as the best split attribute.
The best split attribute is placed as the root node.
The root node is branched into subtrees with each subtree as an outcome of the
test condition of the root node attribute.
Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 116/ 127


Regression Trees : Example I

Construct a RT using the following Table, which consists of 10 Data Instances


and 3 attributes ’Assessment’, ’Assignment’ and ’Project’. The Target
variable is ’Result’ which is continuous attribute.

S.No. Assessment Assignment Project Result(%)


1 Good Yes Yes 95
2 Average Yes No 70
3 Good No Yes 75
4 Poor No No 45
5 Good Yes Yes 98
6 Average No Yes 80
7 Good No No 75
8 Poor Yes Yes 65
9 Average No No 58
10 Good Yes Yes 89

Dr. Rabindra Kumar Singh "Artificial Intelligence" 117/ 127


Regression Trees : Example II

Solution :

Step-1 : Compute SD for each attribute w.r.t to Target Attribute.

Standard Deviation for the DS


(95+70+75+45+98+80+75+65+58+89)
Average = 10 = 75

(95−75) +(70−75) +(75−75) +(45−75) +(98−75)2 +(80−75)2 +(75−75)2 +(65−75)2 +(58−75)2 +(89−75)2
2 2 2 2
SD = 10
SD = 16.55

Assessment = Good
Table: Attribute Assessment = Good

S.No. Assessment Assignment Project Result(%)


1 Good Yes Yes 95
3 Good No Yes 75
5 Good Yes Yes 98
7 Good No No 75
10 Good Yes Yes 89

Dr. Rabindra Kumar Singh "Artificial Intelligence" 118/ 127


Regression Trees : Example III

Standard Deviation for Assessment = Good


(95+75+98+75+89)
Average = 5 = 86.4

(95−86.4)2 +(75−86.4)2 +(98−86.4)2 +(75−86.4)2 +(89−86.4)2
SD = 5
SD = 16.55

Assessment = Average
Table: Attribute Assessment = Average

S.No. Assessment Assignment Project Result(%)


2 Average Yes No 70
6 Average No Yes 80
9 Average No No 58

Standard Deviation for Assessment = Average


(70+80+58)
Average = 3 = 69.3

(70−69.3)2 +(80−69.3)2 +(58−69.3)2
SD = 3
SD = 11.01

Dr. Rabindra Kumar Singh "Artificial Intelligence" 119/ 127


Regression Trees : Example IV

Assessment = Poor
Table: Attribute Assessment = Poor

S.No. Assessment Assignment Project Result(%)


4 Poor No No 45
8 Poor Yes Yes 65

Standard Deviation for Assessment = Poor


(45+65)
Average = 2 = 55

(45−55) +(65−55)2
2
SD = 2 = 14.14

Table: Standard Deviation for Assessment

Assessment Standard Deviation Data Instances


Good 10.9 5
Average 11.01 3
Poor 14.14 3

Dr. Rabindra Kumar Singh "Artificial Intelligence" 120/ 127


Regression Trees : Example V

5 3 2
Weighted SD for Assessment = 10 ∗ 10.9 + 10 ∗ 11.01 + 10
∗ 14.14 = 11.58
SD reduction for Assessment = 16.55 - 11.58 = 4.97

Assignment = Yes
Table: Attribute Assignment = Yes

S.No. Assessment Assignment Project Result(%)


1 Good Yes Yes 95
2 Average Yes No 70
5 Good Yes Yes 98
8 Poor Yes Yes 65
10 Good Yes Yes 89

Standard Deviation for Assignment = Yes


(95+70+98+65+89)
Average = 5 = 83.4

(95−83.4)2 +(70−83.4)2 +(98−83.4)2 +(65−83.4)2 +(89−83.4)2
SD = 5 = 14.98

Dr. Rabindra Kumar Singh "Artificial Intelligence" 121/ 127


Regression Trees : Example VI

Assignment = No
Table: Assignment = No

S.No. Assessment Assignment Project Result(%)


3 Good No Yes 75
4 Poor No No 45
6 Average No Yes 80
7 Good No No 75
9 Average No No 58

Standard Deviation for Assignment = No


(75+45+80+75+58)
Average = 5 = 66.6

(75−66.6) +(45−66.6) +(80−66.6)2 +(75−66.6)2 +(58−66.6)2
2 2
SD = 5 = 14.7

Table: Standard Deviation for Assignment

Assignment Standard Deviation Data Instances)


Yes 14.98 5
No 14.7 5
Dr. Rabindra Kumar Singh "Artificial Intelligence" 122/ 127
Regression Trees : Example VII

5 5
Weighted SD for Assignment = 10 ∗ 14.98 + 10 ∗ 14.7 = 14.84
SD reduction for Assignment = 16.55 - 14.84 = 1.71

Project = Yes
Table: Project = Yes

S.No. Assessment Assignment Project Result(%)


1 Good Yes Yes 95
3 Good No Yes 75
5 Good Yes Yes 98
6 Average No Yes 80
8 Poor Yes Yes 65
10 Good Yes Yes 89

Standard Deviation for Project = Yes


(95+75+98+80+65+89)
Average = 6 = 83.7

(95−83.7)2 +(95−83.7)2 +(98−83.7)2 +(80−83.7)2 +(65−83.7)2 +(89−83.7)2
SD = 6 = 12.6

Dr. Rabindra Kumar Singh "Artificial Intelligence" 123/ 127


Regression Trees : Example VIII

Project = No

Table: Project = NO

S.No. Assessment Assignment Project Result(%)


2 Average Yes No 70
4 Poor No No 45
7 Good No No 75
9 Average No No 58

Standard Deviation for Project = No


(70+45+75+58)
Average = 4 = 62

(70−62)2 +(45−62)2 +(75−62)2 +(58−62)2
SD = 4 = 13.39

Table: Standard Deviation for Project

Project. Standard Deviation Data Instances


Yes 12.6 6
No 13.39 4

Dr. Rabindra Kumar Singh "Artificial Intelligence" 124/ 127


Regression Trees : Example IX

6 4
Weighted SD for Project = 10 ∗ 12.6 + 10 ∗ 13.39 = 12.92
SD reduction for Project = 16.55 - 12.92 = 3.63

Table: Standard Deviation for all the attributes in the DataSet

Attributes. Standard Deviation Reduction


Assessment 4.97
Assignment 1.71
Assignment 3.63
The attribute ’Assessment; has the maximum SD Reduction and hence it is
chosen as best splitting attribute as shown in the figure.

Dr. Rabindra Kumar Singh "Artificial Intelligence" 125/ 127


Regression Trees : Example X

Figure: RT with Assessment as Root Node

Note : The Training DS is split into subset based on the attribute


’Assessment’ and this process is continued until the entire tree is constructed
Dr. Rabindra Kumar Singh "Artificial Intelligence" 126/ 127
Dr. Rabindra Kumar Singh "Artificial Intelligence" 127/ 127

You might also like