CSE3013 Module6
CSE3013 Module6
Learning Systems
Definition-1
It is a system of computer algorithms that can learn from example through
self-improvement without being explicitly coded by a programmer.
Definition-2
It is all about making computers how to learn from data to make decisions /
predictions / identify patterns without being explicitly programmed to.
Definition-3
Machine learning enables a machine to automatically learn from data, improve
performance from experiences, and predict things without being explicitly
programmed.
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it
Here, we provide data and let the machine find out the patterns in the dataset.
For instance, provided 3 different shapes (circles, triangles, and squares) and let
the machine cluster them. Such a technique is called clustering.
Here, the machine is commonly referred to as an agent, and the agent receives
a reward (or a penalty) based on each of its actions. It then learns what would
be the best actions to maximize the rewards and alleviate the penalties.
After getting trained on data, the goal of our trained model is the
generalize on unseen data as accurately as possible.
If the model yield very accurate results on training data but fails to
generalize on unseen data, it’s called over-fitting because the model
over-fits the training data.
If the model doesn’t even predict accurately on training data, that means
the model has not learned anything, which is known as under-fitting.
Augmentation:
Machine learning, which assists humans with their day-to-day tasks,
personally or commercially without having complete control of the output.
Such machine learning is used in different ways such as Virtual Assistant,
Data analysis, software solutions. The primary user is to reduce errors due
to human bias.
Automation:
Machine learning, which works entirely autonomously in any field without
the need for any human intervention. For example, robots performing the
essential process steps in manufacturing plants.
Finance Industry :
Machine learning is growing in popularity in the finance industry. Banks
are mainly using ML to find patterns inside the data but also to prevent
fraud.
Government organization :
The government makes use of ML to manage public safety and utilities.
Take the example of China with the massive face recognition. The
government uses Artificial intelligence to prevent jaywalker.
Healthcare industry
Healthcare was one of the first industry to use machine learning with
image detection.
Marketing
Broad use of AI is done in marketing thanks to abundant access to data.
Before the age of mass data, researchers develop advanced mathematical
tools like Bayesian analysis to estimate the value of a customer. With the
boom of data, marketing department relies on AI to optimize the customer
relationship and marketing campaign.
1 Gathering Data : is the first step to identify and obtain all data-related
problems. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate
will be the prediction.
Identify various data(Structured/Unstructured) sources
(Files/Database/Internet)
Collect data
Integrate the data obtained from different sources (coherent set of data -
Dataset)
2 Data Preparation : is a step where we put our data into a suitable place
and prepare it to use in our machine learning training.
Data exploration: It is used to understand the nature of data that we have
to work with. We need to understand the characteristics, format, and
quality of data. A better understanding of data leads to an effective
outcome. In this, we find Correlations, general trends, and outliers.
Data pre-processing: Now the next step is preprocessing of data for its
analysis.
6 Test Model : To check for the accuracy of the trained model by providing
a test dataset to it. Testing the model determines the percentage accuracy
of the model as per the requirement of project or problem.
What is a Dataset?
A dataset is a collection of data in which data is arranged in some order.
A dataset can contain any data from a series of an array to a database
table.
Note: The datasets are of large size, so to download these datasets, you must
have fast internet on your computer.
Labeled data indicates that some input data has already been tagged with the
appropriate output.
Supervised learning is the process of providing correct input and output data to
a machine learning model. And the goal is to find a mapping function that
maps the input variable (X) to the output variable (Y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
Models are trained using labelled datasets, where the model learns about each
type of data. After the training process is completed, the model is tested on
test data (a subset of the training set) and predicts the output.
The machine has already been trained on all types of shapes, and when it
discovers a new one, it classifies it based on a number of sides and predicts the
output.
With the help of supervised learning, the model can predict the output on
the basis of prior experiences.
In supervised learning, we can have an exact idea about the classes of
objects.
Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.
Supervised learning models are not suitable for handling the complex tasks.
Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
Training required lots of computation times.
In supervised learning, we need enough knowledge about the classes of
object.
However, there may be many cases where we do not have labelled data and
must find hidden patterns in the given dataset. Unsupervised learning
techniques are required to solve such types of cases in machine learning.
It is a type of ML in which models are trained using unlabeled dataset and are
allowed to act on that data without any supervision.
Given a dataset containing images of various types of cats and dogs The
algorithm is never trained on the given dataset, so it has no idea about the
dataset’s characteristics.
The task of this learning is to identify the image features on their own. And
will perform by clustering the image dataset into the groups according to
similarities between images.
Now, this unlabeled data is fed to the ML model in order to train it. Firstly, it
will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects.
K-means clustering
KNN (k-nearest neighbors)
Hierarchal clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition
The best way to train your dog is by using a reward system. You give the dog a
treat when it behaves well, and you chastise it when it does something wrong.
Same policy can be applied to ML models too! This type of learning method,
where we use a reward system to train our model, is called Reinforcement
Learning.
Also, learning from a small subset of actions will not help expand the vast
realm of solutions that may work for a particular problem. Machines need to
learn to perform actions by themselves and not just learn from humans.
Consider a dog that we have to house train. Here, the dog is the agent and the
house, the environment.
We can get the dog to perform various actions by offering incentives such as
dog biscuits as a reward.
The dog will follow a policy to maximize its reward and hence will follow every
command and might even learn a new action, like begging, all by itself.
The dog will also want to run around and play and explore its environment.
This quality of a model is called Exploration.
Types of Learning
Supervised Unsupervised Reinforcment
Data provided is unlabeled, The machine learns from its
Data provided is labeled data,
the outputs are not specified, environment using rewards
with output values specified
Machine makes its own prediction and Errors
Used to solve Regression and Used to solve Association and Used to solve Reward based
Classification Problems Clustering problems Problems
Labeled data is used Unlabeled data is used No predefined data is used
External Su[ervision No Supervision No Supervision
Solves problems by
Solves Problems by mapping Follows Trail & Erro
Understanding Patterns
labeled Input to known Output Problem Solving Approach
and Discovering Outputs
Agent: Agent is the model that is being trained via reinforcement learning
Environment: The training situation that the model must optimize to is
called its environment
Action: All possible steps that can be taken by the model
State: The current position/ condition returned by the model
Reward: To help the model move in the right direction, it is
rewarded/points are given to it to appraise some action
Policy: Policy determines how an agent will behave at any time. It acts as
a mapping between Action and present State
It states that the future is independent of the past, given the present. This
means that, given the present state, the next state can be predicted easily,
without the need for the previous state.
In the diagram shown, we need to find the shortest path between node A and D.
Each path has a reward associated with it, and the path with maximum reward
is what we want to choose.
The process will maximize the output based on the reward at each step and will
traverse the path with the highest reward. This process does not explore but
maximizes reward.
In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees. Below diagram explains the general
structure of a decision tree:
While implementing DT, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique called as Attribute selection measure (ASM). By
this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:
Information Gain
Gini Index
C
X
GI = 1 − Pj2
j=1
Where, ’C’ is the no. of classes and ’Pj ’ is the probability associated with the jth class
There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID,
QUEST, GUIDE, CRUISE and CTREE, that are used for classification in
real-time environment.
The accuracy depends upon the selection of the best split attribute.
ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides) features into two or
more groups at each step.
Algorithm
Compute "Entropy_Info" for the whole training DS based on Target Variable.
Compute "Entropy_Info" and "Information Gain" for each attribute in DS.
Chose the attribute for which Entropy is minimum and Gain is maximum as
best split attribute and consider it as Root Node.
The root node is branched into subtrees with each subtee as an outcome of the
test condition of the root node attribute. Accordingly, The training DS is also
split into subsets.
Recursively repeat the process for the subset till a leaf node is derived or no more
training instances are available in the subset.
Definitions :
Problem :
Consider the below Table, to assess a student’s performance during his course
of study and predict whether a student will get a job offer or not in his final
year of the course. The training dataset T consists of 10 data instances with
attributes ’CGPA (C)’, ’Interactiveness (I)’, ’Practical Knowledge(Pk)’,
’Communication Skills (Cs)’ as shown below dataset :
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 < 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 < 8 Yes Good Poor No
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes
Solution :
Step-1 : Calculate the Entropy for the Target Class ’Job Offer’ :
Iteration-1 :
Table: Entropy Information for CGPA
Entropy_Info(T, CGPA) =
4
= 10
(− 34 log2 34 − 14 log2 41 ) + 4
10
(− 44 log2 44 − 40 log2 04 ) + 2
10
(− 02 log2 20 − 22 log2 22 )
4
= 10
(0.3111 + 0.4997) + 0 + 0 = 0.3243
Entropy_Info(T, Interactiveness) =
6
= 10
(− 56 log2 56 − 16 log2 61 ) + 4
10
(− 42 log2 24 − 42 log2 24 )
6 4
= 10
(0.2191 + 0.4306) + 10
(0.4997 + 0.4997) = 0.7896
Step-3 : Choose the attribute for which entropy is minimum and ∴ the gain is
maximum as the best split attribute. So, we choose CGPA as root node.
Now Continue the same process for the subset of data instances branched with
CGPA >= 9
Iteration-2 :
Here Once again, the same process of computing the Entropy_Info and Gain
are repeated with the subset of training set.
Entropy_Info(T) = Entropy_Infor(3,1) =
Entropy_Info(T, Interactiveness) =
= 0
= 0
Here, bot the attributes ’Practical Knowledge’ and ’Communication Skills’ have
the same Gain, So, can either construct the DT using ’Practical Knowledge’ or
’Communication Skills’
Mertis
Understandable prediction rules are created from the training data.
Builds a short tree in relatively small time.
It only needs to test enough attributes until all data is classified.
Finding leaf nodes enables test data to be pruned, reducing the number of
tests.
Demerits:
Data may be over-fitted or over-classified, if a small sample is tested.
Only one attribute at a time is tested for making a decision.
Overfitting
It is an undesirable learning behavior that occurs when the machine learning
model gives accurate predictions for training data but not for new data.
Example :
Info_Gain(A)
Gain_Ratio(A) = Split_Info(T,A)
Choose the attribute for which Gain_Ratio is maximum as the best split
attribute.
The root node is branched into subtrees with each subtree as an outcome
of the test condition of the root node attribute. Accordingly, the training
DS is a also split into subsets.
Recursively apply the same operation for the subset of the training set
with the remaining attributes until a leaf node is derived or not more
training instances are available in the subset.
Problem :
Consider the below Table, to assess a student’s performance during his course
of study and predict whether a student will get a job offer or not in his final
year of the course. The training dataset T consists of 10 data instances with
attributes ’CGPA (C)’, ’Interactiveness (I)’, ’Practical Knowledge(Pk)’,
’Communication Skills (Cs)’ as shown below dataset :
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 < 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 < 8 Yes Good Poor No
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes
Solution :
Step-1 : Calculate the Entropy for the Target Class ’Job Offer’ :
Iteration-1 :
Table: Entropy Information for CGPA
Entropy_Info(T, CGPA) =
4
= 10
(− 34 log2 34 − 14 log2 41 ) + 4
10
(− 44 log2 44 − 40 log2 04 ) + 2
10
(− 02 log2 20 − 22 log2 22 )
4
= 10
(0.3111 + 0.4997) + 0 + 0 = 0.3243
Entropy_Info(T, Interactiveness) =
6
= 10
(− 56 log2 56 − 16 log2 61 ) + 4
10
(− 42 log2 24 − 42 log2 24 )
6 4
= 10
(0.2191 + 0.4306) + 10
(0.4997 + 0.4997) = 0.7896
= 0.3502
Attributes Gain_Ratio
CGPA 0.3658
Interactiveness 0.0939
Practical Knowledge 0.1648
Communication Skills 0.3502
Step-3 : Choose the attribute for which Gain_Ratio is maximum as the best
split attribute. So, we choose CGPA as root node.
Now Continue the same process for the subset of data instances branched with
CGPA >= 9
Iteration-2 :
Here Once again, the same process of computing the Entropy_Info and Gain,
Split_Info and Gain_Ratio are repeated with the subset of training set.
The subset consists of 4 data instances.
Entropy_Info(T) = Entropy_Infor(3,1) =
Entropy_Info(T, Interactiveness) =
When an attribute ’A’ has numerical values which are continuous, then a
threshold or split point ’s’ is found such that the set of values is categorized
into 2 sets as A < s and A >= s. The best split point is the attribute value
which has maximum information gain for that attribute.
Remove the duplicates and consider only the unique values of the attribute
6.8 7.9 8.2 8.5 8.8 9.1 9.5
Now Compute the Gain for the distinct values of this continuous attributes
From the above table, we observe that CGPA with 7.9 has maximum gain as
0.4462. So we choose CGPA @7.9 as Split Point as >7.9 and <= 7.9.
Next, you may notice that mandarins tend to be slightly darker in color than
oranges. So, can use a color scale (1=dark to 10=light) to split your tree
further:
Color ≤ 5 for the left side of the sub-tree
Color ≤ for the right side of the sub-tree
Gini Impurity for the leftmost leaf node(Above Figure) would be:
Pn
Gini Impurity = 1- i=1
Pi2 = 1 − (0.0272 + 0.9732 ) = 0.053
To find the best split, we need to calculate the weighted sum of Gini Impurity
for both child nodes. We do this for all possible splits and then take the one
with the lowest Gini Impurity as the best split.
37 9
Gini Impurity = 46
* 0.053 + 46
* 0.345 = 0.110
Note
If the best weighted Gini Impurity for the two child nodes is not lower than Gini
Impurity for the parent node, you should not split the parent node any further.
The Entropy approach is essentially the same as Gini Impurity, except it uses a
slightly different formula:
Pn
- i=1
pi log2 (pi )
To identify the best split, you would have to follow all the same steps outlined
above. The split with the lowest entropy is the best one. Similarly, if the
entropy of the two child nodes is not lower than that of a parent node, you
should not split any further.
Pn
Compute Gini Index(T) = 1- i=1 i
P 2 , Where T is the Training DS.
|S1 |
Compute Gini Index(T, A) = |T |
Gini(S1 ) + |S2|
|T |
Gini(S2 ), Where A is the
attribute.
Choose the best splitting subset which has minimum Gini_Index for an
attribute.
Compute △ Gini = Gini(T) - Gini(T,A) for the best splitting subset of
that attribute and consider as root node.
The Root node is branched into 2 subtrees with each subtree as outcome
of the test condition of the root node attribute. Accordingly, the training
dataset is also split into 2 subsets.
Recursively apply the same operation for the subset of the training set
with the remaining attributes until a leaf node is derived or no more
training instances are available in the subset.
Problem :
Consider the below Table, to assess a student’s performance during his course
of study and predict whether a student will get a job offer or not in his final
year of the course. The training dataset T consists of 10 data instances with
attributes ’CGPA (C)’, ’Interactiveness (I)’, ’Practical Knowledge(Pk)’,
’Communication Skills (Cs)’ as shown below dataset :
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
4 < 8 No Average Good No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
7 < 8 Yes Good Poor No
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes
Solution :
Step-1 : Calculate the Gini_Index for the data set, consists of 10 data
instances. The target attribute ’Job Offer’ has 7 instances as ’Yes’ and 3
instances as ’No’.
7 2 3 2
∴ Gini_Index(T) = 1 - ( 10 ) − ( 10 ) = 0.42
Step-2 : Compute Gini_Index for each of the attribute and each of the subset
in the attribute.
Subsets Gini_Index
(≥ 9, ≥ 8) <8 0.1755
(≥ 9, <8) ≥ 8 0.3
(≥ 8, <8) ≥ 9 0.417
Dr. Rabindra Kumar Singh "Artificial Intelligence" 103/ 127
CART - Example IV
Step-3 : Choose the best splitting subset which has minimum Gini_Index for an
attribute. ∴ the subset CGPA ϵ {(≥ 9, ≥ 8), < 8} is choose as best attribute.
Now Repeat the same process for the other attributes in the Training Data set.
Subsets Gini_Index
{G, M} P 0.1755
{G, P} M 0.3429
{M, P} G 0.40
Iteration-2 : Now the DS has 8 instances. Repeat the same process to find the
best splitting attribute and the splitting subset for that attribute.
Practical Communication
Sl.No CGPA Interactiveness Job Offer
Knowledge Skills
1 ≥ 9 Yes Very Good Good Yes
2 ≥ 8 No Good Moderate Yes
3 ≥ 9 No Average Poor No
5 ≥ 8 Yes Good Moderate Yes
6 ≥ 9 Yes Good Moderate Yes
8 ≥ 9 No Very Good Good Yes
9 ≥ 8 Yes Good Good Yes
10 ≥ 8 Yes Average Good Yes
Gini_Index(T, I ϵ {Yes}) = 1 - ( 55 )2 − ( 50 )2 = 0
Gini_Index(T, I ϵ {No}) = 1 - ( 23 )2 − ( 31 )2 = 0.449
Gini_Index(T, I ϵ {Yes, No}) = ( 87 ) ∗ 0 + ( 18 ) ∗ 0.449 = 0.056
Subsets Gini_Index
{VG, G} Avg 0.125
{VG, Avg} G 0.1875
{G, Avg} VG 0.2085
Subsets Gini_Index
{G, M} P 0
{G, P} M 0.2
{M, P} G 0.1875
Communication Skills has the Highest △Gini value. The Tree is further
branched based on the attribute ’Communication Skills".
Here, All the branches end up in a leaf node and the process of construction is
completed.
Algorithm
Compute Standard Deviation(SD) for each attribute w.r.t target variable
Compute SD for the No. of Data Instances of each distinct value of an attribute
and Weighted SD for each attribute.
Compute SD reduction by subtracting weighted standard deviation for each
attribute from SD of each attribute.
Choose the attribute with a higher SD reduction as the best split attribute.
The best split attribute is placed as the root node.
The root node is branched into subtrees with each subtree as an outcome of the
test condition of the root node attribute.
Recursively apply the same operation for the subset of the training set with the
remaining attributes until a leaf node is derived or no more training instances are
available in the subset.
Solution :
Assessment = Good
Table: Attribute Assessment = Good
Assessment = Average
Table: Attribute Assessment = Average
Assessment = Poor
Table: Attribute Assessment = Poor
5 3 2
Weighted SD for Assessment = 10 ∗ 10.9 + 10 ∗ 11.01 + 10
∗ 14.14 = 11.58
SD reduction for Assessment = 16.55 - 11.58 = 4.97
Assignment = Yes
Table: Attribute Assignment = Yes
Assignment = No
Table: Assignment = No
5 5
Weighted SD for Assignment = 10 ∗ 14.98 + 10 ∗ 14.7 = 14.84
SD reduction for Assignment = 16.55 - 14.84 = 1.71
Project = Yes
Table: Project = Yes
Project = No
Table: Project = NO
6 4
Weighted SD for Project = 10 ∗ 12.6 + 10 ∗ 13.39 = 12.92
SD reduction for Project = 16.55 - 12.92 = 3.63