0% found this document useful (0 votes)
111 views117 pages

ML Unit 3 Notes

best notes of ml

Uploaded by

cabhi7789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views117 pages

ML Unit 3 Notes

best notes of ml

Uploaded by

cabhi7789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

UNIT – III

Lecture - 13
Pallavi Shukla
Assistant Professor
Introduction to Decision Tree
• Decision tree is used to create a learning model that can be used to predict
(test) the class or value of the target variable.
• The decision tree uses prior training data to predict the class of new
example.
A DECISION TREE
• “It is a flowchart structure in which each node represents a “test” on
the attribute and each branch represents the outcome of the test.”
• The end node (called leaf node) represents a class label.
• Decision tree is a supervised learning method.
• It is used for both classification and regression tasks in machine learning.
DECISION TREE LEARNING
• It is a method for approximating discrete - valued target function, (concept)
in which the learned function is represented by a decision tree.
TERMINOLOGIES IN DECISION
TREE
• Root Node: It represents the entire population (or sample) which gets
further divided into two or more sets.
• Splitting: It is a process of dividing a node into two or more sub-nodes to
increase the tree.
• Decision Nodes: When a sub-node splits into further sub-nodes then it is
called a decision node.
• Leaf/ Terminal Node: The end nodes that do not split are called leaf or
terminal nodes.
TERMINOLOGIES IN DECISION
TREE
• Pruning : The removal of sub-nodes is called pruning to reduce trees.

• Branch(sub tree) : A submission of entire tree is called branch or subtree.

• Parent nodes : a node divided into sub nodes is called a parent node.

• Child Nodes: The sub nodes of a parent node are called child nodes.
How does the Decision Tree algorithm
Work?
• In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree.
• This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.
• For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further.
• It continues the process until it reaches the leaf node of the tree.
• The complete process can be better understood using the below algorithm:
How does the Decision Tree algorithm
Work?
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contain possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example
• Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels.
The next decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM.
• By this measurement, we can easily select the best attribute for the nodes of
the tree.
• There are two popular techniques for ASM, which are:
• Information Gain
• Gini Index
Advantages of the Decision Tree
• It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree

• The decision tree contains lots of layers, which makes it complex.


• It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
• For more class labels, the computational complexity of the decision tree
may increase.
UNIT III
Lecture 14
PALLAVI SHUKLA
Assistant Professor
UCER, Prayagraj
ENTROPY E(x)
• Average amount of information contained by a random variable(x) is called
Entropy.
• “Measure of randomness of information” of a variable.
• Denoted by E or H
EXAMPLE 1

Let’s have a dataset made up of three colors; red,


purple, and yellow. If we have one red, three purple,
and four yellow observations in our set,
our equation becomes:
• We have Pr = 1 /8 ( Red in the Dataset)
• Pp = 3/8 (Purple in Dataset)
• Py = 4/8 (Yellow in Dataset)

• E = 1.41
• when all observations belong to the same class? In such a case, the entropy will
always be zero.
• Such a dataset has no impurity.
• This implies that such a dataset would
not be useful for learning.
• If we have a dataset with say, two classes,
• Half made up of yellow and the other half being purple,
• The entropy will be one.
Example 2
• Calculate the entropy E of a single attribute “Playing Golf ” problem when
following data is given.

Play Golf
Yes No

9 5
Solution
E(s) =
S = current state , pi = Prob of event (i) of State(S)
Entropy (Play Golf) = (9,5)
Probability of Play Golf = Yes = 9/14 = 0.64
Probability of Play Golf = No = 5/14 = 0.36
Entropy = - (0.36 log 2 (0.36) - (0.64) log 2 (0.64))
E(s) = 0.94
Example 3
• Calculate the entropy of multiple attributes of “Playing Golf ” Problem with
given data set.
Play Golf
Yes No
Sunny 3 2
Outlook Overcast 4 0
Rainy 2 3
Solution
E(Play Golf, Outlook) = P(Sunny), E(3,2) + P(Overcast) , E(4,0) + P(Rainy) , E(2,3)
P(Sunny), E(3,2) = - (3/14 log 2 (3/14) – 2/14 log 2 (2/14))=
INFORMATION GAIN
• In machine Learning and decision trees, the information gain(IG) is
defined as the reduction (decrease) in entropy.
• Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• Information gain helps to determine the order of attributes in the nodes
of a decision tree.
• According to the value of information gain, we split the node and build
the decision tree.
INFORMATION GAIN
• The main node is referred to as the parent node, whereas sub-nodes are
known as child nodes.
• We can use information gain to determine how good the splitting of
nodes in a decision tree.

• E parent is the entropy of the parent node and E Children is the average
entopic of the child nodes.
Example
Suppose we have a dataset with two classes.
This dataset has 5 purple and 5 yellow examples.
Since the dataset is balanced, we expect the answer to be 1.

Say we split the dataset into two branches.


One branch ends up having four values while the other has six.
The left branch has four purples while the right one has five yellows and one
purple.
Example
• We mentioned that when all the observations belong to the same class, the
entropy is zero since the dataset is pure. As such, the entropy of the left
branch Eleft= 0
• On the other hand, the right branch has five yellows and one purple.
• Thus:
Example
• A perfect split would have five examples on each branch.
• This is clearly not a perfect split, but we can determine how good the split is.
• We know the entropy of each of the two branches.
• We weight the entropy of each branch by the number of elements each
contains.
• This helps us calculate the quality of the split.
Example
• The one on the left has 4, while the other has 6 out of a total of 10. Therefore, the
weighting goes as shown below:

• The entropy before the split, which we referred to as initial entropy Einitial=1
• After splitting, the current value is 0.39
• We can now get our information gain, which is the entropy we “lost” after splitting.
Example

The more the entropy removed, the greater the information gain.
The higher the information gain, the better the split.
Decision Tree
Algorithms
Pallavi Shukla
Assistant Professor
Types of Decision Tree Algorithms
• 1. Iterative Dichotomizer 3(ID 3) Algorithm
• 2. CD 4.5 Algorithm
• 3. Classification and Regression Tree (CART) Algorithm
General Decision Tree Algorithm Steps
1. Calculate the Entropy(E ) of Every attribute (A) of Dataset (S).
2. Split the dataset (S) into subsets using the attribute for which the resulting
entropy after splitting is minimized .
Iterative Dichotomizer 3 Algorithm
Inductive Bias
Pallavi Shukla
Assistant Professor
Inductive Bias in Decision Tree Learning

• The Inductive Bias of machine learning is the set of assumptions that the
learner uses to predict outputs of given inputs that it has not encountered.
• An approximation of inductive bias of ID3 decision tree algorithm: “Shorter
trees are preferred over longer trees. Trees that place high information gain
attributes close to the root are preferred over those that do not.”
• Inductive bias is a “policy” by which the decision tree algorithm generalizes
from observed training examples to classify unseen instances.
Inductive Bias in Decision Tree Learning

• Inductive Bias is an essential requirement in machine learning. With


Inductive bias, a learner algorithm can easily examples generalize new
unseen.
• Bias: The assumptions made by a model to make a function easier to learn
are called bias.
• Variance: If you get very less errors in training and very high errors in
testing of data, then this difference is called “Variance” in M/C learning and
testing.
ISSUES IN DECISION TREE
LEARNING
1. Avoiding Over Fitting of Data
a) Reduced Error Pruning
b) Rule Post Pruning
2. Incorporating continuous valued attributes.
3. Alternative measure for selecting attributes.
4. Handling training examples with missing attribute values.
5. Handling attributes with differing costs.
OVERFITTING OF DATA IN
DECISION TREES
• “Overfitting” of data is a condition in which the model completely fits the
training data but fails to generalize the testing data.
• “Given a hypothesis space(H), a hypothesis h1 ∊ H is said to
“overfit” the “training data of there exists some alternative
hypothesis h2 ∊ H, such that h1 has smaller error than h2 over
the training examples, but h2 has smaller error than h2
overall distribution of complete data set instances(i.e
training + testing data set).”
Technique to reduce Overfitting
• 1. Reduce model complexity
• 2. early stopping training process before final classification
• Post Pruning the decision tree nodes after the overfit has occurred
• Ridge regularization
• Lasso Regularization
• To use ANN
METHODS /APPROACHES TO AVOID
OVER FITTING IN DECISION TREES
1. Pre pruning (avoiding) Overfitting - To stop growing the decision tree,
before it finally classifies training data.
2. Post pruning after Over Fitting – To allow decision tree to overfit data
and then post prune the tree leafs nodes.
ALGORITHM STEPS FOR DECISION
TREE FOR BOOLEAN FUNCTIONS
• 1) Every Variable in Boolean Function(eg. A, B , C) has two possibilities as
true(1) or False(0)
• If Boolean Function is true, we write yes(Y) in leaf Node.
• If Boolean Function is false, we write no (N) in leaf node.
• Boolean functions are evaluated from left to right i.e the first variable on
LHS is root node always.
Example
• Make decision tree for following Boolean function(expression)
a) A ⋀ ¬ B b) A ⋁ [B ⋀ C]
c) A XOR B d) [A ⋀ B] ⋁ [C ⋀ D]
A B A ⋀ B ¬B A ⋀ ¬ B Y or N
F F F T F NO
F T F F F NO
T F F T T YES
T T T F F NO
Step 1 If A = True and B = True , then
final decision in truth table is “NO”.
Step 2 If A = True and B = False, then final
decision in truth table is “YES”
Step 3 If A = False , B = Any value , Then
final decision in truth table is “No”
A ⋁[B ⋀C] -
Instance Based
Learning
Pallavi Shukla
Assistant Professor
Instance Based Learning
• It is a supervised learning technique which is used for classification and
regression tasks.
• It performs operation after comparing the current instances with the
previous instances.
• It is also called “ Lazy Learning”, “Memory Based Learning”
Types of Instance Based Learning
• K – Nearest Neighbourhood(KNN) Algorithm
• Locally Weighted Regression Algorithm
• Radial Basis Function Network
• Case Based Learning
• Learning Vector Quantization(LVQ)
• Self Organizing Map(SOM)
KNN Example
Pallavi Shukla
Assistant Professor
Example
• To solve the numerical example on the K-nearest neighbor i.e. KNN
classification algorithm, we will use the following dataset.
• In the Given dataset, we have fifteen data points with three class labels.
Now, suppose that we have to find the class label of the point P= (5, 7).
Point Coordinates Class Label
A1 (2,10) C2
A2 (2, 6) C1
A3 (11,11) C3
A4 (6, 9) C2
A5 (6, 5) C1
A6 (1, 2) C1
A7 (5, 10) C2
A8 (4, 9) C2
A9 (10, 12) C3
A10 (7, 5) C1
A11 (9, 11) C3
A12 (4, 6) C1
A13 (3, 10) C2
A15 (3, 8) C2
A15 (6, 11) C2
• For this, we will first specify the number of nearest neighbors i.e. k. Let us
take k to be 3.
• Now, we will find the distance of P to each data point in the dataset.
• For this KNN classification numerical example, we will use the euclidean
distance metric.
• The following table shows the euclidean distance of P to each data point
in the dataset.
Point Coordinates Distance from P (5, 7)
A1 (2, 10) 4.24
A2 (2, 6) 3.16
A3 (11, 11) 7.21
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A6 (1, 2) 6.40
A7 (5, 10) 3.0
A8 (4, 9) 2.23
A9 (10, 12) 7.07
A10 (7, 5) 2.82
A11 (9, 11) 5.65
A12 (4, 6) 1.41
A13 (3, 10) 3.60
A15 (3, 8) 2.23
A15 (6, 11) 4.12
• After finding the distance of each point in the dataset to P, we will sort
the above points according to their distance from P (5, 7).
• After sorting, we get the following table.
Point Coordinates Distance from P (5, 7)
A12 (4, 6) 1.41
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A8 (4, 9) 2.23
A15 (3, 8) 2.23
A10 (7, 5) 2.82
A7 (5, 10) 3
A2 (2, 6) 3.16
A13 (3, 10) 3.6
A15 (6, 11) 4.12
A1 (2, 10) 4.24
A11 (9, 11) 5.65
A6 (1, 2) 6.4
A9 (10, 12) 7.07
A3 (11, 11) 7.21
• As we have taken k=3, we will now consider the class labels of three
points in the dataset nearest to point P to classify P In the above table,
A12, A4, and A5 are the closest 3 neighbors of point P.
• Hence, we will use the class labels of points A12, A4, and A5 to decide the
class label for P.
Point Coordinates Distance from P (5, 7)
A12 (4, 6) 1.41
A4 (6, 9) 2.23
A5 (6, 5) 2.23
• Now, point A12, A4, and A5 have the class labels C1, C2, and C1
respectively. Among these points, the majority class label is C1.
• Therefore, we will specify the class label of point P = (5, 7) as C1.
• Hence, we have successfully used KNN classification to classify point P
according to the given dataset.
Question
Locally Weighted
Regression
Pallavi Shukla
Assistant Professor
Linear Regression is a supervised learning algorithm used for computing
linear relationships between input (X) and output (Y).

The steps involved in ordinary linear regression are:


• As evident from the image below, this algorithm cannot be used
for making predictions when there exists a non-linear relationship
between X and Y. In such cases, locally weighted linear regression
is used.
RADIAL BASIS FUNCTION(RBF) –

• It is a mathematical function whose value depends only on the distance from


the origin.
• A RBF works by defining its distance from the centre or origin point.
• This is done by using absolute values.
• It is denoted by Փ(x) = Փ(|x|)
• The RBF is used for approximation of multivariate target function
RADIAL BASIS FUNCTION(RBF) –

• The target function approximation is given as


f(x) = w0 + wu ku (d(xu , x))
• Where f(x) = Approximation of multivariate target function
w0 = Initial Weight
wu = Unit Weight
ku (d(xu , x)) = kernel function
d(xu , x) = distance between xu and x
RADIAL BASIS FUNCTION NETWORKS -

• They are used in Artificial neural networks(ANN)


• It is used for classification task in ANN.
• Commonly used in ANN for function approximation also.
• The RBF networks are different from simple ANN due to their universal
approximation and faster speed.
RADIAL BASIS FUNCTION NETWORKS -

• An RBF networks is a feed forward neural network.


• It consists of three layers as input layer, middle layer and output layer.
CASE BASE REASONING(CBR)
• Also called Case-based Learning.
• Used for classification and regression.
• It is the process of solving new problems based on the solutions of similar
past problems.
• It is an advanced instance-based learning method that is used to solve more
complex problems.
• It does not use the Euclidean distance metric.
STEPS IN CBR -
• Retrieve – Gather Data from memory. Check any previous solution similar
to the current problem.
• Reuse – Suggest a solution based on the experience. Adapt it to meet the
demands of the new situation.
• Revise – Evaluate the use of the solution in a new context.
• Retain – Store this new problem-solving method in the memory system.
Applications of CBR -
• Customer service helpdesk for diagnosis of problems.

• Engineering and law for technical design and legal rules.

• Medical science for patient case histories and treatment.

You might also like