Unit-7 ML
Unit-7 ML
Decision Tree
Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is
preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node
represents the outcome.
In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further
branches.
It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs
a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
Decision Tree Algorithm
Input: Training data set, test data set (or data points)
Steps:
Do for all attributes
calculate the entropy Ei of the attribute Fi
if Ei < Emin then
Emin = Ei and Fmin = Fi
End if
End do
Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
Step-4: Among these k neighbours, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbour is maximum.
Step-6: Our model is ready.
We have a new entry but it doesn't have a class yet. To know its class,
we have to calculate the distance from the new entry to other entries
in the data set using the Euclidean distance formula.
Here's the formula(Euclidean distance): √(X₂-X₁)²+(Y₂-Y₁)²
Where:
X₂ = New entry's IMDb (7.4).
X₁= Existing entry's IMDb.
Y₂ = New entry's Duration (114).
Y₁ = Existing entry's IMDb.
For k=3
As you can see above, the majority class within the 3 nearest
neighbours to the new entry is (41,46,54) . Therefore, we'll classify
the new entry as red.
Majority Voting (Action,comedy,comedy) = Comedy
SSR measures the variability in the dependent variable that is explained by the
regression model. It is the difference between the total variability (SST) and the
unexplained variability (SSE).
Formula: SSR=SST−SSE