0% found this document useful (0 votes)
13 views11 pages

Unit-7 ML

Uploaded by

gsinren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Unit-7 ML

Uploaded by

gsinren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-7

Decision Tree
Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is
preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node
represents the outcome.
In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further
branches.
It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs
a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
Decision Tree Algorithm
Input: Training data set, test data set (or data points)

Steps:
Do for all attributes
calculate the entropy Ei of the attribute Fi
if Ei < Emin then
Emin = Ei and Fmin = Fi
End if
End do

Split data set into subset using attribute Fmin


Draw a decision tree node containing the attribute Fmin and split
the data set into subsets
Repeat above steps until full tree is drawn covering all
attributes of original table
hhhh

Strengths of Decision Trees:

1. Interpretability: Decision trees are easy to understand and


interpret, making them a valuable tool for explaining the
reasoning behind a particular decision or prediction.
2. Non-linearity: Decision trees can model complex, non-linear
relationships in data without requiring extensive data
preprocessing or feature engineering.
3. Handling both numerical and categorical data: Decision trees can
handle a mix of numerical and categorical data without the need
for one-hot encoding or other data transformations.
4. Feature selection: Decision trees implicitly perform feature
selection by giving more importance to the most informative
features at the top of the tree, which can help reduce
dimensionality and improve model performance.
5. Can handle missing values: Decision trees can handle missing
values in the dataset by considering alternative paths when a
feature's value is missing.
6. Low computational complexity for prediction: Once a decision tree
is trained, making predictions on new data is fast and efficient,
with a time complexity of O(log(n)), where n is the number of
nodes in the tree.
Weaknesses of Decision Trees:
1. Overfitting: Decision trees are prone to overfitting, especially
when they are deep and complex. To mitigate overfitting,
techniques like pruning and setting a maximum depth or minimum
samples per leaf are often used.
2. Instability: Small changes in the data can lead to significant
changes in the structure of the tree, making decision trees
unstable compared to some other algorithms like random forests
or gradient boosting.
3. Limited expressiveness: While decision trees can model non-linear
relationships, they may struggle with highly complex patterns in
the data, which other algorithms like neural networks may handle
better.
4. Not always the most accurate: Decision trees may not always
produce the most accurate predictions, especially when the data
relationships are highly complex. Ensemble methods like random
forests or gradient boosting often outperform standalone decision
trees.
5. Lack of global optimization: Decision trees make local decisions at
each node, which may not necessarily lead to a globally optimal
model.
Classification Algorithms
K-nearest Neighbour(kNN)
Decision Tree
Random forest classification

Supervised Learning And classification steps


Supervised learning is where the model is trained on a labelled
dataset. A labelled dataset is one that has both input and output
parameters.
The labelled dataset used in supervised learning consists of
input features and corresponding output labels.
KNN Algorithm(K-Nearest Neighbour)
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.

 Step-1: Select the number K of the neighbours

 Step-2: Calculate the Euclidean distance of K number of neighbours

 Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.

 Step-4: Among these k neighbours, count the number of the data points in each
category.
 Step-5: Assign the new data points to that category for which the number of the
neighbour is maximum.
 Step-6: Our model is ready.

We have a new entry but it doesn't have a class yet. To know its class,
we have to calculate the distance from the new entry to other entries
in the data set using the Euclidean distance formula.
Here's the formula(Euclidean distance): √(X₂-X₁)²+(Y₂-Y₁)²
Where:
X₂ = New entry's IMDb (7.4).
X₁= Existing entry's IMDb.
Y₂ = New entry's Duration (114).
Y₁ = Existing entry's IMDb.

For k=3

As you can see above, the majority class within the 3 nearest
neighbours to the new entry is (41,46,54) . Therefore, we'll classify
the new entry as red.
Majority Voting (Action,comedy,comedy) = Comedy

Define linear regression. Also explain Sum of squares


with its formula.
 Linear Regression is a fundamental statistical and machine
learning technique used for the relationship between a dependent
variable (target) and one or more independent variables (features
or predictors).
 The primary goal of linear regression is to find the linear equation
that best describes how the independent variables influence the
dependent variable. In simple linear regression, there is one
independent variable, while in multiple linear regression, there are
two or more.
 The general form of a simple linear regression equation is:
y=b0+b1x+ε
Where:
 y is the dependent variable (the variable you want to
predict).
 x is the independent variable (the feature used for
prediction).
 b0 is the intercept (the value of y when x=0).
 b1 is the slope (the change in y for a one-unit change in
x).
 ε represents the error term (the part of y that cannot be
explained by x).

The goal in linear regression is to determine the values of b0 and b1


that minimize the sum of squares of the error term, also known as
the "Sum of Squares of Residuals."

Sum of Squares (SS) is a measure of the total variation in a dataset.


In the context of linear regression, there are two important sums of
squares:

(1)Total Sum of Squares (SST):


SST measures the total variability in the dependent variable y. It is
the sum of the squared differences between each data point and the
mean of y.
(2)Residual Sum of Squares (SSE):
SSE measures the unexplained variability in the dependent variable after
applying the linear regression model. It is the sum of the squared differences
between each data point's actual value and the predicted value from the
regression model.
The sum of squares can be used to calculate the third important measure:

(3)Explained Sum of Squares (SSR):

SSR measures the variability in the dependent variable that is explained by the
regression model. It is the difference between the total variability (SST) and the
unexplained variability (SSE).
Formula: SSR=SST−SSE

You might also like