0% found this document useful (0 votes)
62 views64 pages

ML-Module 3

Uploaded by

Shashank R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views64 pages

ML-Module 3

Uploaded by

Shashank R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

MACHINE LEARNING (BCS602)

Module 3
Chapter 4 : Similarity Based Learning
4.2 Nearest-neighbor Learning :
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that classifies
unlabeled data by finding the most similar labeled examples
The value of k is critical in KNN as it determines the number of neighbors to consider when
making predictions.(Bias & Variance)
Example
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.
• If k = 3, the algorithm looks at the 3 closest fruits to the new one.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an
apple because most of its neighbors a

4.1 KNN Algorithm

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Problem based on KNN
Example: You are trying to classify a new point based on its features, using k = 3.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

4.3 Weighted K –Nearest –Neighbor Algorithm


• Normal k-NN: In regular k-NN, each of the k nearest neighbors gets equal weight in
making the decision.
• Weighted k-NN: In weighted k-NN, closer neighbors have a higher weight, meaning their
influence on the final prediction is stronger. Typically, weights are inversely proportional
to the distance from the new point.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Problems on Weighted KNN

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

4.4 Nearest Centroid Classifier


• The Nearest Centroid Classifier is a simple machine learning classification algorithm that
assigns a class to a data point by finding the centroid (mean) of each class and then
classifying the point based on which centroid it is closest to.
• This classifier is often used for classification tasks where the classes have well-separated
centroids.

Problems on Nearest Centroid Classifier:


1. We want to classify a new data point with x=5 and y=5

.
Prof. Navyashree K S, CSE(DS) RNSIT
MACHINE LEARNING (BCS602)
Centroid of Class A:
We calculate the centroid for Class A using the average of all x and y values for the points in
Class A.
The data points for Class A are:
• (1, 2)
• (2, 3)
• (3, 3)
The centroid of Class A (μA) is the average of the x-coordinates and y-coordinates ,
(6,8)/3 ≈ (2,2.67)
Centroid of Class B:
We calculate the centroid for Class B using the average of all x and y values for the points in
Class B.
The data points for Class B are:
• (6, 6)
• (7, 7)
• (8, 6)
The centroid of Class B(μB) is
(21,19)/3 ≈ (7,6.33)
Step 2: Calculate the Distance to Each Centroid
Now, we calculate the Euclidean distance from the new point (5, 5) to each of the centroids.
Distance to Class A Centroid (2,2.67):
The Euclidean distance is given by:
dA =sqrt (14.43) ≈3 .8
Distance to Class B Centroid (7,6.33):
The Euclidean distance is:
dB=sqrt (5.77) ≈ 2.4
Step 3: Classify the New Point
• The distance to the Class A centroid is 3.8.
• The distance to the Class B centroid is 2.4.
Since the new point is closer to the Class B centroid, we classify the new point as Class B. The
new point (5, 5) is classified as Class B using the Nearest Centroid Classifier.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
2.Consider the sample data shown in table with two features x and y . The target classes
are A or B. Predict the class using Nearest Centroid Classifier. Given point(6,5)

4.5 Locally Weighted Regression (LWR)


• Locally Weighted Linear Regression (LWLR) is a non-parametric, memory-based
algorithm designed to capture non-linear relationships in data. Unlike traditional regression
models that fit a single global line across the dataset, LWLR creates localized models for
subsets of data points near the query point. Each query point has its own regression line
based on weighted contributions from nearby data points.
• LWLR assigns weights to data points based on their proximity to the query point:
• Points closer to the query point have higher weights.
• Points farther away have lower weights.
Steps Involved in Locally Weighted Linear Regression
1. Data Collection and Preparation
• Gather a dataset with relevant features and a target variable.
• Preprocess the data by handling missing values and normalizing features to ensure a
consistent scale, which improves the weighting process.
2. Choose the Kernel and Bandwidth (Tau)
• Kernel Function: Determines how weights are assigned to data points based on their
distance from the query point.
 Gaussian Kernel: Assigns weights using the formula:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

• Here, Wi is the weight for data point ,xi and τ\tau (bandwidth) controls the rate of weight
decay.
• Bandwidth (Tau): A critical parameter that governs how localized the regression is:
• Small T : Focuses on nearby points, capturing finer details but risks overfitting.
• Large T: Includes more distant points, reducing variance but increasing bias.
3. Weight Calculation
For a given query point , xi compute weights for all data points using the chosen kernel
function. Points closer to xi will have higher weights.
4. Model Fitting:Using the computed weights, fit a weighted least squares regression to the
data. The goal is to minimize the weighted sum of squared errors:
5. Prediction:Once the localized model is fitted, use it to predict the target value for the query
point.
LWR Algorithm:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Problems on LWR:

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Chapter 5:
Regression Analysis
5.1 Introduction to Regression
Regression analysis is a statistical technique used to model and analyse the relationships
between variables. This is one of the oldest supervised learning technique.
It helps in understanding how the dependent variable (outcome) changes when one or more
independent variables (predictors) are modified. Given a training dataset D containing N
training points
(xi , yi), where i=1….N, used to model the relationship between one or more independent
variables xi and a dependent variable yi. It is represented by a function y=f(x)
This technique is widely used in various fields such as economics, finance, machine learning,
and social sciences to make predictions and infer relationships between variables.
Types of Regression
 Simple Linear Regression: Involves one independent variable and models a linear
relationship (e.g., predicting sales based on advertising spend).
 Multiple Linear Regression: Involves multiple independent variables (e.g., predicting
house prices based on location, size, and number of rooms).
 Logistic Regression: Used when the dependent variable is categorical (e.g., predicting
whether a customer will buy a product: Yes/No).
 Polynomial Regression: Models nonlinear relationships by adding polynomial terms.
 Ridge and Lasso Regression: Used for regularization to prevent overfitting.
5.2 Introduction to Linearity, correlation, And Causation
Understanding the relationships between variables is crucial in data analysis and statistics.
Three fundamental concepts that help explain these relationships are linearity, correlation, and
causation. Each concept plays a unique role in interpreting data and making accurate
predictions.
1. Linearity
Linearity refers to a straight-line relationship between two variables. In a linear relationship,
changes in one variable result in proportional changes in another. This relationship can be
expressed using a linear equation:
Y=β0+β1X+ε
Where:
 Y is the dependent variable ,X is the independent variable ,β0 is the intercept,β1 is the slope
(rate of change),ε is the error term
Examples of Linearity:
 A company’s revenue increasing proportionally with its marketing budget.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
 Temperature rising linearly with time during a steady heat increase.
Nonlinear Relationships:
 Some relationships do not follow a straight-line pattern. For example, the relationship
between study hours and test scores might flatten after a certain point.
2. Correlation
Correlation among two variables can be done effectively using a scalar plot, which is a plot
between explanatory variables and response variables. It is a 2D graph showing the relationship
between two variables. The x-axis of scatter plot is independent , or input or predictor variables
and y-axis of the scatter plot is output or dependent or predicted variables.
The scatter plot are shown in below figure.

Measures the strength and direction of a relationship between two variables. It is represented
by the correlation coefficient (r), which ranges from -1 to 1:
 r=1 → Perfect positive correlation (as one variable increases, the other increases).
 r=−1 → Perfect negative correlation (as one variable increases, the other decreases).
 r=0 → No correlation (no relationship between variables).
Examples of Correlation:
 Height and weight often show a positive correlation.
 Increase in gas prices and demand for electric cars may have a negative correlation.
However, correlation does not imply causation—just because two variables move together
does not mean one causes the other.
3. Causation
Causation means that a change in one variable directly causes a change in another. Unlike
correlation, which only shows an association, causation requires experimental or observational
evidence.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Examples of Causation:
 Smoking causes lung cancer.
 Increasing fertilizer use leads to higher crop yield.
Mistaking Correlation for Causation:
 Ice cream sales and drowning incidents are correlated—but ice cream does not cause
drowning. Instead, hot weather (a third factor) increases both.
 Higher education levels correlate with higher income—but education alone does not
"cause" wealth; factors like skills, experience, and networking also play roles.
Linearity and Non -Linearity Relationships
Linearity means relationship between the dependent and independent variables can be
visualized in a straight line. The line form y=ax+b can be fitted to the data points that
indicate the relationship between x and y. By linearity, it is meant that as one variable
increases, the corresponding variable also increases in a linear manner.
A linear relationship is show in below figure ,(a) A non linear relationship exists in
functions such as exponential function and power function and it is shown in figures (b)
and (c), here x axis is given by x data and y axis is given by y data.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Types of Regression Methods:

1. Linear Regression
 Purpose: Predict a continuous outcome based on one predictor.
 Equation:
y=β0+β1x+ε
Use case: Straight-line relationship (e.g., predicting salary based on years of experience).
2. Multiple Linear Regression
 Purpose: Predict a continuous outcome using multiple predictors.
 Equation:
y=β0+β1x1+β2x2+…+βnxn+ε
Use case: Predicting house price based on area, number of rooms, location, etc.
3. Polynomial Regression
 Purpose: Handle nonlinear relationships by adding polynomial terms.
 Equation (2nd degree example):
y=β0+β1x+β2x2+ε
Use case: Modelling curves like the growth rate of a business over time.
4. Logistic Regression
 Purpose: Classification (not traditional regression).
 Output: Probability (between 0 and 1) that is mapped to classes (e.g., 0 or 1).
 Equation (Sigmoid function):

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

 Use case: Predicting whether an email is spam or not.


5. Lasso Regression (L1 Regularization)
 Purpose: Prevent overfitting & perform feature selection by shrinking some coefficients
to 0.

 Penalty term added:


 Use case: When you have many features, but only a few are actually useful.
6. Ridge Regression (L2 Regularization)
 Purpose: Handle multicollinearity and reduce model complexity by shrinking
coefficients.

 Penalty term added:


 Use case: When all features contribute a bit, but you want to control overfitting.
Limitations of Regression Model:
1. Outliers
Outliers are extreme values that can distort regression results by heavily influencing the
regression line. They can lead to misleading coefficients and poor predictions. Handling them
involves detection, removal, or using robust regression techniques.
2.Number of Cases (Sample Size)
A small sample size can cause overfitting and unreliable results. Regression models need
enough data to be accurate and generalizable. Ideally, there should be at least 10–15
observations per predictor variable.
3.Missing Data
Missing values reduce the effectiveness of a regression model and can bias results. Simply
dropping them wastes data, while imputation methods like mean filling or multiple imputation
help retain valuable information.
4.Multicollinearity
When predictors are highly correlated, it becomes hard to separate their individual effects. This
causes unstable coefficients and inflated standard errors. It can be addressed by removing
correlated variables or using techniques like Ridge regression.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
5.3 Introduction to Linear Regression
Linear regression is a method used to find the best-fitting straight line through a set of data
points. The goal is to model the relationship between a dependent variable y and an independent
variable x using a straight line:
y=a0+a1x + e
 a0 is the intercept
 a1 is the slope
 e is the error in prediction
The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also
mutually independent with the same distributions, such as a normal distribution with
zero mean and constant variance.
3. The distribution of the error term is independent of the joint distribution of explanatory
variables.
4. The unknown parameters of the regression models are constants.
The idea of linear regression is based on Ordinary Least square (OLS) approach.The data points
are modelled using a straight line. Any arbitrarily drawn line is not an optimal line. In fig 5.4
three data points and their errors (e1,e2,e3) are shown

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
5.4 Validation of Regression Methods:
1.Mean Absolute Error (MAE)
• The average of the absolute differences between actual values yi and predicted values.

Interpretation: Lower MAE means better model performance.


2.Mean Squared Error (MSE)
• The average of the squared differences between actual and predicted values:

Interpretation: Larger errors get penalized more due to squaring. MSE is always positive.
3.Standard Error (explained in words)
• Refers to the standard deviation of residuals (errors between actual and predicted
values).
• Ideally, it should be small, meaning predictions are close to actual values.
• If residuals follow a normal distribution with mean zero, it's a good sign.
4.RMSE
Measures the average magnitude of prediction errors in other words, how far off your model's
predictions are from the actual values. It's especially useful because it penalizes larger errors
more than smaller ones (due to squaring).

5.Relative Mean Square Error (RelMSE)


RelMSE shows how well your model performs compared to a naive model that always
predicts the average value. It’s a normalized metric, useful when you want to compare error
relative to the scale of your data.

3. Coefficient of Variation (CV)


CV helps you understand the error size relative to the average value of the data. It makes
RMSE dimensionless by dividing it by the mean of the target variable. So, you can use it to

compare errors across different datasets or scales.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

5.6 Polynomial Regression


If the Relationship between the independent and dependent variables is not Linear , then linear
regression cannot be used as it will result in large errors. The problem of non -linear regression
can be solved by two methods.
1.Transformation of non linear data, so that the linear regression can handle the data.
2.Using polynomial regression.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Chapter 6
Decision Tree Learning
Decision Tree Learning is a widely used predictive model for supervised learning that spans
over a number of practical applications in various areas. It is used for both classification and
regression tasks. The decision tree model basically represents logical rules that predict the value
of a target variable by inferring from data features. This chapter provides a keen insight into
how to construct a decision tree and infer knowledge from the tree.
Learning Objectives
 Understand the structure of the decision tree
 Know about the fundamentals of Entropy
 Learn and understand popular univariate Decision Tree Induction algorithms such as
ID3, C4.5, and multivariate decision tree algorithm such as CART
 Deal with continuous attributes using improved C4.5 algorithm
 Construct Classification and Regression Tree (CART) for classifying both categorical
and continuous-valued target variables
 Construct regression trees where the target feature is a continuous-valued variable
 Understand the basics of validating and pruning of decision trees

6.1 INTRODUCTION TO DECISION TREE LEARNING MODEL

 Decision tree learning model, one of the most popular supervised predictive learning
models, classifies data instances with high accuracy and consistency. The model
performs an inductive inference that reaches a general conclusion from observed
examples. This model is variably used for solving complex classification applications.
 Decision tree is a concept tree which summarizes the information contained in the
training dataset in the form of a tree structure. Once the concept model is built, test data
can be easily classified.
 This model can be used to classify both categorical target variables and continuous-
valued target variables. Given a training dataset X, this model computes a hypothesis
function f(X) as a decision tree.
 Inputs to the model are data instances or objects with a set of features or attributes which
can be discrete or continuous, and the output of the model is a decision tree which
predicts or classifies the target class for the test data object.
 In statistical terms, attributes or features are called as independent variables. The target
feature or target class is called as response variable which indicates the category we
need to predict on a test object.
 The decision tree learning model generates a complete hypothesis space in the form of
a tree structure with the given training dataset and allows us to search through the
possible set of hypotheses which in fact would be a smaller decision tree as we walk
through the tree. This kind of search bias is called as preference bias.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
6.1.1 Structure of a Decision Tree

A decision tree has a structure that consists of a root node, internal nodes/decision nodes,
branches, and terminal nodes/leaf nodes. The topmost node in the tree is the root node. Internal
nodes are the test nodes and are also called as decision nodes. These nodes represent a choice
or test of an input attribute and the outcome or outputs of the test condition are the branches
emanating from this decision node. The branches are labelled as per the outcomes or output
values of the test condition. Each branch represents a sub-tree or subsection of the entire tree.
Every decision node is part of a path to a leaf node. The leaf nodes represent the labels or the
outcome of a decision path. The labels of the leaf nodes are the different target classes a data
instance can belong to.

Every path from root to a leaf node represents a logical rule that corresponds to a conjunction
of test attributes and the whole tree represents a disjunction of these conjunctions. The decision
tree model, in general, represents a collection of logical rules of classification in the form of a
tree structure.

Decision networks, otherwise called as influence diagrams, have a directed graph structure with
nodes and links. It is an extension of Bayesian belief networks that represents information about
each node’s current state, its possible actions, the possible outcome of those actions, and their
utility. The concept of Bayesian Belief Network (BBN) is discussed in Chapter 9.

Figure 6.1 shows symbols that are used in this book to represent different nodes in the
construction of a decision tree. A circle is used to represent a root node, a diamond symbol is
used to represent a decision node or the internal nodes, and all leaf nodes are represented with
a rectangle.

Building the Tree

Goal: Construct a decision tree with the given training dataset. The tree is constructed in a top-
down fashion. It starts from the root node. At every level of tree construction, we need to find
the best split attribute or best decision node among all attributes. This process is recursive and
continued until we end up in the last level of the tree or finding a leaf node which cannot be
split further. The tree construction is complete when all the test conditions lead to a leaf node.

The leaf node contains the target class or output of classification.


Output: Decision tree representing the complete hypothesis space.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Knowledge Inference or Classification

Goal: Given a test instance, infer to the target class it belongs to.

Classification: Inferring the target class for the test instance or object is based on inductive
inference on the constructed decision tree. In order to classify an object, we need to start
traversing the tree from the root. We traverse as we evaluate the test condition on every decision
node with the test object attribute value and walk to the branch corresponding to the test’s
outcome. This process is repeated until we end up in a leaf node which contains the target class
of the test object.
Output: Target label of the test instance.
Advantages of Decision Trees
1. Easy to model and interpret
2. Simple to understand
3. The input and output attributes can be discrete or continuous predictor variables
4. Can model a high degree of nonlinearity in the relationship between the target variables
and the predictor variables
5. Quick to train
Disadvantages of Decision Trees
Some of the issues that generally arise with a decision tree learning are that:
1. It is difficult to determine how deeply a decision tree can be grown or when to stop growing
it.
2. If training data has errors or missing attribute values, then the decision tree constructed may
become unstable or biased.
3. If the training data has continuous-valued attributes, handling it is computationally complex
and has to be discretized.
4. A complex decision tree may also be over-fitting with the training data.
5. Decision tree learning is not well suited for classifying multiple output classes.
6. Learning an optimal decision tree is also known to be NP-complete.
How to draw a decision tree to predict a student's academic performance based on the given
information such as class attendance, class assignments, home-work assignments, tests,
participation in competitions or other events, group activities such as projects and
presentations, etc.
Solution:
The target feature is the student performance in the final examination—whether he will pass or
fail in the examination. The decision nodes are test nodes which check for conditions like:
 “What’s the student’s class attendance?”
 “How did he perform in his class assignments?”

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
 “Did he do his home assignments properly?”
 “What about his assessment results?”
 “Did he participate in competitions or other events?”
 “What is the performance rating in group activities such as projects and presentations?”

The leaf nodes represent the outcomes, that is, either ‘pass’, or ‘fail’.
A decision tree would be constructed by following a set of if-else conditions which may or
may not include all the attributes, and decision nodes outcomes are two or more than two.
Hence, the tree is not a binary tree.
Note: A decision tree is not always a binary tree. It is a tree which can have more than two
branches
Example 6.2:
Predict a student’s academic performance of whether he will pass or fail based on the given
information such as ‘Assessment’ and ‘Assignment’. The following Table 6.2 shows the
independent variables, Assessment and Assignment, and the target variable Exam Result with
their values. Draw a binary decision tree.
Table 6.2: Attributes and Associated Values

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

6.1.2 Fundamentals of Entropy


When constructing a decision tree, we need to select the best feature to split the dataset at each
level. The goal is to best describe the target class for the given test instances.
 The best split feature is the one that contains more information about how to divide
the data effectively so the target class is accurately identified.
 This means choosing features that make the resulting subsets purer in terms of their
classification.
 This continues recursively until a stopping condition is met.
Entropy is used as a measure of this information:
 It is a measure of uncertainty or randomness in a system.
 It also reflects the homogeneity of the data.
 Lower entropy indicates less randomness and more homogeneity (better
classification).
Example:
 A coin flip (2 outcomes) has lower entropy than rolling a die (6 outcomes), because
it's simpler and has fewer possible outcomes.
Higher Entropy → Higher Uncertainty
Lower Entropy → Lower Uncertainty
Entropy is a measure of the purity or impurity in a dataset:
 If all instances belong to the same class, entropy = 0 (pure).
 If classes are evenly distributed (e.g., 50%-50%), entropy = 1 (impure, max
uncertainty).

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
Example:
Given 10 data instances:
 6 belong to the positive class
 4 belong to the negative class
Entropy is calculated using:

Interpretation:
 If the dataset is homogeneous, entropy = 0
 If the dataset is evenly split, entropy = 1
 The value of entropy lies between 0 and 1
 A lower entropy means a better (purer) split

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

6.2 Decision Tree Induction Algorithms


Several decision tree algorithms are commonly used:
 ID3 (Iterative Dichotomiser 3)
 C4.5
 CART (Classification and Regression Trees)
 Others: CHAID, QUEST, GUIDE, CRUISE, CTREE
 ID3 (1986) by J.R. Quinlan uses Information Gain to decide splits.
 C4.5 (1993) is an improvement over ID3, using Gain Ratio as the criterion.
 CART (1984) uses GINI Index and supports both classification and regression.
 C4.5 is better suited for handling continuous values and missing data.

Univariate vs. Multivariate Decision Trees:

 Univariate trees (like ID3, C4.5): Split based on one attribute at a time.
 Multivariate trees (like CART): Split based on a combination of attributes.

ID3 and Attribute Types

 ID3 works well when features are categorical (discrete).


 For continuous attributes, they need to be discretized (partitioned into ranges).
 It uses a purity measure called Information Gain to build trees.
 Ideal for nominal attributes with no missing values.
 Best for large datasets, but:
o Can lead to overfitting on small datasets.
o Performs poorly with missing values and outliers.

Note: No pruning is done in ID3, making it prone to overfitting.


C4.5 and CART handle both categorical and continuous attributes, and can handle missing
values.

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)
6.2.1 ID3 Tree Construction

 Supervised learning algorithm.


 Uses a greedy approach to choose the best attribute (one at a time).
 Only works with categorical features.
 Builds axis-aligned splits (one feature per decision).

Procedure to Construct a Decision Tree using ID3

1. Compute Entropy_Info for the entire dataset (based on the target attribute).
2. Compute Entropy_Info and Information Gain for each attribute.
3. Choose the attribute with the lowest entropy / highest gain as the best split.
4. Place it as the root node.
5. Branch the dataset into subsets based on the values of the root attribute.
6. Repeat recursively for each subset until:
o A leaf node is formed.
o No instances remain.
o Entropy is 0.

Note: Stop branching when entropy = 0.At each step, choose the attribute with highest
Information Gain.

Definitions

 Let T be the training dataset.


 Let A = {A₁, A₂, ..., Aₙ} be the set of attributes.
 Let m be the number of classes.
 Let Pᵢ be the probability that a data instance belongs to class Cᵢ:

Where:dci: Number of instances of class Ci ,T: Total number of instances in the dataset

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT


MACHINE LEARNING (BCS602)

Prof. Navyashree K S, CSE(DS) RNSIT

You might also like