0% found this document useful (0 votes)
12 views8 pages

2023AIB1008 Lab08

This lab report discusses the implementation of Random Forest, an ensemble learning method based on decision trees, detailing its structure, advantages, and hyperparameters. It outlines the data preprocessing steps, including classification and visualization, as well as the construction and training of decision trees and random forests. The report emphasizes the importance of tuning parameters to balance bias and variance while preventing overfitting.

Uploaded by

2023aib1008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

2023AIB1008 Lab08

This lab report discusses the implementation of Random Forest, an ensemble learning method based on decision trees, detailing its structure, advantages, and hyperparameters. It outlines the data preprocessing steps, including classification and visualization, as well as the construction and training of decision trees and random forests. The report emphasizes the importance of tuning parameters to balance bias and variance while preventing overfitting.

Uploaded by

2023aib1008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Indian Institute of Technology,

Ropar

AI211
Machine Learning

LAB REPORT 8
Random Forest
I attended the lecture

Student Name Student ID


Kamakshi Gupta 2023AIB1008

Submission Date : 4/04/2025


1 Overview
1. Decision Trees - is a supervised learning algorithm used for classification and
regression tasks by recursively splitting data based on feature values. It con-
sists of a root node representing the entire dataset, internal nodes making
decisions, branches indicating possible outcomes, and leaf nodes providing
the final prediction. The splitting criteria can be Entropy & Information
Gain (for classification),or Mean Squared Error (MSE) (for regression). Key
hyperparameters include max depth, minimum samples split, and pruning,
which help control overfitting and improve generalization. Decision trees
are easy to interpret and handle both numerical and categorical data but
can overfit if not properly tuned. Despite their simplicity, they serve as the
foundation for more advanced models like Random Forests and Gradient
Boosting.

2. Random Forest - A random forest is an ensemble learning method that


builds multiple decision trees and combines their predictions to improve
accuracy and reduce overfitting. It works by randomly selecting subsets
of data and features for training each tree, ensuring diversity in decision-
making. The final prediction is made by majority voting for classification or
averaging for regression. Key hyperparameters include the number of trees,
maximum depth, and minimum samples per split, which help balance bias
and variance. Random forests are robust to noise, handle missing values
well, and provide feature importance rankings, making them widely used
in various applications. However, they require more computational power
compared to a single decision tree.

2 Data Preprocessing
1. Loaded the dataset as a dataframe and added headings.

2. Classified >50K as 1 and ≤ 50K as 0 for ease of classification

3. Next we visualised data

4. histograms

1
5. Next we removed outliers from education-num

6. We then analyzed the countplots of the categorical columns and found some
mismatch in number of unique values in training(left) and testing(right)
dataset. We fixed it by putting the extras in a column called Others and
adding a dummy column others to training dataset

2
7. Some issues were faced in maintaining equal size of columns.

8. Finally dummies were added and the numerical columns were scaled

3 Decision Trees
• Entropy Formula:
k
X
H(Y ) = − pi log2 (pi )
i=1

where H(Y ) is the entropy, pi is the probability of class i, and k is the


number of classes.
• Information Gain Calculation:
X |Yv |
IG(Xf ) = H(Y ) − H(Yv )
v∈V
|Y |

where V is the set of unique values of feature Xf .


• Decision Trees are initialized and loaded with optional max depth,
min samples split, and pruning ratio parameters.
• The fit method calls build tree to construct the decision tree recur-
sively.
• The algorithm stops splitting if the depth limit is reached, too few
samples remain, or all labels are the same.
• The best split is found by iterating over all features and selecting
thresholds using percentiles.
• Entropy and information gain are calculated for each possible split,
and the best feature-threshold pair is chosen.
• Random pruning is applied based on pruning ratio to reduce overfit-
ting.

3
• The dataset is split into left and right subsets, and subtrees are built
recursively.
• TreeNode objects store feature indices, thresholds, and child nodes for
decision-making.

4 Random Forest
1. The Random Forest Classifier is initialized with parameters like number of
trees, tree depth, minimum samples to split, pruning ratio, and number of
features to sample at each split.

2. The fit method creates n trees decision trees using bootstrap sampling of
the training data.

3. Each tree is trained using the DecisionTreeClassifier class with specified


hyperparameters.

4. During training, each tree considers a random subset of features (typically


sqrt(n features)) for finding the best split.

5. Random pruning is applied to each tree to prevent overfitting.

4
5
5 Testing & Final Result

6
7

You might also like