Team 5
Team 5
Krishnan KM (20031)
Spandana M(20038)
Sathvika P(20050)
Rishekesan SV(20058)
Decision Tree
Introduction:
Decision Trees are a popular machine learning algorithm used for classification and
regression problems. A decision tree is a flowchart-like model that uses a tree structure to
represent decisions and their possible consequences. In this document, we will discuss the
key concepts of decision trees, including entropy, information gain, gini impurity, working
for categorical and numerical features, overfitting, hyperparameter techniques, and the
impact of outliers and missing values.
Entropy:
Entropy is a measure of the impurity or randomness of the data. In decision trees, entropy
is used to determine the best split by measuring the homogeneity of the data at each split
point. The formula for entropy is:
Entropy = -p1log2(p1) - p2log2(p2) - ... - pn*log2(pn)
Where p1, p2, ..., pn are the proportions of the different classes in the data.
Information Gain:
Information gain is a measure of the reduction in entropy achieved by splitting the data
based on a specific feature. Information gain is calculated as the difference between the
entropy of the parent node and the weighted average of the entropy of the child nodes. The
higher the information gain, the better the split.
Gini Impurity:
Gini impurity is another measure of the impurity of the data, similar to entropy. In decision
trees, Gini impurity is used to determine the best split by measuring the homogeneity of
the data at each split point. The formula for Gini impurity is:
Gini Impurity = 1 - (p1^2 + p2^2 + ... + pn^2)
Where p1, p2, ..., pn are the proportions of the different classes in the data.
Working for categorical and numerical features:
Decision trees can handle both categorical and numerical features. For categorical features,
the algorithm creates a branch for each possible value of the feature. For numerical
features, the algorithm determines the best split point by calculating the information gain
or Gini impurity for each possible split point.
Overfitting:
One of the biggest challenges with decision trees is overfitting, where the tree is too
complex and captures noise in the data instead of the underlying patterns. To avoid
overfitting, techniques such as pruning, setting a minimum number of samples required to
split an internal node, and setting a maximum depth for the tree can be used.
Hyperparameter techniques:
Hyperparameter techniques can be used to improve the performance of decision trees.
Some common hyperparameters include the minimum number of samples required to split
an internal node, the maximum depth of the tree, and the maximum number of features to
consider when looking for the best split. Hyperparameter tuning can be done using
techniques such as grid search and random search.
Impact of outliers and missing values:
Outliers and missing values can have a significant impact on the performance of decision
trees. Outliers can cause the algorithm to create a split that is not representative of the
majority of the data. Missing values can also cause problems, as the algorithm may not
know how to handle them. One approach to handling missing values is to impute them with
the mean or median value of the feature. Outliers can be handled by removing them or
using a robust decision tree algorithm that is less sensitive to outliers.
Conclusion:
In conclusion, decision trees are a powerful machine learning algorithm that can be used
for classification and regression problems. Key concepts include entropy, information gain,
and Gini impurity, as well as techniques for working with categorical and numerical
features, avoiding overfitting, using hyperparameters, and handling outliers and missing
values. Decision trees are a popular and effective tool for building predictive models, and
understanding these concepts is essential for using them effectively.
Random Forest
Introduction:
Random Forest is an ensemble learning algorithm that combines multiple decision trees to
improve the accuracy of predictions. It is a popular machine learning algorithm used for
both classification and regression tasks. In this document, we will discuss the key concepts
of Random Forest, including ensemble techniques (boosting and bagging), working as a
classifier and regressor, and hyperparameter tuning (Gridsearch and Random search).
Ensemble Techniques:
Ensemble techniques are used in machine learning to improve the accuracy of a model.
Two common ensemble techniques used in Random Forest are Boosting and Bagging.
Boosting is an ensemble technique where the algorithm trains multiple weak learners in
sequence, with each learner improving on the errors of the previous one. Boosting focuses
on the examples that are hard to predict, allowing the algorithm to improve its
performance on these difficult cases.
Bagging is another ensemble technique that involves training multiple learners
independently and in parallel. The final prediction is made by averaging the predictions of
all the learners. Bagging focuses on reducing the variance of the model by reducing the
impact of outliers and noise in the data.
Working as Classifier and Regressor:
Random Forest can be used for both classification and regression tasks. In classification
tasks, Random Forest generates a set of decision trees and assigns the class label based on
the majority vote of the trees. In regression tasks, Random Forest generates a set of
decision trees and assigns the predicted value based on the average of the values predicted
by each tree.
Hyperparameter Tuning:
Hyperparameter tuning is an essential step in improving the performance of Random
Forest. Hyperparameters are parameters that are not learned from the data, but rather set
before the training process begins. Two common hyperparameter tuning techniques are
Gridsearch and Random search.
Gridsearch is a hyperparameter tuning technique that involves exhaustively searching
through a specified range of hyperparameters to find the best combination that maximizes
the model's performance. Gridsearch is a brute-force method that can be time-consuming,
but it is guaranteed to find the optimal hyperparameters within the specified range.
Random search is another hyperparameter tuning technique that involves randomly
sampling hyperparameters from a specified range. Random search is less computationally
expensive than Gridsearch and can be useful for finding hyperparameters that are not well-
sampled by a Gridsearch.
Conclusion:
Random Forest is a powerful ensemble learning algorithm that can improve the accuracy of
predictions for both classification and regression tasks. Key concepts include ensemble
techniques (boosting and bagging), working as a classifier and regressor, and
hyperparameter tuning (Gridsearch and Random search). Understanding these concepts is
essential for using Random Forest effectively and improving its performance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
raw_data = pd.read_csv('kyphosis.csv')
print(raw_data.columns)
accuracy 0.81 21
macro avg 0.75 0.81 0.77 21
weighted avg 0.84 0.81 0.82 21
[[13 3]
[ 1 4]]
Accuracy: 0.8095238095238095
Precision: 0.9285714285714286
Recall: 0.8125
F1_score: 0.8125
Confusion Matrix:
[[3 0]
[0 4]]
F1 Score: 1.0
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Random Forest Regressor
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
import matplotlib.pyplot as plt
MSE: 10.374371921259836
RMSE: 3.2209271834768067
MAE: 2.1481259842519673
R2 Score: 0.8518521336172665