0% found this document useful (0 votes)

71 views12 pages

Team 5

This document contains information about Team 5's project on decision tree algorithms, including the team members, an introduction to decision trees and key concepts like entropy, information gain, gini impurity, handling categorical and numerical features, overfitting, hyperparameter tuning, and dealing with outliers and missing values. It also includes an introduction to random forests, ensemble techniques like bagging and boosting, usage for classification and regression, and hyperparameter tuning methods. Code examples are provided to demonstrate a decision tree classifier on kyphosis data and a decision tree regressor on salary data.

Uploaded by

sathvika pingali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views12 pages

Team 5

Uploaded by

sathvika pingali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Team 5

Krishnan KM (20031)

Spandana M(20038)

Sathvika P(20050)

Rishekesan SV(20058)

Decision Tree
Introduction:
Decision Trees are a popular machine learning algorithm used for classification and
regression problems. A decision tree is a flowchart-like model that uses a tree structure to
represent decisions and their possible consequences. In this document, we will discuss the
key concepts of decision trees, including entropy, information gain, gini impurity, working
for categorical and numerical features, overfitting, hyperparameter techniques, and the
impact of outliers and missing values.
Entropy:
Entropy is a measure of the impurity or randomness of the data. In decision trees, entropy
is used to determine the best split by measuring the homogeneity of the data at each split
point. The formula for entropy is:
Entropy = -p1log2(p1) - p2log2(p2) - ... - pn*log2(pn)
Where p1, p2, ..., pn are the proportions of the different classes in the data.
Information Gain:
Information gain is a measure of the reduction in entropy achieved by splitting the data
based on a specific feature. Information gain is calculated as the difference between the
entropy of the parent node and the weighted average of the entropy of the child nodes. The
higher the information gain, the better the split.
Gini Impurity:
Gini impurity is another measure of the impurity of the data, similar to entropy. In decision
trees, Gini impurity is used to determine the best split by measuring the homogeneity of
the data at each split point. The formula for Gini impurity is:
Gini Impurity = 1 - (p1^2 + p2^2 + ... + pn^2)
Where p1, p2, ..., pn are the proportions of the different classes in the data.
Working for categorical and numerical features:
Decision trees can handle both categorical and numerical features. For categorical features,
the algorithm creates a branch for each possible value of the feature. For numerical
features, the algorithm determines the best split point by calculating the information gain
or Gini impurity for each possible split point.
Overfitting:
One of the biggest challenges with decision trees is overfitting, where the tree is too
complex and captures noise in the data instead of the underlying patterns. To avoid
overfitting, techniques such as pruning, setting a minimum number of samples required to
split an internal node, and setting a maximum depth for the tree can be used.
Hyperparameter techniques:
Hyperparameter techniques can be used to improve the performance of decision trees.
Some common hyperparameters include the minimum number of samples required to split
an internal node, the maximum depth of the tree, and the maximum number of features to
consider when looking for the best split. Hyperparameter tuning can be done using
techniques such as grid search and random search.
Impact of outliers and missing values:
Outliers and missing values can have a significant impact on the performance of decision
trees. Outliers can cause the algorithm to create a split that is not representative of the
majority of the data. Missing values can also cause problems, as the algorithm may not
know how to handle them. One approach to handling missing values is to impute them with
the mean or median value of the feature. Outliers can be handled by removing them or
using a robust decision tree algorithm that is less sensitive to outliers.
Conclusion:
In conclusion, decision trees are a powerful machine learning algorithm that can be used
for classification and regression problems. Key concepts include entropy, information gain,
and Gini impurity, as well as techniques for working with categorical and numerical
features, avoiding overfitting, using hyperparameters, and handling outliers and missing
values. Decision trees are a popular and effective tool for building predictive models, and
understanding these concepts is essential for using them effectively.

Random Forest
Introduction:
Random Forest is an ensemble learning algorithm that combines multiple decision trees to
improve the accuracy of predictions. It is a popular machine learning algorithm used for
both classification and regression tasks. In this document, we will discuss the key concepts
of Random Forest, including ensemble techniques (boosting and bagging), working as a
classifier and regressor, and hyperparameter tuning (Gridsearch and Random search).
Ensemble Techniques:
Ensemble techniques are used in machine learning to improve the accuracy of a model.
Two common ensemble techniques used in Random Forest are Boosting and Bagging.
Boosting is an ensemble technique where the algorithm trains multiple weak learners in
sequence, with each learner improving on the errors of the previous one. Boosting focuses
on the examples that are hard to predict, allowing the algorithm to improve its
performance on these difficult cases.
Bagging is another ensemble technique that involves training multiple learners
independently and in parallel. The final prediction is made by averaging the predictions of
all the learners. Bagging focuses on reducing the variance of the model by reducing the
impact of outliers and noise in the data.
Working as Classifier and Regressor:
Random Forest can be used for both classification and regression tasks. In classification
tasks, Random Forest generates a set of decision trees and assigns the class label based on
the majority vote of the trees. In regression tasks, Random Forest generates a set of
decision trees and assigns the predicted value based on the average of the values predicted
by each tree.
Hyperparameter Tuning:
Hyperparameter tuning is an essential step in improving the performance of Random
Forest. Hyperparameters are parameters that are not learned from the data, but rather set
before the training process begins. Two common hyperparameter tuning techniques are
Gridsearch and Random search.
Gridsearch is a hyperparameter tuning technique that involves exhaustively searching
through a specified range of hyperparameters to find the best combination that maximizes
the model's performance. Gridsearch is a brute-force method that can be time-consuming,
but it is guaranteed to find the optimal hyperparameters within the specified range.
Random search is another hyperparameter tuning technique that involves randomly
sampling hyperparameters from a specified range. Random search is less computationally
expensive than Gridsearch and can be useful for finding hyperparameters that are not well-
sampled by a Gridsearch.
Conclusion:
Random Forest is a powerful ensemble learning algorithm that can improve the accuracy of
predictions for both classification and regression tasks. Key concepts include ensemble
techniques (boosting and bagging), working as a classifier and regressor, and
hyperparameter tuning (Gridsearch and Random search). Understanding these concepts is
essential for using Random Forest effectively and improving its performance.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
raw_data = pd.read_csv('kyphosis.csv')
print(raw_data.columns)

Index(['Kyphosis', 'Age', 'Number', 'Start'], dtype='object')

from sklearn.model_selection import train_test_split

x = raw_data.drop('Kyphosis', axis = 1)
y = raw_data['Kyphosis']
x_training_data, x_test_data, y_training_data, y_test_data =
train_test_split(x, y, test_size = 0.25)

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(x_training_data, y_training_data)
predictions = model.predict(x_test_data)

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix
print(classification_report(y_test_data, predictions))
print(confusion_matrix(y_test_data, predictions))
precision recall f1-score support

absent 0.93 0.81 0.87 16

present 0.57 0.80 0.67 5

accuracy 0.81 21
macro avg 0.75 0.81 0.77 21
weighted avg 0.84 0.81 0.82 21

[[13 3]
[ 1 4]]

from sklearn.metrics import accuracy_score, precision_score,

recall_score,f1_score
accuracy = accuracy_score(y_test_data, predictions)
precision = precision_score(y_test_data,
predictions,pos_label="absent")
recall = recall_score(y_test_data, predictions,pos_label="absent")
f1 = f1_score(y_test_data, predictions,pos_label="absent")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1_score:", recall)

Accuracy: 0.8095238095238095
Precision: 0.9285714285714286
Recall: 0.8125
F1_score: 0.8125

# plot performance of Decision Tree Classifier as function of

percentage of training data
train_sizes = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
train_scores = []
test_scores = []
for size in train_sizes:
X_train, X_test, y_train, y_test = train_test_split(x, y,
test_size=1-size, random_state=0)
model.fit(X_train, y_train)
y_train = model.predict(X_train)
y_pred_class=model.predict(X_test)
train_score = model.score(X_train, y_train)
test_score = f1_score(y_test, y_pred_class,pos_label="absent")
train_scores.append(train_score)
test_scores.append(test_score)
plt.plot(train_sizes, test_scores)
plt.title('Decision Tree Classifier Performance')
plt.xlabel('Training Data Size')
plt.ylabel('F1_Score')
plt.show()
Decision Tree Regressor
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
import matplotlib.pyplot as plt

# Load the Salary dataset

data = pd.read_csv('Salary_Data.csv')

# Split the data into training and testing sets

X = data.drop('Salary', axis=1)
y = data['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)

# Train the decision tree regressor model

dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)
# Predict target values for testing set
y_pred = dt_regressor.predict(X_test)

# Evaluate the performance of the model using metrics

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics

print("Mean Squared Error (MSE): ", mse)
print("Root Mean Squared Error (RMSE): ", rmse)
print("Mean Absolute Error (MAE): ", mae)
print("R2 Score: ", r2)

# Draw a performance plot of the regressor model

train_percentages = range(10, 100, 10)
mse_values = []
for percent in train_percentages:
train_size = int(percent/100 * len(data))
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=1-(percent/100), random_state=42)
dt_regressor.fit(X_train, y_train)
y_pred = dt_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_values.append(mse)
plt.plot(train_percentages, mse_values)
plt.xlabel('Percentage of Training Data')
plt.ylabel('Mean Squared Error (MSE)')
plt.show()

Mean Squared Error (MSE): 77671302.625

Root Mean Squared Error (RMSE): 8813.132395748971
Mean Absolute Error (MAE): 7026.125
R2 Score: 0.8693307583517671
Random Forest Classifier
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, f1_score,
accuracy_score, precision_score, recall_score
import matplotlib.pyplot as plt

# Load the insurance_data dataset

data = pd.read_csv('insurance_data.csv')

# Convert categorical variables into dummy variables

data = pd.get_dummies(data)

# Split the data into training and testing sets

X = data.drop('bought_insurance', axis=1)
y = data['bought_insurance']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)

# Train the random forest classifier model

rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict target values for testing set

y_pred = rf_classifier.predict(X_test)

# Evaluate the performance of the model using metrics

conf_mat = confusion_matrix(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# Print the evaluation metrics

print("Confusion Matrix:\n", conf_mat)
print("F1 Score: ", f1)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)

# Draw a performance plot of the classifier model

train_percentages = range(10, 100, 10)
acc_values = []
f1_values = []
for percent in train_percentages:
train_size = int(percent/100 * len(data))
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=1-(percent/100), random_state=42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
f1_values.append(f1)
plt.plot(train_percentages, f1_values, label='F1 Score')
plt.xlabel('Percentage of Training Data')
plt.ylabel('Metric Value')
plt.legend()
plt.show()

Confusion Matrix:
[[3 0]
[0 4]]
F1 Score: 1.0
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Random Forest Regressor
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
import matplotlib.pyplot as plt

# Load the Boston Housing Prices dataset

data =
pd.read_csv('https://raw.githubusercontent.com/scikit-learn/scikit-
learn/master/sklearn/datasets/data/boston_house_prices.csv', header=1)

# Split the data into features and target variable

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)

# Train the random forest regressor model

rf_regressor = RandomForestRegressor(n_estimators=100,
random_state=42)
rf_regressor.fit(X_train, y_train)

# Predict target values for testing set

y_pred = rf_regressor.predict(X_test)

# Evaluate the performance of the model using metrics

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics

print("MSE: ", mse)
print("RMSE: ", rmse)
print("MAE: ", mae)
print("R2 Score: ", r2)

# Draw a performance plot of the regressor model

train_percentages = range(10, 100, 10)
mse_values = []
for percent in train_percentages:
train_size = int(percent/100 * len(data))
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=1-(percent/100), random_state=42)
rf_regressor.fit(X_train, y_train)
y_pred = rf_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_values.append(mse)
plt.plot(train_percentages, mse_values)
plt.xlabel('Percentage of Training Data')
plt.ylabel('MSE')
plt.show()

MSE: 10.374371921259836
RMSE: 3.2209271834768067
MAE: 2.1481259842519673
R2 Score: 0.8518521336172665

Amc Book 1 2018 Secure
100% (7)
Amc Book 1 2018 Secure
275 pages
We Are Investocrypt
No ratings yet
We Are Investocrypt
9 pages
Preschool English Activity
100% (1)
Preschool English Activity
64 pages
M Schemes 04
0% (2)
M Schemes 04
3 pages
Unit 4
No ratings yet
Unit 4
33 pages
DS Unit - 4
No ratings yet
DS Unit - 4
76 pages
Lecture-4 Unit 2
No ratings yet
Lecture-4 Unit 2
73 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Tree and Random Forest
No ratings yet
Decision Tree and Random Forest
41 pages
Decision Tree Algorithm
No ratings yet
Decision Tree Algorithm
14 pages
Ii. What Is Random Forest?
No ratings yet
Ii. What Is Random Forest?
6 pages
08 Decision - Tree
No ratings yet
08 Decision - Tree
9 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
Naive Bayes and Decision Tree Classification
No ratings yet
Naive Bayes and Decision Tree Classification
21 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
AIML Final Cpy Word
No ratings yet
AIML Final Cpy Word
15 pages
Random Forest
No ratings yet
Random Forest
25 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
30 pages
Unit-V 1
No ratings yet
Unit-V 1
26 pages
Lecture Notes 3
No ratings yet
Lecture Notes 3
11 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Decision Tree
No ratings yet
Decision Tree
5 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Random Forest
No ratings yet
Random Forest
5 pages
Decision Trees
67% (3)
Decision Trees
14 pages
Loan
No ratings yet
Loan
3 pages
Unit 3
No ratings yet
Unit 3
14 pages
Unit 3
No ratings yet
Unit 3
31 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Hyper Parameter Optimization
No ratings yet
Hyper Parameter Optimization
13 pages
Trinh Khanh Ly 20213676
No ratings yet
Trinh Khanh Ly 20213676
13 pages
Decision Tree
No ratings yet
Decision Tree
11 pages
NOTES
No ratings yet
NOTES
18 pages
ML Unit 3
No ratings yet
ML Unit 3
15 pages
ML CLASS 6 Decision Tree Algorithm
No ratings yet
ML CLASS 6 Decision Tree Algorithm
21 pages
What Is Decision Tree
No ratings yet
What Is Decision Tree
35 pages
RB's ML2 Notes
No ratings yet
RB's ML2 Notes
5 pages
Unit-4 (1) .Docx ML
No ratings yet
Unit-4 (1) .Docx ML
42 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Decision Trees Cheat Sheet PDF
No ratings yet
Decision Trees Cheat Sheet PDF
2 pages
BSC ML Ch3
No ratings yet
BSC ML Ch3
106 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
E IS388 Theory MellaMargaretaVeronica 00000059669
No ratings yet
E IS388 Theory MellaMargaretaVeronica 00000059669
7 pages
Dmi Unit 4
No ratings yet
Dmi Unit 4
34 pages
Classification
No ratings yet
Classification
8 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Chapter 4classification and Prediction
No ratings yet
Chapter 4classification and Prediction
19 pages
Tree Based Learning Methods
No ratings yet
Tree Based Learning Methods
28 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
FDP Session 4 (Decision Tree)
No ratings yet
FDP Session 4 (Decision Tree)
1 page
Comparative Analysis of XGBoost
No ratings yet
Comparative Analysis of XGBoost
20 pages
Experiment No-2
No ratings yet
Experiment No-2
4 pages
HSMC
No ratings yet
HSMC
5 pages
Decision Trees Set-1
No ratings yet
Decision Trees Set-1
7 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Lecture 6
No ratings yet
Lecture 6
24 pages
Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
7 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
MIS410 Chapter6
No ratings yet
MIS410 Chapter6
47 pages
MLA Lab 6:-Implementation of Decision Tree
No ratings yet
MLA Lab 6:-Implementation of Decision Tree
16 pages
Chapter 04
No ratings yet
Chapter 04
48 pages
LinkWay S2 Datasheet 012 Web
No ratings yet
LinkWay S2 Datasheet 012 Web
2 pages
JOUR213 Answers Fall 2020 6
No ratings yet
JOUR213 Answers Fall 2020 6
4 pages
F
No ratings yet
F
45 pages
Linux Unit 5
No ratings yet
Linux Unit 5
28 pages
Formwork (Shuttering) For Different Structural Members - Beams, Slabs Etc
No ratings yet
Formwork (Shuttering) For Different Structural Members - Beams, Slabs Etc
6 pages
Introduction To C Programming Course Materail
100% (1)
Introduction To C Programming Course Materail
161 pages
Multiquark Hadrons 1st Edition Ahmed Ali Download
No ratings yet
Multiquark Hadrons 1st Edition Ahmed Ali Download
61 pages
DSS41A05
No ratings yet
DSS41A05
7 pages
Chem Lab 2
No ratings yet
Chem Lab 2
6 pages
Crude Oil Conversion Table
No ratings yet
Crude Oil Conversion Table
61 pages
TD1360c Shell and Tube Datasheet
No ratings yet
TD1360c Shell and Tube Datasheet
2 pages
WAGO 750-461en
No ratings yet
WAGO 750-461en
6 pages
113 Current Monitoring Relay of Imin and Imax in 1P - AC/DC: PRI-41, PRI-42
No ratings yet
113 Current Monitoring Relay of Imin and Imax in 1P - AC/DC: PRI-41, PRI-42
1 page
Jest All in One Notes 2018-Final Updated PDF
No ratings yet
Jest All in One Notes 2018-Final Updated PDF
165 pages
Boost Power Stage in SMPS
No ratings yet
Boost Power Stage in SMPS
32 pages
Break The Wall From Bottom: Automated Discovery of Protocol-Level Evasion Vulnerabilities in Web Application Firewalls
No ratings yet
Break The Wall From Bottom: Automated Discovery of Protocol-Level Evasion Vulnerabilities in Web Application Firewalls
50 pages
SOAv 1
No ratings yet
SOAv 1
50 pages
Geology of Kohistan
100% (1)
Geology of Kohistan
39 pages
Avionics
100% (1)
Avionics
43 pages
Brochure Rilsan-PA11 2005
No ratings yet
Brochure Rilsan-PA11 2005
32 pages
Infinera WP Advantages of Indium Phosphide
No ratings yet
Infinera WP Advantages of Indium Phosphide
9 pages
MPLS TP Overview
100% (1)
MPLS TP Overview
30 pages
ADM202EA
No ratings yet
ADM202EA
16 pages
Generalised Angular Momentum
No ratings yet
Generalised Angular Momentum
10 pages
Guidance Note C - B - ENV 002, July 02
No ratings yet
Guidance Note C - B - ENV 002, July 02
12 pages
Types of Modulator
No ratings yet
Types of Modulator
31 pages

Team 5

Uploaded by

Team 5

Uploaded by

Team 5

Index(['Kyphosis', 'Age', 'Number', 'Start'], dtype='object')

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

absent 0.93 0.81 0.87 16

from sklearn.metrics import accuracy_score, precision_score,

# plot performance of Decision Tree Classifier as function of

# Load the Salary dataset

# Split the data into training and testing sets

# Train the decision tree regressor model

# Evaluate the performance of the model using metrics

# Print the evaluation metrics

# Draw a performance plot of the regressor model

Mean Squared Error (MSE): 77671302.625

# Load the insurance_data dataset

# Convert categorical variables into dummy variables

# Split the data into training and testing sets

# Train the random forest classifier model

# Predict target values for testing set

# Evaluate the performance of the model using metrics

# Print the evaluation metrics

# Draw a performance plot of the classifier model

# Load the Boston Housing Prices dataset

# Split the data into features and target variable

# Split the data into training and testing sets

# Train the random forest regressor model

# Predict target values for testing set

# Evaluate the performance of the model using metrics

# Print the evaluation metrics

# Draw a performance plot of the regressor model

You might also like