0% found this document useful (0 votes)
66 views16 pages

MLA Lab 6:-Implementation of Decision Tree

The document is a student's lab report on implementing a decision tree algorithm. It includes: 1) An overview of decision trees, how they work, and key concepts like splitting criteria, tree construction, and pruning. 2) The student loads and explores a car evaluation dataset, then splits it into train and test sets. 3) Categorical features are encoded before training decision trees with gini impurity and entropy splitting criteria. 4) The accuracy is reported on both training and test sets, and the trees are visualized.

Uploaded by

tushar3patil03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views16 pages

MLA Lab 6:-Implementation of Decision Tree

The document is a student's lab report on implementing a decision tree algorithm. It includes: 1) An overview of decision trees, how they work, and key concepts like splitting criteria, tree construction, and pruning. 2) The student loads and explores a car evaluation dataset, then splits it into train and test sets. 3) Categorical features are encoded before training decision trees with gini impurity and entropy splitting criteria. 4) The accuracy is reported on both training and test sets, and the trees are visualized.

Uploaded by

tushar3patil03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

MLA Lab 6:- Implementation of Decision Tree

Name : Tushar Patil


Roll no: A254
Batch : B

Theory:-
Decision trees are a popular machine learning algorithm used for both
classification and regression tasks. They operate by recursively partitioning the
input space into regions, with each partition corresponding to a decision based
on the values of input features. Here's a concise overview:

Splitting Criteria: Decision trees make decisions based on splitting criteria, such
as Gini impurity for classification or mean squared error for regression. These
criteria quantify the impurity or uncertainty in a dataset.

Tree Construction: Decision trees are constructed recursively. At each step,


the algorithm selects the best feature and corresponding threshold to split the
data into two or more subsets. This process continues until a stopping criterion
is met, such as reaching a maximum depth or minimum number of samples in
a node.

Pruning: Decision trees can suffer from overfitting, especially when they grow
too deep. Pruning techniques help to prevent overfitting by removing nodes
that do not significantly improve the tree's performance on a validation set.

Tree Interpretability: One of the main advantages of decision trees is their


interpretability. The resulting tree structure can be easily visualized and
understood, making it valuable for explaining the decision-making process to
stakeholders.
Handling Categorical Features: Decision trees naturally handle categorical
features by splitting them into distinct categories. Some implementations may
require encoding categorical features into numerical values.

Ensemble Methods: Decision trees can be combined into ensemble methods


like Random Forests or Gradient Boosted Trees, which often result in improved
performance by aggregating the predictions of multiple trees.

Scalability: While decision trees are efficient for small to medium-sized


datasets, they may not scale well to very large datasets due to their
computational complexity.

Handling Missing Values: Decision trees can handle missing values by either
ignoring them during the splitting process or imputing them based on certain
criteria.

Handling Imbalanced Classes: Decision trees can be biased towards the


majority class in imbalanced datasets. Techniques such as class weights or
resampling can be employed to mitigate this issue.

Hyperparameter Tuning: Decision trees have hyperparameters that can be


tuned to optimize performance, such as maximum depth, minimum samples
per leaf, and maximum features considered for splitting.

Code(Python):-
"""MLA_lAB6_a254ipynb

Automatically generated by Colaboratory.

Original file is located at


https://colab.research.google.com/drive/1rd0tGaEJq0VTrq4QvnCeWS-tpCzqOi9B
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

import warnings

warnings.filterwarnings('ignore')

"""# **8. Import dataset** <a class="anchor" id="8"></a>

[Table of Contents](#0.1)
"""

data = '/content/car_evaluation.csv'

df = pd.read_csv(data, header=None)

"""# **9. Exploratory data analysis** <a class="anchor" id="9"></a>

[Table of Contents](#0.1)

Now, I will explore the data to gain insights about the data.
"""

df.shape

"""We can see that there are 1728 instances and 7 variables in the data
set."""

df.head()

col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety',


'class']

df.columns = col_names

col_names

# let's again preview the dataset


df.head()

"""We can see that the column names are renamed. Now, the columns have
meaningful names."""

df.info()

col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety',


'class']

for col in col_names:

print(df[col].value_counts())

"""We can see that the `doors` and `persons` are categorical in nature. So, I
will treat them as categorical variables.

### Explore `class` variable


"""

df['class'].value_counts()

df.isnull().sum()

X = df.drop(['class'], axis=1)

y = df['class']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33,


random_state = 42)

X_train.shape, X_test.shape

X_train.dtypes

X_train.head()

# import category encoders


!pip install category_encoders
import category_encoders as ce

!pip install category_encoders

encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons',


'lug_boot', 'safety'])
X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

X_train.head()

X_test.head()

# import DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier

# instantiate the DecisionTreeClassifier model with criterion gini index

clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3,


random_state=0)

# fit the model


clf_gini.fit(X_train, y_train)

y_pred_gini = clf_gini.predict(X_test)

from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion gini index: {0:0.4f}'.


format(accuracy_score(y_test, y_pred_gini)))

y_pred_train_gini = clf_gini.predict(X_train)

y_pred_train_gini

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train,


y_pred_train_gini)))

# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_gini.fit(X_train, y_train))
import graphviz
dot_data = tree.export_graphviz(clf_gini, out_file=None,
feature_names=X_train.columns,
class_names=y_train,
filled=True, rounded=True,
special_characters=True)

graph = graphviz.Source(dot_data)

graph

# instantiate the DecisionTreeClassifier model with criterion entropy

clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3,


random_state=0)

# fit the model


clf_en.fit(X_train, y_train)

y_pred_en = clf_en.predict(X_test)

from sklearn.metrics import accuracy_score

print('Model accuracy score with criterion entropy: {0:0.4f}'.


format(accuracy_score(y_test, y_pred_en)))

y_pred_train_en = clf_en.predict(X_train)

y_pred_train_en

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train,


y_pred_train_en)))

# print the scores on training and test set

print('Training set score: {:.4f}'.format(clf_en.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_en.score(X_test, y_test)))

plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_en.fit(X_train, y_train))

import graphviz
dot_data = tree.export_graphviz(clf_en, out_file=None,
feature_names=X_train.columns,
class_names=y_train,
filled=True, rounded=True,
special_characters=True)

graph = graphviz.Source(dot_data)

graph

# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n\n', cm)

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_en))

OP:-

Previewing dataset:---
Train.head
Test.head:-

You might also like