0% found this document useful (0 votes)
2 views

vertopal.com_task7 (1)

The document demonstrates the construction of a decision tree classifier using Python, specifically with the Iris and Diabetes datasets. It covers data loading, preprocessing, feature selection, model training, prediction, and evaluation metrics such as confusion matrix and accuracy score. The results show a high accuracy for both training and testing datasets, indicating effective model performance.

Uploaded by

uwmabtw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

vertopal.com_task7 (1)

The document demonstrates the construction of a decision tree classifier using Python, specifically with the Iris and Diabetes datasets. It covers data loading, preprocessing, feature selection, model training, prediction, and evaluation metrics such as confusion matrix and accuracy score. The results show a high accuracy for both training and testing datasets, indicating effective model performance.

Uploaded by

uwmabtw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Task-7

DEMONSTRATE THE PROCESS OF THE DECISION TREE CONSTRUCTION FOR


CLASSIFICATION PROBLEMS USING PYTHON PROGRAMMING.
#Import necessary libraries
import pandas as pd # For data handling and analysis

# Load the dataset


df=pd.read_csv("iris.csv") # Load the dataset into a DataFrame
#Display the data
print(df)

sepal.length sepal.width petal.length petal.width variety


0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Virginica
146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
148 6.2 3.4 5.4 2.3 Virginica
149 5.9 3.0 5.1 1.8 Virginica

[150 rows x 5 columns]

# Check for missing values in the dataset


print("\nMissing Values Count:")
print(df.isnull().sum()) # Displays the count of NaN values per
column

Missing Values Count:


sepal.length 0
sepal.width 0
petal.length 0
petal.width 0
variety 0
dtype: int64

print(df.columns) # Check actual column names

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',


'variety'],
dtype='object')
# Feature Selection
# Define independent variables (features) and dependent variable
(target/class label)
x=df.drop(['variety'],axis=1)
print(x)

sepal.length sepal.width petal.length petal.width


0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

[150 rows x 4 columns]

y=df[['variety']]
print(y)

variety
0 Setosa
1 Setosa
2 Setosa
3 Setosa
4 Setosa
.. ...
145 Virginica
146 Virginica
147 Virginica
148 Virginica
149 Virginica

[150 rows x 1 columns]

#import necessary library for splitting data


from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% train, 20%
test)
x_train, x_test, y_train, y_test = train_test_split(x, y,
train_size=0.8, random_state=45)

#display the result


print(x_train)
print(y_train)
print(x_test)
print(y_test)

sepal.length sepal.width petal.length petal.width


76 6.8 2.8 4.8 1.4
96 5.7 2.9 4.2 1.3
119 6.0 2.2 5.0 1.5
101 5.8 2.7 5.1 1.9
25 5.0 3.0 1.6 0.2
.. ... ... ... ...
68 6.2 2.2 4.5 1.5
95 5.7 3.0 4.2 1.2
32 5.2 4.1 1.5 0.1
124 6.7 3.3 5.7 2.1
131 7.9 3.8 6.4 2.0

[120 rows x 4 columns]


variety
76 Versicolor
96 Versicolor
119 Virginica
101 Virginica
25 Setosa
.. ...
68 Versicolor
95 Versicolor
32 Setosa
124 Virginica
131 Virginica

[120 rows x 1 columns]


sepal.length sepal.width petal.length petal.width
0 5.1 3.5 1.4 0.2
43 5.0 3.5 1.6 0.6
129 7.2 3.0 5.8 1.6
3 4.6 3.1 1.5 0.2
34 4.9 3.1 1.5 0.2
44 5.1 3.8 1.9 0.4
38 4.4 3.0 1.3 0.2
105 7.6 3.0 6.6 2.1
123 6.3 2.7 4.9 1.8
140 6.7 3.1 5.6 2.4
28 5.2 3.4 1.4 0.2
125 7.2 3.2 6.0 1.8
113 5.7 2.5 5.0 2.0
103 6.3 2.9 5.6 1.8
133 6.3 2.8 5.1 1.5
35 5.0 3.2 1.2 0.2
145 6.7 3.0 5.2 2.3
142 5.8 2.7 5.1 1.9
40 5.0 3.5 1.3 0.3
87 6.3 2.3 4.4 1.3
84 5.4 3.0 4.5 1.5
85 6.0 3.4 4.5 1.6
115 6.4 3.2 5.3 2.3
51 6.4 3.2 4.5 1.5
4 5.0 3.6 1.4 0.2
112 6.8 3.0 5.5 2.1
92 5.8 2.6 4.0 1.2
64 5.6 2.9 3.6 1.3
10 5.4 3.7 1.5 0.2
91 6.1 3.0 4.6 1.4
variety
0 Setosa
43 Setosa
129 Virginica
3 Setosa
34 Setosa
44 Setosa
38 Setosa
105 Virginica
123 Virginica
140 Virginica
28 Setosa
125 Virginica
113 Virginica
103 Virginica
133 Virginica
35 Setosa
145 Virginica
142 Virginica
40 Setosa
87 Versicolor
84 Versicolor
85 Versicolor
115 Virginica
51 Versicolor
4 Setosa
112 Virginica
92 Versicolor
64 Versicolor
10 Setosa
91 Versicolor

# Import decision tree classifier


from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(criterion='gini',max_depth=2,random_state=1)
dt.fit(x_train,y_train)
from sklearn.tree import plot_tree
#predict on the test data
pr=dt.predict(x_test)
#Display the predicted values.
print(pr)

['Setosa' 'Setosa' 'Versicolor' 'Setosa' 'Setosa' 'Setosa' 'Setosa'


'Virginica' 'Virginica' 'Virginica' 'Setosa' 'Virginica' 'Virginica'
'Virginica' 'Versicolor' 'Setosa' 'Virginica' 'Virginica' 'Setosa'
'Versicolor' 'Versicolor' 'Versicolor' 'Virginica' 'Versicolor'
'Setosa'
'Virginica' 'Versicolor' 'Versicolor' 'Setosa' 'Versicolor']

# Import metrics for evaluation


from sklearn.metrics import confusion_matrix, accuracy_score,
classification_report

# Evaluate the model

# Confusion Matrix
cf = confusion_matrix(y_test, pr)
print("Confusion Matrix:\n", cf)

# Classification Report
cr = classification_report(y_test, pr)
print("Classification Report:\n", cr)

# Accuracy Score
print("Accuracy Score:", accuracy_score(y_test, pr) * 100)

Confusion Matrix:
[[11 0 0]
[ 0 7 0]
[ 0 2 10]]
Classification Report:
precision recall f1-score support

Setosa 1.00 1.00 1.00 11


Versicolor 0.78 1.00 0.88 7
Virginica 1.00 0.83 0.91 12

accuracy 0.93 30
macro avg 0.93 0.94 0.93 30
weighted avg 0.95 0.93 0.93 30

Accuracy Score: 93.33333333333333

# Predicting the output for the training data using the trained model
f = s.predict(x_train)
# Checking the accuracy on the training data to assess potential
overfitting
# A very high accuracy on the training data but a low accuracy on the
test data could indicate overfitting
print('Training accuracy is', accuracy_score(y_train, f))
# Predicting the output for the test data to evaluate model
performance on unseen data
pr = s.predict(x_test)
# Checking the accuracy on the testing data to see how well the model
generalizes
print('Testing accuracy is', accuracy_score(y_test, pr))

Training accuracy is 0.9583333333333334


Testing accuracy is 0.9666666666666667

from matplotlib import pyplot as plt


from sklearn.tree import DecisionTreeClassifier, plot_tree
# Set up the figure size for the plot
plt.figure(figsize=(12, 8))
# Plot the decision tree
plot_tree(dt.fit(x_train,y_train),filled=True, impurity=True)
# Show the plot
plt.show()

#Example-2
# Load the dataset
#The dataset "diabetes.csv" contains information about patients'
#health parameters and diabetes status.
df=pd.read_csv("diabetes.csv") # Load the dataset into a DataFrame
#Display the data
print(df)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI


\
0 6 148 72 35 0 33.6

1 1 85 66 29 0 26.6

2 8 183 64 0 0 23.3

3 1 89 66 23 94 28.1

4 0 137 40 35 168 43.1

.. ... ... ... ... ... ...

763 10 101 76 48 180 32.9

764 2 122 70 27 0 36.8

765 5 121 72 23 112 26.2

766 1 126 60 0 0 30.1

767 1 93 70 31 0 30.4

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...
763 0.171 63 0
764 0.340 27 0
765 0.245 30 0
766 0.349 47 1
767 0.315 23 0

[768 rows x 9 columns]

# Check for missing values in the dataset


print("\nMissing Values Count:")
print(df.isnull().sum()) # Displays the count of NaN values per
column
Missing Values Count:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

print(df.columns) # Check actual column names

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',


'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')

# Feature Selection
# Define independent variables (features) and dependent variable
(target/class label)
x=df.drop(['Outcome'],axis=1)
print(x)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI


\
0 6 148 72 35 0 33.6

1 1 85 66 29 0 26.6

2 8 183 64 0 0 23.3

3 1 89 66 23 94 28.1

4 0 137 40 35 168 43.1

.. ... ... ... ... ... ...

763 10 101 76 48 180 32.9

764 2 122 70 27 0 36.8

765 5 121 72 23 112 26.2

766 1 126 60 0 0 30.1

767 1 93 70 31 0 30.4
DiabetesPedigreeFunction Age
0 0.627 50
1 0.351 31
2 0.672 32
3 0.167 21
4 2.288 33
.. ... ...
763 0.171 63
764 0.340 27
765 0.245 30
766 0.349 47
767 0.315 23

[768 rows x 8 columns]

y=df[['Outcome']]
print(y)

Outcome
0 1
1 0
2 1
3 0
4 1
.. ...
763 0
764 0
765 0
766 1
767 0

[768 rows x 1 columns]

#import necessary library for splitting data


from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% train, 20%
test)
x_train, x_test, y_train, y_test = train_test_split(x, y,
train_size=0.8, random_state=45)

#display the result


print(x_train)
print(y_train)
print(x_test)
print(y_test)

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI


\
727 0 141 84 26 0 32.4
327 10 179 70 0 0 35.1

721 1 114 66 36 200 38.1

210 2 81 60 22 0 27.7

683 4 125 80 0 0 32.3

.. ... ... ... ... ... ...

725 4 112 78 40 0 39.4

607 1 92 62 25 41 19.5

544 1 88 78 29 76 32.0

643 4 90 0 0 0 28.0

414 0 138 60 35 167 34.6

DiabetesPedigreeFunction Age
727 0.433 22
327 0.200 37
721 0.289 21
210 0.290 25
683 0.536 27
.. ... ...
725 0.236 38
607 0.482 25
544 0.365 29
643 0.610 31
414 0.534 21

[614 rows x 8 columns]


Outcome
727 0
327 0
721 0
210 0
683 1
.. ...
725 0
607 0
544 0
643 0
414 1

[614 rows x 1 columns]


Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
\
195 5 158 84 41 210 39.4

51 1 101 50 15 36 24.2

66 0 109 88 30 0 32.5

437 5 147 75 0 0 29.9

665 1 112 80 45 132 34.8

.. ... ... ... ... ... ...

246 10 122 68 0 0 31.2

556 1 97 70 40 0 38.1

298 14 100 78 25 184 36.6

339 7 178 84 0 0 39.9

146 9 57 80 37 0 32.8

DiabetesPedigreeFunction Age
195 0.395 29
51 0.526 26
66 0.855 38
437 0.434 28
665 0.217 24
.. ... ...
246 0.258 41
556 0.218 30
298 0.412 46
339 0.331 41
146 0.096 41

[154 rows x 8 columns]


Outcome
195 1
51 0
66 1
437 0
665 0
.. ...
246 0
556 0
298 1
339 1
146 0
[154 rows x 1 columns]

# Import decision tree classifier


from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(criterion='entropy',max_depth=2,random_state
=1)
dt.fit(x_train,y_train)
from sklearn.tree import plot_tree

#predict on the test data


pr=dt.predict(x_test)
#Display the predicted values.
print(pr)

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0
0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0
1 0
0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
0 0
0 0 0 0 1 0]

# Import metrics for evaluation


from sklearn.metrics import confusion_matrix, accuracy_score,
classification_report

# Evaluate the model

# Confusion Matrix
cf = confusion_matrix(y_test, pr)
print("Confusion Matrix:\n", cf)

# Classification Report
cr = classification_report(y_test, pr)
print("Classification Report:\n", cr)

# Accuracy Score
print("Accuracy Score:", accuracy_score(y_test, pr) * 100)

Confusion Matrix:
[[97 7]
[36 14]]
Classification Report:
precision recall f1-score support

0 0.73 0.93 0.82 104


1 0.67 0.28 0.39 50
accuracy 0.72 154
macro avg 0.70 0.61 0.61 154
weighted avg 0.71 0.72 0.68 154

Accuracy Score: 72.07792207792207

# Predicting the output for the training data using the trained model
f = s.predict(x_train)
# Checking the accuracy on the training data to assess potential
overfitting
# A very high accuracy on the training data but a low accuracy on the
test data could indicate overfitting
print('Training accuracy is', accuracy_score(y_train, f))
# Predicting the output for the test data to evaluate model
performance on unseen data
pr = s.predict(x_test)
# Checking the accuracy on the testing data to see how well the model
generalizes
print('Testing accuracy is', accuracy_score(y_test, pr))

Training accuracy is 0.7752442996742671


Testing accuracy is 0.7337662337662337

from matplotlib import pyplot as plt


from sklearn.tree import DecisionTreeClassifier, plot_tree
# Set up the figure size for the plot
plt.figure(figsize=(12, 8))
# Plot the decision tree
plot_tree(dt.fit(x_train,y_train),filled=True, impurity=True)
# Show the plot
plt.show()

You might also like