0% found this document useful (0 votes)
2 views15 pages

8.program Decisiontree

The document outlines the development of a program to demonstrate the decision tree algorithm using the Breast Cancer Data set for classification. It explains the structure and functioning of decision trees, including feature selection, data splitting, and prediction making. Additionally, it discusses the advantages, challenges, optimization techniques, and applications of decision trees in various fields.

Uploaded by

1bi22cd016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

8.program Decisiontree

The document outlines the development of a program to demonstrate the decision tree algorithm using the Breast Cancer Data set for classification. It explains the structure and functioning of decision trees, including feature selection, data splitting, and prediction making. Additionally, it discusses the advantages, challenges, optimization techniques, and applications of decision trees in various fields.

Uploaded by

1bi22cd016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Experiment 8

Develop a program to demonstrate the working of


the decision tree algorithm. Use Breast Cancer
Data set for building the decision tree and
applying this knowledge to classify a new sample.

Introduction to Decision Trees


What is a Decision Tree?
A Decision Tree is a supervised machine learning algorithm used for
classification and regression tasks. It models decisions using a tree-like
structure where:

Nodes represent decision points based on feature values.


Edges represent possible outcomes (branches).
Leaves represent the final decision or classification.

Decision trees work by recursively splitting data into subsets based on the most
significant feature, ensuring maximum information gain at each step.

Working of the Decision Tree Algorithm


1.Selecting the Best Feature for Splitting
At each step, the algorithm selects the feature that best separates the data.
Common methods for choosing the best feature include:

Gini Impurity
Gini = 1- ∑Pi2

Measures how often a randomly chosen element would be incorrectly classified.

Entropy (Information Gain)


Entropy = ∑p(X)log p(X)

Measures the uncertainty in a dataset and selects splits that maximize information
gain.

Chi-Square Test
Evaluates the statistical significance of the feature split.

2.Splitting the Data


The dataset is divided into subsets based on the selected
feature. The process continues recursively until:
A stopping condition is met (e.g., pure classification, max
depth). The tree reaches a predefined depth.

3.Making Predictions
For a new sample, traverse the tree from the root to a
leaf node. The leaf node contains the predicted class
label.

Advantages of Decision Trees


✔ Easy to interpret – Mimics human decision-making.
✔ Handles both numerical & categorical data.
✔ Requires little data preprocessing – No need for feature scaling.
✔ Works well with missing values.

Challenges of Decision Trees


❌ Overfitting – Deep trees may memorize noise instead of patterns.
❌ Bias towards dominant features – Features with more categories can lead
to biased splits.
❌ Instability – Small data variations can lead to different trees.

Optimizing Decision Trees


1.Pruning

Pre-Pruning: Stop the tree early using conditions (e.g., min samples per
split).
Post-Pruning: Remove unnecessary branches after the tree is built.
2.Setting Tree Depth

Limiting maximum depth prevents overfitting.


3.Using Ensemble Methods

Random Forest: Combines multiple trees for better generalization.


Gradient Boosting: Sequentially improves predictions.

Applications of Decision Trees


Medical Diagnosis – Classifying diseases based on symptoms.
Fraud Detection – Identifying fraudulent transactions.
Customer Segmentation – Categorizing users based on behavior.
In [40]: # Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split


from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix

from sklearn.tree import export_graphviz


from IPython.display import Image
import pydotplus

import warnings
warnings.filterwarnings('ignore')
In [5]: data = pd.read_csv(r'C:\Users\Admin\OneDrive\Documents\Machine Learning Lab\
Dataset
In [10]: pd.set_option('display.max_columns', None)

In [11]: data.head()

Out[11]: id diagnosi radius_mea texture_mea perimeter_mea area_mea smooth


s n n n n n
0 842302 M 17.99 10.38 122.80 1001.0

1 842517 M 20.57 17.77 132.90 1326.0

2 M 19.69 21.25 130.00 1203.0


84300903
3 M 11.42 20.38 77.58 386.1
84348301
4 M 20.29 14.34 135.10 1297.0
84358402

In [7]: data.shape

Out[7]: (569, 32)

In [12]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype

0 id 56 non-null int6
9 4
1 diagnosis 56 non-null objec
9 t
2 radius_mean 56 non-null float6
9 4
3 texture_mean 56 non-null float6
9 4
4 perimeter_mean 56 non-null float6
9 4
5 area_mean 56 non-null float6
9 4
6 smoothness_mean 56 non-null float6
9 4
7 compactness_mean 56 non-null float6
9 4
8 concavity_mean 56 non-null float6
9 4
9 concave_points_mean 56 non-null float6
9 4
10 symmetry_mean 56 non-null float6
9 4
11 fractal_dimension_mean 56 non-null float6
9 4
12 radius_se 56 non-null float6
9 4
13 texture_se 56 non-null float6
9 4
14 perimeter_se 56 non-null float6
9 4
15 area_se 56 non-null float6
9 4
16 smoothness_se 56 non-null float6
9 4
17 compactness_se 56 non-null float6
9 4
18 concavity_se 56 non-null float6
9 4
19 concave_points_se 56 non-null float6
9 4
20 symmetry_se 56 non-null float6
9 4
21 fractal_dimension_se 56 non-null float6
9 4
22 radius_worst 56 non-null float6
9 4
23 texture_worst 56 non-null float6
9 4
24 perimeter_worst 56 non-null float6
9 4
25 area_worst 56 non-null float6
9 4
26 smoothness_worst 56 non-null float6
9 4
27 compactness_worst 56 non-null float6
9 4
28 concavity_worst 56 non-null float6
9 4
29 concave_points_worst 56 non-null float6
9 4
30 symmetry_worst 56 non-null float6
9 4
31 fractal_dimension_worst 56 non-null float6
9 4
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB

In [13]: data.diagnosis.unique()

Out[13] array(['M', 'B'], dtype=object)


:

Data Preprocessing
Data Cleaning

In data.isnull().sum()
[14]:
Out[14]: id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave_points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave_points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave_points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
dtype: int64

In [15]: data.duplicated().sum()

Out[15]: np.int64(0)

In df = data.drop(['id'], axis=1)

In df['diagnosis'] = df['diagnosis'].map({'M':1, 'B':0}) # Malignant:1,


Benign:0

Discriptive Statistics

In df.describe().T
Out[18]: count m s m 2 5
e t i 5 0
a d n % %
n
diagnosis 569.0 0 0 0 0 0
. . . .
3 4 0 0 0
7 8 0 0 0
2 3 0 0 0
5 9 0 0 0
8 1 0 0 0
3 8 0 0 0
radius_me 569.0 1 3 6 1 1
an 4 . . 1 3
. 5 9 .
1 2 8 7 3
2 4 1 0 7
7 0 0 0 0
2 4 0 0 0
9 9 0 0 0
2 0 0
texture_m 569.0 1 4 9 1 1
ean 9 . . 6 8
. 3 7 .
2 0 1 1 8
8 1 0 7 4
9 0 0 0 0
6 3 0 0 0
4 6 0 0 0
9 0 0
perimeter 569.0 9 2 4 7 8
_mean 1 4 3 5 6
. . . .
9 2 7 1 2
6 9 9 7 4
9 8 0 0 0
0 9 0 0 0
3 8 0 0 0
3 1 0 0 0
area_mea 569.0 6 3 1 4 5
n 5 5 4 2 5
4 1 3 0 1
. . . .
8 9 5 3 1
8 1 0 0 0
9 4 0 0 0
1 1 0 0 0
0 2 0 0 0
4 9 0 0 0
smoothne 569.0 0 0 0 0 0
ss_mean . . . .
0 0 0 0 0
9 1 5 8 9
6 4 2 6 5
3 0 6 3 8
6 6 3 7 7
0 4 0 0 0
compactn 569.0 0 0 0 0 0
ess_mean . . . .
1 0 0 0 0
0 5 1 6 9
4 2 9 4 2
3 8 3 9 6
4 1 8 2 3
1 3 0 0 0
concavity_ 569.0 0 0 0 0 0
mean . . . .
0 0 0 0 0
8 7 0 2 6
8 9 0 9 1
7 7 0 5 5
9 2 0 6 4
9 0 0 0 0
concave_p 569.0 0 0 0 0 0
oints_mea . . . .
n 0 0 0 0 0
4 3 0 2 3
8 8 0 0 3
9 8 0 3 5
1 0 0 1 0
9 3 0 0 0
symmetry 569.0 0 0 0 0 0
_mean . . . .
1 0 1 1 1
8 2 0 6 7
1 7 6 1 9
1 4 0 9 2
6 1 0 0 0
2 4 0 0 0
fractal_di 569.0 0 0 0 0 0
mension_ . . . .
mean 0 0 0 0 0
6 0 4 5 6
2 7 9 7 1
7 0 9 7 5
9 6 6 0 4
8 0 0 0 0
radius_se 569.0 0 0 0 0 0
. . . .
4 2 1 2 3
0 7 1 3 2
5 7 1 2 4
1 3 5 4 2
7 1 0 0 0
2 3 0 0 0
texture_se 569.0 1 0 0 0 1
. . . .
2 5 3 8 1
1 5 6 3 0
6 1 0 3 8
8 6 2 9 0
5 4 0 0 0
3 8 0 0 0
perimeter 569.0 2 2 0 1 2
_se . . . .
8 0 7 6 2
6 2 5 0 8
6 1 7 6 7
0 8 0 0 0
5 5 0 0 0
9 5 0 0 0
area_se 569.0 4 4 6 1 2
0 5 . 7 4
. . 8 .
3 4 0 8 5
3 9 2 5 3
7 1 0 0 0
0 0 0 0 0
7 0 0 0 0
9 6 0 0
smoothne 569.0 0 0 0 0 0
ss_se . . . .
0 0 0 0 0
0 0 0 0 0
7 3 1 5 6
0 0 7 1 3
4 0 1 6 8
1 3 3 9 0
compactn 569.0 0 0 0 0 0
ess_se . . . .
0 0 0 0 0
2 1 0 1 2
5 7 2 3 0
4 9 2 0 4
7 0 5 8 5
8 8 2 0 0
concavity_ 569.0 0 0 0 0 0
se . . . .
0 0 0 0 0
3 3 0 1 2
1 0 0 5 5
8 1 0 0 8
9 8 0 9 9
4 6 0 0 0
concave_p 569.0 0 0 0 0 0
oints_se . . . .
0 0 0 0 0
1 0 0 0 1
1 6 0 7 0
7 1 0 6 9
9 7 0 3 3
6 0 0 8 0
symmetry 569.0 0 0 0 0 0
_se . . . .
0 0 0 0 0
2 0 0 1 1
0 8 7 5 8
5 2 8 1 7
4 6 8 6 3
2 6 2 0 0
fractal_di 569.0 0 0 0 0 0
mension_s . . . .
e 0 0 0 0 0
0 0 0 0 0
3 2 0 2 3
7 6 8 2 1
9 4 9 4 8
5 6 5 8 7
radius_wo 569.0 1 4 7 1 1
rst 6 . . 3 4
. 8 9 .
2 3 3 0 9
6 3 0 1 7
9 2 0 0 0
1 4 0 0 0
9 2 0 0 0
0 0 0
texture_w 569.0 2 6 1 2 2
orst 5 . 2 1 5
. 1 . .
6 4 0 0 4
7 6 2 8 1
7 2 0 0 0
2 5 0 0 0
2 8 0 0 0
3 0 0 0
perimeter 569.0 1 3 5 8 9
_worst 0 3 0 4 7
7 . . .
. 6 4 1 6
2 0 1 1 6
6 2 0 0 0
1 5 0 0 0
2 4 0 0 0
1 2 0 0 0
3
area_wors 569.0 8 5 1 5 6
t 8 6 8 1 8
0 9 5 5 6
. . . .
5 3 2 3 5
8 5 0 0 0
3 6 0 0 0
1 9 0 0 0
2 9 0 0 0
8 3 0 0 0
smoothne 569.0 0 0 0 0 0
ss_worst . . . .
1 0 0 1 1
3 2 7 1 3
2 2 1 6 1
3 8 1 6 3
6 3 7 0 0
9 2 0 0 0
compactn 569.0 0 0 0 0 0
ess_worst . . . .
2 1 0 1 2
5 5 2 4 1
4 7 7 7 1
2 3 2 2 9
6 3 9 0 0
5 6 0 0 0
concavity_ 569.0 0 0 0 0 0
worst . . . .
2 2 0 1 2
7 0 0 1 2
2 8 0 4 6
1 6 0 5 7
8 2 0 0 0
8 4 0 0 0
concave_p 569.0 0 0 0 0 0
oints_wor . . . .
st 1 0 0 0 0
1 6 0 6 9
4 5 0 4 9
6 7 0 9 9
0 3 0 3 3
6 2 0 0 0
symmetry 569.0 0 0 0 0 0
_worst . . . .
2 0 1 2 2
9 6 5 5 8
0 1 6 0 2
0 8 5 4 2
7 6 0 0 0
6 7 0 0 0
m s m 2
e t i 5
a d n %
n
fractal_dimension_wo 569.0 0 0 0 0
rst . . . . 08
0 0 0 0 00
8 1 5 7 40
3 8 5 1
Model Building 9 0 0 4
4 6 4 6
6 1 0 0

In [28]: X = df.drop('diagnosis', axis=1) # Drop the 'diagnosis' column (target)


y = df['diagnosis']

In [29]: # Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
In [30]: # Fit the decision tree model
model = DecisionTreeClassifier(criterion='entropy') #criteria = gini,
entropy
model.fit(X_train, y_train)
model
Out[30]: ▾ DecisionTreelassifie i
?
C r
DecisionTreeClassifier(criterion='entrop
y')

In [34]: import math

# Function to calculate entropy


def entropy(column):
counts = column.value_counts()
probabilities = counts / len(column)
return -sum(probabilities * probabilities.apply(math.log2))

# Function to calculate conditional entropy


def conditional_entropy(data, X, target):
feature_values = data[X].unique() # Corrected: use .unique() on the
series
weighted_entropy = 0
for value in feature_values:
subset = data[data[feature] == value]
weighted_entropy += (len(subset) / len(data)) *
entropy(subset[target])
return weighted_entropy

# Function to calculate information gain


def information_gain(data, X, target):
total_entropy = entropy(data[target])
feature_conditional_entropy = conditional_entropy(data, X, target)
return total_entropy - feature_conditional_entropy

# Calculate information gain for each feature

for feature in X:
ig = information_gain(df,feature,'diagnosis')
Information Gain for radius_mean: 0.8607815854835991
Information Gain for texture_mean: 0.8357118798482908
Information Gain for perimeter_mean: 0.9267038614138748
Information Gain for area_mean: 0.9280305529818247
Information Gain for smoothness_mean: 0.7761788341876101
Information Gain for compactness_mean:
0.9091291689709926 Information Gain for concavity_mean:
0.9350604299589776
Information Gain for concave_points_mean:
0.9420903069361305 Information Gain for symmetry_mean:
0.735036638169654
Information Gain for fractal_dimension_mean:
0.8361770160635639 Information Gain for radius_se:
0.9337337383910278
Information Gain for texture_se: 0.8642965239721755
Information Gain for perimeter_se: 0.9315454914704012
Information Gain for area_se: 0.925377169845925
Information Gain for smoothness_se: 0.9350604299589776
Information Gain for compactness_se: 0.9231889229252984
Information Gain for concavity_se: 0.9280305529818247
Information Gain for concave_points_se:
0.8585933385629725 Information Gain for symmetry_se:
0.8181371874054084
Information Gain for fractal_dimension_se:
0.9174857375160954 Information Gain for radius_worst:
0.9003074642106167
Information Gain for texture_worst: 0.8634349686194988
Information Gain for perimeter_worst: 0.8985843535052632
Information Gain for area_worst: 0.9350604299589776
Information Gain for smoothness_worst:
0.7197189097252679 Information Gain for
compactness_worst: 0.9183472928687721 Information Gain
for concavity_worst: 0.9302187999024514
Information Gain for concave_points_worst:
0.9148323543801957 Information Gain for symmetry_worst:
0.8453951399613433
Information Gain for fractal_dimension_worst: 0.8915544765281104

In [35]: # Export the tree to DOT format


dot_data = export_graphviz(model, out_file=None,
feature_names=X_train.columns,
rounded=True, proportion=False,
precision=2, filled=True)

# Convert DOT data to a graph


graph = pydotplus.graph_from_dot_data(dot_data)

# Display the graph


Image(graph.create_png())
Out[35]:

In [41]: # Visualize the Decision Tree (optional)


plt.figure(figsize=(12, 8))
plot_tree(model, filled=True, feature_names=X.columns, class_names=['Benign',
'Mali plt.show()

y_pred = model.predict(X_test)
In [36]:
y_pred
Out[36]: 1, 1, 0, 0 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
array([0, ,
1, 0, 0, 0, 0 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
,
0, 0, 0, 0, 0 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
,
1, 1, 0, 0, 1 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
,
0, 0, 0, 0, 0 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
,
1, 0, 0, 1]
)

In [38]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred) * 100
classification_rep = classification_report(y_test, y_pred)

# Print the results


print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

Accuracy: 94.73684210526315
Classificatio Report:
n precision recall f1-score support

0 0.93 0.99 0.96 71


1 0.97 0.88 0.93 43

accuracy 0.95 114


macro avg 0.95 0.93 0.94 114
weighted avg 0.95 0.95 0.95 114

In [45]: df.head(1)

Out[45]: diagnosi radius_mea texture_mea perimeter_me area_mea smoothness_mea


s n n an n n
0 1 17.99 10.38 122.8 1001.0
0.1184

In [44]:new = [[12.5, 19.2, 80.0, 500.0, 0.085, 0.1, 0.05, 0.02, 0.17,
0.06,
0.4, 1.0, 2.5, 40.0, 0.006, 0.02, 0.03, 0.01, 0.02, 0.003,
16.0, 25.0, 105.0, 900.0, 0.13, 0.25, 0.28, 0.12, 0.29, 0.08]]
y_pred =
model.predict(new)

# Output the prediction (0 = Benign, 1 = Malignant)


if y_pred[0] == 0:
print("Prediction: Benign")
else:
Prediction: Benign

You might also like