8.program Decisiontree
8.program Decisiontree
Decision trees work by recursively splitting data into subsets based on the most
significant feature, ensuring maximum information gain at each step.
Gini Impurity
Gini = 1- ∑Pi2
Measures the uncertainty in a dataset and selects splits that maximize information
gain.
Chi-Square Test
Evaluates the statistical significance of the feature split.
3.Making Predictions
For a new sample, traverse the tree from the root to a
leaf node. The leaf node contains the predicted class
label.
Pre-Pruning: Stop the tree early using conditions (e.g., min samples per
split).
Post-Pruning: Remove unnecessary branches after the tree is built.
2.Setting Tree Depth
import warnings
warnings.filterwarnings('ignore')
In [5]: data = pd.read_csv(r'C:\Users\Admin\OneDrive\Documents\Machine Learning Lab\
Dataset
In [10]: pd.set_option('display.max_columns', None)
In [11]: data.head()
In [7]: data.shape
In [12]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype
0 id 56 non-null int6
9 4
1 diagnosis 56 non-null objec
9 t
2 radius_mean 56 non-null float6
9 4
3 texture_mean 56 non-null float6
9 4
4 perimeter_mean 56 non-null float6
9 4
5 area_mean 56 non-null float6
9 4
6 smoothness_mean 56 non-null float6
9 4
7 compactness_mean 56 non-null float6
9 4
8 concavity_mean 56 non-null float6
9 4
9 concave_points_mean 56 non-null float6
9 4
10 symmetry_mean 56 non-null float6
9 4
11 fractal_dimension_mean 56 non-null float6
9 4
12 radius_se 56 non-null float6
9 4
13 texture_se 56 non-null float6
9 4
14 perimeter_se 56 non-null float6
9 4
15 area_se 56 non-null float6
9 4
16 smoothness_se 56 non-null float6
9 4
17 compactness_se 56 non-null float6
9 4
18 concavity_se 56 non-null float6
9 4
19 concave_points_se 56 non-null float6
9 4
20 symmetry_se 56 non-null float6
9 4
21 fractal_dimension_se 56 non-null float6
9 4
22 radius_worst 56 non-null float6
9 4
23 texture_worst 56 non-null float6
9 4
24 perimeter_worst 56 non-null float6
9 4
25 area_worst 56 non-null float6
9 4
26 smoothness_worst 56 non-null float6
9 4
27 compactness_worst 56 non-null float6
9 4
28 concavity_worst 56 non-null float6
9 4
29 concave_points_worst 56 non-null float6
9 4
30 symmetry_worst 56 non-null float6
9 4
31 fractal_dimension_worst 56 non-null float6
9 4
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB
In [13]: data.diagnosis.unique()
Data Preprocessing
Data Cleaning
In data.isnull().sum()
[14]:
Out[14]: id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave_points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave_points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave_points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
dtype: int64
In [15]: data.duplicated().sum()
Out[15]: np.int64(0)
In df = data.drop(['id'], axis=1)
Discriptive Statistics
In df.describe().T
Out[18]: count m s m 2 5
e t i 5 0
a d n % %
n
diagnosis 569.0 0 0 0 0 0
. . . .
3 4 0 0 0
7 8 0 0 0
2 3 0 0 0
5 9 0 0 0
8 1 0 0 0
3 8 0 0 0
radius_me 569.0 1 3 6 1 1
an 4 . . 1 3
. 5 9 .
1 2 8 7 3
2 4 1 0 7
7 0 0 0 0
2 4 0 0 0
9 9 0 0 0
2 0 0
texture_m 569.0 1 4 9 1 1
ean 9 . . 6 8
. 3 7 .
2 0 1 1 8
8 1 0 7 4
9 0 0 0 0
6 3 0 0 0
4 6 0 0 0
9 0 0
perimeter 569.0 9 2 4 7 8
_mean 1 4 3 5 6
. . . .
9 2 7 1 2
6 9 9 7 4
9 8 0 0 0
0 9 0 0 0
3 8 0 0 0
3 1 0 0 0
area_mea 569.0 6 3 1 4 5
n 5 5 4 2 5
4 1 3 0 1
. . . .
8 9 5 3 1
8 1 0 0 0
9 4 0 0 0
1 1 0 0 0
0 2 0 0 0
4 9 0 0 0
smoothne 569.0 0 0 0 0 0
ss_mean . . . .
0 0 0 0 0
9 1 5 8 9
6 4 2 6 5
3 0 6 3 8
6 6 3 7 7
0 4 0 0 0
compactn 569.0 0 0 0 0 0
ess_mean . . . .
1 0 0 0 0
0 5 1 6 9
4 2 9 4 2
3 8 3 9 6
4 1 8 2 3
1 3 0 0 0
concavity_ 569.0 0 0 0 0 0
mean . . . .
0 0 0 0 0
8 7 0 2 6
8 9 0 9 1
7 7 0 5 5
9 2 0 6 4
9 0 0 0 0
concave_p 569.0 0 0 0 0 0
oints_mea . . . .
n 0 0 0 0 0
4 3 0 2 3
8 8 0 0 3
9 8 0 3 5
1 0 0 1 0
9 3 0 0 0
symmetry 569.0 0 0 0 0 0
_mean . . . .
1 0 1 1 1
8 2 0 6 7
1 7 6 1 9
1 4 0 9 2
6 1 0 0 0
2 4 0 0 0
fractal_di 569.0 0 0 0 0 0
mension_ . . . .
mean 0 0 0 0 0
6 0 4 5 6
2 7 9 7 1
7 0 9 7 5
9 6 6 0 4
8 0 0 0 0
radius_se 569.0 0 0 0 0 0
. . . .
4 2 1 2 3
0 7 1 3 2
5 7 1 2 4
1 3 5 4 2
7 1 0 0 0
2 3 0 0 0
texture_se 569.0 1 0 0 0 1
. . . .
2 5 3 8 1
1 5 6 3 0
6 1 0 3 8
8 6 2 9 0
5 4 0 0 0
3 8 0 0 0
perimeter 569.0 2 2 0 1 2
_se . . . .
8 0 7 6 2
6 2 5 0 8
6 1 7 6 7
0 8 0 0 0
5 5 0 0 0
9 5 0 0 0
area_se 569.0 4 4 6 1 2
0 5 . 7 4
. . 8 .
3 4 0 8 5
3 9 2 5 3
7 1 0 0 0
0 0 0 0 0
7 0 0 0 0
9 6 0 0
smoothne 569.0 0 0 0 0 0
ss_se . . . .
0 0 0 0 0
0 0 0 0 0
7 3 1 5 6
0 0 7 1 3
4 0 1 6 8
1 3 3 9 0
compactn 569.0 0 0 0 0 0
ess_se . . . .
0 0 0 0 0
2 1 0 1 2
5 7 2 3 0
4 9 2 0 4
7 0 5 8 5
8 8 2 0 0
concavity_ 569.0 0 0 0 0 0
se . . . .
0 0 0 0 0
3 3 0 1 2
1 0 0 5 5
8 1 0 0 8
9 8 0 9 9
4 6 0 0 0
concave_p 569.0 0 0 0 0 0
oints_se . . . .
0 0 0 0 0
1 0 0 0 1
1 6 0 7 0
7 1 0 6 9
9 7 0 3 3
6 0 0 8 0
symmetry 569.0 0 0 0 0 0
_se . . . .
0 0 0 0 0
2 0 0 1 1
0 8 7 5 8
5 2 8 1 7
4 6 8 6 3
2 6 2 0 0
fractal_di 569.0 0 0 0 0 0
mension_s . . . .
e 0 0 0 0 0
0 0 0 0 0
3 2 0 2 3
7 6 8 2 1
9 4 9 4 8
5 6 5 8 7
radius_wo 569.0 1 4 7 1 1
rst 6 . . 3 4
. 8 9 .
2 3 3 0 9
6 3 0 1 7
9 2 0 0 0
1 4 0 0 0
9 2 0 0 0
0 0 0
texture_w 569.0 2 6 1 2 2
orst 5 . 2 1 5
. 1 . .
6 4 0 0 4
7 6 2 8 1
7 2 0 0 0
2 5 0 0 0
2 8 0 0 0
3 0 0 0
perimeter 569.0 1 3 5 8 9
_worst 0 3 0 4 7
7 . . .
. 6 4 1 6
2 0 1 1 6
6 2 0 0 0
1 5 0 0 0
2 4 0 0 0
1 2 0 0 0
3
area_wors 569.0 8 5 1 5 6
t 8 6 8 1 8
0 9 5 5 6
. . . .
5 3 2 3 5
8 5 0 0 0
3 6 0 0 0
1 9 0 0 0
2 9 0 0 0
8 3 0 0 0
smoothne 569.0 0 0 0 0 0
ss_worst . . . .
1 0 0 1 1
3 2 7 1 3
2 2 1 6 1
3 8 1 6 3
6 3 7 0 0
9 2 0 0 0
compactn 569.0 0 0 0 0 0
ess_worst . . . .
2 1 0 1 2
5 5 2 4 1
4 7 7 7 1
2 3 2 2 9
6 3 9 0 0
5 6 0 0 0
concavity_ 569.0 0 0 0 0 0
worst . . . .
2 2 0 1 2
7 0 0 1 2
2 8 0 4 6
1 6 0 5 7
8 2 0 0 0
8 4 0 0 0
concave_p 569.0 0 0 0 0 0
oints_wor . . . .
st 1 0 0 0 0
1 6 0 6 9
4 5 0 4 9
6 7 0 9 9
0 3 0 3 3
6 2 0 0 0
symmetry 569.0 0 0 0 0 0
_worst . . . .
2 0 1 2 2
9 6 5 5 8
0 1 6 0 2
0 8 5 4 2
7 6 0 0 0
6 7 0 0 0
m s m 2
e t i 5
a d n %
n
fractal_dimension_wo 569.0 0 0 0 0
rst . . . . 08
0 0 0 0 00
8 1 5 7 40
3 8 5 1
Model Building 9 0 0 4
4 6 4 6
6 1 0 0
In [29]: # Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
In [30]: # Fit the decision tree model
model = DecisionTreeClassifier(criterion='entropy') #criteria = gini,
entropy
model.fit(X_train, y_train)
model
Out[30]: ▾ DecisionTreelassifie i
?
C r
DecisionTreeClassifier(criterion='entrop
y')
for feature in X:
ig = information_gain(df,feature,'diagnosis')
Information Gain for radius_mean: 0.8607815854835991
Information Gain for texture_mean: 0.8357118798482908
Information Gain for perimeter_mean: 0.9267038614138748
Information Gain for area_mean: 0.9280305529818247
Information Gain for smoothness_mean: 0.7761788341876101
Information Gain for compactness_mean:
0.9091291689709926 Information Gain for concavity_mean:
0.9350604299589776
Information Gain for concave_points_mean:
0.9420903069361305 Information Gain for symmetry_mean:
0.735036638169654
Information Gain for fractal_dimension_mean:
0.8361770160635639 Information Gain for radius_se:
0.9337337383910278
Information Gain for texture_se: 0.8642965239721755
Information Gain for perimeter_se: 0.9315454914704012
Information Gain for area_se: 0.925377169845925
Information Gain for smoothness_se: 0.9350604299589776
Information Gain for compactness_se: 0.9231889229252984
Information Gain for concavity_se: 0.9280305529818247
Information Gain for concave_points_se:
0.8585933385629725 Information Gain for symmetry_se:
0.8181371874054084
Information Gain for fractal_dimension_se:
0.9174857375160954 Information Gain for radius_worst:
0.9003074642106167
Information Gain for texture_worst: 0.8634349686194988
Information Gain for perimeter_worst: 0.8985843535052632
Information Gain for area_worst: 0.9350604299589776
Information Gain for smoothness_worst:
0.7197189097252679 Information Gain for
compactness_worst: 0.9183472928687721 Information Gain
for concavity_worst: 0.9302187999024514
Information Gain for concave_points_worst:
0.9148323543801957 Information Gain for symmetry_worst:
0.8453951399613433
Information Gain for fractal_dimension_worst: 0.8915544765281104
y_pred = model.predict(X_test)
In [36]:
y_pred
Out[36]: 1, 1, 0, 0 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
array([0, ,
1, 0, 0, 0, 0 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
,
0, 0, 0, 0, 0 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
,
1, 1, 0, 0, 1 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
,
0, 0, 0, 0, 0 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
,
1, 0, 0, 1]
)
In [38]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred) * 100
classification_rep = classification_report(y_test, y_pred)
Accuracy: 94.73684210526315
Classificatio Report:
n precision recall f1-score support
In [45]: df.head(1)
In [44]:new = [[12.5, 19.2, 80.0, 500.0, 0.085, 0.1, 0.05, 0.02, 0.17,
0.06,
0.4, 1.0, 2.5, 40.0, 0.006, 0.02, 0.03, 0.01, 0.02, 0.003,
16.0, 25.0, 105.0, 900.0, 0.13, 0.25, 0.28, 0.12, 0.29, 0.08]]
y_pred =
model.predict(new)