0% found this document useful (0 votes)
5 views3 pages

Bin Ar Ization

The document outlines a process for analyzing a dataset using a Decision Tree Classifier with Python's sklearn library. It includes data preprocessing steps such as handling missing values, creating a new feature for family size, and applying binarization. The model's accuracy is evaluated using cross-validation and accuracy scores before and after applying transformations.

Uploaded by

Rudraksh Amar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views3 pages

Bin Ar Ization

The document outlines a process for analyzing a dataset using a Decision Tree Classifier with Python's sklearn library. It includes data preprocessing steps such as handling missing values, creating a new feature for family size, and applying binarization. The model's accuracy is evaluated using cross-validation and accuracy scores before and after applying transformations.

Uploaded by

Rudraksh Amar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

In [28]:

import numpy as np
import pandas as pd

In [29]:
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.compose import ColumnTransformer

In [30]:
df = pd.read_csv('train.csv')[['Age','Fare','SibSp','Parch','Survived']]

In [31]:
df.dropna(inplace=True)

In [32]:
df.head()

Out[32]: Age Fare SibSp Parch Survived

0 22.0 7.2500 1 0 0

1 38.0 71.2833 1 0 1

2 26.0 7.9250 0 0 1

3 35.0 53.1000 1 0 1

4 35.0 8.0500 0 0 0

In [33]:
df['family'] = df['SibSp'] + df['Parch']

In [34]:
df.head()

Out[34]: Age Fare SibSp Parch Survived family

0 22.0 7.2500 1 0 0 1

1 38.0 71.2833 1 0 1 1

2 26.0 7.9250 0 0 1 0

3 35.0 53.1000 1 0 1 1

4 35.0 8.0500 0 0 0 0

In [35]:
df.drop(columns=['SibSp','Parch'],inplace=True)

In [36]:
df.head()
Out[36]: Age Fare Survived family

0 22.0 7.2500 0 1

1 38.0 71.2833 1 1

2 26.0 7.9250 1 0

3 35.0 53.1000 1 1

4 35.0 8.0500 0 0

In [37]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [38]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_stat

In [39]:
X_train.head()

Out[39]: Age Fare family

328 31.0 20.5250 2

73 26.0 14.4542 1

253 30.0 16.1000 1

719 33.0 7.7750 0

666 25.0 13.0000 0

In [40]:
# Without binarization

clf = DecisionTreeClassifier()

clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

accuracy_score(y_test,y_pred)

Out[40]: 0.6293706293706294

In [41]:
np.mean(cross_val_score(DecisionTreeClassifier(),X,y,cv=10,scoring='accuracy')

Out[41]: 0.6429381846635367

In [20]:
# Applying Binarization

from sklearn.preprocessing import Binarizer


In [42]: trf = ColumnTransformer([
('bin',Binarizer(copy=False),['family'])
],remainder='passthrough')

In [43]:
X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

In [44]:
pd.DataFrame(X_train_trf,columns=['family','Age','Fare'])

Out[44]: family Age Fare

0 1.0 31.0 20.5250

1 1.0 26.0 14.4542

2 1.0 30.0 16.1000

3 0.0 33.0 7.7750

4 0.0 25.0 13.0000

... ... ... ...

566 1.0 46.0 61.1750

567 0.0 25.0 13.0000

568 0.0 41.0 134.5000

569 1.0 33.0 20.5250

570 0.0 33.0 7.8958

571 rows × 3 columns

In [45]:
clf = DecisionTreeClassifier()
clf.fit(X_train_trf,y_train)
y_pred2 = clf.predict(X_test_trf)

accuracy_score(y_test,y_pred2)

Out[45]: 0.6363636363636364

In [46]:
X_trf = trf.fit_transform(X)
np.mean(cross_val_score(DecisionTreeClassifier(),X_trf,y,cv=10,scoring='accura

Out[46]: 0.6304186228482003

In [ ]:

You might also like