0% found this document useful (0 votes)
8 views7 pages

Lab 3

This document uses a Naive Bayes classifier model to predict survival on the Titanic using passenger data. It loads and cleans a dataset, explores the data, encodes categorical variables, splits data into training and test sets, and fits a Gaussian Naive Bayes model to make predictions.

Uploaded by

alishacalista238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Lab 3

This document uses a Naive Bayes classifier model to predict survival on the Titanic using passenger data. It loads and cleans a dataset, explores the data, encodes categorical variables, splits data into training and test sets, and fits a Gaussian Naive Bayes model to make predictions.

Uploaded by

alishacalista238
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

titanic-naive-bayes-1

April 21, 2024

[ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

[ ]: df = pd.read_csv('/content/train.csv')
df

[ ]: PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. … … …
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. … … … …
886 Montvila, Rev. Juozas male 27.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0

Parch Ticket Fare Cabin Embarked

1
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. … … … … …
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]

[ ]: df.shape

[ ]: (891, 12)

[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

[ ]: df.describe(include='all')

[ ]: PassengerId Survived Pclass Name Sex \


count 891.000000 891.000000 891.000000 891 891
unique NaN NaN NaN 891 2
top NaN NaN NaN Braund, Mr. Owen Harris male
freq NaN NaN NaN 1 577

2
mean 446.000000 0.383838 2.308642 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN

Age SibSp Parch Ticket Fare Cabin \


count 714.000000 891.000000 891.000000 891 891.000000 204
unique NaN NaN NaN 681 NaN 147
top NaN NaN NaN 347082 NaN B96 B98
freq NaN NaN NaN 7 NaN 4
mean 29.699118 0.523008 0.381594 NaN 32.204208 NaN
std 14.526497 1.102743 0.806057 NaN 49.693429 NaN
min 0.420000 0.000000 0.000000 NaN 0.000000 NaN
25% 20.125000 0.000000 0.000000 NaN 7.910400 NaN
50% 28.000000 0.000000 0.000000 NaN 14.454200 NaN
75% 38.000000 1.000000 0.000000 NaN 31.000000 NaN
max 80.000000 8.000000 6.000000 NaN 512.329200 NaN

Embarked
count 889
unique 3
top S
freq 644
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN

[ ]: df.dtypes

[ ]: PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object

3
Embarked object
dtype: object

[ ]: df.duplicated().sum()

[ ]: 0

[ ]: df.drop(["Sex","Name","Ticket","Cabin"], axis=1, inplace=True)


df

[ ]: PassengerId Survived Pclass Age SibSp Parch Fare Embarked


0 1 0 3 22.0 1 0 7.2500 S
1 2 1 1 38.0 1 0 71.2833 C
2 3 1 3 26.0 0 0 7.9250 S
3 4 1 1 35.0 1 0 53.1000 S
4 5 0 3 35.0 0 0 8.0500 S
.. … … … … … … … …
886 887 0 2 27.0 0 0 13.0000 S
887 888 1 1 19.0 0 0 30.0000 S
888 889 0 3 NaN 1 2 23.4500 S
889 890 1 1 26.0 0 0 30.0000 C
890 891 0 3 32.0 0 0 7.7500 Q

[891 rows x 8 columns]

[ ]: df.isnull().sum()

[ ]: PassengerId 0
Survived 0
Pclass 0
Age 177
SibSp 0
Parch 0
Fare 0
Embarked 2
dtype: int64

[ ]: df['Age'].fillna(df['Age'].mean(), inplace=True)

[ ]: df.dropna(inplace=True)

[ ]: df.isnull().sum()

[ ]: PassengerId 0
Survived 0
Pclass 0
Age 0

4
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64

[ ]: Q1=np.percentile(df['Age'], 25)
Q2=np.percentile(df['Age'], 50)
Q3=np.percentile(df['Age'], 75)

[ ]: IQR=Q3-Q1

[ ]: min=Q1-1.5*IQR
max=Q3+1.5*IQR

[ ]: df['Age'] = df['Age'].clip(min,max)
sns.boxplot(x=df['Age'])
plt.show()

[ ]: df=pd.get_dummies(df,columns=['Embarked'],dtype=int)

5
[ ]: x=df.drop(columns=['Survived'])
x

[ ]: PassengerId Pclass Age SibSp Parch Fare Embarked_C \


0 1 3 22.000000 1 0 7.2500 0
1 2 1 38.000000 1 0 71.2833 1
2 3 3 26.000000 0 0 7.9250 0
3 4 1 35.000000 1 0 53.1000 0
4 5 3 35.000000 0 0 8.0500 0
.. … … … … … … …
886 887 2 27.000000 0 0 13.0000 0
887 888 1 19.000000 0 0 30.0000 0
888 889 3 29.699118 1 2 23.4500 0
889 890 1 26.000000 0 0 30.0000 1
890 891 3 32.000000 0 0 7.7500 0

Embarked_Q Embarked_S
0 0 1
1 0 0
2 0 1
3 0 1
4 0 1
.. … …
886 0 1
887 0 1
888 0 1
889 0 0
890 1 0

[889 rows x 9 columns]

[ ]: y=df['Survived']
y

[ ]: 0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 889, dtype: int64

6
[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.
↪2,random_state=42)

[ ]: print("Training set shape:", x_train.shape, y_train.shape)

Training set shape: (711, 9) (711,)

[ ]: print("Testing set shape:", x_test.shape, y_test.shape)

Testing set shape: (178, 9) (178,)

[ ]: model=GaussianNB()

[ ]: model.fit(x_train,y_train)

[ ]: GaussianNB()

[ ]: y_pred=model.predict(x_test)
y_pred

[ ]: array([0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0])

[ ]: from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)


print("Accuracy:", accuracy)

Accuracy: 0.6348314606741573

[ ]:

You might also like