0% found this document useful (0 votes)
23 views11 pages

Data Cleaning and Exploratory Analysis On A Public Dataset

The document is a Python script that performs data analysis on the Titanic dataset using pandas, matplotlib, and seaborn. It includes steps for data cleaning, handling missing values, correcting data types, and exploratory data analysis (EDA) through visualizations such as histograms, count plots, and bar plots. The analysis focuses on survival rates by passenger class, sex, and age distribution.

Uploaded by

Shaikh Firdous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Data Cleaning and Exploratory Analysis On A Public Dataset

The document is a Python script that performs data analysis on the Titanic dataset using pandas, matplotlib, and seaborn. It includes steps for data cleaning, handling missing values, correcting data types, and exploratory data analysis (EDA) through visualizations such as histograms, count plots, and bar plots. The analysis focuses on survival rates by passenger class, sex, and age distribution.

Uploaded by

Shaikh Firdous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

#import the necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#set the style of the plot


sns.set(style = "whitegrid")

#load the dataframe


titanic_data = pd.read_csv('titanic.csv')

#data cleaning
#1.Inspecting the data

#displaying first few rows of dataset


print(titanic_data.head())

#displaying the summary


print(titanic_data.info)

#checking for null values


print(titanic_data.isnull().sum())

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Embarked


0 0 A/5 21171 7.2500 S
1 0 PC 17599 71.2833 C
2 0 STON/O2. 3101282 7.9250 S
3 0 113803 53.1000 S
4 0 373450 8.0500 S
<bound method DataFrame.info of PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0
.. ... ... ...
...
886 Montvila, Rev. Juozas male 27.0
0
887 Graham, Miss. Margaret Edith female 19.0
0
888 Johnston, Miss. Catherine Helen "Carrie" female 28.0
1
889 Behr, Mr. Karl Howell male 26.0
0
890 Dooley, Mr. Patrick male 32.0
0

Parch Ticket Fare Embarked


0 0 A/5 21171 7.2500 S
1 0 PC 17599 71.2833 C
2 0 STON/O2. 3101282 7.9250 S
3 0 113803 53.1000 S
4 0 373450 8.0500 S
.. ... ... ... ...
886 0 211536 13.0000 S
887 0 112053 30.0000 S
888 2 W./C. 6607 23.4500 S
889 0 111369 30.0000 C
890 0 370376 7.7500 Q

[891 rows x 11 columns]>


PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
dtype: int64

#2.handling missing values


titanic_data['Age'].fillna(titanic_data['Age'].median(),inplace =
True)

titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()
[0],inplace = True)

# titanic_data.drop(columns = ['Cabin'],inplace = True)

titanic_data.dropna(subset=['Fare'],inplace = True)

#3.Correcting DataTypes
#converting 'Survived' and 'Pclass' to categorial
titanic_data['Survived'] = titanic_data['Survived'].astype('category')
titanic_data['Pclass'] = titanic_data['Pclass'].astype('category')

#4.Expolatory Data Analysis (EDA)

#a.Distribution of Numerical Features:


titanic_data.hist(bins = 20,figsize=(14,10))
plt.show()

#b.Count Plot for Categorical Features


plt.figure(figsize=(14,6))
plt.subplot(1,2,1)
sns.countplot(data = titanic_data,x='Survived')
plt.show()

plt.subplot(1,2,1)
sns.countplot(data = titanic_data,x='Pclass')
plt.show()
#c.Survival Rate by Passenger Class
titanic_data['Survived'] = titanic_data['Survived'].astype(int)
titanic_data['Pclass'] = titanic_data['Pclass'].astype(int)

plt.figure(figsize=(8, 6))
sns.barplot(data=titanic_data, x='Pclass', y='Survived')
plt.title('Survival Rate by Passenger Class')
plt.show()

#d.Survival Rate by Sex


plt.figure(figsize=(8,6))
sns.barplot(data = titanic_data , x = 'Sex',y = 'Survived')
plt.title('Survuval Rate by Sex')
plt.show()

#e.Age distribution by survival


plt.figure(figsize=(8,6))
sns.histplot(data=titanic_data,x='Age',hue='Survived',multiple =
'stack',bins = 30)
plt.title('Age Distribution by Survival')
plt.show()

#f.Correlation Matrix
plt.figure(figsize=(8,4))
sns.heatmap(titanic_data.corr(),annot = True,cmap='coolwarm',fmt =
'.2f')
plt.title('Correlation Matrix')
plt.show()

You might also like