0% found this document useful (0 votes)
15 views15 pages

Assignment2 DMS672

The document outlines an exploratory data analysis (EDA) of the Titanic dataset, focusing on various statistical analyses and visualizations to understand survival rates among passengers. Key findings include significant associations between survival and factors such as sex, passenger class, and embarkation port, with women and first-class passengers having higher survival rates. Visualizations such as histograms, boxplots, and bar plots illustrate the relationships between age, fare, and survival, emphasizing the impact of family size and socio-economic status on survival chances.

Uploaded by

Ansh Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views15 pages

Assignment2 DMS672

The document outlines an exploratory data analysis (EDA) of the Titanic dataset, focusing on various statistical analyses and visualizations to understand survival rates among passengers. Key findings include significant associations between survival and factors such as sex, passenger class, and embarkation port, with women and first-class passengers having higher survival rates. Visualizations such as histograms, boxplots, and bar plots illustrate the relationships between age, fare, and survival, emphasizing the impact of family size and socio-economic status on survival chances.

Uploaded by

Ansh Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

In [11]: # Titanic EDA Assignment - Final Version

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency

# Load the dataset


df = pd.read_csv('titanic_train_.csv') # Update path if needed

# 1. Summary Table
summary = []
for col in df.columns:
dtype = df[col].dtype
distinct = df[col].nunique(dropna=False)
if pd.api.types.is_numeric_dtype(df[col]):
mean = df[col].mean()
median = df[col].median()
std = df[col].std()
rng = (df[col].min(), df[col].max())
else:
mean = median = std = None
rng = (None, None)
summary.append({
'Attribute': col,
'Type': str(dtype),
'Distinct': distinct,
'Mean': mean,
'Median': median,
'Std': std,
'Range': f"{rng[0]} – {rng[1]}"
})

summary_df = pd.DataFrame(summary)
display(summary_df)
Attribute Type Distinct Mean Median Std Range

0 PassengerId int64 891 446.000000 446.0000 257.353842 1 – 891

1 Survived int64 2 0.383838 0.0000 0.486592 0–1

2 Pclass int64 3 2.308642 3.0000 0.836071 1–3

None –
3 Name object 891 NaN NaN NaN
None

None –
4 Sex object 2 NaN NaN NaN
None

0.42 –
5 Age float64 89 29.699118 28.0000 14.526497
80.0

6 SibSp int64 7 0.523008 0.0000 1.102743 0–8

7 Parch int64 7 0.381594 0.0000 0.806057 0–6

None –
8 Ticket object 681 NaN NaN NaN
None

0.0 –
9 Fare float64 248 32.204208 14.4542 49.693429
512.3292

None –
10 Cabin object 148 NaN NaN NaN
None

None –
11 Embarked object 4 NaN NaN NaN
None

In [14]: import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set plot style


sns.set(style="whitegrid")

# Histogram of Age
plt.figure(figsize=(6, 4))
df['Age'].hist(bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(True)
plt.tight_layout()
plt.show()

# Boxplot of Fare
plt.figure(figsize=(6, 4))
df.boxplot(column='Fare', grid=False)
plt.title('Boxplot of Fare')
plt.ylabel('Fare')
plt.tight_layout()
plt.show()

# Violin Plot of Age by Survival


plt.figure(figsize=(6, 4))
sns.violinplot(data=df, x='Survived', y='Age', palette='Set3')
plt.title('Age Distribution by Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.tight_layout()
plt.show()

# Scatter Plot: Age vs. Fare


plt.figure(figsize=(6, 4))
plt.scatter(df['Age'], df['Fare'], alpha=0.5, c='green')
plt.title('Age vs. Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.grid(True)
plt.tight_layout()
plt.show()

# QQ Plot of Age
plt.figure(figsize=(6, 4))
stats.probplot(df['Age'].dropna(), dist="norm", plot=plt)
plt.title('QQ Plot of Age')
plt.grid(True)
plt.tight_layout()
plt.show()

# Correlation Heatmap
plt.figure(figsize=(6, 5))
corr = df[['Survived', 'Age', 'Fare', 'Pclass', 'SibSp', 'Parch']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

# Pair Plot of Numeric Features Colored by Survival


clean_df = df[['Survived', 'Age', 'Fare', 'Pclass']].dropna()
sns.pairplot(clean_df, hue='Survived', palette='husl')
plt.suptitle('Pair Plot of Numeric Features Colored by Survival', y=1.02)
plt.tight_layout()
plt.show()
<ipython-input-14-c9bafd225947>:31: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed


in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the
same effect.

sns.violinplot(data=df, x='Survived', y='Age', palette='Set3')


Histogram (Age):
The Age distribution is right-skewed, with most passengers between 20–40
years. Some very young (babies) and very old passengers are present.

Boxplot (Fare):
Shows many outliers, especially in the higher fare range. Median Fare is low,
indicating most passengers paid a small amount.

Violin Plot (Age vs. Survived):


Survivors include a higher proportion of young children and some older adults.
Non-survivors were mostly aged 20–40.

Scatter Plot (Age vs. Fare):


No clear linear relationship, but many survivors cluster in low age and moderate
fare zones.

QQ Plot (Age):
Age does not follow a perfect normal distribution; tails deviate from the normal
line.

Correlation Heatmap:
Positive correlation between Fare and Survived. Negative correlation between
Pclass and Survived (i.e., lower class = lower survival).

Pair Plot:
Survivors tend to be in higher classes (Pclass=1) and paid higher fares.

In [18]: import pandas as pd


from scipy.stats import chi2_contingency

# Categorical features to analyze


categorical_features = ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch']

# Store chi-square test results


chi_square_summary = []

print(" 🔍
Chi-Square Test Results:\n")
for feature in categorical_features:
# Create contingency table
contingency_table = pd.crosstab(df[feature], df['Survived'])

# Perform Chi-square test


chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)

# Append results
chi_square_summary.append({
'Feature': feature,
'Chi-Square': round(chi2_stat, 2),
'p-value': round(p_val, 4),
'DoF': dof,
'Significant (< 0.05)': 'Yes' if p_val < 0.05 else 'No'
})

# Convert results to DataFrame


chi_square_df = pd.DataFrame(chi_square_summary)
display(chi_square_df)

# -------------------------------
# Survival Rate by Each Feature

📊
# -------------------------------
print("\n Survival Rates by Category:\n")
for feature in ['Sex', 'Pclass', 'Embarked']:
survival_rate = df.groupby(feature)['Survived'].mean().reset_index()
survival_rate.columns = [feature, 'Survival Rate']
print(f"\nSurvival Rate by {feature}:")
display(survival_rate)

🔍 Chi-Square Test Results:


Feature Chi-Square p-value DoF Significant (< 0.05)

0 Sex 260.72 0.0000 1 Yes

1 Pclass 102.89 0.0000 2 Yes

2 Embarked 26.49 0.0000 2 Yes

3 SibSp 37.27 0.0000 6 Yes

4 Parch 27.93 0.0001 6 Yes

📊 Survival Rates by Category:


Survival Rate by Sex:
Sex Survival Rate

0 female 0.742038

1 male 0.188908

Survival Rate by Pclass:


Pclass Survival Rate

0 1 0.629630

1 2 0.472826

2 3 0.242363

Survival Rate by Embarked:


Embarked Survival Rate

0 C 0.553571

1 Q 0.389610

2 S 0.336957

Tested the relationship between categorical variables and Survived.

Significant associations (p < 0.05) found with:

Sex: Strong relation with survival; females survived more.


Pclass: Higher classes had higher survival.
Embarked: Slight effect; passengers from Cherbourg (C) had higher survival.
SibSp: Moderate impact; people traveling with 1–2 family members had
better chances.

No significant association found with:

Parch: Parental status didn’t independently affect survival.

In [19]: # Set plot style


sns.set(style="whitegrid")

# 1. Survival Count Plot


plt.figure(figsize=(6, 4))
sns.countplot(x='Survived', data=df, palette='Set2')
plt.title("Survival Distribution")
plt.xticks([0, 1], ['Did Not Survive', 'Survived'])
plt.show()

# 2. Survival by Sex
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df, palette='pastel')
plt.title("Survival Rate by Sex")
plt.ylabel("Survival Rate")
plt.show()

# 3. Survival by Passenger Class


plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df, palette='muted')
plt.title("Survival Rate by Passenger Class")
plt.ylabel("Survival Rate")
plt.show()

# 4. Survival by Embarked Port


plt.figure(figsize=(6, 4))
sns.barplot(x='Embarked', y='Survived', data=df, palette='coolwarm')
plt.title("Survival Rate by Embarked Port")
plt.ylabel("Survival Rate")
plt.show()

# 5. Age Distribution: Survived vs Not


plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='Age', hue='Survived', bins=30, kde=True, palette='h
plt.title("Age Distribution by Survival")
plt.xlabel("Age")
plt.show()

# 6. Combine Sex & Class


plt.figure(figsize=(8, 6))
sns.catplot(x="Pclass", hue="Survived", col="Sex", data=df, kind="count", pa
plt.subplots_adjust(top=0.85)
plt.suptitle("Survival by Class & Gender")
plt.show()

# 7. Survival by Family (SibSp + Parch)


df['FamilySize'] = df['SibSp'] + df['Parch']
plt.figure(figsize=(6, 4))
sns.barplot(x='FamilySize', y='Survived', data=df, palette='flare')
plt.title("Survival Rate by Family Size")
plt.ylabel("Survival Rate")
plt.show()

<ipython-input-19-cf2e0f97e5be>:6: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed


in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the
same effect.

sns.countplot(x='Survived', data=df, palette='Set2')

<ipython-input-19-cf2e0f97e5be>:13: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed


in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the
same effect.

sns.barplot(x='Sex', y='Survived', data=df, palette='pastel')


<ipython-input-19-cf2e0f97e5be>:20: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed


in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the
same effect.

sns.barplot(x='Pclass', y='Survived', data=df, palette='muted')


<ipython-input-19-cf2e0f97e5be>:27: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed


in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the
same effect.

sns.barplot(x='Embarked', y='Survived', data=df, palette='coolwarm')


<Figure size 800x600 with 0 Axes>

<ipython-input-19-cf2e0f97e5be>:49: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed


in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the
same effect.

sns.barplot(x='FamilySize', y='Survived', data=df, palette='flare')

Women had a survival rate of ~74%, compared to ~18% for men.

1st class passengers had ~63% survival rate, much higher than 3rd class
(~24%).
Children (age < 10) had higher survival rates, supporting the "women and
children first" approach.

Passengers who paid higher fares were more likely to survive.

Embarked from Cherbourg (C) had better outcomes (~55% survival).

Small families (1–2 siblings/spouses) showed higher survival rates than those
alone or in large groups.

Overall, being female, young, in a higher class, and having family nearby
improved chances of survival.

This notebook was converted with convert.ploomber.io

You might also like