0% found this document useful (0 votes)
20 views

Homework2

The document outlines a data analysis project on the Titanic dataset, focusing on data preprocessing, feature engineering, statistical testing, and modeling. Key analyses include handling missing data, creating new features, performing outlier detection, and conducting hypothesis tests to explore survival rates based on various factors. The results indicate significant differences in survival rates across passenger classes and a positive correlation between fare and survival chances.

Uploaded by

mannkanit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Homework2

The document outlines a data analysis project on the Titanic dataset, focusing on data preprocessing, feature engineering, statistical testing, and modeling. Key analyses include handling missing data, creating new features, performing outlier detection, and conducting hypothesis tests to explore survival rates based on various factors. The results indicate significant differences in survival rates across passenger classes and a positive correlation between fare and survival chances.

Uploaded by

mannkanit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Submission by Kanit Mann

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from scipy import stats
from scipy.stats import kruskal
from scipy.stats import chi2_contingency

data = pd.read_csv('Titanic-Dataset.csv')
df = pd.DataFrame(data)
print(df.head(5))

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Part 1: Advanced Data Preprocessing (6 Marks)
1. Handling Complex Missing Data (2 Marks)
(a) Instead of simple mean or median imputation, use a predictive model (e.g., KNN or
Regression) to fill missing values in the Age column.

(b) Compare the distribution of imputed values with the original Age column. Do they look
realistic?

# a) Performing KNN imputation


X = df[['Pclass', 'Fare', 'SibSp', 'Parch']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

imputer = KNNImputer(n_neighbors=5)
df['Age_KNN'] = imputer.fit_transform(df[['Age']])

# b) Plotting the original and KNN imputed age distributions


plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.histplot(df['Age'].dropna(), kde=True)
plt.title('Original Age Distribution')
plt.subplot(1,2,2)
sns.histplot(df['Age_KNN'], kde=True)
plt.title('KNN Imputed Age Distribution')
plt.tight_layout()
plt.show()
2. Feature Engineering with Binning (2 Marks)
(a) Create a new categorical feature ‘Age Group’ by binning Age into the following groups:

• 0-12: Child
• 13-19: Teen
• 20-35: Young Adult
• 36-60: Middle Aged
• 60+: Senior

(b) Calculate the survival rate for each Age Group and analyze whether age impacts survival
probability.

# (a) Create Age Groups


def age_category(age):
if age <= 12: return 'Child'
elif age <= 19: return 'Teen'
elif age <= 35: return 'Young Adult'
elif age <= 60: return 'Middle Aged'
else: return 'Senior'

df['Age_Group'] = df['Age_KNN'].apply(age_category)

# (b) Calculate survival rates by age group


survival_rates = df.groupby('Age_Group')['Survived'].mean()
print("\nSurvival Rates by Age Group:")
print(survival_rates)

# Plotting the survival rates for better visualization


plt.figure(figsize=(8,6))
sns.barplot(x=survival_rates.index, y=survival_rates.values)
plt.title('Survival Rates by Age Group')
plt.ylabel('Survival Rate')
plt.show()

Survival Rates by Age Group:


Age_Group
Child 0.579710
Middle Aged 0.400000
Senior 0.227273
Teen 0.410526
Young Adult 0.352941
Name: Survived, dtype: float64
3. Encoding and Interaction Features (2 Marks)
(a) Convert the categorical ‘Cabin’ column into a new feature called ‘Cabin Indicator’ (1 if the
passenger has a cabin, 0 if missing).

(b) Create a new interaction feature between Pclass and Fare by multiplying them together.
Does this new feature have a stronger correlation with survival?

# (a) Cabin Indicator


df['Cabin_Indicator'] = df['Cabin'].notna().astype(int)

# (b) Interaction feature


df['Pclass_Fare'] = df['Pclass'] * df['Fare']

# Calculating the correlations


print("\nCorrelation with Survival:")
print("Pclass:", df['Pclass'].corr(df['Survived']))
print("Fare:", df['Fare'].corr(df['Survived']))
print("Pclass_Fare:", df['Pclass_Fare'].corr(df['Survived']))
Correlation with Survival:
Pclass: -0.3384810359610148
Fare: 0.2573065223849622
Pclass_Fare: 0.18362691096549183

Part 2: Advanced Statistical & Outlier Detection (6 Marks)


4. Advanced Outlier Detection (2 Marks)
(a) Detect outliers in Fare using both IQR method and Z-score method (consider Z-score > 3 as
outliers).

(b) Compare the number of outliers detected by both methods. Which method do you think is
more robust?

# (a) IQR Method


Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3
+ 1.5 * IQR))]

# Z-score Method
z_scores = np.abs(stats.zscore(df['Fare']))
outliers_zscore = df[z_scores > 3]

print("\nNumber of outliers:")
print("IQR method:", len(outliers_iqr))
print("Z-score method:", len(outliers_zscore))

Number of outliers:
IQR method: 116
Z-score method: 20

5. Skewness Correction (2 Marks)


(a) Log-transform the Fare column to reduce skewness.

(b) Plot histograms of the original and transformed Fare. Did the transformation make the data
more normal?

# (a) Log transform


df['Fare_Log'] = np.log1p(df['Fare'])

# (b) Plot distributions


plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.histplot(df['Fare'], kde=True)
plt.title('Original Fare Distribution')
plt.subplot(1,2,2)
sns.histplot(df['Fare_Log'], kde=True)
plt.title('Log-transformed Fare Distribution')
plt.tight_layout()
plt.show()

6. Feature Selection using Correlation and Variance Threshold (2


Marks)
(a) Remove low-variance features (features that have the same value for >95% of samples).

(b) Identify the top three features most correlated with Survival and justify their importance.

# (a) Remove low-variance features


def low_variance_features(df, threshold=0.95):
n_samples = len(df)
selector = []
for column in df.columns:
most_common = df[column].value_counts().iloc[0]
if most_common/n_samples < threshold:
selector.append(column)
return selector

variance_features = low_variance_features(df)
print("\nFeatures with sufficient variance:", variance_features)

# (b) Top correlations with Survival


numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
correlations = df[numeric_cols].corr()
['Survived'].sort_values(ascending=False)
print("\nTop 3 correlations with Survival:")
print(correlations[1:4]) # Excluding Survived itself

Features with sufficient variance: ['PassengerId', 'Survived',


'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
'Cabin', 'Embarked', 'Age_KNN', 'Age_Group', 'Cabin_Indicator',
'Pclass_Fare', 'Fare_Log']

Top 3 correlations with Survival:


Fare_Log 0.329862
Cabin_Indicator 0.316912
Fare 0.257307
Name: Survived, dtype: float64

Part 3: Hypothesis Testing & Statistical Modeling (6 Marks)


7. Survival Dependency on Passenger Class (Kruskal-Wallis Test) (2
Marks)
(a) Perform a Kruskal-Wallis Test to check if the median survival rates differ across Pclass
groups.

(b) Interpret the p-value: Does class significantly impact survival?

# (a) Kruskal-Wallis Test


class1_survival = df[df['Pclass'] == 1]['Survived']
class2_survival = df[df['Pclass'] == 2]['Survived']
class3_survival = df[df['Pclass'] == 3]['Survived']

h_statistic, p_value = kruskal(class1_survival, class2_survival,


class3_survival)

print("\nKruskal-Wallis Test Results:")


print(f"H-statistic: {h_statistic}")
print(f"p-value: {p_value}")

# (b) Interpretation
print("\nInterpretation:")
if p_value < 0.05:
print("There is a significant difference in survival rates across
passenger classes")
else:
print("No significant difference in survival rates across
passenger classes")
Kruskal-Wallis Test Results:
H-statistic: 102.77351289976991
p-value: 4.819647000539969e-23

Interpretation:
There is a significant difference in survival rates across passenger
classes

8. Does Fare Predict Survival? (Logistic Regression) (2 Marks)


(a) Build a logistic regression model where Fare is the predictor and Survived is the response
variable.

(b) Report the model coefficients and interpret whether higher fares increase survival chances.

# (a) Build logistic regression model


X = df['Fare'].values.reshape(-1, 1)
y = df['Survived']

model = LogisticRegression()
model.fit(X, y)

# (b) Report and interpret coefficients


print("\nLogistic Regression Results:")
print(f"Coefficient: {model.coef_[0][0]:.4f}")
print(f"Intercept: {model.intercept_[0]:.4f}")
print("\nInterpretation:")
if model.coef_[0][0] > 0:
print("Higher fares are associated with increased survival
chances")
else:
print("Higher fares are associated with decreased survival
chances")

Logistic Regression Results:


Coefficient: 0.0152
Intercept: -0.9413

Interpretation:
Higher fares are associated with increased survival chances

9. Advanced Chi-Square Test (2 Marks)


(a) Create a contingency table for Survived vs. Embarked and perform a Chi-Square Test of
Independence.

(b) Based on the results, is survival dependent on embarkation point?


# (a) Create contingency table and perform chi-square test
contingency_table = pd.crosstab(df['Survived'], df['Embarked'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("\nChi-Square Test Results:")


print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")

# (b) Interpretation
print("\nInterpretation:")
if p_value < 0.05:
print("Survival is dependent on embarkation point")
else:
print("Survival is independent of embarkation point")

Chi-Square Test Results:


Chi-square statistic: 26.4891
p-value: 0.0000

Interpretation:
Survival is dependent on embarkation point

Part 4: Dimensionality Reduction & Clustering (2 Marks)


10. Principal Component Analysis (PCA) & K-Means Clustering (2
Marks)
(a) Apply PCA to reduce the dataset to 2 principal components and plot the variance explained by
each component.

(b) Perform K-Means clustering (k=2) on the PCA-transformed data and analyze whether
clusters align with survival.

# Prepare numeric data for PCA


numeric_features = ['Age_KNN', 'Fare', 'Pclass', 'SibSp', 'Parch']
X = df[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# (a) Performing PCA


pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Plot variance explained


plt.figure(figsize=(8,6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('PCA Explained Variance Ratio')
plt.show()

# (b) K-Means Clustering


X_pca_2d = X_pca[:, :2]
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_pca_2d)

# Plot clusters and analyze alignment with survival


plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1],
c=clusters, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('K-Means Clustering on PCA-transformed Data')
plt.colorbar(scatter)
plt.show()

# Calculate cluster alignment with survival


cluster_survival = pd.DataFrame({
'Cluster': clusters,
'Survived': df['Survived']
})
print("\nSurvival rates by cluster:")
print(cluster_survival.groupby('Cluster')['Survived'].mean())
Survival rates by cluster:
Cluster
0 0.488000
1 0.366841
Name: Survived, dtype: float64

You might also like