0% found this document useful (0 votes)

35 views12 pages

Homework 2

The document outlines a data analysis project on the Titanic dataset, focusing on data preprocessing, feature engineering, statistical testing, and modeling. Key analyses include handling missing data, creating new features, performing outlier detection, and conducting hypothesis tests to explore survival rates based on various factors. The results indicate significant differences in survival rates across passenger classes and a positive correlation between fare and survival chances.

Uploaded by

mannkanit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views12 pages

Homework 2

Uploaded by

mannkanit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Submission by Kanit Mann

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from scipy import stats
from scipy.stats import kruskal
from scipy.stats import chi2_contingency

data = pd.read_csv('Titanic-Dataset.csv')
df = pd.DataFrame(data)
print(df.head(5))

PassengerId Survived Pclass \

0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age

SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked

0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Part 1: Advanced Data Preprocessing (6 Marks)
1. Handling Complex Missing Data (2 Marks)
(a) Instead of simple mean or median imputation, use a predictive model (e.g., KNN or
Regression) to fill missing values in the Age column.

(b) Compare the distribution of imputed values with the original Age column. Do they look
realistic?

# a) Performing KNN imputation

X = df[['Pclass', 'Fare', 'SibSp', 'Parch']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

imputer = KNNImputer(n_neighbors=5)
df['Age_KNN'] = imputer.fit_transform(df[['Age']])

# b) Plotting the original and KNN imputed age distributions

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.histplot(df['Age'].dropna(), kde=True)
plt.title('Original Age Distribution')
plt.subplot(1,2,2)
sns.histplot(df['Age_KNN'], kde=True)
plt.title('KNN Imputed Age Distribution')
plt.tight_layout()
plt.show()
2. Feature Engineering with Binning (2 Marks)
(a) Create a new categorical feature ‘Age Group’ by binning Age into the following groups:

• 0-12: Child
• 13-19: Teen
• 20-35: Young Adult
• 36-60: Middle Aged
• 60+: Senior

(b) Calculate the survival rate for each Age Group and analyze whether age impacts survival
probability.

# (a) Create Age Groups

def age_category(age):
if age <= 12: return 'Child'
elif age <= 19: return 'Teen'
elif age <= 35: return 'Young Adult'
elif age <= 60: return 'Middle Aged'
else: return 'Senior'

df['Age_Group'] = df['Age_KNN'].apply(age_category)

# (b) Calculate survival rates by age group

survival_rates = df.groupby('Age_Group')['Survived'].mean()
print("\nSurvival Rates by Age Group:")
print(survival_rates)

# Plotting the survival rates for better visualization

plt.figure(figsize=(8,6))
sns.barplot(x=survival_rates.index, y=survival_rates.values)
plt.title('Survival Rates by Age Group')
plt.ylabel('Survival Rate')
plt.show()

Survival Rates by Age Group:

Age_Group
Child 0.579710
Middle Aged 0.400000
Senior 0.227273
Teen 0.410526
Young Adult 0.352941
Name: Survived, dtype: float64
3. Encoding and Interaction Features (2 Marks)
(a) Convert the categorical ‘Cabin’ column into a new feature called ‘Cabin Indicator’ (1 if the
passenger has a cabin, 0 if missing).

(b) Create a new interaction feature between Pclass and Fare by multiplying them together.
Does this new feature have a stronger correlation with survival?

# (a) Cabin Indicator

df['Cabin_Indicator'] = df['Cabin'].notna().astype(int)

# (b) Interaction feature

df['Pclass_Fare'] = df['Pclass'] * df['Fare']

# Calculating the correlations

print("\nCorrelation with Survival:")
print("Pclass:", df['Pclass'].corr(df['Survived']))
print("Fare:", df['Fare'].corr(df['Survived']))
print("Pclass_Fare:", df['Pclass_Fare'].corr(df['Survived']))
Correlation with Survival:
Pclass: -0.3384810359610148
Fare: 0.2573065223849622
Pclass_Fare: 0.18362691096549183

Part 2: Advanced Statistical & Outlier Detection (6 Marks)

4. Advanced Outlier Detection (2 Marks)
(a) Detect outliers in Fare using both IQR method and Z-score method (consider Z-score > 3 as
outliers).

(b) Compare the number of outliers detected by both methods. Which method do you think is
more robust?

# (a) IQR Method

Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3
+ 1.5 * IQR))]

# Z-score Method
z_scores = np.abs(stats.zscore(df['Fare']))
outliers_zscore = df[z_scores > 3]

print("\nNumber of outliers:")
print("IQR method:", len(outliers_iqr))
print("Z-score method:", len(outliers_zscore))

Number of outliers:
IQR method: 116
Z-score method: 20

5. Skewness Correction (2 Marks)

(a) Log-transform the Fare column to reduce skewness.

(b) Plot histograms of the original and transformed Fare. Did the transformation make the data
more normal?

# (a) Log transform

df['Fare_Log'] = np.log1p(df['Fare'])

# (b) Plot distributions

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.histplot(df['Fare'], kde=True)
plt.title('Original Fare Distribution')
plt.subplot(1,2,2)
sns.histplot(df['Fare_Log'], kde=True)
plt.title('Log-transformed Fare Distribution')
plt.tight_layout()
plt.show()

6. Feature Selection using Correlation and Variance Threshold (2

Marks)
(a) Remove low-variance features (features that have the same value for >95% of samples).

(b) Identify the top three features most correlated with Survival and justify their importance.

# (a) Remove low-variance features

def low_variance_features(df, threshold=0.95):
n_samples = len(df)
selector = []
for column in df.columns:
most_common = df[column].value_counts().iloc[0]
if most_common/n_samples < threshold:
selector.append(column)
return selector

variance_features = low_variance_features(df)
print("\nFeatures with sufficient variance:", variance_features)

# (b) Top correlations with Survival

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
correlations = df[numeric_cols].corr()
['Survived'].sort_values(ascending=False)
print("\nTop 3 correlations with Survival:")
print(correlations[1:4]) # Excluding Survived itself

Features with sufficient variance: ['PassengerId', 'Survived',

'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
'Cabin', 'Embarked', 'Age_KNN', 'Age_Group', 'Cabin_Indicator',
'Pclass_Fare', 'Fare_Log']

Top 3 correlations with Survival:

Fare_Log 0.329862
Cabin_Indicator 0.316912
Fare 0.257307
Name: Survived, dtype: float64

Part 3: Hypothesis Testing & Statistical Modeling (6 Marks)

7. Survival Dependency on Passenger Class (Kruskal-Wallis Test) (2
Marks)
(a) Perform a Kruskal-Wallis Test to check if the median survival rates differ across Pclass
groups.

(b) Interpret the p-value: Does class significantly impact survival?

# (a) Kruskal-Wallis Test

class1_survival = df[df['Pclass'] == 1]['Survived']
class2_survival = df[df['Pclass'] == 2]['Survived']
class3_survival = df[df['Pclass'] == 3]['Survived']

h_statistic, p_value = kruskal(class1_survival, class2_survival,

class3_survival)

print("\nKruskal-Wallis Test Results:")

print(f"H-statistic: {h_statistic}")
print(f"p-value: {p_value}")

# (b) Interpretation
print("\nInterpretation:")
if p_value < 0.05:
print("There is a significant difference in survival rates across
passenger classes")
else:
print("No significant difference in survival rates across
passenger classes")
Kruskal-Wallis Test Results:
H-statistic: 102.77351289976991
p-value: 4.819647000539969e-23

Interpretation:
There is a significant difference in survival rates across passenger
classes

8. Does Fare Predict Survival? (Logistic Regression) (2 Marks)

(a) Build a logistic regression model where Fare is the predictor and Survived is the response
variable.

(b) Report the model coefficients and interpret whether higher fares increase survival chances.

# (a) Build logistic regression model

X = df['Fare'].values.reshape(-1, 1)
y = df['Survived']

model = LogisticRegression()
model.fit(X, y)

# (b) Report and interpret coefficients

print("\nLogistic Regression Results:")
print(f"Coefficient: {model.coef_[0][0]:.4f}")
print(f"Intercept: {model.intercept_[0]:.4f}")
print("\nInterpretation:")
if model.coef_[0][0] > 0:
print("Higher fares are associated with increased survival
chances")
else:
print("Higher fares are associated with decreased survival
chances")

Logistic Regression Results:

Coefficient: 0.0152
Intercept: -0.9413

Interpretation:
Higher fares are associated with increased survival chances

9. Advanced Chi-Square Test (2 Marks)

(a) Create a contingency table for Survived vs. Embarked and perform a Chi-Square Test of
Independence.

(b) Based on the results, is survival dependent on embarkation point?

# (a) Create contingency table and perform chi-square test
contingency_table = pd.crosstab(df['Survived'], df['Embarked'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("\nChi-Square Test Results:")

print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")

# (b) Interpretation
print("\nInterpretation:")
if p_value < 0.05:
print("Survival is dependent on embarkation point")
else:
print("Survival is independent of embarkation point")

Chi-Square Test Results:

Chi-square statistic: 26.4891
p-value: 0.0000

Interpretation:
Survival is dependent on embarkation point

Part 4: Dimensionality Reduction & Clustering (2 Marks)

10. Principal Component Analysis (PCA) & K-Means Clustering (2
Marks)
(a) Apply PCA to reduce the dataset to 2 principal components and plot the variance explained by
each component.

(b) Perform K-Means clustering (k=2) on the PCA-transformed data and analyze whether
clusters align with survival.

# Prepare numeric data for PCA

numeric_features = ['Age_KNN', 'Fare', 'Pclass', 'SibSp', 'Parch']
X = df[numeric_features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# (a) Performing PCA

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Plot variance explained

plt.figure(figsize=(8,6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('PCA Explained Variance Ratio')
plt.show()

# (b) K-Means Clustering

X_pca_2d = X_pca[:, :2]
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_pca_2d)

# Plot clusters and analyze alignment with survival

plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1],
c=clusters, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('K-Means Clustering on PCA-transformed Data')
plt.colorbar(scatter)
plt.show()

# Calculate cluster alignment with survival

cluster_survival = pd.DataFrame({
'Cluster': clusters,
'Survived': df['Survived']
})
print("\nSurvival rates by cluster:")
print(cluster_survival.groupby('Cluster')['Survived'].mean())
Survival rates by cluster:
Cluster
0 0.488000
1 0.366841
Name: Survived, dtype: float64

Hemoglobin Report - Nabomita Pramanik
No ratings yet
Hemoglobin Report - Nabomita Pramanik
1 page
Lab Report - Telangana Diagnostic
No ratings yet
Lab Report - Telangana Diagnostic
6 pages
Critical Values For Pearson's Correlation Coefficient
100% (3)
Critical Values For Pearson's Correlation Coefficient
3 pages
Statistics
No ratings yet
Statistics
163 pages
1-Aarogyam 1.2 - PO4080828012-204
100% (1)
1-Aarogyam 1.2 - PO4080828012-204
10 pages
Hygiene
No ratings yet
Hygiene
90 pages
Complete Blood Count (CBC) : Department of Haematology and Clinical Pathology
No ratings yet
Complete Blood Count (CBC) : Department of Haematology and Clinical Pathology
2 pages
PT Archana
No ratings yet
PT Archana
9 pages
6-Advanced Blood Test 1.0 - PO3488252077-945
No ratings yet
6-Advanced Blood Test 1.0 - PO3488252077-945
10 pages
Tulsipur - CC Jarwa Road Tulsipur Balrampur (Up) 271208 Dr. Lal Path Labs LTD Rikub Ganj, Faizabad UP 224001
No ratings yet
Tulsipur - CC Jarwa Road Tulsipur Balrampur (Up) 271208 Dr. Lal Path Labs LTD Rikub Ganj, Faizabad UP 224001
9 pages
Mirza Kayesh Begg - 250274290 - CompleteReport
No ratings yet
Mirza Kayesh Begg - 250274290 - CompleteReport
12 pages
Pt. Subhendra Jadhav
No ratings yet
Pt. Subhendra Jadhav
11 pages
Screenshot 2023-03-30 at 9.08.15 PM
No ratings yet
Screenshot 2023-03-30 at 9.08.15 PM
14 pages
Logit Model For Binary Data
No ratings yet
Logit Model For Binary Data
50 pages
F0044C29713013621969
No ratings yet
F0044C29713013621969
5 pages
PUP1051481
No ratings yet
PUP1051481
9 pages
Medical Test Report
No ratings yet
Medical Test Report
10 pages
Theory Session: Introduction To Biostatistics
No ratings yet
Theory Session: Introduction To Biostatistics
22 pages
Report of Us
No ratings yet
Report of Us
7 pages
Assignment2 DMS672
No ratings yet
Assignment2 DMS672
15 pages
Medibuddy 12385526
No ratings yet
Medibuddy 12385526
5 pages
Lab Report New
No ratings yet
Lab Report New
3 pages
TRN2044095 7939696 Clinical Report
No ratings yet
TRN2044095 7939696 Clinical Report
10 pages
Test Reports
No ratings yet
Test Reports
4 pages
Shivam Suhane CBC Malarial Report
No ratings yet
Shivam Suhane CBC Malarial Report
3 pages
Laboratory Report: M.D. (Patho) M.D. (Patho) M.D. (Patho)
No ratings yet
Laboratory Report: M.D. (Patho) M.D. (Patho) M.D. (Patho)
2 pages
Sampling Distributions Coursera
No ratings yet
Sampling Distributions Coursera
8 pages
CBC Test Report Format Example Sample Template Drlogy Lab Report
No ratings yet
CBC Test Report Format Example Sample Template Drlogy Lab Report
2 pages
Anil Singh RTPCR
No ratings yet
Anil Singh RTPCR
2 pages
Culture Aerobic: Final Laboratory Report
No ratings yet
Culture Aerobic: Final Laboratory Report
3 pages
Medical Laboratory Report: Haemoglobin Total Leucocyte Count Total Erythrocyte Count Platelet Count MPV PCT PDW
No ratings yet
Medical Laboratory Report: Haemoglobin Total Leucocyte Count Total Erythrocyte Count Platelet Count MPV PCT PDW
4 pages
Geeta Bablani Urine
No ratings yet
Geeta Bablani Urine
2 pages
Haematology Test Name Result Unit Bio. Ref. Range Method
No ratings yet
Haematology Test Name Result Unit Bio. Ref. Range Method
3 pages
Hba1C (Whole Blood) : SR - No Investigation Observed Value Reference Range Unit
No ratings yet
Hba1C (Whole Blood) : SR - No Investigation Observed Value Reference Range Unit
3 pages
Lab06 Confidence Intervals
No ratings yet
Lab06 Confidence Intervals
4 pages
Biochemistry Report: Test Is Carried Out by Dimension EXL 200
No ratings yet
Biochemistry Report: Test Is Carried Out by Dimension EXL 200
1 page
M. Ataharul Islam, Abdullah Al-Shiha - Foundations of Biostatistics (2018, Springer) PDF
No ratings yet
M. Ataharul Islam, Abdullah Al-Shiha - Foundations of Biostatistics (2018, Springer) PDF
471 pages
CBC
No ratings yet
CBC
1 page
Department of Haematology: Test Name Result Unit Bio. Ref. Range Method
No ratings yet
Department of Haematology: Test Name Result Unit Bio. Ref. Range Method
4 pages
Second Year B.C.A. (Sem. I LL) Exam Ination 301: Statistical M Ethods
No ratings yet
Second Year B.C.A. (Sem. I LL) Exam Ination 301: Statistical M Ethods
4 pages
S60 - Fofo Greater Noida Shop No.-5, LGF, Amrapali Leisure Valley Greater Noida West (U.P)
No ratings yet
S60 - Fofo Greater Noida Shop No.-5, LGF, Amrapali Leisure Valley Greater Noida West (U.P)
2 pages
Interpretation: L23 - FPSC Malviya Nagar1 A-88 Shivanand Marg, Malviyanagar Jaipur
No ratings yet
Interpretation: L23 - FPSC Malviya Nagar1 A-88 Shivanand Marg, Malviyanagar Jaipur
4 pages
CBC Report
No ratings yet
CBC Report
1 page
Ybtp1q30ohiqup1etd5wac3h
No ratings yet
Ybtp1q30ohiqup1etd5wac3h
5 pages
Prof. U.J.Dixit
No ratings yet
Prof. U.J.Dixit
11 pages
Test/Profile Complete Blood Count: Opd/Badulla/Als Page 1 of 1
No ratings yet
Test/Profile Complete Blood Count: Opd/Badulla/Als Page 1 of 1
1 page
Payroll Template
No ratings yet
Payroll Template
1 page
AM Last Page Quality Criteria in Qualitative And.29
No ratings yet
AM Last Page Quality Criteria in Qualitative And.29
1 page
CBC Complete Blood Count
No ratings yet
CBC Complete Blood Count
1 page
Tavvyan Report CBC PDF
No ratings yet
Tavvyan Report CBC PDF
2 pages
PdfText - 2024-03-15T191650.936
No ratings yet
PdfText - 2024-03-15T191650.936
1 page
L46 - Whitefield Lab Home Visit Sy No. 18/1B, K R Puram, Hobli, Sree Sai Harsha Tower, White Field
No ratings yet
L46 - Whitefield Lab Home Visit Sy No. 18/1B, K R Puram, Hobli, Sree Sai Harsha Tower, White Field
4 pages
DR - Lal Pathlab
No ratings yet
DR - Lal Pathlab
5 pages
C-Reactive Protein (CRP), Quantitative, Serum
No ratings yet
C-Reactive Protein (CRP), Quantitative, Serum
1 page
Narayan CH Dey-73
No ratings yet
Narayan CH Dey-73
2 pages
Complete Blood Count: Test Reference Range Unit Methodology
No ratings yet
Complete Blood Count: Test Reference Range Unit Methodology
1 page
Haemogram: Blood Counts
No ratings yet
Haemogram: Blood Counts
3 pages
Uric Acid (Ua) Test Name Observed Values Units Biological Reference Intervals Uric Acid 10.5
No ratings yet
Uric Acid (Ua) Test Name Observed Values Units Biological Reference Intervals Uric Acid 10.5
1 page
Haematology: DR - Abhilash Kumar Jain
No ratings yet
Haematology: DR - Abhilash Kumar Jain
1 page
Adolecent Escolar
100% (1)
Adolecent Escolar
6 pages
Practice Questions Statistics
No ratings yet
Practice Questions Statistics
5 pages
Lab Results 69647400
No ratings yet
Lab Results 69647400
1 page
Investigation Report: Flowcytometry. Carried Out by Sysmex XN-2000 Series
No ratings yet
Investigation Report: Flowcytometry. Carried Out by Sysmex XN-2000 Series
1 page
Unit-2 - Notes
No ratings yet
Unit-2 - Notes
80 pages
Labreportnew PDF
No ratings yet
Labreportnew PDF
4 pages
Gabriel R. Panganiban Bs Mls 2-Ya-1 Course 3 Unit Task
No ratings yet
Gabriel R. Panganiban Bs Mls 2-Ya-1 Course 3 Unit Task
2 pages
Quiz No. 2
No ratings yet
Quiz No. 2
4 pages
Inferential Statistics 1
No ratings yet
Inferential Statistics 1
34 pages
Chapter 10 Data Analysis-Quantitative
No ratings yet
Chapter 10 Data Analysis-Quantitative
93 pages
Lesson 5 Measure of Skewness 1
No ratings yet
Lesson 5 Measure of Skewness 1
9 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Hasil Uji Hipotesis
No ratings yet
Hasil Uji Hipotesis
3 pages
CHAPTER 3 Part A
No ratings yet
CHAPTER 3 Part A
70 pages
Frequency (Statistics) - Wikipedia
No ratings yet
Frequency (Statistics) - Wikipedia
24 pages
Aqa Ss1a W QP Jun07
No ratings yet
Aqa Ss1a W QP Jun07
4 pages
Chapter 2b
No ratings yet
Chapter 2b
16 pages
Visualization Using Python
No ratings yet
Visualization Using Python
42 pages
Third Lecture in Elementary Statistics 101
No ratings yet
Third Lecture in Elementary Statistics 101
46 pages
Be Electrical Engineering Semester 3 2023 May Engineering Mathematics III m3 Pattern 2019
No ratings yet
Be Electrical Engineering Semester 3 2023 May Engineering Mathematics III m3 Pattern 2019
5 pages
Biostatistics - Part 6 - DR - Vennila J
No ratings yet
Biostatistics - Part 6 - DR - Vennila J
14 pages
Stat 201 Mt1 Cheatsheet
No ratings yet
Stat 201 Mt1 Cheatsheet
2 pages
Formulas
No ratings yet
Formulas
21 pages
Partial Correlation Intro 1
No ratings yet
Partial Correlation Intro 1
5 pages
Assignment 1
No ratings yet
Assignment 1
15 pages
2nd Quarter Exam in Statistics
No ratings yet
2nd Quarter Exam in Statistics
7 pages
I Puc Mid Term Stats Model Question Paper
No ratings yet
I Puc Mid Term Stats Model Question Paper
3 pages
Testul 6
No ratings yet
Testul 6
10 pages
STT 215 Exam 1 Study Guide
No ratings yet
STT 215 Exam 1 Study Guide
2 pages
LA 05 - MLS 054 For SAS 11 - Version 2
No ratings yet
LA 05 - MLS 054 For SAS 11 - Version 2
4 pages
Emotional Status
No ratings yet
Emotional Status
4 pages