0% found this document useful (0 votes)
9 views

Assignment 2

Uploaded by

lavanyagowdau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Assignment 2

Uploaded by

lavanyagowdau
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1. Dataset Selection.

We'll analyze the Titanic dataset, which lists passengers from the Titanic, including whether or
not they survived.

2. Data Loading and Cleaning


# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the Titanic dataset


url =
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/
titanic.csv"
df = pd.read_csv(url)

# Display the first few rows of the dataset


print(df.head())

# Clean the data: Handling missing values


df['Age'].fillna(df['Age'].median(), inplace=True) # Filling missing
Age with median
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True) #
Filling missing Embarked with mode

# Dropping 'Cabin' since it's too sparse and 'Name' since we'll
extract titles
df.drop(columns=['Cabin', 'Name'], inplace=True)

# Convert 'Sex' and 'Embarked' into numerical values


df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df['Embarked'] = df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Display cleaned data


print(df.info())

PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Sex 891 non-null int64
4 Age 891 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Ticket 891 non-null object
8 Fare 891 non-null float64
9 Embarked 891 non-null int64
dtypes: float64(2), int64(7), object(1)
memory usage: 69.7+ KB
None

C:\Users\Dell\AppData\Local\Temp\ipykernel_8564\483482650.py:16:
FutureWarning: A value is trying to be set on a copy of a DataFrame or
Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never
work because the intermediate object on which we are setting values
always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try


using 'df.method({col: value}, inplace=True)' or df[col] =
df[col].method(value) instead, to perform the operation inplace on the
original object.

df['Age'].fillna(df['Age'].median(), inplace=True) # Filling


missing Age with median
C:\Users\Dell\AppData\Local\Temp\ipykernel_8564\483482650.py:17:
FutureWarning: A value is trying to be set on a copy of a DataFrame or
Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never
work because the intermediate object on which we are setting values
always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try


using 'df.method({col: value}, inplace=True)' or df[col] =
df[col].method(value) instead, to perform the operation inplace on the
original object.

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True) #
Filling missing Embarked with mode

3. String Manipulation
# Example of string manipulation (if applicable)
# In this dataset, we did not keep the 'Name' column, but if we had,
we could do:
# df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.') # Extracting
titles like Mr, Mrs
# df['Title'] = df['Title'].str.lower() # Convert to lowercase

4. Using NumPy for Basic Statistics


# Convert relevant columns to NumPy arrays
age_array = df['Age'].to_numpy()
fare_array = df['Fare'].to_numpy()

# Calculate basic statistics


print(f"Mean Age: {np.mean(age_array)}, Median Age:
{np.median(age_array)}")
print(f"Mean Fare: {np.mean(fare_array)}, Median Fare:
{np.median(fare_array)}")

Mean Age: 29.36158249158249, Median Age: 28.0


Mean Fare: 32.204207968574636, Median Fare: 14.4542

5. Data Splitting
# Define features and target variable
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived'] # Target variable

# Splitting the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
6. Build a Model
We'll use Logistic Regression since it works well with binary outcomes.

# Build the model


model = LogisticRegression()

# Train the model on the training set


model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)

print(f"Accuracy of the model: {accuracy*100:.2f}%")


print("Confusion Matrix:")
print(confusion)

Accuracy of the model: 81.01%


Confusion Matrix:
[[90 15]
[19 55]]

REPORT:

Title: Analysis of Titanic Dataset for Survival Prediction


Objective: The aim of this analysis was to predict passenger survival
from the Titanic dataset using machine learning techniques. The
primary focus was on data cleaning, string manipulation, and model
building.

Dataset: The Titanic dataset was selected from Kaggle,


containing information about passengers, including
features like age, gender, ticket class, and whether they
survived.
[1] Data Cleaning:

Missing values were addressed: median age filled in for missing Age, and mode for missing
Embarked. Preprocessing included dropping irrelevant columns and converting categorical
variables (Sex, Embarked) into numerical format. String Manipulation: Although the Name
column was dropped, typical string manipulations could involve extracting titles for gender and
class analysis.

[2] Statistical Analysis:

Basic statistics were performed using NumPy, revealing that the average age of passengers was
approximately 29.7 years, while the average fare was about 32.2. Model Building: We employed
logistic regression for modeling the survival of passengers. The dataset was split into training
(80%) and testing (20%) sets.

[3] Results: The model achieved an accuracy of approximately 80%, indicating a reasonably
good prediction capability given the structured features. The confusion matrix provided further
insight into the classification performance.

[4] Conclusion: This analysis demonstrates how data preprocessing and machine learning
techniques can be applied to derive insights from historical datasets. Future work could explore
hyperparameter tuning and alternative models for better accuracy.

You might also like