Assignment 2
Assignment 2
We'll analyze the Titanic dataset, which lists passengers from the Titanic, including whether or
not they survived.
# Dropping 'Cabin' since it's too sparse and 'Name' since we'll
extract titles
df.drop(columns=['Cabin', 'Name'], inplace=True)
C:\Users\Dell\AppData\Local\Temp\ipykernel_8564\483482650.py:16:
FutureWarning: A value is trying to be set on a copy of a DataFrame or
Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never
work because the intermediate object on which we are setting values
always behaves as a copy.
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True) #
Filling missing Embarked with mode
3. String Manipulation
# Example of string manipulation (if applicable)
# In this dataset, we did not keep the 'Name' column, but if we had,
we could do:
# df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.') # Extracting
titles like Mr, Mrs
# df['Title'] = df['Title'].str.lower() # Convert to lowercase
5. Data Splitting
# Define features and target variable
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived'] # Target variable
# Splitting the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
6. Build a Model
We'll use Logistic Regression since it works well with binary outcomes.
REPORT:
Missing values were addressed: median age filled in for missing Age, and mode for missing
Embarked. Preprocessing included dropping irrelevant columns and converting categorical
variables (Sex, Embarked) into numerical format. String Manipulation: Although the Name
column was dropped, typical string manipulations could involve extracting titles for gender and
class analysis.
Basic statistics were performed using NumPy, revealing that the average age of passengers was
approximately 29.7 years, while the average fare was about 32.2. Model Building: We employed
logistic regression for modeling the survival of passengers. The dataset was split into training
(80%) and testing (20%) sets.
[3] Results: The model achieved an accuracy of approximately 80%, indicating a reasonably
good prediction capability given the structured features. The confusion matrix provided further
insight into the classification performance.
[4] Conclusion: This analysis demonstrates how data preprocessing and machine learning
techniques can be applied to derive insights from historical datasets. Future work could explore
hyperparameter tuning and alternative models for better accuracy.