0% found this document useful (0 votes)
4 views

Data Pre Processing

The document outlines a data preprocessing workflow for a dataset related to survival analysis, including loading the dataset, handling missing values, and encoding categorical variables. It also involves splitting the dataset into training and testing sets and standardizing numerical features. Finally, a pie chart is generated to visualize the proportion of survival outcomes.

Uploaded by

sravyasankuratri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Pre Processing

The document outlines a data preprocessing workflow for a dataset related to survival analysis, including loading the dataset, handling missing values, and encoding categorical variables. It also involves splitting the dataset into training and testing sets and standardizing numerical features. Finally, a pie chart is generated to visualize the proportion of survival outcomes.

Uploaded by

sravyasankuratri
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns

dataset = pd.read_csv("D:/preethi/BTech/SUBJECTS/ML/LAB/train.csv")

print(dataset.head())

# Check the dimensions of the dataset


print(dataset.shape)

# Display summary statistics


print(dataset.describe())

# Check for missing values


print(dataset.isnull().sum())

# Impute missing 'age' values with the median


dataset['age'].fillna(dataset['age'].mode(), inplace=True)

# Drop the 'cabin' column due to excessive missing values


dataset.drop(columns=['cabin'], inplace=True)

# Fill missing 'embarked' values with the mode


dataset['embarked'].fillna(dataset['embarked'].mode()[0], inplace=True)

print(dataset.isnull().sum())

from sklearn.preprocessing import LabelEncoder

#LabelEncoder is used to convert categorical labels


# Encode 'gender' column
labelencoder = LabelEncoder()
dataset['gender'] = labelencoder.fit_transform(dataset['gender'])

# Encode 'embarked' column


dataset = pd.get_dummies(dataset, columns=['embarked'], drop_first=True)
from sklearn.model_selection import train_test_split

# Define features and target variable


X = dataset.drop(columns=['name', 'ticket', 'survived'])
y = dataset['survived']

# Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

print(f"Training set shape: {X_train.shape}")


print(f"Testing set shape: {X_test.shape}")

from sklearn.preprocessing import StandardScaler

#StandardScaler is a preprocessing technique used in feature scaling to standardize


numerical data
scaler = StandardScaler()
dataset[['age', 'fare']] = scaler.fit_transform(dataset[['age', 'fare']])

plt.pie(dataset.survived.value_counts(),labels= ['1', '0'],autopct='%.f', shadow=True)


plt.title('Outcome Proportionality')
plt.show()

You might also like