1 - Data Preprocessing and Cleaning - 55
1 - Data Preprocessing and Cleaning - 55
Introduction:
Background Information:
Before feeding data into the training algorithm, data preprocessing is a must do step which
aims at improving data quality in order to accomplish the overall goal of getting better
performance of the models. This includes feature pre-processing; managing missing values,
normalization, and feature scaling and data transformation. Otherwise, when building
models, the result is hindered with low quality data while preprocessing makes sure the
dataset is fit for analysis.
Hypothesis:
Pre-processing will enhance the quality of the obtained data and hence increase the
performance of the machine learning techniques. Such approaches as handling missing
values and scaling of features will cause different effects on the model.
Relevance:
Historically, data preprocessing is one of the critical steps in the growth of high-performance
models of machine learning. Preprocessing is an important step in improving model
outcomes because it removes all inconsistencies and converts the data into a standard
format for ‘feeding’ to the models.
Methods:
Assignment 1: Data Cleaning and Handling Missing Values:
1. Load the Dataset
Use Python’s Pandas library to load a dataset (e.g., a CSV file) that contains
missing values.
2. Identify Missing Values
3. Handle Missing Values
4. Implement various strategies to handle missing data
Remove Rows/Columns with Missing Data
5. Compare Impact
See how the dataset changes after handling missing values.
Assignment 2: Data Transformation and Feature Scaling:
1. Load a given dataset containing numerical and categorical features.
2. Perform data transformation tasks such as log transformation, power
transformation, and one-hot encoding.
3. Apply feature scaling techniques like Min-Max Scaling, Standardization, and
Robust Scaling.
4. Evaluate the impact of different transformations and scaling techniques on a
simple machine learning model (e.g., Linear Regression). –not covered
Programming
Assignment 1: Data Cleaning and Handling Missing Values
import pandas as pd
import numpy as np
# Log transformation
df['Close'] = np.log(df['Close'])
# One-hot encoding
df = pd.get_dummies(df, columns=['Close'])
# Min-Max Scaling
scaler = MinMaxScaler()
df['High'] = scaler.fit_transform(df[['High']])
# Standardization
scaler = StandardScaler()
df['High'] = scaler.fit_transform(df[['High']])
print("\nData after Standardization:")
print(df.head())
Output:
# Remove outliers
df = df[z_scores <= 3]
# KNN imputation
imputer = KNNImputer()
print(df.head())
df['High'] = imputer.fit_transform(df[['High']])