0% found this document useful (0 votes)
9 views

Data Preprocessing Steps for Machine Learning in Python (Part 1) _ by Learn with Nas _ Wom

The document discusses the importance of data preprocessing in machine learning, outlining various steps such as data collection, cleaning, transformation, and feature engineering. It emphasizes that proper data preparation is crucial for achieving reliable and accurate predictive models. The article provides practical examples and Python implementations for each preprocessing step, highlighting techniques for handling missing values, outliers, and scaling data.

Uploaded by

alankeys271997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Preprocessing Steps for Machine Learning in Python (Part 1) _ by Learn with Nas _ Wom

The document discusses the importance of data preprocessing in machine learning, outlining various steps such as data collection, cleaning, transformation, and feature engineering. It emphasizes that proper data preparation is crucial for achieving reliable and accurate predictive models. The article provides practical examples and Python implementations for each preprocessing step, highlighting techniques for handling missing values, outliers, and scaling data.

Uploaded by

alankeys271997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Open in app

Search Write

Get unlimited access to the best of Medium for less than $1/week. Become a member

Data Preprocessing Steps for


Machine Learning in Python (Part 1)
Learn with Nas · Follow
Published in Women in Technology · 14 min read · Sep 30, 2023

791 3

Data Preprocessing, also recognized as Data Preparation or Data Cleaning,


encompasses the practice of identifying and rectifying erroneous or
misleading records within a dataset. This involves pinpointing flawed,
incomplete, or irrelevant segments of the data and subsequently modifying,
substituting, or eliminating this impure or coarse data [1]. Data
Preprocessing techniques have been adapted to train AI models, including
machine learning models. The techniques are generally used at the early
stages to ensure accurate results [2]. Please be aware that data preprocessing
is a comprehensive term covering a wide range of tasks, spanning from
formatting the data to creating features, all depending on the nature of your
AI project.

This preparatory phase not only enhances the overall quality of the data but
also streamlines the modelling process, ultimately leading to more reliable
and accurate predictive models. This article delves into the vital role that
Data Preprocessing plays in the context of Machine Learning, shedding light
on its various aspects and emphasizing its necessity for achieving
meaningful and impactful results.

Why is it important?
The significance of Data Preprocessing in Machine Learning cannot be
overstated, as it forms the cornerstone of any successful data analysis or
machine learning endeavour. In the realm of data-driven technologies, the
quality and suitability of data directly influence the outcomes and
effectiveness of machine learning models.

Data Preprocessing involves a series of steps such as:


1. Data Collection

2. Data Cleaning

3. Data Transformation

4. Feature Engineering: Scaling, Normalization and Standardization

5. Feature Selection

6. Handling Imbalanced Data

7. Encoding Categorical Features

8. Data Splitting

Step 1: Data Collection


The cornerstone of machine learning is rooted in data. Collecting data
involves gathering information aligned with the goals and objectives of your
AI project. If you feed subpar or low-quality data into your model, it will not
produce satisfactory outcomes. This holds true regardless of the model’s
complexity, the expertise of the data scientist, or the financial investment in
the project [3].

While some companies have been accumulating data for years, ensuring a
steady supply for machine learning, those lacking sufficient data can turn to
reference datasets available online to complete their AI projects. Discovering
new data and sharing it can be achieved through three methods:
collaborative analysis (DataHub), web (Google Fusion Tables, CKAN, Quandl,
and Data Market), and a combination of collaboration and web use (Kaggle).
Additionally, there are specialized data retrieval systems, including data
lakes (Google Data Search) and web-based platforms (WebTables) [3].

Suppose we have all the necessary data; we can proceed with creating a
dataset.

# import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# load the data


df = pd.read_csv('data/credit_scoring_eng.csv')

df.head(5)

Result:
Data Description

children - number of children in the family

days_employed - number of days employed

dob_years - client's age in years

education - client's education level

education_id - education identifier

family_status - marital status

family_status_id - marital status identifier

gender - client's gender

income_type - type of employment

debt - whether the client has a loan debt

total_income - monthly income

purpose - purpose of the loan application

Step 2: Data Cleaning


This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers and duplicates. Various techniques can be
used for data cleaning, such as imputation, removal or transformation [4].

Implementing in phyton:
2-a: Handling missing values

# check the dataset information


df.info()

Result:

Findings:
There are missing values in the columns days_employed and total_income,
because the number of rows should be 21,525

# check the percentage


df.isna().sum() / len(df)

Result:

Findings:

The missing value percentage for both columns are around 10%
# Visualizing Missing Data using a seaborn heatmap.
plt.figure(figsize=(10,6))
sns.heatmap(df.isna().transpose(),
cmap="YlGnBu",
cbar_kws={'label': 'Missing Data'})

Result:

Findings:

1. Missing values form a pattern. The missing values are caused by job types
where clients with the job types ‘student’ and ‘unemployed’ do not have any
income, leading them to leave the ‘days_employed’ and ‘total_income’
columns empty.
2. This conclusion is reinforced by the pattern shown in the seaborn
heatmap, indicating that when the value in the ‘days_employed’ column is
missing, the data in the same row for ‘total_income’ is also missing
(symmetrical).

3. Since the missing values are only present in the ‘days_employed’ and
‘total_income’ columns, and both of these columns have float data types,
which fall under the Numeric/Ratio category, the missing data will be filled
using statistical calculations (such as Mean, Median).

4. Median is chosen to fill in missing values because it can prevent the


occurrence of outliers [5]

# function to fill in missing values using median


def data_imputation(data, column_grouping, column_selected):
# Parameter meaning
# data => The name of the dataframe to be processed
# column_grouping => The column used to group values and take the median
# column_selected => The column in which we will fill its NaN values

# Get unique category groups


group = data[column_grouping].unique()

# Loop through each value in the group category


for value in group:
# get median
median = data.loc[(data[column_grouping]==value) & ~(data[column_selecte

# change missing value


data.loc[(data[column_grouping]==value) & (data[column_selected].isna())

# Return the dataframe after filling the missing values


return data
# apply the function to 'total_income' column
df = data_imputation(data=df, column_grouping='age_category', column_selected='t

# apply the function to 'days_employed' column


df = data_imputation(data=df, column_grouping='age_category', column_selected='d

# check the statistical


df.info()
Result:

2-b: Handling outliers


# check outlier in children column
sns.boxplot(df['children'])

# check statistical data in children column


df['children'].describe()
Findings:

1. Based on the statistical data above, I will replace the value 20 with the
value 2, assuming it was an input error.

2. I will remove the minus sign (-), assuming it was an input error

# replace the value 20 with the value 2


condition_children = df['children']==20
df['children'] = df['children'].mask(condition_children, 2)

# remove minus sign


df['children'] = abs(df['children'])
# check outliers in days_employed column
sns.boxplot(df['days_employed'])
# check percentage
len(df.loc[(df['days_employed'] < 0 ) | (df['days_employed'] > 200000)]) / len(d

Result:

0.8990011614401858

Findings:

1. There are 2 issues identified in the ‘days_employed’ column:

Too many digits after the decimal point.

Existence of negative values and outliers, with a high percentage of rows


having these conditions, approximately 89%.

2. The steps to solve these issues are as follows:

Remove the minus sign (-).

Perform rounding.

Replace the outlier values.

# remove minus sign (-), assuming it was an input error


df['days_employed'] = abs(df['days_employed'])
# round
df['days_employed'] = round(df['days_employed'],0)

# check data distribution


df['days_employed'].describe()

Result:

Findings:

The mean value does not represent the data as it is mixed with outliers.
Therefore, the replacement of outliers will be done using the median value.

# Replace outlier with median


condition_de = (df['days_employed'] > 200000) & (df['days_employed'].notnull())
df['days_employed'] = df['days_employed'].mask(condition_de, df['days_employed']

# verify the result


sns.boxplot(df['days_employed'])

Result:

2-c: Handling Duplicates


Findings

1. There are 72 identified duplicate data entries.

2. These duplicate data entries will be removed, and the index will be reset.

# remove duplicate data and do reset index


df = df.drop_duplicates().reset_index(drop=True)

You can visit my GitHub account to access the complete code related to the
above example:
GitHub …
Contribute to…
github.com

Step 3: Data Transformation


Data transformation involves technically converting data from one format,
standard, or structure to another, without changing the dataset’s content.
This is typically done to prepare the data for consumption by an application
or user, or to enhance data quality. The specifics of data transformation can
vary based on the techniques employed. In this article, I will utilize a data
aggregation technique for data transformation [6].

Data aggregation is a method used to present data in a summarized form.


Given the likelihood of data originating from diverse sources, combining all
incoming data into a cohesive description is the essence of data aggregation.
This facet of data processing holds significance as it hinges on the quality
and quantity of the data at hand. An illustrative example of this process is
generating an annual sales report by consolidating quarterly or monthly data
[7].

There are many ways to aggregate data in Pandas, including:

a. Utilizing the groupby() function: Grouping involves breaking down a


dataset into smaller subsets depending on specific variables. This approach
is widely employed for data exploration and analysis purposes. The pandas’
groupby() function is highly versatile and allows for the grouping of data
based on one or more columns. By using the groupby() function, we can
group data according to selected variables and subsequently apply a range of
aggregation functions to these groups [8].
Implementing in phyton:

We will use groupby() to analyze whether the user review and professional
review will influence platform sales

# preparing the dataset


top2_ref_df = reference_df.groupby(['platform', 'name'])[['total_sales', 'critic
top2_ref_df = top2_ref_df[['name', 'platform', 'total_sales', 'critic_score', 'u
top2_ref_df

b. using pivot_table() function: We’ve explored how the GroupBy concept


enables us to investigate connections within a dataset. A pivot table is a
similar operation commonly encountered in spreadsheet software and other
programs working with tabular data. When using a pivot table, input data in
a column-wise format is organized into a two-dimensional table, offering a
multidimensional summary of the information. Distinguishing between
pivot tables and GroupBy can be confusing at times. It’s helpful to consider
pivot tables as essentially a multidimensional form of GroupBy aggregation.
In other words, you perform the split-apply-combine process, but in this
case, both the splitting and combining occur not along a one-dimensional
index, but across a two-dimensional grid [9].

Implementing in phyton:

In order to analyze top 5 platforms in NA, EU and JP regions and visualize


the variation in market share from one region to another

# data aggregation of sales for each platform in NA, EU, JP regions


agg_selected_region_platform = pd.pivot_table(data=reference_df, index='platform
agg_selected_region_platform

We can visualize the data:

# visualizing the sales for each platform in NA, EU, JP regions


plt.figure(figsize=(20,6))
plt.title('Distribution of game sales on each platform in the EU, NA, and JP reg
sns.lineplot(data=agg_selected_region_platform)
plt.show()

You can visit my GitHub account to access the complete code related to the
above example:

GitHub …
Contribute to…
github.com

Step 4: Feature Engineering: Scaling, Normalization and


Standardization
Feature engineering constitutes a pivotal stage in the creation of accurate
and efficient machine learning models. A significant facet of feature
engineering involves scaling, normalization, and standardization,
encompassing the alteration of data to enhance its suitability for modeling.
Employing these methods can enhance model accuracy, mitigate the
influence of outliers, and ensure uniformity in data scale. This article delves
into the fundamentals of scaling, normalization, and standardization [10].

Feature Scaling
Feature scaling is a crucial step in data preprocessing, aiming to standardize
the values of features or variables within a dataset to a uniform scale. The
primary objective is to ensure that all features have a fair influence on the
model, avoiding the dominance of features with higher values. The necessity
for feature scaling arises when working with datasets that encompass
features having diverse ranges, units of measurement, or orders of
magnitude. In such scenarios, discrepancies in feature values can introduce
bias in model performance or hinder the learning process. Through the
application of feature scaling, the features in a dataset can be harmonized to
a consistent scale, simplifying the construction of precise and efficient
machine learning models. Scaling promotes meaningful feature
comparisons, enhances model convergence, and prevents specific features
from dominating others solely based on their magnitude [10].

Why Should We Use Feature Scaling?


Certain machine learning algorithms exhibit sensitivity to feature scaling,
whereas others remain mostly unaffected by it. Let’s delve into a detailed
examination of this aspect.

1. Gradient Descent Based Algorithms

Machine learning algorithms that use gradient descent as an optimization


technique (like linear regression, logistic regression, etc) require data to be
scaled [10]

2. Distance-Based Algorithms
Algorithms based on distance metrics, such as K-nearest neighbors (KNN),
K-means clustering, and support vector machines (SVM), are highly
influenced by the range of features. This is because these algorithms rely on
calculating distances between data points to ascertain their similarity [10].

Implementing in Phyton:

A function will be created to calculate the distance using the k-nearest


neighbors algorithm based on two distance metrics: Euclidean and
Manhattan. We will then compare the distance results on both unscaled and
scaled data.

the df dataset:
# function for calculating kNN distance
def get_knn(df, n, k, metric):

"""
Display k nearest neighbors:
param df: Pandas DataFrame used to find similar objects within it
param n: number of the object for which k nearest neighbors are sought
param k: number of k nearest neighbors to be displayed
param metric: name of the distance metric
"""

nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors = k, metric = metric,


nbrs.fit(df[feature_names])
nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]],

df_res = pd.concat([
df.iloc[nbrs_indices[0]],
pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance
], axis=1)

return df_res

Using unscaled data (df):

# euclidean metric - unscaled data


get_knn(df, 1, 50, 'euclidean')

Result:
# manhattan metric - unscaled data
get_knn(df, 1, 50, 'manhattan')
Findings:

When using unscaled data, the results are the same (referring to the
generated indices — at index 1, it has a similar classification to the following
indices: 3920, 4948, 2528, 3593 -) for both distance metrics.

Using scaled data:

For instance, age and income have different scales (age = years, income =
dollars), hence data scaling is necessary.

MaxAbsScaler is utilized to scale data to its maximum value; that is, dividing
each observation by the maximum value of the variable: The result of the
previous transformation is a distribution where values roughly vary within
the range of -1 to 1.

# scalling the data using MaxAbsScaler


feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_
df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to

the df_scaled dataset:

# euclidean metric - scaled data


get_knn(df_scaled, 1, 10, 'euclidean')
# manhattan metric - scaled data
get_knn(df_scaled, 1, 10, 'manhattan')

The question is: Does non-scaled data affect the kNN algorithm? If it does,
how does it affect it?
Yes, when data is not scaled, the results will be the same (regardless of the
metric used). Therefore, the results might be inaccurate due to differences in
the scales used in each column.

In calculations, it is important to maintain a consistent scale as much as


possible. For example: age and income have different scales (age = years,
income = dollars).

You can visit my GitHub account to access the complete code related to the
above example:

GitHub …
Contribute to…
github.com

Normalization
Normalization, a data preprocessing approach, standardizes feature values
within a dataset to a consistent scale. This is carried out to streamline data
analysis and modeling, mitigating the influence of disparate scales on
machine learning model accuracy [10].

Implementing in Phyton:

# import library
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data


std = MinMaxScaler().fit(X_train)
# transform training data
X_train_std = std.transform(X_train)

# transform testing data


X_test_std = std.transform(X_test)

Standardization
Standardization, a form of scaling, involves centering values around the
mean and adjusting the standard deviation to one unit. Consequently, the
attribute’s mean becomes zero, and the resulting distribution maintains a
unit standard deviation [10].

Now, let’s proceed with utilizing scikit-learn’s StandardScaler for


standardizing features. This process involves eliminating the mean and
adjusting the scale to unit variance, ultimately resulting in a mean of 0 and a
standard deviation of 1. This aligns the data with a standard normal
distribution. [11].

Implementing in Phyton:

We will compare the metric results before and after implementing


StandardScaler.

Before normalizing the features:

beforeScaling_lr = LogisticRegression(random_state = 42)

# train model on training set


beforeScaling_lr.fit(features_train, target_train)
# predict using validation set
y_predict_valid_lr = beforeScaling_lr.predict(features_valid)
# measuring probability using validation set
y_probability_valid_lr = beforeScaling_lr.predict_proba(features_valid)[:, 1]
# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

Result:

Normalizing the features:

# normalizing the features using StandardScaler


scaler = StandardScaler()

features_train[df_numerical] = scaler.fit_transform(features_train[df_numerical]
features_valid[df_numerical] = scaler.transform(features_valid[df_numerical])
features_test[df_numerical] = scaler.transform(features_test[df_numerical])

After normalizing the features:

afterScaling_lr = LogisticRegression(random_state = 42)


# train model on training set
afterScaling_lr.fit(features_train, target_train)
# predict using validation set
y_predict_valid_lr = afterScaling_lr.predict(features_valid)
# measuring probability using validation set
y_probability_valid_lr = afterScaling_lr.predict_proba(features_valid)[:, 1]
# test performance algorithm using F1 score and auc_score
print('F1 score =', f1_score(target_valid, y_predict_valid_lr))
print('AUC-ROC score =', roc_auc_score(target_valid, y_probability_valid_lr))

Result:

Findings:

Both the F1 score and AUC-ROC score are increasing after standardizing the
features.

You can visit my GitHub account to access the complete code related to the
above example:

GitHub …
Contribute to…
github.com
In Part 2, I will delve into topics such as Feature Selection, handling
imbalanced dataset, Encoding Features and Data Splitting. Keep an eye out
for this continuation where we’ll explore these essential steps in detail!

References:

1. Shaomin Wu, A review on coarse warranty data and analysis (2013)

2. George Lawton, Data Preprocessing (2022)

https://www.techtarget.com/searchdatamanagement/definition/data-
preprocessing

3. Yuliia Kniazieva, What is Data Collectio in Machine Learning (2022)

https://labelyourdata.com/articles/data-collection-methods-AI

4. Deepak Jain, Data Preprocessing in Data Mining (2023)

https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/

5. https://stats.stackexchange.com/questions/143700/which-is-better-
replacement-by-mean-and-replacement-by-median

6. Chiradeep BasuMallick, What Is Data Transformation? Types, Tools, and


Importance (2022)

https://www.spiceworks.com/tech/big-data/articles/what-is-data-
transformation/
7. Data Science Wizards, Introduction to Data Transformation (2023)

Introduction to Data Transformation


End-to-end data analysis involves many processes and practices
aimed at extracting insights from data. data…
medium.com

8. https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-
and-grouping.html

9. https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-
tables.html

10. Aniruddha Bhandari, Feature Engineering: Scalling, Normalization and


Standardization (2023)

Feature Engineering: Scaling, Normalization, and Standardization


(Updated 2023)
Learn how feature scaling, normalization, & standardization work in
machine learning. Understand the uses & differences…
www.analyticsvidhya.com

11. Scikit-learn documentation, Standard Scaler

https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.h
tml
12. Nate Rosidi, Advanced Feature Selection Techniques for Machine
Learning Models (2023)

Advanced Feature Selection Techniques for Machine Learning


Models — KDnuggets
Mastering Feature Selection: An Exploration of Advanced
Techniques for Supervised and Unsupervised Machine Learning…
www.kdnuggets.com

Machine Learning Python Pandas Data Preprocessing Technology

Published in Women in Technology Follow


2.2K Followers · Last published 6 hours ago

Women in Tech is a publication to highlight women in STEM, their


accomplishments, career lessons, and stories.

Written by Learn with Nas Follow


260 Followers · 4 Following

System Analyst | Data Engineering and Machine Learning Enthusiast

Responses (3)

What are your thoughts?


Respond

Peng Qian
Oct 20, 2023

Is Phyton in the title a new word I haven't seen before?

6 1 reply Reply

Cazador
Oct 18, 2023

Hi! I enjoyed reading this article very much. I had a question for you involving the replacement of missing
values.

In terms of replacing them, what research must/should be done to justify the replacement value? For example,
in replacing total income… more

4 1 reply Reply

Shaik Basheer Ahmed


Oct 25, 2023

It is useful for me, thanks you for sharing.

Reply

More from Learn with Nas and Women in Technology


Learn with Nas In Women in Technology by Mrunal Natu

Tuning Multiple ML Models using Why Learning to Code Is So


GridSearchCV: A Project of… Frustrating (And How to Push…
In the rapidly evolving field of machine Learning to Code getting frustrating? It’s Not
learning, the quest for optimal model… You — It’s Just Really Hard

Sep 17, 2023 14 1d ago 82

In Women in Technology by Celine Liu Learn with Nas

Hack Your Career Change: How I Sentiment Analysis: Concept and


Transitioned from a Non-Tech… Practical Use Case for Detecting…
I went from zero data experience to leading a Let’s take a moment to ponder the quirks of
global team at Uber. Here’s how you can… human communication. Have you ever…

3d ago 408 12 Sep 4, 2023 24 2

See all from Learn with Nas See all from Women in Technology
Recommended from Medium

Mayurkumar Surani J.

20 Essential Git Commands Every Building a Data Pipeline with


Data Engineer / Data Scientist… Python: A Step-by-Step Guide to…
As an Data Engineer, version control is crucial
for managing data pipelines, Infrastructure a…

Oct 31, 2024 11 Nov 4, 2024 21

Lists

Predictive Modeling w/ Practical Guides to Machine


Python Learning
20 stories · 1837 saves 10 stories · 2214 saves

Coding & Development ChatGPT prompts


11 stories · 1016 saves 51 stories · 2594 saves
Ebrahim Mousavi In Artificial Intelligence in Plain En… by Ritesh Gu…

ML Series: Day 47 — Scaling and Data Science All Algorithm


Normalization Cheatsheet 2025
Balancing Data Magnitudes for Model Stories, strategies, and secrets to choosing
Stability the perfect algorithm.

Feb 10 67 Jan 5 1.3K 27

Python Fundamentals In TDS Archive by Benjamin Bodner

Data Visualization for Exploratory Top 12 Skills Data Scientists Need


Data Analysis (EDA) in Python to Succeed in 2025
Python Data Visualization Guide It’s (not) all about LLMs and AI tools

Feb 25, 2024 786 5 Dec 31, 2024 2.3K 33

See more recommendations

You might also like