Data Preprocessing Steps for Machine Learning in Python (Part 1) _ by Learn with Nas _ Wom
Data Preprocessing Steps for Machine Learning in Python (Part 1) _ by Learn with Nas _ Wom
Search Write
Get unlimited access to the best of Medium for less than $1/week. Become a member
791 3
This preparatory phase not only enhances the overall quality of the data but
also streamlines the modelling process, ultimately leading to more reliable
and accurate predictive models. This article delves into the vital role that
Data Preprocessing plays in the context of Machine Learning, shedding light
on its various aspects and emphasizing its necessity for achieving
meaningful and impactful results.
Why is it important?
The significance of Data Preprocessing in Machine Learning cannot be
overstated, as it forms the cornerstone of any successful data analysis or
machine learning endeavour. In the realm of data-driven technologies, the
quality and suitability of data directly influence the outcomes and
effectiveness of machine learning models.
2. Data Cleaning
3. Data Transformation
5. Feature Selection
8. Data Splitting
While some companies have been accumulating data for years, ensuring a
steady supply for machine learning, those lacking sufficient data can turn to
reference datasets available online to complete their AI projects. Discovering
new data and sharing it can be achieved through three methods:
collaborative analysis (DataHub), web (Google Fusion Tables, CKAN, Quandl,
and Data Market), and a combination of collaboration and web use (Kaggle).
Additionally, there are specialized data retrieval systems, including data
lakes (Google Data Search) and web-based platforms (WebTables) [3].
Suppose we have all the necessary data; we can proceed with creating a
dataset.
# import library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df.head(5)
Result:
Data Description
Implementing in phyton:
2-a: Handling missing values
Result:
Findings:
There are missing values in the columns days_employed and total_income,
because the number of rows should be 21,525
Result:
Findings:
The missing value percentage for both columns are around 10%
# Visualizing Missing Data using a seaborn heatmap.
plt.figure(figsize=(10,6))
sns.heatmap(df.isna().transpose(),
cmap="YlGnBu",
cbar_kws={'label': 'Missing Data'})
Result:
Findings:
1. Missing values form a pattern. The missing values are caused by job types
where clients with the job types ‘student’ and ‘unemployed’ do not have any
income, leading them to leave the ‘days_employed’ and ‘total_income’
columns empty.
2. This conclusion is reinforced by the pattern shown in the seaborn
heatmap, indicating that when the value in the ‘days_employed’ column is
missing, the data in the same row for ‘total_income’ is also missing
(symmetrical).
3. Since the missing values are only present in the ‘days_employed’ and
‘total_income’ columns, and both of these columns have float data types,
which fall under the Numeric/Ratio category, the missing data will be filled
using statistical calculations (such as Mean, Median).
1. Based on the statistical data above, I will replace the value 20 with the
value 2, assuming it was an input error.
2. I will remove the minus sign (-), assuming it was an input error
Result:
0.8990011614401858
Findings:
Perform rounding.
Result:
Findings:
The mean value does not represent the data as it is mixed with outliers.
Therefore, the replacement of outliers will be done using the median value.
Result:
2. These duplicate data entries will be removed, and the index will be reset.
You can visit my GitHub account to access the complete code related to the
above example:
GitHub …
Contribute to…
github.com
We will use groupby() to analyze whether the user review and professional
review will influence platform sales
Implementing in phyton:
You can visit my GitHub account to access the complete code related to the
above example:
GitHub …
Contribute to…
github.com
Feature Scaling
Feature scaling is a crucial step in data preprocessing, aiming to standardize
the values of features or variables within a dataset to a uniform scale. The
primary objective is to ensure that all features have a fair influence on the
model, avoiding the dominance of features with higher values. The necessity
for feature scaling arises when working with datasets that encompass
features having diverse ranges, units of measurement, or orders of
magnitude. In such scenarios, discrepancies in feature values can introduce
bias in model performance or hinder the learning process. Through the
application of feature scaling, the features in a dataset can be harmonized to
a consistent scale, simplifying the construction of precise and efficient
machine learning models. Scaling promotes meaningful feature
comparisons, enhances model convergence, and prevents specific features
from dominating others solely based on their magnitude [10].
2. Distance-Based Algorithms
Algorithms based on distance metrics, such as K-nearest neighbors (KNN),
K-means clustering, and support vector machines (SVM), are highly
influenced by the range of features. This is because these algorithms rely on
calculating distances between data points to ascertain their similarity [10].
Implementing in Phyton:
the df dataset:
# function for calculating kNN distance
def get_knn(df, n, k, metric):
"""
Display k nearest neighbors:
param df: Pandas DataFrame used to find similar objects within it
param n: number of the object for which k nearest neighbors are sought
param k: number of k nearest neighbors to be displayed
param metric: name of the distance metric
"""
df_res = pd.concat([
df.iloc[nbrs_indices[0]],
pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance
], axis=1)
return df_res
Result:
# manhattan metric - unscaled data
get_knn(df, 1, 50, 'manhattan')
Findings:
When using unscaled data, the results are the same (referring to the
generated indices — at index 1, it has a similar classification to the following
indices: 3920, 4948, 2528, 3593 -) for both distance metrics.
For instance, age and income have different scales (age = years, income =
dollars), hence data scaling is necessary.
MaxAbsScaler is utilized to scale data to its maximum value; that is, dividing
each observation by the maximum value of the variable: The result of the
previous transformation is a distribution where values roughly vary within
the range of -1 to 1.
transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_
df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to
The question is: Does non-scaled data affect the kNN algorithm? If it does,
how does it affect it?
Yes, when data is not scaled, the results will be the same (regardless of the
metric used). Therefore, the results might be inaccurate due to differences in
the scales used in each column.
You can visit my GitHub account to access the complete code related to the
above example:
GitHub …
Contribute to…
github.com
Normalization
Normalization, a data preprocessing approach, standardizes feature values
within a dataset to a consistent scale. This is carried out to streamline data
analysis and modeling, mitigating the influence of disparate scales on
machine learning model accuracy [10].
Implementing in Phyton:
# import library
from sklearn.preprocessing import MinMaxScaler
Standardization
Standardization, a form of scaling, involves centering values around the
mean and adjusting the standard deviation to one unit. Consequently, the
attribute’s mean becomes zero, and the resulting distribution maintains a
unit standard deviation [10].
Implementing in Phyton:
Result:
features_train[df_numerical] = scaler.fit_transform(features_train[df_numerical]
features_valid[df_numerical] = scaler.transform(features_valid[df_numerical])
features_test[df_numerical] = scaler.transform(features_test[df_numerical])
Result:
Findings:
Both the F1 score and AUC-ROC score are increasing after standardizing the
features.
You can visit my GitHub account to access the complete code related to the
above example:
GitHub …
Contribute to…
github.com
In Part 2, I will delve into topics such as Feature Selection, handling
imbalanced dataset, Encoding Features and Data Splitting. Keep an eye out
for this continuation where we’ll explore these essential steps in detail!
References:
https://www.techtarget.com/searchdatamanagement/definition/data-
preprocessing
https://labelyourdata.com/articles/data-collection-methods-AI
https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
5. https://stats.stackexchange.com/questions/143700/which-is-better-
replacement-by-mean-and-replacement-by-median
https://www.spiceworks.com/tech/big-data/articles/what-is-data-
transformation/
7. Data Science Wizards, Introduction to Data Transformation (2023)
8. https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-
and-grouping.html
9. https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-
tables.html
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.h
tml
12. Nate Rosidi, Advanced Feature Selection Techniques for Machine
Learning Models (2023)
Responses (3)
Peng Qian
Oct 20, 2023
6 1 reply Reply
Cazador
Oct 18, 2023
Hi! I enjoyed reading this article very much. I had a question for you involving the replacement of missing
values.
In terms of replacing them, what research must/should be done to justify the replacement value? For example,
in replacing total income… more
4 1 reply Reply
Reply
See all from Learn with Nas See all from Women in Technology
Recommended from Medium
Mayurkumar Surani J.
Lists