Data Preprocessing in Machine Learning
Data Preprocessing in Machine Learning
Data preprocessing in Machine Learning is a crucial step that helps enhance the
quality of data to promote the extraction of meaningful insights from the data. Data
preprocessing in Machine Learning refers to the technique of preparing (cleaning and
organizing) the raw data to make it suitable for a building and training Machine Learning
models. In simple words, data preprocessing in Machine Learning is a data mining
technique that transforms raw data into an understandable and readable format.
When it comes to creating a Machine Learning model, data preprocessing is the first
step marking the initiation of the process. Typically, real-world data is incomplete,
inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute
values/trends. This is where data preprocessing enters the scenario – it helps to clean,
format, and organize the raw data, thereby making it ready-to-go for Machine Learning
models. Let’s explore various steps of data preprocessing in machine learning.
Acquiring the dataset is the first step in data preprocessing in machine learning. To
build and develop Machine Learning models, you must first acquire the relevant dataset.
This dataset will be comprised of data gathered from multiple and disparate sources which
are then combined in a proper format to form a dataset.
Since Python is the most extensively used and the most preferred library by Data
Scientists around the world, we’ll show you how to import Python libraries for data
preprocessing in Machine Learning. The predefined Python libraries can perform specific
data preprocessing jobs. Importing all the crucial libraries is the second step in data
preprocessing in machine learning. The three core Python libraries used for this data
preprocessing in Machine Learning are:
NumPy – NumPy is the fundamental package for scientific calculation in Python.
Hence, it is used for inserting any type of mathematical operation in the code. Using NumPy,
you can also add large multidimensional arrays and matrices in your code.
Pandas – Pandas is an excellent open-source Python library for data manipulation
and analysis. It is extensively used for importing and managing the datasets. It packs in
high-performance, easy-to-use data structures and data analysis tools for Python.
Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of
charts in Python. It can deliver publication-quality figures in numerous hard copy formats
and interactive environments across platforms (IPython shells, Jupyter notebook, web
application servers, etc.).
In this step, you need to import the dataset/s that you have gathered for the ML
project at hand. Importing the dataset is one of the important steps in data preprocessing in
machine learning. However, before you can import the dataset/s, you must set the current
directory as the working directory.
Deleting a particular row – In this method, you remove a specific row that has a null
value for a feature or a particular column where more than 75% of the values are missing.
However, this method is not 100% efficient, and it is recommended that you use it only
when the dataset has adequate samples. You must ensure that after deleting the data, there
remains no addition of bias.
Calculating the mean – This method is useful for features having numeric data like
age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular
feature or column or row that contains a missing value and replace the result for the
missing value. This method can add variance to the dataset, and any loss of data can be
efficiently negated. Hence, it yields better results compared to the first method (omission of
rows/columns). Another way of approximation is through the deviation of neighboring
values. However, this works best for linear data.
Machine Learning models are primarily based on mathematical equations. Thus, you
can intuitively understand that keeping the categorical data in the equation will cause
certain issues since you would only need numbers in the equations.
Splitting the dataset is the next step in data preprocessing in machine learning. Every
dataset for Machine Learning model must be split into two separate sets – training set and
test set.
Training set denotes the subset of a dataset that is used for training the machine
learning model. Here, you are already aware of the output. A test set, on the other hand, is
the subset of the dataset that is used for testing the machine learning model. The ML model
uses the test set to predict outcomes.
Usually, the dataset is split into 80% of the data for training the model while leaving
out the rest 20%.
7. Feature scaling:
Normalization: