Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
Data Pre-Processing
• Import the data
• Clean the data
• Split into training & test sets
• Feature Scaling
Importing Data & Libraries
Data: DHC_Data.csv
Load Data with Python Standard Library
With Python Standard Library, you will be using the module CSV (Comma-Separated
Values) and the function reader() to load your CSV files. Upon loading, the CSV data
will be automatically converted to NumPy array which can be used for machine learning.
Importing Libraries:
Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets.
Handling Missing data
There are mainly two ways to handle missing data, which are:
By deleting the particular row:
The first way is used to commonly deal with null values. If it is less than 1% data is null
values then you can simply delete it.
By calculating the mean:
In this way, we will calculate the mean of that column or row which contains any missing
value and will put it on the place of missing value.
To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library.
Encoding Categorical Data
Encoding categorical data is a process of converting categorical data into integer format
so that the data can be provided to different models. Categorical data will be in the form
of strings or object data types. But, machine learning or deep learning algorithms can
work only on numbers.
Categorical Data: Department & Purchased
Split into training & test sets
To accurately assess your ML model’s performance without overfitting or underfitting
issues, it’s necessary to split your dataset into two separate sets:
Training set: Helps train the algorithm on real-world examples
Testing set: Used later for evaluating its generalization capabilities on unseen
instances. It is common to use a train/test split of 70/30 or 80/20.
Splitting data into training and testing sets is an essential step in the development of
machine learning models . It involves dividing the available dataset into separate
subsets for training, validation, and testing the model. The most common approach is to
split the dataset into a training set and a testing set. The training set is used to train
the model, while the testing set is used to evaluate the model’s performance . The
regular split is 70-80% for training and 20-30% for testing, but this may vary depending
on the size of the dataset and the specific use case .
The primary reason for splitting data into training and testing sets is to prevent
overfitting . Overfitting occurs when a model is trained too well on the training data,
resulting in poor performance on new, unseen data. By evaluating the model’s
performance on a separate testing set, we can estimate how well it will perform on new
data .
It’s important to note that splitting data into training and testing sets is not enough to
prevent overfitting. Other techniques such as cross-validation and regularization are
also used to prevent overfitting .
Why we have to apply feature scaling after the splitting data training set and test set?
Test set suppose to be brand new set which is going to be evaluated your machine
learning model. Your training machine learning model your training set after that you are
going to deploy on new observation. So, test set not supposed to work with your
training. Feature scaling is a technique that you will get the mean and standard
deviation of features. If we apply feature scaling before the split that mean it will get the
mean and standard deviation all the values including once of test set. Applying feature
scaling on original data before split which cause some information leakage on the test
set. So we grab some information from the test set that which not suppose to get
because it is supposed to be new data new observation.
Point should be noted: feature scaling after the splitting data int test set and training set
to prevent the information leakage of the test set.
Feature scaling
Feature scaling is a technique used in machine learning to standardize the
independent features present in the data in a fixed range. It is performed during the data
pre-processing stage. The purpose of feature scaling is to bring all the features to the
same level of magnitude, which helps in improving the performance of machine learning
algorithms that use optimization algorithms or metrics that depend on some kind of
distance metric .
There are different methods for feature scaling, including standardization, min-max
scaling, and unit vector scaling . Standardization scales the data to have a mean of
zero and a standard deviation of one. Min-max scaling scales the data to a fixed range,
usually between 0 and 1. Unit vector scaling scales the data to have a length of 1.
Feature scaling is important because it helps in avoiding bias towards features with
higher magnitudes, which can lead to poor performance of machine learning models. It
also helps in reducing the time required for training machine learning models.
Normalization
Xn = (X - Xmin) / ( Xmax - Xmin)
Xn = Value of Normalization
Xmax= Maximum value of a feature
Xmin = Minimum value of a feature
Standardization
Standardization = (Current_value – Mean) / Standard Deviation.