0% found this document useful (0 votes)
5 views8 pages

Data Preprocessing

Data preprocessing is essential for preparing raw data for machine learning models, involving steps such as importing, cleaning, and splitting data into training and test sets. Key techniques include handling missing data, encoding categorical data, and applying feature scaling to prevent information leakage and improve model performance. Feature scaling standardizes features to the same magnitude, utilizing methods like normalization and standardization to enhance the efficiency of machine learning algorithms.

Uploaded by

vikram_1612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Data Preprocessing

Data preprocessing is essential for preparing raw data for machine learning models, involving steps such as importing, cleaning, and splitting data into training and test sets. Key techniques include handling missing data, encoding categorical data, and applying feature scaling to prevent information leakage and improve model performance. Feature scaling standardizes features to the same magnitude, utilizing methods like normalization and standardization to enhance the efficiency of machine learning algorithms.

Uploaded by

vikram_1612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Preprocessing in Machine learning

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

Data Pre-Processing
• Import the data
• Clean the data
• Split into training & test sets
• Feature Scaling

Importing Data & Libraries


Data: DHC_Data.csv
Load Data with Python Standard Library

With Python Standard Library, you will be using the module CSV (Comma-Separated
Values) and the function reader() to load your CSV files. Upon loading, the CSV data
will be automatically converted to NumPy array which can be used for machine learning.

Importing Libraries:

Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets.
Handling Missing data

There are mainly two ways to handle missing data, which are:

By deleting the particular row:

The first way is used to commonly deal with null values. If it is less than 1% data is null
values then you can simply delete it.

By calculating the mean:

In this way, we will calculate the mean of that column or row which contains any missing
value and will put it on the place of missing value.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library.
Encoding Categorical Data
Encoding categorical data is a process of converting categorical data into integer format
so that the data can be provided to different models. Categorical data will be in the form
of strings or object data types. But, machine learning or deep learning algorithms can
work only on numbers.

Categorical Data: Department & Purchased

Split into training & test sets

To accurately assess your ML model’s performance without overfitting or underfitting


issues, it’s necessary to split your dataset into two separate sets:
 Training set: Helps train the algorithm on real-world examples
 Testing set: Used later for evaluating its generalization capabilities on unseen
instances. It is common to use a train/test split of 70/30 or 80/20.

Splitting data into training and testing sets is an essential step in the development of
machine learning models . It involves dividing the available dataset into separate
subsets for training, validation, and testing the model. The most common approach is to
split the dataset into a training set and a testing set. The training set is used to train
the model, while the testing set is used to evaluate the model’s performance . The
regular split is 70-80% for training and 20-30% for testing, but this may vary depending
on the size of the dataset and the specific use case .
The primary reason for splitting data into training and testing sets is to prevent
overfitting . Overfitting occurs when a model is trained too well on the training data,
resulting in poor performance on new, unseen data. By evaluating the model’s
performance on a separate testing set, we can estimate how well it will perform on new
data .
It’s important to note that splitting data into training and testing sets is not enough to
prevent overfitting. Other techniques such as cross-validation and regularization are
also used to prevent overfitting .

Why we have to apply feature scaling after the splitting data training set and test set?

Test set suppose to be brand new set which is going to be evaluated your machine
learning model. Your training machine learning model your training set after that you are
going to deploy on new observation. So, test set not supposed to work with your
training. Feature scaling is a technique that you will get the mean and standard
deviation of features. If we apply feature scaling before the split that mean it will get the
mean and standard deviation all the values including once of test set. Applying feature
scaling on original data before split which cause some information leakage on the test
set. So we grab some information from the test set that which not suppose to get
because it is supposed to be new data new observation.

Point should be noted: feature scaling after the splitting data int test set and training set
to prevent the information leakage of the test set.
Feature scaling

Feature scaling is a technique used in machine learning to standardize the


independent features present in the data in a fixed range. It is performed during the data
pre-processing stage. The purpose of feature scaling is to bring all the features to the
same level of magnitude, which helps in improving the performance of machine learning
algorithms that use optimization algorithms or metrics that depend on some kind of
distance metric .
There are different methods for feature scaling, including standardization, min-max
scaling, and unit vector scaling . Standardization scales the data to have a mean of
zero and a standard deviation of one. Min-max scaling scales the data to a fixed range,
usually between 0 and 1. Unit vector scaling scales the data to have a length of 1.
Feature scaling is important because it helps in avoiding bias towards features with
higher magnitudes, which can lead to poor performance of machine learning models. It
also helps in reducing the time required for training machine learning models.

Normalization

Xn = (X - Xmin) / ( Xmax - Xmin)

Xn = Value of Normalization

Xmax= Maximum value of a feature

Xmin = Minimum value of a feature

Standardization
Standardization = (Current_value – Mean) / Standard Deviation.

You might also like