0% found this document useful (0 votes)

5 views8 pages

Data Preprocessing

Data preprocessing is essential for preparing raw data for machine learning models, involving steps such as importing, cleaning, and splitting data into training and test sets. Key techniques include handling missing data, encoding categorical data, and applying feature scaling to prevent information leakage and improve model performance. Feature scaling standardizes features to the same magnitude, utilizing methods like normalization and standardization to enhance the efficiency of machine learning algorithms.

Uploaded by

vikram_1612

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

Data Preprocessing

Uploaded by

vikram_1612

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Preprocessing in Machine learning

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

Data Pre-Processing
• Import the data
• Clean the data
• Split into training & test sets
• Feature Scaling

Importing Data & Libraries

Data: DHC_Data.csv
Load Data with Python Standard Library

With Python Standard Library, you will be using the module CSV (Comma-Separated
Values) and the function reader() to load your CSV files. Upon loading, the CSV data
will be automatically converted to NumPy array which can be used for machine learning.

Importing Libraries:

Numpy: Numpy Python library is used for including any type of mathematical operation
in the code. It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets.
Handling Missing data

There are mainly two ways to handle missing data, which are:

By deleting the particular row:

The first way is used to commonly deal with null values. If it is less than 1% data is null
values then you can simply delete it.

By calculating the mean:

In this way, we will calculate the mean of that column or row which contains any missing
value and will put it on the place of missing value.

To handle missing values, we will use Scikit-learn library in our code, which contains
various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library.
Encoding Categorical Data
Encoding categorical data is a process of converting categorical data into integer format
so that the data can be provided to different models. Categorical data will be in the form
of strings or object data types. But, machine learning or deep learning algorithms can
work only on numbers.

Categorical Data: Department & Purchased

Split into training & test sets

To accurately assess your ML model’s performance without overfitting or underfitting

issues, it’s necessary to split your dataset into two separate sets:
 Training set: Helps train the algorithm on real-world examples
 Testing set: Used later for evaluating its generalization capabilities on unseen
instances. It is common to use a train/test split of 70/30 or 80/20.

Splitting data into training and testing sets is an essential step in the development of
machine learning models . It involves dividing the available dataset into separate
subsets for training, validation, and testing the model. The most common approach is to
split the dataset into a training set and a testing set. The training set is used to train
the model, while the testing set is used to evaluate the model’s performance . The
regular split is 70-80% for training and 20-30% for testing, but this may vary depending
on the size of the dataset and the specific use case .
The primary reason for splitting data into training and testing sets is to prevent
overfitting . Overfitting occurs when a model is trained too well on the training data,
resulting in poor performance on new, unseen data. By evaluating the model’s
performance on a separate testing set, we can estimate how well it will perform on new
data .
It’s important to note that splitting data into training and testing sets is not enough to
prevent overfitting. Other techniques such as cross-validation and regularization are
also used to prevent overfitting .

Why we have to apply feature scaling after the splitting data training set and test set?

Test set suppose to be brand new set which is going to be evaluated your machine
learning model. Your training machine learning model your training set after that you are
going to deploy on new observation. So, test set not supposed to work with your
training. Feature scaling is a technique that you will get the mean and standard
deviation of features. If we apply feature scaling before the split that mean it will get the
mean and standard deviation all the values including once of test set. Applying feature
scaling on original data before split which cause some information leakage on the test
set. So we grab some information from the test set that which not suppose to get
because it is supposed to be new data new observation.

Point should be noted: feature scaling after the splitting data int test set and training set
to prevent the information leakage of the test set.
Feature scaling

Feature scaling is a technique used in machine learning to standardize the

independent features present in the data in a fixed range. It is performed during the data
pre-processing stage. The purpose of feature scaling is to bring all the features to the
same level of magnitude, which helps in improving the performance of machine learning
algorithms that use optimization algorithms or metrics that depend on some kind of
distance metric .
There are different methods for feature scaling, including standardization, min-max
scaling, and unit vector scaling . Standardization scales the data to have a mean of
zero and a standard deviation of one. Min-max scaling scales the data to a fixed range,
usually between 0 and 1. Unit vector scaling scales the data to have a length of 1.
Feature scaling is important because it helps in avoiding bias towards features with
higher magnitudes, which can lead to poor performance of machine learning models. It
also helps in reducing the time required for training machine learning models.

Normalization

Xn = (X - Xmin) / ( Xmax - Xmin)

Xn = Value of Normalization

Xmax= Maximum value of a feature

Xmin = Minimum value of a feature

Standardization
Standardization = (Current_value – Mean) / Standard Deviation.

Furniture Shop Management System Project Report
100% (1)
Furniture Shop Management System Project Report
54 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
ML Da
No ratings yet
ML Da
55 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Mini 4
No ratings yet
Mini 4
9 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Module 4
No ratings yet
Module 4
96 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Gvpcoew - Feature Scaling - Done
No ratings yet
Gvpcoew - Feature Scaling - Done
11 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Allpiedml Unit2
No ratings yet
Allpiedml Unit2
19 pages
Axa Challenge Rapport
No ratings yet
Axa Challenge Rapport
2 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Step 06 - Data Preprocessing
No ratings yet
Step 06 - Data Preprocessing
10 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Intro To ML
No ratings yet
Intro To ML
29 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
Feature Scaling
No ratings yet
Feature Scaling
13 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
4 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Feature Scaling Techniques: Machine Learning
No ratings yet
Feature Scaling Techniques: Machine Learning
27 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Week 10
No ratings yet
Week 10
50 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
287 - Sougata Saha - Scaling Training and Test Data
No ratings yet
287 - Sougata Saha - Scaling Training and Test Data
11 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Towards Data Science All About Feature Scaling
No ratings yet
Towards Data Science All About Feature Scaling
16 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
نسخة من prep
No ratings yet
نسخة من prep
17 pages
NN 7
No ratings yet
NN 7
26 pages
Session7 Chi Square Test
No ratings yet
Session7 Chi Square Test
12 pages
Let's Solve The Right Problem!
No ratings yet
Let's Solve The Right Problem!
3 pages
Design Thinking and Innovation
No ratings yet
Design Thinking and Innovation
10 pages
Naive Bayes Classification Example
No ratings yet
Naive Bayes Classification Example
4 pages
Great Leaders Think "Bring Problems, Let's Find A Solution Together" - by Ant Murphy - The Startup - Medium
No ratings yet
Great Leaders Think "Bring Problems, Let's Find A Solution Together" - by Ant Murphy - The Startup - Medium
7 pages
Chapter I - Neat
No ratings yet
Chapter I - Neat
23 pages
Understand Data Preprocessing For Effective End-to-End Training of Deep Neural Networks
No ratings yet
Understand Data Preprocessing For Effective End-to-End Training of Deep Neural Networks
7 pages
ML Chapter 1 - Problems and Solutions
No ratings yet
ML Chapter 1 - Problems and Solutions
5 pages
Types of Machine Learning 1753064764
No ratings yet
Types of Machine Learning 1753064764
8 pages
One Pager Doc CleanAir+
No ratings yet
One Pager Doc CleanAir+
2 pages
It All Comes Down To The Quality of Your Decisions and Actions.
No ratings yet
It All Comes Down To The Quality of Your Decisions and Actions.
3 pages
Regression Models A Concise Tutorial 1752984550
No ratings yet
Regression Models A Concise Tutorial 1752984550
21 pages
Fear & Stage Fear
No ratings yet
Fear & Stage Fear
21 pages
One Page Doc. RemindEasy
No ratings yet
One Page Doc. RemindEasy
2 pages
Ijitcs V10 N1 4
No ratings yet
Ijitcs V10 N1 4
9 pages
4 Leadership Lessons From Sundar Pichai, CEO of Google
No ratings yet
4 Leadership Lessons From Sundar Pichai, CEO of Google
8 pages
Google Classroom 101
No ratings yet
Google Classroom 101
21 pages
Is You Cup Full or Empty or Half
No ratings yet
Is You Cup Full or Empty or Half
2 pages
Finding Purpose Identify Strengths 1700705389
No ratings yet
Finding Purpose Identify Strengths 1700705389
11 pages
365careers PM Critical+Path+Analysis
No ratings yet
365careers PM Critical+Path+Analysis
1 page
Water For A New Journey: Case Study
No ratings yet
Water For A New Journey: Case Study
4 pages
HubSpot Academy Piloting AI Workbook
No ratings yet
HubSpot Academy Piloting AI Workbook
5 pages
Indian Institute of Technology Tirupati: Shortlisted Candidates For The Post of Junior Assistant
No ratings yet
Indian Institute of Technology Tirupati: Shortlisted Candidates For The Post of Junior Assistant
22 pages
Module 7-Amazon DocumentDB & Amazon Neptune
No ratings yet
Module 7-Amazon DocumentDB & Amazon Neptune
6 pages
CSA Framework Cloud Incident Framework 1620215103
No ratings yet
CSA Framework Cloud Incident Framework 1620215103
36 pages
Ds Assignment
No ratings yet
Ds Assignment
6 pages
Bluebeam Revu Keyboard Shortcuts 2017-UK
No ratings yet
Bluebeam Revu Keyboard Shortcuts 2017-UK
8 pages
Unit VI
No ratings yet
Unit VI
45 pages
Grade 4 Paper
No ratings yet
Grade 4 Paper
2 pages
22IZ023 Nikhil - Exercise 6 - Linear Regression
No ratings yet
22IZ023 Nikhil - Exercise 6 - Linear Regression
4 pages
DICOM Conformance Statement Fujifilm DR Systeme (Console Advance)
No ratings yet
DICOM Conformance Statement Fujifilm DR Systeme (Console Advance)
120 pages
WWW Careerride Com
No ratings yet
WWW Careerride Com
9 pages
(Ebook PDF) Modern Database Management 12th Global Edition Instant Download
No ratings yet
(Ebook PDF) Modern Database Management 12th Global Edition Instant Download
51 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
Workshop 1.1 FEA: ANSYS Meshing Basics
No ratings yet
Workshop 1.1 FEA: ANSYS Meshing Basics
20 pages
BSBXCS402 - Assessment Task 2 Activity 2 Jiale Zhang
No ratings yet
BSBXCS402 - Assessment Task 2 Activity 2 Jiale Zhang
18 pages
Chapter 4 Notes
No ratings yet
Chapter 4 Notes
8 pages
2024 Spring Project
No ratings yet
2024 Spring Project
7 pages
Android Lab Manual
No ratings yet
Android Lab Manual
63 pages
Technical Bulletin: Releasing Service Manual
No ratings yet
Technical Bulletin: Releasing Service Manual
2 pages
Backend Development: Detailed Course Syllabus
No ratings yet
Backend Development: Detailed Course Syllabus
5 pages
Machine Life Summary
No ratings yet
Machine Life Summary
1 page
Adm206d - Unit 4 20241
No ratings yet
Adm206d - Unit 4 20241
32 pages
DX Diag
No ratings yet
DX Diag
42 pages
How To Install Guide: Mac Windows
No ratings yet
How To Install Guide: Mac Windows
5 pages
8255 and 8254
No ratings yet
8255 and 8254
41 pages
EasyXLS LICENSE
No ratings yet
EasyXLS LICENSE
4 pages
Reaction Paper About Data Privacy Webinar: NAME: Bongbonga, Brenda Rose R. Course & Year: Bsit 3 Block: C
No ratings yet
Reaction Paper About Data Privacy Webinar: NAME: Bongbonga, Brenda Rose R. Course & Year: Bsit 3 Block: C
3 pages
Huawei Server Firmware Upgrade Guide 29
No ratings yet
Huawei Server Firmware Upgrade Guide 29
188 pages
UG TV-IP1315PI (v1.0R)
No ratings yet
UG TV-IP1315PI (v1.0R)
51 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing in Machine learning

Importing Data & Libraries

By deleting the particular row:

By calculating the mean:

Categorical Data: Department & Purchased

Split into training & test sets

To accurately assess your ML model’s performance without overfitting or underfitting

Feature scaling is a technique used in machine learning to standardize the

Xn = (X - Xmin) / ( Xmax - Xmin)

Xmax= Maximum value of a feature

Xmin = Minimum value of a feature

You might also like