0% found this document useful (0 votes)

202 views5 pages

Data Preprocessing in Machine Learning

Data preprocessing in machine learning involves 7 key steps: 1) acquiring the dataset, 2) importing relevant libraries, 3) importing the dataset, 4) identifying and handling missing values, 5) encoding categorical data, 6) splitting the dataset into training and test sets, and 7) performing feature scaling to standardize variables. These steps clean, organize, and transform raw data into a readable format suitable for building and training machine learning models.

Uploaded by

Musto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

202 views5 pages

Data Preprocessing in Machine Learning

Uploaded by

Musto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow

Data preprocessing in Machine Learning is a crucial step that helps enhance the
quality of data to promote the extraction of meaningful insights from the data. Data
preprocessing in Machine Learning refers to the technique of preparing (cleaning and
organizing) the raw data to make it suitable for a building and training Machine Learning
models. In simple words, data preprocessing in Machine Learning is a data mining
technique that transforms raw data into an understandable and readable format.

Why Data Preprocessing in Machine Learning?

When it comes to creating a Machine Learning model, data preprocessing is the first
step marking the initiation of the process. Typically, real-world data is incomplete,
inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute
values/trends. This is where data preprocessing enters the scenario – it helps to clean,
format, and organize the raw data, thereby making it ready-to-go for Machine Learning
models. Let’s explore various steps of data preprocessing in machine learning.

Steps in Data Preprocessing in Machine Learning

1. Acquire the dataset:

Acquiring the dataset is the first step in data preprocessing in machine learning. To
build and develop Machine Learning models, you must first acquire the relevant dataset.
This dataset will be comprised of data gathered from multiple and disparate sources which
are then combined in a proper format to form a dataset.

2. Import all the crucial libraries:

Since Python is the most extensively used and the most preferred library by Data
Scientists around the world, we’ll show you how to import Python libraries for data
preprocessing in Machine Learning. The predefined Python libraries can perform specific
data preprocessing jobs. Importing all the crucial libraries is the second step in data
preprocessing in machine learning. The three core Python libraries used for this data
preprocessing in Machine Learning are:
NumPy – NumPy is the fundamental package for scientific calculation in Python.
Hence, it is used for inserting any type of mathematical operation in the code. Using NumPy,
you can also add large multidimensional arrays and matrices in your code.
Pandas – Pandas is an excellent open-source Python library for data manipulation
and analysis. It is extensively used for importing and managing the datasets. It packs in
high-performance, easy-to-use data structures and data analysis tools for Python.
Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of
charts in Python. It can deliver publication-quality figures in numerous hard copy formats
and interactive environments across platforms (IPython shells, Jupyter notebook, web
application servers, etc.).

3. Import the Dataset:

In this step, you need to import the dataset/s that you have gathered for the ML
project at hand. Importing the dataset is one of the important steps in data preprocessing in
machine learning. However, before you can import the dataset/s, you must set the current
directory as the working directory.

4. Identifying and handling the missing values:

In data preprocessing, it is pivotal to identify and correctly handle the missing values,
failing to do this, you might draw inaccurate and faulty conclusions and inferences from the
data. This will hamper your ML project.

Basically, there are two ways to handle missing data:

Deleting a particular row – In this method, you remove a specific row that has a null
value for a feature or a particular column where more than 75% of the values are missing.
However, this method is not 100% efficient, and it is recommended that you use it only
when the dataset has adequate samples. You must ensure that after deleting the data, there
remains no addition of bias.

Calculating the mean – This method is useful for features having numeric data like
age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular
feature or column or row that contains a missing value and replace the result for the
missing value. This method can add variance to the dataset, and any loss of data can be
efficiently negated. Hence, it yields better results compared to the first method (omission of
rows/columns). Another way of approximation is through the deviation of neighboring
values. However, this works best for linear data.

5. Encoding the categorical data:

Categorical data refers to the information that has specific categories within the
dataset. In the dataset cited above, there are two categorical variables – country and
purchased.

Machine Learning models are primarily based on mathematical equations. Thus, you
can intuitively understand that keeping the categorical data in the equation will cause
certain issues since you would only need numbers in the equations.

6. Splitting the dataset:

Splitting the dataset is the next step in data preprocessing in machine learning. Every
dataset for Machine Learning model must be split into two separate sets – training set and
test set.

Training set denotes the subset of a dataset that is used for training the machine
learning model. Here, you are already aware of the output. A test set, on the other hand, is
the subset of the dataset that is used for testing the machine learning model. The ML model
uses the test set to predict outcomes.

Usually, the dataset is split into 80% of the data for training the model while leaving
out the rest 20%.

7. Feature scaling:

Feature scaling marks the end of the data preprocessing in Machine Learning. It is a

method to standardize the independent variables of a dataset within a specific range. In
other words, feature scaling limits the range of variables so that you can compare them on
common grounds.

You can perform feature scaling in Machine Learning in two ways:

Standardization:

Normalization:

Foundation: The First Principles by Neville Johnson
92% (13)
Foundation: The First Principles by Neville Johnson
92 pages
The Fallout Klein Glasko download
No ratings yet
The Fallout Klein Glasko download
39 pages
0 Tim Masters Combo
No ratings yet
0 Tim Masters Combo
91 pages
FDSA UNIT 5
No ratings yet
FDSA UNIT 5
48 pages
RBI NPA Classification Circular
No ratings yet
RBI NPA Classification Circular
93 pages
Global Promotion Main
0% (1)
Global Promotion Main
20 pages
Resistance Is Futile ... or Is It?: The Immune System and HIV Infection
No ratings yet
Resistance Is Futile ... or Is It?: The Immune System and HIV Infection
28 pages
CSIT Module IV Notes
No ratings yet
CSIT Module IV Notes
19 pages
Handout 1
No ratings yet
Handout 1
11 pages
Generic Field Validation Testcases
No ratings yet
Generic Field Validation Testcases
21 pages
Birds of Bolivia Species List 03JUN2020
No ratings yet
Birds of Bolivia Species List 03JUN2020
33 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Grade 9 Technical Terms
25% (4)
Grade 9 Technical Terms
7 pages
Quiz Navigation: /bridgeteflinsider/) /bridgetefl) /bridgetefl/) /User/Bridgetefl
100% (1)
Quiz Navigation: /bridgeteflinsider/) /bridgetefl) /bridgetefl/) /User/Bridgetefl
8 pages
Pe Grou Philippine Dance Education
No ratings yet
Pe Grou Philippine Dance Education
4 pages
ML Unit-3 Notes
No ratings yet
ML Unit-3 Notes
26 pages
ML_LAB_Mannual-1
No ratings yet
ML_LAB_Mannual-1
79 pages
All Pairs Shortest Path
No ratings yet
All Pairs Shortest Path
28 pages
GE6 BSE2D Caberoy Nhoricks 1
No ratings yet
GE6 BSE2D Caberoy Nhoricks 1
13 pages
PRAYERS-FOR-HOLY-COMMUNION-2425
No ratings yet
PRAYERS-FOR-HOLY-COMMUNION-2425
2 pages
Divinagracia Composure Model
100% (4)
Divinagracia Composure Model
20 pages
Unit 1 Fuzzy Logic
No ratings yet
Unit 1 Fuzzy Logic
29 pages
Conservation of Mechanical Energy: Driving Question - Objective
No ratings yet
Conservation of Mechanical Energy: Driving Question - Objective
7 pages
Chopin New Waltz in A Minor 2
No ratings yet
Chopin New Waltz in A Minor 2
2 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Unit 5 I/O Organization: Computer Architecture
No ratings yet
Unit 5 I/O Organization: Computer Architecture
9 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
Computer Organization & Architecture
No ratings yet
Computer Organization & Architecture
49 pages
18CSC305J - UNIT-4.pptx - 18CSC305J - UNIT-4
No ratings yet
18CSC305J - UNIT-4.pptx - 18CSC305J - UNIT-4
77 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Mockboard English Language
No ratings yet
Mockboard English Language
15 pages
Essay Evaluation Form
No ratings yet
Essay Evaluation Form
3 pages
Machine Learning Report
No ratings yet
Machine Learning Report
58 pages
Loveseat
No ratings yet
Loveseat
1 page
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
21CS54 Aiml Module3 PPT
No ratings yet
21CS54 Aiml Module3 PPT
102 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
Sri Vidya Mantras
100% (1)
Sri Vidya Mantras
4 pages
Dimensionality Reduction Lecture Slide
No ratings yet
Dimensionality Reduction Lecture Slide
27 pages
Answers For End-Sem Exam Part - 2 (Deep Learning)
No ratings yet
Answers For End-Sem Exam Part - 2 (Deep Learning)
20 pages
Regression Notes
100% (1)
Regression Notes
20 pages
Ensemble Methods Bagging Boosting and Stacking
100% (1)
Ensemble Methods Bagging Boosting and Stacking
19 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Unit 2 Preparing To Model
No ratings yet
Unit 2 Preparing To Model
49 pages
Artificial Intelligence: Adversarial Search
No ratings yet
Artificial Intelligence: Adversarial Search
36 pages
CCNP Enterprise Advanced Routing ENARSI 300-410
85% (27)
CCNP Enterprise Advanced Routing ENARSI 300-410
1,100 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
274 - Soft Computing LECTURE NOTES
No ratings yet
274 - Soft Computing LECTURE NOTES
499 pages
Proposed Library Thesis Conceptual Developement
No ratings yet
Proposed Library Thesis Conceptual Developement
4 pages
Game Playing: Adversarial Search
No ratings yet
Game Playing: Adversarial Search
66 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
SCSA3015 Deep Learning Unit 2 PDF
No ratings yet
SCSA3015 Deep Learning Unit 2 PDF
32 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Sol3e Int U1 Short Test 1b
No ratings yet
Sol3e Int U1 Short Test 1b
2 pages
UNIT2
No ratings yet
UNIT2
25 pages
Pattern Recognition and Anomaly Detection Lab
No ratings yet
Pattern Recognition and Anomaly Detection Lab
3 pages
Neuro Fuzzy Systems
100% (1)
Neuro Fuzzy Systems
27 pages
Lab Program
100% (1)
Lab Program
15 pages
Bai602 Ml Lesson Plan 2024-25 Even Aiml Dept
No ratings yet
Bai602 Ml Lesson Plan 2024-25 Even Aiml Dept
5 pages
Artificial Intelligence: Chapter 6: Representing Knowledge Using Rules
No ratings yet
Artificial Intelligence: Chapter 6: Representing Knowledge Using Rules
54 pages
Revised CS8383 (Eee) Oop Lab Man
No ratings yet
Revised CS8383 (Eee) Oop Lab Man
85 pages
Representing Knowledge Using
No ratings yet
Representing Knowledge Using
22 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Unit-5 Alt
No ratings yet
Unit-5 Alt
15 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
71A Machine Learning
No ratings yet
71A Machine Learning
8 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
Machine Learning Theory and Applications
100% (3)
Machine Learning Theory and Applications
510 pages
ChEn 3701
No ratings yet
ChEn 3701
2 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Designing A Learning System
No ratings yet
Designing A Learning System
12 pages
TCS Aptitude Paper
No ratings yet
TCS Aptitude Paper
4 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Artificial Intelligence Module 5
No ratings yet
Artificial Intelligence Module 5
23 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
LP I ML Viva Questions
100% (1)
LP I ML Viva Questions
9 pages
CP5191 Machine Learning Techniques L T P C3 0 0 3
No ratings yet
CP5191 Machine Learning Techniques L T P C3 0 0 3
7 pages
Algorithm Analysis Design Lecture1 PowerPoint Presentation
No ratings yet
Algorithm Analysis Design Lecture1 PowerPoint Presentation
9 pages
Theory Plusquamperfekt
No ratings yet
Theory Plusquamperfekt
3 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
Fundamentals of Artificial Intelligence PDF
100% (13)
Fundamentals of Artificial Intelligence PDF
730 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
18AI61
No ratings yet
18AI61
3 pages
AI Artificial Intelligence, 60 Leaders 17 Questions
100% (12)
AI Artificial Intelligence, 60 Leaders 17 Questions
236 pages
Agentic AI Playbook v1.1
100% (6)
Agentic AI Playbook v1.1
19 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
100% (14)
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
210 pages
Mc9280 Data Mining and Data Warehousing
No ratings yet
Mc9280 Data Mining and Data Warehousing
1 page
Self Assessment Toolkit
No ratings yet
Self Assessment Toolkit
39 pages
Databricks Big Book of GenAI FINAL
100% (7)
Databricks Big Book of GenAI FINAL
118 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
93% (14)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
RAG Architecture
100% (7)
RAG Architecture
52 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Top 100 Applications of Generative AI 1683282083
100% (14)
Top 100 Applications of Generative AI 1683282083
119 pages
Generative Ai Fundamentals v1
100% (15)
Generative Ai Fundamentals v1
80 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
100% (5)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
326 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
RAG - A Simple Introduction
100% (5)
RAG - A Simple Introduction
75 pages
Machine Learning Paradigms
100% (10)
Machine Learning Paradigms
336 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
Machine Learning From Scratch PDF
88% (8)
Machine Learning From Scratch PDF
124 pages
Magical Alphabets
95% (80)
Magical Alphabets
254 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
LLMs and Generative AI For (Z-Library)
100% (3)
LLMs and Generative AI For (Z-Library)
58 pages
Generative AI Usecases - A Comprehensive Guide - Dummies
100% (1)
Generative AI Usecases - A Comprehensive Guide - Dummies
19 pages
Generative AI With Large Language Models
100% (2)
Generative AI With Large Language Models
31 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet

Data Preprocessing in Machine Learning

Uploaded by

Data Preprocessing in Machine Learning

Uploaded by

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow

Why Data Preprocessing in Machine Learning?

Steps in Data Preprocessing in Machine Learning

2. Import all the crucial libraries:

3. Import the Dataset:

4. Identifying and handling the missing values:

Basically, there are two ways to handle missing data:

5. Encoding the categorical data:

6. Splitting the dataset:

Feature scaling marks the end of the data preprocessing in Machine Learning. It is a

You can perform feature scaling in Machine Learning in two ways:

You might also like