Data Preparation For Machine Learning A Step by Step Guide

The document outlines a step-by-step guide for data preparation in machine learning, emphasizing the importance of data collection, cleaning, transformation, and splitting. It discusses various data types and techniques for handling missing data, outliers, and irrelevant information. The guide also highlights different strategies for data splitting and the significance of thorough data preparation for effective machine learning models.

Uploaded by

Mohamed BEN HLIMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views11 pages

Data Preparation For Machine Learning A Step by Step Guide

Uploaded by

Mohamed BEN HLIMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Data Preparation

for Machine
Learning: A Step-
by-Step Guide
Years back, when Spotify was working on its recommendation
engine, they faced challenges related to the quality of the data
used for training ML algorithms. Thoroughly preparing data for
machine learning allowed the streaming platform to train a
powerful ML engine that accurately predicts users’ listening
preferences and offers highly personalized music
recommendations.
by Arij MEFTAH
How to Prepare Data for
Machine Learning

1 1. Data Collection
Data preparation for machine learning starts with data collection. During
this stage, you gather data for training and tuning the future ML model.

2 2. Data Cleaning
The next step is to clean the data, involving finding and correcting
errors, inconsistencies, and missing values.

3 3. Data Transformation
During this stage, you convert raw data into a format suitable for
machine learning algorithms.

4 4. Data Splitting
The final step involves dividing all gathered data into subsets — the
process known as data splitting.
Data Types
1 Structured Data 2 Unstructured Data
Data organized in a specific Includes images, videos, audio
way, typically in a table or recordings, and other
spreadsheet format. information that does not follow
conventional data models.

3 Semi-Structured Data
Doesn’t follow a format of a tabular data model but contains some
structural elements, like tags or metadata.
Data Collection
Collecting data from internal sources:
if you have information stored in your enterprise data warehouse, you can use it
for training ML algorithms. This data could include sales transactions, customer
interactions, data from social media platforms, and other sources.

Collecting data from external sources:

You can turn to publicly available data sources, such as government data portals,
academic data repositories, and data sharing communities, such as Kaggle, UCI
Machine Learning Repository, or Google Dataset Search.

Web scraping:
his technique involves extracting data from websites using automated tools. This
approach may be useful for collecting data from sources that are not accessible
through other means, such as product reviews, news articles, and social media.

Surveys:
this approach can be used to collect specific data points from a specific target
audience. It is especially useful for collecting information on user preferences or
behavior.
Data Collection
Data augmentation
which allows generating more data from existing samples by transforming them in
a variety of ways, for example, rotating, translating, or scaling

Active learning,
which allows selecting the most informative data sample for labeling by a human expert.

Transfer learning,:
which involves using pre-trained ML algorithms applied for solving a related task
as a starting point for training a new ML model, followed by fine-tuning the new
model on new data.

Collaborative data sharing:

which involves working with other researchers and organizations to collect and
share data for a common goal.
Data Cleaning Techniques
Handling missing data
Missing values is a common issue in machine learning. It can be handled by :

imputation / interpolation / deletion

Handling outliers
Outliers are data points that significantly differ from the rest of the
dataset. Outliers can occur due to measurement errors, data entry errors,
or simply because they represent unusual or extreme observations.

Removing duplicates
Duplicates don’t only skew ML predictions, but also waste storage space
and increase processing time, especially in large datasets. To remove
duplicates, data scientists resort to a variety of duplicate identification
techniques (like exact matching, fuzzy matching, hashing, or record
linkage). Once identified, they can be either dropped or merged.
Handling irrelevant data
Irrelevant data refers to the data that is not useful or applicable to
solving the problem. Handling irrelevant data can help reduce noise and
improve prediction accuracy. To identify irrelevant data, data teams
resort to such techniques as principal component analysis, correlation
analysis, or simply rely on their domain knowledge. Once identified, such
data points are removed from the dataset.
Handling incorrect data
Common techniques of dealing with such data include data
Data Transformation Techniques
Scaling Normalization Encoding
Transforms all data Changes the Converts categorical
points to fit a distribution of a data into a numerical
specified range, dataset. format.
typically between 0
and 1.
Discretization Dimensionality reduction
Transforming continuous variables, Limiting the number of features or
such as time, temperature, or variables in a dataset and only
weight, into discrete ones. preserving the information relevant
for solving the problem
Data Splitting Strategies
1 Training dataset 2 Validation dataset
Teach a ML model to recognize Subset of data that is used to
patterns and relationships evaluate the performance of the
between input and target model during training.
variables.

3 Testing dataset
Subset of data that is used to evaluate the performance of the trained model.
Data Splitting Strategies
Random Sampling
Data is split randomly, often applied to large datasets representative of
the population being modeled.

Stratified Sampling
Data is divided into subsets based on class labels or other characteristics,
followed by randomly sampling these subsets.

Time-based Sampling
Data collected up to a certain point makes a training dataset, while the
data collected after the set point is formed into a testing dataset.

Cross-validation:
The data is divided into multiple subsets, or folds. Some folds are used to
train the model, while the remaining are used for performance evaluation.
Importance of Data
Preparation for
Machine Learning
In this course, we highlighted the importance of preparing data
for machine learning and shared our approach to collecting,
cleaning, and transforming data.

How To Build A Machine Learning Model - by Chanin Nantasenamat - Towards Data Science
No ratings yet
How To Build A Machine Learning Model - by Chanin Nantasenamat - Towards Data Science
37 pages
Romantic Love and Intimacy in Relationships
100% (1)
Romantic Love and Intimacy in Relationships
85 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
The Star Wars
No ratings yet
The Star Wars
200 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
11 pages
@vtudeveloper - in ML Mod 1
No ratings yet
@vtudeveloper - in ML Mod 1
34 pages
Intro To ML
No ratings yet
Intro To ML
29 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
RCW - Eng
No ratings yet
RCW - Eng
77 pages
Data Preparation and Preprocessing A Crucial Step in Machine Learning
No ratings yet
Data Preparation and Preprocessing A Crucial Step in Machine Learning
10 pages
Data Preparation Phase To Model The Data - 4
No ratings yet
Data Preparation Phase To Model The Data - 4
8 pages
Designing Machine Learning Systems by Chip Huygen by Rick
No ratings yet
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
Steel Beams Analysis
No ratings yet
Steel Beams Analysis
15 pages
A 6 Step Field Guide For Building Machine Learning Projects
No ratings yet
A 6 Step Field Guide For Building Machine Learning Projects
17 pages
The Full Stack Data Scientist in AI
No ratings yet
The Full Stack Data Scientist in AI
11 pages
ACC 222 Costing
No ratings yet
ACC 222 Costing
17 pages
Data Preprocessing Preparing Data For Success
No ratings yet
Data Preprocessing Preparing Data For Success
8 pages
Data Science
No ratings yet
Data Science
10 pages
Data Cleaning
No ratings yet
Data Cleaning
6 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Data Preprocessing Preparing Data For Success
No ratings yet
Data Preprocessing Preparing Data For Success
8 pages
Instrumentation Training Tutorial1 PDF
No ratings yet
Instrumentation Training Tutorial1 PDF
6 pages
Introduction To Data Science With Python
No ratings yet
Introduction To Data Science With Python
10 pages
Six Steps To Master Machine Learning With Data Preparation
No ratings yet
Six Steps To Master Machine Learning With Data Preparation
44 pages
Class1 - Introduction and Foundation-1717413257735
No ratings yet
Class1 - Introduction and Foundation-1717413257735
23 pages
ML Training
No ratings yet
ML Training
6 pages
ML Training PDF
No ratings yet
ML Training PDF
6 pages
Introduction To ML (Group-2)
No ratings yet
Introduction To ML (Group-2)
42 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
General HR Interview Questions With Possible Answers
No ratings yet
General HR Interview Questions With Possible Answers
7 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
10 pages
Data Science and Big Data Analytics A Comprehensive Guide
No ratings yet
Data Science and Big Data Analytics A Comprehensive Guide
8 pages
Basics of ML
No ratings yet
Basics of ML
11 pages
Mehul
No ratings yet
Mehul
12 pages
Machine Learning 1
No ratings yet
Machine Learning 1
34 pages
Module 4
No ratings yet
Module 4
28 pages
Machine Learning Features
No ratings yet
Machine Learning Features
10 pages
Notes Unit 1-3 Part-II
No ratings yet
Notes Unit 1-3 Part-II
20 pages
Machine Learning A Deep Dive
No ratings yet
Machine Learning A Deep Dive
9 pages
Data Science Unleashing The Power of Information
No ratings yet
Data Science Unleashing The Power of Information
12 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Machine Learning & Data Science
No ratings yet
Machine Learning & Data Science
18 pages
E-Notes 33718 Content Document 20250325122736PM
No ratings yet
E-Notes 33718 Content Document 20250325122736PM
18 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
18 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Data Science Unlocking The Power of Data
No ratings yet
Data Science Unlocking The Power of Data
8 pages
Data Science Unlocking Insights From Information
No ratings yet
Data Science Unlocking Insights From Information
8 pages
Module 2
No ratings yet
Module 2
8 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Machine Learning A Comprehensive Report
No ratings yet
Machine Learning A Comprehensive Report
10 pages
Mekanisme Pengelolaan Persediaan Sparepart Sepeda Motor Honda Pada PT. Bintang Motor Jaya, TBK Cabang Cirebon
No ratings yet
Mekanisme Pengelolaan Persediaan Sparepart Sepeda Motor Honda Pada PT. Bintang Motor Jaya, TBK Cabang Cirebon
35 pages
Introduction To Machine Learning: by Aditya Sangwan
No ratings yet
Introduction To Machine Learning: by Aditya Sangwan
4 pages
Machine Learning Presentation A Comprehensive Guide
No ratings yet
Machine Learning Presentation A Comprehensive Guide
7 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Module - 1
No ratings yet
Module - 1
9 pages
5.IMPRESSION TECHNIQUES FOR COMPLETE DENTURES (Shewlett)
100% (1)
5.IMPRESSION TECHNIQUES FOR COMPLETE DENTURES (Shewlett)
45 pages
ML Notes
No ratings yet
ML Notes
7 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
An Enlightenment To Machine Learning - Resp
No ratings yet
An Enlightenment To Machine Learning - Resp
22 pages
Seanewdim Philology Ii10 Issue 47
No ratings yet
Seanewdim Philology Ii10 Issue 47
118 pages
Machine Learning Roadmap PDF
No ratings yet
Machine Learning Roadmap PDF
4 pages
RESEARCH PLAN. Energy Harvesting Through Piezoelectric Generator Installed To Footwear.
No ratings yet
RESEARCH PLAN. Energy Harvesting Through Piezoelectric Generator Installed To Footwear.
10 pages
Thesis Paper Project Evaluation 1
No ratings yet
Thesis Paper Project Evaluation 1
25 pages
Sol Review Scientific Investigation
No ratings yet
Sol Review Scientific Investigation
34 pages
Arakin 3 Key
No ratings yet
Arakin 3 Key
23 pages
Westwood Homework
100% (1)
Westwood Homework
7 pages
Allegory of The Cave Analysis
No ratings yet
Allegory of The Cave Analysis
4 pages
3M Petrifilm Yeast Molds
No ratings yet
3M Petrifilm Yeast Molds
8 pages
23S1-SS ZG653-M1-CS02B - WhatIsSoftArch
No ratings yet
23S1-SS ZG653-M1-CS02B - WhatIsSoftArch
39 pages
Introduction To Programming With RAPTOR
No ratings yet
Introduction To Programming With RAPTOR
12 pages
CMT Quiz
No ratings yet
CMT Quiz
3 pages
Busi BOSCH Retail
No ratings yet
Busi BOSCH Retail
1 page
SHP-DS705 USER Manual
No ratings yet
SHP-DS705 USER Manual
2 pages
SYSTEME 3LMD Final
No ratings yet
SYSTEME 3LMD Final
92 pages
Will Happen?
No ratings yet
Will Happen?
43 pages
Impact of Gender Diversity On Team Performance SM Raza Naqvi
No ratings yet
Impact of Gender Diversity On Team Performance SM Raza Naqvi
8 pages
Named Entity Recognition Datasets A Classification
No ratings yet
Named Entity Recognition Datasets A Classification
17 pages
Yazan Waqfi Paper Published 2022
No ratings yet
Yazan Waqfi Paper Published 2022
21 pages
Deep Architecture For Super-Resolution and Deblurr
No ratings yet
Deep Architecture For Super-Resolution and Deblurr
18 pages
10 Things High Performing Leaders Never Do
No ratings yet
10 Things High Performing Leaders Never Do
12 pages
Tamil Sangam
No ratings yet
Tamil Sangam
3 pages
Blur2Sharp A GAN-Based Model For Document Image de
No ratings yet
Blur2Sharp A GAN-Based Model For Document Image de
7 pages
Viva Voce Question
No ratings yet
Viva Voce Question
3 pages
SABIK MARINE Datasheet LED 155
No ratings yet
SABIK MARINE Datasheet LED 155
2 pages
CR03 - PPAP-Flammability-IMDS-OTOP Status
No ratings yet
CR03 - PPAP-Flammability-IMDS-OTOP Status
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Preparation For Machine Learning A Step by Step Guide

Uploaded by

Data Preparation For Machine Learning A Step by Step Guide

Uploaded by

Data Preparation

Collecting data from external sources:

Collaborative data sharing:

imputation / interpolation / deletion

You might also like