Data Preparation For Machine Learning A Step by Step Guide
Data Preparation For Machine Learning A Step by Step Guide
for Machine
Learning: A Step-
by-Step Guide
Years back, when Spotify was working on its recommendation
engine, they faced challenges related to the quality of the data
used for training ML algorithms. Thoroughly preparing data for
machine learning allowed the streaming platform to train a
powerful ML engine that accurately predicts users’ listening
preferences and offers highly personalized music
recommendations.
by Arij MEFTAH
How to Prepare Data for
Machine Learning
1 1. Data Collection
Data preparation for machine learning starts with data collection. During
this stage, you gather data for training and tuning the future ML model.
2 2. Data Cleaning
The next step is to clean the data, involving finding and correcting
errors, inconsistencies, and missing values.
3 3. Data Transformation
During this stage, you convert raw data into a format suitable for
machine learning algorithms.
4 4. Data Splitting
The final step involves dividing all gathered data into subsets — the
process known as data splitting.
Data Types
1 Structured Data 2 Unstructured Data
Data organized in a specific Includes images, videos, audio
way, typically in a table or recordings, and other
spreadsheet format. information that does not follow
conventional data models.
3 Semi-Structured Data
Doesn’t follow a format of a tabular data model but contains some
structural elements, like tags or metadata.
Data Collection
Collecting data from internal sources:
if you have information stored in your enterprise data warehouse, you can use it
for training ML algorithms. This data could include sales transactions, customer
interactions, data from social media platforms, and other sources.
Web scraping:
his technique involves extracting data from websites using automated tools. This
approach may be useful for collecting data from sources that are not accessible
through other means, such as product reviews, news articles, and social media.
Surveys:
this approach can be used to collect specific data points from a specific target
audience. It is especially useful for collecting information on user preferences or
behavior.
Data Collection
Data augmentation
which allows generating more data from existing samples by transforming them in
a variety of ways, for example, rotating, translating, or scaling
Active learning,
which allows selecting the most informative data sample for labeling by a human expert.
Transfer learning,:
which involves using pre-trained ML algorithms applied for solving a related task
as a starting point for training a new ML model, followed by fine-tuning the new
model on new data.
Handling outliers
Outliers are data points that significantly differ from the rest of the
dataset. Outliers can occur due to measurement errors, data entry errors,
or simply because they represent unusual or extreme observations.
Removing duplicates
Duplicates don’t only skew ML predictions, but also waste storage space
and increase processing time, especially in large datasets. To remove
duplicates, data scientists resort to a variety of duplicate identification
techniques (like exact matching, fuzzy matching, hashing, or record
linkage). Once identified, they can be either dropped or merged.
Handling irrelevant data
Irrelevant data refers to the data that is not useful or applicable to
solving the problem. Handling irrelevant data can help reduce noise and
improve prediction accuracy. To identify irrelevant data, data teams
resort to such techniques as principal component analysis, correlation
analysis, or simply rely on their domain knowledge. Once identified, such
data points are removed from the dataset.
Handling incorrect data
Common techniques of dealing with such data include data
Data Transformation Techniques
Scaling Normalization Encoding
Transforms all data Changes the Converts categorical
points to fit a distribution of a data into a numerical
specified range, dataset. format.
typically between 0
and 1.
Discretization Dimensionality reduction
Transforming continuous variables, Limiting the number of features or
such as time, temperature, or variables in a dataset and only
weight, into discrete ones. preserving the information relevant
for solving the problem
Data Splitting Strategies
1 Training dataset 2 Validation dataset
Teach a ML model to recognize Subset of data that is used to
patterns and relationships evaluate the performance of the
between input and target model during training.
variables.
3 Testing dataset
Subset of data that is used to evaluate the performance of the trained model.
Data Splitting Strategies
Random Sampling
Data is split randomly, often applied to large datasets representative of
the population being modeled.
Stratified Sampling
Data is divided into subsets based on class labels or other characteristics,
followed by randomly sampling these subsets.
Time-based Sampling
Data collected up to a certain point makes a training dataset, while the
data collected after the set point is formed into a testing dataset.
Cross-validation:
The data is divided into multiple subsets, or folds. Some folds are used to
train the model, while the remaining are used for performance evaluation.
Importance of Data
Preparation for
Machine Learning
In this course, we highlighted the importance of preparing data
for machine learning and shared our approach to collecting,
cleaning, and transforming data.