ML Lecture 6 7 Preprocess
ML Lecture 6 7 Preprocess
Email: [email protected]
What is Data Preprocessing?
A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
Data preprocessing involves cleaning and transforming the data in a
structured, useful and efficient format to make it suitable for
analysis and machine learning models.
It is the most important step in machine learning to ensure the
quality of data.
It increases the accuracy and efficiency of a machine learning
model.
Data Preprocessing Techniques
Data cleaning (fix noises, outliers, missing values, duplicates in
data)
Aggregation
Sampling
Dimensionality reduction
Feature subset selection
Feature creation
Discretization and binarization
Attribute transformation
Aggregation
Combining two or more attributes (or objects) into a single
attribute (or object)
Purpose
◦ Data reduction
◦ Reduce the number of attributes or objects
◦ Change of scale
◦ Cities aggregated into regions, states, countries, etc.
◦ Less memory, less processing time
Detail: AanalyticsVidhya
Types of Sampling Methods
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling
Multistage sampling
Simple Random Sampling
Select a subset of items randomly from a
population
There is an equal probability of selecting
any particular item
◦ Sampling without replacement
◦ As each item is selected, it is removed from
the population
◦ Sampling with replacement
◦ Objects are not removed from the population
as they are selected for the sample
◦ In sampling with replacement, the same
object can be picked up more than once
Systematic Sampling
Samples are drawn using a pre-
specified pattern, such as at intervals
Suppose, we began with person
number 3, and we want a sample size
of 5. So, the next individual that we
will select would be at an interval of
(20/5) = 4 from the 3rd person, i.e. 7
(3+4), and so on:
3, 3+4=7, 7+4=11, 11+4=15, 15+4=19
Stratified Sampling
Split the data into several
partitions called strata based on
different traits like gender,
category, etc.
then draw random samples from
each partition.
Cluster Sampling
The population is divided into
some groups called clusters.
Then we select a fixed number of
clusters randomly and include all
observations from each of the
clusters in our sample.
In the example, one cluster is
selected as our sample but we can
include more clusters as per our
sample size.
Multistage Sampling
Multistage sampling: It is very much similar to cluster sampling but
instead of keeping all the observations in each cluster, we collect a
random sample within each selected cluster.
Progressive sampling: Start with a small sample, and then increase the size
until a sufficient sample has been obtained
Curse of Dimensionality
Many types of data analysis become harder as the dimensionality
increases, the data becomes increasingly sparse in the space that it
occupies