Data Preprocessing
Data Preprocessing
Large datasets often contain various types of data, including structured tables, images, audio
files, and videos.
However, machine learning algorithms cannot directly process raw text, images, or videos, as
they only understand numerical representations (1s and 0s).
Therefore, it is essential to transform or encode the dataset into a suitable format before
applying machine learning techniques.
By converting the data into meaningful numerical features, the algorithm can effectively
interpret and learn patterns, enabling accurate predictions and analysis.
• Feature Aggregation
• Feature Discretization
• Feature Sampling
• Dimensionality Reduction
• Feature Encoding
• Feature Scaling
The first step when working with a dataset is to check quality of data set, we may face some
challenges with these datasets.
• Missing values
• Outliers
• Inconsistent values
• Duplicate values
We need to address these issues using software to ensure data quality and improve the
performance of machine learning models.
Feature Aggregation
We need to aggregate values to organize the data and present it from a better perspective.
For Example
• day-to-day transactions of a product to record the daily sales of that product in various store
locations over the year
• Aggregating the transactions to single store-wise monthly or yearly transactions will help us
reducing the number of data objects.
There are lot of advantage in aggregating feature values because it will compress dataset, take
less memory and it will take less computation power.
Aggregation cannot be applying with all data set, definitely we will need to decide whether
aggregation require or not. If we want day-to - day transactional dataset in that case
aggregation
is not good step.
Feature aggregation provides a high-level representation of the original dataset, making it
easier
to analyze.
Feature Discretization
Sometimes we have features that are having continuous values.it is easy to convert
continuous to discreet values.
For example, if age is a feature in our dataset, it is usually represented in years or months.
However, in many cases, we may not need the exact number to represent age. Instead, it can
be categorized into groups such as "Young," "Middle-aged," and "Old".
Discretizing age into these categories improves efficiency by eliminating the need to handle
continuous values for that feature.
We must check whether a continuous value is necessary or if it can be discretized. While
converting continuous values to discrete ones often improves efficiency, it is not always
applicable to all features.
Feature Sampling
Sampling is a very common method for selecting a subset of the dataset that we are
analyzing.
In most cases, working with the complete dataset can turn out to be too expensive
considering the memory and time constraints.
the sampling should be done in such a manner that the sample generated should have
approximately the same properties as the original dataset, meaning that the sample is
representative.
We have different technique for sampling.
• Simple Random Sampling
• Sampling without Replacement: Once you pick some instance from dataset and
selected data is again in dataset then we will not
put selected instances in original dataset.
• Sampling with Replacement: same instance can be chosen multiple times.
If we have large dataset not comfortable to work on whole data, it that case we do sampling.
Depending on problem we need to choose sampling technique.
In case of simple random sampling there is chance of Imbalance dataset. Imbalanced dataset is
one where the number of instances of a class(es) are significantly higher than another
class(es), thus leading to an imbalance. For example, in patient dataset there is chance that
there is many instances that are related to healthy person but few are related to patients.
Stratified Sampling Stratified sampling ensures that each class is proportionally represented in both
training and testing sets, reducing the risk of an imbalanced dataset. With large data we can use
stratified sampling to get balance dataset.
Instance:
Here, the proportions are not maintained, and Lung Disease patients are underrepresented, which may
lead to biased predictions.
Now, the original proportions are preserved, ensuring fair representation of all disease classes in both
the training and testing sets.
Stratified sampling significantly reduces the chance of an imbalanced dataset, leading to a well-
trained machine learning model that performs better across all classes.
Dimensionality Reduction
Most real world datasets have a large number of features and we do not know whether the
feature are important or not, then there is chances of reduction in number of features. This is
called Dimensionality Reduction.
For example, consider an image processing problem, we might have to deal with thousands of
features, also called as dimensions.
Two widely accepted techniques
Principal Component Analysis (PCA) is a technique used to reduce the number of features in
a dataset while keeping the most important information. In this technique we directly select
few of feature and leave some feature, that is called feature subset selection. It finds new
features, called principal components, that capture the most variation in the data.
This helps in visualization, noise reduction, and improving model performance. Singular
Value Decomposition (SVD) is a method of breaking down a matrix into three smaller
matrices, making it useful for data compression, noise removal, and recommendation systems.
PCA often uses SVD to find principal components efficiently, especially for large datasets.
Both methods help in handling high-dimensional data in machine learning.
In this technique, we do not know which features are important or which ones to select or
remove. Instead, we create a new dataset with new features, where each new feature is a
combination of some original features.
It is also possible that the new features are formed by combining multiple original features.
As a result, the number of features in the new dataset will be less than in the original dataset.
So these are the dimensionality reduction task. Features represent dimensions in a dataset. If
there is one feature, it is one-dimensional; if there are two features, it is two-dimensional; and
if there are five features, it is five-dimensional. More features mean more dimensions in the
data.
Feature Encoding
Feature encoding is the process of converting categorical or numerical features into a form that
machine learning models can understand. Different types of features require different encoding
techniques.
New_value=a*old_value
Feature Scaling
Feature scaling is the process of transforming numerical features into a specific range to ensure that
no feature dominates others due to differences in scale. Because sometimes features with larger
ranges can dominate those with smaller ranges.
For example, if there is height feature is 140 cm and 8.2 feet, then ML algo give more preference to
140.
Feature scaling ensures all numerical features have the same scale, for improvement of model
performance.
Many ML algorithms like K-means, SVM,KNN are sensitive to the scale of the data.
Feature scaling speeds up training and improves model accuracy.
Types of Scaling
X’=X-Xmin/Xmax -Xmin
Example: If Height ranges from 160 to 190 cm, after Min-Max Scaling:
o 160 cm → 0.0
o 190 cm → 1.0
X′=X−μ/σ
Example: If the average Salary is 65,000 with a standard deviation of 35,000, then:
o Salary of 30,000 → -1.0 (below average)
o Salary of 120,000 → +1.57 (above average)
3. Robust Scaling (Using Median & IQR - Handles Outliers)
Robust Scaling is a feature scaling method that handles outliers by using the Interquartile Range
(IQR) instead of mean and standard deviation.
X’=X-median(X)/IQR(X)
The Interquartile Range (IQR) is a measure of statistical dispersion, which tells us how spread out
the middle 50% of a dataset is. It is useful for detecting outliers and understanding data distribution.
IQR=Q3−Q1
Where:
Q1 (1st Quartile / 25th Percentile): The value below which 25% of the data falls.
Q2 (Median / 50th Percentile): The middle value (not directly used in IQR calculation).
Q3 (3rd Quartile / 75th Percentile): The value below which 75% of the data falls.
X′=log(X+1)
Example: If Salary varies from 25,000 to 120,000, log transformation reduces the gap
between small and large values.