0% found this document useful (0 votes)
2 views8 pages

Data Preprocessing

Data preprocessing is essential for transforming various types of raw data into numerical representations suitable for machine learning algorithms. Key steps include data quality assessment, feature aggregation, discretization, sampling, dimensionality reduction, encoding, and scaling, each addressing specific challenges in dataset management. Techniques such as PCA and various encoding methods ensure that machine learning models can effectively interpret and learn from the data, improving accuracy and performance.

Uploaded by

pandit27165
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Data Preprocessing

Data preprocessing is essential for transforming various types of raw data into numerical representations suitable for machine learning algorithms. Key steps include data quality assessment, feature aggregation, discretization, sampling, dimensionality reduction, encoding, and scaling, each addressing specific challenges in dataset management. Techniques such as PCA and various encoding methods ensure that machine learning models can effectively interpret and learn from the data, improving accuracy and performance.

Uploaded by

pandit27165
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Preprocessing

 Large datasets often contain various types of data, including structured tables, images, audio
files, and videos.
 However, machine learning algorithms cannot directly process raw text, images, or videos, as
they only understand numerical representations (1s and 0s).
 Therefore, it is essential to transform or encode the dataset into a suitable format before
applying machine learning techniques.
 By converting the data into meaningful numerical features, the algorithm can effectively
interpret and learn patterns, enabling accurate predictions and analysis.

Steps of Data Preprocessing

Not all the steps are applicable for each problem

• Data Quality Assessment

• Feature Aggregation

• Feature Discretization

• Feature Sampling

• Dimensionality Reduction

• Feature Encoding

• Feature Scaling

Data Quality Assessment

The first step when working with a dataset is to check quality of data set, we may face some
challenges with these datasets.

• Missing values
• Outliers
• Inconsistent values
• Duplicate values
We need to address these issues using software to ensure data quality and improve the
performance of machine learning models.
Feature Aggregation

We need to aggregate values to organize the data and present it from a better perspective.

For Example

• day-to-day transactions of a product to record the daily sales of that product in various store
locations over the year
• Aggregating the transactions to single store-wise monthly or yearly transactions will help us
reducing the number of data objects.

There are lot of advantage in aggregating feature values because it will compress dataset, take
less memory and it will take less computation power.
Aggregation cannot be applying with all data set, definitely we will need to decide whether
aggregation require or not. If we want day-to - day transactional dataset in that case
aggregation
is not good step.
Feature aggregation provides a high-level representation of the original dataset, making it
easier
to analyze.

Feature Discretization
 Sometimes we have features that are having continuous values.it is easy to convert
continuous to discreet values.
 For example, if age is a feature in our dataset, it is usually represented in years or months.
However, in many cases, we may not need the exact number to represent age. Instead, it can
be categorized into groups such as "Young," "Middle-aged," and "Old".
 Discretizing age into these categories improves efficiency by eliminating the need to handle
continuous values for that feature.
We must check whether a continuous value is necessary or if it can be discretized. While
converting continuous values to discrete ones often improves efficiency, it is not always
applicable to all features.

Feature Sampling

 Sampling is a very common method for selecting a subset of the dataset that we are
analyzing.
 In most cases, working with the complete dataset can turn out to be too expensive
considering the memory and time constraints.
 the sampling should be done in such a manner that the sample generated should have
approximately the same properties as the original dataset, meaning that the sample is
representative.
We have different technique for sampling.
• Simple Random Sampling
• Sampling without Replacement: Once you pick some instance from dataset and
selected data is again in dataset then we will not
put selected instances in original dataset.
• Sampling with Replacement: same instance can be chosen multiple times.
 If we have large dataset not comfortable to work on whole data, it that case we do sampling.
 Depending on problem we need to choose sampling technique.
 In case of simple random sampling there is chance of Imbalance dataset. Imbalanced dataset is
one where the number of instances of a class(es) are significantly higher than another
class(es), thus leading to an imbalance. For example, in patient dataset there is chance that
there is many instances that are related to healthy person but few are related to patients.
Stratified Sampling Stratified sampling ensures that each class is proportionally represented in both
training and testing sets, reducing the risk of an imbalanced dataset. With large data we can use
stratified sampling to get balance dataset.

Instance:

Suppose we have a patient dataset with three disease categories:

 Diabetes: 500 patients


 Heart Disease: 300 patients
 Lung Disease: 200 patients
 Total Patients: 1,000

Without Stratified Sampling (Random Split - Risk of Imbalance)

If we randomly split 20% of the data for testing, we might get:

 Diabetes: 120 patients


 Heart Disease: 50 patients
 Lung Disease: 30 patients

Here, the proportions are not maintained, and Lung Disease patients are underrepresented, which may
lead to biased predictions.

With Stratified Sampling (Balanced Split)

Using stratified sampling (20% test data), we get:

 Diabetes: 100 patients (from 500)


 Heart Disease: 60 patients (from 300)
 Lung Disease: 40 patients (from 200)

Now, the original proportions are preserved, ensuring fair representation of all disease classes in both
the training and testing sets.

Stratified sampling significantly reduces the chance of an imbalanced dataset, leading to a well-
trained machine learning model that performs better across all classes.

Dimensionality Reduction

 Most real world datasets have a large number of features and we do not know whether the
feature are important or not, then there is chances of reduction in number of features. This is
called Dimensionality Reduction.
 For example, consider an image processing problem, we might have to deal with thousands of
features, also called as dimensions.
Two widely accepted techniques

 Principal Component Analysis


 Singular Value Decomposition

Principal Component Analysis

 Principal Component Analysis (PCA) is a technique used to reduce the number of features in
a dataset while keeping the most important information. In this technique we directly select
few of feature and leave some feature, that is called feature subset selection. It finds new
features, called principal components, that capture the most variation in the data.

 This helps in visualization, noise reduction, and improving model performance. Singular
Value Decomposition (SVD) is a method of breaking down a matrix into three smaller
matrices, making it useful for data compression, noise removal, and recommendation systems.
PCA often uses SVD to find principal components efficiently, especially for large datasets.
Both methods help in handling high-dimensional data in machine learning.

Singular Value Decomposition

 In this technique, we do not know which features are important or which ones to select or
remove. Instead, we create a new dataset with new features, where each new feature is a
combination of some original features.
 It is also possible that the new features are formed by combining multiple original features.
 As a result, the number of features in the new dataset will be less than in the original dataset.

So these are the dimensionality reduction task. Features represent dimensions in a dataset. If
there is one feature, it is one-dimensional; if there are two features, it is two-dimensional; and
if there are five features, it is five-dimensional. More features mean more dimensions in the
data.

Feature Encoding

Feature encoding is the process of converting categorical or numerical features into a form that
machine learning models can understand. Different types of features require different encoding
techniques.

Types of Features in Machine Learning

Machine Learning deals with four types of features:


1. Nominal Features (Categorical - No Order)
o These are labels without any specific order.
o Example: Colors (Red, Blue, Green), Cities (New York, London, Paris).
o Encoding Methods:
 One-Hot Encoding
 Label Encoding
 Binary Encoding

2. Ordinal Features (Categorical - With Order)


o These features have a meaningful order, but the difference between them is not
measurable.
o Example: Customer service Ratings (Poor < Average < Good < Excellent).
o Encoding Methods:
 Ordinal Encoding
 Label Encoding

3. Interval Features (Numerical - No True Zero)


o These are numeric values where the difference is meaningful, but there is no true zero.
o Example: Temperature in Celsius (0°C is not the absence of temperature).
o Encoding Methods:

New_value = a* old_value+ b, where a,b are constant

4. Ratio Features (Numerical - True Zero Exists)


o These are numeric values where the difference and ratio are meaningful, and zero
means the absence of value.
o Example: Age, Salary, Weight, Height.
o Encoding Methods:

New_value=a*old_value

Feature Scaling

Feature scaling is the process of transforming numerical features into a specific range to ensure that
no feature dominates others due to differences in scale. Because sometimes features with larger
ranges can dominate those with smaller ranges.

For example, if there is height feature is 140 cm and 8.2 feet, then ML algo give more preference to
140.
 Feature scaling ensures all numerical features have the same scale, for improvement of model
performance.
 Many ML algorithms like K-means, SVM,KNN are sensitive to the scale of the data.
 Feature scaling speeds up training and improves model accuracy.

Emp ID Height (cm) Salary


101 160 25,000
102 170 30,000
103 175 60,000
104 180 90,000
105 190 120,000

Height (cm) varies between 160 - 190.


Salary (USD) varies between 25,000 - 120,000 (large scale difference).
Since Salary is significantly larger than Height, it may dominate machine learning models.

Types of Scaling

1. Min-Max Scaling (Normalization: 0 to 1)

 Converts all values to a fixed range [0,1] using the formula:

X’=X-Xmin/Xmax -Xmin

 Example: If Height ranges from 160 to 190 cm, after Min-Max Scaling:
o 160 cm → 0.0
o 190 cm → 1.0

2. Standardization (Z-score Normalization)

 Converts values to have mean = 0 and standard deviation = 1:

X′=X−μ/σ

 Example: If the average Salary is 65,000 with a standard deviation of 35,000, then:
o Salary of 30,000 → -1.0 (below average)
o Salary of 120,000 → +1.57 (above average)
3. Robust Scaling (Using Median & IQR - Handles Outliers)

Robust Scaling is a feature scaling method that handles outliers by using the Interquartile Range
(IQR) instead of mean and standard deviation.

X’=X-median(X)/IQR(X)

Interquartile Range (IQR) – Explained Simply

The Interquartile Range (IQR) is a measure of statistical dispersion, which tells us how spread out
the middle 50% of a dataset is. It is useful for detecting outliers and understanding data distribution.

Formula for IQR

IQR=Q3−Q1

Where:

 Q1 (1st Quartile / 25th Percentile): The value below which 25% of the data falls.
 Q2 (Median / 50th Percentile): The middle value (not directly used in IQR calculation).
 Q3 (3rd Quartile / 75th Percentile): The value below which 75% of the data falls.

4. Log Transformation (For Skewed Data)

 Converts data using logarithm to reduce large value differences:

X′=log(X+1)

 Example: If Salary varies from 25,000 to 120,000, log transformation reduces the gap
between small and large values.

You might also like