0% found this document useful (0 votes)
5 views

Data Preprocessing

Data preprocessing is essential in data mining, transforming raw data into a clean and analyzable format by addressing issues like missing values, noise, and inconsistencies. Key steps include data cleaning, integration, transformation, and reduction, with techniques such as imputation, normalization, and binning to enhance data quality and model performance. Proper preprocessing ensures reliable analysis and meaningful insights from the data.

Uploaded by

nics1425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preprocessing

Data preprocessing is essential in data mining, transforming raw data into a clean and analyzable format by addressing issues like missing values, noise, and inconsistencies. Key steps include data cleaning, integration, transformation, and reduction, with techniques such as imputation, normalization, and binning to enhance data quality and model performance. Proper preprocessing ensures reliable analysis and meaningful insights from the data.

Uploaded by

nics1425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Preprocessing

 Data preprocessing is a crucial step in data mining, involving


the preparation and transformation of raw data into a clean,
structured and analyzable format.

 In most real-world scenarios, data is incomplete, noisy and


inconsistent, making it unsuitable for direct analysis.

 Data preprocessing improves the quality of the data and


ensures reliable and meaningful patterns can be extracted.
Key steps involved in data preprocessing:
1. Data Cleaning
Data often contains missing values, errors and outliers that need to be addressed.
Cleaning ensures data consistency and accuracy.
Handling Missing Data:

o Ignore the tuples with missing values (only if minimal impact).


o Fill missing values using:
 Mean, median, or mode (for numerical data).
 Most frequent value (for categorical data).
 Interpolation or regression-based techniques.
o Use imputation models or algorithms to predict the missing
values.
 Imputation refers to the process of replacing
missing or inconsistent data with substitute
values to ensure that the dataset remains
complete and suitable for analysis.

 Missing values are common in real-world


datasets due to various reasons such as
 data entry errors,
 sensor malfunction, or
 skipped survey questions

 Imputation provides a strategy to handle these


gaps without discarding valuable data.
Types of Imputation Techniques:
1.Simple Imputation
 Mean Imputation: Replacing missing values
with the mean of the available data in the
same column.
Example: If a column has values [5, NaN, 10]
replace the missing value with (5+10)/2=7.5

 Median Imputation: Replacing missing


values with the median of the column.

 Mode Imputation: For categorical variables,


replacing missing values with the most
frequent value (mode).
2. Advanced Imputation
• K-Nearest Neighbors (KNN) Imputation:
Uses the values of the K closest observations
to estimate the missing value.
• Regression Imputation: Predicts missing
values based on a regression model trained on
other variables in the dataset.
• Multiple Imputation: Generates several
possible values for the missing data using
statistical models, creating multiple complete
datasets, and pooling the results.
• Hot-deck Imputation: Fills in missing data by
randomly selecting values from similar
3. Domain-specific or Logical Imputation
• Imputing based on business rules or domain
knowledge
(example, if "Age" is missing, infer a value based on
job title or location).

4. Time-series Imputation
• Forward fill (FFill):
Propagates the last known value forward.
• Backward fill (BFill):
Uses the next known value to fill previous missing
ones.
• Interpolation:
Estimates missing values using linear or other
Why Use Imputation?
• Preserves data:
Avoids dropping rows or columns with
missing data, which could reduce statistical
power.
• Reduces bias:
Ensures that the dataset remains
representative and unbiased.
• Improves model performance:
Many machine learning algorithms can’t
handle missing data directly, so imputation
ensures the model performs well.
Challenges of Imputation
• Bias introduction:
Imputation may introduce bias if the
missing data isn’t random (e.g., non-
random missing patterns).
• Overfitting:
Some advanced imputation techniques
(like multiple imputation) can lead to
overfitting if not used carefully.
• Loss of variability:
Techniques like mean imputation reduce
the natural variance in the data.
Removing Noise:
oApply smoothing techniques (e.g., moving average,
binning) to reduce random noise.
oUse clustering to identify and remove outliers.
Correcting Inconsistencies:
oFix data entry errors (e.g., typos).
oEnsure uniform data formats (e.g., dates, units).
Importance of data cleaning:
 To ensure high data integrity

 To check if there are any bias involved in data


collection

 Data should be in line with problem


statement

 Data should have complete sample size

 There should be uniformity in data


2. Data Integration
Data comes from multiple sources (databases,
spreadsheets, APIs), it must be combined into a unified
format.
 Schema Integration: Align similar attributes across
datasets (e.g., "customer_id" vs. "client_id").
 Entity Matching: Identify and merge duplicate records
(e.g., same person under two slightly different names).
 Conflict Resolution: Resolve data conflicts by setting
rules (e.g., prioritize the latest value when duplicate
entries exist).
3. Data Transformation
Transformations make the data more suitable for mining algorithms.
 Normalization (or Scaling):
Transform data into a specific range (e.g., 0 to 1 or -1 to 1).
o Techniques: Min-Max normalization, Z-score standardization, or
decimal scaling.
Normalization refers to transforming data so that it falls within a
specific range (e.g., 0 to 1 or -1 to 1).

This process ensures that numerical values across features are scaled
uniformly, which is essential when working with algorithms that are
sensitive to differences in magnitude (e.g., K-nearest neighbors, neural
networks, and gradient-based methods like logistic regression).
Why is Normalization Important?
1. Prevents Bias in Models:
o Features with larger ranges (e.g., income in millions vs. age in years)
can dominate the learning process if not scaled.
2. Improves Model Convergence:
o Many machine learning algorithms (e.g., neural networks) converge
faster when the data is normalized.
3. Makes Distance Metrics Meaningful:
o Algorithms like K-means clustering and K-nearest neighbors (KNN)
use distance metrics. Normalization ensures no feature
disproportionately affects the result.
4. Reduces Computational Complexity:
o Scaling data helps models operate more efficiently by keeping the
Types of Normalization Techniques
1. Min-Max Normalization
2. Z-Score Normalization (Standardization)
3. Decimal Scaling Normalization
4. Max Absolute Scaling

Min-Max Normalization
Min-Max normalization scales the data within a specified
range, usually between 0 and 1.
Formula: Xnormalized=X-Xmin/Xmax-Xmin
Explanation
• X: The original value.
• Xmin​: The minimum value in the dataset/column.
• Xmax​: The maximum value in the
dataset/column.
• Xnormalized​: The normalized value, which will be
scaled between 0 and 1.
Example:
o Data: [10, 20, 30, 40, 50]
o If we want to normalize the value X=30:
o Xnorm =30-10/50-10=20/40=0.5
o After normalization, the value 30 becomes 0.5.
Advantages:
o Simple and easy to implement.
o Preserves the relationship between original values.
Disadvantages:
o Sensitive to outliers.
o A single outlier can significantly affect the range and distort the
results.
Z-Score Normalization (Standardization)

Z-score normalization centers the data around the mean and


scales it according to the standard deviation.
The transformed data will have a mean of 0 and a standard
deviation of 1.
Formula: Z=X−μ/σ
​Explanation
• Z: The normalized value (Z-score).
• X: The original value.
• μ: The mean of the dataset/column.
• σ: The standard deviation of the dataset/column.
Example:
o Data: [10, 20, 30, 40, 50]
o Mean (μ) = 30, Standard Deviation (σ) ≈ 15.81
o Normalized value of 40:
o Z=40−30/15.81≈0.63
o Z ​≈0.63
 Advantages:
o Useful when the data follows a normal distribution.
o Handles outliers better than Min-Max normalization.
 Disadvantages:
o If the data is not normally distributed, Z-score normalization
might not be effective.
Why Use Z-score Normalization?
• Centers the data around 0 with a standard
deviation of 1.
• Useful for algorithms like
k-means clustering or
principal component analysis (PCA).
• Handles outliers better compared to min-
max normalization.
 Decimal Scaling Normalization
Decimal scaling normalization scales the data by moving the
decimal point, based on the largest absolute value in the data.
Xnormalized​=X/10 ^ j
Explanation
• X: The original value.
• Xnormalized​: The normalized value.
• j: The smallest integer such that X/10^j results in
a value between -1 and 1.
How to Determine j
• j is the number of digits in the largest absolute
value in the dataset.
• For example, if the largest absolute value is 987,
then j=3 because 10^3 =1000.
Example
 Consider a dataset: [5,50,500].
• The largest absolute value is 500, so j=3.
 To normalize X=50:
 Xnormalized=50/10^3=50/1000=0.05
 So, the normalized value of 50 is 0.05.
Advantages:
o Simple to implement.
o Effective when the data varies over different magnitudes
(e.g., 100 vs. 1,000,000).
Disadvantages:
o Less flexible for datasets with outliers or complex
distributions.
4. Max Absolute Scaling
Max Absolute Scaling scales the data to the range [-1, 1]
based on the maximum absolute value in the feature.
Formula:
X′=X/Xmax
where Xmax​is the maximum absolute value of the feature.

Example:
o Data: [2, 5, -3, 10]
o Maximum absolute value = 10
o Normalized data: [0.2, 0.5, -0.3, 1.0]
Advantages:
oUseful when data contains both positive and
negative values.
oMaintains data sparsity for sparse datasets.
Disadvantages:
Sensitive to outliers, as the maximum value affects the
scaling
When to Use Each Normalization Technique:
Min-Max Normalization:
o When data has no outliers and you need to scale it within a specific range (e.g., 0
to 1).
o Useful for neural networks and other algorithms that expect input values to be
bounded.
Z-Score Normalization:
o When the data follows a normal distribution and you need to normalize for
algorithms like logistic regression, k-means, or principal component analysis (PCA).
Decimal Scaling:
o Useful when the data varies over several orders of magnitude, like in financial
data.
Max Absolute Scaling:
o When the data has both positive and negative values and you want to maintain
sparsity (e.g., in recommendation systems).
Applications of Normalization in Data Mining
1. Clustering:
o Algorithms like K-means use Euclidean distance to measure similarity, which
is sensitive to the magnitude of features. Normalization ensures no feature
dominates the distance calculation.
2. Classification:
o Models like K-nearest neighbors (KNN) and support vector machines (SVM)
perform better with normalized data.
3. Neural Networks:
o Inputs to neural networks are often normalized to ensure faster convergence
and better training performance.
4. Principal Component Analysis (PCA):

 PCA is sensitive to the variance of features, so normalization ensures that all


features contribute equally.
 Note:
Normalization is a critical step in data preprocessing that
ensures fair treatment of features, speeds up convergence,
and improves the performance of data mining algorithms.
Depending on the data characteristics and the algorithm being
used, different normalization techniques like;
 Min-Max,
 Z-score, or
 decimal scaling
can be applied to make the data more suitable for analysis.
Encoding Categorical Data:
o Label Encoding: Assign unique numeric values to categories.
o One-Hot Encoding: Create binary columns for each category.
Discretization and Binning:
o Convert continuous attributes into discrete intervals or
categories.
o Example: Age groups (0–18, 19–35, 36–60, etc.).
Feature Engineering:
Derive new features from raw data to improve analysis.
For example, creating a “yearly growth” metric from monthly sales
data.
4. Data Reduction
To reduce the computational cost and improve model efficiency, redundant and irrelevant
data must be removed.
 Dimensionality Reduction: Reduce the number of features while retaining essential
information.
Techniques:
o Principal Component Analysis (PCA),
o Linear Discriminant Analysis (LDA).
 Feature Selection: Use statistical methods to identify the most relevant attributes.
Techniques:
 Correlation analysis,
 forward selection,
 backward elimination.
 Sampling: Select a subset of the data to reduce size.
Types:
o Random sampling,
o stratified sampling.
5. Data Discretization
This involves dividing continuous data into smaller intervals, making it easier
to analyze.
 Equal-Width Binning: Divides data into intervals of equal width.
 Equal-Frequency Binning: Divides data so that each bin contains an equal
number of records.
 Cluster-based Binning: Uses clustering to determine bin boundaries.

Binning is a data preprocessing technique used to convert continuous


numerical data into discrete intervals or categories.
It groups a range of numeric values into smaller, manageable bins, which
makes the data easier to analyze and helps reduce noise or outliers.
Binning is widely used in data mining, machine learning, and statistical
analysis, especially when converting numerical features for classification or
association rule mining.
Why Use Binning?
1. Simplifies data:
Continuous data is often complex to analyze; binning reduces this
complexity.
2. Reduces Noise:
Small variations in the data (noise) are smoothed by grouping
values into bins.
3. Improves Interpretability:
Grouped intervals (e.g., "Age: 18–35") provide a more intuitive
understanding of the data.
4. Enables Compatibility:
Some algorithms work better with discrete data (e.g., decision
trees, Naive Bayes).
Types of Binning Techniques
1. Equal-Width Binning
o The range of the data is divided into equal-sized intervals (bins).
o Formula for bin width:
Bin Width=Max Value−Min Value/Number of Bins
o Example:
 Data: [3, 7, 15, 20, 22, 30, 40]
 Bins: [0–10], [10–20], [20–30], [30–40]
 Grouping:
 3 → Bin 1, 7 → Bin 1, 15 → Bin 2, etc.
Advantages:
o Simple to implement.
o Ensures each bin has the same range.
Disadvantages:
o Can result in uneven distribution (some bins may contain many values, others
few or none).
2. Equal-Frequency Binning (or Quantile Binning)
o Each bin contains approximately the same number of data points (frequencies).
o Useful when the data is skewed or unevenly distributed.
o Example:
 Data: [2, 5, 8, 15, 18, 20, 35]
 If we create 3 bins, each bin will contain approximately 2–3 data points.
 Bins:
 Bin 1: [2, 5]
 Bin 2: [8, 15]
 Bin 3: [18, 20, 35]
Advantages:

o Balances the number of data points per bin, preventing sparsely populated bins.
Disadvantages:

 Intervals may vary in width, making it harder to interpret


3. Cluster-Based Binning
o Uses clustering algorithms (e.g., K-means) to group similar
values into bins.
o Clusters are treated as bins, where data points in each cluster
are closer to each other.
Example: A set of house prices might be clustered into low, medium,
and high price bins.
Advantages:
o Results in natural groupings based on data patterns.
o Handles non-uniform distributions better than other binning
methods.
Disadvantages:
o
4. Adaptive Binning
o Bins are created based on specific conditions or
domain knowledge (e.g., ages grouped as "child,"
"teen," "adult").
o Often used for domain-specific datasets like
demographics or medical data.
Binning Methods: Smoothing Techniques
 Smoothing by Binning Mean: Each value in a bin is replaced by the
mean of the bin.
 Smoothing by Binning Median: Each value is replaced by the median
of the bin.
 Smoothing by Binning Boundaries: Each value is replaced by the
closest boundary value (either the bin’s minimum or maximum).
Example of Smoothing:
Data: [4, 8, 9, 15, 17, 22]
Bins: [4–10], [11–20], [21–30]
Smoothing by mean: Replace values in each bin with the bin’s mean.
o Bin 1 → Mean = (4 + 8 + 9) / 3 = 7
o Bin 2 → Mean = (15 + 17) / 2 = 16
o Smoothed data: [7, 7, 7, 16, 16, 22]
Challenges in Binning
 Loss of Information:
Aggregation can cause a loss of detailed information.
 Choosing the Right Number of Bins:
Too many bins might retain noise, while too few might
oversimplify data.
 Handling Skewed Data:
Requires careful selection of binning techniques (e.g.,
equal-frequency binning works better on skewed data).
6. Data Balancing
In cases of imbalanced datasets (e.g., rare events like
fraud detection), balancing improves the performance of
mining models.
 Oversampling:
Duplicate minority class samples.
 Undersampling:
Reduce the number of majority class samples.
 SMOTE (Synthetic Minority Over-sampling Technique):
Generate synthetic samples for the minority class.
7. Data Partitioning
Before analysis, the dataset is divided into training, validation, and
test sets to ensure unbiased model evaluation.
Training Set: Used to train the model.
Validation Set: Used to tune hyperparameters and avoid overfitting.
Test Set: Used to evaluate the final model performance.

 In data mining and machine learning, a validation set is a crucial


part of the data split (usually alongside a training set and test set).
 It plays an essential role in tuning hyperparameters and helps
prevent overfitting, ensuring the model generalizes well to new,
unseen data
What is a Validation Set?

 The validation set is a portion of the original dataset, separate from the
training set and test set.
 It is used to evaluate the model during training to fine-tune hyperparameters
(e.g., learning rate, regularization strength, number of layers).
 Unlike the test set (which assesses the final model’s performance), the
validation set helps to choose the best model configuration.

Hyperparameter Tuning Using a Validation Set

What Are Hyperparameters?

 Hyperparameters are parameters that control how the model is trained (e.g.,
the learning rate of a neural network, the number of trees in a random forest).
 These values are not learned by the model itself but must be specified before
training starts.
How the Validation Set Helps in Tuning Hyperparameters:
1. Train the Model on the Training Set:
o Use the training set to fit the model based on the data
patterns.
2. Evaluate on the Validation Set:
o After each iteration or training session, the model’s
performance is checked on the validation set to assess
how well it generalizes to unseen data.
o Common metrics include accuracy, RMSE (Root Mean
Squared Error), AUC (Area Under the Curve), etc.
3. Select the Best Hyperparameters:
o Multiple models are trained using different hyperparameter
values (e.g., different learning rates or depths).
o The hyperparameters that produce the best performance on
the validation set are chosen.
o This process is often automated using:
 Grid Search: Tries all possible combinations of
hyperparameters.
 Random Search: Randomly selects hyperparameter
combinations.
 Bayesian Optimization: Uses probabilistic models to search
more efficiently
.
Example of Hyperparameter Tuning:
oTraining a neural network:
 Hyperparameter: Learning rate.
 Models trained with learning rates of 0.001,
0.01, and 0.1.
 Validation set shows 0.01 gives the best
accuracy.
 This learning rate is selected for the final
model.
Avoiding Overfitting Using the Validation Set
What is Overfitting?

 Overfitting occurs when the model performs exceptionally well on the training
data but fails to generalize to new, unseen data (validation or test sets).
 It indicates the model has memorized the training data rather than learning
general patterns.

How Validation Helps Avoid Overfitting:

1. Early Stopping:
o Early stopping monitors the performance of the model on the validation set
during training.
o If validation accuracy stops improving (or starts to decline), the training is
stopped early to prevent overfitting.
o This ensures the model doesn’t learn noise or irrelevant patterns in the
training data.
2. Hyperparameter Regularization:
o Regularization techniques (e.g., L1/L2 regularization, dropout in
neural networks) help reduce overfitting.
o The strength of regularization is a hyperparameter that is fine-
tuned using the validation set.
o Example: In a regression model, increasing the L2 penalty (Ridge
regularization) might prevent the model from assigning overly
large weights to features.
3. Model Selection:
o Sometimes different models (e.g., decision trees vs. random
forests) are tested.
o The validation set helps select the best-performing model to
avoid overfitting on a particular type of data.
Example Process: Train, Validate, Test Workflow
1. Data Split:
o 70% Training Set, 15% Validation Set, 15% Test Set.
2. Training and Validation Loop:
o Train the model on the training set with initial hyperparameters.
o Evaluate on the validation set.
o Tune hyperparameters based on validation performance.
o Repeat the process with different hyperparameters until the best
combination is found.
3. Final Model Evaluation:
o After the hyperparameters are optimized, train the model on both the
training + validation sets.
o Use the test set only once for final performance evaluation (to get an
unbiased estimate).
Challenges of Using a Validation Set
 Data Leakage: If the same data is used in both the validation and training
phases, it can lead to biased results.
 Insufficient Data: Splitting the data into training, validation, and test sets can
result in smaller datasets, especially if the data is limited.
o Solution: Use techniques like cross-validation.

Cross-Validation as an Alternative

 In k-fold cross-validation, the dataset is divided into k folds.

 Each fold acts as the validation set exactly once, while the remaining data is
used for training.
 This method ensures that every data point is used for both training and
validation, improving the robustness of hyperparameter tuning.
Importance of Data Preprocessing in Data Mining
1. Improves Data Quality: Clean and consistent data leads to more
accurate and meaningful results.
2. Enhances Model Performance: Normalization and feature
engineering ensure that models learn effectively.
3. Reduces Computational Complexity: Data reduction techniques
make algorithms more efficient.
4. Prevents Bias and Overfitting: Data balancing and partitioning
ensure the model generalizes well to new data.
5. Handles Real-World Challenges: Preprocessing deals with noise,
missing data, and inconsistencies that occur in most real-world
datasets.
THANK YOU

You might also like