Data Preprocessing
Data Preprocessing
4. Time-series Imputation
• Forward fill (FFill):
Propagates the last known value forward.
• Backward fill (BFill):
Uses the next known value to fill previous missing
ones.
• Interpolation:
Estimates missing values using linear or other
Why Use Imputation?
• Preserves data:
Avoids dropping rows or columns with
missing data, which could reduce statistical
power.
• Reduces bias:
Ensures that the dataset remains
representative and unbiased.
• Improves model performance:
Many machine learning algorithms can’t
handle missing data directly, so imputation
ensures the model performs well.
Challenges of Imputation
• Bias introduction:
Imputation may introduce bias if the
missing data isn’t random (e.g., non-
random missing patterns).
• Overfitting:
Some advanced imputation techniques
(like multiple imputation) can lead to
overfitting if not used carefully.
• Loss of variability:
Techniques like mean imputation reduce
the natural variance in the data.
Removing Noise:
oApply smoothing techniques (e.g., moving average,
binning) to reduce random noise.
oUse clustering to identify and remove outliers.
Correcting Inconsistencies:
oFix data entry errors (e.g., typos).
oEnsure uniform data formats (e.g., dates, units).
Importance of data cleaning:
To ensure high data integrity
This process ensures that numerical values across features are scaled
uniformly, which is essential when working with algorithms that are
sensitive to differences in magnitude (e.g., K-nearest neighbors, neural
networks, and gradient-based methods like logistic regression).
Why is Normalization Important?
1. Prevents Bias in Models:
o Features with larger ranges (e.g., income in millions vs. age in years)
can dominate the learning process if not scaled.
2. Improves Model Convergence:
o Many machine learning algorithms (e.g., neural networks) converge
faster when the data is normalized.
3. Makes Distance Metrics Meaningful:
o Algorithms like K-means clustering and K-nearest neighbors (KNN)
use distance metrics. Normalization ensures no feature
disproportionately affects the result.
4. Reduces Computational Complexity:
o Scaling data helps models operate more efficiently by keeping the
Types of Normalization Techniques
1. Min-Max Normalization
2. Z-Score Normalization (Standardization)
3. Decimal Scaling Normalization
4. Max Absolute Scaling
Min-Max Normalization
Min-Max normalization scales the data within a specified
range, usually between 0 and 1.
Formula: Xnormalized=X-Xmin/Xmax-Xmin
Explanation
• X: The original value.
• Xmin: The minimum value in the dataset/column.
• Xmax: The maximum value in the
dataset/column.
• Xnormalized: The normalized value, which will be
scaled between 0 and 1.
Example:
o Data: [10, 20, 30, 40, 50]
o If we want to normalize the value X=30:
o Xnorm =30-10/50-10=20/40=0.5
o After normalization, the value 30 becomes 0.5.
Advantages:
o Simple and easy to implement.
o Preserves the relationship between original values.
Disadvantages:
o Sensitive to outliers.
o A single outlier can significantly affect the range and distort the
results.
Z-Score Normalization (Standardization)
Example:
o Data: [2, 5, -3, 10]
o Maximum absolute value = 10
o Normalized data: [0.2, 0.5, -0.3, 1.0]
Advantages:
oUseful when data contains both positive and
negative values.
oMaintains data sparsity for sparse datasets.
Disadvantages:
Sensitive to outliers, as the maximum value affects the
scaling
When to Use Each Normalization Technique:
Min-Max Normalization:
o When data has no outliers and you need to scale it within a specific range (e.g., 0
to 1).
o Useful for neural networks and other algorithms that expect input values to be
bounded.
Z-Score Normalization:
o When the data follows a normal distribution and you need to normalize for
algorithms like logistic regression, k-means, or principal component analysis (PCA).
Decimal Scaling:
o Useful when the data varies over several orders of magnitude, like in financial
data.
Max Absolute Scaling:
o When the data has both positive and negative values and you want to maintain
sparsity (e.g., in recommendation systems).
Applications of Normalization in Data Mining
1. Clustering:
o Algorithms like K-means use Euclidean distance to measure similarity, which
is sensitive to the magnitude of features. Normalization ensures no feature
dominates the distance calculation.
2. Classification:
o Models like K-nearest neighbors (KNN) and support vector machines (SVM)
perform better with normalized data.
3. Neural Networks:
o Inputs to neural networks are often normalized to ensure faster convergence
and better training performance.
4. Principal Component Analysis (PCA):
o Balances the number of data points per bin, preventing sparsely populated bins.
Disadvantages:
The validation set is a portion of the original dataset, separate from the
training set and test set.
It is used to evaluate the model during training to fine-tune hyperparameters
(e.g., learning rate, regularization strength, number of layers).
Unlike the test set (which assesses the final model’s performance), the
validation set helps to choose the best model configuration.
Hyperparameters are parameters that control how the model is trained (e.g.,
the learning rate of a neural network, the number of trees in a random forest).
These values are not learned by the model itself but must be specified before
training starts.
How the Validation Set Helps in Tuning Hyperparameters:
1. Train the Model on the Training Set:
o Use the training set to fit the model based on the data
patterns.
2. Evaluate on the Validation Set:
o After each iteration or training session, the model’s
performance is checked on the validation set to assess
how well it generalizes to unseen data.
o Common metrics include accuracy, RMSE (Root Mean
Squared Error), AUC (Area Under the Curve), etc.
3. Select the Best Hyperparameters:
o Multiple models are trained using different hyperparameter
values (e.g., different learning rates or depths).
o The hyperparameters that produce the best performance on
the validation set are chosen.
o This process is often automated using:
Grid Search: Tries all possible combinations of
hyperparameters.
Random Search: Randomly selects hyperparameter
combinations.
Bayesian Optimization: Uses probabilistic models to search
more efficiently
.
Example of Hyperparameter Tuning:
oTraining a neural network:
Hyperparameter: Learning rate.
Models trained with learning rates of 0.001,
0.01, and 0.1.
Validation set shows 0.01 gives the best
accuracy.
This learning rate is selected for the final
model.
Avoiding Overfitting Using the Validation Set
What is Overfitting?
Overfitting occurs when the model performs exceptionally well on the training
data but fails to generalize to new, unseen data (validation or test sets).
It indicates the model has memorized the training data rather than learning
general patterns.
1. Early Stopping:
o Early stopping monitors the performance of the model on the validation set
during training.
o If validation accuracy stops improving (or starts to decline), the training is
stopped early to prevent overfitting.
o This ensures the model doesn’t learn noise or irrelevant patterns in the
training data.
2. Hyperparameter Regularization:
o Regularization techniques (e.g., L1/L2 regularization, dropout in
neural networks) help reduce overfitting.
o The strength of regularization is a hyperparameter that is fine-
tuned using the validation set.
o Example: In a regression model, increasing the L2 penalty (Ridge
regularization) might prevent the model from assigning overly
large weights to features.
3. Model Selection:
o Sometimes different models (e.g., decision trees vs. random
forests) are tested.
o The validation set helps select the best-performing model to
avoid overfitting on a particular type of data.
Example Process: Train, Validate, Test Workflow
1. Data Split:
o 70% Training Set, 15% Validation Set, 15% Test Set.
2. Training and Validation Loop:
o Train the model on the training set with initial hyperparameters.
o Evaluate on the validation set.
o Tune hyperparameters based on validation performance.
o Repeat the process with different hyperparameters until the best
combination is found.
3. Final Model Evaluation:
o After the hyperparameters are optimized, train the model on both the
training + validation sets.
o Use the test set only once for final performance evaluation (to get an
unbiased estimate).
Challenges of Using a Validation Set
Data Leakage: If the same data is used in both the validation and training
phases, it can lead to biased results.
Insufficient Data: Splitting the data into training, validation, and test sets can
result in smaller datasets, especially if the data is limited.
o Solution: Use techniques like cross-validation.
Cross-Validation as an Alternative
Each fold acts as the validation set exactly once, while the remaining data is
used for training.
This method ensures that every data point is used for both training and
validation, improving the robustness of hyperparameter tuning.
Importance of Data Preprocessing in Data Mining
1. Improves Data Quality: Clean and consistent data leads to more
accurate and meaningful results.
2. Enhances Model Performance: Normalization and feature
engineering ensure that models learn effectively.
3. Reduces Computational Complexity: Data reduction techniques
make algorithms more efficient.
4. Prevents Bias and Overfitting: Data balancing and partitioning
ensure the model generalizes well to new data.
5. Handles Real-World Challenges: Preprocessing deals with noise,
missing data, and inconsistencies that occur in most real-world
datasets.
THANK YOU