0% found this document useful (0 votes)
8 views

MSDSModule 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

MSDSModule 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Preprocessing

Exploring data cleaning, handling missing


values, and dealing with outliers.
Data Exploration
• Before we can clean or preprocess data, we need to understand it. This
involves examining the dataset to identify potential issues and anomalies.
Here's what we can do:
• Load the Data: Import dataset into a data analysis tool such as Python (using
libraries like Pandas) or R.
• Check Data Types: Ensure that the data types of each column are appropriate.
Numeric values should be stored as numbers, dates as date objects, and so on.
• Summary Statistics: Calculate summary statistics (mean, median, standard
deviation, etc.) for numeric columns to get a sense of the data distribution.
• Visualizations: Create visualizations like histograms, box plots, scatter plots,
and heatmaps to spot patterns, outliers, and relationships in the data.
Data Preprocessing
• Data preprocessing is a crucial step in machine learning that
involves cleaning, transforming, and organizing raw data
into a format that is suitable for training and testing machine
learning models.
• High-quality, well-preprocessed data is essential for building
accurate and robust machine learning models.
Common Data Preprocessing Steps
1. Data Cleaning
2. Data Transformation
3. Data Reduction
4. Data Splitting
5. Data Imbalance Handling
6. Data Normalization and Standardization
7. Handling Categorical Data
8. Handling Time-Series Data
9. Encoding Text Data
10. Data Validation and Quality Checks
11. Feature Scaling and Selection
12. Data Augmentation (for Image Data)
Common Data Preprocessing Steps
1. Data Cleaning
• It involves identifying and addressing issues or errors in dataset
to ensure that the data is accurate, consistent, and suitable for
training machine learning models.
• Common data cleaning tasks and techniques:
• Handling Missing Data, Handling Outliers, Dealing with Duplicate Records,
Addressing Inconsistent Data, Handling Incomplete or Inaccurate Data,
Dealing with Irrelevant Data, Dealing with Data Skewness, Handling Noisy
Data, Handling Inconsistent Scales, Dealing with Text Data, Encoding
Categorical Data, Validation and Testing, Documenting Changes
1.1 Handling Missing Data
• Identify missing values: Use tools like summary statistics or data
visualization to detect missing data.
• Imputation: Fill in missing values with appropriate data, such as
mean, median, mode, or more advanced methods like regression or
imputing based on similar observations.
• Deletion: If missing values are few and don't significantly impact the
dataset's integrity, you may opt to remove rows or columns with
missing data.
1.2 Handling Outliers
• Identify outliers: Use statistical methods, box plots, or scatter
plots to detect data points that deviate significantly from the
rest.
• Treatment: Depending on the nature of the data and the
problem, outliers can be removed, transformed (e.g.,
winsorization), or kept as-is.
• Winsorization is a statistical technique used to handle outliers in a
dataset. The process involves modifying extreme values in a dataset by
setting them to a specified percentile value, typically the minimum or
maximum value within a certain range. This helps to reduce the influence
of outliers on statistical analysis and modeling.
1.3 Dealing with Duplicate Records
• Identify duplicates: Search for and remove duplicate rows
from the dataset, ensuring that each record is unique.
1.4 Addressing Inconsistent Data
• Inconsistent Formatting: Ensure that data is consistently
formatted (e.g., date formats, capitalization).
• Standardizing Categories: Group similar categories together
(e.g., "Male" and "M" into "Male").
1.5 Handling Incomplete or Inaccurate Data
• Check for incomplete or incorrect data and correct it where
possible.
• Cross-validation: Use cross-validation techniques to identify
and correct data issues by comparing multiple sources or
records.
1.6 Dealing with Irrelevant Data
• Remove irrelevant columns or features that don't contribute to
the machine-learning task.
1.7 Dealing with Data Skewness
• If the target variable is heavily skewed (e.g., in classification
tasks with imbalanced classes), consider resampling techniques
like oversampling or undersampling.
1.8 Handling Noisy Data
• Remove or smooth noisy data points that may have resulted
from errors or sensor inaccuracies.
1.9 Handling Inconsistent Scales
• Standardize or normalize numerical features to ensure they
have similar scales, preventing certain features from dominating
others during model training.
1.10 Dealing with Text Data
• Tokenization: Split text data into individual words or tokens.
• Removing special characters, stopwords, and irrelevant words.
• Lemmatization or stemming to reduce words to their root form.
• Tokenization is a natural language processing (NLP) technique that involves breaking down a
text or sentence into smaller units, typically words or subwords, known as tokens.
• Lemmatization is a natural language processing (NLP) technique used to reduce words to
their base or root form, known as the lemma. The lemma represents the canonical or
dictionary form of a word, which makes it easier to analyze and compare words with similar
meanings.
• Stemming is a natural language processing (NLP) technique used to reduce words to their
root or base form, known as the "stem." The goal of stemming is to remove prefixes, suffixes,
and other affixes from words in order to simplify them and group similar words together.
1.11 Encoding Categorical Data
• Convert categorical data into numerical format using
techniques like one-hot encoding or label encoding.

• One-hot encoding is a technique used to represent categorical data, such


as words, labels, or categories, as binary vectors. The term "one-hot" refers
to the fact that only one element in the binary vector is "hot" or "on" (set
to 1), while all others are "cold" or "off" (set to 0). Each unique category or
word is represented by a unique binary vector.
• Label encoding is a technique used to convert categorical data into
numerical values. It is particularly useful when working with algorithms
that require numerical input, as most machine learning models can only
process numeric data.
1.12 Validation and Testing
• Use validation and testing datasets to detect and correct data
issues that may not be apparent during initial exploration.
1.13 Documenting Changes
• Keep a record of all data cleaning and preprocessing steps
applied to maintain transparency and reproducibility.
Common Data Preprocessing Steps
(Continued..)
2. Data Transformation
• Feature Scaling: Scale numerical features to have a similar range, which can prevent certain
features from dominating others in models that use distance-based metrics (e.g., gradient
descent).
• One-Hot Encoding: Convert categorical variables into a binary format (0 or 1) for each category,
allowing them to be used in machine learning algorithms.
• Label Encoding: Encode categorical variables with integer labels when the order of categories
matters, but be cautious about using this when the order doesn't make sense.
• Feature Engineering: Create new features from existing ones to capture more relevant
information or reduce dimensionality.
• Binning: Group continuous numerical data into bins to convert them into categorical features.
• Log Transformation: Apply logarithmic transformations to features that have skewed
distributions to make them more symmetric.
• Normalization: Scale features to have a mean of 0 and a standard deviation of 1, which is
particularly important for algorithms like Principal Component Analysis (PCA).
Common Data Preprocessing Steps
(Continued..)
3. Data Reduction
• It is the process of reducing the volume but producing the same or
similar analytical results as the original dataset.
• This reduction in data size can be helpful in various scenarios, such as
speeding up model training, reducing memory and storage
requirements, and improving the efficiency of machine learning
algorithms.
• Some common techniques for data reduction are given in next slides.
3.1 Dimensionality Reduction
• Principal Component Analysis (PCA): PCA is a widely used technique to
reduce the dimensionality of the dataset by projecting it onto a lower-
dimensional subspace while preserving as much variance as possible.
• Linear Discriminant Analysis (LDA): LDA is used for dimensionality
reduction and feature extraction while maximizing class separability in
classification problems.
• LDA aims to find the linear combinations of features that best separate different classes in a
dataset while maximizing the distance between the class means and minimizing the variance
within each class.
• t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a
nonlinear dimensionality reduction technique that is particularly effective
for visualizing high-dimensional data in two or three dimensions.
3.2 Feature Selection
• Filter Methods: These methods assess the relevance of individual features
with respect to the target variable and select the most informative ones
based on statistical tests or scoring metrics (e.g., Chi-squared, Mutual
Information).
• Wrapper Methods: Wrapper methods evaluate subsets of features by
training and evaluating machine learning models with different feature
combinations. Examples include forward selection, backward elimination,
and recursive feature elimination.
• Embedded Methods: Embedded methods incorporate feature selection
directly into the model training process, with algorithms like Lasso (L1
regularization) penalizing and automatically selecting relevant features.
3.3 Sampling Techniques
• Random Sampling: Subsampling a random subset of the data can reduce
the dataset's size while preserving its overall characteristics. However, this
may lead to information loss.
• Stratified Sampling: When dealing with imbalanced datasets, stratified
sampling ensures that the class distribution is preserved in the reduced
dataset.
• Cluster-Based Sampling: Cluster-based sampling selects representative
samples from clusters to maintain data distribution.
3.4 Data Aggregation: Aggregate data over certain time intervals or spatial regions to reduce the
dataset's size while retaining essential statistical properties.
3.5 Binning: Grouping continuous numerical data into bins or intervals, which can reduce the
number of unique values and simplify the data.
3.6 Frequent Pattern Mining: Identifying and keeping only frequently occurring patterns or
associations in the data, such as frequent itemsets in market basket analysis.
3.7 Data Compression: Using compression techniques like Singular Value Decomposition (SVD) to
represent data more compactly while minimizing information loss.
3.8 Summary Statistics: Instead of keeping detailed records, summarize data using statistics like
means, medians, or percentiles.
3.9 Time-Series Aggregation: For time-series data, aggregate values over longer time intervals
(e.g., hourly or daily) instead of keeping high-frequency data.
3.10 Feature Engineering: Create new features that capture essential information while reducing
the dimensionality. For example, combining related features or using domain-specific knowledge
to derive new ones.
Common Data Preprocessing Steps
(Continued..)
4. Data Splitting
• It is the process of dividing a dataset into multiple subsets for
different purposes, typically for training, validation, and testing
machine learning models. Proper data splitting is essential to assess
the performance and generalization of a model accurately.
4.1 Training Set
• The training set is the largest subset of the data and is used to train
the machine learning model. It contains both the features (input data)
and their corresponding target labels (output or response variable).
• The model learns patterns and relationships in the training data,
adjusting its internal parameters to minimize a chosen objective
function (e.g., loss function).
4.2 Validation Set
• The validation set is a separate subset of the data that is not used
during training but is used to tune hyperparameters and monitor
the model's performance.
• After training the model on the training set, it is evaluated on the
validation set to assess its ability to generalize to new, unseen
data.
• Hyperparameters, such as learning rates or regularization
strength, can be adjusted based on validation set performance.
4.3 Test Set
• The test set is another distinct subset of the data that is not used
during training or hyperparameter tuning.
• It is used as an independent evaluation dataset to provide an
unbiased estimate of the model's performance on unseen data.
• The test set helps assess how well the model is expected to
perform in real-world scenarios.
• Common Split Ratio: A common split ratio is 70-80% for training, 10-15% for validation, and 10-15% for
testing. However, the exact split ratio can vary depending on the size and quality of the dataset.
• Stratified Splitting: Stratified splitting is essential when dealing with imbalanced datasets, where one class is
significantly underrepresented. In such cases, the split ensures that each subset (training, validation, and test)
maintains the same class distribution as the original dataset.
• Cross-Validation: Cross-validation is a more advanced technique that involves splitting the dataset into
multiple subsets, often referred to as "folds." The model is trained and validated multiple times, with each
fold serving as the validation set once while the others are used for training. Cross-validation helps provide a
more robust estimate of model performance.
• Leave-One-Out Cross-Validation (LOOCV): In LOOCV, each data point is treated as a separate validation
set, and the model is trained on the remaining data points. This approach provides a comprehensive
assessment of model performance but can be computationally expensive for large datasets.
• Time-Series Data Splitting: For time-series data, data splitting is often done chronologically. Earlier data is
used for training, intermediate data for validation, and the most recent data for testing to simulate real-world
scenarios where the model must make predictions on unseen future data.

• Random vs. Stratified Sampling: The choice between random and stratified sampling for splitting depends
on the dataset's characteristics and the problem at hand. Stratified sampling is preferred when dealing with
class imbalance, while random sampling is more common for balanced datasets.
Common Data Preprocessing Steps
(Continued..)
5. Data Imbalance Handling: In classification tasks, deal with imbalanced datasets by oversampling the minority
class, undersampling the majority class, or using synthetic data generation techniques.
6. Data Normalization and Standardization: Normalize and standardize features to ensure that they have similar
scales. Normalization typically scales features to a [0, 1] range, while standardization makes features have a mean
of 0 and a standard deviation of 1.
7. Handling Categorical Data: Convert categorical data into a numerical format (e.g., one-hot encoding) to make
it suitable for machine learning algorithms.
8. Handling Time-Series Data: Resample, interpolate, or aggregate time-series data to align it with the desired
frequency or to fill in missing values.
9. Encoding Text Data: Convert text data into numerical format using techniques like TF-IDF (Term Frequency-
Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) for natural language processing tasks.
10. Data Validation and Quality Checks: Ensure that the preprocessed data is free from anomalies, errors, or
inconsistencies.
11. Feature Scaling and Selection: Choose appropriate features and apply scaling methods to avoid issues with
algorithms sensitive to feature scales (e.g., gradient-based optimization algorithms).
12. Data Augmentation (for Image Data): Generate additional training examples by applying random
transformations to images (e.g., rotation, cropping, flipping) to improve model generalization

You might also like