MSDSModule 2

Uploaded by

hassanmehmoood786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

MSDSModule 2

Uploaded by

hassanmehmoood786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Preprocessing

Exploring data cleaning, handling missing

values, and dealing with outliers.
Data Exploration
• Before we can clean or preprocess data, we need to understand it. This
involves examining the dataset to identify potential issues and anomalies.
Here's what we can do:
• Load the Data: Import dataset into a data analysis tool such as Python (using
libraries like Pandas) or R.
• Check Data Types: Ensure that the data types of each column are appropriate.
Numeric values should be stored as numbers, dates as date objects, and so on.
• Summary Statistics: Calculate summary statistics (mean, median, standard
deviation, etc.) for numeric columns to get a sense of the data distribution.
• Visualizations: Create visualizations like histograms, box plots, scatter plots,
and heatmaps to spot patterns, outliers, and relationships in the data.
Data Preprocessing
• Data preprocessing is a crucial step in machine learning that
involves cleaning, transforming, and organizing raw data
into a format that is suitable for training and testing machine
learning models.
• High-quality, well-preprocessed data is essential for building
accurate and robust machine learning models.
Common Data Preprocessing Steps
1. Data Cleaning
2. Data Transformation
3. Data Reduction
4. Data Splitting
5. Data Imbalance Handling
6. Data Normalization and Standardization
7. Handling Categorical Data
8. Handling Time-Series Data
9. Encoding Text Data
10. Data Validation and Quality Checks
11. Feature Scaling and Selection
12. Data Augmentation (for Image Data)
Common Data Preprocessing Steps
1. Data Cleaning
• It involves identifying and addressing issues or errors in dataset
to ensure that the data is accurate, consistent, and suitable for
training machine learning models.
• Common data cleaning tasks and techniques:
• Handling Missing Data, Handling Outliers, Dealing with Duplicate Records,
Addressing Inconsistent Data, Handling Incomplete or Inaccurate Data,
Dealing with Irrelevant Data, Dealing with Data Skewness, Handling Noisy
Data, Handling Inconsistent Scales, Dealing with Text Data, Encoding
Categorical Data, Validation and Testing, Documenting Changes
1.1 Handling Missing Data
• Identify missing values: Use tools like summary statistics or data
visualization to detect missing data.
• Imputation: Fill in missing values with appropriate data, such as
mean, median, mode, or more advanced methods like regression or
imputing based on similar observations.
• Deletion: If missing values are few and don't significantly impact the
dataset's integrity, you may opt to remove rows or columns with
missing data.
1.2 Handling Outliers
• Identify outliers: Use statistical methods, box plots, or scatter
plots to detect data points that deviate significantly from the
rest.
• Treatment: Depending on the nature of the data and the
problem, outliers can be removed, transformed (e.g.,
winsorization), or kept as-is.
• Winsorization is a statistical technique used to handle outliers in a
dataset. The process involves modifying extreme values in a dataset by
setting them to a specified percentile value, typically the minimum or
maximum value within a certain range. This helps to reduce the influence
of outliers on statistical analysis and modeling.
1.3 Dealing with Duplicate Records
• Identify duplicates: Search for and remove duplicate rows
from the dataset, ensuring that each record is unique.
1.4 Addressing Inconsistent Data
• Inconsistent Formatting: Ensure that data is consistently
formatted (e.g., date formats, capitalization).
• Standardizing Categories: Group similar categories together
(e.g., "Male" and "M" into "Male").
1.5 Handling Incomplete or Inaccurate Data
• Check for incomplete or incorrect data and correct it where
possible.
• Cross-validation: Use cross-validation techniques to identify
and correct data issues by comparing multiple sources or
records.
1.6 Dealing with Irrelevant Data
• Remove irrelevant columns or features that don't contribute to
the machine-learning task.
1.7 Dealing with Data Skewness
• If the target variable is heavily skewed (e.g., in classification
tasks with imbalanced classes), consider resampling techniques
like oversampling or undersampling.
1.8 Handling Noisy Data
• Remove or smooth noisy data points that may have resulted
from errors or sensor inaccuracies.
1.9 Handling Inconsistent Scales
• Standardize or normalize numerical features to ensure they
have similar scales, preventing certain features from dominating
others during model training.
1.10 Dealing with Text Data
• Tokenization: Split text data into individual words or tokens.
• Removing special characters, stopwords, and irrelevant words.
• Lemmatization or stemming to reduce words to their root form.
• Tokenization is a natural language processing (NLP) technique that involves breaking down a
text or sentence into smaller units, typically words or subwords, known as tokens.
• Lemmatization is a natural language processing (NLP) technique used to reduce words to
their base or root form, known as the lemma. The lemma represents the canonical or
dictionary form of a word, which makes it easier to analyze and compare words with similar
meanings.
• Stemming is a natural language processing (NLP) technique used to reduce words to their
root or base form, known as the "stem." The goal of stemming is to remove prefixes, suffixes,
and other affixes from words in order to simplify them and group similar words together.
1.11 Encoding Categorical Data
• Convert categorical data into numerical format using
techniques like one-hot encoding or label encoding.

• One-hot encoding is a technique used to represent categorical data, such

as words, labels, or categories, as binary vectors. The term "one-hot" refers
to the fact that only one element in the binary vector is "hot" or "on" (set
to 1), while all others are "cold" or "off" (set to 0). Each unique category or
word is represented by a unique binary vector.
• Label encoding is a technique used to convert categorical data into
numerical values. It is particularly useful when working with algorithms
that require numerical input, as most machine learning models can only
process numeric data.
1.12 Validation and Testing
• Use validation and testing datasets to detect and correct data
issues that may not be apparent during initial exploration.
1.13 Documenting Changes
• Keep a record of all data cleaning and preprocessing steps
applied to maintain transparency and reproducibility.
Common Data Preprocessing Steps
(Continued..)
2. Data Transformation
• Feature Scaling: Scale numerical features to have a similar range, which can prevent certain
features from dominating others in models that use distance-based metrics (e.g., gradient
descent).
• One-Hot Encoding: Convert categorical variables into a binary format (0 or 1) for each category,
allowing them to be used in machine learning algorithms.
• Label Encoding: Encode categorical variables with integer labels when the order of categories
matters, but be cautious about using this when the order doesn't make sense.
• Feature Engineering: Create new features from existing ones to capture more relevant
information or reduce dimensionality.
• Binning: Group continuous numerical data into bins to convert them into categorical features.
• Log Transformation: Apply logarithmic transformations to features that have skewed
distributions to make them more symmetric.
• Normalization: Scale features to have a mean of 0 and a standard deviation of 1, which is
particularly important for algorithms like Principal Component Analysis (PCA).
Common Data Preprocessing Steps
(Continued..)
3. Data Reduction
• It is the process of reducing the volume but producing the same or
similar analytical results as the original dataset.
• This reduction in data size can be helpful in various scenarios, such as
speeding up model training, reducing memory and storage
requirements, and improving the efficiency of machine learning
algorithms.
• Some common techniques for data reduction are given in next slides.
3.1 Dimensionality Reduction
• Principal Component Analysis (PCA): PCA is a widely used technique to
reduce the dimensionality of the dataset by projecting it onto a lower-
dimensional subspace while preserving as much variance as possible.
• Linear Discriminant Analysis (LDA): LDA is used for dimensionality
reduction and feature extraction while maximizing class separability in
classification problems.
• LDA aims to find the linear combinations of features that best separate different classes in a
dataset while maximizing the distance between the class means and minimizing the variance
within each class.
• t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a
nonlinear dimensionality reduction technique that is particularly effective
for visualizing high-dimensional data in two or three dimensions.
3.2 Feature Selection
• Filter Methods: These methods assess the relevance of individual features
with respect to the target variable and select the most informative ones
based on statistical tests or scoring metrics (e.g., Chi-squared, Mutual
Information).
• Wrapper Methods: Wrapper methods evaluate subsets of features by
training and evaluating machine learning models with different feature
combinations. Examples include forward selection, backward elimination,
and recursive feature elimination.
• Embedded Methods: Embedded methods incorporate feature selection
directly into the model training process, with algorithms like Lasso (L1
regularization) penalizing and automatically selecting relevant features.
3.3 Sampling Techniques
• Random Sampling: Subsampling a random subset of the data can reduce
the dataset's size while preserving its overall characteristics. However, this
may lead to information loss.
• Stratified Sampling: When dealing with imbalanced datasets, stratified
sampling ensures that the class distribution is preserved in the reduced
dataset.
• Cluster-Based Sampling: Cluster-based sampling selects representative
samples from clusters to maintain data distribution.
3.4 Data Aggregation: Aggregate data over certain time intervals or spatial regions to reduce the
dataset's size while retaining essential statistical properties.
3.5 Binning: Grouping continuous numerical data into bins or intervals, which can reduce the
number of unique values and simplify the data.
3.6 Frequent Pattern Mining: Identifying and keeping only frequently occurring patterns or
associations in the data, such as frequent itemsets in market basket analysis.
3.7 Data Compression: Using compression techniques like Singular Value Decomposition (SVD) to
represent data more compactly while minimizing information loss.
3.8 Summary Statistics: Instead of keeping detailed records, summarize data using statistics like
means, medians, or percentiles.
3.9 Time-Series Aggregation: For time-series data, aggregate values over longer time intervals
(e.g., hourly or daily) instead of keeping high-frequency data.
3.10 Feature Engineering: Create new features that capture essential information while reducing
the dimensionality. For example, combining related features or using domain-specific knowledge
to derive new ones.
Common Data Preprocessing Steps
(Continued..)
4. Data Splitting
• It is the process of dividing a dataset into multiple subsets for
different purposes, typically for training, validation, and testing
machine learning models. Proper data splitting is essential to assess
the performance and generalization of a model accurately.
4.1 Training Set
• The training set is the largest subset of the data and is used to train
the machine learning model. It contains both the features (input data)
and their corresponding target labels (output or response variable).
• The model learns patterns and relationships in the training data,
adjusting its internal parameters to minimize a chosen objective
function (e.g., loss function).
4.2 Validation Set
• The validation set is a separate subset of the data that is not used
during training but is used to tune hyperparameters and monitor
the model's performance.
• After training the model on the training set, it is evaluated on the
validation set to assess its ability to generalize to new, unseen
data.
• Hyperparameters, such as learning rates or regularization
strength, can be adjusted based on validation set performance.
4.3 Test Set
• The test set is another distinct subset of the data that is not used
during training or hyperparameter tuning.
• It is used as an independent evaluation dataset to provide an
unbiased estimate of the model's performance on unseen data.
• The test set helps assess how well the model is expected to
perform in real-world scenarios.
• Common Split Ratio: A common split ratio is 70-80% for training, 10-15% for validation, and 10-15% for
testing. However, the exact split ratio can vary depending on the size and quality of the dataset.
• Stratified Splitting: Stratified splitting is essential when dealing with imbalanced datasets, where one class is
significantly underrepresented. In such cases, the split ensures that each subset (training, validation, and test)
maintains the same class distribution as the original dataset.
• Cross-Validation: Cross-validation is a more advanced technique that involves splitting the dataset into
multiple subsets, often referred to as "folds." The model is trained and validated multiple times, with each
fold serving as the validation set once while the others are used for training. Cross-validation helps provide a
more robust estimate of model performance.
• Leave-One-Out Cross-Validation (LOOCV): In LOOCV, each data point is treated as a separate validation
set, and the model is trained on the remaining data points. This approach provides a comprehensive
assessment of model performance but can be computationally expensive for large datasets.
• Time-Series Data Splitting: For time-series data, data splitting is often done chronologically. Earlier data is
used for training, intermediate data for validation, and the most recent data for testing to simulate real-world
scenarios where the model must make predictions on unseen future data.

• Random vs. Stratified Sampling: The choice between random and stratified sampling for splitting depends
on the dataset's characteristics and the problem at hand. Stratified sampling is preferred when dealing with
class imbalance, while random sampling is more common for balanced datasets.
Common Data Preprocessing Steps
(Continued..)
5. Data Imbalance Handling: In classification tasks, deal with imbalanced datasets by oversampling the minority
class, undersampling the majority class, or using synthetic data generation techniques.
6. Data Normalization and Standardization: Normalize and standardize features to ensure that they have similar
scales. Normalization typically scales features to a [0, 1] range, while standardization makes features have a mean
of 0 and a standard deviation of 1.
7. Handling Categorical Data: Convert categorical data into a numerical format (e.g., one-hot encoding) to make
it suitable for machine learning algorithms.
8. Handling Time-Series Data: Resample, interpolate, or aggregate time-series data to align it with the desired
frequency or to fill in missing values.
9. Encoding Text Data: Convert text data into numerical format using techniques like TF-IDF (Term Frequency-
Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) for natural language processing tasks.
10. Data Validation and Quality Checks: Ensure that the preprocessed data is free from anomalies, errors, or
inconsistencies.
11. Feature Scaling and Selection: Choose appropriate features and apply scaling methods to avoid issues with
algorithms sensitive to feature scales (e.g., gradient-based optimization algorithms).
12. Data Augmentation (for Image Data): Generate additional training examples by applying random
transformations to images (e.g., rotation, cropping, flipping) to improve model generalization

Parts Manual: C185WKUB T2-D80 C185WKUB T4I-D95 Compressor Model
100% (2)
Parts Manual: C185WKUB T2-D80 C185WKUB T4I-D95 Compressor Model
116 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
Unit - II
No ratings yet
Unit - II
56 pages
6-Deep Networks Basics - Shallow Neural Networks-29-07-2024
No ratings yet
6-Deep Networks Basics - Shallow Neural Networks-29-07-2024
8 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Chương
No ratings yet
Chương
12 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Module 2
No ratings yet
Module 2
8 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Presentation
No ratings yet
Presentation
10 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
NN-7
No ratings yet
NN-7
26 pages
ML_DA
No ratings yet
ML_DA
55 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
1737527078055
No ratings yet
1737527078055
111 pages
AML MIDSEM
No ratings yet
AML MIDSEM
59 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
chap3
No ratings yet
chap3
26 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Eda
No ratings yet
Eda
48 pages
Week 3
No ratings yet
Week 3
23 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Module 13 - Aircraft Aerodynamics Structures and Systems
No ratings yet
Module 13 - Aircraft Aerodynamics Structures and Systems
8 pages
Dbms Unit 1
No ratings yet
Dbms Unit 1
97 pages
Unit II Basic Internetworking
No ratings yet
Unit II Basic Internetworking
73 pages
June 2017 QP - Unit 4 Edexcel Physics A-Level PDF
No ratings yet
June 2017 QP - Unit 4 Edexcel Physics A-Level PDF
28 pages
3rd SEM SYLLABUS..
No ratings yet
3rd SEM SYLLABUS..
24 pages
3 - Dynamics Force and Motion Analysis of Plane Mechanism
No ratings yet
3 - Dynamics Force and Motion Analysis of Plane Mechanism
21 pages
ME2610 Exam Jan-2021
No ratings yet
ME2610 Exam Jan-2021
7 pages
Sampling Chart
No ratings yet
Sampling Chart
5 pages
PPS Reexam Synoptic answer 24-25
No ratings yet
PPS Reexam Synoptic answer 24-25
18 pages
Structural Design: Aashtoware Pavement Me Design™
No ratings yet
Structural Design: Aashtoware Pavement Me Design™
61 pages
Mba Strategy Quant Advanced
No ratings yet
Mba Strategy Quant Advanced
196 pages
spca7180a (mp3解码)
No ratings yet
spca7180a (mp3解码)
22 pages
Data:: Wind Load Analysis As Per IS 875 (Part-3)
100% (1)
Data:: Wind Load Analysis As Per IS 875 (Part-3)
15 pages
Encyclopedia of Computer Science and Technology, Second Edition Volume II Laplante All Chapters Instant Download
100% (1)
Encyclopedia of Computer Science and Technology, Second Edition Volume II Laplante All Chapters Instant Download
65 pages
Year 4 Autumn Block 4 Step 11 RPS Multiply and Divide by 7
No ratings yet
Year 4 Autumn Block 4 Step 11 RPS Multiply and Divide by 7
5 pages
Problem Set 5
No ratings yet
Problem Set 5
2 pages
6SL3220-1YC38-0UF0 Datasheet en
No ratings yet
6SL3220-1YC38-0UF0 Datasheet en
2 pages
SplitPDFFile 3
No ratings yet
SplitPDFFile 3
1 page
Ds7201 Adip
No ratings yet
Ds7201 Adip
2 pages
Intro To Comp
No ratings yet
Intro To Comp
59 pages
Design of Control Laws and State Observers For Fixed-Wing UAVs Simulation and Experimental Approaches 1st Edition - Ebook PDF 2024 Scribd Download
100% (4)
Design of Control Laws and State Observers For Fixed-Wing UAVs Simulation and Experimental Approaches 1st Edition - Ebook PDF 2024 Scribd Download
41 pages
Profibus Glossário
No ratings yet
Profibus Glossário
92 pages
Earthquake Design Loads
No ratings yet
Earthquake Design Loads
10 pages
Chemistry:: Exploring Life Through Science
No ratings yet
Chemistry:: Exploring Life Through Science
11 pages
Wheel-Balancer-Ml Balanceadora Manatec
No ratings yet
Wheel-Balancer-Ml Balanceadora Manatec
4 pages
Bowens Reaction Series
No ratings yet
Bowens Reaction Series
9 pages
SDP Services Limited: MT-102-HP Trailer-Mounted Fracturing Blender Unit
No ratings yet
SDP Services Limited: MT-102-HP Trailer-Mounted Fracturing Blender Unit
126 pages
Spectrum Estimation
No ratings yet
Spectrum Estimation
49 pages
Summary of Python 3's Built-In Types: Type Mutable Description Syntax Example
No ratings yet
Summary of Python 3's Built-In Types: Type Mutable Description Syntax Example
1 page