0% found this document useful (0 votes)

13 views23 pages

Week 3

The document discusses the process of data preprocessing, which involves preparing raw data for analysis through steps like cleaning, transformation, and reduction. It describes techniques for handling issues like missing values, outliers, and data from various sources. The goal of preprocessing is to convert raw data into a clean and analyzable format suitable for modeling and insights.

Uploaded by

Muneeba Mehmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views23 pages

Week 3

Uploaded by

Muneeba Mehmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Preprocessing: A primer

 Data preprocessing is a foundational step in the data

mining process.

 It entails preparing raw data for analysis by transforming

and refining it. This presentation delves into the intricacies
of cleaning, structuring, and addressing imbalances to
ensure data is ready for analysis.
What is Data Pre-processing

 Data preprocessing is the process of converting raw data

into a clean and analyzable format.

 It involves multiple steps, including cleaning,

transformation, and reduction.

 This initial phase is pivotal, as the quality and precision of

data preprocessing can dictate the success of subsequent
analytical procedures.
Steps of Data Pre-Processing
Data Profiling

 Data profiling is the process of examining datasets to gather

descriptive statistics about the data.

 It provides a summary of a dataset's attributes, patterns,

anomalies, and unique values.

 By understanding the data's structure, relationships, and

inconsistencies, data profiling lays the groundwork for
further data preprocessing and quality enhancement, ensuring
that the data is well-understood before any advanced
processing or analysis.
Data Cleaning

 Dirty data can be a major impediment in data analysis.

Errors, inconsistencies, and redundancies can mislead
analysts and produce skewed results.

 Data cleaning becomes imperative to ensure the integrity

of data by spotting and correcting inaccuracies.
Handling Missing Values

 Missing values are a common issue in datasets. Their

presence can distort data analysis and lead to incorrect
interpretations.

 Techniques like imputation, predictive filling, and

elimination are employed based on the nature and pattern
of the missing data to ensure completeness.
Handling Missing Values

 Imputation: Replace missing values with statistical

measures like mean, median, or mode. For categorical data,
the mode is often used.
 Deletion: Remove rows with missing values, especially if
the data is randomly missing and its absence doesn't create
bias.
 Prediction: Use algorithms to predict and fill missing
values based on other attributes.
/ Handling Outliers

 Outliers can introduce variance and bias.

 While sometimes they carry significant information, they
can also distort results. Various statistical methods are
available to detect outliers.
 Based on their nature, outliers can be adjusted, removed, or
even retained.
/ Handling Outliers

 Identifying Outliers:
 Use statistical measures like IQR (Interquartile Range), Z-
score, or visual tools like box plots and scatter plots to detect
outliers.
 Understand if the outliers are genuine or data errors.
 Handling Outliers:
 Deletion: Remove outliers if they are the result of data entry
errors.
 Transformation: Use log or square root transformations to
reduce the impact of outliers.
 Capping: Limit the maximum and minimum values for certain
attributes.
 Imputation: Replace outliers with mean/median/mode values.
/ Data Reduction

 Data reduction refers to the process of transforming large volumes of

data into a reduced representation, retaining as much meaningful
information as possible.
 The goal is to simplify, compress, or condense the original data,
making it more manageable and easier to analyze, without losing
significant information.
 Common Techniques:
 Dimensionality Reduction: Reducing the number of random
variables under consideration. Techniques include Principal
Component Analysis (PCA), Linear Discriminant Analysis
(LDA), and feature selection methods.
 Binomialization: Reducing data by turning numerical data into
binary values (0 or 1).
 Histogram Analysis: Dividing data into bins and then
representing the data by its bin.
/ Handling Outliers
 Clustering: Grouping similar data points together. Algorithms
include K-means, hierarchical clustering, and DBSCAN.
 Aggregation: Summarizing and grouping data in various ways,
like computing the sum, average, or count for groups of data.
 Sampling: Using a subset of the data that's representative of
the entire dataset.
 Data Compression: Techniques like Run-Length Encoding
(RLE) or algorithms such as JPEG images
 Challenges:
 Loss of Information: Some data reduction techniques can lead
to the loss of original data.
 Complexity: The process can be computationally intensive or
complex, especially with high-dimensional data.
 Reversibility: Some reduction techniques are irreversible,
meaning the original data cannot be reconstructed from the
reduced data.
/ Data Transformations

 Data transformation is the process of converting data from one

format, structure, or value to another to make it suitable for various
analytical needs or specific tasks.
 The aim is to improve the data's quality and usability by ensuring it
is in the most appropriate form to meet the requirements of
different operations, such as data analysis, machine learning, or
visualization.
 Reasons for Data Transformation:
 Compatibility: Ensuring data from different sources aligns
well for consolidated analyses.
 Performance: Optimizing data for faster queries or algorithm
processing.
 Analysis Requirements: Certain algorithms or analytical
techniques require data in specific formats.
/ Data Transformations

 Common Techniques:
 Normalization: Scaling numerical data to fall within a smaller,
standard range, like 0-1.
 Standardization (Z-Score Normalization): Rescaling data so
it has a mean of 0 and a standard deviation of 1.
 One-Hot Encoding: Converting categorical variables into a
format that can be provided to machine learning algorithms to
improve predictions.
 Binning: Converting continuous data into discrete intervals or
bins.
 Log Transformation: Used to transform skewed data into a
more normal or Gaussian distribution.
 Feature Extraction: Creating new variables from the existing
ones, like Principal Component Analysis (PCA).
/ Data Transformations

 Handling Complex Data:

 Date and Time Transformation: Extracting specific
components like day, month, year, or time of day.
 Text Transformation: Techniques like tokenization,
stemming, or encoding to convert text into numerical data.
 Spatial Transformation: Converting spatial data (like latitude
and longitude) into distinct zones or distances.
 Data Integration:
 Aggregation: Combining multiple data rows into a single row,
often using methods like sum, average, or count.
 Pivoting: Rotating data from a long format to a wide format, or
vice versa.
 Joining: Combining data from multiple sources based on
common attributes.
/ Data Transformations

 Data transformation is the process of converting data from one

 Data enrichment is the process of enhancing, refining, and

improving raw data by supplementing it with relevant information
from external sources. The main goal of data enrichment is to add
value to the original dataset, making it more comprehensive,
accurate, and insightful for decision-making or analysis.

 Purpose of Data Enrichment:

 Completeness: Filling gaps in datasets with missing or incomplete

information.
 Accuracy: Correcting or verifying existing data entries.
 Enhanced Insights: Adding depth and context to facilitate better
analyses and informed decision-making.
/ Data Transformations

 Common Techniques & Sources:

 Third-party Databases: Leveraging external databases to pull in
relevant data, such as demographic information or industry statistics.
 Web Scraping: Extracting data from websites to supplement existing
datasets.
 Geospatial Enrichment: Augmenting datasets with geographical or
locational data.
 Social Media & Online Platforms: Pulling user-generated content or
sentiments to enhance consumer data.
 Benefits:
 Enhanced Decision-making: Offers a broader perspective by adding
layers of context to existing data.
 Personalization: Helps in tailoring products, services, or content to
individual preferences or profiles.
 Better Segmentation: Facilitates a deeper understanding of customer
segments.
 Improved Data Quality: Boosts the reliability and accuracy of the
dataset.
/ Data Transformations

 Data validation is the process of ensuring that data is accurate,

reliable, and meets the specified criteria before it's used in any
system or analysis. It checks the quality and integrity of the data,
ensuring that it's free from errors and inconsistencies.
 Purpose of Data Validation:
 Accuracy: Ensure that the data collected or entered is correct.
 Consistency: Make sure data is logical and consistent across
datasets.
 Completeness: Check that no essential data points are missing.
 Reliability: Ensure that the data can be trusted for decision-making
and analysis.
/ Data Transformations

 Common Techniques:
 Range Check: Verifying that a data value falls within a
specified range.
 Format Check: Ensuring data is in a specific format, like a
valid email address or phone number.
 List Check: Validating data against a predefined list of
acceptable values.
 Consistency Check: Ensuring data doesn't have contradictions,
such as a date of birth indicating a person is 150 years old.
 Uniqueness Check: Verifying that entries in a unique field, like
a user ID, are not duplicated.
 Logical Check: Confirming that data combinations make logical
sense, such as gender and salutation alignment.
/ Data Transformations

 Types of Data Validation:

 Manual Verification: Humans checking data for errors, often
used for subjective data.
 Automated Validation: Using software or algorithms to check
data against certain rules or patterns.
 Real-time Validation: Validating data immediately as it's entered
into a system, often seen in online forms.
 Benefits:
 Reduced Errors: Minimizing the number of inaccuracies and
mistakes in datasets.
 Improved Decision-making: Reliable data leads to more
accurate analyses and better decisions.
 Efficiency: Identifying and correcting errors early can save time
and resources later on.
 Compliance: Meeting regulatory and industry standards that
require accurate data.
/ Data Transformations

 Types of Data Validation:

 Challenges & Considerations:

 False Positives/Negatives: Validation rules might incorrectly
flag valid data or miss invalid data.
 Complexity: As data grows in volume and variety, validation
can become more complex.
 Balancing Rigor with Flexibility: Overly strict validation
rules can reject data that's slightly off but still valuable.
 Post-validation Activities:
 Data Cleansing: Once data validation identifies errors, the
next step is often to clean or correct the data.
 Feedback Loops: Especially in real-time validation, providing
users with immediate feedback can help correct errors at the
source.

Report On Coal India On Marketing Research
67% (3)
Report On Coal India On Marketing Research
55 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Ph.D. Thesis Guidelines
83% (6)
Ph.D. Thesis Guidelines
18 pages
Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
Nutritional Surveillance
100% (1)
Nutritional Surveillance
20 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
JANREX Immersion
No ratings yet
JANREX Immersion
19 pages
Ancient Israel, Grabbe
95% (40)
Ancient Israel, Grabbe
327 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
What Is Predictive Analytics
No ratings yet
What Is Predictive Analytics
5 pages
M & E Procedures
100% (2)
M & E Procedures
18 pages
Patterns of Investment Strategy Aninvestment Strategy and Behavior Among Individual Investorsd Behavior Among Individual Investors
100% (1)
Patterns of Investment Strategy Aninvestment Strategy and Behavior Among Individual Investorsd Behavior Among Individual Investors
39 pages
AOL-2-Mod-1 MA
No ratings yet
AOL-2-Mod-1 MA
17 pages
Academic Transcript
No ratings yet
Academic Transcript
7 pages
The Content Based Media Exposure Scale C
No ratings yet
The Content Based Media Exposure Scale C
32 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Lecture Note (CPS) Construction Planning & M
No ratings yet
Lecture Note (CPS) Construction Planning & M
195 pages
Academic Achievement With Emotional Competence, Learning Style & Academic Anxiety: A Correlation Study of High School Students
No ratings yet
Academic Achievement With Emotional Competence, Learning Style & Academic Anxiety: A Correlation Study of High School Students
8 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Unit - II
No ratings yet
Unit - II
56 pages
Project Report On Customer Service
No ratings yet
Project Report On Customer Service
13 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Final 3I CHAPTER1-5
No ratings yet
Final 3I CHAPTER1-5
87 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Normalization
No ratings yet
Normalization
35 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Academic Writing
No ratings yet
Academic Writing
16 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
MA6451-Probability and Random Processes
No ratings yet
MA6451-Probability and Random Processes
19 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
DAY 6 MLR Case Studies
No ratings yet
DAY 6 MLR Case Studies
24 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Understanding Multi Rule QC JUN23
No ratings yet
Understanding Multi Rule QC JUN23
12 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Shift Work, Emotional Labour and Psychological Well-Being of Nursing Staff
No ratings yet
Shift Work, Emotional Labour and Psychological Well-Being of Nursing Staff
14 pages
Emsat Math Achieve (General Track) : Total Time For Test: 50 Questions: 1.5 Hours
No ratings yet
Emsat Math Achieve (General Track) : Total Time For Test: 50 Questions: 1.5 Hours
10 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Mining
No ratings yet
Data Mining
22 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
IJRPR17066
No ratings yet
IJRPR17066
9 pages
Unit 3
No ratings yet
Unit 3
18 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
Chap 3
No ratings yet
Chap 3
26 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Project Ecc563 Mac 2021
No ratings yet
Project Ecc563 Mac 2021
9 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Jurnal Berkaitan DG Lampu UV
No ratings yet
Jurnal Berkaitan DG Lampu UV
6 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Unit 3
No ratings yet
Unit 3
22 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Examination Paper 2019
No ratings yet
Examination Paper 2019
7 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
S1 2014 300981 Bibliography
No ratings yet
S1 2014 300981 Bibliography
8 pages
Key Performance Indicators (KPI) in Hospitality Industry PDF
No ratings yet
Key Performance Indicators (KPI) in Hospitality Industry PDF
7 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Week 2
No ratings yet
Week 2
3 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Tuto 6 Optimisation ENSIA
No ratings yet
Tuto 6 Optimisation ENSIA
3 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Syllabus: 1. Course Description
No ratings yet
Syllabus: 1. Course Description
2 pages
Researching Ghosts
No ratings yet
Researching Ghosts
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

Week 3

Uploaded by

Week 3

Uploaded by

Data Preprocessing: A primer

 Data preprocessing is a foundational step in the data

 It entails preparing raw data for analysis by transforming

 Data preprocessing is the process of converting raw data

 It involves multiple steps, including cleaning,

 This initial phase is pivotal, as the quality and precision of

 Data profiling is the process of examining datasets to gather

 It provides a summary of a dataset's attributes, patterns,

 By understanding the data's structure, relationships, and

 Dirty data can be a major impediment in data analysis.

 Data cleaning becomes imperative to ensure the integrity

 Missing values are a common issue in datasets. Their

 Techniques like imputation, predictive filling, and

 Imputation: Replace missing values with statistical

 Outliers can introduce variance and bias.

 Data reduction refers to the process of transforming large volumes of

 Data transformation is the process of converting data from one

 Handling Complex Data:

 Data transformation is the process of converting data from one

 Data enrichment is the process of enhancing, refining, and

 Purpose of Data Enrichment:

 Completeness: Filling gaps in datasets with missing or incomplete

 Common Techniques & Sources:

 Data validation is the process of ensuring that data is accurate,

 Types of Data Validation:

 Types of Data Validation:

 Challenges & Considerations:

You might also like