0% found this document useful (0 votes)

18 views39 pages

Data Cleaning

The document outlines the processes of data exploration, cleaning, and preparation, which can consume up to 70% of a project's time. Key steps include variable identification, univariate and bivariate analysis, and handling missing and outlier data through various methods such as imputation and statistical tests. It emphasizes the importance of addressing data quality issues to enhance the accuracy and reliability of predictive models.

Uploaded by

Dr. BRINDHA 2369

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views39 pages

Data Cleaning

Uploaded by

Dr. BRINDHA 2369

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

DATA EXPLORATION,

CLEANING OF DATA
Contents

 Data Exploration
 Missing Data
 Outlier Data
 Data exploration, cleaning and preparation can take up to 70% of your total
project time.

 Steps involved to understand, clean and prepare the data for building your
predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis Need to iterate over steps 4 – 7 multiple times
4. Missing values treatment before we come up with our refined model.
5. Outlier treatment
6. Variable transformation
7. Variable creation
Variable Identification

 First, identify Predictor (Input) and Target (output) variables. Next, identify the
data type and category of the variables.

Output/ Target/Class label

45 1 65 10 75 100
Univariate Analysis

 Univariate analysis used to explore variables one by one. It is also used to highlight
missing and outlier values
 Method to perform uni-variate analysis will depend on whether the variable type is
categorical or continuous.
 Continuous Variables:- In case of continuous variables, need to understand the central
tendency and spread of the variable.
Univariate Analysis

 Categorical Variables:- For categorical variables, we’ll use frequency table to

understand distribution of each category. We can also read as percentage of
values under each category. It can be measured using two
metrics, Count and Count% against each category. Bar chart can be used as
visualization.

The most common representation of a distribution is a

histogram, which is a graph that shows the frequency
of each value. Let us show the age of working men
and women separately.
Bivariate Analysis

 Bi-variate Analysis finds out the relationship between two variables.

 Here, we look for association and disassociation between variables at a pre-defined
significance level.
 We can perform bi-variate analysis for any combination of categorical and
continuous variables. The combination can be:
 Categorical & Categorical,
 Categorical & Continuous and
 Continuous & Continuous.
Bivariate Analysis :

 Continuous & Continuous: While doing bi-variate analysis between two continuous
variables, we should look at scatter plot. It is a nifty way to find out the relationship
between two variables. The pattern of scatter plot indicates the relationship between
variables. The relationship can be linear or non-linear.

A scatter plot is a set of When to use scatter plots?

points that represents the
values obtained for two A scatterplot is a type of data display that shows the
different variables plotted relationship between two numerical variables. Each member
on horizontal and vertical of the dataset gets plotted as a point whose (x, y)
axes. Scatter plots are coordinates relates to its values for the two variables.
capable of showing
thousands of bivariate • Scatter plots are sometimes called correlation plots
(two-dimensional) points in because they show how two variables are correlated
a clear understandable
manner.
What is correlation?

We often see patterns or relationships in scatterplots.

What is correlation?

Positive correlation: When the y variable tends

to increase as the x variable increases, we say
there is a positive correlation between the
variables.

Negative correlation: When the y variable tends

to decrease as the x variable increases, we say
there is a negative correlation between the
variables.

No correlation: When there is no clear

relationship between the two variables, we say
there is no correlation between the two
variables.
Scatter Plots

Scatter plot shows the relationship between two

variables but does not indicates the strength of
relationship amongst them.

To find the strength of the relationship, we

use Correlation. Correlation varies between -1
and +1.
-1 : perfect negative linear correlation
+1: perfect positive linear correlation and 0 :
No correlation

Correlation = Covariance(X,Y) /
SQRT( Var(X)* Var(Y))
Correlation

Example

Good positive
relationship(0.65) between
two variables X and Y.
-1 : perfect negative linear correlation Correlation = Covariance(X,Y) /
+1: perfect positive linear correlation and 0 : SQRT( Var(X)* Var(Y))
No correlation
 Categorical & Categorical:
• Two-way table: We can start analyzing the relationship by creating a two-way table of count
and count%. The rows represents the category of one variable and the columns represent the
categories of the other variable. We show count or count% of observations available in each
combination of row and column categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
Stacked Bar Plots

Stacked bar charts illustrating the survivorship rate on the doomed ship Titanic, by
ticket class.
The histogram (left) informs us of the size of each class,
Scaled bars (right) better capture proportions.
Chi-Square Test:

 This test is used to derive the statistical significance of relationship between the variables.
 It returns probability for the computed chi-square distribution with the degree of freedom
 Probability of 0: It indicates that both categorical variable are dependent
 Probability of 1: It shows that both variables are independent.
 Probability less than 0.05: It indicates that the relationship between the variables is
significant at 95% confidence.

Statistical Measures used to analyze the power of relationship

are:

•Cramer’s V for Nominal Categorical Variable

•Mantel-Haenszed Chi-Square for ordinal categorical variable.
 Categorical & Continuous: While exploring relation between categorical and
continuous variables, we can draw box plots for each level of categorical variables. If
levels are small in number, it will not show the statistical significance. To look at the
statistical significance we can perform Z-test, T-test or ANOVA.

Z-Test/ T-Test:- Either test assess whether mean of two groups are
statistically different from each other or not.
Data Errors

 Data cleansing or data cleaning is the process of detecting and correcting

(or removing) corrupted or inaccurate records from the data set

 Refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the

data and then replacing, modifying, or deleting the dirty or coarse data.
Missing Data
Missing Data

Missing data in the training

data set can reduce the
power / fit of a model
or
can lead to a biased model

Reason : Not analyzed the

behavior and relationship
with other variables
correctly.

Missing data can

Inference - The chances of Inference - The chances of lead to wrong
playing cricket by males is higher playing cricket by females is prediction or
than females higher than males
classification
Reason for Missing Data?

 May occur at two stages

Data Extraction Data Collection

• Hashing procedures can also These errors are harder to

be used to make sure data correct People may/may not disclose
Categorized in four types:
extraction is correct. • Missing their earnings. Same for all
completely at
• Easy to find the errors at this observations
Data for age and female has
random
stage higher missing values
• Missing at random
compared to male
“discomfort” as an input
• Missing that depends on variable for all patients.
unobserved predictors
People with higher or lower
• Missing that depends on income may not disclose their
the missing value itself earnings.
Dealing with Missing Values

 An important aspect of data cleaning is identifying fields for which data isn’t there, and
then properly compensating
 If replaced with NAN - can get misinterpreted as data when the model is built.
 Using a value like −1 as a no-data symbol has exactly the same deficiencies as zero.
 Separately maintain both the raw data and its cleaned version. The raw data is the ground
Gender
truth, and must be preserved intact for future analysis. M
 M
The cleaned data may be improved using imputation to fill in missing values.
NAN
F
NAN
Approaches - Missing Values

Drop
• If the missing values in a column rarely happen and occur at random, then the easiest and
most forward solution is to drop observations (rows) that have missing values.
• Dropping these records will lead to biased results.
Deletion- List Wise Deletion and Pair Wise
Deletion

Simplicity is one of the major advantage of Advantage - it keeps as many cases available for
this method, but this method reduces the analysis.
power of model because it reduces the Disadvantage - it uses different sample size for
sample size different variables.
Approaches - Missing Values

Impute
• It means to calculate the missing value based on other observations.
• Using statistical values like mean, median (quantitative attribute)
• Using statistical values like mode (qualitative attribute)
• Using a linear regression
• Copying values from other similar records
Imputation

Do-Nothing:
• Let the algorithm handle the missing data
• Some algorithms learn the best imputation values for the missing data based
on the training loss reduction
Heuristic-based imputation:
• Make a reasonable guess of the missing value, given sufficient knowledge of the
underlying domain
Imputation

Mean value imputation:

• This works by calculating the mean/median of the non-missing values in a column and
then replacing the missing values within each column separately and independently from
the others. It can only be used with numeric data.
Imputation

Mean value imputation:

Pros:
• Easy and fast.
• Works well with small numerical datasets.
Cons:
• Doesn’t factor the correlations between features. It only works on the column level.
• Will give poor results on encoded categorical features (do NOT use it on categorical
features).
• Not very accurate.
• Doesn’t account for the uncertainty in the imputations.
Imputation
Random value imputation:
• Another approach is to select a random value from the column to replace the missing
value.
Imputation Using (Most Frequent) or (Zero/Constant) Values:
• Most Frequent is another statistical strategy to impute missing values by replacing
missing data with the most frequent values within each column.
Pros:
• Works well with categorical features.
Cons:
• It also doesn’t factor the correlations between features.
• It can introduce bias in the data.
• It has least preference
Imputation- k-nearest neighbor
Imputation by nearest neighbor:
• Identify the complete record which matches most closely on all fields present, and use this
nearest neighbor to infer the values of what is missing
• This approach requires a distance function to identify the most similar records. Nearest
neighbor methods are an important technique in data science.
Pros:
 k-nearest neighbor can predict both qualitative & quantitative attributes
 Can be much more accurate than the mean, median or most frequent imputation methods (It
depends on the dataset).
 Attributes with multiple missing values can be easily treated
DEFINITION OF NEAREST NEIGHBOR

 k = 1:
 Belongs to square class
k=1
 k = 3:
Test  Belongs to triangle class
Input
k=3
k = 7:
Belongs to square
k=7 class
K=3
K=1
Square – 1
Square – 1 Vote = S/T = Square Vote = S/T = Triangle> Squar
Triangle - 2
Triangle - 0
Example

X1 X2 Label D3=16
Test Input Bad
7 7 Bad
D1=9 D4=25
7 4 Bad D2=13 Bad
3 4 Good Good Good
1 4 Good

X1 X2 Distance Rank Is it included Label

,D Minimum in 3- nearest
Distance neighbors?
7 7 16 3 Yes Bad
7 4 25 4 No -
3 4 9 1 Yes Good
1 4 13 2 Yes Good
Imputation- k-nearest neighbor
Cons:
 Computationally expensive.
 May be sensitive to outliers in the data
 KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset
looking for the most similar instances.
 Choice of k-value is very critical. Higher value of k would include attributes which are significantly
different from what we need whereas lower value of k implies missing out of significant attributes.
Imputation

Imputation by interpolation:
Use a method like linear regression to predict the
values of the target column, given the other fields
in the record.
Such models can be trained over full records and
then applied to those with missing values.
Outlier Detection

An outlier is a data point that differs significantly from other observations

 An outlier may be due to variability in the measurement or it may indicate
experimental error.
 An outlier can cause serious problems in statistical analyses.
 May be due to data entry mistakes
 Outliers tend to make your data skewed and reduces accuracy
Identifying Outlier
 plotting the frequency histogram and looking at the location of the extreme elements.
 Unsupervised learning problem, like clustering
Outlier Detection

An outlier is an observation that lies outside the overall pattern of a distribution

Outliers are often easy to spot in histograms. Outliers can also occur when comparing
For example, the point on the far left in the relationships between two sets of data.
above figure is an outlier. Outliers of this type can be easily identified
on a scatter diagram.
Various types of outliers

• Data Entry Errors:- Human errors such as errors caused during data collection, recording, or
entry
• Measurement/Experiment Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty.
• Intentional Outlier: This is commonly found in self-reported measures that involves sensitive
data.
• Data Processing Error: Whenever we perform data mining, we extract data from multiple sources.
It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
• Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a
few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
Impact of Outlier Data

• It increases the error variance

• Reduces the power of statistical
tests
• Decrease normality
• They can bias or influence
estimates that may be of
substantive interest
• They can also impact the basic
assumption of Regression, ANOVA
and other statistical model
assumptions.

Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Chapter 2 - Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2 - Data Exploration, Preprocessing and Visualization
92 pages
M6 Predictive Analytics Presentation
No ratings yet
M6 Predictive Analytics Presentation
49 pages
DM Merged
No ratings yet
DM Merged
169 pages
SQL Interview Questions Goldman Sachs
No ratings yet
SQL Interview Questions Goldman Sachs
19 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
6.research Methodology-BBA S1M6
No ratings yet
6.research Methodology-BBA S1M6
64 pages
Data Processing and Analysis of Data
100% (1)
Data Processing and Analysis of Data
43 pages
03 Preprocessing
No ratings yet
03 Preprocessing
80 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Univariate Bivariate & Multivariate Analysis of Data
No ratings yet
Univariate Bivariate & Multivariate Analysis of Data
24 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Module 2
No ratings yet
Module 2
62 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
CSEQP25
No ratings yet
CSEQP25
311 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Chapter One
No ratings yet
Chapter One
107 pages
Quantitative Research Methods - Data Processing and Analysis
No ratings yet
Quantitative Research Methods - Data Processing and Analysis
25 pages
A Comprehensive Guide To Data Exploration
100% (2)
A Comprehensive Guide To Data Exploration
18 pages
Big Data Chapter 3
No ratings yet
Big Data Chapter 3
29 pages
T54B VCF PDF
92% (12)
T54B VCF PDF
528 pages
Unit 4
No ratings yet
Unit 4
21 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Unit 4 Water
100% (1)
Unit 4 Water
31 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Minimally Invasive Cardiac Surgery A Practical Guide - 1st Edition PDF
100% (9)
Minimally Invasive Cardiac Surgery A Practical Guide - 1st Edition PDF
14 pages
Basics of OR
No ratings yet
Basics of OR
37 pages
Data Screening and Main Model Analysis in Spss
No ratings yet
Data Screening and Main Model Analysis in Spss
26 pages
01 Multivariate Analysis
100% (1)
01 Multivariate Analysis
40 pages
Talal Khan CV Civil Engineer - Planning Engineer - Dec 18
100% (1)
Talal Khan CV Civil Engineer - Planning Engineer - Dec 18
4 pages
Turbine System
100% (1)
Turbine System
52 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
Data Mining Technical
No ratings yet
Data Mining Technical
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
OM MCQ
No ratings yet
OM MCQ
29 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Business Planfor Soapand Detergent Factory
100% (1)
Business Planfor Soapand Detergent Factory
6 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Presentation1HOD SIR-1
No ratings yet
Presentation1HOD SIR-1
13 pages
MR 307 Laguna 8
No ratings yet
MR 307 Laguna 8
314 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
45 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
MBR Lab Week 10-12-1
No ratings yet
MBR Lab Week 10-12-1
65 pages
Ex.1 Questions
No ratings yet
Ex.1 Questions
2 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Cause and Effect - Key IELTS Vocabulary Because: Notes
100% (1)
Cause and Effect - Key IELTS Vocabulary Because: Notes
18 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Laguna - Coupe Quick Manual
No ratings yet
Laguna - Coupe Quick Manual
23 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Echeverría - The Slaughter Yard PDF
No ratings yet
Echeverría - The Slaughter Yard PDF
6 pages
Data Analysis and Interpretation
No ratings yet
Data Analysis and Interpretation
33 pages
SSRN Id4138427
No ratings yet
SSRN Id4138427
12 pages
Data Processing and Analysis: The Purpose of Analyzing Data Is
No ratings yet
Data Processing and Analysis: The Purpose of Analyzing Data Is
13 pages
Module 01 - STAT 101
No ratings yet
Module 01 - STAT 101
23 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
Business Club: Basic Statistics
No ratings yet
Business Club: Basic Statistics
26 pages
BRM CS
No ratings yet
BRM CS
4 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Quantitative Research Methods
No ratings yet
Quantitative Research Methods
18 pages
Pitot Tube Apparatus
No ratings yet
Pitot Tube Apparatus
6 pages
Cell Size Configuration in Random Access Procedure (I) - Preamble Format
No ratings yet
Cell Size Configuration in Random Access Procedure (I) - Preamble Format
5 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Final Course File 20cs2054 Mathu (1429)
No ratings yet
Final Course File 20cs2054 Mathu (1429)
27 pages
Students Performance
No ratings yet
Students Performance
2 pages
520 2022 Article 7194
No ratings yet
520 2022 Article 7194
9 pages
Tiger Grass Industry Studies in Romblon PDF
No ratings yet
Tiger Grass Industry Studies in Romblon PDF
10 pages
MLP
No ratings yet
MLP
11 pages
Pointo - Pitch Deck - 5-Dec.'24
No ratings yet
Pointo - Pitch Deck - 5-Dec.'24
15 pages
Alternating Voltage and Current
No ratings yet
Alternating Voltage and Current
41 pages
Composite Tube Trailer Designmanufacturing Needs
No ratings yet
Composite Tube Trailer Designmanufacturing Needs
16 pages
Adobe Scan 20 Feb 2025
No ratings yet
Adobe Scan 20 Feb 2025
5 pages
1 Belgrade
No ratings yet
1 Belgrade
67 pages
GRE Writting 1000 Tips
No ratings yet
GRE Writting 1000 Tips
58 pages
The Rock Cycle Foldable & Cut Out Activity
No ratings yet
The Rock Cycle Foldable & Cut Out Activity
4 pages
The Enchanted Forest
No ratings yet
The Enchanted Forest
2 pages
Kliscom Lessons 1-3
No ratings yet
Kliscom Lessons 1-3
17 pages
398-Article Text-1335-1-10-20160904
No ratings yet
398-Article Text-1335-1-10-20160904
7 pages
Common Interview Question
No ratings yet
Common Interview Question
4 pages
Name:Nor Shakira Binti Azemi & Dharvin Dharan A/L Elango Theme: Environment Issue Topic: Humans Are To Blame For Environmental Degradation
No ratings yet
Name:Nor Shakira Binti Azemi & Dharvin Dharan A/L Elango Theme: Environment Issue Topic: Humans Are To Blame For Environmental Degradation
3 pages
CSR Initiatives Related To Procurement and Suppliers: Organic Cotton
No ratings yet
CSR Initiatives Related To Procurement and Suppliers: Organic Cotton
2 pages
Canablast EDP 10 Pump - en PDF
No ratings yet
Canablast EDP 10 Pump - en PDF
4 pages
Description and Application: 80%ar - 20%CO / 100%CO EN ISO 17633-A T 19 9 L P C1/M21 1 AWS A5.22 E308LT1-1/4 EN 1.4316
No ratings yet
Description and Application: 80%ar - 20%CO / 100%CO EN ISO 17633-A T 19 9 L P C1/M21 1 AWS A5.22 E308LT1-1/4 EN 1.4316
1 page
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

DATA EXPLORATION,

Output/ Target/Class label

 Categorical Variables:- For categorical variables, we’ll use frequency table to

The most common representation of a distribution is a

 Bi-variate Analysis finds out the relationship between two variables.

A scatter plot is a set of When to use scatter plots?

We often see patterns or relationships in scatterplots.

Positive correlation: When the y variable tends

Negative correlation: When the y variable tends

No correlation: When there is no clear

Scatter plot shows the relationship between two

To find the strength of the relationship, we

Statistical Measures used to analyze the power of relationship

•Cramer’s V for Nominal Categorical Variable

 Data cleansing or data cleaning is the process of detecting and correcting

 Refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the

Missing data in the training

Reason : Not analyzed the

Missing data can

 May occur at two stages

Data Extraction Data Collection

• Hashing procedures can also These errors are harder to

Mean value imputation:

Mean value imputation:

X1 X2 Distance Rank Is it included Label

An outlier is a data point that differs significantly from other observations

An outlier is an observation that lies outside the overall pattern of a distribution

• It increases the error variance

You might also like