Data Cleaning
Data Cleaning
CLEANING OF DATA
Contents
Data Exploration
Missing Data
Outlier Data
Data exploration, cleaning and preparation can take up to 70% of your total
project time.
Steps involved to understand, clean and prepare the data for building your
predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis Need to iterate over steps 4 – 7 multiple times
4. Missing values treatment before we come up with our refined model.
5. Outlier treatment
6. Variable transformation
7. Variable creation
Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the
data type and category of the variables.
45 1 65 10 75 100
Univariate Analysis
Univariate analysis used to explore variables one by one. It is also used to highlight
missing and outlier values
Method to perform uni-variate analysis will depend on whether the variable type is
categorical or continuous.
Continuous Variables:- In case of continuous variables, need to understand the central
tendency and spread of the variable.
Univariate Analysis
Continuous & Continuous: While doing bi-variate analysis between two continuous
variables, we should look at scatter plot. It is a nifty way to find out the relationship
between two variables. The pattern of scatter plot indicates the relationship between
variables. The relationship can be linear or non-linear.
Correlation = Covariance(X,Y) /
SQRT( Var(X)* Var(Y))
Correlation
Example
Good positive
relationship(0.65) between
two variables X and Y.
-1 : perfect negative linear correlation Correlation = Covariance(X,Y) /
+1: perfect positive linear correlation and 0 : SQRT( Var(X)* Var(Y))
No correlation
Categorical & Categorical:
• Two-way table: We can start analyzing the relationship by creating a two-way table of count
and count%. The rows represents the category of one variable and the columns represent the
categories of the other variable. We show count or count% of observations available in each
combination of row and column categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
Stacked Bar Plots
Stacked bar charts illustrating the survivorship rate on the doomed ship Titanic, by
ticket class.
The histogram (left) informs us of the size of each class,
Scaled bars (right) better capture proportions.
Chi-Square Test:
This test is used to derive the statistical significance of relationship between the variables.
It returns probability for the computed chi-square distribution with the degree of freedom
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is
significant at 95% confidence.
Z-Test/ T-Test:- Either test assess whether mean of two groups are
statistically different from each other or not.
Data Errors
An important aspect of data cleaning is identifying fields for which data isn’t there, and
then properly compensating
If replaced with NAN - can get misinterpreted as data when the model is built.
Using a value like −1 as a no-data symbol has exactly the same deficiencies as zero.
Separately maintain both the raw data and its cleaned version. The raw data is the ground
Gender
truth, and must be preserved intact for future analysis. M
M
The cleaned data may be improved using imputation to fill in missing values.
NAN
F
NAN
Approaches - Missing Values
Drop
• If the missing values in a column rarely happen and occur at random, then the easiest and
most forward solution is to drop observations (rows) that have missing values.
• Dropping these records will lead to biased results.
Deletion- List Wise Deletion and Pair Wise
Deletion
Simplicity is one of the major advantage of Advantage - it keeps as many cases available for
this method, but this method reduces the analysis.
power of model because it reduces the Disadvantage - it uses different sample size for
sample size different variables.
Approaches - Missing Values
Impute
• It means to calculate the missing value based on other observations.
• Using statistical values like mean, median (quantitative attribute)
• Using statistical values like mode (qualitative attribute)
• Using a linear regression
• Copying values from other similar records
Imputation
Do-Nothing:
• Let the algorithm handle the missing data
• Some algorithms learn the best imputation values for the missing data based
on the training loss reduction
Heuristic-based imputation:
• Make a reasonable guess of the missing value, given sufficient knowledge of the
underlying domain
Imputation
k = 1:
Belongs to square class
k=1
k = 3:
Test Belongs to triangle class
Input
k=3
k = 7:
Belongs to square
k=7 class
K=3
K=1
Square – 1
Square – 1 Vote = S/T = Square Vote = S/T = Triangle> Squar
Triangle - 2
Triangle - 0
Example
X1 X2 Label D3=16
Test Input Bad
7 7 Bad
D1=9 D4=25
7 4 Bad D2=13 Bad
3 4 Good Good Good
1 4 Good
Imputation by interpolation:
Use a method like linear regression to predict the
values of the target column, given the other fields
in the record.
Such models can be trained over full records and
then applied to those with missing values.
Outlier Detection
Outliers are often easy to spot in histograms. Outliers can also occur when comparing
For example, the point on the far left in the relationships between two sets of data.
above figure is an outlier. Outliers of this type can be easily identified
on a scatter diagram.
Various types of outliers
• Data Entry Errors:- Human errors such as errors caused during data collection, recording, or
entry
• Measurement/Experiment Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty.
• Intentional Outlier: This is commonly found in self-reported measures that involves sensitive
data.
• Data Processing Error: Whenever we perform data mining, we extract data from multiple sources.
It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
• Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a
few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
Impact of Outlier Data