0% found this document useful (0 votes)
18 views39 pages

Data Cleaning

The document outlines the processes of data exploration, cleaning, and preparation, which can consume up to 70% of a project's time. Key steps include variable identification, univariate and bivariate analysis, and handling missing and outlier data through various methods such as imputation and statistical tests. It emphasizes the importance of addressing data quality issues to enhance the accuracy and reliability of predictive models.

Uploaded by

Dr. BRINDHA 2369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views39 pages

Data Cleaning

The document outlines the processes of data exploration, cleaning, and preparation, which can consume up to 70% of a project's time. Key steps include variable identification, univariate and bivariate analysis, and handling missing and outlier data through various methods such as imputation and statistical tests. It emphasizes the importance of addressing data quality issues to enhance the accuracy and reliability of predictive models.

Uploaded by

Dr. BRINDHA 2369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

DATA EXPLORATION,

CLEANING OF DATA
Contents

 Data Exploration
 Missing Data
 Outlier Data
 Data exploration, cleaning and preparation can take up to 70% of your total
project time.

 Steps involved to understand, clean and prepare the data for building your
predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis Need to iterate over steps 4 – 7 multiple times
4. Missing values treatment before we come up with our refined model.
5. Outlier treatment
6. Variable transformation
7. Variable creation
Variable Identification

 First, identify Predictor (Input) and Target (output) variables. Next, identify the
data type and category of the variables.

Output/ Target/Class label

45 1 65 10 75 100
Univariate Analysis

 Univariate analysis used to explore variables one by one. It is also used to highlight
missing and outlier values
 Method to perform uni-variate analysis will depend on whether the variable type is
categorical or continuous.
 Continuous Variables:- In case of continuous variables, need to understand the central
tendency and spread of the variable.
Univariate Analysis

 Categorical Variables:- For categorical variables, we’ll use frequency table to


understand distribution of each category. We can also read as percentage of
values under each category. It can be measured using two
metrics, Count and Count% against each category. Bar chart can be used as
visualization.

The most common representation of a distribution is a


histogram, which is a graph that shows the frequency
of each value. Let us show the age of working men
and women separately.
Bivariate Analysis

 Bi-variate Analysis finds out the relationship between two variables.


 Here, we look for association and disassociation between variables at a pre-defined
significance level.
 We can perform bi-variate analysis for any combination of categorical and
continuous variables. The combination can be:
 Categorical & Categorical,
 Categorical & Continuous and
 Continuous & Continuous.
Bivariate Analysis :

 Continuous & Continuous: While doing bi-variate analysis between two continuous
variables, we should look at scatter plot. It is a nifty way to find out the relationship
between two variables. The pattern of scatter plot indicates the relationship between
variables. The relationship can be linear or non-linear.

A scatter plot is a set of When to use scatter plots?


points that represents the
values obtained for two A scatterplot is a type of data display that shows the
different variables plotted relationship between two numerical variables. Each member
on horizontal and vertical of the dataset gets plotted as a point whose (x, y)
axes. Scatter plots are coordinates relates to its values for the two variables.
capable of showing
thousands of bivariate • Scatter plots are sometimes called correlation plots
(two-dimensional) points in because they show how two variables are correlated
a clear understandable
manner.
What is correlation?

We often see patterns or relationships in scatterplots.


What is correlation?

Positive correlation: When the y variable tends


to increase as the x variable increases, we say
there is a positive correlation between the
variables.

Negative correlation: When the y variable tends


to decrease as the x variable increases, we say
there is a negative correlation between the
variables.

No correlation: When there is no clear


relationship between the two variables, we say
there is no correlation between the two
variables.
Scatter Plots

Scatter plot shows the relationship between two


variables but does not indicates the strength of
relationship amongst them.

To find the strength of the relationship, we


use Correlation. Correlation varies between -1
and +1.
-1 : perfect negative linear correlation
+1: perfect positive linear correlation and 0 :
No correlation

Correlation = Covariance(X,Y) /
SQRT( Var(X)* Var(Y))
Correlation

Example

Good positive
relationship(0.65) between
two variables X and Y.
-1 : perfect negative linear correlation Correlation = Covariance(X,Y) /
+1: perfect positive linear correlation and 0 : SQRT( Var(X)* Var(Y))
No correlation
 Categorical & Categorical:
• Two-way table: We can start analyzing the relationship by creating a two-way table of count
and count%. The rows represents the category of one variable and the columns represent the
categories of the other variable. We show count or count% of observations available in each
combination of row and column categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
Stacked Bar Plots

Stacked bar charts illustrating the survivorship rate on the doomed ship Titanic, by
ticket class.
The histogram (left) informs us of the size of each class,
Scaled bars (right) better capture proportions.
Chi-Square Test:

 This test is used to derive the statistical significance of relationship between the variables.
 It returns probability for the computed chi-square distribution with the degree of freedom
 Probability of 0: It indicates that both categorical variable are dependent
 Probability of 1: It shows that both variables are independent.
 Probability less than 0.05: It indicates that the relationship between the variables is
significant at 95% confidence.

Statistical Measures used to analyze the power of relationship


are:

•Cramer’s V for Nominal Categorical Variable


•Mantel-Haenszed Chi-Square for ordinal categorical variable.
 Categorical & Continuous: While exploring relation between categorical and
continuous variables, we can draw box plots for each level of categorical variables. If
levels are small in number, it will not show the statistical significance. To look at the
statistical significance we can perform Z-test, T-test or ANOVA.

Z-Test/ T-Test:- Either test assess whether mean of two groups are
statistically different from each other or not.
Data Errors

 Data cleansing or data cleaning is the process of detecting and correcting


(or removing) corrupted or inaccurate records from the data set

 Refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the


data and then replacing, modifying, or deleting the dirty or coarse data.
Missing Data
Missing Data

Missing data in the training


data set can reduce the
power / fit of a model
or
can lead to a biased model

Reason : Not analyzed the


behavior and relationship
with other variables
correctly.

Missing data can


Inference - The chances of Inference - The chances of lead to wrong
playing cricket by males is higher playing cricket by females is prediction or
than females higher than males
classification
Reason for Missing Data?

 May occur at two stages

Data Extraction Data Collection

• Hashing procedures can also These errors are harder to


be used to make sure data correct People may/may not disclose
Categorized in four types:
extraction is correct. • Missing their earnings. Same for all
completely at
• Easy to find the errors at this observations
Data for age and female has
random
stage higher missing values
• Missing at random
compared to male
“discomfort” as an input
• Missing that depends on variable for all patients.
unobserved predictors
People with higher or lower
• Missing that depends on income may not disclose their
the missing value itself earnings.
Dealing with Missing Values

 An important aspect of data cleaning is identifying fields for which data isn’t there, and
then properly compensating
 If replaced with NAN - can get misinterpreted as data when the model is built.
 Using a value like −1 as a no-data symbol has exactly the same deficiencies as zero.
 Separately maintain both the raw data and its cleaned version. The raw data is the ground
Gender
truth, and must be preserved intact for future analysis. M
 M
The cleaned data may be improved using imputation to fill in missing values.
NAN
F
NAN
Approaches - Missing Values

Drop
• If the missing values in a column rarely happen and occur at random, then the easiest and
most forward solution is to drop observations (rows) that have missing values.
• Dropping these records will lead to biased results.
Deletion- List Wise Deletion and Pair Wise
Deletion

Simplicity is one of the major advantage of Advantage - it keeps as many cases available for
this method, but this method reduces the analysis.
power of model because it reduces the Disadvantage - it uses different sample size for
sample size different variables.
Approaches - Missing Values

Impute
• It means to calculate the missing value based on other observations.
• Using statistical values like mean, median (quantitative attribute)
• Using statistical values like mode (qualitative attribute)
• Using a linear regression
• Copying values from other similar records
Imputation

Do-Nothing:
• Let the algorithm handle the missing data
• Some algorithms learn the best imputation values for the missing data based
on the training loss reduction
Heuristic-based imputation:
• Make a reasonable guess of the missing value, given sufficient knowledge of the
underlying domain
Imputation

Mean value imputation:


• This works by calculating the mean/median of the non-missing values in a column and
then replacing the missing values within each column separately and independently from
the others. It can only be used with numeric data.
Imputation

Mean value imputation:


Pros:
• Easy and fast.
• Works well with small numerical datasets.
Cons:
• Doesn’t factor the correlations between features. It only works on the column level.
• Will give poor results on encoded categorical features (do NOT use it on categorical
features).
• Not very accurate.
• Doesn’t account for the uncertainty in the imputations.
Imputation
Random value imputation:
• Another approach is to select a random value from the column to replace the missing
value.
Imputation Using (Most Frequent) or (Zero/Constant) Values:
• Most Frequent is another statistical strategy to impute missing values by replacing
missing data with the most frequent values within each column.
Pros:
• Works well with categorical features.
Cons:
• It also doesn’t factor the correlations between features.
• It can introduce bias in the data.
• It has least preference
Imputation- k-nearest neighbor
Imputation by nearest neighbor:
• Identify the complete record which matches most closely on all fields present, and use this
nearest neighbor to infer the values of what is missing
• This approach requires a distance function to identify the most similar records. Nearest
neighbor methods are an important technique in data science.
Pros:
 k-nearest neighbor can predict both qualitative & quantitative attributes
 Can be much more accurate than the mean, median or most frequent imputation methods (It
depends on the dataset).
 Attributes with multiple missing values can be easily treated
DEFINITION OF NEAREST NEIGHBOR

 k = 1:
 Belongs to square class
k=1
 k = 3:
Test  Belongs to triangle class
Input
k=3
k = 7:
Belongs to square
k=7 class
K=3
K=1
Square – 1
Square – 1 Vote = S/T = Square Vote = S/T = Triangle> Squar
Triangle - 2
Triangle - 0
Example

X1 X2 Label D3=16
Test Input Bad
7 7 Bad
D1=9 D4=25
7 4 Bad D2=13 Bad
3 4 Good Good Good
1 4 Good

X1 X2 Distance Rank Is it included Label


,D Minimum in 3- nearest
Distance neighbors?
7 7 16 3 Yes Bad
7 4 25 4 No -
3 4 9 1 Yes Good
1 4 13 2 Yes Good
Imputation- k-nearest neighbor
Cons:
 Computationally expensive.
 May be sensitive to outliers in the data
 KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset
looking for the most similar instances.
 Choice of k-value is very critical. Higher value of k would include attributes which are significantly
different from what we need whereas lower value of k implies missing out of significant attributes.
Imputation

Imputation by interpolation:
Use a method like linear regression to predict the
values of the target column, given the other fields
in the record.
Such models can be trained over full records and
then applied to those with missing values.
Outlier Detection

An outlier is a data point that differs significantly from other observations


 An outlier may be due to variability in the measurement or it may indicate
experimental error.
 An outlier can cause serious problems in statistical analyses.
 May be due to data entry mistakes
 Outliers tend to make your data skewed and reduces accuracy
Identifying Outlier
 plotting the frequency histogram and looking at the location of the extreme elements.
 Unsupervised learning problem, like clustering
Outlier Detection

An outlier is an observation that lies outside the overall pattern of a distribution

Outliers are often easy to spot in histograms. Outliers can also occur when comparing
For example, the point on the far left in the relationships between two sets of data.
above figure is an outlier. Outliers of this type can be easily identified
on a scatter diagram.
Various types of outliers

• Data Entry Errors:- Human errors such as errors caused during data collection, recording, or
entry
• Measurement/Experiment Error: It is the most common source of outliers. This is caused when
the measurement instrument used turns out to be faulty.
• Intentional Outlier: This is commonly found in self-reported measures that involves sensitive
data.
• Data Processing Error: Whenever we perform data mining, we extract data from multiple sources.
It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
• Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a
few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
Impact of Outlier Data

• It increases the error variance


• Reduces the power of statistical
tests
• Decrease normality
• They can bias or influence
estimates that may be of
substantive interest
• They can also impact the basic
assumption of Regression, ANOVA
and other statistical model
assumptions.

You might also like