0% found this document useful (0 votes)

51 views23 pages

Data Exploration

Uploaded by

Shivansh Ghelani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views23 pages

Data Exploration

Uploaded by

Shivansh Ghelani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Connect, Message, Like,Follow&Share, 100% Free Counselling➔ThankYou

Data Exploration
• A complete tutorial on data exploration (EDA)
• We cover several data exploration aspects, including missing value
imputation, outlier removal and the art of feature engineering

There are no shortcuts for data exploration. If you are in a state of mind, that
machine learning can sail you away from every data storm, trust me, it won’t.
After some point of time, you’ll realize that you are struggling at improving
model’s accuracy. In such situation, data exploration techniques will come to
your rescue.

Contents
1. Steps of Data Exploration and Preparation
2. Missing Value Treatment
o Why missing value treatment is required ?
o Why data has missing values?
o Which are the methods to treat missing value ?
3. Techniques of Outlier Detection and Treatment
o What is an outlier?
o What are the types of outliers ?
o What are the causes of outliers ?
o What is the impact of outliers on dataset ?
o How to detect outlier ?
o How to remove outlier ?
4. The Art of Feature Engineering
o What is Feature Engineering ?
o What is the process of Feature Engineering ?
o What is Variable Transformation ?
o When should we use variable transformation ?
o What are the common methods of variable transformation ?
o What is feature variable creation and its benefits ?

1. Steps of Data Exploration and Preparation

Remember the quality of your inputs decide the quality of your output. So, once
you have got your business hypothesis ready, it makes sense to spend lot of time
and efforts here. With my personal estimate, data exploration, cleaning and
preparation can take up to 70% of your total project time.

Below are the steps involved to understand, clean and prepare your data for
building your predictive model:

1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation

Finally, we will need to iterate over steps 4 – 7 multiple times before we come
up with our refined model.

Let’s now study each stage in detail:-

Variable Identification

First, identify Predictor (Input) and Target (output) variables. Next, identify
the data type and category of the variables.
Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket
or not (refer below data set). Here you need to identify predictor variables,
target variable, data type of variables and category of variables.

Below, the variables have been defined in different category:

Univariate Analysis

At this stage, we explore variables one by one. Method to perform uni-variate

analysis will depend on whether the variable type is categorical
or continuous. Let’s look at these methods and statistical measures for
categorical and continuous variables individually:
Continuous Variables:- In case of continuous variables, we need to understand
the central tendency and spread of the variable. These are measured using
various statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In
the upcoming part of this series, we will look at methods to handle missing and
outlier values. To know more about these methods, you can refer
course descriptive statistics from Udacity.

Categorical Variables:- For categorical variables, we’ll use frequency table to

understand distribution of each category. We can also read as percentage of
values under each category. It can be be measured using two
metrics, Count and Count% against each category. Bar chart can be used as
visualization.

Bi-variate Analysis

Bi-variate Analysis finds out the relationship between two variables. Here, we
look for association and disassociation between variables at a pre-defined
significance level. We can perform bi-variate analysis for any combination of
categorical and continuous variables. The combination can be: Categorical &
Categorical, Categorical & Continuous and Continuous & Continuous. Different
methods are used to tackle these combinations during analysis process.

Let’s understand the possible combinations in detail:

Continuous & Continuous: While doing bi-variate analysis between two

continuous variables, we should look at scatter plot. It is a nifty way to find out
the relationship between two variables. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or non-linear.

Scatter plot shows the relationship

between two variable but does not indicates the strength of relationship amongst
them. To find the strength of the relationship, we use Correlation. Correlation
varies between -1 and +1.

• -1: perfect negative linear correlation

• +1:perfect positive linear correlation and

• 0: No correlation

Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Various tools have function or functionality to identify correlation between

variables. In Excel, function CORREL() is used to return the correlation
between two variables and SAS uses procedure PROC CORR to identify the
correlation. These function returns Pearson Correlation value to identify the
relationship between two variables:

In above example, we have good positive relationship(0.65) between two

variables X and Y.

Categorical & Categorical: To find the relationship between two categorical

variables, we can use following methods:

• Two-way table: We can start analyzing the relationship by creating a two-way

table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or
count% of observations available in each combination of row and column
categories.
• Stacked Column Chart: This method is more of a visual form of Two-way
table.
• Chi-Square Test: This test is used to derive the statistical significance of
relationship between the variables. Also, it tests whether the evidence in the
sample is strong enough to generalize that the relationship for a larger
population as well. Chi-square is based on the difference between the expected
and observed frequencies in one or more categories in the two-way table. It
returns probability for the computed chi-square distribution with the degree of
freedom.

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables
is significant at 95% confidence. The chi-square test statistic for a test of
independence of two categorical variables is found by:

where O represents the observed frequency. E is the expected

frequency under the null hypothesis and computed by:

From previous two-way table, the expected count for product category 1 to be
of small size is 0.22. It is derived by taking the row total for Size (9) times the
column total for Product category (2) then dividing by the sample size (81).
This is procedure is conducted for each cell. Statistical Measures used to
analyze the power of relationship are:
• Cramer’s V for Nominal Categorical Variable
• Mantel-Haenszed Chi-Square for ordinal categorical variable.

Different data science language and tools have specific methods to perform chi-
square test. In SAS, we can use Chisq as an option with Proc freq to perform
this test.

Categorical & Continuous: While exploring relation between categorical and

continuous variables, we can draw box plots for each level of categorical
variables. If levels are small in number, it will not show the statistical
significance. To look at the statistical significance we can perform Z-test, T-test
or ANOVA.

• Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically

different from each other or not. If the probability of Z is

small then the difference of two averages is more significant. The T-test is very
similar to Z-test but it is used when number of observation for both categories is
less than 30.
• ANOVA:- It assesses whether the average of more than two groups is
statistically different.

Example: Suppose, we want to test the effect of five different exercises. For
this, we recruit 20 men and assign one type of exercise to 4 men (5 groups).
Their weights are recorded after a few weeks. We need to find out whether the
effect of these exercises on them is significantly different or not. This can be
done by comparing the weights of the 5 groups of 4 men each.

Till here, we have understood the first three stages of Data Exploration,
Variable Identification, Uni-Variate and Bi-Variate analysis. We also looked at
various statistical and visual methods to identify the relationship between
variables.

Now, we will look at the methods of Missing values Treatment. More

importantly, we will also look at why missing values occur in our data and why
treating them is necessary.

2. Missing Value Treatment

Why missing values treatment is required?

Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Notice the missing values in the image shown above: In the left scenario, we
have not treated missing values. The inference from this data set is that the
chances of playing cricket by males is higher than females. On the other hand, if
you look at the second table, which shows data after treatment of missing values
(based on gender), we can see that females have higher chances of playing
cricket compared to males.

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now,

let’s identify the reasons for occurrence of these missing values. They may
occur at two stages:

1. Data Extraction: It is possible that there are problems with extraction

process. In such cases, we should double-check for correct data with data
guardians. Some hashing procedures can also be used to make sure data
extraction is correct. Errors at data extraction stage are typically easy to
find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are
harder to correct. They can be categorized in four types:
o Missing completely at random: This is a case when the probability of
missing variable is same for all observations. For example: respondents
of data collection process decide that they will declare their earning after
tossing a fair coin. If an head occurs, respondent declares his / her
earnings & vice versa. Here each observation has equal chance of missing
value.
o Missing at random: This is a case when variable is missing at random
and missing ratio varies for different values / level of other input
variables. For example: We are collecting data for age and female has
higher missing value compare to male.
o Missing that depends on unobserved predictors: This is a case when
the missing values are not random and are related to the unobserved input
variable. For example: In a medical study, if a particular diagnostic
causes discomfort, then there is higher chance of drop out from the study.
This missing value is not at random unless we have included
“discomfort” as an input variable for all patients.
o Missing that depends on the missing value itself: This is a case when
the probability of missing value is directly correlated with missing value
itself. For example: People with higher or lower income are likely to
provide non-response to their earning.

Which are the methods to treat missing values ?

1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is
missing. Simplicity is one of the major advantage of this method, but this
method reduces the power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the
variables of interest are present. Advantage of this method is, it keeps as
many cases available for analysis. One of the disadvantage of this
method, it uses different sample size for different variables.

o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the
model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the
missing values with estimated ones. The objective is to employ known
relationships that can be identified in the valid values of the data set to
assist in estimating the missing values. Mean / Mode / Median imputation
is one of the most frequently used methods. It consists of replacing the
missing data for a given attribute by the mean or median (quantitative
attribute) or mode (qualitative attribute) of all known values of that
variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median
for all non missing values of that variable then replace missing value with
mean or median. Like in above table, variable “Manpower” is missing so
we take average of all non missing values of “Manpower” (28.33) and
then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender
“Male” (29.75) and “Female” (25) individually of non missing values
then replace the missing value based on gender. For “Male“, we will
replace missing values of manpower with 29.75 and for “Female” with
25.
3. Prediction Model: Prediction model is one of the sophisticated
method for handling missing data. Here, we create a predictive model to
estimate values that will substitute the missing data. In this case, we
divide our data set into two sets: One set with no missing values for the
variable and another one with missing values. First data set become
training data set of the model while second data set with missing values is
test data set and variable with missing values is treated as target variable.
Next, we create a model to predict target variable based on other
attributes of the training data set and populate missing values of test data
set.We can use regression, ANOVA, Logistic regression and various
modeling technique to perform this. There are 2 drawbacks for this
approach:
o The model estimated values are usually more well-behaved than
the true values
o If there are no relationships with attributes in the data set and the
attribute with missing values, then the model will not be precise for
estimating missing values.
4. KNN Imputation: In this method of imputation, the missing values of
an attribute are imputed using the given number of attributes that are most
similar to the attribute whose values are missing. The similarity of two
attributes is determined using a distance function. It is also known to have
certain advantage & disadvantages.
o Advantages:
▪ k-nearest neighbour can predict both qualitative & quantitative attributes
▪ Creation of predictive model for each attribute with missing data is not
required
▪ Attributes with multiple missing values can be easily treated
▪ Correlation structure of the data is taken into consideration
o Disadvantage:
▪ KNN algorithm is very time-consuming in analyzing large database. It
searches through all the dataset looking for the most similar instances.
▪ Choice of k-value is very critical. Higher value of k would include
attributes which are significantly different from what we need whereas
lower value of k implies missing out of significant attributes.

After dealing with missing values, the next task is to deal with outliers. Often,
we tend to neglect outliers while building models. This is a discouraging
practice. Outliers tend to make your data skewed and reduces accuracy. Let’s
learn more about outlier treatment.

3. Techniques of Outlier Detection and Treatment

What is an Outlier?

Outlier is a commonly used terminology by analysts and data scientists as it

needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an
overall pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average
annual income of customers is $0.8 million. But, there are two customers
having annual income of $4 and $4.2 million. These two customers annual
income is much higher than rest of the population. These two observations will
be seen as Outliers.
What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have

discussed the example of univariate outlier. These outliers can be found when
we look at distribution of a single variable. Multi-variate outliers are outliers in
an n-dimensional space. In order to find them, you have to look at distributions
in multi-dimensions.

Let us understand this with an example. Let us say we are understanding the
relationship between height and weight. Below, we have univariate and
bivariate distribution for Height, Weight. Take a look at the box plot. We do not
have any outlier (above and below 1.5*IQR, most common method). Now look
at the scatter plot. Here, we have two values below and one above the average
in a specific segment of weight and height.
What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out
the reason of having these outliers. The method to deal with them would then
depend on the reason of their occurrence. Causes of outliers can be classified in
two broad categories:

1. Artificial (Error) / Non-natural

2. Natural.

Let’s understand various types of outliers in more detail:

• Data Entry Errors:- Human errors such as errors caused during data
collection, recording, or entry can cause outliers in data. For example: Annual
income of a customer is $100,000. Accidentally, the data entry operator puts an
additional zero in the figure. Now the income becomes $1,000,000 which is 10
times higher. Evidently, this will be the outlier value when compared with rest
of the population.
• Measurement Error: It is the most common source of outliers. This is caused
when the measurement instrument used turns out to be faulty. For example:
There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight
measured by people on the faulty machine will be higher / lower than the rest of
people in the group. The weights measured on faulty machine can lead to
outliers.
• Experimental Error: Another cause of outliers is experimental error. For
example: In a 100m sprint of 7 runners, one runner missed out on concentrating
on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s
run time to be more than other runners. His total run time can be an outlier.
• Intentional Outlier: This is commonly found in self-reported measures that
involves sensitive data. For example: Teens would typically under report the
amount of alcohol that they consume. Only a fraction of them would report
actual value. Here actual values might look like outliers because rest of the
teens are under reporting the consumption.
• Data Processing Error: Whenever we perform data mining, we extract data
from multiple sources. It is possible that some manipulation or extraction errors
may lead to outliers in the dataset.
• Sampling error: For instance, we have to measure the height of athletes. By
mistake, we include a few basketball players in the sample. This inclusion is
likely to cause outliers in the dataset.
• Natural Outlier: When an outlier is not artificial (due to error), it is a natural
outlier. For instance: In my last assignment with one of the renowned insurance
company, I noticed that the performance of top 50 financial advisors was far
higher than rest of the population. Surprisingly, it was not due to any error.
Hence, whenever we perform any data mining activity with advisors, we used to
treat this segment separately.

What is the impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical
modeling. There are numerous unfavourable impacts of outliers in the data set:

• It increases the error variance and reduces the power of statistical tests
• If the outliers are non-randomly distributed, they can decrease normality
• They can bias or influence estimates that may be of substantive interest
• They can also impact the basic assumption of Regression, ANOVA and other
statistical model assumptions.

To understand the impact deeply, let’s take an example to check what happens
to a data set with and without outliers in the data set.

Example:
As you can see, data set with outliers has significantly different mean and
standard deviation. In the first scenario, we will say that average is 5.45. But
with the outlier, average soars to 30. This would change the estimate
completely.

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various

visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have
used box plot and scatter plot for visualization). Some analysts also various
thumb rules to detect outliers. Some of them are:

• Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
• Use capping methods. Any value which out of range of 5th and 95th percentile
can be considered as outlier
• Data points, three or more standard deviation away from mean are considered
outlier
• Outlier detection is merely a special case of the examination of data for
influential data points and it also depends on the business understanding
• Bivariate and multivariate outliers are typically measured using either an index
of influence or leverage, or distance. Popular indices such as Mahalanobis’
distance and Cook’s D are frequently used to detect outliers.
• In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and
influential observation, we also look at statistical measure like STUDENT,
COOKD, RSTUDENT and others.

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing
values like deleting observations, transforming them, binning them, treat them
as a separate group, imputing values and other statistical methods. Here, we
will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error,

data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate

outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm
allows to deal with outliers well due to binning of variable. We can also use the
process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We

can use mean, median, mode imputation methods. Before imputing values, we
should analyse if it is natural outlier or artificial. If it is artificial, we can go with
imputing values. We can also use statistical model to predict values of outlier
observation and after that we can impute it with predicted values.

Treat separately: If there are significant number of outliers, we should treat

them separately in the statistical model. One of the approach is to treat both
groups as two different groups and build individual model for both groups and
then combine the output.

Till here, we have learnt about steps of data exploration, missing value
treatment and techniques of outlier detection and treatment. These 3 stages will
make your raw data better in terms of information availability and accuracy.
Let’s now proceed to the final stage of data exploration. It is Feature
Engineering.

4. The Art of Feature Engineering

What is Feature Engineering?

Feature engineering is the science (and art) of extracting more information from
existing data. You are not adding any new data here, but you are actually
making the data you already have more useful.

For example, let’s say you are trying to predict foot fall in a shopping mall
based on dates. If you try and use the dates directly, you may not be able to
extract meaningful insights from the data. This is because the foot fall is less
affected by the day of the month than it is by the day of the week. Now this
information about day of week is implicit in your data. You need to bring it out
to make your model better.

This exercising of bringing out information from data in known as feature

engineering.
What is the process of Feature Engineering ?

You perform feature engineering once you have completed the first 5 steps in
data exploration – Variable Identification, Univariate, Bivariate
Analysis, Missing Values Imputation and Outliers Treatment. Feature
engineering itself can be divided in 2 steps:

• Variable transformation.
• Variable / Feature creation.

These two techniques are vital in data exploration and have a remarkable impact
on the power of prediction. Let’s understand each of this step in more details.

What is Variable Transformation?

In data modelling, transformation refers to the replacement of a variable by a

function. For instance, replacing a variable x by the square / cube root or
logarithm x is a transformation. In other words, transformation is a process that
changes the distribution or relationship of a variable with others.

Let’s look at the situations when variable transformation is useful.

When should we use Variable Transformation?

Below are the situations where variable transformation is a requisite:

• When we want to change the scale of a variable or standardize the values of a

variable for better understanding. While this transformation is a must if you
have data in different scales, this transformation does not change the shape of
the variable distribution
• When we can transform complex non-linear relationships into linear
relationships. Existence of a linear relationship between variables is easier to
comprehend compared to a non-linear or curved relation. Transformation helps
us to convert a non-linear relation into linear relation. Scatter plot can be used to
find the relationship between two continuous variables. These transformations
also improve the prediction. Log transformation is one of the commonly used
transformation technique used in these situations.

• Symmetric
distribution is preferred over skewed distribution as it is easier to interpret
and generate inferences. Some modeling techniques requires normal distribution
of variables. So, whenever we have a skewed distribution, we can
use transformations which reduce skewness. For right skewed distribution, we
take square / cube root or logarithm of variable and for left skewed, we take
square / cube or exponential of variables.

• Variable Transformation is also done from an implementation point of

view (Human involvement). Let’s understand it more clearly. In one of my
project on employee performance, I found that age has direct correlation with
performance of the employee i.e. higher the age, better the performance. From
an implementation stand point, launching age based progamme might present
implementation challenge. However, categorizing the sales agents in three age
group buckets of <30 years, 30-45 years and >45 and then formulating three
different strategies for each group is a judicious approach. This categorization
technique is known as Binning of Variables.

What are the common methods of Variable Transformation?

There are various methods used to transform variables. As discussed, some of
them include square root, cube root, logarithmic, binning, reciprocal and many
others. Let’s look at these methods in detail by highlighting the pros and cons of
these transformation methods.

• Logarithm: Log of a variable is a common transformation method used to

change the shape of distribution of the variable on a distribution plot. It is
generally used for reducing right skewness of variables. Though, It can’t be
applied to zero or negative values as well.

• Square / Cube root: The square and cube root of a variable has a sound effect
on variable distribution. However, it is not as significant as logarithmic
transformation. Cube root has its own advantage. It can be applied to negative
values including zero. Square root can be applied to positive values including
zero.

• Binning: It is used to categorize variables. It is performed on original values,

percentile or frequency. Decision of categorization technique is based on
business understanding. For example, we can categorize income in three
categories, namely: High, Average and Low. We can also perform co-variate
binning which depends on the value of more than one variables.

What is Feature / Variable Creation & its Benefits?

Feature / Variable creation is a process to generate a new variables / features

based on existing variable(s). For example, say, we have date(dd-mm-yy) as an
input variable in a data set. We can generate new variables like day, month,
year, week, weekday that may have better relationship with target variable. This
step is used to highlight the hidden relationship in a variable:
There are various techniques to create new features. Let’s look at the some of
the commonly used methods:

• Creating derived variables: This refers to creating new variables from existing
variable(s) using set of functions or different methods. Let’s look at it through
“Titanic – Kaggle competition”. In this data set, variable age has missing
values. To predict missing values, we used the salutation (Master, Mr, Miss,
Mrs) of name as a new variable. How do we decide which variable to create?
Honestly, this depends on business understanding of the analyst, his curiosity
and the set of hypothesis he might have about the problem. Methods such as
taking log of variables, binning variables and other methods of variable
transformation can also be used to create new variables.

• Creating dummy variables: One of the most common application of dummy

variable is to convert categorical variable into numerical variables. Dummy
variables are also called Indicator Variables. It is useful to take categorical
variable as a predictor in statistical models. Categorical variable can take
values 0 and 1. Let’s take a variable ‘gender’. We can produce two variables,
namely, “Var_Male” with values 1 (Male) and 0 (No male) and “Var_Female”
with values 1 (Female) and 0 (No Female). We can also create dummy variables
for more than two classes of a categorical variables with n or n-1 dummy
variables.

For further read, here is a list of transformation / creation ideas which can be
applied to your data.

Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Data Processing and Analysis of Data
100% (1)
Data Processing and Analysis of Data
43 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Research Analytics
25% (4)
Research Analytics
2 pages
Karina e Zindagi - Hindi 1
50% (2)
Karina e Zindagi - Hindi 1
229 pages
Using Multivariate Statistics: Barbara G. Tabachnick
100% (1)
Using Multivariate Statistics: Barbara G. Tabachnick
22 pages
A Comprehensive Guide To Data Exploration
100% (2)
A Comprehensive Guide To Data Exploration
18 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
45 pages
Data Mining Technical
No ratings yet
Data Mining Technical
45 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
Everest Group Intelligent Document Processing and Unstructured Document Processing Products PEAK Matrix Assessment 2023
No ratings yet
Everest Group Intelligent Document Processing and Unstructured Document Processing Products PEAK Matrix Assessment 2023
15 pages
Why Exploratory Data Analysis Is Important
No ratings yet
Why Exploratory Data Analysis Is Important
2 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
9 pages
SPSS Outputs Summary
No ratings yet
SPSS Outputs Summary
14 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Data Analysis Guide
No ratings yet
Data Analysis Guide
4 pages
Big Data Chapter 3
No ratings yet
Big Data Chapter 3
29 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
BRM CS
No ratings yet
BRM CS
4 pages
CLC - Data Cleansing and Data Summary
No ratings yet
CLC - Data Cleansing and Data Summary
17 pages
Em (601) Report# 9
No ratings yet
Em (601) Report# 9
6 pages
570 AS02 NguyenManhQuan GBS210099 GBS1004B
No ratings yet
570 AS02 NguyenManhQuan GBS210099 GBS1004B
34 pages
Module 01 Introduction To Business Statistics
No ratings yet
Module 01 Introduction To Business Statistics
16 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Presentation1HOD SIR-1
No ratings yet
Presentation1HOD SIR-1
13 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Unit 4
No ratings yet
Unit 4
25 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
Data Analysis - Selecting A Test
No ratings yet
Data Analysis - Selecting A Test
5 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Qunatitative Analysis
No ratings yet
Qunatitative Analysis
30 pages
Dva 2
No ratings yet
Dva 2
13 pages
BRM Presentation Group 5 - Univariate & Bivariate Analysis
No ratings yet
BRM Presentation Group 5 - Univariate & Bivariate Analysis
26 pages
DataAnalytics (Unit 2)
No ratings yet
DataAnalytics (Unit 2)
131 pages
Unit 1
No ratings yet
Unit 1
24 pages
Regression
No ratings yet
Regression
86 pages
STAT22209 - Chapter 01-Correlation Analyisis - 2022
No ratings yet
STAT22209 - Chapter 01-Correlation Analyisis - 2022
53 pages
Unit 4
No ratings yet
Unit 4
21 pages
Not 1
No ratings yet
Not 1
8 pages
CH 5
No ratings yet
CH 5
26 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
67 pages
Module Ii
No ratings yet
Module Ii
31 pages
Marketing Engineering and Analytics
No ratings yet
Marketing Engineering and Analytics
52 pages
Business Club: Basic Statistics
No ratings yet
Business Club: Basic Statistics
26 pages
Wisdom and StatisticsTecq-Amitava
No ratings yet
Wisdom and StatisticsTecq-Amitava
18 pages
M6 Predictive Analytics Presentation
No ratings yet
M6 Predictive Analytics Presentation
49 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
6 Data Analysis
No ratings yet
6 Data Analysis
24 pages
Unit - 4 - Quantitative Methods
No ratings yet
Unit - 4 - Quantitative Methods
8 pages
Business Research Methods Unit 4
No ratings yet
Business Research Methods Unit 4
25 pages
A Project Report: in Partial Fulfillment For The Award of The Degree
No ratings yet
A Project Report: in Partial Fulfillment For The Award of The Degree
50 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
GSM + Accessories Price List: Table of Content
100% (1)
GSM + Accessories Price List: Table of Content
27 pages
EarthExplorer Helptutorial
100% (1)
EarthExplorer Helptutorial
67 pages
Scandinavian
No ratings yet
Scandinavian
2 pages
1-19#-LS-909-SIZE ANALYSIS MACHINE-user Manual PDF
No ratings yet
1-19#-LS-909-SIZE ANALYSIS MACHINE-user Manual PDF
101 pages
Virtual Private Network (VPN) : Ipsec
No ratings yet
Virtual Private Network (VPN) : Ipsec
4 pages
Dell LCD Monitor 1503FP Service Manual PDF
No ratings yet
Dell LCD Monitor 1503FP Service Manual PDF
8 pages
Trung PN Wdu202c GD 1301 Question Bank
No ratings yet
Trung PN Wdu202c GD 1301 Question Bank
24 pages
Bio 11 Syllabus
No ratings yet
Bio 11 Syllabus
4 pages
PWM Signal Generator ESR 1.2
No ratings yet
PWM Signal Generator ESR 1.2
4 pages
MCA NewProgrammeGuideCRCFinalJanuary2022
No ratings yet
MCA NewProgrammeGuideCRCFinalJanuary2022
81 pages
Midterm Solution
No ratings yet
Midterm Solution
8 pages
Project Report On Golden Bricks
No ratings yet
Project Report On Golden Bricks
29 pages
Notes On C Prog
No ratings yet
Notes On C Prog
28 pages
Supplier
No ratings yet
Supplier
117 pages
3
No ratings yet
3
45 pages
User Manual For Amazfit Band 5
No ratings yet
User Manual For Amazfit Band 5
25 pages
214 Implementing Dark Mode On Ios
No ratings yet
214 Implementing Dark Mode On Ios
142 pages
Unit1 and Unit2
No ratings yet
Unit1 and Unit2
85 pages
Publisher Lab Instructions
No ratings yet
Publisher Lab Instructions
3 pages
Share Whitepaper 7
No ratings yet
Share Whitepaper 7
14 pages
10-04-2023 User Pppoe DHCP
No ratings yet
10-04-2023 User Pppoe DHCP
48 pages
Factors For Success in Customer Relationship Management (CRM) Systems
No ratings yet
Factors For Success in Customer Relationship Management (CRM) Systems
27 pages
TX Ty TZ Ty TZ TX TZ TX Ty Matlab Code:: % Symbolizing The Variables
No ratings yet
TX Ty TZ Ty TZ TX TZ TX Ty Matlab Code:: % Symbolizing The Variables
3 pages
Implementation of An E-Commerce System For The Automation and Improvement of Commercial Management at A Business Level
No ratings yet
Implementation of An E-Commerce System For The Automation and Improvement of Commercial Management at A Business Level
7 pages
Dev Ops 2
No ratings yet
Dev Ops 2
1 page
BrightEye 5 - Analog Composite TBC and Frame Sync BE5
No ratings yet
BrightEye 5 - Analog Composite TBC and Frame Sync BE5
2 pages
Od 427305939568238100
No ratings yet
Od 427305939568238100
1 page
Understanding Analysis: Foundations and Applications
From Everand
Understanding Analysis: Foundations and Applications
Tanmay Shroff
No ratings yet
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet

Data Exploration

Uploaded by

Data Exploration

Uploaded by

Connect, Message, Like,Follow&Share, 100% Free Counselling➔ThankYou

1. Steps of Data Exploration and Preparation

Let’s now study each stage in detail:-

Below, the variables have been defined in different category:

At this stage, we explore variables one by one. Method to perform uni-variate

Categorical Variables:- For categorical variables, we’ll use frequency table to

Let’s understand the possible combinations in detail:

Continuous & Continuous: While doing bi-variate analysis between two

Scatter plot shows the relationship

• -1: perfect negative linear correlation

Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Various tools have function or functionality to identify correlation between

In above example, we have good positive relationship(0.65) between two

Categorical & Categorical: To find the relationship between two categorical

• Two-way table: We can start analyzing the relationship by creating a two-way

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

where O represents the observed frequency. E is the expected

Categorical & Continuous: While exploring relation between categorical and

different from each other or not. If the probability of Z is

Now, we will look at the methods of Missing values Treatment. More

2. Missing Value Treatment

Why missing values treatment is required?

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now,

1. Data Extraction: It is possible that there are problems with extraction

Which are the methods to treat missing values ?

3. Techniques of Outlier Detection and Treatment

Outlier is a commonly used terminology by analysts and data scientists as it

Outlier can be of two types: Univariate and Multivariate. Above, we have

1. Artificial (Error) / Non-natural

Let’s understand various types of outliers in more detail:

What is the impact of Outliers on a dataset?

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various

How to remove Outliers?

Deleting observations: We delete outlier values if it is due to data entry error,

Transforming and binning values: Transforming variables can also eliminate

Imputing: Like imputation of missing values, we can also impute outliers. We

Treat separately: If there are significant number of outliers, we should treat

4. The Art of Feature Engineering

What is Feature Engineering?

This exercising of bringing out information from data in known as feature

What is Variable Transformation?

In data modelling, transformation refers to the replacement of a variable by a

Let’s look at the situations when variable transformation is useful.

When should we use Variable Transformation?

Below are the situations where variable transformation is a requisite:

• When we want to change the scale of a variable or standardize the values of a

• Variable Transformation is also done from an implementation point of

What are the common methods of Variable Transformation?

• Logarithm: Log of a variable is a common transformation method used to

• Binning: It is used to categorize variables. It is performed on original values,

What is Feature / Variable Creation & its Benefits?

Feature / Variable creation is a process to generate a new variables / features

• Creating dummy variables: One of the most common application of dummy

You might also like