0% found this document useful (0 votes)
122 views23 pages

Data Exploration & Visualization

Exploratory data analysis is an important part of the data analytics process. It involves steps like data preparation, missing value treatment, outlier detection, variable transformation, and feature engineering. Visualization is key to exploratory data analysis, with techniques like univariate analysis using histograms and box plots to understand individual variables, and bivariate analysis using scatter plots, correlation, and chi-square tests to explore relationships between variables. The goal of exploratory data analysis is to understand the data, identify patterns and insights, and prepare the data for building predictive models.

Uploaded by

divya kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views23 pages

Data Exploration & Visualization

Exploratory data analysis is an important part of the data analytics process. It involves steps like data preparation, missing value treatment, outlier detection, variable transformation, and feature engineering. Visualization is key to exploratory data analysis, with techniques like univariate analysis using histograms and box plots to understand individual variables, and bivariate analysis using scatter plots, correlation, and chi-square tests to explore relationships between variables. The goal of exploratory data analysis is to understand the data, identify patterns and insights, and prepare the data for building predictive models.

Uploaded by

divya kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Exploration &

Visualization
Data Science & Analytics
Introduction
 Exploratory data analysis is a concept developed by John Tuckey (1977) that
consists on a new perspective of statistics. Tuckey’s idea was that in traditional
statistics, the data was not being explored graphically, is was just being used to
test hypotheses. The first attempt to develop a tool was done in Stanford, the
project was called prim9. The tool was able to visualize data in nine dimensions,
therefore it was able to provide a multivariate perspective of the data.

 In recent days, exploratory data analysis is a must and has been included in the
big data analytics life cycle. The ability to find insight and be able to
communicate it effectively in an organization is fueled with strong EDA
capabilities.

 Based on Tuckey’s ideas, Bell Labs developed the S programming language in


order to provide an interactive interface for doing statistics. The idea of S was to
provide extensive graphical capabilities with an easy-to-use language. In
today’s world, in the context of Big Data, R that is based on the S programming
language is the most popular software for analytics.
Table of Contents
 Steps of Data Exploration and Preparation
 Missing Value Treatment
 Why missing value treatment is required ?
 Why data has missing values?
 Which are the methods to treat missing value ?
 Techniques of Outlier Detection and Treatment
 What is an outlier?
 What are the types of outliers ?
 What are the causes of outliers ?
 What is the impact of outliers on dataset ?
 How to detect outlier ?
 How to remove outlier ?
 The Art of Feature Engineering
 What is Feature Engineering ?
 What is the process of Feature Engineering ?
 What is Variable Transformation ?
 When should we use variable transformation ?
 What are the common methods of variable transformation ?
 What is feature variable creation and its benefits ?
Steps for Data Exploration &
Visualization
 Remember the quality of your inputs decide the quality of your output. So, once
you have got your business hypothesis ready, it makes sense to spend lot of time
and efforts here. With my personal estimate, data exploration, cleaning and
preparation can take up to 70% of your total project time.
 Below are the steps involved to understand, clean and prepare your data for
building your predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
 Finally, we will need to iterate over steps 4 – 7 multiple times before we come up
with our refined model.
 Let’s now study each stage in detail:-
Steps for Data Exploration &
Visualization (contd)
Variable Identification

 First, identify Predictor (Input) and Target (output) variables. Next, identify
the data type and category of the variables.
 Let’s understand this step more clearly by taking an example.
 Example:- Suppose, we want to predict, whether the students will play
cricket or not (refer below data set). Here you need to identify predictor
variables, target variable, data type of variables and category of variables.
Steps for Data Exploration &
Visualization (contd)
 Below, the variables have been defined in different category:
Steps for Data Exploration &
Visualization (contd)
Univariate Analysis
 At this stage, we explore variables one by one. Method to perform uni-
variate analysis will depend on whether the variable type is categorical or
continuous. Let’s look at these methods and statistical measures for
categorical and continuous variables individually:

 Continuous Variables:- In case of continuous variables, we need to


understand the central tendency and spread of the variable. These are
measured using various statistical metrics visualization methods as shown
below:
Steps for Data Exploration &
Visualization (contd)
Univariate Analysis

 Note: Univariate analysis is also used to highlight missing and outlier values.
In the upcoming part of this series, we will look at methods to handle
missing and outlier values. To know more about these methods, you can
refer course descriptive statistics from Udacity.

 Categorical Variables:- For categorical variables, we’ll use frequency table


to understand distribution of each category. We can also read as
percentage of values under each category. It can be be measured using
two metrics, Count and Count% against each category. Bar chart can be
used as visualization.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis

 Bi-variate Analysis finds out the relationship between two variables. Here,
we look for association and disassociation between variables at a pre-
defined significance level. We can perform bi-variate analysis for any
combination of categorical and continuous variables. The combination
can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.

 Let’s understand the possible combinations in detail:

 Continuous & Continuous: While doing bi-variate analysis between two


continuous variables, we should look at scatter plot. It is a nifty way to find
out the relationship between two variables. The pattern of scatter plot
indicates the relationship between variables. The relationship can be linear
or non-linear.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Continuous & Continuous
Scatter plot shows the relationship between
two variable but does not indicates the
strength of relationship amongst them. To find
the strength of the relationship, we use
Correlation. Correlation varies between -1
and +1.

• -1: perfect negative linear correlation


• +1:perfect positive linear correlation
and
• 0: No correlation

Correlation can be derived using following


formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)*


Var(Y))
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Continuous & Continuous
Various tools have function or functionality to identify correlation between
variables. In Excel, function CORREL() is used to return the correlation between
two variables and SAS uses procedure PROC CORR to identify the correlation.
These function returns Pearson Correlation value to identify the relationship
between two variables:

 In above example, we have good positive relationship(0.65) between two


variables X and Y.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Categorial

Categorical & Categorical: To find the relationship between two categorical


variables, we can use following methods:
• Two-way table: We can start analyzing the relationship by creating a two-way
table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or
count% of observations available in each combination of row and column
categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Categorial
Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables.
Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a
larger population as well. Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability for the computed chi-
square distribution with the degree of freedom.
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95%
confidence.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Categorial

The chi-square test statistic for a test of independence of two categorical variables is
found by:
where O represents the observed frequency. E is the expected
frequency under the null hypothesis and computed by:

From previous two-way table, the expected count for product category 1 to be of small size
is 0.22. It is derived by taking the row total for Size (9) times the column total for Product
category (2) then dividing by the sample size (81). This is procedure is conducted for each cell.
Statistical Measures used to analyze the power of relationship are:

 Cramer’s V for Nominal Categorical Variable


 Mantel-Haenszed Chi-Square for ordinal categorical variable.
Different data science language and tools have specific methods to perform chi-square test. In
SAS, we can use Chisq as an option with Proc freq to perform this test.
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Continuous

 While exploring relation between categorical and continuous variables, we can


draw box plots for each level of categorical variables. If levels are small in number, it
will not show the statistical significance. To look at the statistical significance we can
perform Z-test, T-test or ANOVA
 Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different
or not
Steps for Data Exploration &
Visualization (contd)
Bi-variate Analysis: Categorial & Continuous

 If the probability of Z is small then the difference of two averages is more significant.
The T-test is very similar to Z-test but it is used when number of observation for both
categories is less than 30.
Missing Value Treatment

Why missing values treatment is required?

 Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Missing Value Treatment (contd)
Why missing values treatment is required?
 Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values.
The inference from this data set is that the chances of playing cricket by males is higher than females. On the
other hand, if you look at the second table, which shows data after treatment of missing values (based on gender),
we can see that females have higher chances of playing cricket compared to males.
Missing Value Treatment (contd)
Why my data has missing values?

 We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for
occurrence of these missing values. They may occur at two stages:
 Data Extraction: It is possible that there are problems with extraction process. In such cases, we should
double-check for correct data with data guardians. Some hashing procedures can also be used to make
sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be
corrected easily as well.
 Data collection: These errors occur at time of data collection and are harder to correct. They can be
categorized in four types:

 Missing completely at random: This is a case when the probability of missing variable is same for all
observations. For example: respondents of data collection process decide that they will declare their earning
after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value.
 Missing at random: This is a case when variable is missing at random and missing ratio varies for different values
/ level of other input variables. For example: We are collecting data for age and female has higher missing
value compare to male.
 Missing that depends on unobserved predictors: This is a case when the missing values are not random and
are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic
causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all patients.
 Missing that depends on the missing value itself: This is a case when the probability of missing value is directly
correlated with missing value itself. For example: People with higher or lower income are likely to provide non-
response to their earning.
Missing Value Treatment (contd)
Which are the methods to treat missing values ?
 Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.

 In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one
of the major advantage of this method, but this method reduces the power of model because it
reduces the sample size.
 In pair wise deletion, we perform analysis with all cases in which the variables of interest are
present. Advantage of this method is, it keeps as many cases available for analysis. One of the
disadvantage of this method, it uses different sample size for different variables.

Deletion methods
are used when the
nature of missing
data is “Missing
completely at
random” else non
random missing
values can bias
the model output.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)

 Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with
estimated ones. The objective is to employ known relationships that can be identified in
the valid values of the data set to assist in estimating the missing values. Mean / Mode /
Median imputation is one of the most frequently used methods. It consists of replacing
the missing data for a given attribute by the mean or median (quantitative attribute) or
mode (qualitative attribute) of all known values of that variable. It can be of two types:-

 Generalized Imputation: In this case, we calculate the mean or median for all non missing values
of that variable then replace missing value with mean or median. Like in above table, variable
“Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and
then replace missing value with it.
 Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and
“Female” (25) individually of non missing values then replace the missing value based on
gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female”
with 25.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)

 Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
data set and variable with missing values is treated as target variable. Next, we create a
model to predict target variable based on other attributes of the training data set and
populate missing values of test data set.We can use regression, ANOVA, Logistic
regression and various modeling technique to perform this. There are 2 drawbacks for this
approach:
 The model estimated values are usually more well-behaved than the true values
 If there are no relationships with attributes in the data set and the attribute with missing values,
then the model will not be precise for estimating missing values.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)

 KNN Imputation: In this method of imputation, the missing values of an attribute are imputed
using the given number of attributes that are most similar to the attribute whose values are
missing. The similarity of two attributes is determined using a distance function. It is also known
to have certain advantage & disadvantages.
 Advantages:
 k-nearest neighbour can predict both qualitative & quantitative attributes
 Creation of predictive model for each attribute with missing data is not required
 Attributes with multiple missing values can be easily treated
 Correlation structure of the data is taken into consideration
 Disadvantage:
 KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset
looking for the most similar instances.
 Choice of k-value is very critical. Higher value of k would include attributes which are significantly
different from what we need whereas lower value of k implies missing out of significant attributes.

 After dealing with missing values, the next task is to deal with outliers. Often, we tend to
neglect outliers while building models. This is a discouraging practice. Outliers tend to make
your data skewed and reduces accuracy. Let’s learn more about outlier treatment.

You might also like