Data Exploration & Visualization
Data Exploration & Visualization
Visualization
Data Science & Analytics
Introduction
Exploratory data analysis is a concept developed by John Tuckey (1977) that
consists on a new perspective of statistics. Tuckey’s idea was that in traditional
statistics, the data was not being explored graphically, is was just being used to
test hypotheses. The first attempt to develop a tool was done in Stanford, the
project was called prim9. The tool was able to visualize data in nine dimensions,
therefore it was able to provide a multivariate perspective of the data.
In recent days, exploratory data analysis is a must and has been included in the
big data analytics life cycle. The ability to find insight and be able to
communicate it effectively in an organization is fueled with strong EDA
capabilities.
First, identify Predictor (Input) and Target (output) variables. Next, identify
the data type and category of the variables.
Let’s understand this step more clearly by taking an example.
Example:- Suppose, we want to predict, whether the students will play
cricket or not (refer below data set). Here you need to identify predictor
variables, target variable, data type of variables and category of variables.
Steps for Data Exploration &
Visualization (contd)
Below, the variables have been defined in different category:
Steps for Data Exploration &
Visualization (contd)
Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-
variate analysis will depend on whether the variable type is categorical or
continuous. Let’s look at these methods and statistical measures for
categorical and continuous variables individually:
Note: Univariate analysis is also used to highlight missing and outlier values.
In the upcoming part of this series, we will look at methods to handle
missing and outlier values. To know more about these methods, you can
refer course descriptive statistics from Udacity.
Bi-variate Analysis finds out the relationship between two variables. Here,
we look for association and disassociation between variables at a pre-
defined significance level. We can perform bi-variate analysis for any
combination of categorical and continuous variables. The combination
can be: Categorical & Categorical, Categorical & Continuous and
Continuous & Continuous. Different methods are used to tackle these
combinations during analysis process.
The chi-square test statistic for a test of independence of two categorical variables is
found by:
where O represents the observed frequency. E is the expected
frequency under the null hypothesis and computed by:
From previous two-way table, the expected count for product category 1 to be of small size
is 0.22. It is derived by taking the row total for Size (9) times the column total for Product
category (2) then dividing by the sample size (81). This is procedure is conducted for each cell.
Statistical Measures used to analyze the power of relationship are:
If the probability of Z is small then the difference of two averages is more significant.
The T-test is very similar to Z-test but it is used when number of observation for both
categories is less than 30.
Missing Value Treatment
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Missing Value Treatment (contd)
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or
classification.
Notice the missing values in the image shown above: In the left scenario, we have not treated missing values.
The inference from this data set is that the chances of playing cricket by males is higher than females. On the
other hand, if you look at the second table, which shows data after treatment of missing values (based on gender),
we can see that females have higher chances of playing cricket compared to males.
Missing Value Treatment (contd)
Why my data has missing values?
We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for
occurrence of these missing values. They may occur at two stages:
Data Extraction: It is possible that there are problems with extraction process. In such cases, we should
double-check for correct data with data guardians. Some hashing procedures can also be used to make
sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be
corrected easily as well.
Data collection: These errors occur at time of data collection and are harder to correct. They can be
categorized in four types:
Missing completely at random: This is a case when the probability of missing variable is same for all
observations. For example: respondents of data collection process decide that they will declare their earning
after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value.
Missing at random: This is a case when variable is missing at random and missing ratio varies for different values
/ level of other input variables. For example: We are collecting data for age and female has higher missing
value compare to male.
Missing that depends on unobserved predictors: This is a case when the missing values are not random and
are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic
causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all patients.
Missing that depends on the missing value itself: This is a case when the probability of missing value is directly
correlated with missing value itself. For example: People with higher or lower income are likely to provide non-
response to their earning.
Missing Value Treatment (contd)
Which are the methods to treat missing values ?
Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one
of the major advantage of this method, but this method reduces the power of model because it
reduces the sample size.
In pair wise deletion, we perform analysis with all cases in which the variables of interest are
present. Advantage of this method is, it keeps as many cases available for analysis. One of the
disadvantage of this method, it uses different sample size for different variables.
Deletion methods
are used when the
nature of missing
data is “Missing
completely at
random” else non
random missing
values can bias
the model output.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with
estimated ones. The objective is to employ known relationships that can be identified in
the valid values of the data set to assist in estimating the missing values. Mean / Mode /
Median imputation is one of the most frequently used methods. It consists of replacing
the missing data for a given attribute by the mean or median (quantitative attribute) or
mode (qualitative attribute) of all known values of that variable. It can be of two types:-
Generalized Imputation: In this case, we calculate the mean or median for all non missing values
of that variable then replace missing value with mean or median. Like in above table, variable
“Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and
then replace missing value with it.
Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and
“Female” (25) individually of non missing values then replace the missing value based on
gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female”
with 25.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)
Prediction Model: Prediction model is one of the sophisticated method for handling
missing data. Here, we create a predictive model to estimate values that will substitute
the missing data. In this case, we divide our data set into two sets: One set with no
missing values for the variable and another one with missing values. First data set
become training data set of the model while second data set with missing values is test
data set and variable with missing values is treated as target variable. Next, we create a
model to predict target variable based on other attributes of the training data set and
populate missing values of test data set.We can use regression, ANOVA, Logistic
regression and various modeling technique to perform this. There are 2 drawbacks for this
approach:
The model estimated values are usually more well-behaved than the true values
If there are no relationships with attributes in the data set and the attribute with missing values,
then the model will not be precise for estimating missing values.
Missing Value Treatment (contd)
Which are the methods to treat missing values ? (contd)
KNN Imputation: In this method of imputation, the missing values of an attribute are imputed
using the given number of attributes that are most similar to the attribute whose values are
missing. The similarity of two attributes is determined using a distance function. It is also known
to have certain advantage & disadvantages.
Advantages:
k-nearest neighbour can predict both qualitative & quantitative attributes
Creation of predictive model for each attribute with missing data is not required
Attributes with multiple missing values can be easily treated
Correlation structure of the data is taken into consideration
Disadvantage:
KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset
looking for the most similar instances.
Choice of k-value is very critical. Higher value of k would include attributes which are significantly
different from what we need whereas lower value of k implies missing out of significant attributes.
After dealing with missing values, the next task is to deal with outliers. Often, we tend to
neglect outliers while building models. This is a discouraging practice. Outliers tend to make
your data skewed and reduces accuracy. Let’s learn more about outlier treatment.