Exploratory Data Analysis
Exploratory Data Analysis
Chapter - 4
Business Analytics
This dataset contains information about individuals and details about their
dwellings.
Looking at the dataset we might have some questions in mind:
And many more questions will follow along with the answers as we dig the
data deeper and this process of mining the data is called exploratory
analysis!!
What is Data Exploration?
•If we wish to build an impeccable predictive model, neither any
programming language nor any machine learning algorithm can award it
to you unless you perform data exploration.
•Data Exploration not only uncovers the hidden trends and insights, but
also allows you to take the first steps towards building a highly accurate
model
• There are 7 steps involved to clean and prepare the data for
building predictive model.
Variable Identification
Univariate Analysis
Bivariate Analysis
Missing values treatment
Outlier treatment
Variable transformation
Variable creation
Appraised HouseSizes
Name Age Gender Education Salary Value Location Landacres qrft Rooms Baths Garage
Tony 25 M Grad 50 700 Glen Cove 0.2297 2448 8 3.5 2
Harret 52 F PostGrad 95 364 Glen Cove 0.2192 1942 7 2.5 1
Jane 26 F PostGrad 65 600 Glen Cove 0.163 2073 7 3 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.229 2089 7 2 0
Bruce 51 M Grad 101 600 Glen Cove 0.1714 1344 8 1 0
Steve 43 M Grad 108 299 Roslyn 0.175 1120 5 1.5 0
Carol 24 F PostGrad 51 471 Roslyn 0.213 1817 6 2 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 2 1
Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7 2 1
Maria 51 F Grad 122 1200 Long Beach 0.4116 4067 9 4 1
Janet 49 F PostGrad 112 700 Roslyn 0.3372 3130 8 3 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 2 0
Jeffery 37 M Grad 90 543 Long Beach 0.2348 1799 6 2.5 1
Below, the variables have been defined in different category:
Note: Numeric variable is of two types, discrete and continuous depending on the nature
of the data value that a variable takes.
Business Analytics
Univariate Analysis
Univariate Analysis
• Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so we analyze one variable at one time.
• It doesn’t deal with causes or relationships among variables but mostly to
describe and summarize and find patterns in the data.
• Used to highlight missing and outlier values
• Method to perform univariate analysis depends on whether the variable type
is categorical or continuous
Continuous Variables
These measures(below) help in determining the central value and also the dispersion of
continuous variables
Central Measure of Visualization
Tendency Dispersion Method
Mean Range Histogram
Mode IQR
Categorical Variables
Frequency table is used to understand the distribution of each category under a variable,
we can produce count and count% against each category
Bar plots could be used to visualize the Frequency Table
Business Analytics
Bivariate Analysis
Bivariate Analysis
• In Univariate Analysis, we study one variable at a time, like we did in earlier slides, but if we
want to find if there is any relation between two variables we need to perform bivariate
analysis.
• Bivariate analysis, can be performed for any combination ofcategorical and continuous
variables.
• Different methods are used to tackle different combinations during analysis process.
• Possible Combinations are:-
– Continuous & Continuous
– Continuous & Categorical
– Categorical & Categorical
Bivariate Analysis - Continuous & Continuous
• Scatter plot
– find out the relationship between two variables
– The pattern of scatter plot indicates the relationship between variables,
but does not indicates the strength of relationship amongst them
– The relationship can be linear or non-linear
– To find the strength of the relationship, we use Correlation(-1 negative
linear correlation to +1 positive linear correlation and 0 is no correlation).
– We get an idea of some relation and pattern among 2 variables in the
dataset.
Bivariate Analysis - Categorical & Categorical
Methods to identify the relationship between two categorical
variables.
• Two-way table: In this method by creating a two-way table of count
and count%. Both row and column represents category of their
respected variable.
• Stacked Column Chart: This method is one of the most visual form
of Two-way table.
O = observed frequency
E = expected frequency
(Obs-Exp)^2 / Exp F M
Adding up all the values from the above table, we get a chi sqr value.
Chi sqr = 0.342857 + 0.3 + 0.514286 + 0.45 = 1.607143
P-value corresponding to the above chi sqr value with 1 df and alpha = 0.05 is 0.2049.
Since p-value > 0.05, we do not reject Null hypothesis and conclude that Education and Gender
are independent variables.
Business Analytics
Missing Values
Missing Value Treatment
• There may be situations where there could be missing values in your data.
• Missing Data will not make any impact on the result if its percentage is less 1%,
if missing data’s range within the range of 1-5% then it is somehow
manageable; however in case of 5-15% complex techniques are used for
handling the problems of missing data but if it exceeds from 15% then it will
surely hinder the result achieved after applying data mining techniques
• Handling such values is very important as this could lead to wrong results.
Obs Age Salary (in 1000s) Location Obs Age Salary (in 1000s)
1 24 15 North 1 24 15
2 28 20 NA 2 28 20
3 36 45 NA 3 36 45
4 30 35 NA 4 30 NA
5 25 20 South 5 25 20
6 35 54 NA 6 35 54
7 41 60 NA 7 41 NA
8 38 52 NA …………
9 28 26 NA 1000 24 18
10 29 25 NA
• Single Imputation: In single imputation, we use mean, median or mode.
If the variable is continuous then replace the missing values with either
mean, median or mode.
If the variable is otherwise generally normally distributed (and in particular
does not have any skewness), we would choose mean.
If the data skewed, median imputation is suggested.
If the variable is categorical then we could replace the missing values with
the most frequent occurring value in that variable, i.e the mode.
Single Imputation - by Mean/Median/Mode
Name Age Gender Education Salary AppraisedValue Location Landacres HouseSizesqrft Rooms Baths Garage
Tony 25 M Grad 50 700 Glen Cove 0.2297 2448 8 3.5 2
Harret 52 F PostGrad 95 364 Glen Cove 0.2192 1942 7 2.5 1
Jane 26 F PostGrad 65 600 Glen Cove 0.163 2073 7 3 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.229 2089 7 2 0
Bruce 51 M Grad 101 600 Glen Cove 0.1714 1344 8 1 0
Steve 43 M Grad 108 299 Roslyn 0.175 1120 5 1.5 0
Carol 24 F PostGrad 51 471 Roslyn 0.213 1817 6 2 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 2 1
Donald 41 M Grad 86 517.7 Roslyn 0.2497 1615 7 2 1
Maria 51 F Grad 122 1200 Roslyn 0.4116 4067 9 4 1
Janet 49 F PostGrad 112 700 Roslyn 0.3372 3130 8 3 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 2 0
Jeffery 37 M Grad 90 543 Roslyn 0.2348 1799 6 2.5 1
We can see that the variable Rooms has 3 missing values, Missing
we need to find a way to replace the missing values Values
Rooms
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1 2 3 4 5
Looking at the histogram of the variable Rooms (non missing value, we see that it is
normally distributed. Hence we can impute missing values with Mean of non-
missing data
Name Age Gender Education Salary AppraisedValue Location Landacres HouseSizesqrft Rooms Baths Garage
Tony 25 M Grad 50 700.0 Glen Cove 0.2297 2448 8.000000 3.5 2
Harret 52 F PostGrad 95 364.0 Glen Cove 0.2192 1942 7.000000 2.5 1
Jane 26 F PostGrad 65 600.0 Glen Cove 0.1630 2073 7.000000 3.0 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8.000000 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 7.1 66667 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.2290 2089 7.000000 2.0 0
Bruce 51 M Grad 101 600.0 Glen Cove 0.1714 1344 8.000000 1.0 0
Steve 43 M Grad 108 299.0 Roslyn 0.1750 1120 5.000000 1.5 0
Carol 24 F PostGrad 51 471.0 Roslyn 0.2130 1817 6.000000 2.0 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 7.1 66667 2.0 1
Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7.000000 2.0 1
Maria 51 F Grad 122 1200.0 Long Beach 0.4116 4067 9.000000 4.0 1
Janet 49 F PostGrad 112 700.0 Roslyn 0.3372 3130 8.0 00000 3.0 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 7.166667 2.0 0
Jeffery 37 M Grad 90 543.0 Long Beach 0.2348 1799 6.000000 2.5 1
Constant: This choice allows us to provide our own default value to fill in the gaps. This
might be an integer or real number for numeric variables, or else a special marker or the
choice of something other than the majority category for Categoricalvariables.
Closest fit: The closet fit algorithm depends upon exchanging absent values with present
value of the similar attribute of other likewise cases. Main notion is to find out from
dataset likewise scenarios and select the likewise case to the case in discussion with
missing attribute values.
Treating Missing Values :: Closest Fit
Area Sq. ft Rent
275 8000 Note:
500 10000
This method is more
850 12000
900 useful for a small dataset
1000 17000
1225 19000
1500 20000
Outliers
Outliers
• What is an Outlier?
Outlier is an observation that appears far away and diverges from
an overall pattern in a sample.
• Outliers can drastically change the results of the data analysis and
statistical modeling. There are numerous unfavorable impacts of
outliers in the data set:
o It increases the error variance and reduces the power of statistical
tests
o If the outliers are non-randomly distributed, they can decrease
normality
o They can bias or influence estimates that may be of substantive
interest
Outliers
Causes of outliers
• Data Entry Errors - Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
• Measurement Error - When the measurement instrument used turns out to be
faulty.
• Intentional Error - This is commonly found in self-reported measures that
involves sensitive data.
• Data Processing Error - When data is collected from different sources
• Sampling Error - Data considered which is not part of the sample
• Natural Outlier - When an outlier is not artificial (due to error), it is a natural
outlier.
Example
Let's examine what can happen to a data set with outliers. For the sample data set:
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4
We find the following mean, median, mode, and standard deviation:
Mean = 2.58
Median = 2.5
Mode = 2
Standard Deviation = 1.08
As we can see, having outliers often has a significant effect on your mean and standard
deviation. Because of this, we must take steps to remove outliers from our data sets.
Example
Suppose you want to take admission in a MBA school and your criteria for selection
of the best MBA school is the average package received by the students.
School1
Student size: 20
Packages(in lakhs p.a.): 10,9,7,10,5,5,9,9,8,5,8,9,7,9,9,10,8,5,8,10
Avg. Package = 8
School2
Student size: 20
Packages(in lakhs p.a.): 7,6,8,10,10,10,9,50,9,7,50,8,7,10,7,8,8,50,6,8
Avg.Package = 12.4
Looking at the numbers we would decide that School 2 is the best, but the average
package of school 2 has gone up just because two students got hired by an MNC(say
Google).
These are outlier which are skewing our average on the higher side.
Outlier Detection - Viz
• Outliers can be detected using boxplots and scatter plots
• In our data, we plot a scatter plot for Appraised_value and
Baths(bivariate analysis) and also a boxplot for
Appraised_value(Univariate analysis)
Feature Engineering
Feature Engineering
• Feature engineering is the science (and art) of extracting more information from existing
data.
• Example
– Several variables could be generated from a date variable i.e. Day, month, year, day
of the week etc. This information helps a lot in getting idea about different
characteristics of the data under study
• It can be divided into two steps,
– Variable Transformation
– Variable Creation
Feature Engineering – Variable Transformation
• In data modelling, transformation refers to the replacement of a variable by a
function. For instance, replacing a variable x by the square / cube root or
logarithm x is a transformation.
• When do we transform?
– When we want to change the scale of a variable or standardize the values of
a variable for better understanding. While this transformation is a must if you
have data in different scales
– This transformation does not change the shape of the variable distribution
– Existence of a linear relationship between variables is easier to comprehend
compared to a non-linear or curved relation.
– Variables can be transformed by applying functions like log, square, cube etc.
These transformations help in reducing skewness. For right skewed
distribution, we take square / cube root or logarithm of variable and for left
skewed, we take square.
Feature Engineering - Variable and Dummy Variable Creation
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8.000000 2.5 1 0 1 0
Mark 62 M PostGrad 118 374.1 Glen Cove 0.2290 2089 7.000000 2.0 0 1 0 0
Bruce 51 M Grad 101 600.0 Glen Cove 0.1714 1344 8.000000 1.0 0 1 0 0
Maria 51 F Grad 122 1200.0 Long Beach 0.4116 4067 9.000000 4.0 1 0 1 0
•If the response variable is not a linear function of the predictors, try a different
function. For example, polynomial regression involves transforming one or more
predictor variables while remaining within the multiple linear regression framework.
•For another example, applying a logarithmic transformation to the response variable
also allows for a nonlinear relationship between the response and the predictors while
remaining within the multiple linear regression framework.
•Transforming response and/or predictor variables therefore has the potential to
remedy a number of model problems
•The use of transformation will be more clear in the further course when we deal with
model building.
Transforming a variable involves using a mathematical operation to change its measurement
scale.
In regression, a transformation to achieve linearity is a special kind of nonlinear
transformation. It is a nonlinear transformation that increases the linear relationship between
two variables.