0% found this document useful (0 votes)

74 views

Exploratory Data Analysis

This document discusses exploratory data analysis and data cleaning techniques for a housing price prediction problem. It contains a sample housing dataset with information about individuals and their dwelling details. The document discusses exploring the dataset to understand patterns, identify questions, and data cleaning steps like variable identification, univariate analysis, bivariate analysis, and treating missing values and outliers. Exploratory data analysis is important for building an accurate predictive model by understanding trends and insights in the data. Most of the project time should be spent on data exploration, cleaning and preparation.

Uploaded by

Pusyakant Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views

Exploratory Data Analysis

Uploaded by

Pusyakant Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Business Analytics

Chapter - 4

Business Analytics

Exploratory Data Analysis & Data Cleaning

Suppose we want to predict the house prices depending on
some of the variables, we might need to build a regression
model for prediction of house prices!!!

But before jumping to model building its preferred to study

and understand the data we have.

Lets have a look at the data…

Name Age Gender Education Salary AppraisedValue Location Landacres HouseSizesqrft Rooms Baths Garage
Tony 25 M Grad 50 700 Glen Cove 0.2297 2448 8 3.5 2
Harret 52 F PostGrad 95 364 Glen Cove 0.2192 1942 7 2.5 1
Jane 26 F PostGrad 65 600 Glen Cove 0.163 2073 7 3 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.229 2089 7 2 0
Bruce 51 M Grad 101 600 Glen Cove 0.1714 1344 8 1 0
Steve 43 M Grad 108 299 Roslyn 0.175 1120 5 1.5 0
Carol 24 F PostGrad 51 471 Roslyn 0.213 1817 6 2 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 2 1
Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7 2 1
Maria 51 F Grad 122 1200 Long Beach 0.4116 4067 9 4 1
Janet 49 F PostGrad 112 700 Roslyn 0.3372 3130 8 3 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 2 0
Jeffery 37 M Grad 90 543 Long Beach 0.2348 1799 6 2.5 1

This dataset contains information about individuals and details about their
dwellings.
Looking at the dataset we might have some questions in mind:

 What could be the average age of the people in the data?

 What is the average salary of the people?
 Why are there missing values in the rooms column and how they be
replaced?
 What the general observation…are people graduates or post graduates?
 Does the salary depend on age ?
Does the house appraised value depend on the number of rooms or area
of the house or both? Can we predict house values using the given data?
 Etc…

And many more questions will follow along with the answers as we dig the
data deeper and this process of mining the data is called exploratory
analysis!!
What is Data Exploration?
•If we wish to build an impeccable predictive model, neither any
programming language nor any machine learning algorithm can award it
to you unless you perform data exploration.

•Data Exploration not only uncovers the hidden trends and insights, but
also allows you to take the first steps towards building a highly accurate
model

• Major time needs to be spent on data exploration, cleaning and

preparation as this would take major part of the project time

• Data cleaning can support better analytics as well as all-round business

intelligence which can facilitate better decision making and execution
Quote by John Tukey
Steps For Cleaning

• There are 7 steps involved to clean and prepare the data for
building predictive model.
 Variable Identification
 Univariate Analysis
 Bivariate Analysis
 Missing values treatment
 Outlier treatment
 Variable transformation
 Variable creation

• The above steps could be re-iterated to prepare good data for

analysis
Variable Identification
• Understand the variables and the type of data for each variable.
• Suppose, we want to predict, the appraised value of the house for the below
data. Then, we need to identify predictor variables, target variable, data type
of variables and category of variables.

Appraised HouseSizes
Name Age Gender Education Salary Value Location Landacres qrft Rooms Baths Garage
Tony 25 M Grad 50 700 Glen Cove 0.2297 2448 8 3.5 2
Harret 52 F PostGrad 95 364 Glen Cove 0.2192 1942 7 2.5 1
Jane 26 F PostGrad 65 600 Glen Cove 0.163 2073 7 3 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.229 2089 7 2 0
Bruce 51 M Grad 101 600 Glen Cove 0.1714 1344 8 1 0
Steve 43 M Grad 108 299 Roslyn 0.175 1120 5 1.5 0
Carol 24 F PostGrad 51 471 Roslyn 0.213 1817 6 2 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 2 1
Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7 2 1
Maria 51 F Grad 122 1200 Long Beach 0.4116 4067 9 4 1
Janet 49 F PostGrad 112 700 Roslyn 0.3372 3130 8 3 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 2 0
Jeffery 37 M Grad 90 543 Long Beach 0.2348 1799 6 2.5 1
Below, the variables have been defined in different category:

Type of Variable Data Type Variable Category

• Predictor Variable • Character • Categorical

• Housesizesqrft • Name • Gender
• Age • Gender • Education
• Salary • Education • Location
• Education • Location • Continuous
• Baths • Numeric • Age
• Room • Age • Salary
• Garage • Salary • Landacres
• Target Variable • Baths • Housesizesqrft
• Appraised_value • Room • Appraised_value
• Garage • Discrete
• Landacres • Baths
• Housesizesqrft • Rooms
• Appraised_value • Garage

Note: Numeric variable is of two types, discrete and continuous depending on the nature
of the data value that a variable takes.
Business Analytics

Univariate Analysis
Univariate Analysis
• Univariate analysis is the simplest form of analyzing data. “Uni” means “one”,
so we analyze one variable at one time.
• It doesn’t deal with causes or relationships among variables but mostly to
describe and summarize and find patterns in the data.
• Used to highlight missing and outlier values
• Method to perform univariate analysis depends on whether the variable type
is categorical or continuous
Continuous Variables
These measures(below) help in determining the central value and also the dispersion of
continuous variables
Central Measure of Visualization
Tendency Dispersion Method
Mean Range Histogram

Median Quartile Box-Plot

Mode IQR

Min Variance and SD

Max Skewness and
Kurtosis

Categorical Variables
Frequency table is used to understand the distribution of each category under a variable,
we can produce count and count% against each category
Bar plots could be used to visualize the Frequency Table
Business Analytics

Bivariate Analysis
Bivariate Analysis
• In Univariate Analysis, we study one variable at a time, like we did in earlier slides, but if we
want to find if there is any relation between two variables we need to perform bivariate
analysis.
• Bivariate analysis, can be performed for any combination ofcategorical and continuous
variables.
• Different methods are used to tackle different combinations during analysis process.
• Possible Combinations are:-
– Continuous & Continuous
– Continuous & Categorical
– Categorical & Categorical
Bivariate Analysis - Continuous & Continuous
• Scatter plot
– find out the relationship between two variables
– The pattern of scatter plot indicates the relationship between variables,
but does not indicates the strength of relationship amongst them
– The relationship can be linear or non-linear
– To find the strength of the relationship, we use Correlation(-1 negative
linear correlation to +1 positive linear correlation and 0 is no correlation).
– We get an idea of some relation and pattern among 2 variables in the
dataset.
Bivariate Analysis - Categorical & Categorical
Methods to identify the relationship between two categorical
variables.
• Two-way table: In this method by creating a two-way table of count
and count%. Both row and column represents category of their
respected variable.

• Stacked Column Chart: This method is one of the most visual form
of Two-way table.

• Chi-Square Test: It derives the statistical significance of relationship

between the variables for a larger population as well. The difference
between the expected and observed frequencies in one or more
categories in the two-way table.
Bivariate Analysis - Categorical & Categorical

• Chi square test

O = observed frequency
E = expected frequency

chi-square test is found by

• If p<0.05 then it indicates that the relationship between the

variables is significant at 95% confidence
Example
It is used to determine whether there is a significant association between the two categorical
variables.
H0: Variable Education and Variable Gender are independent.
Ha: Variable Education and Variable Gender are not independent.
Observed Expected
F M Total F M Total
Grad 3 6 9 Grad 4.2 4.8 9
Post Grad 4 2 6 Post Grad 2.8 3.2 6
Total 7 8 15 Total 7 8 15

(Obs-Exp)^2 / Exp F M

Grad 0.342857 0.3

Post Grad 0.514286 0.45

Adding up all the values from the above table, we get a chi sqr value.
Chi sqr = 0.342857 + 0.3 + 0.514286 + 0.45 = 1.607143
P-value corresponding to the above chi sqr value with 1 df and alpha = 0.05 is 0.2049.

Since p-value > 0.05, we do not reject Null hypothesis and conclude that Education and Gender
are independent variables.
Business Analytics

Missing Values
Missing Value Treatment
• There may be situations where there could be missing values in your data.

• Missing Data will not make any impact on the result if its percentage is less 1%,
if missing data’s range within the range of 1-5% then it is somehow
manageable; however in case of 5-15% complex techniques are used for
handling the problems of missing data but if it exceeds from 15% then it will
surely hinder the result achieved after applying data mining techniques

• Handling such values is very important as this could lead to wrong results.

• Missing values could occur due to several reasons like,

– During data extraction i.e. while fetching the data required for the analysis
– During data collection itself there could be some fields for which the values
may not have been collected.

• But there are ways to handle these problems

Treating Missing Values
• Deletion: Deleting observations or variables.
If a particular variable is having more missing values than rest of the variables in the dataset,
then we are better off without that variable unless it is a really important predictor that
makes a lot of business sense.
Also, if in a huge dataset we have very minute number of observations missing, then we can
delete the whole of observations altogether.
We can delete obs 4 and 7
We can delete the variable from the dataset as they
altogether since majority are very few missing(NA)
values are missing(NA) for it values in a large dataset

Obs Age Salary (in 1000s) Location Obs Age Salary (in 1000s)
1 24 15 North 1 24 15
2 28 20 NA 2 28 20
3 36 45 NA 3 36 45
4 30 35 NA 4 30 NA
5 25 20 South 5 25 20
6 35 54 NA 6 35 54
7 41 60 NA 7 41 NA
8 38 52 NA …………
9 28 26 NA 1000 24 18
10 29 25 NA
• Single Imputation: In single imputation, we use mean, median or mode.
If the variable is continuous then replace the missing values with either
mean, median or mode.
If the variable is otherwise generally normally distributed (and in particular
does not have any skewness), we would choose mean.
If the data skewed, median imputation is suggested.
If the variable is categorical then we could replace the missing values with
the most frequent occurring value in that variable, i.e the mode.
Single Imputation - by Mean/Median/Mode
Name Age Gender Education Salary AppraisedValue Location Landacres HouseSizesqrft Rooms Baths Garage
Tony 25 M Grad 50 700 Glen Cove 0.2297 2448 8 3.5 2
Harret 52 F PostGrad 95 364 Glen Cove 0.2192 1942 7 2.5 1
Jane 26 F PostGrad 65 600 Glen Cove 0.163 2073 7 3 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.229 2089 7 2 0
Bruce 51 M Grad 101 600 Glen Cove 0.1714 1344 8 1 0
Steve 43 M Grad 108 299 Roslyn 0.175 1120 5 1.5 0
Carol 24 F PostGrad 51 471 Roslyn 0.213 1817 6 2 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 2 1
Donald 41 M Grad 86 517.7 Roslyn 0.2497 1615 7 2 1
Maria 51 F Grad 122 1200 Roslyn 0.4116 4067 9 4 1
Janet 49 F PostGrad 112 700 Roslyn 0.3372 3130 8 3 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 2 0
Jeffery 37 M Grad 90 543 Roslyn 0.2348 1799 6 2.5 1

We can see that the variable Rooms has 3 missing values, Missing
we need to find a way to replace the missing values Values
Rooms
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1 2 3 4 5

Looking at the histogram of the variable Rooms (non missing value, we see that it is
normally distributed. Hence we can impute missing values with Mean of non-
missing data
Name Age Gender Education Salary AppraisedValue Location Landacres HouseSizesqrft Rooms Baths Garage
Tony 25 M Grad 50 700.0 Glen Cove 0.2297 2448 8.000000 3.5 2
Harret 52 F PostGrad 95 364.0 Glen Cove 0.2192 1942 7.000000 2.5 1
Jane 26 F PostGrad 65 600.0 Glen Cove 0.1630 2073 7.000000 3.0 2
Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8.000000 2.5 1
John 42 M Grad 77 405.9 Long Beach 0.2549 2042 7.1 66667 1.5 1
Mark 62 M PostGrad 118 374.1 Glen Cove 0.2290 2089 7.000000 2.0 0
Bruce 51 M Grad 101 600.0 Glen Cove 0.1714 1344 8.000000 1.0 0
Steve 43 M Grad 108 299.0 Roslyn 0.1750 1120 5.000000 1.5 0
Carol 24 F PostGrad 51 471.0 Roslyn 0.2130 1817 6.000000 2.0 0
Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 7.1 66667 2.0 1
Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7.000000 2.0 1
Maria 51 F Grad 122 1200.0 Long Beach 0.4116 4067 9.000000 4.0 1
Janet 49 F PostGrad 112 700.0 Roslyn 0.3372 3130 8.0 00000 3.0 1
Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 7.166667 2.0 0
Jeffery 37 M Grad 90 543.0 Long Beach 0.2348 1799 6.000000 2.5 1

# replace with mean

[1] 8.000000 7.000000 7.000000 8.000000 7.166667 7.000000 8.000000 5.000000 6.000000
7.166667 7.000000
[12] 9.000000 8.000000 7.166667 6.000000
Treating Missing Values
Prediction Imputation:
Regression - In this case, we divide our data set into two sets: One set with no missing
values for the variable and another one with missing values. Next, we create a model to
predict target variable based on other attributes of the non-missing data set and populate
missing values of other data set
KNN Imputation- For every observation to be imputed, it identifies ‘k’ closest observations
based on the euclidean distance and computes the weighted average of these ‘k’ obs.
(Note: We will get a clear idea of these methods in the course further)

Constant: This choice allows us to provide our own default value to fill in the gaps. This
might be an integer or real number for numeric variables, or else a special marker or the
choice of something other than the majority category for Categoricalvariables.

Closest fit: The closet fit algorithm depends upon exchanging absent values with present
value of the similar attribute of other likewise cases. Main notion is to find out from
dataset likewise scenarios and select the likewise case to the case in discussion with
missing attribute values.
Treating Missing Values :: Closest Fit
Area Sq. ft Rent
275 8000 Note:
500 10000
This method is more
850 12000
900 useful for a small dataset
1000 17000
1225 19000
1500 20000

Missing value is for 900.

The value closer to 900 with a non missing rent value is 800
So we replace the missing rent value with 12000

Area Sq. ft Rent

275 8000
500 10000
850 12000
900 12000
1000 17000
1225 19000
1500 20000
Business Analytics

Outliers
Outliers
• What is an Outlier?
Outlier is an observation that appears far away and diverges from
an overall pattern in a sample.
• Outliers can drastically change the results of the data analysis and
statistical modeling. There are numerous unfavorable impacts of
outliers in the data set:
o It increases the error variance and reduces the power of statistical
tests
o If the outliers are non-randomly distributed, they can decrease
normality
o They can bias or influence estimates that may be of substantive
interest
Outliers
Causes of outliers
• Data Entry Errors - Human errors such as errors caused during data collection,
recording, or entry can cause outliers in data.
• Measurement Error - When the measurement instrument used turns out to be
faulty.
• Intentional Error - This is commonly found in self-reported measures that
involves sensitive data.
• Data Processing Error - When data is collected from different sources
• Sampling Error - Data considered which is not part of the sample
• Natural Outlier - When an outlier is not artificial (due to error), it is a natural
outlier.
Example

Let's examine what can happen to a data set with outliers. For the sample data set:
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4
We find the following mean, median, mode, and standard deviation:
Mean = 2.58
Median = 2.5
Mode = 2
Standard Deviation = 1.08

If we add an outlier to the data set:

1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400
The new values of our statistics are:
Mean = 35.38
Median = 2.5
Mode = 2
Standard Deviation = 114.74

As we can see, having outliers often has a significant effect on your mean and standard
deviation. Because of this, we must take steps to remove outliers from our data sets.
Example
Suppose you want to take admission in a MBA school and your criteria for selection
of the best MBA school is the average package received by the students.

School1
Student size: 20
Packages(in lakhs p.a.): 10,9,7,10,5,5,9,9,8,5,8,9,7,9,9,10,8,5,8,10
Avg. Package = 8

School2
Student size: 20
Packages(in lakhs p.a.): 7,6,8,10,10,10,9,50,9,7,50,8,7,10,7,8,8,50,6,8
Avg.Package = 12.4

Looking at the numbers we would decide that School 2 is the best, but the average
package of school 2 has gone up just because two students got hired by an MNC(say
Google).
These are outlier which are skewing our average on the higher side.
Outlier Detection - Viz
• Outliers can be detected using boxplots and scatter plots
• In our data, we plot a scatter plot for Appraised_value and
Baths(bivariate analysis) and also a boxplot for
Appraised_value(Univariate analysis)

Scatter plot of Appraised_value and Baths Box plot of Appraised_value

Outlier Detection – Thumb Rules
• Other than the plots,
Outliers can also be detected
by using certain thumb rules,
– Any value, which is
beyond the range of -1.5
x IQR to 1.5 x IQR where
IQR = Q3-Q1
– Any value which out of
range of 5th and 95th
percentile can be
considered as outlier
– Data points, three or
more standard deviation
away from mean are
considered outlier.
Handle Outliers
• We could remove the outliers from the data if they are due to data
entry or data processing errors
• Based on business understanding you could also replace the outliers
with mean or median
• If there is a pattern of interest in the outliers then they could be
handled separately. For example if the outliers are like in groups then
treat both groups as two different groups and build individual model for
both groups and then combine the output.
• Also the outliers can be capped with 5th or 95th percentile.

We will treat outlier in our data using R.

We will be using capping method for imputation of the outlier in our data
for variable Apparaised_value
Business Analytics

Feature Engineering
Feature Engineering
• Feature engineering is the science (and art) of extracting more information from existing
data.
• Example
– Several variables could be generated from a date variable i.e. Day, month, year, day
of the week etc. This information helps a lot in getting idea about different
characteristics of the data under study
• It can be divided into two steps,
– Variable Transformation
– Variable Creation
Feature Engineering – Variable Transformation
• In data modelling, transformation refers to the replacement of a variable by a
function. For instance, replacing a variable x by the square / cube root or
logarithm x is a transformation.

• When do we transform?
– When we want to change the scale of a variable or standardize the values of
a variable for better understanding. While this transformation is a must if you
have data in different scales
– This transformation does not change the shape of the variable distribution
– Existence of a linear relationship between variables is easier to comprehend
compared to a non-linear or curved relation.
– Variables can be transformed by applying functions like log, square, cube etc.
These transformations help in reducing skewness. For right skewed
distribution, we take square / cube root or logarithm of variable and for left
skewed, we take square.
Feature Engineering - Variable and Dummy Variable Creation

• Variable creation is a process to generate a new variables / features based on

existing variable(s)
• Dummy coding provides one way of using categorical predictor variables in
various kinds of estimation models (see also effect coding), such as, linear
regression. Dummy coding uses only ones and zeros to convey all of the
necessary information on group membership.
• Below is an example of variable creations (Yellow columns are original
variables and the columns in blue are variables created from them)

ID Gender Date Day Month Year Dummy_M Dummy_Fem

ale ale
1 Male 10May2016 10 5 2016 1 0
2 Female 15July2016 15 7 2016 0 1
3 Male 01June2016 1 6 2016 1 0
4 Male 04January2016 4 1 2016 1 0
5 Female 27March2016 27 3 2016 0 1
Since, Location is a categorical variable, we need to convert it to a dummy variable so that we will be able to use it as a predictor

Location Location Location

Name Age Gender Education Salary AppraisedValue Location Landacres HouseSizesqrft Rooms Baths Garage
GlenCove LongBeach Roslyn
Tony 25 M Grad 50 700.0 Glen Cove 0.2297 2448 8.000000 3.5 2 1 0 0

Harret 52 F PostGrad 95 364.0 Glen Cove 0.2192 1942 7.000000 2.5 1 1 0 0

Jane 26 F PostGrad 65 600.0 Glen Cove 0.1630 2073 7.000000 3.0 2 1 0 0

Rose 45 F Grad 100 548.4 Long Beach 0.4608 2707 8.000000 2.5 1 0 1 0

John 42 M Grad 77 405.9 Long Beach 0.2549 2042 7.166667 1.5 1 0 1 0

Mark 62 M PostGrad 118 374.1 Glen Cove 0.2290 2089 7.000000 2.0 0 1 0 0

Bruce 51 M Grad 101 600.0 Glen Cove 0.1714 1344 8.000000 1.0 0 1 0 0

Steve 43 M Grad 108 299.0 Roslyn 0.1750 1120 5.000000 1.5 0 0 0 1

Carol 24 F PostGrad 51 471.0 Roslyn 0.2130 1817 6.000000 2.0 0 0 0 1

Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 7.166667 2.0 1 0 0 1

Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7.000000 2.0 1 0 1 0

Maria 51 F Grad 122 1200.0 Long Beach 0.4116 4067 9.000000 4.0 1 0 1 0

Janet 49 F PostGrad 112 700.0 Roslyn 0.3372 3130 8.000000 3.0 1 0 0 1

Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 7.166667 2.0 0 0 0 1

1 0
Jeffery 37 M Grad 90 543.0 Long Beach 0.2348 1799 6.000000 2.5 1
0
Feature Engineering - Variable Transformation

•If the response variable is not a linear function of the predictors, try a different
function. For example, polynomial regression involves transforming one or more
predictor variables while remaining within the multiple linear regression framework.
•For another example, applying a logarithmic transformation to the response variable
also allows for a nonlinear relationship between the response and the predictors while
remaining within the multiple linear regression framework.
•Transforming response and/or predictor variables therefore has the potential to
remedy a number of model problems
•The use of transformation will be more clear in the further course when we deal with
model building.
Transforming a variable involves using a mathematical operation to change its measurement
scale.
In regression, a transformation to achieve linearity is a special kind of nonlinear
transformation. It is a nonlinear transformation that increases the linear relationship between
two variables.

Methods of Transforming Variables to Achieve Linearity:

There are many ways to transform variables to achieve linearity for regression analysis. Some
common methods are summarized below.
Method Transformation(s) Regression equation Predicted value (ŷ)

Standard linear regression None y = b0 + b1x ŷ = b0 + b1x

Exponential model Dependent variable = log(y) log(y) = b0 + b1x ŷ = 10b + b x

0 1

Quadratic model Dependent variable = sqrt(y) sqrt(y) = b0 + b1x ŷ = ( b0 + b1x)2

Reciprocal model Dependent variable = 1/y 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x)

Logarithmic model Independent variable = log(x) y= b0 + b1log(x) ŷ = b0 + b1log(x)

Power model Dependent variable = log(y) log(y)= b0 + b1log(x) ŷ = 10b + b log(x)

0 1
Independent variable = log(x)
Example
Investment value Investment value
Year @ 7% per yr Year @ 7% per yr Investment value @ 7% per yr
0 1000 21 4141 20000
1 1070 22 4430
2 1145 23 4741 15000
3 1225 24 5072
4 1311 25 5427 10000
5 1403 26 5807
6 1501 27 6214 5000
7 1606 28 6649
8 1718 29 7114 0
9 1838 30 7612 0 10 20 30 40 50
10 1967 31 8145
Investment value @ 7% per yr
11 2105 32 8715
12 2252 33 9325
13 2410 34 9978
14 2579 35 10677 We observe that Investment value is increasing
15 2759 36 11424 exponentially.
16 2952 37 12224
So, var X and Y are exponentially related.
17 3159 38 13079
18 3380 39 13995
19 3617 40 14974 To make them linearly dependent on each other,
20 3870 41 16023 we need to transform the variable(Investment
Value)
After Transformation with LN
Ln(IV)
Year Ln(Investment value) Year Ln(Investment value)
0 6.907755279 21 8.328692584 12
1 6.975413927 22 8.396154863 10
2 7.043159916 23 8.464003363 8
3 7.110696123 24 8.531490496
6
4 7.178545484 25 8.599141774
5 7.24636808 26 8.666819365 4
6 7.313886832 27 8.73456009 2
7 7.381501895 28 8.802221746
0
8 7.448916103 29 8.869819953 0 10 20 30 40 50
9 7.516433303 30 8.937481228
10 7.584264818 31 9.005159521 Ln(IV)
11 7.652070746 32 9.072800958
12 7.719573989 33 9.140454245
13 7.787382026 34 9.208137948 We have transformed Investment value by LN
14 7.855157006 35 9.275847174
15 7.922623574 36 9.343471685 After transformation the vars are linearly
16 7.990238186 37 9.411156511
related to eachother.
17 8.058010801 38 9.478763169
18 8.125630988 39 9.546455402
19 8.193400232 40 9.614070643 We will learn further on data transformation
20 8.261009786 41 9.681780469 in further course.
Thank You

859715094
No ratings yet
859715094
10 pages
Assignment-2 Noc18 Ma07 5
100% (2)
Assignment-2 Noc18 Ma07 5
9 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
A Comprehensive Guide To Data Exploration
100% (1)
A Comprehensive Guide To Data Exploration
18 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Data Mining Technical
No ratings yet
Data Mining Technical
45 pages
7. Data Cleaning
No ratings yet
7. Data Cleaning
39 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Marketing Engineering and Analytics
No ratings yet
Marketing Engineering and Analytics
52 pages
Business Club: Basic Statistics
No ratings yet
Business Club: Basic Statistics
26 pages
Big Data Chapter 3
No ratings yet
Big Data Chapter 3
29 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Regression
No ratings yet
Regression
86 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
QM 1
No ratings yet
QM 1
58 pages
IS5740 W02
No ratings yet
IS5740 W02
37 pages
Unit 4
No ratings yet
Unit 4
21 pages
DS assignment COMPLETED DOC
No ratings yet
DS assignment COMPLETED DOC
11 pages
Variable: An Item of Data Examples
No ratings yet
Variable: An Item of Data Examples
60 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Ba Textbook Part2
No ratings yet
Ba Textbook Part2
10 pages
EDA - Day 3
No ratings yet
EDA - Day 3
18 pages
BA Chatgpt Notes
No ratings yet
BA Chatgpt Notes
27 pages
NSE Project
No ratings yet
NSE Project
11 pages
SPSS File
No ratings yet
SPSS File
21 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Chapter 1-INTRODUCTION TO DATA ANALYSIS IN BUSINESSTh
No ratings yet
Chapter 1-INTRODUCTION TO DATA ANALYSIS IN BUSINESSTh
61 pages
TOD 212 - PPT 1 For Students - Monsoon 2023
No ratings yet
TOD 212 - PPT 1 For Students - Monsoon 2023
26 pages
Analytics
No ratings yet
Analytics
38 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
lesson plan_new
No ratings yet
lesson plan_new
4 pages
Labmanual for Mba
No ratings yet
Labmanual for Mba
36 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Research Analytics
25% (4)
Research Analytics
2 pages
Inferential Statistics
No ratings yet
Inferential Statistics
22 pages
Simple Linear Regression (1)
No ratings yet
Simple Linear Regression (1)
83 pages
Presentation1HOD SIR-1
No ratings yet
Presentation1HOD SIR-1
13 pages
1 Advanced Data Analysis-Course Outline
No ratings yet
1 Advanced Data Analysis-Course Outline
7 pages
Univariate and Bivariate Statistical Analysespdf
100% (1)
Univariate and Bivariate Statistical Analysespdf
6 pages
RM-Cha 9
No ratings yet
RM-Cha 9
83 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
10008
No ratings yet
10008
50 pages
1. Uni, Bi and Multivariate Data
No ratings yet
1. Uni, Bi and Multivariate Data
21 pages
Introduction Bus Statistics
No ratings yet
Introduction Bus Statistics
32 pages
MBR Lab Week 10-12-1
No ratings yet
MBR Lab Week 10-12-1
65 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Chapter 1 Introduction to Multivariate Data Analysis.pptx
No ratings yet
Chapter 1 Introduction to Multivariate Data Analysis.pptx
15 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Unit 4
No ratings yet
Unit 4
25 pages
Lesson 08 Data Analysis Using Statistics
No ratings yet
Lesson 08 Data Analysis Using Statistics
100 pages
Lectures 1 and 2 - Data Anaysis in Management - MBM
No ratings yet
Lectures 1 and 2 - Data Anaysis in Management - MBM
49 pages
Knits in a Day: 40 Quick Knits to Cast On and Complete in Three Hours or Less
From Everand
Knits in a Day: 40 Quick Knits to Cast On and Complete in Three Hours or Less
Candi Derr
4/5 (3)
A Suitcase Full of Wildflowers
From Everand
A Suitcase Full of Wildflowers
Fil Bufalo
No ratings yet
bk9 8
No ratings yet
bk9 8
42 pages
Week 4 Measures of Central Tendency For Ungrouped Data
No ratings yet
Week 4 Measures of Central Tendency For Ungrouped Data
27 pages
STATIC: Static Application or Removal: Topics
No ratings yet
STATIC: Static Application or Removal: Topics
12 pages
M.SC Agronomy Syllabus
No ratings yet
M.SC Agronomy Syllabus
16 pages
Class Exercise - DHRM
No ratings yet
Class Exercise - DHRM
3 pages
Advantages and Disadvantages of Central Tedency
100% (1)
Advantages and Disadvantages of Central Tedency
2 pages
MT Econ 1140 Q&a
No ratings yet
MT Econ 1140 Q&a
4 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
4 Measures of Central Tendency, Position, Variability PDF
100% (1)
4 Measures of Central Tendency, Position, Variability PDF
24 pages
Summative Test
0% (1)
Summative Test
2 pages
OCR MEI S1 Revision Notes
No ratings yet
OCR MEI S1 Revision Notes
7 pages
Topic 1 - Types of Data PDF
No ratings yet
Topic 1 - Types of Data PDF
10 pages
AOL 1 Chapter Chapter 7 Part 1
No ratings yet
AOL 1 Chapter Chapter 7 Part 1
10 pages
Time: 3 Hrs (Maximum Marks: 100) : Cbse Class Xi
No ratings yet
Time: 3 Hrs (Maximum Marks: 100) : Cbse Class Xi
5 pages
Statistics-1 With Exercises in Text Book
No ratings yet
Statistics-1 With Exercises in Text Book
17 pages
CHAPTER 2 - Probability Distribution (NORMAL) Week 7 - New
No ratings yet
CHAPTER 2 - Probability Distribution (NORMAL) Week 7 - New
34 pages
Maths SWC 2024
No ratings yet
Maths SWC 2024
32 pages
FinQuiz Level1Mock2018Version2JuneAMSolutions
100% (3)
FinQuiz Level1Mock2018Version2JuneAMSolutions
79 pages
HNS 2321 BIOSTATISTICS LECTURE 3 AND 4 DESCRITIVE STATISTICS
No ratings yet
HNS 2321 BIOSTATISTICS LECTURE 3 AND 4 DESCRITIVE STATISTICS
36 pages
Sta301 Final Term Solved Mcqs Mega File-1-Converted-1
No ratings yet
Sta301 Final Term Solved Mcqs Mega File-1-Converted-1
40 pages
3 5 A IntroSummaryStatistics
No ratings yet
3 5 A IntroSummaryStatistics
32 pages
Assessment of Learning
100% (1)
Assessment of Learning
36 pages
12132_BASIC TOOLS OF ECONOMICS ANALYSIS
No ratings yet
12132_BASIC TOOLS OF ECONOMICS ANALYSIS
18 pages
Fy Bba Unit 1
No ratings yet
Fy Bba Unit 1
43 pages
Topic 2 Frequency Distribution and Data Presentation, Measures of Central Tendency and Dispersion
No ratings yet
Topic 2 Frequency Distribution and Data Presentation, Measures of Central Tendency and Dispersion
46 pages
Statistik MBA
No ratings yet
Statistik MBA
41 pages
MAApplied Psychology
No ratings yet
MAApplied Psychology
13 pages
DMDW-Solution For Unit 1-5
50% (2)
DMDW-Solution For Unit 1-5
20 pages
Measure of Validity
No ratings yet
Measure of Validity
79 pages

Exploratory Data Analysis

Uploaded by

Exploratory Data Analysis

Uploaded by

Business Analytics

Exploratory Data Analysis & Data Cleaning

But before jumping to model building its preferred to study

Lets have a look at the data…

 What could be the average age of the people in the data?

• Major time needs to be spent on data exploration, cleaning and

• Data cleaning can support better analytics as well as all-round business

• The above steps could be re-iterated to prepare good data for

Type of Variable Data Type Variable Category

• Predictor Variable • Character • Categorical

Median Quartile Box-Plot

Min Variance and SD

• Chi-Square Test: It derives the statistical significance of relationship

• Chi square test

chi-square test is found by

• If p<0.05 then it indicates that the relationship between the

Grad 0.342857 0.3

Post Grad 0.514286 0.45

• Missing values could occur due to several reasons like,

• But there are ways to handle these problems

# replace with mean

Missing value is for 900.

Area Sq. ft Rent

If we add an outlier to the data set:

Scatter plot of Appraised_value and Baths Box plot of Appraised_value

We will treat outlier in our data using R.

• Variable creation is a process to generate a new variables / features based on

ID Gender Date Day Month Year Dummy_M Dummy_Fem

Location Location Location

Harret 52 F PostGrad 95 364.0 Glen Cove 0.2192 1942 7.000000 2.5 1 1 0 0

Jane 26 F PostGrad 65 600.0 Glen Cove 0.1630 2073 7.000000 3.0 2 1 0 0

John 42 M Grad 77 405.9 Long Beach 0.2549 2042 7.166667 1.5 1 0 1 0

Steve 43 M Grad 108 299.0 Roslyn 0.1750 1120 5.000000 1.5 0 0 0 1

Carol 24 F PostGrad 51 471.0 Roslyn 0.2130 1817 6.000000 2.0 0 0 0 1

Henry 25 M PostGrad 68 510.7 Roslyn 0.1377 2496 7.166667 2.0 1 0 0 1

Donald 41 M Grad 86 517.7 Long Beach 0.2497 1615 7.000000 2.0 1 0 1 0

Janet 49 F PostGrad 112 700.0 Roslyn 0.3372 3130 8.000000 3.0 1 0 0 1

Sophia 32 F Grad 85 374.8 Roslyn 0.1503 1423 7.166667 2.0 0 0 0 1

Methods of Transforming Variables to Achieve Linearity:

Standard linear regression None y = b0 + b1x ŷ = b0 + b1x

Exponential model Dependent variable = log(y) log(y) = b0 + b1x ŷ = 10b + b x

Quadratic model Dependent variable = sqrt(y) sqrt(y) = b0 + b1x ŷ = ( b0 + b1x)2

Reciprocal model Dependent variable = 1/y 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x)

Logarithmic model Independent variable = log(x) y= b0 + b1log(x) ŷ = b0 + b1log(x)

Power model Dependent variable = log(y) log(y)= b0 + b1log(x) ŷ = 10b + b log(x)

You might also like