Data Analytics and Visualization Unit-II
Data Analytics and Visualization Unit-II
SYLLABUS
The mathematical equation for the simple linear regression model is shown below.
y=ax+b
• where y is a dependent variable
• x is a independent variable
• a, b are the regression coefficients
a is also called as slope b is the intercept of the linear equation as the equation of the
simple linear regression is like the slope intecept form of the line , where slope intercept
form y=mx+c . The slope of the equation may be positive or negative (i.e, value of a may
be positive or negative).
Let us now look at an example to fit the linear regression curve y= b+ax for the provided
information.
1 1
x 8 5 4 6 7 9 3 2
0 2
1 1 1 1 1
y 4 8 9 6 7
1 0 3 5 2
In order to fit the linear regression equation we need to find the values of the a (slope) and
b (intercept) .We can find the values of the slope and intercept by using the normal
equations of the linear regression.
Normal equations of the linear regression equation y= b+ax is.
∑ y = n*b + a ∑ x
∑ x*y = b ∑ x + a ∑ x^2
where n is the total number of observations of the provided data/information
for the above given information n=10
Let us now calculate the value of a and b by solving the normal equations of the linear
regression curve.
x
x y ^ xy
2
1 6
8 88
1 4
1 2
5 50
0 5
1
4 4 16
6
3
6 8 48
6
4
7 9 63
9
1 8 11
9
3 1 7
1
1 1 15
0
0 5 0
0
3 6 9 18
1
2 4 24
2
1
1
7 4 84
2
4
x1 1 2 3 4 5
x2 8 6 4 2 10
y 3 7 5 9 11
In order to fit the multileinear regression curve we need the normal equations to calculate
the coefficients and intercept values.
x1
x
x2 y x1^2 x2^2 *x x1*y x2*y
1
2
1 8 3 1 64 8 3 24
2 6 7 4 36 12 14 42
3 4 5 9 16 12 15 20
4 2 9 16 4 8 36 18
5 10 11 25 100 50 55 110
Output:
Call:
lm(formula = dependentY ~ independentX1 + independentX2 + independentX3)
Residuals:
Min 1Q Median 3Q Max
-21.862 -2.466 2.124 6.983 10.232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.76188 35.43412 1.094 0.316
independentX1 0.46033 1.00476 0.458 0.663
independentX2 -0.09301 0.63260 -0.147 0.888
independentX3 -0.27250 1.55802 -0.175 0.867
Residual standard error: 12.22 on 6 degrees of freedom
Multiple R-squared: 0.04444, Adjusted R-squared: -0.4333
F-statistic: 0.093 on 3 and 6 DF, p-value: 0.9612
Polynomial Regression
Polynomial regression analysis is a non linear regression analysis . Polynomial
regression analysis helps for the flexible curve fitting of the data , involves the fitting of
polynomial equation of the data.Polynomial regression analysis is the extension of the
simple linear regression analysis by adding the extra independent variables obtained by
raising the power .
The mathematical expression for the polynomail regression analysis is shown below.
y=a0+a1x+a2x^2+...........+anx^n
• where y is dependent variable
• x is independent variable
• a0,a1,a2 are the coefficeients of independent variable.
Let us now look at an example to fit a polynomial regression curve for the provided
information.
1 1 1 2 2
x
0 2 5 3 0
1 2 2 2
y 7
4 3 5 1
Let us now fit a second degree polynomial curve for the above provided information.
Inorder to fit the curve for the polynomial regression we need the normal equations for the
second degree polynomial. We know the second degree polynomial can be represented as
y=a0+a1x+a2x^2 .
In order to fit the regression for the above second degree equation we need to calculate
the coeffiecient values a0,a1,a2 by using the normal equations.
Normal equations for the second degree polynomail is.
∑y = n*a0 + a1∑x + a2 ∑x^2
∑xy = a0∑x + a1∑x^2 + a2 ∑x^3
∑x^2y = a0∑x^2 + a1∑x^3 + a2 ∑x^4
where n is the total number of observations in the provided inforamtion
For the above given information the value of n is 5.
Let us now calculate the values of a0,a1 and a2.
Call:
lm(formula = dependentY ~ poly(independentX, degree = 3))
Residuals:
1 2 3 4 5 6 7
-0.4872 0.6943 1.1420 -1.6521 -1.0555 1.7218 -0.3632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.0000 0.6533 33.673 5.76e-05 ***
poly(independentX, degree = 3)1 -20.8398 1.7286 -12.056 0.00123 **
poly(independentX, degree = 3)2 1.1339 1.7286 0.656 0.55866
poly(independentX, degree = 3)3 1.2054 1.7286 0.697 0.53578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.729 on 3 degrees of freedom
Multiple R-squared: 0.9799, Adjusted R-squared: 0.9598
F-statistic: 48.76 on 3 and 3 DF, p-value: 0.004808
Exponential Regression
Expenential regression is a non linear type of regression . Exponential regression can be
expressed in two ways . Let us discuss the both type of exponential regression types in
detail with example . Exponential regression can be used in finance , biology , physics etc
fields . Let us look the mathematical expression for the exponential regression with
example.
y=ae^(bx)
• where y is dependent variable
• x is independent variable
• a , b are the regression coefficients.
While fitting the exponential curve , we can fit by converting the above equation in the
form of line intercept form of straight line ( simple linear regression ) by applying the "ln"
(logarithm with base e ) on both sides of the above equation y= ae^(bx).
By applying ln on both sides we get :
•ln(y) = ln(ae^(bx)) ->ln(y) = ln(a) + ln(e^(bx))
•ln(y) = ln(a) + bx
we can compare the above equation withe Y = A + BX
where Y=ln(y) , A = ln(a) , B=b , x=X , a=e^A and b=B
Normal equations will be
•∑ Y = n*A + B ∑ X
•∑ X*Y = A ∑ X + B ∑ X^2
Now let us try to fit an exponential regression for the given data
x 1 5 7 9 12
y 10 15 12 15 21
1 10 1 2.302 2.302 1
5 15 5 2.708 13.54 25
7 12 7 2.484 17.388 49
9 15 9 2.708 24.372 81
1
21 12 3.044 36.528 144
2
Output:
(Intercept)
66.54395
independentX
-0.1185602
Call:
lm(formula = log(dependentY, exp(1)) ~ independentX)
Residuals:
1 2 3 4 5 6 7
-0.108554 0.033256 0.082823 -0.016529 -0.003329 0.116008 -0.103675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.19786 0.10862 38.65 2.19e-07 ***
independentX -0.11856 0.01026 -11.55 8.53e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09406 on 5 degrees of freedom
Multiple R-squared: 0.9639, Adjusted R-squared: 0.9567
F-statistic: 133.4 on 1 and 5 DF, p-value: 8.527e-05
x 2 3 4 5 6
X^ Y=log10(
x y X XY
2 y)
Output:
(Intercept)
66.54395
independentX
0.8881984
Call:
lm(formula = log10(dependentY) ~ independentX)
Residuals:
1 2 3 4 5 6 7
-0.047144 0.014443 0.035970 -0.007178 -0.001446 0.050382 -0.045026
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.823109 0.047171 38.65 2.19e-07 ***
independentX -0.051490 0.004457 -11.55 8.53e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04085 on 5 degrees of freedom
Multiple R-squared: 0.9639, Adjusted R-squared: 0.9567
F-statistic: 133.4 on 1 and 5 DF, p-value: 8.527e-05
Logistic Regression
Logistic regression analysis can be used for classification and regression .We can solve
the logistic regression eqaution by using the linear regression representation. The
mathematical equation of the logistic regression can be denoted in two ways as shown
below.
y=a+b*ln(x)
where y is dependent variable
• x is independent variable
• β0 , β1 .... are the constants/regression coefficients
R
#logarthimic regression
#storing dependent and independent variables
independentX<-c(10,20,30,40,50,60,70,80,90)
dependentY<-c(1,2,3,4,5,6,7,8,9)
#fitting the logarithmic regression equation
logarthimic<-lm(dependentY~log(independentX,exp(1)))
#printing the summary of result
summary(logarthimic)
Output:
Call:
lm(formula = dependentY ~ log(independentX, exp(1)))
Residuals:
1 2 3 4 5 6 7
-2.4883 1.7209 2.5819 -0.6370 -0.5949 0.9843 -1.5668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69.972 4.710 14.86 2.5e-05 ***
log(independentX, exp(1)) -21.426 2.076 -10.32 0.000147 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2 on 5 degrees of freedom
Multiple R-squared: 0.9552, Adjusted R-squared: 0.9462
F-statistic: 106.5 on 1 and 5 DF, p-value: 0.000147
Steps in Regression Analysis
1.Firstly , we need to identify the problem, objective or research question inorder to apply
the regression analysis.
2.Collect the required data or relevant data for the regression analysis. Make sure the data
is free form errors and missed value.
3.To understand the characteristics of the data perform EDA analysis which include data
visualization , statistical summary of the data.
4.We need to select the variables which are independent variables .
5.After selection of the independent variables ,we need to perform the data preprocessing
which includes the handling the missing data , outliers in the taken data.
6.Then we need to build the model based on the type of the regression model we are
performing.
7.We need to estimate the parameters or coefficients of the regression model by using the
estimation methods.
8.After calculating the parameters , we need to check the model whether it is good fit for
all the data or not . (i.e., checking the performance of the model).
9.Finally use the model for unseen data .
Applications of regression analysis
Regression Analysis has various applications in many fields like economics,finance,real
estate , healthcare , marketing ,business , science , education , psychology , sport analysis
, agriculture and many more. Let us now discuss about the few applications of regression
analysis .
1.Regression analysis is used for the prediction of stock price based on the past data,
analyzing the relationship between the interest rate and consumer spending.
2.It can be used for the analysis of the impact of price changes on product demand and for
predicting the sales based on expenditure of advertisement.
3.It can be used in real estate for predicting the value of property based on the location.
4.Regression is also used in the weather forecasting.
5.It is also used for the prediction of crops yield based on the weather conditions , impact
of fertilizers and irrigation the plant.
6.It can be used in the analysis of the product quality and also gives the relationship
between the manufacturing varibales and product quality.
7.It can be used for the prediction of performance of the sports players based on the
historical data and impact of coaching strategies on team sucess.
Advantages of regression analysis
1.It provides insights how one or more independent variables relates to the dependent
variables.
2.It helps in analysis of the model and gives the relationship among the varibels.
3.It helps in forecasting and decision making by predicting the dependent variables based
on the independent variabels.
4.Regression Analysis can help in predicting the most important predictor/variable among
all other variables.
5.It gives the information providing the strengths , positives and negatives of the
relationship between the variables.
6.The goodness of fit and identify potential issues of the model can be assessed by the
diagnostic tools like residual analysis tools.
Disadvantages of regression analysis
1.Regression Analysis is sentive to outliers which infulencing the coeffiecient estimation .
2.Regression analysis dependents on the several assumptions like linearity , normality of
residuals etc which effects the realibility of the results when the assumptions are violated.
3.Regression Analysis mostly depend on the quality of the data . The results will be
inaccurate and unrealiable when the data is biased .
4.With the regression analysis we are not able to provide the accurate results for the
extremely complex relationships.
5.In regression analysis multi collinearity may lead to standard errors and it is becoming a
challenge to identify the contribution of each variable in the data.
What is multivariate analysis?
In data analytics, we look at different variables (or factors) and how they might impact
For example, in marketing, you might look at how the variable “money spent on
advertising” impacts the variable “number of sales.” In the healthcare sector, you might
want to explore whether there’s a correlation between “weekly hours of exercise” and
“cholesterol level.” This helps us to understand why certain outcomes occur, which in turn
As you can see, multivariate analysis encompasses all statistical techniques that are used
to analyze more than two variables at once. The aim is to find patterns
more complex understanding of a given scenario than you’ll get with bivariate analysis.
1.An example of multivariate analysis
Let’s imagine you’re interested in the relationship between a person’s social media habits
and their self-esteem. You could carry out a bivariate analysis, comparing the following two
variables:
that, in reality, self-esteem is a complex concept. It’s likely impacted by many different
factors—not just how many hours a person spends on Instagram. You might also want to
consider factors such as age, employment status, how often a person exercises, and
relationship status (for example). In order to deduce the extent to which each of these
variables correlates with self-esteem, and with each other, you’d need to run a
multivariate analysis.
So we know that multivariate analysis is used when you want to explore more than two
variables at once. Now let’s consider some of the different techniques you might use to do
this.
2. Multivariate data analysis techniques and examples
There are many different techniques for multivariate analysis, and they can be divided into
two categories:
•Dependence techniques
•Interdependence techniques
When we use the terms “dependence” and “interdependence,” we’re referring to different
Dependence methods are used when one or some of the variables are dependent on
others. Dependence looks at cause and effect; in other words, can the values of two or
more independent variables be used to explain, describe, or predict the value of another,
dependent variable? To give a simple example, the dependent variable of “weight” might
In machine learning, dependence techniques are used to build predictive models. The
analyst enters input data into the model, specifying which variables are independent and
which ones are dependent—in other words, which variables they want the model to
predict, and which variables they want the model to use to make those predictions.
Interdependence methods
Interdependence methods are used to understand the structural makeup and underlying
patterns within a dataset. In this case, no variables are dependent on others, so you’re not
looking for causal relationships. Rather, interdependence methods seek to give meaning to
So: One is about the effect of certain variables on others, while the other is all about the
With that in mind, let’s consider some useful multivariate analysis techniques. We’ll look
at:
•Factor analysis
•Cluster analysis
Multiple linear regression
Multiple linear regression is a dependence method which looks at the relationship between
one dependent variable and two or more independent variables. A multiple regression
model will tell you the extent to which each independent variable has a linear relationship
with the dependent variable. This is useful as it helps you to understand which factors are
As a data analyst, you could use multiple regression to predict crop growth. In this
example, crop growth is your dependent variable and you want to see how different
factors affect it. Your independent variables could be rainfall, temperature, amount of
sunlight, and amount of fertilizer added to the soil. A multiple regression model would
show you the proportion of variance in crop growth that each independent variable
accounts for.
Logistic regression analysis is used to calculate (and predict) the probability of a binary
event occurring. A binary outcome is one where there are only two possible outcomes;
either the event occurs (1) or it doesn’t (0). So, based on a set of independent variables,
logistic regression can predict how likely it is that a certain scenario will arise. It is also
used for classification. You can learn about the difference between regression and
classification here.
Example of logistic regression:
Let’s imagine you work as an analyst within the insurance sector and you need to predict
how likely it is that each potential customer will make a claim. You might enter a range of
independent variables into your model, such as age, whether or not they have a serious
health condition, their occupation, and so on. Using these variables, a logistic regression
analysis will calculate the probability of the event (making a claim) occurring. Another oft-
cited example is the filters used to classify email as “spam” or “not spam.” You’ll find a
independent variables on two or more dependent variables. With MANOVA, it’s important
to note that the independent variables are categorical, while the dependent variables are
for example, the variable “employment status” could be categorized into certain units,
Let’s imagine you work for an engineering company that is on a mission to build a super-
fast, eco-friendly rocket. You could use MANOVA to measure the effect that various design
combinations have on both the speed of the rocket and the amount of carbon dioxide it
measured in parts per million. Using MANOVA, you’d test different combinations (e.g. E1,
M1, and F1 vs. E1, M2, and F1, vs. E1, M3, and F1, and so on) to calculate the effect of all
the independent variables. This should help you to find the optimal design solution for your
rocket.
Factor analysis
variables in a dataset. If you have too many variables, it can be difficult to find patterns in
your data. At the same time, models created using datasets with too many variables are
susceptible to overfitting. Overfitting is a modeling error that occurs when a model fits too
Factor analysis works by detecting sets of variables which correlate highly with each other.
These variables may then be condensed into a single variable. Data analysts will often
carry out factor analysis to prepare the data for subsequent analyses.
Factor analysis example:
Let’s imagine you have a dataset containing data pertaining to a person’s income,
education level, and occupation. You might find a high degree of correlation among each of
these variables, and thus reduce them to the single factor “socioeconomic status.” You
might also have data on how happy they were with customer service, how much they like
a certain product, and how likely they are to recommend the product to a friend. Each of
these variables could be grouped into the single factor “customer satisfaction” (as long as
they are found to correlate strongly with one another). Even though you’ve reduced
several data points to just one factor, you’re not really losing any information—these
factors adequately capture and represent the individual variables concerned. With your
Another interdependence technique, cluster analysis is used to group similar items within
similar to each other than they are to variables in other clusters. This is measured in terms
between data points within one cluster. This should be small. Intercluster distance looks at
the distance between data points in different clusters. This should ideally be large. Cluster
analysis helps you to understand how data in your sample is distributed, and to find
patterns.
marketing, you might use cluster analysis to define different customer groups which could
benefit from more targeted campaigns. As a healthcare analyst, you might use cluster
associated with higher or lower cases of certain illnesses. Because it’s an interdependence
technique, cluster analysis is often carried out in the early stages of data analysis.
Source: Chire, CC BY-SA 3.0 via Wikimedia Commons
More multivariate analysis techniques
This is just a handful of multivariate analysis techniques used by data analysts and data
scientists to understand complex datasets. If you’re keen to explore further, check out
The one major advantage of multivariate analysis is the depth of insight it provides. In
exploring multiple variables, you’re painting a much more detailed picture of what’s
occurring—and, as a result, the insights you uncover are much more applicable to the real
world.
Remember our self-esteem example back in section one? We could carry out a bivariate
analysis, looking at the relationship between self-esteem and just one other factor; and, if
we found a strong correlation between the two variables, we might be inclined to conclude
know that self-esteem can’t be attributed to one single factor. It’s a complex concept; in
order to create a model that we could really trust to be accurate, we’d need to take many
more factors into account. That’s where multivariate analysis really shines; it allows us to
analyze many different factors and get closer to the reality of a given situation.
In this post, we’ve learned that multivariate analysis is used to analyze data containing
more than two variables. To recap, here are some key takeaways:
you to gain a deeper understanding of your data and how it relates to real-world
scenarios
logistic regression, MANOVA, factor analysis, and cluster analysis—to name just a
few