0% found this document useful (0 votes)
2 views23 pages

Data Analytics and Visualization Unit-II

Uploaded by

tunnuofficial01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views23 pages

Data Analytics and Visualization Unit-II

Uploaded by

tunnuofficial01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 23

BCDS501

Introduction to Data Analytics and Visualization

UNIT 2 Data Analysis

SYLLABUS

Regression modeling, multivariate analysis, Bayesian modeling,


inference and Bayesian networks, support vector and kernel methods,
analysis of time series: linear systems analysis & nonlinear dynamics,
rule induction, neural networks: learning and generalisation,
competitive learning, principal component analysis and neural
networks, fuzzy logic: extracting fuzzy models from data, fuzzy
decision trees, stochastic search methods
What is regression?
Regression Analysis is a supervised learning analysis where supervised learning is
the analyzing or predicting the data based on the previously available data or past data.
For supervised learning, we have both train data and test data. Regression analysis is one
of the statistical methods for the analysis and prediction of the data. Regression analysis is
used for predictive data or quantitative or numerical data.
In R Programming Language Regression Analysis is a statistical model which gives the
relationship between the dependent variables and independent variables. Regression
analysis is used in many fields like machine learning, artificial intelligence, data science,
economics, finance, real estate, healthcare, marketing, business, science, education,
psychology, sports analysis, agriculture, and many more. The main aim of the regression
analysis is to give the relationship between the variables, nature, and strength among the
variables, and make predictions based on the model.
Types of regression analysis
We know that the regression analysis is the statistical technique that gives the relationship
between the dependent and independent variables. There are many types of regression
analysis. Let us discuss the each type of regression analysis in detail.
Simple Linear Regression
It is one of the basic and linear regression analysis. In this simple linear
regression there is only one dependent and one independent variable. This linear
regression model only one predictor. This linear regression model gives the linear
relationship between the dependent and independent variables. Simple linear regression is
one of the most used regression analysis. This simple linear regression analysis is mostly
used in weather forecasting, financial analysis , market analysis . It can be used for the
predicting outcomes , increasing the efficiency of the models , make necessary measures
to prevent the mistakes of the model.

The mathematical equation for the simple linear regression model is shown below.
y=ax+b
• where y is a dependent variable
• x is a independent variable
• a, b are the regression coefficients
a is also called as slope b is the intercept of the linear equation as the equation of the
simple linear regression is like the slope intecept form of the line , where slope intercept
form y=mx+c . The slope of the equation may be positive or negative (i.e, value of a may
be positive or negative).
Let us now look at an example to fit the linear regression curve y= b+ax for the provided
information.

1 1
x 8 5 4 6 7 9 3 2
0 2

1 1 1 1 1
y 4 8 9 6 7
1 0 3 5 2

In order to fit the linear regression equation we need to find the values of the a (slope) and
b (intercept) .We can find the values of the slope and intercept by using the normal
equations of the linear regression.
Normal equations of the linear regression equation y= b+ax is.
∑ y = n*b + a ∑ x
∑ x*y = b ∑ x + a ∑ x^2
where n is the total number of observations of the provided data/information
for the above given information n=10
Let us now calculate the value of a and b by solving the normal equations of the linear
regression curve.

x
x y ^ xy
2

1 6
8 88
1 4

1 2
5 50
0 5

1
4 4 16
6

3
6 8 48
6

4
7 9 63
9

1 8 11
9
3 1 7

1
1 1 15
0
0 5 0
0

3 6 9 18

1
2 4 24
2

1
1
7 4 84
2
4

From the above table


•n=10 , ∑ x = 66 , ∑ y = 95 , ∑ xy =1186 , ∑ x^2 = 528
•Now the normal equations become :
•95 = 10*b + 66a
•1186 = 66*b + 528a
•By solving the above two euations we get a = 6.05 and b = -30.429
•The linear regression equation is y = -30.429 + 6.05 x.
•Let us now discuss the implementation of the linear regression curve in R
#code
#Storing the independent value X
independentX<-c(5,7,8,10,11,13,16)
#soring the dependent value
dependentY<-c(33,30,28,20,18,16,9)
#performing the regression analysis using the function lm
linearregression<-lm(dependentY~independentX)
#printing the summary of the result
summary(linearregression)
#ploting the model
plot(linearregression)
plot(independentX,dependentY)
Call:
lm(formula = dependentY ~ independentX)
Residuals:
1 2 3 4 5 6 7
-0.3690 1.1786 1.4524 -2.0000 -1.7262 0.8214 0.6429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.7381 1.7665 25.32 1.79e-06 ***
independentX -2.2738 0.1669 -13.62 3.82e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.53 on 5 degrees of freedom
Multiple R-squared: 0.9738, Adjusted R-squared: 0.9685
F-statistic: 185.6 on 1 and 5 DF, p-value: 3.822e-05

AnalysisMultiple Linear Regression


Multiple linear regression analysis gives the relationship between the two or more
independent varibales and a dependent variable. Multiple linear regression can be
represented as the hyperplane in multidimensional space . It is also a linear type
regression analysis . It is almost similar to the linear regression but the major difference is
the number of independent variables are different . Multi linear regression analysis is used
in the fields of real estate , finance , business , public healthcare etc.
The mathematical equation for the multiple linear regression is shown below.
y=a0x1+a1x2+a2x3+....+b
• where y is a dependent variable
• x1,x2,x3 .... are the independent variables
• a0,a1,a2,... are the constants /coeffiecients of independent variables / partial regression
coefiicientsb is the intercept.
Let us now look into an example to fit a multi linear regression curve. In the below
example we just look at the example for the multilinear curve for the equation with two
independent variables x1 and x2 (y = b + a0*x1 + a1*x2)

x1 1 2 3 4 5

x2 8 6 4 2 10

y 3 7 5 9 11

In order to fit the multileinear regression curve we need the normal equations to calculate
the coefficients and intercept values.

x1
x
x2 y x1^2 x2^2 *x x1*y x2*y
1
2

1 8 3 1 64 8 3 24

2 6 7 4 36 12 14 42

3 4 5 9 16 12 15 20

4 2 9 16 4 8 36 18

5 10 11 25 100 50 55 110

From the above table


•n=5 , ∑ x1 = 15 , ∑ x2 = 30 , ∑ y = 35 , ∑ x1^2 = 55 , ∑ x2^2 = 220 , ∑ x1*x2 = 90 , ∑
x1*y = 123 , ∑ x2 *y = 214
•Then the normal equations become:
35 = 5b + 15a0 + 30a1
•123 = 15b + 55a0 + 90a1
•214 = 30b + 90a0 + 220a1
•By solving the above three normal equations we get the values of a0 , a1 and b .
•a0 = 1.8 , a1 = 0.1 , b = 1.666
•The multilinear regression analysis curve can be fit as y = 1.666 + 1.8*x1 + 0.1 * x2 .
Let us now discuss the implementation of the multilinear regression in R .
R #in this we are performing the multi linear regression using the three independent
#storing the three independent variables
independentX1<-c(8,10,15,19,20,11,16,13,6,18)
independentX2<-c(22,26,24,32,38,39,29,13,15,25)
independentX3<-c(28,26,24,22,29,25,27,23,20,21)
#storing the dependent variable y
dependentY<-c(43,12,45,48,33,37,39,38,36,28)

#performing the multilinear regression analysis


multilinear<-lm(dependentY~independentX1+independentX2+independentX3)

#printing the summary of the result


summary(multilinear)
plot(multilinear)

Output:
Call:
lm(formula = dependentY ~ independentX1 + independentX2 + independentX3)
Residuals:
Min 1Q Median 3Q Max
-21.862 -2.466 2.124 6.983 10.232
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.76188 35.43412 1.094 0.316
independentX1 0.46033 1.00476 0.458 0.663
independentX2 -0.09301 0.63260 -0.147 0.888
independentX3 -0.27250 1.55802 -0.175 0.867
Residual standard error: 12.22 on 6 degrees of freedom
Multiple R-squared: 0.04444, Adjusted R-squared: -0.4333
F-statistic: 0.093 on 3 and 6 DF, p-value: 0.9612

Polynomial Regression
Polynomial regression analysis is a non linear regression analysis . Polynomial
regression analysis helps for the flexible curve fitting of the data , involves the fitting of
polynomial equation of the data.Polynomial regression analysis is the extension of the
simple linear regression analysis by adding the extra independent variables obtained by
raising the power .
The mathematical expression for the polynomail regression analysis is shown below.
y=a0+a1x+a2x^2+...........+anx^n
• where y is dependent variable
• x is independent variable
• a0,a1,a2 are the coefficeients of independent variable.
Let us now look at an example to fit a polynomial regression curve for the provided
information.

1 1 1 2 2
x
0 2 5 3 0

1 2 2 2
y 7
4 3 5 1

Let us now fit a second degree polynomial curve for the above provided information.
Inorder to fit the curve for the polynomial regression we need the normal equations for the
second degree polynomial. We know the second degree polynomial can be represented as
y=a0+a1x+a2x^2 .
In order to fit the regression for the above second degree equation we need to calculate
the coeffiecient values a0,a1,a2 by using the normal equations.
Normal equations for the second degree polynomail is.
∑y = n*a0 + a1∑x + a2 ∑x^2
∑xy = a0∑x + a1∑x^2 + a2 ∑x^3
∑x^2y = a0∑x^2 + a1∑x^3 + a2 ∑x^4
where n is the total number of observations in the provided inforamtion
For the above given information the value of n is 5.
Let us now calculate the values of a0,a1 and a2.

x x^2 x^4 xy x^2y

10 100 10000 140 1400

12 144 20736 204 2448

15 225 40500 345 5175

23 529 279841 575 13225

20 400 16000 420 8400

From the above table


•n=5 , ∑ x = 80 , ∑ y = 100 , ∑ x^2 = 1398 , ∑ x^3 = 26270 , ∑ x^4 = 521202 , ∑ xy =
1684 , ∑ x^2y = 30648
•Then the normal equations becomes :
•100 = 5*a0 + 80*a1 + 1398*a2
•1684 = 80*a0 + 1398*a1 + 26270*a2
•30648 = 1398*a0 + 26270*a1 + 521202*a2
•By solving the above three equations we get a0 = -8.728 , a1 = 3.017 , a2 = -0.69
•The polynomial regression curve is y = -8.728 + 3.017x -0.69x^2 .
Now let us see the implementation of the polynomail regression in R .
R
#Storing the independent value X
independentX<-c(5,7,8,10,11,13,16)
#soring the dependent value
dependentY<-c(33,30,28,20,18,16,9)
#performing the regression analysis using the function lm with degree 3
polyregression<-lm(dependentY~poly(independentX,degree=3))
#printing the summary of the result
summary(polyregression)
#ploting the regression curve
plot(polyregression)

Call:
lm(formula = dependentY ~ poly(independentX, degree = 3))
Residuals:
1 2 3 4 5 6 7
-0.4872 0.6943 1.1420 -1.6521 -1.0555 1.7218 -0.3632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.0000 0.6533 33.673 5.76e-05 ***
poly(independentX, degree = 3)1 -20.8398 1.7286 -12.056 0.00123 **
poly(independentX, degree = 3)2 1.1339 1.7286 0.656 0.55866
poly(independentX, degree = 3)3 1.2054 1.7286 0.697 0.53578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.729 on 3 degrees of freedom
Multiple R-squared: 0.9799, Adjusted R-squared: 0.9598
F-statistic: 48.76 on 3 and 3 DF, p-value: 0.004808

Exponential Regression
Expenential regression is a non linear type of regression . Exponential regression can be
expressed in two ways . Let us discuss the both type of exponential regression types in
detail with example . Exponential regression can be used in finance , biology , physics etc
fields . Let us look the mathematical expression for the exponential regression with
example.
y=ae^(bx)
• where y is dependent variable
• x is independent variable
• a , b are the regression coefficients.
While fitting the exponential curve , we can fit by converting the above equation in the
form of line intercept form of straight line ( simple linear regression ) by applying the "ln"
(logarithm with base e ) on both sides of the above equation y= ae^(bx).
By applying ln on both sides we get :
•ln(y) = ln(ae^(bx)) ->ln(y) = ln(a) + ln(e^(bx))
•ln(y) = ln(a) + bx
we can compare the above equation withe Y = A + BX
where Y=ln(y) , A = ln(a) , B=b , x=X , a=e^A and b=B
Normal equations will be
•∑ Y = n*A + B ∑ X
•∑ X*Y = A ∑ X + B ∑ X^2
Now let us try to fit an exponential regression for the given data

x 1 5 7 9 12

y 10 15 12 15 21

From the above derived equations we know X=x , Y=ln(y)

x y X Y = ln(y) X*Y X^2

1 10 1 2.302 2.302 1

5 15 5 2.708 13.54 25

7 12 7 2.484 17.388 49

9 15 9 2.708 24.372 81

1
21 12 3.044 36.528 144
2

From the above table n= 5 , ∑ X = 34 , ∑ Y = 13.246 ,∑ XY =94.13 , ∑ X^2 = 300


Now the normal equations becomes.
•13.246 = 5A + 34B
•94.13 = 34A + 300B
•By solving the above equation we can get the values of A and B
•A=2.248 and B= 0.059
•From the mentioned equations we know b=B and a=e^A
•a=e^2.248 =9.468
•b= B = 0.059
•The exponential regression equation is y=ae^(bx) -> y = 9.468*e^(0.059x)
Let us now try to implement the exponential regression in R programming
R
#storing the independent variable
independentX<-c(1,5,7,9,12)
#storing the dependent variable
dependentY<-c(10,15,12,15,21)
#fitting the exponential regression
exponentialregression<-lm(log(dependentY,exp(1))~independentX)
#determining the coefficients a and b
a<-exp(coef(exponentialregression)[1])
b<-coef(exponentialregression)[2]
print(a)
print(b)
#printing the summary of exponential regression equation
summary(exponentialregression)
#ploting
plot(exponentialregression)

Output:
(Intercept)
66.54395
independentX
-0.1185602
Call:
lm(formula = log(dependentY, exp(1)) ~ independentX)
Residuals:
1 2 3 4 5 6 7
-0.108554 0.033256 0.082823 -0.016529 -0.003329 0.116008 -0.103675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.19786 0.10862 38.65 2.19e-07 ***
independentX -0.11856 0.01026 -11.55 8.53e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09406 on 5 degrees of freedom
Multiple R-squared: 0.9639, Adjusted R-squared: 0.9567
F-statistic: 133.4 on 1 and 5 DF, p-value: 8.527e-05

Exponential regression in the


form of y=ab^x.
y=ab^x
where y is dependent variable
• x is independent variable
• a , b are the regression
coefficients
While fitting the exponential
curve , we can fit by
converting the above equation
in the form of line intercept form of straight line ( simple linear regression ) by applying the
"log" (logarithm with base 10 ) on both sides of the above equation y= ab^x.
By applying ln on both sides we get :
•log10(y) = log10(ab^x) ->log10(y) = log10(a) + log10(b^x)
•log10(y) = log10(a) + xlog10(b)
we can compare the above equation withe Y = A + BX
where Y=log10(y) , A = log10(a) , B=log10(b) , x=X , a=10^A and b=10^B
Normal equations will be
•∑ Y = n*A + B ∑ X
•∑ X*Y = A ∑ X + B ∑ X^2
Now let us try to fit an exponential regression for the given data

x 2 3 4 5 6

y 8.3 15.4 33.1 165.2 127.4

From the above equation we know that X=x and Y=log10(y)

X^ Y=log10(
x y X XY
2 y)

2 8.3 2 4 0.91 1.82

3 15.4 3 9 1.18 3.54

4 33.1 4 16 1.51 6.04

5 165.2 5 25 2.21 11.05

6 127.4 6 36 2.1 12.6

From the above table n= 5 , ∑ X = 20 , ∑ Y = 7.91 ,∑ XY =35.05 , ∑ X^2 = 90


Now the normal equations becomes.
•7.91 = 5A + 20B
•35.05 = 20A + 90B
•By solving the above equation we can get the values of A and B
•A=0.218 and B= 0.341
•From the mentioned equations we know b=10^B and a=10^A
•a=10^0.218 = 1.6519
•b= 10^0.341 = 2.192
•The exponential regression equation is y=ab^x -> y = 1.6519*2.192^x
Let us now try to implement the exponential regression in R programming
R
#storing the independent variable
independentX<-c(2,3,4,5,6)
#storing the dependent variable
dependentY<-c(8.3,15.4,33.1,165.2,127.4)
#fitting the exponential regression
exponentialregression<-lm(log10(dependentY)~independentX)
#determining the coefficients a and b
a<-10^(coef(exponentialregression)[1])
b<-10^coef(exponentialregression)[2]
print(a)
print(b)
#printing the summary of exponential regression equation
summary(exponentialregression)
#ploting
plot(exponentialregression)

Output:
(Intercept)
66.54395
independentX
0.8881984
Call:
lm(formula = log10(dependentY) ~ independentX)
Residuals:
1 2 3 4 5 6 7
-0.047144 0.014443 0.035970 -0.007178 -0.001446 0.050382 -0.045026
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.823109 0.047171 38.65 2.19e-07 ***
independentX -0.051490 0.004457 -11.55 8.53e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04085 on 5 degrees of freedom
Multiple R-squared: 0.9639, Adjusted R-squared: 0.9567
F-statistic: 133.4 on 1 and 5 DF, p-value: 8.527e-05

Logistic Regression
Logistic regression analysis can be used for classification and regression .We can solve
the logistic regression eqaution by using the linear regression representation. The
mathematical equation of the logistic regression can be denoted in two ways as shown
below.
y=a+b*ln(x)
where y is dependent variable
• x is independent variable
• β0 , β1 .... are the constants/regression coefficients
R
#logarthimic regression
#storing dependent and independent variables
independentX<-c(10,20,30,40,50,60,70,80,90)
dependentY<-c(1,2,3,4,5,6,7,8,9)
#fitting the logarithmic regression equation
logarthimic<-lm(dependentY~log(independentX,exp(1)))
#printing the summary of result
summary(logarthimic)

Output:
Call:
lm(formula = dependentY ~ log(independentX, exp(1)))
Residuals:
1 2 3 4 5 6 7
-2.4883 1.7209 2.5819 -0.6370 -0.5949 0.9843 -1.5668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69.972 4.710 14.86 2.5e-05 ***
log(independentX, exp(1)) -21.426 2.076 -10.32 0.000147 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2 on 5 degrees of freedom
Multiple R-squared: 0.9552, Adjusted R-squared: 0.9462
F-statistic: 106.5 on 1 and 5 DF, p-value: 0.000147
Steps in Regression Analysis
1.Firstly , we need to identify the problem, objective or research question inorder to apply
the regression analysis.
2.Collect the required data or relevant data for the regression analysis. Make sure the data
is free form errors and missed value.
3.To understand the characteristics of the data perform EDA analysis which include data
visualization , statistical summary of the data.
4.We need to select the variables which are independent variables .
5.After selection of the independent variables ,we need to perform the data preprocessing
which includes the handling the missing data , outliers in the taken data.
6.Then we need to build the model based on the type of the regression model we are
performing.
7.We need to estimate the parameters or coefficients of the regression model by using the
estimation methods.
8.After calculating the parameters , we need to check the model whether it is good fit for
all the data or not . (i.e., checking the performance of the model).
9.Finally use the model for unseen data .
Applications of regression analysis
Regression Analysis has various applications in many fields like economics,finance,real
estate , healthcare , marketing ,business , science , education , psychology , sport analysis
, agriculture and many more. Let us now discuss about the few applications of regression
analysis .
1.Regression analysis is used for the prediction of stock price based on the past data,
analyzing the relationship between the interest rate and consumer spending.
2.It can be used for the analysis of the impact of price changes on product demand and for
predicting the sales based on expenditure of advertisement.
3.It can be used in real estate for predicting the value of property based on the location.
4.Regression is also used in the weather forecasting.
5.It is also used for the prediction of crops yield based on the weather conditions , impact
of fertilizers and irrigation the plant.
6.It can be used in the analysis of the product quality and also gives the relationship
between the manufacturing varibales and product quality.
7.It can be used for the prediction of performance of the sports players based on the
historical data and impact of coaching strategies on team sucess.
Advantages of regression analysis
1.It provides insights how one or more independent variables relates to the dependent
variables.
2.It helps in analysis of the model and gives the relationship among the varibels.
3.It helps in forecasting and decision making by predicting the dependent variables based
on the independent variabels.
4.Regression Analysis can help in predicting the most important predictor/variable among
all other variables.
5.It gives the information providing the strengths , positives and negatives of the
relationship between the variables.
6.The goodness of fit and identify potential issues of the model can be assessed by the
diagnostic tools like residual analysis tools.
Disadvantages of regression analysis
1.Regression Analysis is sentive to outliers which infulencing the coeffiecient estimation .
2.Regression analysis dependents on the several assumptions like linearity , normality of
residuals etc which effects the realibility of the results when the assumptions are violated.
3.Regression Analysis mostly depend on the quality of the data . The results will be
inaccurate and unrealiable when the data is biased .
4.With the regression analysis we are not able to provide the accurate results for the
extremely complex relationships.
5.In regression analysis multi collinearity may lead to standard errors and it is becoming a
challenge to identify the contribution of each variable in the data.
What is multivariate analysis?

In data analytics, we look at different variables (or factors) and how they might impact

certain situations or outcomes.

For example, in marketing, you might look at how the variable “money spent on

advertising” impacts the variable “number of sales.” In the healthcare sector, you might

want to explore whether there’s a correlation between “weekly hours of exercise” and

“cholesterol level.” This helps us to understand why certain outcomes occur, which in turn

allows us to make informed predictions and decisions for the future.

There are three categories of analysis to be aware of:

•Univariate analysis, which looks at just one variable

•Bivariate analysis, which analyzes two variables

•Multivariate analysis, which looks at more than two variables

As you can see, multivariate analysis encompasses all statistical techniques that are used

to analyze more than two variables at once. The aim is to find patterns

and correlations between several variables simultaneously—allowing for a much deeper,

more complex understanding of a given scenario than you’ll get with bivariate analysis.
1.An example of multivariate analysis

Let’s imagine you’re interested in the relationship between a person’s social media habits

and their self-esteem. You could carry out a bivariate analysis, comparing the following two

variables:

1.How many hours a day a person spends on Instagram

2.Their self-esteem score (measured using a self-esteem scale)


You may or may not find a relationship between the two variables; however, you know

that, in reality, self-esteem is a complex concept. It’s likely impacted by many different

factors—not just how many hours a person spends on Instagram. You might also want to

consider factors such as age, employment status, how often a person exercises, and

relationship status (for example). In order to deduce the extent to which each of these

variables correlates with self-esteem, and with each other, you’d need to run a

multivariate analysis.

So we know that multivariate analysis is used when you want to explore more than two

variables at once. Now let’s consider some of the different techniques you might use to do

this.
2. Multivariate data analysis techniques and examples

There are many different techniques for multivariate analysis, and they can be divided into

two categories:

•Dependence techniques

•Interdependence techniques

So what’s the difference? Let’s take a look.


Multivariate analysis techniques: Dependence vs. interdependence

When we use the terms “dependence” and “interdependence,” we’re referring to different

types of relationships within the data. To give a brief explanation:


Dependence methods

Dependence methods are used when one or some of the variables are dependent on

others. Dependence looks at cause and effect; in other words, can the values of two or

more independent variables be used to explain, describe, or predict the value of another,

dependent variable? To give a simple example, the dependent variable of “weight” might

be predicted by independent variables such as “height” and “age.”

In machine learning, dependence techniques are used to build predictive models. The

analyst enters input data into the model, specifying which variables are independent and

which ones are dependent—in other words, which variables they want the model to

predict, and which variables they want the model to use to make those predictions.
Interdependence methods

Interdependence methods are used to understand the structural makeup and underlying

patterns within a dataset. In this case, no variables are dependent on others, so you’re not

looking for causal relationships. Rather, interdependence methods seek to give meaning to

a set of variables or to group them together in meaningful ways.

So: One is about the effect of certain variables on others, while the other is all about the

structure of the dataset.

With that in mind, let’s consider some useful multivariate analysis techniques. We’ll look

at:

•Multiple linear regression

•Multiple logistic regression

•Multivariate analysis of variance (MANOVA)

•Factor analysis

•Cluster analysis
Multiple linear regression

Multiple linear regression is a dependence method which looks at the relationship between

one dependent variable and two or more independent variables. A multiple regression

model will tell you the extent to which each independent variable has a linear relationship

with the dependent variable. This is useful as it helps you to understand which factors are

likely to influence a certain outcome, allowing you to estimate future outcomes.


Example of multiple regression:

As a data analyst, you could use multiple regression to predict crop growth. In this

example, crop growth is your dependent variable and you want to see how different

factors affect it. Your independent variables could be rainfall, temperature, amount of

sunlight, and amount of fertilizer added to the soil. A multiple regression model would

show you the proportion of variance in crop growth that each independent variable

accounts for.

Source: Public domain via Wikimedia Commons


Multiple logistic regression

Logistic regression analysis is used to calculate (and predict) the probability of a binary

event occurring. A binary outcome is one where there are only two possible outcomes;

either the event occurs (1) or it doesn’t (0). So, based on a set of independent variables,
logistic regression can predict how likely it is that a certain scenario will arise. It is also

used for classification. You can learn about the difference between regression and

classification here.
Example of logistic regression:

Let’s imagine you work as an analyst within the insurance sector and you need to predict

how likely it is that each potential customer will make a claim. You might enter a range of

independent variables into your model, such as age, whether or not they have a serious

health condition, their occupation, and so on. Using these variables, a logistic regression

analysis will calculate the probability of the event (making a claim) occurring. Another oft-

cited example is the filters used to classify email as “spam” or “not spam.” You’ll find a

more detailed explanation in this complete guide to logistic regression.

Multivariate analysis of variance (MANOVA)

Multivariate analysis of variance (MANOVA) is used to measure the effect of multiple

independent variables on two or more dependent variables. With MANOVA, it’s important

to note that the independent variables are categorical, while the dependent variables are

metric in nature. A categorical variable is a variable that belongs to a distinct category—

for example, the variable “employment status” could be categorized into certain units,

such as “employed full-time,” “employed part-time,” “unemployed,” and so on. A metric

variable is measured quantitatively and takes on a numerical value.

In MANOVA analysis, you’re looking at various combinations of the independent variables

to compare how they differ in their effects on the dependent variable.


Example of MANOVA:

Let’s imagine you work for an engineering company that is on a mission to build a super-

fast, eco-friendly rocket. You could use MANOVA to measure the effect that various design

combinations have on both the speed of the rocket and the amount of carbon dioxide it

emits. In this scenario, your categorical independent variables could be:

•Engine type, categorized as E1, E2, or E3

•Material used for the rocket exterior, categorized as M1, M2, or M3

•Type of fuel used to power the rocket, categorized as F1, F2, or F3


Your metric dependent variables are speed in kilometers per hour, and carbon dioxide

measured in parts per million. Using MANOVA, you’d test different combinations (e.g. E1,

M1, and F1 vs. E1, M2, and F1, vs. E1, M3, and F1, and so on) to calculate the effect of all

the independent variables. This should help you to find the optimal design solution for your

rocket.
Factor analysis

Factor analysis is an interdependence technique which seeks to reduce the number of

variables in a dataset. If you have too many variables, it can be difficult to find patterns in

your data. At the same time, models created using datasets with too many variables are

susceptible to overfitting. Overfitting is a modeling error that occurs when a model fits too

closely and specifically to a certain dataset, making it less generalizable to future

datasets, and thus potentially less accurate in the predictions it makes.

Factor analysis works by detecting sets of variables which correlate highly with each other.

These variables may then be condensed into a single variable. Data analysts will often

carry out factor analysis to prepare the data for subsequent analyses.
Factor analysis example:

Let’s imagine you have a dataset containing data pertaining to a person’s income,

education level, and occupation. You might find a high degree of correlation among each of

these variables, and thus reduce them to the single factor “socioeconomic status.” You

might also have data on how happy they were with customer service, how much they like

a certain product, and how likely they are to recommend the product to a friend. Each of

these variables could be grouped into the single factor “customer satisfaction” (as long as

they are found to correlate strongly with one another). Even though you’ve reduced

several data points to just one factor, you’re not really losing any information—these

factors adequately capture and represent the individual variables concerned. With your

“streamlined” dataset, you’re now ready to carry out further analyses.


Cluster analysis

Another interdependence technique, cluster analysis is used to group similar items within

a dataset into clusters.


When grouping data into clusters, the aim is for the variables in one cluster to be more

similar to each other than they are to variables in other clusters. This is measured in terms

of intracluster and intercluster distance. Intracluster distance looks at the distance

between data points within one cluster. This should be small. Intercluster distance looks at

the distance between data points in different clusters. This should ideally be large. Cluster

analysis helps you to understand how data in your sample is distributed, and to find

patterns.

Learn more:What is Cluster Analysis? A Complete Beginner’s Guide


Cluster analysis example:

A prime example of cluster analysis is audience segmentation. If you were working in

marketing, you might use cluster analysis to define different customer groups which could

benefit from more targeted campaigns. As a healthcare analyst, you might use cluster

analysis to explore whether certain lifestyle factors or geographical locations are

associated with higher or lower cases of certain illnesses. Because it’s an interdependence

technique, cluster analysis is often carried out in the early stages of data analysis.
Source: Chire, CC BY-SA 3.0 via Wikimedia Commons
More multivariate analysis techniques

This is just a handful of multivariate analysis techniques used by data analysts and data

scientists to understand complex datasets. If you’re keen to explore further, check out

discriminant analysis, conjoint analysis, canonical correlation analysis, structural equation

modeling, and multidimensional scaling.


What are the advantages of multivariate analysis?

The one major advantage of multivariate analysis is the depth of insight it provides. In

exploring multiple variables, you’re painting a much more detailed picture of what’s
occurring—and, as a result, the insights you uncover are much more applicable to the real

world.

Remember our self-esteem example back in section one? We could carry out a bivariate

analysis, looking at the relationship between self-esteem and just one other factor; and, if

we found a strong correlation between the two variables, we might be inclined to conclude

that this particular variable is a strong determinant of self-esteem. However, in reality, we

know that self-esteem can’t be attributed to one single factor. It’s a complex concept; in

order to create a model that we could really trust to be accurate, we’d need to take many

more factors into account. That’s where multivariate analysis really shines; it allows us to

analyze many different factors and get closer to the reality of a given situation.

Key takeaways and further reading

In this post, we’ve learned that multivariate analysis is used to analyze data containing

more than two variables. To recap, here are some key takeaways:

•The aim of multivariate analysis is to find patterns and correlations between

several variables simultaneously

•Multivariate analysis is especially useful for analyzing complex datasets, allowing

you to gain a deeper understanding of your data and how it relates to real-world

scenarios

•There are two types of multivariate analysis techniques: Dependence techniques,

which look at cause-and-effect relationships between variables, and

interdependence techniques, which explore the structure of a dataset

•Key multivariate analysis techniques include multiple linear regression, multiple

logistic regression, MANOVA, factor analysis, and cluster analysis—to name just a

few

You might also like