0% found this document useful (0 votes)
445 views5 pages

Correlation and Regression - Interview Questions in Business Analytics

Uploaded by

rohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
445 views5 pages

Correlation and Regression - Interview Questions in Business Analytics

Uploaded by

rohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2/23/2019 5.

Correlation and Regression - Interview Questions in Business Analytics

 Interview Questions in Business Analytics

PREV NEXT
⏮ ⏭
4. Hypothesis Testing 6. Segmentation
  🔎

© Bhasker Gupta 2016

Bhasker Gupta, Interview Questions in Business Analytics , 10.1007/978-1-


4842-0599-0_5

5. Correlation and Regression


1
Bhasker Gupta

(1)Bangalore, Karnataka, India

This chapter is concerned with measuring the relatedness between two


variables. A simple measure, the correlation coefficient, is commonly used
to quantify the degree of relationship between two variables. In this
chapter, I will discuss different types of regression models, assumptions
and questions related to them, and the estimation method used commonly
with regression models.

Q: What is correlation and what does it do?


In analytics, we try to find relationships and associations among various
events. In the probabilistic context, we determine the relationships
between variables. Correlation is a method by which to calculate the
relationship between two variables. The coefficient of a correlation is a
numerical measure of the relationship between paired observations (Xi,Yi),
i = 1,…, n. For different coefficients of correlations, the relationship
between variables and their interpretation varies.

There are a number of techniques that have been developed to quantify the
association between variables of different scales (nominal, ordinal, interval,
and ratio), including the following:

Pearson product-moment correlation (both variables are measured


on an interval or ratio scale)

Spearman rank-order correlation (both variables are measured on


an ordinal scale)

Phi correlation (both variables are measured on a


nominal/dichotomous scale)

Point biserial (one variable is measured on a nominal/dichotomous


scale, and one is measured on an interval or ratio scale)

Q: When should correlation be used or not used?


Correlation is a good indicator of how two variables are related. It is a good
metric to look at during the early phases of research or analysis. Beyond a
certain point, however, correlation is of little use.

A dip in a country’s gross domestic product (GDP), for example, would lead
to an increase in the unemployment rate. A casual look at the correlation
between these two variables would indicate that there is a strong
relationship between them.

Yet, the extent or measure of this relationship cannot be ascertained


through a correlation. A correlation of 75% between two variables would
not mean that the two variables are related to each other by a measure of
.75 times.

This brings us to another issue that is often misunderstood by analysts.


Correlation does not mean causation. A strong correlation between two
variables does not necessarily imply that one variable causes another
variable to occur.

In our GDP vs. unemployment rate example, this might be true, i.e., a lower
GDP rate might increase unemployment. But we cannot and should not
infer this from a correlation. It should be left to the sound judgment of a
competent researcher.

Q: What is the Pearson product-moment


correlation coe icient?
It is easy to determine whether two variables are correlated, simply by
looking at the scatter plot (each variable on the 2 axis of the plot).
Essentially, the values should be scattered across a straight line of the plot,
in order for the two variables to have a strong correlation.

However, to quantify the correlation, we use the Pearson’s product-


moment correlation coefficient for samples, otherwise known as Pearson’s
r.

The correlation can be either positive or negative.

So, the value of ρ can be any number between -1 to 1, as follows:

➔ [-1 <= ρ <= +1]

A correlation coefficient of less than zero would mean that the increase of
one variable generally leads to the decrease of the other variable. A
coefficient greater than zero implies that an increase of one variable leads
to an increase in the other variable.

Higher values mean stronger relationships (positive or negative), and


values closer to zero depict weak relationships. A correlation of 1.00 means
that the two values are completely or perfectly positively correlated; -1.00
means that they are perfectly negatively correlated; and a correlation of
0.00 means that there is no relationship between the two variables.

Q: What is the formula for calculating the


correlation coe icient?
Find answers on the fly, The
orformula
master something new. Subscribe today. See pricing options.
is as follows:

https://learning.oreilly.com/library/view/interview-questions-in/9781484205990/A335095_1_En_5_Chapter.html 1/5
2/23/2019 5. Correlation and Regression - Interview Questions in Business Analytics

To compute r, the following algorithm, corresponding to the preceding


formula, is used as follows:

For each (x, y) set of coordinates, subtract the mean from each
observation for x and y.

Divide by the corresponding standard deviation.

Multiply the two results together.

The result is then added to a sum.

The sum is divided by the degrees of freedom, n - 1.

Q: Briefly, what are the other techniques for


calculating correlation?
Although the Pearson product-moment correlation is the most widely used
correlation technique, other correlation techniques must be applied if the
main tenet of the Pearson technique is violated, i.e., both variables should
be on an interval or ratio scale.

SPEARMAN RANK-ORDER CORRELATION


This techniqe is used when both variables are on an ordinal scale. Both
variables are ranked manually between a value of 1 to 10. The main tenet
here is that the rank for a particular observation should be high for both
variables, for any correlation to exist.

PHI CORRELATION
This is also used as an after test for a chi-square test . It is used when
variables are nominal.

POINT BISERIAL
This method is used when one variable is on a nominal/dichotomous scale
and one is measured on an interval or ratio scale.

Q: How would you use a graph to illustrate and


interpret a correlation coe icient?
Visualization helps us to understand the relationships among variables
better. In Figure 5-1, graphs have been plotted to illustrate and visualize
relationships between the variables x and y that are being statistically
analyzed.

Figure 5-1. Graphs plotting the statistical analysis of relationships between the
variables x and y

Now let us interpret these graphs.

Figure (a)
r = 1.0

This represents a perfect linear association. All the data points fall on the
line.

Figure (b)
r = 0

No linear relationship exists between the variables. The data points are
scattered randomly and may approximate a circle. Changing the value of
one variable has no effect on the value of the other.

Figure (c)
r = 0.70

There is some positive linear relationship, although it is not perfect. Most


of the data points fall on or closer to a straight line.

Figure (d)
r = -1.0

This shows a perfect linear relationship between the variables, similar to


figure (a), with the difference being that the variables are inversely related,
i.e., increasing the value of one variable results in a decrease in the other
variable.

Figure (e)
r = 0.51

The relationship between the variables is not very strong, and the data
points are a little scattered, although still closer to a straight line.

Figure (f)
r = -0.70

This is similar to figure (c), with the difference being that the variables are
negatively correlated.

We can see that as the value of r decreases, the data points are more
scattered, whereas the data points are closer to a straight line when the
Find answers on the fly, value
or master something
of r approaches -1.0 or +1.0. new. Subscribe today. See pricing options.

https://learning.oreilly.com/library/view/interview-questions-in/9781484205990/A335095_1_En_5_Chapter.html 2/5
2/23/2019 5. Correlation and Regression - Interview Questions in Business Analytics

Q: How is a correlation coe icient interpreted ?


A correlation coefficient has both magnitude (unit less) and direction
(positive or negative sign) lying between -1 to 1. A correlation coefficient
with a value of zero indicates that no relationship exists between the
variables. When the correlation coefficient has a non-zero value, and the
values approach ±1, this signifies a strong linear relationship between the
variables. The magnitude of r signifies the strength of the relationship, and
the +ve and -ve signs indicate whether the relationship is directly or
inversely related.

For example, if r = 0.9, this means that the variables are strongly related,
and increasing the value of one results in an increase in the value of the
other. Similarly, if r = -0.9, this indicates that the variables are strongly
related, and increasing the value of one results in a decrease in the value of
the other.

Q: What are the various issues with correlation?


The various issues with correlation are

Correlation analysis does not measure the strength of a nonlinear


association between variables.

Accidental or spurious relationships are not accounted for.

Research problems, such as data contamination, sample bias, etc.,


hinder drawing reliable conclusions.

Correlation analysis measures the relationship and does not provide


an explanation or basis for it, which can result in false conclusions.

Q: How would you calculate a correlation


coe icient in Excel?
A correlation coefficient between two arrays can be calculated in Excel
using the CORREL function . The syntax for using the CORREL function is

CORREL (array1, array2)


For eg:
CORREL (A1:A20, B1:B20)

The CORREL function gives error in the case of the following:

A #N/A error for unequal numbers of data points in the two arrays

A #DIV/0! error when the standard deviation of array1 and array2 is


zero

Q: What is meant by the term “linear regression”?


Linear regression is a statistical modeling technique that attempts to model
the relationship between an explanatory variable and a dependent variable,
by fitting the observed data points on a linear equation, e.g., modeling the
body mass index (BMI) of individuals by weight.

A linear regression is used if there is a relationship or significant


association between the variables. This can be checked by scatter plots. If
no association appears between the variables, fitting a linear regression
model to the data will not provide a useful model.

A linear regression line takes equations in the following form:

Y = a + bX,
Where, X = explanatory variable and
Y = dependent variable.
b = slope of the line
a = intercept (the value of y when x = 0).

Q: What are the various assumptions that an


analyst takes into account while running a
regression analysis?
Regression analysis depends on the following assumptions :

The relationship between the variables should be linear (or


approximately linear) over the range of population being studied.

All the variables in the regression analysis should be normal, i.e.,


should follow the normal curve (exactly or approximately).

There should be no multicollinearity, i.e., the independent variables


should not show correlation among themselves.

There should be no autocorrelation in the data, i.e., the residuals


should be independent of each other.

The condition of homoscedasticity, i.e., the error terms or residuals


along the regression, should be equal.

Q: How would you execute regression on Excel?


Regression on Excel can be performed by using three built-in functions to
2
calculate slope, intercept, and R values or by using the Regression
function provided in the Data Analysis toolbar (after installing Analysis
ToolPak add-ins). The built-in functions are SLOPE(), INTERCEPT(), and
RSQ().

Q: What is the multiple coe icient of


determination or R-squared ?
2
The multiple coefficient of determination, R , is a method by which to
calculate the overall effectiveness (in terms of percentage similar to linear
regression) of all the independent variables in explaining the dependent
variable.

2
For example, if R = 0.8, this means that the independent variables have
80% of the variation in the value of dependent variables.
2
Unfortunately, R alone may not be a reliable measure of the accuracy of
2
the multiple regression model, as R increases every time a new variable is
added in the model, even though the variable might not be statistically
significant. If there is a large number of independent variables, the value of
2
R may be high, even though the variables do not explain the dependent
variable that well. This problem is called overestimating the regression.
2
By adjusting the R value for the number of independent variables, the
problem of overestimating the regression can be overcome.

Q: What is meant by “heteroscedasticity ”?


When the variance of the residuals differs across observations in the
sample, this is called heteroscedasticity . It is one of the errors in regression
analysis that analysts have to test before running the regression analysis.
One of the assumptions of multiple regression is that the variance of the
residuals is constant across observations.

Q: How do you di erentiate between conditional


Find answers on the fly, and
or master something
unconditional new. Subscribe today. See pricing options.
heteroscedasticity?

https://learning.oreilly.com/library/view/interview-questions-in/9781484205990/A335095_1_En_5_Chapter.html 3/5
2/23/2019 5. Correlation and Regression - Interview Questions in Business Analytics

Unconditional heteroscedasticity occurs in cases in which the level of


independent variables does not affect heteroscedasticity, i.e., it doesn’t
change systematically with changes in the value of independent variables.
Although this is a defilement of the equal variance assumption, it
frequently causes no serious problems with the regression.

Conditional heteroscedasticity is heteroscedasticity that is related to the


level of (i.e., conditional upon) the independent variables.

Q: What are the di erent methods of detecting


heteroscedasticity?
There are two methods of detecting heteroscedasticity: examining scatter
plots of the residuals, and using the Breusch-Pagan chi-square test .
Plotting the residuals against one or more of the independent variables can
help us spot trends among the observations (see Figure 5-2).

Figure 5-2. Plotting residuals against independent variable

The residual plot in the figure indicates the presence of conditional


heteroscedasticity. Notice how the variation in the regression residuals
increases as the independent variable increases. This indicates that the
variance of the dependent variable about the mean is related to the level of
the independent variable.

The more common way to detect conditional heteroscedasticity is the


Breusch-Pagan test, which calls for the regression of the squared residuals
on the independent variables. Independent variables contribute
significantly in explaining squared residuals in case of conditional
heteroscedasticity.

Q: What are the di erent methods to correct


heteroscedasticity?
The most common remedy is to calculate robust standard errors. The t-
statistics is recalculated using the original regression coefficients and the
robust standard errors. A second method to correct for heteroscedasticity is
to use generalized least squares, by modifying the original equation.

Q: What is meant by the term “serial correlation”?


Serial correlation , or autocorrelation, is the phenomenon commonly
observed in time series data, in which there is a correlation between the
residual terms. It is of two types: positive and negative.

When a positive regression error in one time period increases the


probability of observing a positive regression for the next time period, this
is a positive serial correlation. In a negative serial correlation, the positive
regression error causes the probability of observing a negative error to
increase.

Q: What are the di erent methods to detect serial


correlation?
There are two methods that are commonly used to detect the presence of
serial correlation: residual plots and the Durbin-Watson statistic.

A scatter plot of residuals vs. time, such as those shown in Figure 5-3, can
reveal the presence of serial correlation. Figure 5-3 illustrates examples of
positive and negative serial correlation.

Figure 5-3. Scatter plot of residuals vs. time indicating positive and negative serial
correlations

The more common method is to use the Durbin-Watson statistic (DW) to


detect the presence of serial correlation.

Q: What are the di erent methods to correct


multicollinearity?
The most common method to remove multicollinearity is to omit
independent variables having a high correlation with the variable set.
Unfortunately, it is not always an easy task to identify the variable(s) that
are the source of the multicollinearity . There are statistical procedures that
may help in this effort, such as stepwise regression, which systematically
removes variables until multicollinearity is reduced.

A summary of what you have to know regarding violations of the


assumptions of multiple regression is offered in Table 5-1.

Table 5-1. Summary of Violations of Assumptions of Multiple Regression

Conditional Serial
M
Heteroscedasticity Correlation

H
Residual variance
a
related to level of Residuals are
What is it? m
independent correlated
in
variables
v

Coefficients are
C
consistent.
Coefficients are c
Standard errors
consistent. u
are
Standard errors are S
Effect? underestimated.
underestimated. a
Too many Type
Too many Type I o
I errors
errors T
(positive
I
correlation)

Find answers on the fly, or master something new. Subscribe today. See pricing options.

https://learning.oreilly.com/library/view/interview-questions-in/9781484205990/A335095_1_En_5_Chapter.html 4/5
2/23/2019 5. Correlation and Regression - Interview Questions in Business Analytics

Conditional Serial
M
Heteroscedasticity Correlation

C
F
Breusch-Pagan chi- Durbin-Watson c
Detection?
square test test a
in
v

Use the Hansen


Use White- D
method to
Correction? corrected standard c
adjust standard
errors v
errors.

Q: What is an odds ratio ?


Odds is the relative occurrence of different outcomes, expressed as a ratio
of the form a:b. For example, if the odds of an event are said to be 5:2 in
favor of the first outcome, this means that the first outcome occurs five
times for the second outcome to occur twice. Odds are related to
probability and can be shown mathematically as follows:

Odds = a:b
Probability = a/(a+b)
Probability = Odds/(1 + Odds)
Odds = Probability/(1 - Probability )

Q: How is linear regression di erent from logistic


regression ?
Linear regression is applicable on numerical or continuous variables, but
logistic regression is applicable when the dependent variable is categorical
(a commonly dichotomous variable). The output of logistic regression is
between 0 and 1, where 1 denotes “success” and 0 denotes “failure.” But in
linear regression, the output is continuous, which can assume any range of
value.

Linear regression predicts numerical outputs, such as sales or profit,


whereas logistic regression predicts dichotomous output, such as yes and
no or living and dead.

Recommended / Playlists / History / Topics / Settings / Get the App / Sign Out
© 2019 Safari. Terms of Service / Privacy Policy
PREV NEXT
⏮ ⏭
4. Hypothesis Testing 6. Segmentation

Find answers on the fly, or master something new. Subscribe today. See pricing options.

https://learning.oreilly.com/library/view/interview-questions-in/9781484205990/A335095_1_En_5_Chapter.html 5/5

You might also like