Chapter 2
Chapter 2
Regression
4. Normality of Errors:
The residuals (errors) of the model should be approximately normally distributed. This
assumption is important for hypothesis testing (e.g., t-tests for the regression coefficients)
and for constructing confidence intervals.
5. No Perfect Multicollinearity:
The independent variables should not be perfectly correlated with each other. If two or
more predictors are highly correlated, the model may have difficulty estimating their
individual effects, which can lead to unstable coefficient estimates (high variance). This
is known as multicollinearity.
R-squared is a statistical measure that represents the goodness of fit of a regression model.
The value of R-square lies between 0 to 1. Where we get R-square equals 1 when the model
perfectly fits the data and there is no difference between the predicted value and actual value.
However, we get R-square equals 0 when the model does not predict any variability in the
model and it does not learn any relationship between the dependent and independent variables.
SSE is the sum of the squared differences between the actual dependent variable values and the
predicted values from the regression model.
SST is the total variation in the dependent variable and is calculated by summing the squared
differences between each actual dependent variable value and the mean of all dependent variable
values.
Gauss Markov Theorem:
The Gauss-Markov theorem states that if your linear regression model satisfies the classical
assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that
have the smallest variance of all possible linear estimators.
In Regression analysis, goal is to draw a random sample from a population and use it to estimate
the properties of that population. In regression analysis, the coefficients in the equation are
estimates of the actual population parameters.
Epsilon (ε) represents the random error that the model doesn’t explain.
Unfortunately, we’ll never know these population values because it is generally impossible to
measure the entire population. Instead, we’ll obtain estimates of them using our random sample.
The notation for an estimated model from a random sample is the following:
Imagine that we repeat the same study many times. We collect random samples of the same size,
from the same population, and fit the same OLS regression model repeatedly. Each random
sample produces different estimates for the parameters in the regression equation. After this
process, we can graph the distribution of estimates for each parameter. Statisticians refer to this
type of distribution as a sampling distribution, which is a type of probability distribution.
In the graph below, beta represents the true population value. The curve on the right centers on a
value that is too high. This model tends to produce estimates that are too high, which is a positive
bias. It is not correct on average. However, the curve on the left centers on the actual value of
beta. That model produces parameter estimates that are correct on average. The expected value is
the actual value of the population parameter
In the graph below, both curves center on beta. However, one curve is wider than
the other because the variances are different. Broader curves indicate that there is a
higher probability that the estimates will be further away from the correct value.
The Best in BLUE refers to the sampling distribution with the minimum variance. That’s the
tightest possible distribution of all unbiased linear estimation methods!
This lesson presents two alternative methods for testing whether a linear association exists
between the predictor x and the response y in a simple linear regression model:
H0: β1 = 0 versus HA: β1 ≠ 0.
One is the t-test for the slope while the other is an analysis of variance (ANOVA) F-test.
1. Inference for the Population Intercept and Slope
Let's visit the example concerning the relationship between skin cancer mortality and state
latitude. The response variable y is the mortality rate (number of deaths per 10 million people) of
white males due to malignant skin melanoma from 1950-1959. The predictor variable x is the
latitude (degrees North) at the center of each of 49 states in the United States. A subset of the
data looks like this:
and a plot of the data with the estimated regression equation looks like:
Is there a relationship between state latitude and skin cancer mortality? Certainly, since the
estimated slope of the line, b1, is -5.98, not 0, there is a relationship between state latitude and
skin cancer mortality in the sample of 49 data points. But, we want to know if there is a
relationship between the population of all of the latitudes and skin cancer mortality rates. That is,
we want to know if the population slope β1 is unlikely to be 0.
Third, we use the resulting test statistic to calculate the P-value. The P-value is determined by
referring to a t-distribution with n-2 degrees of freedom.
If the P-value is smaller than the significance level α, we reject the null hypothesis in favor of the
alternative. We conclude "there is sufficient evidence at the α level to conclude that there is a
linear relationship in the population between the predictor x and response y."
If the P-value is larger than the significance level α, we fail to reject the null hypothesis. We
conclude "there is not enough evidence at the α level to conclude that there is a linear
relationship in the population between the predictor x and response y."
Logistic Function
Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as
0 and 1, it gives the probabilistic values which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms
a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
Let’s say instead of y we are taking probabilities (P). But there is an issue here, the
value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0 -1).
To overcome this issue we take “odds” of P:
We know that odds can always be positive which means the range will always be (0,+∞
). Odds are nothing but the ratio of the probability of success and probability of failure.
It is difficult to model a variable that has a restricted range. To control this we take the log
The Pearson correlation coefficient (r) is the most common way of measuring a linear
correlation. It is a number between –1 and 1 that measures the strength and direction of the
relationship between two variables.
Pearson correlation Correlation Interpretation
coefficient (r) type
Between 0 and 1 Positive When one variable changes, the other variable
correlation changes in the same direction.
Or
Numerical: