0% found this document useful (0 votes)
2 views27 pages

Chapter 2

Chapter 2 discusses key assumptions in linear regression, such as linearity, independence of errors, homoscedasticity, normality of errors, and no perfect multicollinearity, which ensure the reliability of regression analysis. It also covers R-squared as a measure of model fit, the Gauss-Markov theorem, and the basics of logistic regression, including the sigmoid function and key terminologies. Additionally, it introduces multiple linear regression, its assumptions, and the concept of Principal Component Analysis (PCA) for dimensionality reduction.

Uploaded by

anupjareda7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

Chapter 2

Chapter 2 discusses key assumptions in linear regression, such as linearity, independence of errors, homoscedasticity, normality of errors, and no perfect multicollinearity, which ensure the reliability of regression analysis. It also covers R-squared as a measure of model fit, the Gauss-Markov theorem, and the basics of logistic regression, including the sigmoid function and key terminologies. Additionally, it introduces multiple linear regression, its assumptions, and the concept of Principal Component Analysis (PCA) for dimensionality reduction.

Uploaded by

anupjareda7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter -2

Regression

Assumptions in Linear regression:


In linear regression, several key assumptions are made about the data and the relationship
between the independent and dependent variables. These assumptions ensure that the model is
appropriate and that the results of the regression analysis are reliable. The main assumptions in
linear regression are:
1. Linearity:
 The relationship between the dependent variable (Y) and the independent variables (X) is
assumed to be linear. This means that the change in Y is proportional to a change in X.
Mathematically, this is represented as:

2. Independence of Errors (No Autocorrelation):


 The residuals (errors) are assumed to be independent of each other. In other words, the
error for one observation should not provide any information about the error for another
observation. This assumption is important for ensuring that the model does not
overestimate the goodness of fit due to correlated residuals.
 For example, in stock market data, sales data, or any scenario where previous values can
influence future values, the error terms for successive observations can become
correlated.
3. Homoscedasticity:
 The variance of the residuals (errors) should be constant across all levels of the
independent variable(s). This is known as homoscedasticity. If the variance of the
residuals changes as the value of X changes, it’s called heteroscedasticity, which can lead
to inefficiency in the model.
 Homoscedasticity means that the spread of the residuals is uniform across all predicted
values.

4. Normality of Errors:
 The residuals (errors) of the model should be approximately normally distributed. This
assumption is important for hypothesis testing (e.g., t-tests for the regression coefficients)
and for constructing confidence intervals.
5. No Perfect Multicollinearity:
 The independent variables should not be perfectly correlated with each other. If two or
more predictors are highly correlated, the model may have difficulty estimating their
individual effects, which can lead to unstable coefficient estimates (high variance). This
is known as multicollinearity.

R squared (Coefficient of determination)

R-squared is a statistical measure that represents the goodness of fit of a regression model.
The value of R-square lies between 0 to 1. Where we get R-square equals 1 when the model
perfectly fits the data and there is no difference between the predicted value and actual value.
However, we get R-square equals 0 when the model does not predict any variability in the
model and it does not learn any relationship between the dependent and independent variables.
SSE is the sum of the squared differences between the actual dependent variable values and the
predicted values from the regression model.

SST is the total variation in the dependent variable and is calculated by summing the squared
differences between each actual dependent variable value and the mean of all dependent variable
values.
Gauss Markov Theorem:

The Gauss-Markov theorem states that if your linear regression model satisfies the classical
assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that
have the smallest variance of all possible linear estimators.

The Gauss-Markov theorem famously states that OLS is BLUE.

Best Linear Unbiased Estimator

What Does OLS Estimate?

In Regression analysis, goal is to draw a random sample from a population and use it to estimate
the properties of that population. In regression analysis, the coefficients in the equation are
estimates of the actual population parameters.

The notation for the model of a population is the following:


The betas (β) represent the population parameter for each term in the model.

Epsilon (ε) represents the random error that the model doesn’t explain.

Unfortunately, we’ll never know these population values because it is generally impossible to
measure the entire population. Instead, we’ll obtain estimates of them using our random sample.

The notation for an estimated model from a random sample is the following:

Sampling Distributions of the Parameter Estimates

Imagine that we repeat the same study many times. We collect random samples of the same size,
from the same population, and fit the same OLS regression model repeatedly. Each random
sample produces different estimates for the parameters in the regression equation. After this
process, we can graph the distribution of estimates for each parameter. Statisticians refer to this
type of distribution as a sampling distribution, which is a type of probability distribution.

1. Unbiased Estimates: Sampling Distributions Centered on the True Population


Parameter

In the graph below, beta represents the true population value. The curve on the right centers on a
value that is too high. This model tends to produce estimates that are too high, which is a positive
bias. It is not correct on average. However, the curve on the left centers on the actual value of
beta. That model produces parameter estimates that are correct on average. The expected value is
the actual value of the population parameter

2. Minimum Variance: Sampling Distributions are Tight Around the Population


Parameter

In the graph below, both curves center on beta. However, one curve is wider than
the other because the variances are different. Broader curves indicate that there is a
higher probability that the estimates will be further away from the correct value.

The Best in BLUE refers to the sampling distribution with the minimum variance. That’s the
tightest possible distribution of all unbiased linear estimation methods!

Simple Linear Regression Evaluation

This lesson presents two alternative methods for testing whether a linear association exists
between the predictor x and the response y in a simple linear regression model:
H0: β1 = 0 versus HA: β1 ≠ 0.
One is the t-test for the slope while the other is an analysis of variance (ANOVA) F-test.
1. Inference for the Population Intercept and Slope

Let's visit the example concerning the relationship between skin cancer mortality and state
latitude. The response variable y is the mortality rate (number of deaths per 10 million people) of
white males due to malignant skin melanoma from 1950-1959. The predictor variable x is the
latitude (degrees North) at the center of each of 49 states in the United States. A subset of the
data looks like this:

# State Latitude Mortality


1 Alabama 33.0 219
2 Arizona 34.5 160
3 Arkansas 35.0 170
4 California 37.5 182
5 Colorado 39.0 149
--- --- --- ---
49 Wyoming 43.0 134

and a plot of the data with the estimated regression equation looks like:

Is there a relationship between state latitude and skin cancer mortality? Certainly, since the
estimated slope of the line, b1, is -5.98, not 0, there is a relationship between state latitude and
skin cancer mortality in the sample of 49 data points. But, we want to know if there is a
relationship between the population of all of the latitudes and skin cancer mortality rates. That is,
we want to know if the population slope β1 is unlikely to be 0.

An α-level hypothesis test for the slope parameter β1


We follow standard hypothesis test procedures in conducting a hypothesis test for the
slope β1. First, we specify the null and alternative hypotheses:
Null hypothesis H0 : β1 = 0
Alternative hypothesis HA : β1 ≠ 0
Second, we calculate the value of the test statistic using the following formula:

Third, we use the resulting test statistic to calculate the P-value. The P-value is determined by
referring to a t-distribution with n-2 degrees of freedom.

Finally, we make a decision:

 If the P-value is smaller than the significance level α, we reject the null hypothesis in favor of the
alternative. We conclude "there is sufficient evidence at the α level to conclude that there is a
linear relationship in the population between the predictor x and response y."
 If the P-value is larger than the significance level α, we fail to reject the null hypothesis. We
conclude "there is not enough evidence at the α level to conclude that there is a linear
relationship in the population between the predictor x and response y."
Logistic Function

1. Logistic regression is a supervised machine learning algorithm used for classification


tasks where the goal is to predict the probability that an instance belongs to a given
class or not.
2. Uses a sigmoid function, that takes input as independent variables and produces a
probability value between 0 and 1.

 Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as
0 and 1, it gives the probabilistic values which lie between 0 and 1.
 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).

Logistic Function – Sigmoid Function

 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
 It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms
a curve like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.

Terminologies involved in Logistic Regression


Here are some common terms involved in logistic regression:
 Independent variables: The input characteristics or predictor factors applied to the
dependent variable’s predictions.
 Dependent variable: The target variable in a logistic regression model, which we are
trying to predict.
 Logistic function: The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables into a
probability value between 0 and 1, which represents the likelihood of the dependent
variable being 1 or 0.
 Odds: It is the ratio of something occurring to something not occurring. it is different from
probability as the probability is the ratio of something occurring to everything that could
possibly occur.
 Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the
odds. In logistic regression, the log odds of the dependent variable are modeled as a linear
combination of the independent variables and the intercept.
 Coefficient: The logistic regression model’s estimated parameters, show how the
independent and dependent variables relate to one another.
 Intercept: A constant term in the logistic regression model, which represents the log odds
when all independent variables are equal to zero.
 Maximum likelihood estimation: The method used to estimate the coefficients of the
logistic regression model, which maximizes the likelihood of observing the data given the
model.
Formula of Logit function is:

Equation of best fit line in linear regression is

Let’s say instead of y we are taking probabilities (P). But there is an issue here, the
value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0 -1).
To overcome this issue we take “odds” of P:

We know that odds can always be positive which means the range will always be (0,+∞
). Odds are nothing but the ratio of the probability of success and probability of failure.

It is difficult to model a variable that has a restricted range. To control this we take the log

of odds which has a range from (-∞,+∞).


Now we have our logistic function, also called a sigmoid function. The graph of a
sigmoid function is as shown below. It squeezes a straight line into an S-curve.

Pearson Correlation Coefficient (r):

The Pearson correlation coefficient (r) is the most common way of measuring a linear
correlation. It is a number between –1 and 1 that measures the strength and direction of the
relationship between two variables.
Pearson correlation Correlation Interpretation
coefficient (r) type

Between 0 and 1 Positive When one variable changes, the other variable
correlation changes in the same direction.

0 No correlation There is no relationship between the variables.


Between Negative When one variable changes, the other variable
0 and –1 correlation changes in the opposite direction.

Visualizing the Pearson correlation coefficient


The Pearson correlation coefficient also tells you whether the slope of the line of best fit is
negative or positive. When the slope is negative, r is negative. When the slope is positive, r is
positive.
When r is 1 or –1, all the points fall exactly on the line of best fit:

Calculating the Pearson correlation coefficient

Or

Multi linear regression:


Multiple linear regression is used to estimate the relationship between two or more
independent variables and one dependent variable.

Assumptions of multiple linear regression:


Multiple linear regression makes all of the same assumptions as simple linear regression:
Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t
change significantly across the values of the independent variable.
Independence of observations: the observations in the dataset were collected using statistically
valid sampling methods, and there are no hidden relationships among variables.
In multiple linear regression, it is possible that some of the independent variables are actually
correlated with one another, so it is important to check these before developing the regression
model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them
should be used in the regression model.
Normality: The data follows a normal distribution.
Linearity: the line of best fit through the data points is a straight line, rather than a curve or
some sort of grouping factor.

How to perform a multiple linear regression

Multiple linear regression formula

The formula for a multiple linear regression is:

 = the predicted value of the dependent variable


 = the y-intercept (value of y when all other parameters are set to 0)
 = the regression coefficient ( ) of the first independent variable ( ) (a.k.a.
the effect that increasing the value of the independent variable has on the predicted y
value)
 … = do the same for however many independent variables you are testing
 = the regression coefficient of the last independent variable
 = model error (a.k.a. how much variation there is in our estimate of )
To find the best-fit line for each independent variable, multiple linear regression calculates three
things:
 The regression coefficients that lead to the smallest overall model error.
 The t statistic of the overall model.
 The associated p value (how likely it is that the t statistic would have occurred by chance
if the null hypothesis of no relationship between the independent and dependent variables
was true).
It then calculates the t statistic and p value for each regression coefficient in the model.

Principal Component analysis:

PCA stands for Principal Component Analysis. It is a dimensionality reduction technique


commonly used in data analysis and machine learning. The primary goal of PCA is to reduce the
dimensionality of a dataset while preserving as much of the variance or information present in
the data as possible.
PCA achieves this by transforming the original variables into a new set of variables, called
principal components. These principal components are linear combinations of the original
variables and are orthogonal to each other, meaning they are uncorrelated. The first principal
component accounts for the largest possible variance in the data, the second principal component
for the second largest variance, and so on.
In essence, PCA helps in simplifying the complexity of high-dimensional data by capturing the
most important patterns or directions of variation in the data, thereby enabling easier
visualization, exploration, and analysis of the dataset. It is widely used in various fields such as
image processing, signal processing, finance, and bioinformatics, among others.
The principal components (PCs) in PCA are derived through linear algebra techniques, primarily
involving eigenvalue decomposition or singular value decomposition (SVD) of the covariance
matrix of the original data. Here's a brief overview of the mathematics behind PCA:
1. Centering the data: First, the mean of each feature (variable) is subtracted from the dataset.
This step ensures that the data is centered around the origin.
2. Covariance matrix: The covariance matrix is calculated for the centered data. This matrix
represents the pairwise covariances between all pairs of features.
3. Eigenvalue decomposition (EVD):
 EVD: In EVD, the covariance matrix is decomposed into its eigenvectors and
eigenvalues. The eigenvectors represent the directions (principal components) of
maximum variance in the data, and the corresponding eigenvalues represent the
magnitude of variance along those directions. The eigenvectors are usually sorted in
descending order based on their corresponding eigenvalues, so the first principal
component (PC1) captures the most variance; the second principal component (PC2)
captures the second most variance, and so on.
4. Selecting principal components: After obtaining the eigenvectors or singular vectors, the
desired number of principal components is selected based on the explained variance or the
application's requirements. Typically, one can select a subset of the principal components that
capture most of the variance in the data.
5. Projection: Finally, the original data is projected onto the selected principal components to
obtain the reduced-dimensional representation of the data. This is achieved by taking the dot
product of the centered data matrix with the matrix of selected principal components.

Numerical:

To compute PCA, following steps:


1. Center the data
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction and classification
technique commonly used in pattern recognition, machine learning, and statistics. It is
particularly useful when dealing with high-dimensional data and aims to maximize the
separation between multiple classes.
LDA projects high-dimensional data onto a lower-dimensional space while preserving class
separability. It finds a linear combination of features that best separate two or more classes.
Unlike Principal Component Analysis (PCA), which is unsupervised and focuses on variance,
LDA is supervised and optimizes for class separability.
1. Mathematical Foundation of LDA
LDA works by computing discriminant axes that maximize the ratio of between-class variance
to within-class variance.
Step 1: Compute Class Mean and Overall Mean
For a dataset with ccc classes, compute:

Step 2: Compute Scatter Matrices


 Within-Class Scatter Matrix (Sw)
Measures the spread of data points within each class:

 Between-Class Scatter Matrix (Sb):


Measures the spread between different class means:
Key Properties of LDA
 Maximizes Class Separation – Unlike PCA, which maximizes variance, LDA optimizes
class discrimination.
 Handles Multi-Class Problems – Can extend beyond binary classification to multiple
classes.
 Feature Reduction – Projects data onto a lower-dimensional space, reducing
computational complexity.

LDA vs. PCA: Key Differences

Feature LDA PCA


Type Supervised Unsupervised
Goal Maximizes class separability Maximizes variance
Uses Class Labels? Yes No
Dimensionality Yes, while preserving class Yes, but may not preserve class
Reduction info info
Best Use Case Classification problems Feature extraction

You might also like