0% found this document useful (0 votes)
11 views16 pages

Day 2-Data Science

Linear regression is a statistical method used to model the relationship between dependent and independent variables by fitting a straight line to the data, primarily suitable for cross-sectional data. It requires certain assumptions to be met for valid results, such as linearity and independence of observations, and can be applied in various fields like economics and business. The coefficient of determination (R-squared) measures the variation explained by the model, while Adjusted R-squared accounts for the number of predictors, providing a more accurate assessment of model performance.

Uploaded by

tugumekeith022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Day 2-Data Science

Linear regression is a statistical method used to model the relationship between dependent and independent variables by fitting a straight line to the data, primarily suitable for cross-sectional data. It requires certain assumptions to be met for valid results, such as linearity and independence of observations, and can be applied in various fields like economics and business. The coefficient of determination (R-squared) measures the variation explained by the model, while Adjusted R-squared accounts for the number of predictors, providing a more accurate assessment of model performance.

Uploaded by

tugumekeith022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DAY 2

Linear Regression
Linear regression is a statistical method

used to model the relationship between a

dependent variable and one or more

independent variables by fitting a straight

line to the data. It assumes that changes

in the independent variable(s) correspond

to proportional changes in the dependent

variable.
Linear Regression
The most important element of this technique is that it contains
random elements, that are ignored by other models that are states
theoretically.
For example, as a business owner, economic theory would say
demand for my products is influenced by its price, income of the
consumers, prices of other similar goods, plus the tastes and
preferences of the consumers.
However, there are other factors that could affect demand for the
goods I produce, which are not necessarily captured by the model.
The influence of other factors is therefore taken care of by the
random variable in this model.
Given the nature of this technique, it is mainly used for cross sectional data (data
captured at a single point in time) and not advised for time series data (data
captured continuously over a long period of time), as it cannot account for
structural breaks and other factors that could have affected the outcome
variable all the way.
For example, consider a company that was established in 1950. As market
analyst, you are looking at the return on investment in marketing strategies over
time. In the 1950s, marketing was harder, as there was no internet, a very few
television stations, no social media and so on. But as of now, the situation is
different.
A linear regression cannot consider such facts, and that is why it is best for cross
sectional data. For time series data, there are more appropriate analysis
techniques that can be employed.
Key
Notes NB 1: THE DEPENDENT
VARIABLE MUST BE A
NB 2: VIGILANCE SHOULD ALSO BE GIVEN TO
THE NATURE OF INDEPENDENT VARIABLES. IF
ALL INDEPENDENT VARIABLES ARE
QUANTITATIVE, THEN THEY CAN BE USED AS
CONTINUOUS VARIABLE THEY ARE. HOWEVER, IN CASE ANY
FOR LINEAR REGRESSION PREDICTOR VARIABLE IS CATEGORICAL, THEN
IT MUST BE CODED, OTHERWISE, IT WILL BE
TO BE USED. CONSIDERED AS A CONTINUOUS VARIABLE,
AND THE RESULTS SHALL BE BIASED. A
CLEARER EXAMPLE TO THIS EFFECT SHALL BE
GIVEN AS WE PROCEED.
Specification This stage involves determining the dependent
of the model
and independent variables to be included in the
model and the mathematical form of the model.

Stages of
Estimation of This involves gathering of data, examining
the model.
problems and peculiarities within variables and

linear
performing tests such as multicollinearity tests.

regression. Evaluation of
estimates.
This involves hypothesis testing to determine
whether the calculated estimates are statistically
reliable.

Determining This involves residual diagnostics such as


the
forecasting heteroscedasticity tests to determine if results
ability of the from the model can be reliable.
model
In statistics and data science, not any model can be
used anyhow and at anywhere. For a model to
produce statistically sound results, certain
assumptions must hold. For a linear regression
Assumptions model, the following conditions MUST be met to
have valid results.
of Linear • Linearity: The relationship between independent and dependent
Regression variables must be linear.
• Independence: Observations are independent of each other.
• Homoscedasticity: Constant variance of residuals (errors) across
all levels of the independent variables.
• Normality of Residuals: Errors follow a normal distribution.
• No Multicollinearity (for multiple regression): Independent
variables should not be highly correlated.
• No autocorrelation of error terms: The error terms should be
independent of each other, even at different points in time.
Some of the Economics: Predicting GDP Finance: Estimating stock

applications
growth based on various returns based on investments
economic factors. and other factors.

of Linear
Regression

Business: Forecasting sales


based on advertising spending
and forecasting demand for
products.
𝑌𝑡 = 𝛽0 + 𝛽1 𝑋1 + 𝜀𝑡
Where;
• 𝑌𝑡 is the outcome variable

Simple Linear 𝛽1 is a parameter to be estimated
• 𝛽0 is a constant term
Regression • 𝑋1 is a predictor variable
• 𝜀𝑡 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

Involves one independent variable (X) and one dependent


variable (Y).
Coefficient of determination (R-squared)
This is used when there is one predictor
variable in a model. It explains the
percentage of the variation in the outcome
variable that is explained by the variation in
the predictor variables.
It is interpreted in terms of percentages. For
example, if the R-squared value in a simple
linear regression is 0.85, then it would been
that 85% of the variation in the outcome
variable is explained by the changes in the
independent variable.
PRACTICAL EXAMPLE OF SIMPLE LINEAR
REGRESSION
𝑌𝑡 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯+ 𝛽𝑘 𝑋𝑘 + 𝜀𝑡

• Where; Multiple


𝑌𝑡 is the outcome variable
𝛽𝑖 is a parameter to be estimated
Linear
• 𝛽0 is a constant term Regression
• 𝑋𝑖 is a predictor varable
In multiple linear regression,
Adjusted
increasing the number of predictor
coefficient of
variables always leads to an increase
determination
in the R-squared value, even if some
(Adjusted R-
of the added variables are irrelevant
Squared)
to the dependent variable.
For example, consider a model where demand for a business
product is assumed to be influenced by price, income level, and
people’s perception about the product. If this model initially
returns an R-squared value of 0.75, adding new predictors—such
as customer height and weight, even though they have no logical
connection to demand—would still increase the R-squared value,
let’s say to 0.88.
Ordinarily, a higher R-squared value suggests a better
model, but in this case, common sense tells us that
height and weight are irrelevant to demand. This
highlights a key limitation of R-squared: it does not
account for the significance of predictors, meaning it
can be misleading when unnecessary variables are
included.
To address this, we use the Adjusted R-squared, which
adjusts for the number of predictors and only considers
variables that have a significant effect on the dependent
variable. Unlike R-squared, Adjusted R-squared decreases
when irrelevant predictors are added, making it a more
reliable measure of model performance. This ensures that
the model reflects only meaningful relationships between
the independent variables and the dependent variable.

You might also like