Day 2-Data Science
Day 2-Data Science
Linear Regression
Linear regression is a statistical method
variable.
Linear Regression
The most important element of this technique is that it contains
random elements, that are ignored by other models that are states
theoretically.
For example, as a business owner, economic theory would say
demand for my products is influenced by its price, income of the
consumers, prices of other similar goods, plus the tastes and
preferences of the consumers.
However, there are other factors that could affect demand for the
goods I produce, which are not necessarily captured by the model.
The influence of other factors is therefore taken care of by the
random variable in this model.
Given the nature of this technique, it is mainly used for cross sectional data (data
captured at a single point in time) and not advised for time series data (data
captured continuously over a long period of time), as it cannot account for
structural breaks and other factors that could have affected the outcome
variable all the way.
For example, consider a company that was established in 1950. As market
analyst, you are looking at the return on investment in marketing strategies over
time. In the 1950s, marketing was harder, as there was no internet, a very few
television stations, no social media and so on. But as of now, the situation is
different.
A linear regression cannot consider such facts, and that is why it is best for cross
sectional data. For time series data, there are more appropriate analysis
techniques that can be employed.
Key
Notes NB 1: THE DEPENDENT
VARIABLE MUST BE A
NB 2: VIGILANCE SHOULD ALSO BE GIVEN TO
THE NATURE OF INDEPENDENT VARIABLES. IF
ALL INDEPENDENT VARIABLES ARE
QUANTITATIVE, THEN THEY CAN BE USED AS
CONTINUOUS VARIABLE THEY ARE. HOWEVER, IN CASE ANY
FOR LINEAR REGRESSION PREDICTOR VARIABLE IS CATEGORICAL, THEN
IT MUST BE CODED, OTHERWISE, IT WILL BE
TO BE USED. CONSIDERED AS A CONTINUOUS VARIABLE,
AND THE RESULTS SHALL BE BIASED. A
CLEARER EXAMPLE TO THIS EFFECT SHALL BE
GIVEN AS WE PROCEED.
Specification This stage involves determining the dependent
of the model
and independent variables to be included in the
model and the mathematical form of the model.
Stages of
Estimation of This involves gathering of data, examining
the model.
problems and peculiarities within variables and
linear
performing tests such as multicollinearity tests.
regression. Evaluation of
estimates.
This involves hypothesis testing to determine
whether the calculated estimates are statistically
reliable.
applications
growth based on various returns based on investments
economic factors. and other factors.
of Linear
Regression
• Where; Multiple
•
•
𝑌𝑡 is the outcome variable
𝛽𝑖 is a parameter to be estimated
Linear
• 𝛽0 is a constant term Regression
• 𝑋𝑖 is a predictor varable
In multiple linear regression,
Adjusted
increasing the number of predictor
coefficient of
variables always leads to an increase
determination
in the R-squared value, even if some
(Adjusted R-
of the added variables are irrelevant
Squared)
to the dependent variable.
For example, consider a model where demand for a business
product is assumed to be influenced by price, income level, and
people’s perception about the product. If this model initially
returns an R-squared value of 0.75, adding new predictors—such
as customer height and weight, even though they have no logical
connection to demand—would still increase the R-squared value,
let’s say to 0.88.
Ordinarily, a higher R-squared value suggests a better
model, but in this case, common sense tells us that
height and weight are irrelevant to demand. This
highlights a key limitation of R-squared: it does not
account for the significance of predictors, meaning it
can be misleading when unnecessary variables are
included.
To address this, we use the Adjusted R-squared, which
adjusts for the number of predictors and only considers
variables that have a significant effect on the dependent
variable. Unlike R-squared, Adjusted R-squared decreases
when irrelevant predictors are added, making it a more
reliable measure of model performance. This ensures that
the model reflects only meaningful relationships between
the independent variables and the dependent variable.