0% found this document useful (0 votes)
8 views19 pages

Aml 3

Module 3 of the MSc DS course focuses on advanced machine learning concepts, particularly regression analysis, including linear regression, Ridge, and LASSO techniques. It emphasizes the limitations of linear regression and introduces regularization methods to address multicollinearity and overfitting, alongside practical applications and evaluations of these models. The module also explores generalized linear models (GLMs) and the importance of adhering to regression assumptions for accurate modeling.

Uploaded by

khuddush89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

Aml 3

Module 3 of the MSc DS course focuses on advanced machine learning concepts, particularly regression analysis, including linear regression, Ridge, and LASSO techniques. It emphasizes the limitations of linear regression and introduces regularization methods to address multicollinearity and overfitting, alongside practical applications and evaluations of these models. The module also explores generalized linear models (GLMs) and the importance of adhering to regression assumptions for accurate modeling.

Uploaded by

khuddush89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Course: MSc DS

Advanced Machine Learning

Module: 3
Learning Objectives:
1. Acquire a comprehensive understanding of the core ideas
behind linear regression and recognise the inherent
constraints associated with this statistical technique.
2. Understand the concepts and differences between Ridge
and LASSO regression techniques.
3. Acquire proficiency in the actual application and assessment
procedures of LASSO and Ridge regression models.
4. Delve into generalised linear regression models, ensuring a
thorough understanding of multicollinearity and critical
model assumptions.

Structure:
3.1 Basics of Linear Regression and its Limitations
3.2 Introduction to Ridge and LASSO Regression
3.3 Implementing and Evaluating LASSO and Ridge Models
3.4 Exploring Generalised Linear Regression Models
3.5 Addressing Multicollinearity and Model Assumptions
3.6 Summary
3.7 Keywords
3.8 Self-Assessment Questions
3.9 Case Study
3.10 References
3.1 Basics of Linear Regression and its Limitations
Linear regression is a statistical approach that has significant
importance and enjoys extensive use within the field of
machine learning. The objective is to provide a mathematical
representation of the association between a response variable
and one or many predictor variables via the use of a linear
equation. In its most basic configuration, where there exists
just one independent variable, the association is denoted by
the mathematical expression: \(y = mx + c\), where \(y\)
symbolises the dependent variable, \(x\) represents the
independent variable, \(m\) denotes the slope, and \(c\)
signifies the y-intercept.
The main objective of linear regression is to determine the
optimal straight line, often referred to as the regression line,
that effectively forecasts the output values within a certain
range. The concept of "best fit" is often characterised by the
objective of minimising the sum of squared differences, also
known as errors, between the observed values, which refer to
the actual data points, and the values projected by the model.
However, while linear regression is powerful, it has its limitations:
Linearity Assumption: The primary limitation of this approach
is in its assumption of a linear association between the
dependent and independent variables. In practical contexts,
this assertion does not consistently hold true. Numerous
phenomena possess intrinsic non-linearity, rendering their
exact representation unattainable by alinear model.
Independence: It is not advisable to infer observations only
from repeated measurements or matched data. The
assumption of linear regression is that the residuals, which
refer to the discrepancies between the observed and projected
values, are independent.
Homoscedasticity: The word "homoscedasticity" refers to the
requirement that the variability of the errors should be
constant across all levels of the independent variables. To
clarify, it is expected that the dispersion of residuals is
approximately uniform along the regression line.
Lack of Flexibility: Linear regression exhibits a degree of
inflexibility since it may fail to adequately capture intricate
correlations present within the dataset. In some instances, it
may be necessary to use more advanced methodologies or
include polynomial concepts.
Outliers: The linear regression model is susceptible to the
influence of outliers. The presence of a solitary outlier has the
potential to have a substantial influence on both the slope and
intercept of the regression line.
Linear regression provides a transparent and comprehensible
framework for forecasting outcomes. However, it is crucial for
practitioners to thoroughly evaluate the appropriateness of
linear regression for a specific dataset or research inquiry due
to its underlying assumptions and inherent limits.
3.2 Introduction to Ridge and LASSO Regression
Linear regression offers a basic approach for modelling the
associations between variables. Nevertheless, in scenarios
characterised by the existence of multicollinearity (where
independent variables exhibit strong correlation) or when the
number of predictors surpasses the number of observations,
ordinary linear regression may yield results that are unstable or
prone to overfitting. Ridge and LASSO regression are two
statistical approaches that have been specifically developed to
tackle the aforementioned issues.
Ridge Regression: Ridge regression, sometimes referred to as
Tikhonov regularisation, incorporates a penalty component into
the objective function of linear regression. The inclusion of a
penalty term serves to prevent the occurrence of high
coefficients, a phenomenon that has the potential to result in
overfitting, particularly when multicollinearity is present. The
primary goal of Ridge regression is to minimise the sum of
squared residuals while also including a punishment term. This
penalty term is determined by the squared magnitude of the
coefficient vector, which is further scaled by a hyperparameter
denoted as λ. As the value of λ increases, the penalty becomes
more pronounced, resulting in reduced coefficients. This
phenomenon leads to a trade-off, where a decrease in variance
is achieved at the expense of adding a certain level of bias.
LASSO Regression: LASSO regression, an abbreviation for Least
Absolute Shrinkage and Selection Operator, is a regularisation
approach used in statistical analysis.
Similar to the Ridge method, it introduces a penalty term into
the objective functionof linear regression. Nevertheless, the
LASSO penalty corresponds to the absolutevalue of the
coefficients' size. The aforementioned distinction has a
significantinfluence: LASSO has a tendency to force some
coefficients to attain a value of zero,hence picking a more
streamlined model that excludes these coefficients. Therefore,
the LASSO technique may therefore be regarded as a method
for selecting features. Both Ridge regression and LASSO
regression are used to overcome the constraints ofordinary
linear regression, particularly in scenarios involving
multicollinearity or asignificant risk of overfitting.
Nevertheless, the regularisation techniques used byRidge and
LASSO vary in their approach. Ridge regularisation tends to
uniformlyreduce all coefficients, but LASSO regularisation has
the ability to shrink somecoefficients to zero, thereby
eliminating them from the model. The selection between
Ridge and LASSO regularisation techniques should be
based onconsiderations such as the problem's contextual
factors, the characteristics of thedata, and the intended
objective.
3.3 Implementing and Evaluating LASSO and Ridge Models
Ridge and LASSO regressions are advanced variations of linear
regression often used in statistical modelling. Their practical
applications frequently include the utilisation of software
packages and modules such as Scikit-learn, a popular Python
framework. During the implementation process, it is customary
to initialise the regression model by explicitly defining the
regularisation strength, which is often represented as α or λ. In
the context of Ridge regression, it is seen that all coefficients
undergo a uniform shrinkage towards zero, yet none of them
are entirely removed. In contrast, the LASSO method has the
capability to successfully conduct feature selection by reducing
certain coefficients to zero.
For Ridge: To begin the process, it is essential to import the
required libraries, namely the Ridge module, from the
sklearn.linear_model package. Next, instantiate a Ridge
regression model by giving the desired regularisation strength.
Subsequently, train this model using the provided training
dataset.
For LASSO: To begin, the first step is to import the Lasso
module from the sklearn.linear_model library. Create an
instance of a Lasso regression object, once again setting the
regularisation intensity, and proceed to train it using your
training dataset.
Evaluation: The evaluation of the performance of Ridge and
LASSO regressions entails using approaches that are similar to
those used for assessing regular linear regression models.
Common metrics that are often used include:
Root Mean Squared Error (RMSE): The measure presented
quantifies the extent of discrepancy between predicted and
actual values, with smaller values indicating the superior
performance of the model.
R-squared: The term "explained variance" refers to the amount
of variability in the dependent variable that can be accounted
for by the independent variables. Models with higher R-
squared values are indicative of a greater proportion of
variance being explained by the model.
Coefficient Analysis: LASSO regression has particular
significance in this context. Through the analysis of the
coefficients that LASSO has decreased to zero, one may
ascertain the characteristics that are considered less significant
in forecasting the result.
Overfitting Check: Utilise cross-validation as a means to assess
the model's efficacy in handling novel data. The observation of
a substantial decrease in performance when applying a model
to additional data is indicative of the phenomenon known as
overfitting.
When doing an evaluation, it is crucial to consider the distinct
characteristics of Ridge and LASSO. For instance, the LASSO
method's capability to effectively decrease coefficients to zero
may provide valuable insights into the relative relevance of
features. On the other hand, the Ridge method's uniform
shrinkage approach can lead to more stable coefficient
estimates, particularly when dealing with multicollinearity.
It is important to exercise caution and conduct thorough
assessments while using Ridge and LASSO methods, despite
their easy implementation using contemporary technologies.
Regularised regression techniques, such as LASSO and Ridge,
provide a reliable alternative in situations when conventional
regression approaches may encounter difficulties. However, it
is crucial to comprehend the intricacies of these methods and
correctly interpret their results in order to effectively use them.

3.4 Exploring Generalised Linear Regression Models


Generalised linear models (GLMs) are an extension of linear
regression that enables the analysis of response variables with
error distribution models that deviate from the assumption of a
normal distribution. The conventional approach to linear
regression implies that there exists a linear connection
between the predictors and the response variable and that the
response variable adheres to a normal distribution.
Nevertheless, it is important to acknowledge that in several
real-world situations, these assumptions may not be applicable.
In the context of estimating thelikelihood of an occurrence, it is
important to note that the result is constrained to a range of
values from 0 to 1. This characteristic renders the assumption
of a normal distribution unsuitable. Generalised linear models
(GLMs) are used in this context.
A Generalised Linear Model (GLM) is characterised by three
primary components:
● Random Component: The user requests clarification on the
probability distribution associated with the answer variable.
Frequently seen options include the normal, binomial, and
Poisson probability distributions.
● Systematic Component: The representation of the linear
combination of predictors is similar to that of linear
regression. This component quantifies the influence of the
predictors on the response variable.
● Link Function: This function establishes a connection
between the random and systematic elements. The process
converts the linear combination of independent variables
into a numerical value that conforms to the selected
probability distribution for the dependent variable. For
example, within the context of logistic regression, which falls
under the category of Generalised LinearModels (GLMs), the
logit link function is used to transform linear combinations
into probabilities.
GLMs are incredibly versatile, catering to a range of scenarios:
● Logistic Regression: Binary outcomes are often addressed
using this method, such as in the context of classifying
emails as either spam or non-spam.
● Poisson Regression: This statistical method is well-suited
for discrete outcomes, such as quantifying the frequency of
website visitors within a given day.

● Multinomial Regression: This method is particularly


advantageous when dealing with categorical outcomes that
include more than two categories, such as the prediction of
weather conditions (e.g., sunny, wet, snowy).
Generalised linear models (GLMs) provide a versatile
framework that expands the applicability of linear regression to
a wide range of response variables. Generalised linear models
(GLMs) enable analysts and researchers to construct more
appropriate and precise models for various data situations by
taking into account the individual distribution of the response
variable and using relevant link functions.

3.5 Addressing Multicollinearity and Model Assumptions


Multicollinearity is a significant difficulty within the domain of
regression analysis. Multicollinearity arises when there is a
substantial correlation between two or more predictor
variables inside a regression model, indicating that one variable
may be accurately predicted from the others using a linear
relationship. While the presence of multicollinearity does not
significantly impact the overall accuracy of the model, it does
hinder the ability to discern the individual impacts of predictors
on the response variable. For example, the inflation of
variances of parameter estimations may result in unstable and
incorrect estimates of regression coefficients. Moreover, the
presence of this factor might introduce bias in p-values, leading
to the erroneous conclusion that some variables lack statistical
significance, despite the fact that they may really possess it.
The Variance Inflation Factor (VIF) is considered one of the
main techniques for identifying multicollinearity. When the
Variance Inflation Factor (VIF) number exceeds the range of 5-
10, it indicates a significant degree of collinearity, which is a
cause for worry. The resolution of multicollinearity may include
the elimination of one of the variables that exhibit correlation,
especially if there is insufficient theoretical justification for its
inclusion. In an alternative approach, it is possible to merge
variables by generating a new variable that captures either the
average or the primary component of the connected entities.
Moreover, regularisation approaches, such as Ridge and LASSO
regression, may provide remedies by including a penalty term
in the regression function.

The assumptions behind regression analysis are fundamental


principles that provide a solid basis for the study. Several
assumptions need to be considered in this context. These
assumptions pertain to the linearity of the connection between
predictors and the response variable, the independence of
observations, the consistent variance of errors
(homoscedasticity), and the normal distribution of errors
(residuals). Deviation from these assumptions may result in the
production of inaccurate or deceptive outcomes.
In order to guarantee the fulfilment of these assumptions,
analysts often rely on diagnostic tools. Residual plots play a
crucial role in the identification of non-linearity and
heteroscedasticity, for example. In order to assess the
normality of residuals, many tools such as histograms, Q-Q
plots, and statistical tests like the Shapiro-Wilk test are of great
significance. Data transformations may be used as a means to
address difficulties associated with assumptions. The use of
logarithmic or square root transformations has the potential to
stabilise variations and promote the development of more
linear connections.
Essentially, it is of utmost importance to comprehend and
tackle the issue of multicollinearity while also ensuring
adherence to regression assumptions. Satisfying these
requirements not only guarantees the validity of the model but
also improves its interpretability, hence facilitating the
derivation of reliable conclusions and the adoption of data-
driven decision-making.

3.6 Summary
❖ Module 3 delves extensively into the subject of regression
analysis, first with an exploration of the fundamental
principles behind linear regression and the inherent
limitations associated with it. Although linear regression is
widely used in predictive modelling, it may encounter
limitations when dealing with intricate datasets, particularly
when there is a correlation among variables. In order to
address this issue, Ridge and LASSO regression is proposed
as regularisation methodologies, which serve the dual
purpose of mitigating multicollinearity and mitigating the
risk of model overfitting.

❖ The inclusion of practical sessions in the module enables


students to develop a high level of competency in the
implementation and evaluation of both the LASSO and Ridge
models. This emphasis on practical application underscores
the significance of these models in real-world contexts. The
module expands its scope by delving into generalised linear
models (GLMs), which enhance linear regression to include
response variables that are not normally distributed, thereby
offering more flexibility in modelling.
❖ Finally, the examination of regression assumptions is
undertaken, with particular attention given to the issue of
multicollinearity. It is crucial to ensure that these
assumptions are satisfied in order to achieve accuracy and
dependability in the model.
3.7 Keywords
● Linear Regression: A statistical technique used to provide a
mathematical representation of the association between a
response variable and one or more predictor factors.
● Ridge Regression: Regularised linear regression is a
statistical technique that incorporates a regularisation term
in order to mitigate the issue of overfitting and address the
problem of multicollinearity.
● LASSO Regression: The regression analysis approach used in
this study incorporates both variable selection and
regularisation techniques in order to improve the accuracy
of predictions.
● Generalised Linear Models (GLMs): The proposed
framework is a versatile extension of conventional linear
regression models, accommodating response variables that
exhibit non-normal distributions.
● Multicollinearity: The presence of strong intercorrelations
among independent variables inside a regression model has
the potential to introduce distortions in the predictions.
● Model Assumptions: The regression model's validity is
contingent upon many preconditions that are deemed
fundamental. These preconditions include linearity,
independence, homoscedasticity, and normality of residuals.
3.8 Self-Assessment Questions
1. What is the main difference between generalised linear
models (GLMs) and linearregression?
2. How are the regularisation methods used in Ridge and
LASSO regression different from one another, and how do
the regularisation terms affect the coefficients of predictors
in each case?
3. Why is multicollinearity a problem in regression models, and
how might it influence how model coefficients are
interpreted?
4. Describe the main presumptions used in a linear regression
model in a few sentences. Which presumption may cause
heteroscedasticity problems if it is broken?
5. Describe the meaning of "L1 regularisation" as it relates
to LASSO regression. How does it affect the decision of the
model's predictor variables?

3.9 Case Study


Title: Advanced Regression Modelling to Improve
Real Estate PricingIntroduction:
The real estate market is characterised by its dynamic and
intricate nature, wherebyproperty values are subject to the
impact of several variables. Regression models areoften used
by real estate professionals and investors to forecast
property values,enabling them to make well-informed choices
in their purchasing or selling activities. Case Study:
RealtyGenius, a prominent real estate company, has compiled a
comprehensive dataset including several property transactions
throughout the course of the last ten years in the city of New
York. The dataset includes several characteristics such as
square footage, number of bedrooms, location of facilities, year
of construction, and other relevant factors. Nevertheless, the
early models have shown the presence of multicollinearity
concerns, and some predictors exhibit a lack of linear
association with the price variable.
Background:
Linear regression is a robust statistical technique that
assumes a linear association between the independent
variables and the dependent variable. In instances when these
assumptions are not upheld, the accuracy of model predictions
may be compromised. Sophisticated regression models, such as
Ridge and LASSO, include regularisation techniques to address
the issue of multicollinearity and identify the most influential
factors.
Your Task:
In the capacity of the principal data scientist at RealtyGenius,
the responsibility has been assigned to you to formulate an
enhanced regression model with the aim of accurately
forecasting property values. The objective is to attain optimal
predicted accuracy while simultaneously acknowledging and
accounting for the constraints and presumptions inherent in
conventional linear regression.
Questions to Consider:
1. How can you tell which predictors in the dataset could be
contributing to multicollinearity problems?
2. How would you choose LASSO regression vs Ridge regression
for this dataset?
3. Which model assumptions need to be verified before your
model is complete?
4. How can one effectively address non-linear correlations
between certainpredictors and property prices?
Recommendations:
In order to improve the dependability of the model, it is
important to use methodologies such as Ridge or LASSO
regression to address the issue of multicollinearity.
Regularisation techniques may be used to facilitate feature
selection, guaranteeing that the model is influenced only by
predictors that have substantial significance. In addition, the
examination of polynomial regression or generalised linear
models (GLMs) might be advantageous in addressing non-linear
associations.
Conclusion:
The use of advanced regression models has the potential to
significantly improve the precision and dependability of
forecasts inside intricate systems such as the real estate market,
provided they are appropriately implemented. Data scientists
may enhance the accuracy of property pricing models by
effectively resolving multicollinearity, comprehending model
assumptions, and effectively managing non-linearities.

3.10 References
1. Kurilovas, E., 2019. Advanced machine learning approaches
to personalise learning: learning analytics and decision
making. Behaviour & Information Technology, 38(4), pp.410-
423.
2. Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J. and
Garrabrant, S., 2019. Risks from learned optimisation in
advanced machine learning systems. arXiv preprint
arXiv:1906.01820.
3. Chakraborty, D. and Elzarka, H., 2019. Advanced machine
learning techniques for building performance simulation: a
comparative analysis. Journal of Building Performance
Simulation, 12(2), pp.193-207.
4. Hearty, J., 2016. Advanced machine learning with Python.
Packt Publishing Ltd.
5. Roy, K.S., Roopkanth, K., Teja, V.U., Bhavana, V. and Priyanka,
J., 2018. Student career prediction using advanced machine
learning techniques. International Journal of Engineering &
Technology, 7(3.20), pp.26-29.

You might also like