0% found this document useful (0 votes)
3 views

Machine Learning (CSO851) - Lecture 02

The document discusses regression models, focusing on linear regression, its assumptions, and optimization techniques such as gradient descent. It includes data on India's population growth rate and explains concepts like the cost function, R-squared, and polynomial regression for non-linear relationships. The document emphasizes the importance of model fitting strategies and considerations when using polynomial models.

Uploaded by

trijitrana9878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine Learning (CSO851) - Lecture 02

The document discusses regression models, focusing on linear regression, its assumptions, and optimization techniques such as gradient descent. It includes data on India's population growth rate and explains concepts like the cost function, R-squared, and polynomial regression for non-linear relationships. The document emphasizes the importance of model fitting strategies and considerations when using polynomial models.

Uploaded by

trijitrana9878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Regression Models,

Penalty and
Optimization
Lecture - 02
Regression Models
India – Population Growth Rate
Data
Year Population Growth Growth Rate 2005 1,147,609,927 1.59%
2022 1,406,631,776 0.95% 2004 1,129,623,456 1.63%
2021 1,393,409,038 0.97% 2003 1,111,523,144 1.67%
2020 1,380,004,385 0.99% 2002 1,093,317,189 1.70%
2019 1,366,417,754 1.02% 2001 1,075,000,085 1.74%
2018 1,352,642,280 1.04% 2000 1,056,575,549 1.78%
2017 1,338,676,785 1.07% 1999 1,038,058,156 1.82%
2016 1,324,517,249 1.10% 1998 1,019,483,581 1.86%
2015 1,310,152,403 1.12% 1997 1,000,900,030 1.89%
2014 1,295,600,772 1.15% 1996 982,365,243 1.91%
2013 1,280,842,125 1.19% 1995 963,922,588 1.94%
2012 1,265,780,247 1.24% 1994 945,601,831 1.96%
2011 1,250,287,943 1.30% 1993 927,403,860 1.99%
2010 1,234,281,170 1.36% 1992 909,307,016 2.02%
2009 1,217,726,215 1.42% 1991 891,273,209 2.06%
2008 1,200,669,765 1.48% 1990 873,277,798 2.10%
2007 1,183,209,472 1.52% 1989 855,334,678 2.13%
2006 1,165,486,291 1.56% 1988 837,468,930 2.17%
------ --------------- -------
Regression Models
Least Square Regression
• It is useful: the underlying model assumptions are, in many
applications, close enough to correct that this method has been
widely used in the literature of many scientific domains.
• It can provide insight into “what is going on” with this approach to
regression.
• It enables the transformation of classification problems into
regression problems.
• The thorough understanding of least-squares regression helps us to
understand the relevant topics very well.
General Framework
The model assumption of linear regression is that the function to be learned
𝑚
:. 𝑥𝑖 ∈ Ɍ
Given a vector , is the vector inner product

[ ]
𝜃0
𝑓 ( 𝑥𝑖 )= 𝑓 𝜃 ( 𝑥 𝑖 ) =[ 1 𝑥 𝑖 1 𝑥 𝑖 2 … 𝑥 𝑖𝑚 ] ⋮ = 𝜃 0+ 𝑥 𝑖 1 𝜃 1+ …+ 𝑥 𝑖𝑚 𝜃 𝑚
𝜃𝑚

for some fixed real (m+1)-vector of parameters


𝜃=( 𝜃0 , 𝜃 1 , … , 𝜃𝑚 )𝑇 .

Observed data (𝑥 ¿ ¿ 1 , 𝑦 1), …,(𝑥 𝑛 , 𝑦 𝑛 )¿ are taken to be

𝑦 𝑖= 𝑓 𝜃 ( 𝑥𝑖 ) +𝑒 𝑖
General Framework
For each i, where 𝑒 1 , … ,𝑒 𝑛 are realizations of independent, identically distributed
random variables with mean zero and unknown variance, .
In matrix notation with n observations, the data satisfy

[ ][ ][ ][ ] [ ]
𝑦1 𝑓 𝜃 ( 𝑥 1 ) +𝑒 1 1 𝑥11 𝑥12 … 𝑥1 𝑚 𝜃0 𝑒1
⋮ = ⋮ = ⋮ ⋮ + ⋮
𝑦𝑛 𝑓 𝜃 ( 𝑥𝑛 ) +𝑒 𝑛 1 𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑚 𝜃𝑚 𝑒𝑛

𝑦 = 𝑋 𝜃+ 𝑒

parameter vector 𝜃 are the corresponding vectors and matrices.


where the n×(m+1) matrix X, the n-long column vectors y and e, and the (m+1)-long
Estimating the Model
Parameters
Given an (m+1)-long (column) parameter vector 𝜃, the linear model predicts that at
point in feature space, the response will be
The difference between an actual response from observed datum and the predicted
response is

to is equivalent to searching for a parameter vector 𝜃 which minimizes the training


Using training risk as an estimate of risk, searching for a minimum-risk approximation

risk,
Linear Regression
• Linear Regression is a
supervised machine
learning algorithm.
• Predicted output is
continuous and has a
constant slope.
• Predict values within a
continuous range rather
than trying to classify them
into categories.
Linear Regression
Linear Regression
Linear Regression
Linear Regression
Cost Function for Linear
Regression

In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which
is the average of squared error that occurred between the predicted and observed
values for dependent variable.

• Using the MSE function, we’ll update the values of intercept and slope such that
the MSE value settles at the minima.

• These parameters can be determined using the gradient descent method such that
the value for the cost function is minimum.
Gradient Descent for Linear
Regression
• Gradient Descent is one of the optimization algorithms that
optimize the cost function (objective function) to reach the
optimal minimal solution.

• To find the optimum solution we need to reduce the cost


function (MSE) for all data points.

• This is done by updating the values of B0 and B1 iteratively


until we get an optimal solution.

• A regression model optimizes the gradient descent algorithm


to update the coefficients of the line by reducing the cost
function by randomly selecting coefficient values and then
iteratively updating the values to reach the minimum cost
function.
Gradient Descent for Linear
Regression
To update B0 and B1, we take gradients
from the cost function. To find these
gradients, we take partial derivatives for
B0 and B1.
Coefficient of Determination or
R-Squared (R2)
• R-Squared is a number that explains the amount of variation that is explained /
captured by the developed model.
• It always ranges between 0 & 1.
• Overall, the higher the value of R-squared, the better the model fits the data.
• Mathematically it can be represented as, R2 = 1 – ( RSS/TSS )
Residual sum of Squares (RSS) is defined as the sum of squares of the residual for
each data point in the plot/data. It is the measure of the difference between the
expected and the actual observed output.

Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the
mean of the response variable. Mathematically TSS is,
Assumptions of Linear
Regression
Linearity of residuals: There needs to be a
linear relationship between the
dependent variable and independent
variable(s).

Independence of residuals: The error


terms should not be dependent on one
another. There should be no correlation
between the residual terms. The absence
of this phenomenon is known as
Autocorrelation.
Assumptions of Linear
Regression
Normal distribution of residuals: The
mean of residuals should follow a normal
distribution with a mean equal to zero or
close to zero. This is done in order to check
whether the selected line is actually the
line of best fit or not.
The equal variance of residuals: The error
terms must have constant variance. This
phenomenon is known as
Homoscedasticity. The presence of non-
constant variance in the error terms is
referred to as Heteroscedasticity.
Generally, non-constant variance arises in
the presence of outliers or extreme
leverage values.
Multiple Linear Regression
Multiple Linear Regression
Multiple Linear Regression
Multiple Linear Regression:
Considerations
Multiple Linear Regression:
Considerations
Multiple Linear Regression: New
Relationships
Multiple Linear Regression: Many Relationships
Multiple Linear Regression Model
Estimated Multiple Regression Equation
Interpreting coefficients
Multiple Regression: Data Preparation
Multiple Regression: Data Preparation
Data and Variable Naming
Sketching Out Relationships
Checking Relevancy: Scatterplot
Checking Relevancy: Scatterplot
Checking Relevancy: Scatterplot
DV vs IV Scatterplot
Scatterplot Summary
IV vs IV Scatterplot
IV vs IV Scatterplot
IV vs IV Scatterplot
IV Scatterplot: Multicollinearity
IV Scatterplot Summary
Correlations
DV vs IV Scatterplot
IV Scatterplot: Multicollinearity
Correlation Summary
Polynomial Regression
Polynomial Regression Model
Summary for Quadratic Function
Polynomial Regression
• A simple linear regression algorithm only works when the relationship between
the data is linear.
• But suppose we have non-linear data, then linear regression will not be able to
draw a best-fit line.
Polynomial Regression
A model is said to be linear when it is linear in parameters. So the model

and

are also the linear model. In fact, they are the second-order polynomials in one and
two variables, respectively.

The polynomial models can be used in those situations where the relationship
between study and explanatory variables is curvilinear.
Polynomial Models in One Variable
The order polynomial model in one variable is given by
Polynomial Models in One Variable
Polynomial Models in One Variable

The polynomial models can be used to


approximate a complex nonlinear
relationship. The polynomial model is
just the Taylor series expansion of the
unknown nonlinear function in such a
case.
Considerations in Fitting Polynomial in One Variable
Order of the model:
• The order of the polynomial model is kept as low as possible.
• Some transformations can be used to keep the model to be of the first order.
• If this is not satisfactory, then the second-order polynomial is used.
• Arbitrary fitting of higher-order polynomials can be a serious abuse of regression
analysis.
• A model which is consistent with the knowledge of data and its environment should
be taken into account.
• It is always possible for a polynomial of order (n-1) to pass through n points so that a
polynomial of sufficiently high degree can always be found that provides a “good” fit
to the data.
• Such models neither enhance the understanding of the unknown function nor be a
good predictor.
Considerations in Fitting Polynomial in One Variable
Model building strategy:
• A good strategy should be used to choose the order of an approximate polynomial.
• One possible approach is to successively fit the models in increasing order and test
the significance of regression coefficients at each step of model fitting. Keep the
order increasing until t-test for the highest order term is non-significant. This is
called a forward selection procedure.
• Another approach is to fit the appropriate highest order model and then delete
terms one at a time starting with the highest order. This is continued until the
highest order remaining term has a significant t statistic. This is called a backward
elimination procedure.
Considerations in Fitting Polynomial in One Variable
Extrapolation:
• One has to be very cautioned in
extrapolation with polynomial models.
The curvatures in the region of data
and the region of extrapolation can be
different.
• For example, in the figure on right
hand side, the trend of data in the
region of original data is increasing,
but it is decreasing in the region of
extrapolation.
• So predicted response would not be
based on the true behaviour of the
data.
Considerations in Fitting Polynomial in One Variable
Ill-Conditioning-:
Considerations in Fitting Polynomial in One Variable
Hierarchy:

It is expected that all polynomial models should have this property because only
hierarchical models are invariant under linear transformation.
Orthogonal Polynomials
• While fitting a linear regression model to a given set of data, we begin with a simple
linear regression model.
• Later we decide to change it to a quadratic or wish to increase the order from
quadratic to a cubic model etc.
• In each case, we have to begin the modeling from scratch, i.e., from the simple linear
regression model.
• The classical cases of orthogonal polynomials of special kinds are due to Legendre,
Hermite and Tehebycheff polynomials.
• These are continuous orthogonal polynomials (where the orthogonality relation
involve integrating) whereas in our case, we have discrete orthogonal polynomials
(where the orthogonality relation involves summation).
Orthogonal Polynomials
Orthogonal Polynomials
Orthogonal Polynomials
Orthogonal Polynomials

You might also like