0% found this document useful (0 votes)
7 views57 pages

Hota ML Regression

The document outlines various regression models used in machine learning, including linear, logistic, and polynomial regression, along with their applications and methodologies. It discusses the process of finding relationships between dependent and independent variables, the importance of minimizing cost functions, and the use of gradient descent for optimization. Additionally, it highlights the differences between regression and classification problems, emphasizing the significance of appropriate loss functions in logistic regression.

Uploaded by

Aryan Kate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

Hota ML Regression

The document outlines various regression models used in machine learning, including linear, logistic, and polynomial regression, along with their applications and methodologies. It discusses the process of finding relationships between dependent and independent variables, the importance of minimizing cost functions, and the use of gradient descent for optimization. Additionally, it highlights the differences between regression and classification problems, emphasizing the significance of appropriate loss functions in logistic regression.

Uploaded by

Aryan Kate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Birla Institute of Technology and Science Pilani, Hyderabad Campus

11.09.2024

BITS F464: Machine Learning (1st Sem 2024-25)


REGRESSION MODELS
Chittaranjan Hota, Sr. Professor
Dept. of Computer Sc. and Information Systems
[email protected]
What Type of Problems can you solve?

Source: www.macroaxis.com/stocks/ https://www.imdb.com/


Price ($)
in 1000’s
400
300
200
Linear 100 Linear
500 1000 1500 2000 2500
Size in feet^2

Logistic

Polynomial

Different types of Regression for different purposes. Ridge, Lasso, Bayesian, …


Regression with Scalar Input(Univariate)
Weight

Height

Simple Linear Regression


With Vector inputs (more covariates)

https://currentaffairs.adda247.com/

• Unemployment rate, education level, population count, land area, income level,
investment rate, life expectancy, … (Multiple Linear Regression: Multi-variate)
Another Example of Multi-variate Regression
Sales = b + w1 weather + w2 money +w3 day
Regression:
Process of finding out
relationship between a
dependent variable
(outcome/ response/
label) and one or more
independent variables
(predictors/ covariates/
explanatory variables/
features)
BITS, Hyderabad

Independent variables (X): weather (rainy, sunny, cloudy), amount in hand, day
type (working, holiday), Dependent variable: Y (Sales)
How the dependent variable (Y) will react to each variable X taken
independently?
Best Fitting a Line: Least Squares Method
If >0
How are X and
Y related?
If <0?
If == 0 ?

The target function: , where m adjustable parameters are held in vector

Simple Linear Regression


Best Fitting a Line: Least Squares Method

Observed response

Predicted response
residual = data – fit

Find out the optimal parameter values by minimizing the sum of squared
residuals
Can you choose the best-fit line?

Hypothetically: Say, weight = 2 + 1.5 height


Multiple Linear Regression Analysis

(People are clustered based


(hardly any association between the two) on age)

Img. Source: https://sphweb.bumc.bu.edu/


Continued…

BMI = 18 + 1.5 (diet score) + 1.6 (male) + 4.2 (age > 20)

Y = β0 + β1 x1 + β2 x2 + β3 x3

Img. Source: https://sphweb.bumc.bu.edu/


Non-linear relationships

Examples: House price based on Floor area, Electricity


consumption based on no. of household members and
appliances being used.
Analyzing Residuals

Model
describes
data well
or poor?

Randomly
scattered
around zero
Continued…
Model includes
a Second-
degree
polynomial
(quadratic term)

Systematically
positive for
much of the
data.
Good or bad
fit?
Non-linear relations using Linear models?
• Feature Engineering: Engineer new features by transforming the existing
ones to capture non-linear relationships, e.g, you can include polynomial
features (e.g., quadratic, cubic).

• Using Basis Functions: Instead of using the original features, you can use
basis functions, which are transformations of the original features, e.g
Polynomial basis functions, Gaussian radial basis functions, or Sigmoidal
basis functions.

• Regularization: Ridge regression (L2 regularization) or Lasso regression


(L1 regularization) to penalize large coefficients.

• Non-linear Regression Models: If the relationship is highly non-linear, use


Polynomial, Logistic, exponential, Power-law, Gaussian, Logarithmic
regression etc., Decision trees, Random forests, SVMs with non-linear
kernels, or Neural networks.
We will see some of these…
• lift is the dependent variable, and the
independent variable is the ‘hours’, i.e
the time spent in weight lifting.
A
n

E
x
a
m
p
l • We add a quadratic term as an
e independent variable in the model. y =
x2
Source: https://bookdown.org/

parabola
Basis Functions: Why are they needed?

Linear or non-linear?
Let us add a basis function x1x2 into the input (this term couples two terms
non-linearly)
With the third input z = x1x2 the XOR becomes linearly separable.

Acknowledgement: Volker Tresp’s presentation


Continued…

Acknowledgement: Volker Tresp’s presentation


What are Basis Functions?
Simplest model of Linear Regression:
Key Property: Linear function of parameters. Also, it is a linear function of its
input variables  Imposes serious limitations on the model.
Basis functions come to rescue (called derived features in machine learning)
are building blocks for creating more complex functions.
For example, individual powers of x: the basis functions 1, x, x2, x3… can be
combined together to form a polynomial function.
Basis functions extend this class of models by considering linear
combinations of handpicked fixed nonlinear functions of the input
variables.
Non linearity in (vector form) or
the data while
keeping linearity Where, and
in parameters.
Basis functions for Non-linearity
Where,

(Polynomial basis function) (Gaussian basis function) (Sigmoidal basis function)

Global: a small change in x Local: a small change in x Local: a small change in x


affects all basis functions only affects nearby basis only affects nearby basis
functions. functions.
The Learning Algorithm
Repeat Initial Random Weights
until the
error is
minimized
Compute least square Error is too high. Are the weights
correct?
error

Reduced rapidly. Weights tend


Compute the gradient to become stable.
to change the weight
No more change of the loss/
cost function. Model found best
Loss is stable, output weights.
the model
An Example of house price prediction
Size in sq. feet (x) Price in 1000’s (y)
2104 460
1416 232 Training Set
1534 315
852 178
… …
Price(y)

400
300
200
Minimize Cost/ Loss: (MSE)
100

500 1000 1500 2000 2500 Size(x)

The division by 2 is for convenience and doesn't fundamentally change the result; it simplifies the
derivative computation when optimizing models.
Minimizing the Cost Function

3 3

2 2

1 1

1 2 3 0 1 2 3

Acknowledgement: Andrew Ng, Stanford


Continued…

3 3

2 2

1 1

1 2 3 0 1 2 3

Acknowledgement: Andrew Ng, Stanford


Continued…

3 3

2 2

1 1

1 2 3 0 1 2 3

Acknowledgement: Andrew Ng, Stanford


Continued…

3 3

2 2

1 1

1 2 3 0 1 2 3

Acknowledgement: Andrew Ng, Stanford


Continued…

3 3

2 2

1 1

1 2 3 0 1 2 3
MSE cost function for linear regression is always Convex.
Gradient Descent: Minimizing the MSE
• Optimization algorithm used to minimize the MSE function by iteratively
adjusting parameters in the direction of the negative gradient, aiming to
find the optimal set of parameters.
If we represent the gradient of
the loss function as ∇L, and the
parameters we are optimizing
as θ:
Then the update rule for gradient
descent is:

θ_new = θ_old - α * ∇L

Move in the opposite direction


of the gradient.
Img. Source: https://www.analyticsvidhya.com/
Many local minima in gradient descent

MSE cost function is Convex. Will you get many local minima? No, only one global minima.

Reason: If you pick any two points on the curve, the line joining them will never cross the curve.
Visualizing Gradient Descent

(visualized by using Contours)


Acknowledgement: Andrew Ng, Stanford
A bit of Math: Derivative of a Function?
Distance

100m

Δy

Δx time

10.25s

What is his Average Speed? Δy/ Δx


Amlan Borgohain
Instantaneous Speed Vs Average Speed

Distance

100m
Δy
Δx
Δy
Δy

Δx Δx
10.25s time

Will the Δy/Δx or Δy/Δx be different than the average slope, i.e.,
Δy/Δx?
What would be really the Instantaneous speed?
Better approximation:
Distance Slope around the steepest point. Measure the slope with a
smaller and smaller
100m change in x that yields a
smaller and smaller
Δy change in y.
Distance
Δx

100m

10.25s time

Fastest Instantaneous speed?


An Approximation: As the slope is
changing constantly.
10.25s time
What is Partial Derivative?
What is the partial derivative of this function at P(1,1)?

That is the slope of f at the point (x, y)

(Img. Source: Wiki)


Gradient: All partial derivatives together

Gradient

Multivariate Function:

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


The Impact of Partial Derviative

𝞱1
𝞱1 = 1

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


Continued…

𝞱1
𝞱1

Positive Derivative

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


Continued…

𝞱1
New 𝞱1 Old 𝞱1

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


Continued…

𝞱1
𝞱1

Negative
Derivative

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


Continued…

𝞱1
Old 𝞱1 New 𝞱1

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


Continued…

𝞱1
Derivative = 0

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


The Impact of Learning Rate

Learing Rate

𝞱1

Acknowledgement: Mohammad Hammoud, CMU (Qatar)


Continued…

𝞱1

0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, …, 0.9, 1


Acknowledgement: Mohammad Hammoud, CMU (Qatar)
Gradient Descent for Linear Regression

Repeat until convergence{

}
Batch Vs Stochastic Gradient Descent
Very smooth
convergence,
however
using all the
data for one
update.

Very noisy
convergence,
because
using only
one data
point for one
update.
Regression vs. Classification
Aspect Regression Classification
Objective Predict continuous values or a Predict categorical labels (0 or 1; cat,
range of values (3.4, 8.6, …) dog, sheep; low,medium,high)

Example House prices; Stock prices; Spam emails; Image classification;


Body Mass Index; Energy Loan approval (approved/ not
consumption etc. approved), Customer churn etc.
Evaluation MSE, RMSE, MAE, R2 Accuracy, Precision, Recall, F1, AUC
metrics
Algorithms Linear regression, Ridge, Logistic regression, DT with
Lasso, Polynomial regression, categorical targets, Naïve Bayes,
DT with numerical targets etc. SVMs, KNN, …
Types of Continuous outcome (how Discrete outcomes (which class?)
problems much?)
Logistic Regression
• The linear regression model discussed in the previous
class assumes that the dependent variable is
quantitative (continuous).

• However, in many situations, the dependent variable is


instead qualitative (categorical)

• A patient arrives at the campus medical (BITS) with cough,


fever and runny nose.
• Which disease the patient has? Influenza (Flu) (20-30%),
Acute Bronchitis (15-25%), Common cold (10-20%).

Question: Which one is dependent and which one is Independent variable?


Logistic Regression
• The linear regression model discussed in the previous
class assumes that the dependentCredentialvariableTheft
is
quantitative (continuous). (20-30%)

• However, in many situations, the dependent variable is


instead qualitative (categorical) Malware
Distribution
• A patient arrives at the campus medical (BITS) with cough,
(15-20%)
fever and runny nose.
• Which disease the patient has? Influenza (Flu) (20-30%),
Acute Bronchitis (15-25%), Common cold (10-20%).

Question: Which one is dependent and which one is Independent variable?


Logistic Regression
• Logistic regression is a type of linear regression that predicts the
probability of an event occurring based on one or more input features. It's
widely used for binary classification problems.
• How does it work?
Step1: Linear combination: Calculate a linear combination of the input
features and their weights, which is represented by the equation:
z = β0 + β1 . x1 + … + βn . xn , where ‘z’ is the log odds score.
Step2: Apply the logistic function (also known as the Sigmoid) to the
linear combination result (z):
p = 1 / (1 + exp(-z))
Step3: Thresholding: Compare the predicted probability with a threshold
value (usually set to 0.5). If p > 0.5, predict class 1; otherwise, predict
class 0.
Example: Hiking in Seattle
? day

? day

Should we fit a linear regression model to this data? No


Loss function for logistic regression
• If you use MSE for Logistic regression,
what problems it might create?

• A suitable loss function in logistic regression is called the


Log-Loss, or binary cross-entropy. This function is:

• It penalizes deviations (incorrect probability predictions),


offering a continuous metric for optimization during model
training.
What is it?
Why Log-Loss?
You can see how as the
probability gets closer
to the true value
(p=0 when y=0 and p=1
when y=1), the Log-Loss
decreases to 0. As the
probability gets further
from the true value, the
Log-Loss approaches
infinity.
How Gradient Descent Works for Logistic
Regression?

(1 step)

(5 steps)

(10 steps)

Source: mlu-explain.github.io/logistic-regression/
Chances of Admission to BITS Pilani: Ex.

Define the Logistic Regression Model:

p = 1/(1 + e-(β0 + β1. Math + β2. Physics + β3. Chemistry + β4.12th Percentage))
Assignment 3
Thank You!

You might also like