Hota ML Regression
Hota ML Regression
11.09.2024
Logistic
Polynomial
Height
https://currentaffairs.adda247.com/
• Unemployment rate, education level, population count, land area, income level,
investment rate, life expectancy, … (Multiple Linear Regression: Multi-variate)
Another Example of Multi-variate Regression
Sales = b + w1 weather + w2 money +w3 day
Regression:
Process of finding out
relationship between a
dependent variable
(outcome/ response/
label) and one or more
independent variables
(predictors/ covariates/
explanatory variables/
features)
BITS, Hyderabad
Independent variables (X): weather (rainy, sunny, cloudy), amount in hand, day
type (working, holiday), Dependent variable: Y (Sales)
How the dependent variable (Y) will react to each variable X taken
independently?
Best Fitting a Line: Least Squares Method
If >0
How are X and
Y related?
If <0?
If == 0 ?
Observed response
Predicted response
residual = data – fit
Find out the optimal parameter values by minimizing the sum of squared
residuals
Can you choose the best-fit line?
BMI = 18 + 1.5 (diet score) + 1.6 (male) + 4.2 (age > 20)
Y = β0 + β1 x1 + β2 x2 + β3 x3
Model
describes
data well
or poor?
Randomly
scattered
around zero
Continued…
Model includes
a Second-
degree
polynomial
(quadratic term)
Systematically
positive for
much of the
data.
Good or bad
fit?
Non-linear relations using Linear models?
• Feature Engineering: Engineer new features by transforming the existing
ones to capture non-linear relationships, e.g, you can include polynomial
features (e.g., quadratic, cubic).
• Using Basis Functions: Instead of using the original features, you can use
basis functions, which are transformations of the original features, e.g
Polynomial basis functions, Gaussian radial basis functions, or Sigmoidal
basis functions.
E
x
a
m
p
l • We add a quadratic term as an
e independent variable in the model. y =
x2
Source: https://bookdown.org/
parabola
Basis Functions: Why are they needed?
Linear or non-linear?
Let us add a basis function x1x2 into the input (this term couples two terms
non-linearly)
With the third input z = x1x2 the XOR becomes linearly separable.
400
300
200
Minimize Cost/ Loss: (MSE)
100
The division by 2 is for convenience and doesn't fundamentally change the result; it simplifies the
derivative computation when optimizing models.
Minimizing the Cost Function
3 3
2 2
1 1
1 2 3 0 1 2 3
3 3
2 2
1 1
1 2 3 0 1 2 3
3 3
2 2
1 1
1 2 3 0 1 2 3
3 3
2 2
1 1
1 2 3 0 1 2 3
3 3
2 2
1 1
1 2 3 0 1 2 3
MSE cost function for linear regression is always Convex.
Gradient Descent: Minimizing the MSE
• Optimization algorithm used to minimize the MSE function by iteratively
adjusting parameters in the direction of the negative gradient, aiming to
find the optimal set of parameters.
If we represent the gradient of
the loss function as ∇L, and the
parameters we are optimizing
as θ:
Then the update rule for gradient
descent is:
θ_new = θ_old - α * ∇L
MSE cost function is Convex. Will you get many local minima? No, only one global minima.
Reason: If you pick any two points on the curve, the line joining them will never cross the curve.
Visualizing Gradient Descent
100m
Δy
Δx time
10.25s
Distance
100m
Δy
Δx
Δy
Δy
Δx Δx
10.25s time
Will the Δy/Δx or Δy/Δx be different than the average slope, i.e.,
Δy/Δx?
What would be really the Instantaneous speed?
Better approximation:
Distance Slope around the steepest point. Measure the slope with a
smaller and smaller
100m change in x that yields a
smaller and smaller
Δy change in y.
Distance
Δx
100m
10.25s time
Gradient
Multivariate Function:
𝞱1
𝞱1 = 1
𝞱1
𝞱1
Positive Derivative
𝞱1
New 𝞱1 Old 𝞱1
𝞱1
𝞱1
Negative
Derivative
𝞱1
Old 𝞱1 New 𝞱1
𝞱1
Derivative = 0
Learing Rate
𝞱1
𝞱1
}
Batch Vs Stochastic Gradient Descent
Very smooth
convergence,
however
using all the
data for one
update.
Very noisy
convergence,
because
using only
one data
point for one
update.
Regression vs. Classification
Aspect Regression Classification
Objective Predict continuous values or a Predict categorical labels (0 or 1; cat,
range of values (3.4, 8.6, …) dog, sheep; low,medium,high)
? day
(1 step)
(5 steps)
(10 steps)
Source: mlu-explain.github.io/logistic-regression/
Chances of Admission to BITS Pilani: Ex.
p = 1/(1 + e-(β0 + β1. Math + β2. Physics + β3. Chemistry + β4.12th Percentage))
Assignment 3
Thank You!