0% found this document useful (0 votes)
10 views39 pages

Regression

Uploaded by

mysecondacc213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

Regression

Uploaded by

mysecondacc213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Regression

Dr Vidhya Kamakshi
Assistant Professor
National Institute of Technology Calicut
[email protected]

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 1


Data from a film maker

Earnings

Expenses

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 2


Notations
• Training data = 𝑥𝑖 , 𝑦𝑖 𝑁
𝑖=1
• Goal learn a function 𝑓: ℝ → ℝ ∶ 𝑓 𝑥 ~𝑦
• Representation - Here the function is assumed to be a straight line
• So, 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥
• Aim is to find to parameters 𝑤0 and 𝑤1 that best fits the given data
(Predicted value 𝑓(𝑥) is as close as possible to the true value 𝑦)
𝑤0
• 𝑤 = 𝑤 is the learnable parameters
1

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 3


Evaluation
• 𝑓(𝑥) has to be as close as possible to the true value 𝑦
• Evaluation function to drive this objective
• Least Squares Error = 𝑓 𝑥 − 𝑦 2

• For a best fit the least squares error has to be low for most if not all
data points
• Objective/ Loss function is thus
𝑁
1
𝐽 𝑤 = ෍ 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 2
2𝑁
𝑖=1
To be read as 𝑓 𝑥𝑖
parameterized by w
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 4
𝑤0 = 0 , 𝑤1 = 0.5 𝑤0 = 0 , 𝑤1 = 2

….

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 5


Plot the error surface for the given data

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 6


Optimization
• To get the best fit, the overall loss must be minimized
• Done using Gradient Descent
• Basic Principle:
• Start with an initial estimate of 𝑤
• Change 𝑤 iteratively such that the loss function J(𝑤) is minimized
• Stop the iterative process if either of the following convergence conditions is met:
• Minimum value of J(𝑤) is hit
• No significant improvement observed in J(𝑤) values between two successive iterations
• Gradient – Provides the direction of increase of the function
• Perform Descent (opposing the gradient direction) in order to minimize the
loss function J(𝑤)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 7


Parameter updates
• Update the parameters 𝑤 (weight 𝑤1 & bias 𝑤0 ) of the linear
regressor as :
• 𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − 𝛼 ∇𝑤 𝐽(𝑤)
• Repeat until convergence
• Here,
• 𝛼 is called learning rate
• ∇𝑤 𝐽(𝑤) is the gradient (partial derivative of the loss function 𝐽(𝑤)
• with respect to the learnable parameter 𝑤)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 8


Importance of learning rate
• Critical hyper – parameter to be decided by the
machine learning engineer
• Too low 𝛼 may lead to slower convergence
• Too high 𝛼 may lead to oscillations
• Typically 𝛼 = 10−3 is recommended for most
cases (no theoretical proof of why this works is
available, but this seems to work in practice for
most practitioners)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 9


Computing Gradients
∇𝑤0 𝐽(𝑤)
• ∇𝑤 𝐽(𝑤) =
∇𝑤1 𝐽(𝑤)
1
•𝐽 𝑤 = σ𝑁
𝑖=1 𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖
2
2𝑁
1 𝑁
σ𝑖=1(𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖 )
𝑁
• ∇𝑤 𝐽(𝑤) = 1
𝑥𝑖 σ𝑁
𝑖=1 (𝑤0 + 𝑤1 𝑥𝑖 − 𝑦𝑖 )
𝑁
1 𝑁
σ𝑖=1(𝑓(𝑥𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦𝑖 )
𝑁
• ∇𝑤 𝐽(𝑤) = 1
𝑥𝑖 σ𝑁
𝑖=1 (𝑓(𝑥 𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦 )
𝑖
𝑁
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 10
Parameter updates - Expanded
• Repeat until convergence
𝑁
1
𝑤0𝑛𝑒𝑤 = 𝑤0𝑜𝑙𝑑 − 𝛼 ෍(𝑓(𝑥𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦𝑖 )
𝑁
𝑖=1

𝑁
1
𝑤1𝑛𝑒𝑤 = 𝑤1𝑜𝑙𝑑 − 𝛼 𝑥𝑖 ෍(𝑓(𝑥𝑖 ; 𝑤 𝑜𝑙𝑑 ) − 𝑦𝑖 )
𝑁
𝑖=1
1
• is mostly ignored in analysis but helps practically
𝑁
• *Practical implementations use batch mode processing

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 11


Real world data – Multivariate
• In real world many factors influence an outcome
• Extending the film maker data example

• Number of input features = D

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 12


Multivariate Linear Regression – Extension
from Univariate Regression
• Representation – 𝑓 𝑥; 𝑤 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷
1
• Evaluation – 𝐽 𝑤 = σ𝑁
𝑖=1 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖
2
2𝑁
• Analogous to simpler univariate case,
• Parameter update:
1
• 𝑤𝑗𝑛𝑒𝑤 = 𝑤𝑗𝑜𝑙𝑑 −𝛼 . . 𝑥𝑖𝑗 . σ𝑁
𝑖=1(𝑓(𝑥𝑖 ; 𝑤
𝑜𝑙𝑑
) − 𝑦𝑖 )
𝑁

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 13


Potential issues
• Different scales of features

• Transform to a common scale


• Standardization – Mean = 0, Variance = 1
• Normalization – All feature values should be in either [-1,1] or [0,1] range
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 14
Multivariate Linear Regression – Matrix Vector
𝑤0 Interpretation
𝑤1
• Weights = 𝑤2 ∈ ℝ𝐷+1

𝑤𝐷
• 𝑤0 is the bias term
• 𝑤1 , 𝑤2 , … , 𝑤𝐷 are the weights corresponding to the D input features
1
𝑥1
• Data point / Instance = 𝑥2 ∈ ℝ𝐷+1

𝑥𝐷
• 𝑥0 = 1 is included to account for bias term
• Representation – 𝑓 𝑥; 𝑤 = 𝑤 𝑇 𝑥 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 15
Multivariate Linear Regression – Matrix
Vector Interpretation(contd.)
1 𝑥11 𝑥12 … 𝑥1𝐷
1 𝑥21 𝑥22 … 𝑥2𝐷 𝑁×(𝐷+1)
• Data Matrix X = … ∈ ℝ
1 𝑥𝑁1 𝑥𝑁2 … 𝑥𝑁𝐷
• Representation : 𝑓 𝑋; 𝑤 = 𝑋𝑤
𝑦1
𝑦2
• Target Vector y = … ∈ ℝ𝑁
𝑦𝑁

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 16


Multivariate Linear Regression – Matrix
Vector Interpretation(contd.)
1
• Evaluation : 𝐽 𝑤 = σ𝑁
𝑖=1 𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖
2
2
• Corresponding to the matrix vector notation
1 2
•𝐽 𝑤 = | 𝑓 𝑋; 𝑤 − 𝑦| 2
2
1 2
•𝐽 𝑤 = | 𝑋𝑤 − 𝑦| 2
2
1
•𝐽 𝑤 = 𝑋𝑤 − 𝑦 𝑇 (𝑋𝑤 − 𝑦)
2
1
•𝐽 𝑤 = (𝑤 𝑇 𝑋 𝑇 − 𝑦 𝑇 )(𝑋𝑤 − 𝑦)
2
1
•𝐽 𝑤 = (𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑦 𝑇 𝑦)
2

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 17


Gradient Computations in Vector notations
𝑝1 𝑞1
𝑝2 𝑞2
• Consider 2 vectors 𝑝 = … , 𝑞 = … ∈ ℝ𝑛
𝑝𝑛 𝑞𝑛

• Dot Product f ∶ ℝ𝑛 → ℝ = 𝑝𝑇 𝑞 = 𝑝1 𝑞1 + 𝑝2 𝑞2 + … + 𝑝𝑛 𝑞𝑛
∇𝑝1 𝑓
∇ 𝑓
• ∇𝑝 𝑓 = 𝑝2 ∈ ℝ𝑛

∇𝑝𝑛 𝑓
𝑞1
𝑞2
• ∇𝑝 𝑓 = … ∈ ℝ𝑛
𝑞𝑛

• ∇𝑝 𝑝𝑇 𝑞 = q ; Similarly ∇𝑞 𝑝𝑇 𝑞 = p

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 18


Back to Multivariate Linear Regression –
Matrix Vector Interpretation
• Find 𝑤 that minimizes
1
• 𝐽 𝑤 = (𝑤 𝑇 𝑋 𝑇 𝑋𝑤 − 𝑤 𝑇 𝑋 𝑇 𝑦 − 𝑦 𝑇 𝑋𝑤 + 𝑦 𝑇 𝑦)
2
• Standard idea from Calculus:
• Equate Gradient to 0
• i.e. ∇𝑤 𝐽 𝑤 = 0
1
• 𝑋 𝑇 𝑋𝑤 + 𝑤 𝑇 𝑋 𝑇 𝑋 𝑇
− 𝑋𝑇 𝑦 − 𝑦𝑇 𝑋 𝑇
+0 =0
2
• Product rule applied at differentiating ∇𝑤 𝑤 𝑇 𝑋 𝑇 𝑋𝑤
1
• 𝑋 𝑇 𝑋𝑤 + 𝑋 𝑇 𝑋𝑤 − 𝑋 𝑇 y − 𝑋 𝑇 y = 0
2
1
• 2 𝑋 𝑇 𝑋𝑤 − 2 𝑋 𝑇 y = 0
2
• 𝑋 𝑇 𝑋𝑤 − 𝑋 𝑇 y = 0
• 𝑋 𝑇 𝑋𝑤 = 𝑋 𝑇 𝑦
• The necessary parameters (weights) are 𝑤 = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
• This is called the analytical solution that can be obtained in one shot
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 19
Polynomial Regression
• Consider a function 𝑓 = sin 2𝜋𝑥
• By adding random noise samples (𝑥, 𝑦) are generated
• Given {(𝑥, 𝑦)}, find 𝑓: 𝑓 𝑥 = 𝑦
• Representation : Polynomial
• 𝑓 𝑥; 𝑤 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + … + 𝑤𝑀 𝑥 𝑀
• Loss function (Evaluation) : Sum of squares error
1
•ℒ 𝑤 = σ𝑁
𝑖=1 (𝑓 𝑥𝑖 ; 𝑤 − 𝑦𝑖 ) 2
2

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 20


Choices of degree M

Good fit Over fit

Under fit

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 21


Caveat
• Aim is to generalize on unseen data and not memorize train data
Tuned to noise

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 22


Back to Linear Regression
• Overfitting is a problem in Linear Regression as well.
• Limitations prevail to visualize higher dimensions
• Hence a detour with univariate polynomial regression taken
• To mitigate overfitting and enhance generalization, regularization
strategies that constrain the weights from exploding in either
directions may be adopted

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 23


Regularization
• Constrain the weights from exploding in either
directions
• Achieved by finding the parameters (weights) 𝑤 that
minimizes
𝑞
• 𝐽 𝑤 + 𝜆 σ𝐷
𝑗=1 𝑤𝑗 Sweet
• This formulation constrains a generic 𝐿𝑞 norm spot to
choose 𝜆
• In practical cases, typically when q = 1 for a high 𝜆 ,
weights become sparse
• Most used q = 2 called Ridge Regression formulated
as:
𝜆 2
• 𝐽 𝑤 + σ𝐷
𝑗=1 𝑤
2 2

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 24


Basics of Probability
• S – Sample space (set of all possible outcomes of an experiment)
• Event – Subset of S
• Probability of any event A satisfies the following axioms:
• 0≤𝑃 𝐴 ≤1
• 𝑃 𝜙 = 0 [Null event]
• 𝑃 𝑆 = 1 [Certain event]
• 𝑃 𝐴′ ≤ 𝑃 𝑆 − 𝑃 𝐴 = 1 − 𝑃(𝐴) [Negated event]
Independence, Conditional Probability
• Two events A & B are independent if
• 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃(𝐵)
𝑃(𝐴∩𝐵)
•𝑃 𝐴𝐵 =
𝑃(𝐵)
• 𝑃 𝐴 𝐵 is read as Probability of A given B
• E.g. What is the probability of brain tumour?
• Usually less
• What is the probability of brain tumour given that the patient
reported severe headache since 1 month?
• Probability (sense of belief) increases due to evidence observed (headache
since 1 month)
2 views of Probability
• Frequentist view
• Repeat the experiment N times
• Let F be the number of times the event A occurred
𝐹
• Compute 𝑃 𝐴 =
𝑁
• Limitations – Can’t use it for non-repeating events
• E.g. Compute the Probability of rain on the coming Friday 10 AM?
• Bayesian view
• Uses a degree of belief
• Look at other factors, evidence and predict the probability
• Can introduce domain expertise often reported in books as subject bias
Bayes Theorem
𝑃(𝑋∩𝑌)
•𝑃 𝑋𝑌 =
𝑃(𝑌)
• 𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌) ---- (1)
𝑃(𝑌∩𝑋)
•𝑃 𝑌𝑋 =
𝑃(𝑋)
• 𝑃 𝑋 ∩ 𝑌 = 𝑃 𝑌 𝑋 𝑃(𝑋) ------ (2)
• From (1) & (2)
• 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃 𝑋𝑌 𝑃(𝑌) 𝑃 𝑋 𝑌 𝑃(𝑌)
•𝑃 𝑌𝑋 = = σ
𝑃(𝑋) 𝑌 𝑃 𝑋 𝑌 𝑃(𝑌)
• Here 𝑃 𝑌 𝑋 is called the posterior probability
• 𝑃 𝑋 𝑌 is called likelihood
• 𝑃(𝑌) is called prior probability
• 𝑃 𝑋 = σ𝑌 𝑃 𝑋 𝑌 𝑃(𝑌) is called total probability
Connecting to the Previous Example
• 𝑃 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒
𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑃(𝑌=𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟)
=
𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑃 𝑌=𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 +𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝑀𝑖𝑔𝑟𝑎𝑖𝑛𝑒 𝑃 𝑌=𝑀𝑖𝑔𝑟𝑎𝑖𝑛𝑒 + …

• Here 𝑃 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 is called the posterior


probability (predicting a brain tumour given the patient reported headache)
• 𝑃 𝑋 = ℎ𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑌 = 𝐵𝑟𝑎𝑖𝑛𝑇𝑢𝑚𝑜𝑢𝑟 is called likelihood (probability that a
patient diagnosed with brain tumour would report headache as a symptom)
• 𝑃(𝑌) is called prior probability (Probability that an individual may get affected
by brain tumour)
• 𝑃 𝑋 = σ𝑌 𝑃 𝑋 𝑌 𝑃(𝑌) is called total probability (considering all possible
conditions where headache is a symptom)
Machine Learning View
• Y is the target attribute to predict, X is the input
• P(Y|X) = Posterior Probability that the learner predicts Y on observing
input X
• P(X|Y) = Likelihood that an instance of target type Y exhibits input
features X
• P(Y) = Prior Probability of target type Y in a given dataset
• Usually obtained through counting (frequentist view)
• To increase accuracy,
• Maximize posterior probability
• Proportional to Maximizing Likelihood
Revisiting Normal (Gaussian) Distribution
• Most Common distribution
• E.g. Height of individuals, Marks of students, etc.
• Parameters
• Mean 𝜇
• Variance 𝜎 2
• Bell shaped density function given by
− 𝑥−𝜇 2
1
• 𝑝 𝑥 = 1 𝑒 2 𝜎2
2𝜋 2 𝜎
• Central limit theorem
• Sum of n independent random variables sampled from a normal distribution can
again be sampled from the same normal distribution as n approaches infinity
• Think about its relevance to Machine Learning!!

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 31


Probabilistic View of Linear Regression
• Observed true value
• 𝑦 =𝑓 𝑥 +𝜖
• 𝜖 ~ 𝒩(0, 𝜎 2 ) is the
noise term sampled

y
from a Gaussian 𝑓(𝑥)
(normal) distribution
– Why?
• 3 𝜎 rule x
− 𝜖2
1
• 𝑃 𝜖 = 1 𝑒 2𝜎2
2𝜋 2 𝜎

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 32


Probabilistic View of Linear Regression
(contd.)
• 𝑃 𝑦𝑖 𝑥𝑖 that drives the prediction can be assumed to be follow any
distribution family
• Each assumption yields a variant of regression
• Consistent with the discussions till now,
• Assumption : 𝑃 𝑦𝑖 𝑥𝑖 follows a Gaussian distribution 𝒩(𝑓 𝑥 , 𝜎 2 )
− (𝑦−𝑓(𝑥𝑖 ;𝑤))2
1
• 𝑃 𝑦𝑖 |𝑥𝑖 = 1 𝑒 2𝜎2
2𝜋 2 𝜎

• Considering all data points,


• 𝑃 𝑦1 , 𝑦2 , … , 𝑦𝑁 𝑥1 , 𝑥2 , … , 𝑥𝑁 = ς𝑁
𝑖=1 𝑃(𝑦𝑖 |𝑥𝑖 )
• [iid assumption i.e. independent and identically distributed]

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 33


Maximum Likelihood Estimation
• max ς𝑁𝑖=1 𝑃(𝑦𝑖 |𝑥𝑖 )
• Standard idea from calculus
• Differentiate and equate the derivative to 0
• Product terms are involved making differentiation bit challenging
• Can we modify to sum?
• Yes using log
• Is using log fine?
• Yes because it is a monotonically increasing function
• Maximizing log likelihood is equivalent to maximizing likelihood itself
• So for ease of differentiation, we will maximize log likelihood having
ensured its validity
Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 34
Maximum Likelihood Estimation (contd.)
• max log ς𝑁𝑖=1 𝑃(𝑦𝑖 |𝑥𝑖 )
• max σ𝑁
𝑖=1 log 𝑃(𝑦𝑖 |𝑥𝑖 )
− (𝑦−𝑓(𝑥𝑖 ;𝑤))2
1
• max σ𝑁
𝑖=1 log 1 𝑒 2𝜎2
2𝜋 2 𝜎
1 2
𝑦−𝑓 𝑥𝑖 ;𝑤
• max σ𝑁
𝑖=1 − log[ 2𝜋 2 𝜎] −
2𝜎 2
• As – sign is present inside a max expression,
1 2
𝑦−𝑓 𝑥𝑖 ;𝑤
• min σ𝑁
𝑖=1 log 2𝜋 𝜎 + 2
2 𝜎2

Minimize noise Analogous to minimizing the objective function – Least Squares Error

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 35


What if the multivariate data is inherently
non-linear?
• Sin() in univariate polynomial regression, was inherently non-linear
that best fit from straight line was impossible
• What if multivariate regression is similarly non-linear?
• Unable to visualize but high loss 𝐽 𝑤 is observed with weights
obtained from the analytical solution
• In this case the data is transformed using a non-linear function called
kernel function and the regression is formulated using a linear
combination of these non-linear functions

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 36


Non-linear regression
• Representation - 𝑓 𝑥 = 𝑤0 + σ𝐷𝑗=1 𝑤𝑗 𝜙(𝑥)
1 𝜙1 𝑥1 𝜙2 (𝑥1 ) … 𝜙𝐷 (𝑥1 )
• Transformed Data Matrix 𝜙(X) = 1 𝜙1 𝑥2 𝜙2 (𝑥2 ) … 𝜙𝐷 (𝑥2 )

1𝜙1 𝑥𝑁 𝜙2 (𝑥𝑁 ) … 𝜙𝐷 (𝑥𝑁 )
1 2
• Evaluation - 𝐽 𝑤 = | 𝜙(𝑋)𝑤 − 𝑦| 2
2
• It can be observed that closed form analytical solution is:
• 𝑤 = 𝜙 𝑋 𝑇 𝜙(𝑋) −1
𝜙(𝑋)𝑇 𝑦

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 37


Multivariate Linear Regression with Multiple
Outputs
𝑦11 𝑦12 … 𝑦1𝐾
𝑦21 𝑦22 … 𝑦2𝐾
• Target matrix Y = …
𝑦𝑁1 𝑦𝑁2 … 𝑦𝑁𝐾

𝑤10 𝑤20 … 𝑤𝐾0


𝑤11 𝑤21 … 𝑤𝐾1
• Weight matrix W = …
𝑤1𝐷 𝑤2𝐷 … 𝑤𝐾𝐷
• Analytical solution would be W = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌
• Equivalent to performing K single output multivariate regression

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 38


References
• The contents of the slides are adapted from the following resources
• PRML by Bishop (Ch 1)
• ESLII by Hastie , Tibshirani & Friedman (Ch 3)
• Introduction to Machine Learning by Alpaydin (Ch 4,5)
• Machine Learning: A Probabilistic Perspective by Murphy (Ch 7)
• Statistical Data Analysis by Cowan (Ch 2)
• Keng’s Blog
• Mathematics for Machine Learning (Ch 5)

Fall 2024 – B Tech S7 - Dr Vidhya Kamakshi 39

You might also like