Bayesian Linear Regression
Bayesian Linear Regression
Linear Regression
● The goal of learning a linear model from training data is to find the coefficients, β,
that best explain the data.
● Find an optimal β, that minimize the cost function / residual sum of squares (RSS)
● In linear regression beta^ is a point estimate as opposed to the Bayesian approach where the
outcomes are distributions.
Bayes Theorem
Bayesian Regression
The response, y, is not estimated as a single value, but is assumed to be drawn from a probability
distribution.
The response generated from a probability distribution and the model parameters are assumed to
come from a distribution as well.
The posterior probability of the model parameters is conditional upon the training inputs and
outputs:
Benefits of Bayesian Linear Regression.
1. Priors: If we have domain knowledge, or a guess for what the model parameters should be,
we can include them in our model, unlike in the linear regression approach which assumes
everything there is to know about the parameters comes from the data. If we don’t have any
estimates ahead of time, we can use non-informative priors for the parameters such as a
normal distribution.
2. Posterior: The result of performing Bayesian Linear Regression is a distribution of possible
model parameters based on the data and the prior. This allows us to quantify our uncertainty
about the model: if we have fewer data points, the posterior distribution will be more spread
out.
3. Bayesian Regression can be very useful when we have insufficient data in the dataset or
the data is poorly distributed.
● As the amount of data points increases, the likelihood washes out the prior, and in the case of
infinite data, the outputs for the parameters converge to the values obtained from OLS.
● The formulation of model parameters as distributions encapsulates the Bayesian worldview:
we start out with an initial estimate, our prior, and as we gather more evidence, our model
becomes less wrong.
Real World Example
With more data, the posterior distribution starts to shrink in size as the variance of the distributions reduces.
Thank You