0% found this document useful (0 votes)
66 views4 pages

6 Min Read: Siwei Xu Aug 27

This document compares frequentist and Bayesian approaches to machine learning, using linear regression as an example. Frequentist methods assume parameters are fixed and find the most likely parameters given data, while Bayesian methods place prior probabilities on parameters and find the posterior probability of parameters given data. Specifically, linear regression uses maximum likelihood estimation to find point estimates, while Bayesian linear regression models the full posterior distribution over parameters and predictions. The document discusses the differences in how the two approaches quantify uncertainty and compares computational complexity.

Uploaded by

eppreta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views4 pages

6 Min Read: Siwei Xu Aug 27

This document compares frequentist and Bayesian approaches to machine learning, using linear regression as an example. Frequentist methods assume parameters are fixed and find the most likely parameters given data, while Bayesian methods place prior probabilities on parameters and find the posterior probability of parameters given data. Specifically, linear regression uses maximum likelihood estimation to find point estimates, while Bayesian linear regression models the full posterior distribution over parameters and predictions. The document discusses the differences in how the two approaches quantify uncertainty and compares computational complexity.

Uploaded by

eppreta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Frequentist vs.

Bayesian Approaches in Machine Learning


A Comparison of Linear Regression and Bayesian Linear Regression

Siwei Xu
Aug 27 · 6 min read
There has always been a debate between Bayesian and frequentist statistical inference. Frequentists
dominated statistical practice during the 20th century. Many common machine learning algorithms like
linear regression and logistic regression use frequentist methods to perform statistical inference. While
Bayesians dominated statistical practice before the 20th century, in recent years many algorithms in the
Bayesian schools like Expectation-Maximization, Bayesian Neural Networks and Markov Chain Monte Carlo
have gained popularity in machine learning.
In this article, we will talk about their differences and connections in the context of machine learning. We
will also use two algorithms for illustration: linear regression and Bayesian linear regression.
Assumptions
For simplicity, we will use θ to denote the model parameter(s) throughout this article.
Frequentist methods assume the observed data is sampled from some distribution. We call this data
distribution the likelihood: P(Data|θ), where θ is treated as is constant and the goal is to find the θ that
would maximize the likelihood. For example, in logistic regression the data is assumed to be sampled from
Bernoulli distribution, and in linear regression the data is assumed to be sample from Gaussian distribution.
Bayesian methods assume the probabilities for both data and hypotheses(parameters specifying the
distribution of the data). In Bayesians, θ is a variable, and the assumptions include a prior distribution of the
hypotheses P(θ), and a likelihood of data P(Data|θ). The main critique of Bayesian inference is the
subjectivity of the prior as different priors may arrive at different posteriors and conclusions.
Parameter Learning
Frequentists use maximum likelihood estimation(MLE) to obtain a point estimation of the parameters θ.
The log-likelihood is expressed as:

The parameters θ are estimated by maximizing the log-likelihood, or minimizing the negative log
likelihood(loss function):

Instead of a point estimate, Bayesians estimate a full posterior distribution of the parameters using the
Bayes’ formula:
You might have noticed the computation of the denominator can be NP-hard because it has an integral (or
summation in the case of classification) over all possible values of θ. You might also wonder if we can have a
point estimate of θ, just like what MLE does. That’s where Maximum A Posteriori(MAP) estimation comes
into play. MAP bypasses the cumbersome computation of the posterior distribution and instead tries to find
the point estimates of θ that maximizes the posterior distribution.

Since logarithmic functions are monotonic, we can rewrite the above equation in the log space and
decompose it into 2 parts: maximizing the likelihood and maximizing the prior distribution:

Doesn’t this look similar to MLE?


In fact, the connection between these two is that MAP can be treated as performing MLE on a regularized
loss function where the prior corresponds to the regularization term. For example, if we assume the prior
distribution to be Gaussian, MAP is equal to MLE with L2 regularization; if we assume the prior distribution
to be Laplace, MAP is equal to MLE with L1 regularization.
There is another method to get a point estimate of the posterior distribution: Expected A
Posteriori(EAP) estimation. The difference between MAP and EAP is that MAP gets the mode(maximum) of
the posterior distribution whereas EAP gets the expected value of the posterior distribution.
Uncertainty
The main difference between frequentist and Bayesian approaches is the way they measure uncertainty in
parameter estimation.
As we mentioned earlier, frequentists use MLE to get point estimates of unknown parameters and they
don’t assign probabilities to possible parameter values. Therefore, to measure uncertainty, Frequentists rely
on null hypothesis and confidence intervals. However, it’s important to point out that confidence intervals
don’t directly translate to probabilities of hypothesis. For example, with a confidence interval of 95%, it only
means 95% of the confidence intervals you’ve generated will cover the true estimate, but it’s incorrect to
say that it covers the true estimate with a probability of 95% .
Bayesians, on the other hand, have a full posterior distribution over the possible parameter values and this
allows them to get uncertainty of the estimate by integrating the full posterior distribution.
Computation
Bayesians are usually more computationally intensive than frequentists due to integration over many
parameters. There are some approaches to reduce the computational intensity by using conjugate priors or
approximating the posterior distribution using sampling methods or variational inference.
Examples
In this section, we will see how to train and make predictions with two algorithms: linear regression and
Bayesian linear regression.
Linear Regression (frequentist)
We assume the below form of a linear regression model where the intercept is incorporated in the
parameter θ:

The data is assumed to be distributed according to Gaussian distribution:

Using MLE to maximize the log likelihood, we can get the point estimate of θ as shown below:

Once we’ve learned the parameters θ from the training data, we can directly use it to make predictions with
new data:

Bayesian Linear Regression (bayesian)


As mentioned earlier, the Bayesian way is to make assumptions for both the prior and likelihood:

Using these assumptions and the Bayes’ formula, we can get the posterior distribution:
At prediction time, we use the posterior distribution and the likelihood to calculate the posterior predictive
distribution:

Notice that the estimation for both the parameters and predictions are full distributions. Of course, if we
only need a point estimate, we can always use MAP or EAP.
Conclusions
The main goal of machine learning is to make predictions using the parameters learned from training data.
Whether we should achieve the goal using frequentist or Bayesian approach depends on :
1. The type of predictions we want: a point estimate or a probability of potential values.
2. Whether we have prior knowledge that can be incorporated into the modeling process.
On a side note, we discussed discriminative and generative models earlier. A common misconception is to
label discriminative models as frequentist and generative models as Bayesian. In fact, both frequentist and
Bayesian approaches can be used for discriminative or generative models. You can refer to this post for
more clarification.
I hope you enjoyed reading this article. :)
Towards Data Science
A Medium publication sharing concepts, ideas, and codes.
155
Sign up for The Daily Pick
By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to
Thursday. Make learning your daily ritual. Take a look
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy
Policy for more information about our privacy practices.
 Machine Learning
 Bayesian Machine Learning
 Bayesian Statistics
 Data Science
 Editors Pick

Let’s chat about AI. https://www.linkedin.com/in/vivienne-siwei-xu/

You might also like