6 Min Read: Siwei Xu Aug 27
6 Min Read: Siwei Xu Aug 27
Siwei Xu
Aug 27 · 6 min read
There has always been a debate between Bayesian and frequentist statistical inference. Frequentists
dominated statistical practice during the 20th century. Many common machine learning algorithms like
linear regression and logistic regression use frequentist methods to perform statistical inference. While
Bayesians dominated statistical practice before the 20th century, in recent years many algorithms in the
Bayesian schools like Expectation-Maximization, Bayesian Neural Networks and Markov Chain Monte Carlo
have gained popularity in machine learning.
In this article, we will talk about their differences and connections in the context of machine learning. We
will also use two algorithms for illustration: linear regression and Bayesian linear regression.
Assumptions
For simplicity, we will use θ to denote the model parameter(s) throughout this article.
Frequentist methods assume the observed data is sampled from some distribution. We call this data
distribution the likelihood: P(Data|θ), where θ is treated as is constant and the goal is to find the θ that
would maximize the likelihood. For example, in logistic regression the data is assumed to be sampled from
Bernoulli distribution, and in linear regression the data is assumed to be sample from Gaussian distribution.
Bayesian methods assume the probabilities for both data and hypotheses(parameters specifying the
distribution of the data). In Bayesians, θ is a variable, and the assumptions include a prior distribution of the
hypotheses P(θ), and a likelihood of data P(Data|θ). The main critique of Bayesian inference is the
subjectivity of the prior as different priors may arrive at different posteriors and conclusions.
Parameter Learning
Frequentists use maximum likelihood estimation(MLE) to obtain a point estimation of the parameters θ.
The log-likelihood is expressed as:
The parameters θ are estimated by maximizing the log-likelihood, or minimizing the negative log
likelihood(loss function):
Instead of a point estimate, Bayesians estimate a full posterior distribution of the parameters using the
Bayes’ formula:
You might have noticed the computation of the denominator can be NP-hard because it has an integral (or
summation in the case of classification) over all possible values of θ. You might also wonder if we can have a
point estimate of θ, just like what MLE does. That’s where Maximum A Posteriori(MAP) estimation comes
into play. MAP bypasses the cumbersome computation of the posterior distribution and instead tries to find
the point estimates of θ that maximizes the posterior distribution.
Since logarithmic functions are monotonic, we can rewrite the above equation in the log space and
decompose it into 2 parts: maximizing the likelihood and maximizing the prior distribution:
Using MLE to maximize the log likelihood, we can get the point estimate of θ as shown below:
Once we’ve learned the parameters θ from the training data, we can directly use it to make predictions with
new data:
Using these assumptions and the Bayes’ formula, we can get the posterior distribution:
At prediction time, we use the posterior distribution and the likelihood to calculate the posterior predictive
distribution:
Notice that the estimation for both the parameters and predictions are full distributions. Of course, if we
only need a point estimate, we can always use MAP or EAP.
Conclusions
The main goal of machine learning is to make predictions using the parameters learned from training data.
Whether we should achieve the goal using frequentist or Bayesian approach depends on :
1. The type of predictions we want: a point estimate or a probability of potential values.
2. Whether we have prior knowledge that can be incorporated into the modeling process.
On a side note, we discussed discriminative and generative models earlier. A common misconception is to
label discriminative models as frequentist and generative models as Bayesian. In fact, both frequentist and
Bayesian approaches can be used for discriminative or generative models. You can refer to this post for
more clarification.
I hope you enjoyed reading this article. :)
Towards Data Science
A Medium publication sharing concepts, ideas, and codes.
155
Sign up for The Daily Pick
By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to
Thursday. Make learning your daily ritual. Take a look
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy
Policy for more information about our privacy practices.
Machine Learning
Bayesian Machine Learning
Bayesian Statistics
Data Science
Editors Pick