Bayesian Model - Statistics
Bayesian Model - Statistics
Simple English
Overview
Statistics
● There are various methods to test the significance of the model like p-value,
Introduction
many analysts. Being amazed by the incredible power of machine learning, a lot
problems. In several situations, it does not help us solve business problems, even
though there is data involved in these problems. To say the least, knowledge of
statistics will allow you to work on complex analytical problems, irrespective of the
size of data.
In 1770s, Thomas Bayes introduced ‘Bayes Theorem’. Even after centuries later,
the importance of ‘Bayesian Statistics’ hasn’t faded away. In fact, today this topic
With this idea, I’ve created this beginner’s guide on Bayesian Statistics. I’ve tried
basic probability & statistics is desirable. You should check out this course to get
By the end of this article, you will have a concrete understanding of Bayesian
1. Frequentist Statistics
3. Bayesian Statistics
○ Conditional Probability
○ Bayes Theorem
4. Bayesian Inference
○ Confidence Intervals
○ Bayes Factor
1. Frequentist Statistics
The debate between frequentist and bayesian have haunted beginners for
It is the most widely used inferential technique in the statistical world. Infact,
generally it is the first school of thought that a person entering into the statistics
calculates the probability of an event in the long run of the experiment (i.e the
Here, the sampling distributions of fixed size are taken. Then, the experiment is
theoretically repeated infinite number of times but practically done with a stopping
that I will stop the experiment when it is repeated 1000 times or I see minimum
Now, we’ll understand frequentist statistics using an example of coin toss. The
objective is to estimate the fairness of the coin. Below is a table representing the
frequency of heads:
We know that probability of getting a head on tossing a fair coin is 0.5. No. of
An important thing is to note that, though the difference between the actual
To know more about frequentist statistical methods, you can head to this excellent
Till here, we’ve seen just one flaw in frequentist statistics. Well, it’s just the
beginning.
20th century saw a massive upsurge in the frequentist statistics being applied to
numerical models to check whether one sample is different from the other, a
flaws in its design and interpretation which posed a serious concern in all real life
1. p-values measured against a sample (fixed size) statistic with some stopping
intention changes with change in intention and sample size. i.e If two persons
work on the same data and have different stopping intention, they may get two
For example: Person A may choose to stop tossing a coin when the total count
reaches 100 while B stops at 1000. For different sample sizes, we get different t-
scores and different p-values. Similarly, intention to stop may change from fixed
number of flips to total duration of flipping. In this case too, we are bound to get
different p-values.
2- Confidence Interval (C.I) like p-value depends heavily on the sample size.
This makes the stopping potential absolutely absurd since no matter how many
persons perform the tests on the same data, the results should be consistent.
3- Confidence Intervals (C.I) are not probability distributions therefore they do not
provide the most probable value for a parameter and the most probable values.
These three reasons are enough to get you going into thinking about the
drawbacks of the frequentist approach and why is there a need for bayesian
3. Bayesian Statistics
statistical problems. It provides people the tools to update their beliefs in the
Suppose, out of all the 4 championship races (F1) between Niki Lauda and
So, if you were to bet on the winner of next race, who would he be ?
once when Niki won and it is definite that it will rain on the next date. So, who
By intuition, it is easy to see that chances of winning for James have increased
Pre-Requisites:
1. Linear Algebra : To refresh your basics, you can check out Khan’s Academy
Algebra.
2. Probability and Basic Statistics : To refresh your basics, you can check out
Set A represents one set of events and Set B represents another. We wish to
calculate the probability of A given B has already happened. Lets represent the
Now since B has happened, the part which now matters for A is the part shaded
in blue which is interestingly . So, the probability of A given B turns out to be:
Therefore, we can write the formula for event B given A has already occurred by:
or
Now, the second equation can be rewritten as :
Therefore,
2. P(B) is 1/4, since James won only one race out of four.
probability to be around 50%, which is almost the double of 25% when rain was
evidence i.e rain. You must be wondering that this formula bears close
Bayes theorem is built on top of conditional probability and lies in the heart of
Bayes Theorem comes into effect when multiple events form an exhaustive set
with another event B. This could be understood with the help of the below
diagram.
But
4. Bayesian Inference
There is no point in diving into the theoretical aspect of it. So, we’ll learn how it
works! Let’s take an example of coin tossing to understand the idea behind
bayesian inference.
models.
Models are the mathematical formulation of the observed events. Parameters are
the factors in the models affecting the observed data. For example, in tossing a
coin, fairness of coin may be defined as the parameter of coin denoted by θ. The
Answer this now. What is the probability of 4 heads out of 9 tosses(D) given the
P(θ|D)=(P(D|θ) X P(θ))/P(D)
Here, P(θ) is the prior i.e the strength of our belief in the fairness of coin before
the toss. It is perfectly okay to believe that coin can have any degree of fairness
between 0 and 1.
P(D|θ) is the likelihood of observing our result given our distribution for θ. If we
knew that coin was fair, this gives the probability of observing the number of
P(D) is the evidence. This is the probability of data as determined by summing (or
If we had multiple views of what the fairness of the coin is (but didn’t know for
sure), then this tells us the probability of seeing a certain sequence of flips for all
P(θ|D) is the posterior belief of our parameters after observing the evidence i.e
From here, we’ll dive deeper into mathematical implications of this concept. Don’t
worry. Once you understand them, getting to its mathematics is pretty easy.
To define our model correctly , we need two mathematical models before hand.
One to represent the likelihood function P(D|θ) and the other for representing the
distribution of prior beliefs . The product of these two gives the posterior belief
P(θ|D) distribution.
Since prior and posterior are both beliefs about the distribution of fairness of coin,
intuition tells us that both should have the same mathematical form. Keep this in
So, there are several functions which support the existence of bayes theorem.
Lets recap what we learned about the likelihood function. So, we learned that:
number of flips for a given fairness of coin. This means our probability of
P(y=1|θ)= [If coin is fair θ=0.5, probability of observing heads (y=1) is 0.5]
the outcomes.
P(y|θ)=
This is called the Bernoulli Likelihood Function and the task of coin flipping is
y={0,1},θ=(0,1)
And, when we want to see a series of heads or flips, its probability is given by:
Don’t worry. Mathematicians have devised methods to mitigate this problem too. It
Well, the mathematical function used to represent the prior beliefs is known as
beta distribution. It has some very nice mathematical properties which enable
where, our focus stays on numerator. The denominator is there just to ensure that
tails. The diagrams below will help you visualize the beta distributions for different
values of α and β
You too can draw the beta distribution for yourself using the following code in R:
> library(stats)
> par(mfrow=c(3,2))
> x=seq(0,1,by=o.1)
> alpha=c(0,2,10,20,50,500)
> beta=c(0,2,8,11,27,232)
y<-dbeta(x,shape1=alpha[i],shape2=beta[i])
plot(x,y,type="l")
Note: α and β are intuitive to understand since they can be calculated by knowing
the mean (μ) and standard deviation (σ) of the distribution. In fact, they are
related as :
If mean and standard deviation of a distribution are known , then there shape
1. When there was no toss we believed that every fairness of coin is possible
2. When there were more number of heads than the tails, the graph showed a
peak shifted towards the right side, indicating higher probability of heads
3. As more tosses are done, and heads continue to come in larger proportion
the peak narrows increasing our confidence in the fairness of the coin
value.
The reason that we chose prior belief is to obtain a beta distribution. This is
a form similar to the prior distribution which is much easier to relate to and
understand. If this much information whets your appetite, I’m sure you are ready
This is interesting. Just knowing the mean and standard distribution of our belief
about the parameter θ and by observing the number of heads in N flips, we can
Suppose, you think that a coin is biased. It has a mean (μ) bias of around 0.6 with
Then ,
α= 13.8 , β=9.2
i.e our distribution will be biased on the right side. Suppose, you observed 80
heads (z=80) in 100 flips(N=100). Let’s see how our prior and posterior beliefs are
going to look:
prior = P(θ|α,β)=P(θ|13.8,9.2)
Posterior = P(θ|z+α,N-z+β)=P(θ|93.8,29.2)
> library(stats)
> x=seq(0,1,by=0.1)
> alpha=c(13.8,93.8)
> beta=c(9.2,29.2)
y<-dbeta(x,shape1=alpha[i],shape2=beta[i])
As more and more flips are made and new data is observed, our beliefs get