Hidden Markov Model - Implemented From Scratch - by Oleg Żero - Towards Data Science
Hidden Markov Model - Implemented From Scratch - by Oleg Żero - Towards Data Science
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
I want to expand this work into a series of µ-tutorial videos. If you’re interested, please
subscribe to my newsletter to stay in touch.
Introduction
The Internet is full of good articles that explain the theory behind the Hidden Markov
Model (HMM) well (e.g. 1, 2, 3 and 4). However, many of these works contain a fair
amount of rather advanced mathematical equations. While equations are necessary if
one wants to explain the theory, we decided to take it to the next level and create a
gentle step by step practical implementation to complement the good work of others.
In this short series of two articles, we will focus on translating all of the complicated
Get started Open in app
mathematics into code. Our starting point is the document written by Mark Stamp. We
will use this paper to define our code (this article) and then use a somewhat peculiar
example of “Morning Insanity” to demonstrate its performance in practice.
Notation
Before we begin, let’s revisit the notation we will be using. By the way, don’t worry if
some of that is unclear to you. We will hold your hand.
M - number of observables.
O - observation sequence.
Having that set defined, we can calculate the probability of any state and observation
using the matrices:
Fundamental definitions
Since HMM is based on probability vectors and matrices, let’s first define objects that
will represent the fundamental concepts. To be useful, the objects must reflect on
certain properties. For example, all elements of a probability vector must be numbers 0
≤ x ≤ 1 and they must sum up to 1. Therefore, let’s design the objects the way they will
inherently safeguard the mathematical properties.
import numpy as np
import pandas as pd
class ProbabilityVector:
states = probabilities.keys()
probs = probabilities.values()
self.states = sorted(probabilities)
self.values = np.array(list(map(lambda x:
@classmethod
size = len(states)
rand /= rand.sum(axis=0)
@classmethod
@property
def dict(self):
@property
def df(self):
def __repr__(self):
raise NotImplementedError
return True
return False
index = self.states.index(state)
if isinstance(other, ProbabilityVector):
else:
NotImplementedError
return self.__mul__(other)
if isinstance(other, ProbabilityMatrix):
raise NotImplementedError
x = self.values
def argmax(self):
in app
return self.states[index]
The most natural way to initialize this object is to use a dictionary as it associates values
with unique keys. Dictionaries, unfortunately, do not provide any assertion mechanisms
that put any constraints on the values. Consequently, we build our custom
ProbabilityVector object to ensure that our values behave correctly. Most importantly, we
enforce the following:
The number of values must equal the number of the keys (names of our states).
Although this is not a problem when initializing the object from a dictionary, we will
use other ways later.
All names of the states must be unique (the same arguments apply).
2. We use ready-made numpy arrays and use values therein, and only providing the
names for the states.
For convenience and debugging, we provide two additional methods for requesting the
values. Decorated with, they return the content of the PV object as a dictionary or a
pandas dataframe.
The PV objects need to satisfy the following mathematical operations (for the purpose of
constructing of HMM):
Note that when e.g. multiplying a PV with a scalar, the returned structure is a resulting
numpy array, not another PV. This is because multiplying by anything other than 1
would violate the integrity of the PV itself.
Example
print(a1.df)
print(a2.df)
print("Comparison:", a1 == a2)
print("Argmax:", a1.argmax())
print("Getitem:", a1['rain'])
# OUTPUT
rain sun
Probability Matrix
Another object is a Probability Matrix , which is a core part of the HMM definition.
Get started Open in app
Formally, the A and B matrices must be row-stochastic, meaning that the values of every
row must sum up to 1. We can, therefore, define our PM by stacking several PV's, which
we have constructed in a way to guarantee this constraint.
class ProbabilityMatrix:
assert len(prob_vec_dict.keys()) ==
len(set(prob_vec_dict.keys())), \
self.states = sorted(prob_vec_dict)
self.observables = prob_vec_dict[self.states[0]].states
self.values = np.stack([prob_vec_dict[x].values \
for x in self.states]).squeeze()
@classmethod
/ (size**2) + 1 / size
rand /= rand.sum(axis=1).reshape(-1, 1)
@classmethod
np.ndarray,
states: list,
observables: list):
for x in array]
@property
def dict(self):
return self.df.to_dict()
@property
def df(self):
return pd.DataFrame(self.values,
def __repr__(self):
index = self.observables.index(observable)
Here, the way we instantiate PM’s is by supplying a dictionary of PV’s to the constructor
of the class. By doing this, we not only ensure that every row of PM is stochastic, but also
supply the names for every observable.
Our PM can, therefore, give an array of coefficients for any observable. Mathematically,
the PM is a matrix:
Example
print(A)
print(A.df)
print(B)
print(B.df)
>>> PM (2, 3) states: ['0H', '1C'] -> obs: ['0S', '1M', '2L'].
>>> 0S 1M 2L
P = ProbabilityMatrix.initialize(list('abcd'), list('xyz'))
print('Dot product:', a1 @ A)
print('Initialization:', P)
print(P.df)
states: ['a', 'b', 'c', 'd'] -> obs: ['x', 'y', 'z'].
>>> x y z
Later on, we will implement more methods that are applicable to this class.
Computing score
Computing the score means to find what is the probability of a particular chain of
observations O given our (known) model λ = (A, B, π). In other words, we are interested
in finding p(O|λ).
We can find p(O|λ) by marginalizing all possible chains of the hidden variables X, where
X = {x₀, x₁, …}:
Since p(O|X, λ) = ∏ b(O) (the product of all probabilities related to the observables) and
Get started Open in app
p(X|λ)=π ∏ a (the product of all probabilities of transitioning from x at t to x at t + 1, the
probability we are looking for (the score) is:
This is a naive way of computing of the score, since we need to calculate the probability
for every possible chain X. Either way, let’s implement it in python:
class HiddenMarkovChain:
self.pi = pi
self.states = pi.states
self.observables = E.observables
def __repr__(self):
len(self.states), len(self.observables))
@classmethod
E = ProbabilityMatrix.initialize(states, observables)
pi = ProbabilityVector.initialize(states)
score = 0
all_chains = self._create_all_chains(len(observations))
p_hidden_state[0] = self.pi[chain[0]]
return score
Example
If our implementation is correct, then all score values for all possible observation chains,
for a given model should add up to one. Namely:
all_observation_chains = list(product(*(all_possible_observations,)
* chain_length))
>>>
Get All possible
started scores added: 1.0.
Open in app
Indeed.
Consequently,
and
Then
Get started Open in app
Note that α_t is a vector of length N. The sum of the product α a can, in fact, be written
as a dot product. Therefore:
With this implementation, we reduce the number of multiplication to N²T and can take
advantage of vectorization.
class HiddenMarkovChain_FP(HiddenMarkovChain):
@ self.T.values) *
self.E[observations[t]].T
return alphas
alphas = self._alphas(observations)
return float(alphas[-1].sum())
Example
…yup.
Simulation and convergence
Get started Open in app
Let’s test one more thing. Basically, let’s take our λ = (A, B, π) and use it to generate a
sequence of random observables, starting from some initial state probability π.
If the desired length T is “large enough”, we would expect that the system to converge on
a sequence that, on average, gives the same number of events as we would expect from A
and B matrices directly. In other words, the transition and the emission matrices
“decide”, with a certain probability, what the next state will be and what observation we
will get, for every step, respectively. Therefore, what may initially look like random
events, on average should reflect the coefficients of the matrices themselves. Let’s check
that as well.
class HiddenMarkovChain_Simulation(HiddenMarkovChain):
prb = self.pi.values
s_history[0] = np.random.choice(self.states,
p=prb.flatten())
o_history[0] = np.random.choice(self.observables,
p=obs.flatten())
s_history[t] = np.random.choice(self.states,
p=prb.flatten())
o_history[t] = np.random.choice(self.observables,
p=obs.flatten())
Example
stats = pd.DataFrame({
'observations': observation_hist,
Figure 1. An example of a Markov process. The states and the observable sequences are shown.
Latent states
The state matrix A is given by the following coefficients:
Consequently, the probability of “being” in the state “1H” at t+1, regardless of the
previous state, is equal to:
Get started Open in app
If we assume that the prior probabilities of being at some state at are totally random,
then p(1H) = 1 and p(2C) = 0.9, which after renormalizing give 0.55 and 0.45,
respectively.
If we count the number of occurrences of each state and divide it by the number of
elements in our sequence, we would get closer and closer to these number as the length
of the sequence grows.
Example
stats = {}
stats[length] = pd.DataFrame({
'observations': observation_hist,
S = np.array(list(map(lambda x:
x['states'].value_counts().to_numpy() / len(x),
stats.values())))
plt.semilogx(np.logspace(1, 5, 40).astype(int), S)
plt.ylabel('Probability')
plt.title('Converging probabilities.')
plt.legend(['1H', '2C'])
plt.show()
Get started Open in app
Let’s take our HiddenMarkovChain class to the next level and supplement it with more
methods. The methods will help us to discover the most probable sequence of hidden
variables behind the observation sequence.
Now, let’s define the “opposite” probability. Namely, the probability of observing the
sequence from T - 1down to t.
Finally, we also define a new quantity γ to indicate the state q_i at time t, for which the
probability (calculated forwards and backwards) is the maximum:
Consequently, for any step t = 0, 1, …, T-1, the state of the maximum likelihood can be
found using:
class HiddenMarkovChain_Uncover(HiddenMarkovChain_Simulation):
* self.E[observations[t]].T
return alphas
betas[-1, :] = 1
* betas[t + 1, :].reshape(-1,
1))).reshape(1, -1)
return betas
alphas = self._alphas(observations)
betas = self._betas(observations)
Validation
To validate, let’s generate some observable sequence O. For that, we can use our model’s
.run method. Then, we will use the .uncover method to find the most likely latent
variable sequence.
Example
np.random.seed(42)
uncovered_sequence = hmc.uncover(observed_sequence)
| | 0 | 1 | 2 | 3 | 4 | 5 |
|:------------------:|:----|:----|:----|:----|:----|:----|
| observed sequence | 3L | 3M | 1S | 3L | 3L | 3L |
| latent sequence | 1H | 2C | 1H | 1H | 2C | 1H |
| uncovered sequence | 1H | 1H | 2C | 1H | 1H | 1H |
As we can see, the most likely latent state chain (according to the algorithm) is not the
same as the one that actually caused the observations. This is to be expected. After all,
each observation sequence can only be manifested with certain probability, dependent
on the latent sequence.
The code below, evaluates the likelihood of different latent sequences resulting in our
observation sequence.
all_states_chains = list(product(*(all_possible_states,) *
chain_length))
df = pd.DataFrame(all_states_chains)
dfp = pd.DataFrame()
for i in range(chain_length):
scores = dfp.sum(axis=1).sort_values(ascending=False)
df = df.iloc[scores.index]
df['score'] = scores
df.head(10).reset_index()
| index | 0 | 1 | 2 | 3 | 4 | 5 | score |
|:--------:|:----|:----|:----|:----|:----|:----|--------:|
| 8 | 1H | 1H | 2C | 1H | 1H | 1H | 3.1 |
| 24 | 1H | 2C | 2C | 1H | 1H | 1H | 2.9 |
| 40 | 2C | 1H | 2C | 1H | 1H | 1H | 2.7 |
| 12 | 1H | 1H | 2C | 2C | 1H | 1H | 2.7 |
| 10 | 1H | 1H | 2C | 1H | 2C | 1H | 2.7 |
| 9 | 1H | 1H | 2C | 1H | 1H | 2C | 2.7 |
| 25 | 1H | 2C | 2C | 1H | 1H | 2C | 2.5 |
| 0 | 1H | 1H | 1H | 1H | 1H | 1H | 2.5 |
| 26 | 1H | 2C | 2C | 1H | 2C | 1H | 2.5 |
| 28 | 1H | 2C | 2C | 2C | 1H | 1H | 2.5 |
The result above shows the sorted table of the latent sequences, given the observation
sequence. The actual latent sequence (the one that caused the observations) places itself
on the 35th position (we counted index from zero).
dfc = df.copy().reset_index()
for i in range(chain_length):
dfc
| index | 0 | 1 | 2 | 3 | 4 | 5 | score |
|:-------:|:----|:----|:----|:----|:----|:----|--------:|
| 18 | 1H | 2C | 1H | 1H | 2C | 1H | 1.9 |
Knowing our latent states Q and possible observation states O, we automatically know
the sizes of the matrices A and B, hence N and M. However, we need to determine a and
b and π.
Now, thinking in terms of implementation, we want to avoid looping over i, j and t at the
same time, as it’s gonna be deadly slow. Fortunately, we can vectorize the equation:
or
For i, j = 0, 1, …, N-1:
class HiddenMarkovLayer(HiddenMarkovChain_Uncover):
L, N = len(observations), len(self.states)
alphas = self._alphas(observations)
betas = self._betas(observations)
score = self.score(observations)
digammas[t, :, :] = P1 * P2 / score
return digammas
Having the “layer” supplemented with the ._difammas method, we should be able to
Get started Open in app
perform all the necessary calculations. However, it makes sense to delegate the
"management" of the layer to another class. In fact, the model training can be
summarized as follows:
1. Initialize A, B and π.
class HiddenMarkovModel:
self.layer = hml
self._score_init = 0
self.score_history = []
@classmethod
return cls(layer)
alpha = self.layer._alphas(observations)
beta = self.layer._betas(observations)
digamma = self.layer._digammas(observations)
score = alpha[-1].sum()
L = len(alpha)
obs_idx = [self.layer.observables.index(x) \
for x in observations]
for t in range(L):
pi = gamma[0]
T = digamma.sum(axis=0) / gamma[:-1].sum(axis=0).reshape(-1,
1)
self.layer.pi = ProbabilityVector.from_numpy(pi,
self.layer.states)
self.layer.T = ProbabilityMatrix.from_numpy(T,
self.layer.states, self.layer.states)
return score
self._score_init = 0
score = self.update(observations)
print("Early stopping.")
break
self._score_init = score
self.score_history[epoch] = score
Example
np.random.seed(42)
hmm = HiddenMarkovModel(hml)
hmm.train(observations, 25)
Get started Open in app
Verification
Let’s look at the generated sequences. The “demanded” sequence is:
| | 0 | 1 | 2 | 3 | 4 | 5 |
|---:|:----|:----|:----|:----|:----|:----|
| 0 | 3L | 2M | 1S | 3L | 3L | 3L |
RUNS = 100000
T = 5
for i in range(len(chains)):
chain = hmm.layer.run(T)[0]
chains[i] = '-'.join(chain)
The table below summarizes simulated runs based on 100000 attempts (see above),
with the frequency of occurrence and number of matching observations.
The bottom line is that if we have truly trained the model, we should see a strong
tendency for it to generate us sequences that resemble the one we require. Let’s see if it
happens.
df = pd.DataFrame(pd.Series(chains).value_counts(), columns=
['counts']).reset_index().rename(columns={'index': 'chain'})
s = []
dfstarted
Get = df.drop(columns=['chain'])
Open in app
df.head(30)
---
|---:|---------:|:----|:----|:----|:----|:----|:----|----------:|
| 0 | 8.907 | 3L | 3L | 3L | 3L | 3L | 3L | 4 |
| 1 | 4.422 | 3L | 2M | 3L | 3L | 3L | 3L | 5 |
| 2 | 4.286 | 1S | 3L | 3L | 3L | 3L | 3L | 3 |
| 3 | 4.284 | 3L | 3L | 3L | 3L | 3L | 2M | 3 |
| 4 | 4.278 | 3L | 3L | 3L | 2M | 3L | 3L | 3 |
| 5 | 4.227 | 3L | 3L | 1S | 3L | 3L | 3L | 5 |
| 6 | 4.179 | 3L | 3L | 3L | 3L | 1S | 3L | 3 |
| 7 | 2.179 | 3L | 2M | 3L | 2M | 3L | 3L | 4 |
| 8 | 2.173 | 3L | 2M | 3L | 3L | 1S | 3L | 4 |
| 9 | 2.165 | 1S | 3L | 1S | 3L | 3L | 3L | 4 |
| 10 | 2.147 | 3L | 2M | 3L | 3L | 3L | 2M | 4 |
| 11 | 2.136 | 3L | 3L | 3L | 2M | 3L | 2M | 2 |
| 12 | 2.121 | 3L | 2M | 1S | 3L | 3L | 3L | 6 |
| 13 | 2.111 | 1S | 3L | 3L | 2M | 3L | 3L | 2 |
| 14 | 2.1 | 1S | 2M | 3L | 3L | 3L | 3L | 4 |
| 15 | 2.075 | 3L | 3L | 3L | 2M | 1S | 3L | 2 |
| 16 | 2.05 | 1S | 3L | 3L | 3L | 3L | 2M | 2 |
| 17 | 2.04 | 3L | 3L | 1S | 3L | 3L | 2M | 4 |
| 18 | 2.038 | 3L | 3L | 1S | 2M | 3L | 3L | 4 |
| 19 | 2.022 | 3L | 3L | 1S | 3L | 1S | 3L | 4 |
| 20 | 2.008 | 1S | 3L | 3L | 3L | 1S | 3L | 2 |
| 21 | 1.955 | 3L | 3L | 3L | 3L | 1S | 2M | 2 |
| 22 | 1.079 | 1S | 2M | 3L | 2M | 3L | 3L | 3 |
| 23 | 1.077 | 1S | 2M | 3L | 3L | 3L | 2M | 3 |
| 24 | 1.075 | 3L | 2M | 1S | 2M | 3L | 3L | 5 |
| 25 | 1.064 | 1S | 2M | 1S | 3L | 3L | 3L | 5 |
| 26 | 1.052 | 1S | 2M | 3L | 3L | 1S | 3L | 3 |
| 27 | 1.048 | 3L | 2M | 3L | 2M | 1S | 3L | 3 |
| 28 | 1.032 | 1S | 3L | 1S | 2M | 3L | 3L | 3 |
| 29 | 1.024 | 1S | 3L | 1S | 3L | 1S | 3L | 3 |
And here are the sequences that we don’t want the model to create.
| | counts | 0 | 1 | 2 | 3 | 4 | 5 | matched |
|----:|---------:|:----|:----|:----|:----|:----|:----|----------:|
| 266 | 0.001 | 1S | 1S | 3L | 3L | 2M | 2M | 1 |
| 267 | 0.001 | 1S | 2M | 2M | 3L | 2M | 2M | 2 |
| 268 | 0.001 | 3L | 1S | 1S | 3L | 1S | 1S | 3 |
| 269 | 0.001 | 3L | 3L | 3L | 1S | 2M | 2M | 1 |
| 270 | 0.001 | 3L | 1S | 3L | 1S | 1S | 3L | 2 |
| 271 | 0.001 | 1S | 3L | 2M | 1S | 1S | 3L | 1 |
| 272 | 0.001 | 3L | 2M | 2M | 3L | 3L | 1S | 4 |
| 273 | 0.001 | 1S | 3L | 3L | 1S | 1S | 1S | 0 |
| 274 | 0.001 | 3L | 1S | 2M | 2M | 1S | 2M | 1 |
| 275 | 0.001 | 3L | 3L | 2M | 1S | 3L | 2M | 2 |
Get started Open in app
As we can see, there is a tendency for our model to generate sequences that resemble the
one we require, although the exact one (the one that matches 6/6) places itself already
at the 10th position! On the other hand, according to the table, the top 10 sequences are
still the ones that are somewhat similar to the one we request.
To ultimately verify the quality of our model, let’s plot the outcomes together with the
frequency of occurrence and compare it against a freshly initialized model, which is
supposed to give us completely random sequences — just to compare.
hmm_rand = HiddenMarkovModel(hml_rand)
RUNS = 100000
T = 5
for i in range(len(chains_rand)):
chain_rand = hmm_rand.layer.run(T)[0]
chains_rand[i] = '-'.join(chain_rand)
s = []
df2 = df2.drop(columns=['chain'])
ax.plot(df['matched'], 'g:')
ax.plot(df2['matched'], 'k:')
ax.set_xlabel('Ordered index')
ax.set_ylabel('Matching observations')
ax2 = ax.twinx()
ax.legend(['trained',
Get started Open in app 'initialized'])
ax2.legend(['trained', 'initialized'])
plt.grid()
plt.show()
Figure 4. Result after training of the model. The dotted lines represent the matched sequences. The lines
represent the frequency of occurrence for a particular sequence: trained model (red) and freshly initialized
(black). The initialized results in almost perfect uniform distribution of sequences, while the trained model
gives a strong preference towards the observable sequence.
Conclusion
In this article, we have presented a step-by-step implementation of the Hidden Markov
Model. We have created the code by adapting the first principles approach. More
specifically, we have shown how the probabilistic concepts that are expressed through
equations can be implemented as objects and methods. Finally, we demonstrated the
usage of the model with finding the score, uncovering of the latent variable chain and
Get started Open in app
applied the training procedure.
PS. I apologise for the poor rendering of the equations here. Basically, I needed to do it
all manually. However, please feel free to read this article on my home blog. There, I
took care of it ;)
If you want to be updated concerning the videos and future articles, subscribe to my
newsletter. You can also let me know of your expectations by filling out the form. See
you soon!
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.