HMM Tutorial Part 1
HMM Tutorial Part 1
http://www.techonline.com/osee/
Valery A. Petrushin
[email protected]
Center for Strategic Technology Research
Accenture
3773 Willow Rd.
Northbrook, Illinois 60062, USA.
Abstract
The objective of this tutorial is to introduce basic concepts of a Hidden Markov Model
(HMM) as a fusion of more simple models such as a Markov chain and a Gaussian mixture
model. The tutorial is intended for the practicing engineer, biologist, linguist or programmer
who would like to learn more about the above mentioned fascinating mathematical models
and include them into one’s repertoire. This lecture presents Markov Chains and Gaussian
mixture models, which constitute the preliminary knowledge for understanding Hidden
Markov Models.
Introduction
Why it is so important to learn about these models? First, the models have proved to be
indispensable for a wide range of applications in such areas as signal processing,
bioinformatics, image processing, linguistics, and others, which deal with sequences or
mixtures of components. Second, the key algorithm used for estimating the models – the so-
called Expectation Maximization Algorithm -- has much broader application potential and
deserves to be known to every practicing engineer or scientist. And last, but not least, the
beauty of the models and algorithms makes it worthwhile to devote some time and efforts to
learning and enjoying them.
(a) (b)
π = (0.2 0.6 0.2 )
0.25 0.5 0.25
π = (0.5 0.5) A = 0.3 0.4 0.3
0.5 0.5 0.3 0.5 0.2
A=
0.3 0.7
Figure 1.2. Markov chain models for a biased coin (a), and the paving model MN
Such chains can be described by diagrams (Figure 1.2). The nodes of the diagram represent
the states (in our case, a state corresponds to a choice of a tile of a particular color) and the
edges represent transitions between the states. A transition probability is assigned to each
edge. The probabilities of all edges outgoing from a node must sum to one. Beside that, there
is an initial state probability distribution to define the first state of the chain.
j =1
Table 1.1 presents three estimates of parameters for the increasing length of the training
sequence.
0.25 0.5 0.25 0.20 0.53 0.27 0.26 0.49 0.25 0.24 0.52 0.24
A = 0.3 0.4 0.3 A = 0.26 0.36 0.38 A = 0.3 0.4 0.3 A = 0.3 0.4 0.3
0.3 0.5 0.2 0.27 0.51 0.22 0.3 0.5 0.2 0.3 0.49 0.21
Now you know what to do! You should walk along each path, collect data and use (1.3) and
(1.4) to create Markov chains for each Province.
1.4 Recognition algorithm
Let us assume that you have already built the models for each Province: MN, ME, MS, and
MW. How can you use them to determine which Province’s team paved a sector Q? A natural
approach to solve this problem is to calculate the conditional probability of the model M for
the given sequence of tiles Q: P(M|Q).
Using the Bayes’ Rule (1.5), we can see that P(Q) does not depend on the model. Thus, we
should pick up the model, which maximizes the numerator of the formula (1.5).
P (Q | M k ) ⋅ P ( M k )
P( M k | Q) = (1.5)
P (Q )
where P(Mk|Q) is the posterior probability of model Mk for the given sequence Q; P(Q|Mk) is
the likelihood of Q to be generated by Mk ; P(Mk) is the prior probability of Mk ; P(Q) is the
probability of Q to be generated by all models.
This criterion is called maximum aposteriori probability (MAP) estimation and can be
formally expressed as (1.6). However, it is often the case that the prior probabilities are not
known or are uniform over all the models. Then we can reduce (1.6) to (1.7). This criterion is
called maximum likelihood (ML) estimation.
M k = arg max{P (Q | M k )}
* (1.6)
Mk
True models:
North East South West
π = (0.2 0.6 0.2 ) π = (0.5 0.25 0.25) π = (0.2 0.2 0.6) π = (0.3 0.3 0 .4 )
0.25 0.5 0.25 0.2 0.3 0.5 0.3 0.3 0. 4
0.4 0.3 0.3
A = 0.3 0.4 0.3
A = 0.3 0.3 0.4 A = 0.3 0.4 0.3
A = 0.5 0.25 0.25
0.3 0.5 0.2 0.4 0.3 0.3 0.2 0.2 0.6 0.4 0.3 0.3
1.5 Generalizations
Two important generalizations of the Markov chain model described above are worth to
mentioning. They are high-order Markov chains and continuous-time Markov chains.
In the case of a high-order Markov chain of order n, where n > 1, we assume that the choice
of the next state depends on n previous states, including the current state (1.11).
P(qk + n | q1 , q2 ,..., qk ,..., qk + n −1 ) = P(qk + n | qk ,..., qk + n −1 ) (1.11)
In the case of a continuous-time Markov chain, the process waits for a random time at the
current state before it jumps to the next state. The waiting time at each state can have its own
probability density function or the common function for all states, for example, an exponential
distribution (1.12).
f T (t ) = λ ⋅ e − λt t≥0 (1.12)
1.6 References and Applications
The theory of Markov chains is well established and has vast literature. Taking all
responsibility for my slant choice, I would like to mention only two books. Norris’ book [1] is
a concise introduction into the theory of Markov chains, and Stewart’s book [2] pays attention
to numerical aspects of Markov chains and their applications. Both books have some
recommendations and references for further reading.
Markov Chains have a wide range of applications. One of the most important and elaborated
areas of applications is analysis and design of queues and queuing networks. It covers a range
of problems from productivity analysis of a carwash station to the design of optimal computer
and telecommunication networks. Stewart’s book [2] gives a lot of details on this type of
modeling and describes a software package that can be used for queuing networks design and
analysis (http://www.csc.ncsu.edu/faculty/WStewart/MARCA/marca.html).
Historically, the first application of Markov Chains was made by Andrey Markov himself in
the area of language modeling. Markov was fond of poetry and he applied his model to
studies of poetic style. Today you can have fun using the Shannonizer which is a Web-based
program (see http://www.nightgarden.com/shannon.htm) that can rewrite your message in the
style of Mark Twain or Lewis Carroll.
DNA sequences can be considered as texts in the alphabet of four letters that represent the
nucleotides. The difference in stochastic properties of Markov chain models for coding and
non-coding regions of DNA can be used for gene finding. GeneMark is the first system that
implemented this approach (see http://genemark.biology.gatech.edu/GeneMark/).
The other group of models known as branching processes was designed to cover such
application as modeling chain reaction in chemistry and nuclear physics, population genetics
(http://pespmc1.vub.ac.be/MATHMPG.html), and even game analysis for such games as
baseball (http://www.retrosheet.org/mdp/markov/theory.htm), curling, etc.
And the last application area, which I’d like to mention, is so called Markov chain Monte
Carlo algorithms. They use Markov chains to generate random numbers that belong exactly to
the desired distribution or, speaking in other words, they create a perfectly random sampling.
See the http://dimacs.rutgers.edu/~dbwilson/exact/ for additional information about these
algorithms.
Let us pretend that you are the chief scientist at Allempire Consulting, Inc., and Mr. Newrich
pays you a fortune to decode the data. How can you determine the characteristics of each
component and proportion of the mixture? It seems that stochastic modeling approach called
Gaussian Mixture Models can do the job.
need to find such values of νi, µi and σi that maximize the function (2.3). To solve the
problem, the Expectation Maximization (EM) Algorithm is used.
K K
f (o; M ) = ∑ν i f i (o; µ i ,σ i ) ∑ν i = 1 (2.2)
i =1 i =1
L K
log L( M ) = ∑ log(∑ν i f (o j ; µ i ,σ i )) (2.3)
j =1 i =1
The EM Algorithm is also known as the “Fuzzy k-mean” algorithm. The key idea of the
algorithm is to assign to each data point oi a vector wi that has as many elements as there are
components in the mixture (three in our case). Each element of the vector wij (we shall call
them weights) represents the confidence or probability that the i-th data point oi belongs to the
j-th mixture component. All weights for a given data point must sum to one (2.4). For
example, a two-component mixture weight vector (1, 0) means that the data point certainly
belongs to the first mixture component, but (0.3, 0.7) means than the data point belongs more
to the second mixture component than to the first one.
K
wij = P ( N ( µ j ,σ j ) | oi ), ∑ wij = 1 (2.4)
j =1
For each iteration, we use the current parameters of the model to calculate weights for all data
point (expectation step or E-step), and then we use these updated weights to recalculate
parameters of the model (maximization step or M-step).
Let us see how the EM algorithm works in some details. First, we have to provide an initial
(0)
approximation of model M using one of the following ways:
(1) Partitioning data points arbitrarily among the mixture components and then calculating the
parameters of the model.
(2) Setting up the parameters randomly or based on our knowledge about the problem.
After this, we start to iterate E- and M-steps until a stopping criterion gets true.
For the E-step the expected component membership weights are calculated for each data point
based on parameters of the current model. This is done using the Bayes’ Rule that calculates
the weight wij as the posterior probability of membership of i-th data point in j-th mixture
component (2.5). Compare the formula (2.5) to the Bayes’ Rule (1.5).
K
wij = ν j ⋅ f (oi ; µ j ,σ j ) ∑ν h ⋅ f (oi ; µ h ,σ h ) (2.5)
h =1
For the M-step, the parameters of the model are recalculated using formulas (2.6), (2.7) and
(2.8), based on our refined knowledge about membership of data points.
L
ν (j t +1) = ∑ wij L j = 1, K (2.6)
i =1
L L
µ (j t +1) = ∑ wij ⋅ oi ∑ wij (2.7)
i =1 i =1
L L
σ 2j ( t +1) = ∑ wij ⋅ (oi − µ j ) 2 ∑ wij (2.8)
i =1 i =1
A nice feature of the EM algorithm is that the likelihood function (2.3) can never decrease;
hence it eventually converges to a local maximum. In practice, the algorithm stops when the
difference between two consecutive values of likelihood function is less than a threshold
(2.9).
Applying the EM algorithm to Mr. Newrich’s data, we obtain the solution presented in Table
2.2. Due to a rather good initial model it took only 17 iterations to converge. The old monk’s
secret is revealed!
The special beauty of mixture models is that they work in cases when the components are
heavily intersected. For example, for two-dimensional data of 1000 points and three-
component mixture with two components having equal means but different variances, we can
get a good approximation in 58 iterations (see Figure 2.4). The dark side of the EM algorithm
is that it is very slow and is considered to be impractical for large data sets.
Figure 2.4. Three-component mixture with two components having equal means.
The EM algorithm was invented in late 1970s, but the 1990s marked a new wave of research
revived by growing interest to data mining. Recent research is going in two directions:
(1) Generalization of the EM algorithm and extension to other probability distributions, and
(2) Improving the speed of EM algorithm using hierarchical data representations.
Mixture Models have been in use for more than 30 years. It is a very powerful approach to
model-based clustering and classification. It has been applied to numerous problems. Some of
the more interesting applications are described below.
Mixture models are widely used in image processing for image representation and
segmentation, object recognition, and content-based image retrieval :
• http://www.cs.toronto.edu/vis/
• http://www.maths.uwa.edu.au/~rkealley/diss/diss.html
Mixture models also were used for text classification and clustering:
• http://www.cs.cmu.edu/~mccallum/
Gaussian mixture models proved to be very successful for solving the speaker recognition and
speaker verification problems:
• http://www.ll.mit.edu/IST/
There is growing interest in applying this technique to exploratory analysis of data or data
mining, and medical diagnosis:
• http://www.cs.helsinki.fi/research/cosco/
• http://www.cnl.salk.edu/cgi-bin/pub-search/
You can learn more about these applications by following the URLs presented above.
References
[1] J.R. Norris (1997) Markov Chains. Cambridge University Press.
[2] W.J. Stewart (1994) Introduction to the Numerical Solutions of Markov Chains.
Princeton University Press.
[3] G.J. McLachlan and K.E. Basford (1988) Mixture Models: Inference and Application to
Clustering. New York: Marcel Dekker.
[4] G.J. McLachlan and T. Krishnan (1997) The EM Algorithm and Extensions. New York:
Wiley.