Cu HMM
Cu HMM
Cu HMM
May 6, 2009
1 Overview
Hidden Markov model (HMM) as a sequential classifier has important applications in speech and
language processing [Rab89] [JM08] and biological sequence analysis [Kro98].
In this project, we analysis the parallelism in the three algorithms for HMM training and
classification, i.e. forward algorithm, Viterbi algorithm, and Baum-Welch algorithm, for graphical
processing units (GPU). Based on the analysis, we implement a prototype program for HMM
training and classification on the NVIDIA CUDA platform. The evaluation shows for forward
algorithm, our CUDA implementation achieves performance of 23.3 GFLOP/s and has up to 800×
speedup over the implementation based on single core CPU; and for Baum-Welch algorithm, the
performance is 4.3 GFLOP/s and there is also 200× speedup over CPU implementation. We
also note our implementation is not fully optimized: several parallelism found during analysis is
not implemented due to time and complexity. We expect more sophisticated implementation can
further exploit the GPU computing ability.
The remaining sections are organized as follows. In Section 2, we first give a brief introduction
to hidden Markov model, followed by description of the three most important algorithms for HMM
and our analysis of their parallelism. In Section 3, we describe our implementation in details. In
Section 4, we evaluate the implementation by experiments. The experimental results are shown
and discussed. Section 5 is conclusion.
2 Parallel Design
In this section, we first give a brief review of HMM and introduce our notations for HMM (Sec-
tion 2.1). In the next three subsections (Section 2.2, 2.3, and 2.4), we will go through the three
algorithms for HMM. Our parallel design for each of the algorithm follows the analysis of the
algorithm in each section respectively.
1
S = s1 s2 · · · sN a set of N states.
O = o1 o2 · · · oT a sequence of T observations;
each drawn from the observation symbol set V .
Q = q1 q2 · · · qT a sequence of state;
qt denotes the state at time t.
a11 a12 ··· a1N a transition probability matrix A; each aij rep-
a21 a22 ··· a2N resenting the probability of moving from state si
A= .
..
to state sj , i.e.
aij = P (qt+1 = sj |qt = si )
aN 1 aN 2 · · · aN N
b11 b12 · · · b1|V | an emission probability matrix B, each bij rep-
b21 b22 · · · b2|V | resenting the probability of the observation vj
B= .
..
being generated from a state si , i.e.
bij = P (ot = vj |qt = si )
bN 1 bN 2 · · · bN |V |
π = [π1 , π2 , · · · , πN ] an initial state distribution: πi = P (q1 = Si )
definition makes the assumption that the probability of a state transition only depends on the
previous state. HMM is Markov chain with outputs. In HMM, an output is produced after a
state transition. Given only the outputs, the underlying Markov chain is hidden from the observer.
Based on the influential tutorial [Rab89], we formalize our definition of discrete HMM with the
notations in Figure 1. With further reference to Jack Ferguson, [Rab89] introduces the following
three fundamental problems of interest to HMM.
Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A, B), discover
the best hidden state sequence Q.
Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM,
learn the HMM parameters A and B.
Problem 1 is solved by forward algorithm; problem 2 is solved by Viterbi algorithm; and prob-
lem 3 is solved by Baum-Welch algorithm. The three algorithm are closely related. The Viterbi
algorithm is like a variance of forward algorithm. The Baum-Welch algorithm first passes through
the output with forward algorithm; then follows a backward pass with a reversed version of forward
algorithm; and finally the model parameters are estimated based on the result from the two passes,
therefore the first part of the algorithm is sometimes named forward-backward algorithm.
2
2.2 The Forward Algorithm
To compute the likelihood, the forward algorithm constructs a trellis α of size T ×N for the sequence
of T observations and N underlying states. The trellis can be viewed as a matrix, where each cell
α(t, i) is the probability of being in state i while seeing the observations until t.
A scratch of forward algorithm is shown in the pseudocode below. The input to the algorithm
is a sequence of observations O. The output is the likelihood probability for the observation. The
algorithm makes the assumption the first observation in sequence is the start state, and the last
observation is the end state. The complexity of this algorithm is O(N 2 T ).
Forward(O)
1 initialize all cells of α to 0
2 α(o1 , s) ← 1
3 for t = o2 to oT
4 do for i = 1 to N
5 do for j = 1 to N
6 do p ← aji · bit
7 α(t, i) ← α(t, i) + α(t − 1, j) · p
8 likelihood ← α(oT , e)
9 return likelihood
Line 3 of the algorithm loops over the entire sequence of the observations. For each observation
t, we compute the α(t, i) probabilities for all the states i. The computation is depend on previous
column in the trellis (α(t − 1) at line 7). The dependence among columns in the trellis make the
loop hard to parallilize.
Iterations of the for loops at line 4 have no dependence among each other. The code can be
parallized follows task decomposition. In Figure 2, we shows a simplified example, where there are
only 3 states. The first iteration of the for loop will compute the three dashed arcs. Each arc adds
a quantity p to the state s1 the arc points to. The other two tasks will compute arcs point to s2
and s3 . In the paralleled algorithm, all the tasks can be carried out simultaneously, i.e. all dashed
arcs will be computed by one thread.
Further notice each task is merely a summation of array of quantities. Therefore the task can
be carried out in a parallel reduction manner [Har08].
By now our discussion assumes there is only one sequence of observation. Suppose we have a
number of sequences of observations and all sequences are independent. This is not uncommon
in practice. For a total number M of sequences, the time complexity goes to O(M N 2 T ). Taking
advantage of the data independence, we can add another level of parallelism to the algorithm. At
each stage of the trellis, we can compute the state probabilities for all sequences in parallel. This
design follows data parallel pattern. Suppose all threads in our design can execute in constant time
in parallel, the time complexity will be reduced to O(cN T ), where c is the execution time of the
parallel code.
3
ot−2 ot−1 ot ...
@ABC
GFED
s1 GFED
@ABC
s1 A_ _ _/ GFED @ABC
s1 ...
00 AA }> F
00 AA }
0}0 }AAA
@ABC
GFED s2 A 00 / GFED
@ABC
GFED @ABC
} 00 A
s2 s2 ...
AA 00}}>
AA}}0
}}}AAA000
@ABC
GFED @ABC
GFED / GFED
@ABC
}} A
s3 s3 s3 ...
Figure 2: Execution one iteration of line 3 of the forward algorithm in a 3 state HMM.
algorithms is at line 7 of the psedocode. Instead of sum over all α probabilities in previous column,
Viterbi algorithm finds the maximum one and keep a pointer to trace the state that leads to the
maximum probability. The pseudocode for Viterbi algorithm is given in Figure 3. The input to the
algorithm is a sequence of observations and output is a sequence of most likely states that generate
the observation.
Viterbi(O)
1 initialize all cells of α to 0
2 α(o1 , s) ← 1
3 for t = o2 to oT
4 do for i = 1 to N
5 do for j = 1 to N
6 do p ← aji · bit
7 if α(t − 1, j) · p > α(t, i)
8 then α(t, i) ← α(t − 1, j) · p
9 backpointers(t, i ) ← j
10 states = Backtrace(backpointers)
11 return states
Line 4 - 9 of the algorithm can be parallized the same way as forward algorithm. The
backtrace takes linear time and cannot be parallized. Since the main loop (line 3) cannot be
parallized neither, backtrace will not effect the running time of the Viterbi algorithm.
4
over the input data and terminate until convergence or certain threshold condition is met, e.g.
number of iterations, difference in parameter changes.
The algorithm takes two passes over the data. In the first pass, the algorithm uses forward
algorithm to construct α probabilities. In addition to the α probabilities, the algorithm runs a
similar backward algorithm to construct β probabilities. The backward probability β(t, i) is the
probability of seeing observation from ot+1 to the end, given that we are in state j at time t.
Based on the α and β probabilities, we can compute the expected number (counts) of transitions
from state i to state j at a given observation t as:
α(t, j)β(t, j)
γ(j, t) =
likelihood
where likelihood is the observation probability as computed by forward algorithm.
We list part of the pseudocode for forward-backward algorithm in Figure 4. The α probabilities
are updated after calling the forward function at line 2. The remaining code performs backward
pass and accumulates ξ and γ counts.
Forwad-Backward-Expectation(O)
1 initialize all cells of α, β, γ, ξ to 0
2 likelihood ← Forward(O)
3 β(oT , e) ← 1
4 for t = oT to o1
5 do for i = 1 to N
6 do γ(t, i) ← γ(t, i) + (α (t, i) · β(t, i)/ likelihood
7 ξ(i) ← ξ(i) + (α (t, i) · β(t, i)/ likelihood
8 for j = 1 to N
9 do p ← αji · bit
10 β(t − 1, i − 1) ← β(t − 1, i − 1) + β(t, i) · p
11 ξ(j, i) ← ξ(j, i) + α(t − 1, j)β(t, i) · p/ likelihood
12
ξ(i, j)
âij =
ξ(i)
γ(j, t)
b̂jt =
ξ(j)
As in forward algorithm, the for loop at line 5 of the backward algorithm can be parallized.
5
Figure 5: The 3D trellis as a cuboid.
Note accumulation of γ and ξ terms (line 6, 7 and line 10, 11) cannot be parallized in the
same manner. If we have all γ and ξ terms, we can calculate their summation using the reduction
technique [Har08]. On the other hand, there are N ×N of ξ terms and N ×M of γ terms. Therefore
we can also carry out the accumulation in parallel for each term. If the number of ξ or γ terms is
much larger than the number of sequences, this alternative approach will have little effect on the
overall running time. The ideal way may be to implement the parallel reduction in two dimensional
space, i.e. parallel reduction for each term in parallel. However, this will add much complexity to
the implementation, and in our prototype program, we only take the term-level parallel approach.
3 Implementation
This section introduces our implementation details, including data structure, kernel functions and
current limitations. At the end, we also discuss how to loose some restrictions on input data
introduced by the parallel design.
Suppose we have M sequences of observations produced from the same HMM of N states.
Further assume each sequence is of the same length L. To compute the forward probabilities in
parallel for all sequences, we form a three dimensional trellis as a cuboid shown in Figure 5. Each
slice in the cuboid parallel to the top (yellow) face is a trellis for one of the M sequences. Each
slice parallel to the front (green) face corresponds to one column of trellises for all M sequences.
Each element in the slice corresponds to a state in the trellis for some sequence of observations.
In the serial implementation, the cuboid is computed from top to bottom. Slices parallel to
the yellow face is computed in sequence. For each slice, the computation goes from front column
to back column in sequence. In our parallel implementation, we compute the cuboid trellis from
front to back. At each individual step, we compute an entire slice parallel to the green face. For
each element of the slice, the task is the same. In our CUDA implementation, each kernel thread
is responsible for one state in the slice.
The computation of one slice of the cuboid resembles matrix multiplication as shown in Figure 6.
The M × N matrix C is the previous slice in the cuboid; the N × N square matrix A is the state
6
transition probabilities; the N × |V | matrix B is observation emission probability matrix; and the
matrix D is the result. For an element Dij , i.e. the jth element on the ith row in matrix D, is
computed as
Dij = BOi . ∗ Ci × Aj
where Oi is the observation for ith sequence in current slice and .∗ denotes element-by-element
multiplication. The C × A part is the same as matrix multiplication. For a row in the C matrix,
the output is fixed. We only need to load the corresponding row in the emission probability matrix
and store the element-by-element product of the two rows and use the resulting product row to
multiply the corresponding row in A.
In the CUDA programming model, threads are known as kernels running on parallel devices.
The kernel threads are organized into grids of blocks. For the partition of kernels into blocks
and grids, our prototype program uses the same schema as in matrix multiplication example the
NVIDIA CUDA Programming Guide [NVI08].
• Each thread block is responsible for computing one square sub-matrix Dsub of D.
• Each thread within the block is responsible for computing one element of Dsub .
The block size here is chosen to be 16. The number of threads per block is a multiple of warp
size and remains below the maximum number of threads per block [NVI08].
The Viterbi algorithm can use the same design. In each kernel thread, instead of computing
the summation of the probabilities, we find the maximum one and the backtrace pointer is stored.
In Baum-Welch algorithm, the computation of one step backward pass follows the same pattern
of forward algorithm. As the only change is index, we use the same kernel/block/grid organization
as in forward algorithm. However due to the change in matrix index, the computation cannot be
represented through matrix multiplication.
During each step of backward pass, we also need to accumulate the estimated γ and ξ counts.
As analyzed before, the idea way is to carry out reduction for N × N states simultaneously. How-
ever this involves two different parallel patterns and makes programming difficult. In our current
implementation, parallelism is carried out for each state. A thread will accumulate all M counts for
the state it is responsible for. This design will lead to different computation load with the thread in
forward algorithm. In the forward algorithm, a thread will accumulation N quantities. This may
suggests the program will reach its computation peak when M = N .
The previous design raises the following two requirements on the input data.
1. The number of states and number of sequences must be a multiple of block size, i.e. 16.
To allow more general input, we can introduce slack variables into the model and pad the
input to meet the two requirements. For both problems, we can add extra states and an output
symbol, and pad the input with those states and outputs to make the size of input multiples of
block size or make all input of same length with the longest one. After the computation, the results
for those added states are simply ignored. This method will eliminate the two requirements but
the computation time will be as long as the padded result. As long as the number of states and
input sequences is much larger than block size and the lengths of all sequences are close, the extra
computation time can be ignored.
7
Figure 6: The computation of a slice in the trellis cuboid.
8
GPU NVIDIA G92 with 512MB RAM
CPU Intel Quad Core Q9500 with 8G RAM
OS GNU/Linux 2.6.18-53 x86 64
CUDA CUDA 2.0
Host Compiler GCC 4.1.2
4 Evaluation
In evaluation our implementation, we first measures the performance in terms of billion floating
point operations per second (GFLOP/s). Though GFLOP/s is a good measure for hardware, we
feel it may not be as well a metric for comparing our serial and parallel implementations. First, an
single floating point operation used in the serial implementation may be decomposed into several
separate operations in the parallel implementation. So the speedup in GFLOP/s may not reflect
entirely in computation time. Second, besides GFLOP/s, other factors like memory bandwidth
may also be the barrier for the program performance. Last the parallel program also uses CPU for
some computation. Thus the measure of GFLOP/s in GPU may not reflect the entire program.
As such, we also include the running time for comparison.
The serial implementation is written in C. Note in practice the HMM program usually can
be optimized targeting certain known HMM structures. For example, speech recognition systems
often use left-right models [Rab89]. Here for comparison, we make no assumption on the underlying
HMM model and make no effort in optimization. Both serial and parallel implementations use single
precision arithmetic. The correctness of both programs are verified against the MATLAB HMM
Toolbox 1 .
The kernels for forward and Viterbi algorithms are almost the same. So we only list the result
for forward algorithm here.
Table 1 lists the information of our testing environment.
9
Figure 7: The performance change of forward algorithm with number of sequences increase.
10
Data size Running time (ms) Speedup
(N × M ) CPU GPU
64 × 64 698.7 10.02 70×
128 × 128 5621.9 19.74 280×
192 × 192 18990.5 40.93 460×
256 × 256 45031.6 71.77 620×
320 × 320 88090.8 128.44 680×
448 × 448 152374.8 208.08 730×
512 × 512 360899.3 410.17 880×
Table 2: Comparison of running time between parallel and serial versions of forward algorithm.
Table 3: Comparison of running time between parallel and serial versions of Baum-Welch algorithm.
implementation uses over 6 minutes while the running time of CUDA program is within 0.5 second.
11
Figure 9: The performance change of Baum-Welch algorithm with number of sequences increase.
Figure 10: The performance change of Baum-Welch algorithm as number of states increase.
12
• Using texture memory. The transition matrix and observation emission matrix are accessed
frequently throughout program and are not altered for forward and Viterbi algorithms or only
altered at the end of each iteration for Baul-Welch algorithm. Currently we only store them
using on-device global memory. We expect increasing in memory bandwidth throughput if
texture memory is used.
• More efficient implementation of matrix multiplication. Our partition schema for matrix mul-
tiplication is not optimized. More performance may be achieved with better thread partition
schema or by using CUBLAS library.
We also note both our serial and parallel implementation can only use single core. We expect
speedup can also be made by taking advantage of multi-core CPUs or multiple GPUs.
We have made the code public available at http://code.google.com/p/chmm/.
References
[Har08] Mark Harris. Optimizing parallel reduction in cuda. NVIDIA CUDA SDK, 2008. Acces-
sible at: http://www.nvidia.com/object/cuda_sdks.html.
[JM08] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice Hall,
2nd edition, May 2008.
[Kro98] Anders Krogh. An introduction to hidden markov models for biological sequences. In
S. L. Salzberg, D. B. Searls, and S. Kasif, editors, Computational Methods in Molecular
Biology, pages 45–63. Elsevier, 1998.
[NVI08] NVIDIA Corporation. NVIDIA CUDA Programming Guide, 2008. Version 2.1.
[Rab89] Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of The IEEE, 77(2):257–286, 1989.
13