Data-Parallel Large Vocabulary Continuous Speech Recognition On Graphics Processors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Data-Parallel Large Vocabulary Continuous Speech Recognition

on Graphics Processors
Jike Chong Youngmin Yi Arlo Faria Nadathur Satish Kurt Keutzer
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94720-1776, USA
{jike,ymyi,arlo,nrsatish,keutzer}@eecs.berkeley.edu
Abstract
Automatic speech recognition is a key technology for en-
abling rich human-computer interaction in emerging ap-
plications. Hidden Markov Model (HMM) based recog-
nition approaches are widely used for modeling the hu-
man speech process by constructing probabilistic estimates
of the underlying word sequence from an acoustic signal.
High-accuracy speech recognition, however, requires com-
plex models, large vocabulary sizes, and exploration of a
very large search space, making the computation too in-
tense for current personal and mobile platforms. In this
paper, we explore opportunities for parallelizing the HMM
based Viterbi search algorithm typically used for large-
vocabulary continuous speech recognition (LVCSR), and
present an efcient implementation on current many-core
platforms. For the case study, we use a recognition model
of 50,000 English words, with more than 500,000 word bi-
gram transitions, and one million hidden states. We exam-
ine important implementation tradeoffs for shared-memory
single-chip many-core processors by implementing LVCSR
on the NVIDIA G80 Graphics Processing Unit (GPU) in
Compute Unied Device Architecture (CUDA), leading to
signicant speedups. This work is an important step for-
ward for LVCSR-based applications to leverage many-core
processors in achieving real-time performance on personal
and mobile computing platforms.
1. Introduction
Automatic speech recognition is a key technology for
enabling rich human-computer interaction in many emerg-
ing applications. The computation required for effective
human-computer speech interaction has been challenging
for todays personal and mobile platforms such as lap-
tops and PDAs. The emergence of many-core accelera-
tors for PCs opens up exciting opportunities for the aver-
age consumer platform. Programming for many-core plat-
forms, however, requires understanding a very different set
of trade-offs. Programming for GPUs, for instance, requires
a vast amount of parallelism in applications to hide memory
latency. GPUs also minimize threading overhead while re-
stricting thread-to-thread interactions. This paper analyzes
an algorithm typically used for large-vocabulary continuous
speech recognition (LVCSR), and illustrates important al-
gorithmic trade-offs for an efcient implementation on the
NVIDIA Geforce 8800 GTX, an off-the-shelf GPU.
Figure 1. Architecture of a large vocabulary
continuous speech recognition system
One effective approach for LVCSR is to use a Hidden
Markov Model (HMM) with beam search [1] approximate
inference algorithm. It is the standard approach seen in ma-
jor speech recognition projects such as SPHINX, HTK, and
Julius [2, 3, 4]. Figure 1 shows the major components of
such a system. This system uses a recognition network that
is compiled ofine from a variety of knowledge sources us-
ing powerful statistical learning techniques. Spectral-based
speech features are extracted by signal-processing the au-
dio input and the inference engine then computes the most
likely word sequence based on the extracted speech features
and the recognition network.
Inference engine based LVCSR systems are modular and
exible. They are language independent and robust to vari-
ous acoustic environments [3, 4]: by using different recog-
Figure 2. Structure of the inference engine and the underlying parallelism opportunities
nition networks and signal-processing kernels, they have
been shown to be effective for Arabic, English, Japanese,
and Mandarin, in a variety of situations such as phone con-
versations, lectures, and news broadcasts.
The inference process recognizes sequences of words
from a large vocabulary where the words can be ordered
in exponentially many permutations without knowing the
boundary segmentation between words. The inference en-
gine implements a search algorithm where the search space
consists of all possible word sequences given an array of
features extracted from an input audio stream. A 50,000
word vocabulary expands to almost one million hidden
states in the HMM in which case a nave Viterbi search
is computationally infeasible, as explained in section 2.
The beam search heuristic reduces the search space to a
manageable size by pruning away the less likely word se-
quences [1].
Although there is ample ne grained concurrency inside
the inference engine, its top level architecture is quite se-
quential. As Figure 2 illustrates, the outer loop of the in-
ference engine proceeds in a sequence of iterations with
feedback between successive iterations. Even within an it-
eration, the algorithm follows a rigid sequence of search
space evaluation and pruning. The parallelism in the algo-
rithm appears within each pipeline step, where thousands to
tens-of-thousands of word sequences can be evaluated and
pruned in parallel.
There is a large body of prior work in mapping LVCSR
to parallel platforms. Ravishankar [5] in 1993 presented a
parallel implementation of the beam search algorithm on a
multi-processor cluster. Agaram et al. [6] in 2001 showed
an implementation of LVCSR on a multi-processor simula-
tor. And Dixon et al. [7] in 2007 produced a multi-threaded
LVCSR system, and attempted to use the GPU as a co-
processor accelerator for computing observation probabil-
ities, but were not able to exploit the full potential of the
GPU platform. There has also been continuous research in
the acceleration of speech recognition in hardware [8, 9].
The proposed hardware speech accelerators are often lim-
ited in the size of vocabulary supported and in the exibility
of the algorithm. More recent implementations such as [9]
are similar in architecture to custom many-core accelerators
with associated application-specic scheduling and locking
mechanisms. A software solution on a general-purpose off-
the-shelf chip is more cost-efcient and more readily avail-
able today.
Compared to multi-core platforms, many-core platforms
take on a different architectural design point. By using
many simpler cores, more of the per-chip transistor bud-
get is devoted to computation, and less to on-chip mem-
ory hierarchies. Applications taking advantage of this ar-
chitectural trade-off should be data-parallel and have their
memory accesses coalesced. Data-parallel algorithms have
mostly independent parallel computation units with mini-
mal slow global synchronizations, and coalesced memory
accesses amortize load-store overheads for higher compu-
tation throughput.
This paper outlines a data-parallel implementation of the
LVCSR algorithm on a NVIDIA many-core GPU and il-
lustrates the key algorithmic and data layout decision for
coalesced memory access. We briey introduce the con-
tinuous speech recognition process in section 2, provide an
overview of many-core architecture properties in section 3,
and explain in detail our implementation in section 4.
2. Large vocabulary continuous speech recog-
nition
2.1. HMM
A speech recognition problem can be dened in terms
of the set of possible word hypotheses that can be inferred
from an acoustic observation signal. The simplest inference
2
Figure 3. A HMM state transition graph of LVCSR and a comparison of Viterbi and beam search
inference algorithms.
problem is an isolated word recognition task, such as dis-
criminating between a yes or no in an interactive voice
response system; such a task can be solved by many tech-
niques, generally with modest computational effort. By
contrast, large vocabulary continuous speech recognition
(LVCSR) is a much more difcult problem: for example,
the objective might be to provide a transcription to serve as
closed captions for a television recording. LVCSR systems
must be able to recognize words from a very large vocabu-
lary arranged in exponentially many permutations, without
knowing the boundary segmentation between words.
For this task, the Hidden Markov Model (HMM) frame-
work has emerged as the most successful approach. This
framework follows the overall architecture of Figure 1, with
an HMM used as the recognition network.
An HMM represents a discrete-time statistical process in
which elements of a sequence of observed random variables
are generated by unobserved state variables in a Markov
chain. The use of HMMs for sequential pattern recognition
has led to various applications besides speech recognition,
such as optical character recognition and machine transla-
tion. Figure 3 shows an example of an HMM, with the state
transition graph to the left, and the Viterbi Search trellis at
center left. The nodes in the transition graph represent the
hidden states s of the system with the set of possible transi-
tions (with associated transition probabilities P(s
t
|s
t1
) be-
tween them. The emission model P(x
t
|s
t
) represents the
probability of seeing a particular observation x
t
at each state
s
t
of the HMM. The transition probabilities and emission
model are learned from training data. A particular path
through the trellis has a specic probability of occurrence
given the observation sequence. The inference problem is
then to nd the path with the maximum probability.
The nave approach is to explore and compute the prob-
ability of all paths and select the most probable one. For a
network with N states, an observation of length T generates
O(N
T
) possible paths. For one million states over hundreds
of observations, this is clearly computationally infeasible.
Fortunately, the Viterbi algorithm was developed to reduce
this exponential search space to O(N
2
T) by applying dy-
namic programming. In this algorithm, the most probable
path at each time step terminating at a given state can be
computed as a recurrence relation maximizing a term in-
volving the most probable paths to each state at the preced-
ing time step.
Let x
0
, . . . , x
t
represent a partial observation of the rst
t < T observation frames and s
0
, . . . , s
t
be a corresponding
sequence of HMM hidden states that could have generated
those observations. We can dene the quantity m[t][s
t
] as
the joint probability of the partial observation sequence of
length t and the most likely path terminating at state s
t
that
generated it:
m[t][s
t
]
.
= max
s
0
,...,s
t1
P(x
0
, . . . , x
t
, s
0
, . . . , s
t1
, s
t
)
The recurrence relation exploited by the Viterbi algo-
rithm allows us to compute each m[t][s
t
] by maximizing
over O(N) values at the previous time step as in the equation
at the top of Figure 3.
Thus the Viterbi algorithm requires the previously com-
puted joint likelihood at a previous state m[t 1][s
t1
]
along with the application of the state transition probabil-
ity P(s
t
|s
t1
) and the observation probability at the current
state P(x
t
|s
t
). We shall see that for LVCSR inference, the
computation bottleneck is the evaluation of the observation
probability.
Because the size of the state space |N| can be on the order
of millions for LVCSR, a common approximation to reduce
3
Figure 4. A recognition network composed of acoustic, pronunciation, and language models.
computation is to employ the beam search approach. At
each time step, 95-98% of the paths can be safely pruned
away as the lower scoring paths are unlikely to become the
best path at the end of the observation sequence. Figure 3
shows the nodes traversed in a Viterbi algorithmand a Beam
search algorithm. The extent of pruning is usually decided
by means of a threshold parameter. The pruning discards
any paths with likelihood smaller than the threshold relative
to the most-likely path; the threshold is often adjusted dy-
namically to cap the maximum number of active states that
are under consideration at any time step.
2.2. LVCSR recognition network
In this section, we show how the HMM is parameter-
ized for the LVCSR application. The recognition network
in LVCSR is a transition graph with a hierarchical combi-
nation of HMM acoustic model H, word pronunciation L,
and language model G, as depicted in Figure 4. The three
models form a hierarchy the acoustic model denes the
phone units of speech, the pronunciation model species
howthese phone units are combined into words, and the lan-
guage model represents how word-to-word transitions oc-
cur. We provide a brief description of these components,
and their integration in LVCSR systems.
The HMM acoustic phone model is a model for an el-
ementary unit of speech composed of a three-state, left-to-
right, self-looping transition structure. Each state contains a
multivariate Gaussian Mixture Model (GMM) from which
the observation probability density function can be com-
puted for the state. Each component in the GMM is a Gaus-
sian model with a vector of means and variances for dif-
ferent features of the input speech. Given an observation of
features extracted from an audio frame, we compare the fea-
tures with the vectors of each Gaussian to obtain a measure
of probabilistic match. We then compute the observation
probability as a weighted sum of the probabilistic matches
to each of the Gaussian mixture components. Since there
may be 16 to 128 components present in a typical GMM for
an LVCSR system, computing the observation probability
density function is often the most compute intensive step in
the inference algorithm.
The pronunciation model denes the structure of the
state transitions inside each word as a concatenation of the
acoustic models for each phoneme of the word. The GMM
for different phone units and the transition probabilities
within each phone unit are shared among all of its instan-
tiations across many word pronunciations. Most LVCSR
systems have a pronunciation dictionary comprising tens
of thousands of words, some of which may have multiple
possible pronunciations. A typical word has 6-7 phonemes,
each of which consists of 3 HMM states. For a 50,000 word
language model, this leads to a total of about 7350, 000
= a million states, making the inference problem computa-
tionally challenging.
Lastly, the recognition network used for decoding a
speech utterance must incorporate a language model which
constrains the possible sequences of words. The most com-
mon language model for a task of this scope is based on
bigram probability estimates, relating the likelihood of co-
occurrence for two consecutive words in a sequence. Given
a set of word HMM models concatenated from phone units,
the nal recognition network can be constructed by creat-
4
ing bigram-weighted transitions from the end of each word
to the beginning of every other word. Due to the sparsity
of language model training data, not all bigram transition
probabilities can be reliably estimated, so in practice the
word-to-word transition matrix is sparse most words tran-
sition only to a small number of following words, and tran-
sitions to all other words are modeled with a backoff prob-
ability [1].
Performing inference on this model produces the most
likely sequence of words while simultaneously determining
where the word boundaries lie. We use a beam search al-
gorithm for LVCSR inference, and we exploit parallel pro-
cessing to evaluate a large number of paths concurrently to
reduce overall runtime.
3. Many-core platform characteristics
Current and emerging many-core platforms, like recent
Graphical Processing Units (GPUs) from NVIDIA and the
Intel Larrabee processor [10], are built around an array
of processors running many threads of execution in paral-
lel. These chips employ a Single Instruction Multiple Data
(SIMD) architecture. Threads are grouped using a SIMD
structure and each group shares a multithreaded instruction
unit.
In this work, we map the LVCSR application to the G8x
series of many-core GPU architectures from NVIDIA that
is already available in the market. We intend to optimize
LVCSR for the Intel Larrabee processor when it becomes
available.
The G8x series of GPU architectures consists of an ar-
ray of Shared Multiprocessors (SM) multiprocessors, each
of which consists of 8 scalar processors (SP). The G80
NVIDIA GPU, for instance has 16 such SM multiproces-
sors and is capable of executing up to 128 concurrent hard-
ware threads. Figure 5 shows the block diagram for the
NVIDIA G80 chip. GPUs rely mainly on multi-threading
to hide memory latency. Each SM multiprocessor is multi-
threaded and is capable of having 768 different thread con-
texts active simultaneously.
The GPU is programmed using the CUDA programming
framework [11, 12]. An application is organized into a se-
quential host program that is run on a CPU, and one or
more parallel kernels that are typically run on a GPU (al-
though CUDA kernels can also be compiled to multi-core
CPUs [13]).
A kernel executes a scalar sequential program across a
set of parallel threads. The programmer organizes these
threads into thread blocks; a kernel consists of a grid of
one or more blocks. A thread block is a group of threads
that can coordinate by means of a barrier synchronization.
They may also share a shared memory space private to that
block. The programmer must specify the number of blocks
in the kernel and the number of threads within a block when
launching the kernel. A single thread block is executed
on one SM multithreaded processor. While synchroniza-
tion within a block is possible using barriers in the shared
memory, global synchronization across all blocks is only
performed at kernel boundaries. The SIMD structure of the
hardware is exposed through thread warps. Each group of
32 threads within a thread block executes in a SIMD fash-
ion on the scalar processors within an SM, where the scalar
processors share an instruction unit. Although the SM can
handle the divergence of threads within a warp, it is impor-
tant to keep such divergence to a minimum for best perfor-
mance.
Each SM is equipped with a 16KB on-chip scratchpad
memory that provides a private per-block shared memory
space to CUDA kernels. This shared memory has very low
access latency and high bandwidth. Along with the SMs
lightweight barriers, this memory is an essential ingredi-
ent for efcient cooperation and communication amongst
threads in a block.
Threads in a warp can perform memory loads and stores
from and to any address in memory. However, when threads
within a warp access consecutive memory locations, then
the hardware can coalesce these accesses into an aggre-
gate transaction for higher memory throughput. On cur-
rent architectures a warp of 32 threads accessing consecu-
tive words only issues two memory accesses while a warp
performing a gather or scatter issues 16 memory accesses.
Code optimizations for the GPU platform using CUDA
usually involves code reorganization into kernels, each of
which has thread blocks with warps of threads in them.
The synchronization available at each step is different.
There is an implicit global barrier between different ker-
nels. Threads within a single thread block in a kernel can
synchronize using a local barrier, but this is not available to
threads across different blocks. It is a good general policy to
limit or eliminate routines that require global coordination
between all threads. Code optimizations for the GPU may
also involve changing the layout of the data structures for
getting good coalescing of memory accesses across threads
in a warp.
4. Inference engine implementation
The inference engine implements the beam search al-
gorithm and iterates over three main steps of execution as
shown in gure 3 on the right. Each iteration begins with
a set of active states that represent the set of most likely
sequence of words up to the current observation. The rst
step computes the observation probabilities of all potential
next states, the second step computes the next state likeli-
hoods, and the third step selects the most likely next state
to retain as the set of new active states for the next iteration.
This section explains the implementation of a beam search
iteration.
5
GPU
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
TPC
Texture Unit
Tex L1
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
Host nterface
Interconnection Network
ROP L2 ROP L2
DRAM DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
ROP L2
DRAM
Vertex Work
Distribution
Pixel Work
Distribution
Compute Work
Distribution
nput Assembler
Host CPU Bridge System Memory
Viewport / Clip /
Setup / Raster /
ZCull
SM
SP
Shared
Memory
SP
SP SP
SP SP
SP SP
I-Cache
MT Issue
C-Cache
SFU SFU
Figure 5. Block diagram of a G80 GPU with 128 scalar SP processors, organized in 16 SM multipro-
cessors, interconnected with 6 DRAM memory partitions.
Table 1. Observation probability computation for a set of 23 speech segments
Algorithm # Labels Computed Execution Time % Runtime
All next state labels 398 million 3092 sec 98.13%
Only next state unique labels 9 million 72 sec 55.00%
One important challenge of effectively using the GPU
as an accelerator with a separate memory space is to avoid
copying intermediate results back and forth between the
CPU and GPU. To avoid this transfer we keep all compu-
tations and intermediate results on the GPU. However, we
must manage intermediate result data layout to maximize
coalesced memory accesses, and avoid expensive global co-
ordination in the implementation. We explain how we im-
plement each of the three steps of execution in the following
subsections and present the nal implementation at the end
of this section.
4.1 Observation probability computation
The rst step of the beam search inference engine is to
compute the observation probability of all potential next
states. For LVCSR, as shown in gure 6, many states in
the recognition network share the same label. All states
with the same label have the same observation probability,
thus the nave approach of doing one observation probabil-
ity computation per potential next state would greatly du-
plicate necessary computation. Experiments show that 97-
98% of the potential next states have duplicate labels. In the
sequential implementation, observation probability compu-
tation is the major bottleneck, taking 98% of the total run-
time. By doing the computation on only the unique labels,
we can achieve a 40X reduction in runtime for this step. Ta-
ble 1 illustrates the advantages of nding the unique labels.
For this reason, we add a step to the implementation,
where step 0 uses an efcient data-parallel nd-unique al-
gorithm to remove duplicates among the labels of potential
next states. Step 1 then only needs to compute the observa-
tion probabilities for the unique labels.
4.1.1 Step 0: Finding unique labels
Given a set of labels, we want to nd the set of unique labels
for which to compute the observation probabilities. Finding
the set of unique labels is straightforward on a CPU, usu-
ally implemented with a sort followed by compaction.
However, doing the same routines in a data-parallel fashion
is much more involved [14, 15].
By considering the application-specic characteristics
and platform capabilities, we are able to implement a
much more efcient nd-unique routine for LVCSR on the
NVIDIA GPU. In our implementation of LVCSR, the set
of all possible labels is known a priori, as our recognition
network is statically compiled. This enables us to make a
lookup table of ags of the size of all possible labels, and
set the ag for the labels that have appeared in the set of
potential next states. To produce a list of unique labels, we
6
Figure 6. A phone state in the recognition network with its label
compact this lookup table to only include elements whose
ags are set. This reduces an expensive sorting problem to
a parallel hashing problem, see table 2. The lookup ta-
ble memory footprint is not signicant. For our 50k-word
vocabulary, the lookup table only needs to support 78441
ags.
The ag setting operation is parallelized by having each
thread look at one potential next states, and set the ag for
the label of that state. This causes memory write conicts,
as different threads may try to modify a particular ag at the
same time. Although NVIDIA GPUs provide atomic mem-
ory accesses to manage write conicts, the atomic access
implementation exposes the long memory latency to device
memory and severely impacts performance.
Table 2. Step 0: Find-unique implementation
execution time for a segment of speech
Algorithm Runtime
Sorting 2420.0 ms
Lookup table with atomic memory ops 157.1 ms
Lookup table with non-atomic memory ops 21.4 ms
Algorithmically, the ag setting step only requires the in-
variant that ags for labels appearing in potential next states
are set. We leverage the non-atomic memory write seman-
tics of the NVIDIA GPUs, which states that at least one
conicting write to a device memory location is guaranteed
to succeed [12]. The success of any one thread in write con-
ict situations can set the ag and satisfy the invariant; thus
we can use non-atomic operations to set the ags. As shown
in table 2, this dramatically improves the performance for
our nd-unique implementation.
4.1.2 Step 1: Parallelizing observation probability
computation
As shown in table 1, observation probability computations
continue to dominate the beam search sequential runtime
even when we compute only the set of unique labels. For-
tunately, the computation for each unique label is indepen-
dent, and is thus trivially parallelizable. However, the num-
ber of unique labels varies drastically across beam search
iterations, from 500 to over 3000, as illustrated in gure 7.
To efciently utilize the G80 architecture, one needs a min-
imum of around 5000 thread context to be active concur-
rently. We thus explore another level of parallelism in the
GMM computation, where each mixture model contains 16
to 128 mixtures that can be computed in parallel, as shown
in gure 4.
Figure 7. Graph of number of active states
over observation frames
The NVIDIA hardware granularity ts the GMM com-
putation well. Our implementation maps each GMM com-
putation to a thread block on the GPU, where each thread
in the block computes the distance between one Gaussian
7
Figure 8. Gaussian mixture model data layout
reformatting for coalesced access
mixture and the observation feature space. It is crucial that
threads in a thread block be able to synchronize, as the ob-
servation probability is computed by summing the weighted
distances from each Gaussian in the mixture. The summa-
tion is done using a tree-based reduction scheme as in [16]
using within-block synchronization.
Mapping the Gaussian mixtures to different threads in a
block also has implications on the data structures for stor-
ing the Gaussian mixture means and variances. For coa-
lesced memory access, we reorganized each GMMs mean
and variance arrays (one element per Gaussian in the GMM)
to be indexed by features, rather than by mixtures. Figure 8
shows the memory reorganization for each GMM. This re-
organization improves performance of this kernel by a fac-
tor of 3.
4.2 Step 2: Next state likelihood compu-
tation
There are two types of next state likelihood computa-
tions: those using within-word transitions, and those using
word-to-word transitions. The within-word transitions con-
nect an active state to a small number of next states, whereas
word-to-word transitions connect an active state to the rst
state of each word in the vocabulary.
Let us refer to an active state at the end of a word as an
active-end-state, and refer to the rst state in each word
as rst state. An active-end-state is present in 80-99% of
the beam search iterations depending on beam search con-
gurations. In these iterations all rst states must be exam-
ined, leading to a strong incentive to arrange the underly-
ing data structure to facilitate coalesced memory accesses.
In the LVCSR language model, the transition between the
states in the transition graph is determined by the vocabu-
lary, but the layout of the states in memory can be optimized
for faster execution. In our implementation, as illustrated in
Figure 9, we pre-arrange the rst states of each word in the
transition graph in consecutive memory locations. In the
beam search, when we limit the beam width to be at most
2% of the total number of states, the number of the rst
states being evaluated accounts for more than half of the
total number of next states evaluated.
Figure 9. Illustration of memory layout used
for transition states on a small three word ex-
ample
Handling the word-to-word transitions is a complex pro-
cess that requires signicant communication. For a 50k-
word vocabulary there are on average 10 active-end-states
in a beam search iteration, each of which can transition to
any of the 50k words in the vocabulary. For each the 50k
times 10 = 500,000 transitions, one must check if there is
a word-to-word transition probability stored in the bigram
model. If the transition probability is not available, a de-
fault probability is computed with the unigram probability
of the rst state and the backoff constant of the active-end-
state, as explained in [1].
A straight forward approach to implement this transition
is to parallelize over the rst states, where a thread performs
a lookup in the bigram model for word-to-word transitions
starting from every active-end-state to the rst state associ-
ated with the thread. This kind of lookup, however, must ac-
cess the entire bigram transition table in uncoalesced fash-
ion each beam search iteration, creating a memory bottle-
neck.
Our solution is an iterative two-step process, where we
iterate over the set of active-end-states and parallelize over
the rst state within each iteration. In the rst step, we
convert the sparse matrix representation of bigram entries
corresponding to the active-end-state of in an iteration to
dense matrix representation. In the second step, we paral-
lelize over the rst states by looking up bigram word-to-
word transition probability in dense matrix format and per-
forming default case computation if necessary. In this step,
the dense matrix format is instrumental in allowing this step
to be efciently executed in a data parallel fashion. This
approach also reduces the number of uncoalesced memory
accesses to about 1/1000 of the size of the bigram transition
table.
8
Figure 10. Illustration of state transition
Figure 11. Illustration of bigram computation
4.3 Next state pruning
The pruning process limits the number of active states
to the most promising states. We prune based on the beam
threshold as dened in [1] and also cap the number of active
states below a hard limit to t in xed size arrays. This is
necessary to avoid memory allocation in the timing sensi-
tive loops.
The maximum likelihood in each beam search iteration
is computed with an efcient two level reduction. We use
the beam threshold to prune based on likelihood, as thresh-
olding obviates the need for sorting. When there are more
states than the capacity of xed size array, we iteratively re-
duce the beam threshold until the pruned solution ts within
a xed array limit.
4.4 Final algorithm
The nal implement is a four-step algorithm optimized
for data parallel execution, shown in Figure 12. Step 0 and
step 1 implements the observation probability computation,
step 2 implements the next states likelihood computation,
and step 3 implements the max reduction and state pruning
phase.
There exists some task-level parallelismin a beamsearch
iteration, but the most signicant parallelization potential
lies in data level parallelism. All steps in our implemen-
tation use algorithms and data structures parallelizable to
thousands of processors, enabling the continued scaling of
the algorithm for future generations of many-core proces-
sors.
5. Results
We performed our experiments on an NVIDIA GeForce
8800 GTX running in a PC with a 2.66GHz Intel Core2
Duo, 8 GB of main memory and using a Linux 2.6.22 ker-
nel. The main properties of this GPU are listed in table 3.
The GPU used is a widely available $500 consumer plat-
form that can be installed in any desktop PC.
The inference engine recognizes human speech by pro-
cessing sequences of 10ms speech frames one frame at a
time and producing probabilistic estimates of the under-
lying word sequence based on the language knowledge
sources encoded in the recognition network. We used the
standard Mel-frequency cepstral coefcients (MFCC) and
their derivatives [1] as the feature vector to associate with
each frame.
There are two language models used to test our infer-
ence routine. The properties of the language models are
shown in Table 4. The large-vocabulary acoustic models
were trained on about 250 hours of conversational telephone
speech data from the NIST Hub-5 English Large Vocabu-
lary Continuous Speech Recognition evaluation. The test
set was taken from the NIST 2001 Speech Transcription
Workshop. The small-vocabulary models were trained and
tested on the OGI Numbers95 database of spoken digit se-
quences.
Table 3. GPU parameters
Properties 8800 GTX
Number of multiprocessors (SMs) 16
Multiprocessor width 8
Local memory per SM 16kb
# of processors 128
Off-chip memory 768 MB
Table 4. Test case properties
Properties Small Case Large Case
# of words 41 50710
# of states 525 996903
# of mixtures in GMM 128 16
# of unique phones 84 78438
# of bigram edges 600 552531
We validated our inference engine by comparing it
with the results of HVite [3], an inference engine in the
HTK package. Table 5 shows that both inference engines
achieved an error rate of around 16%, and the accuracy dif-
ference between the two is within 1%.
The inference engine was tested on the two recognition
models of table 4. The rst of these language models is a
9
Figure 12. Detailed task graph of the nal implementation, and step-wise data-parallelism
Table 6. Performance result for 42 utterances totaling 162.9 seconds in length
Properties CPU runtime % CPU+GPU runtime % Speedup
Step 0: Collect Unique Active Labels 12590.6 ms 5% 3924.8 ms 15% 3.21x
Step 1: Compute Observation Probabilities 114477.4 ms 49% 5984.6 ms 23% 19.13x
Step 2: Update Next State Likelihoods 82609.6 ms 35% 6989.6 ms 27% 11.82x
Step 3: Reduce and Prune States 24981.7 ms 11% 7817.0 ms 30% 3.20x
Sequential (Loop + Control) Overheads 214.5 ms 0.1% 1338.0 ms 5%
Total 234874.8 ms 26054.0 ms 9.01x
(Utterance Length / Recognition Time) 0.69x 6.25x
Table 5. Accuracy validation
Parameters Hvite Our Work
Total error 15.52 15.88
Substitution error 9.05 9.44
Deletion error 1.29 0.86
Insertion error 5.17 5.58
small example which is much faster than real-time even on
a standard desktop 2.66 GHz Intel Core 2 CPU. Thus it is
unnecessary to speed it up on a GPU. We only show the
performance of the GPU implementation on the larger lan-
guage model. We tested our engine on a set of 42 speech
segments from the NIST 2001 Speech Transcription Work-
shop. Table 6 shows the performance of the LVCSR infer-
ence engine running as a single thread compiled with In-
tel C++ Compiler (ICC) version 10.1.008 on the CPU, and
compares it with a parallel version running on the NVIDIA
8800 GTX GPU. The test was run with a beam threshold
of 150 (as dened in [1]) and an active state cap of 20,000
states.
The Core2 Duo CPU has 4-wide SIMD running at
2.66GHz, and the GeForce 8800 GTX GPU has 128 hard-
ware threads running at 1.35GHz. Adjusting for frequency
and hardware thread count, the GPU has (128 1.35)/(4
2.66) =17.5x the peak computation throughput of the CPU.
Assuming that both CPU and GPU fully utilize their re-
spective SIMD units, we should see a 17.5x boost in per-
formance for compute limited kernels.
In table 6, we see a 19x speedup for step 1, the most com-
putationally intensive step, which is higher than expected.
Although we enabled the vectorization option for SSE3 in
ICC and design the data structures to be vectorizable, we
found that ICC is not as effective in utilizing the CPU 4-
wide SIMD unit as the CUDA compiler in utilizing the 128
GPU hardware threads.
In our data parallel implementation, we minimized the
sequential overhead to be less than 0.1% of the total se-
quential runtime. This gives an upper bound of 1000x on
the speedup over sequential code, such that the application
is not bottlenecked by Amdahls law. In the parallel im-
10
plementation we see steps 1 and 2, the two most signicant
steps in the single thread implementation, were signicantly
improved. The sequential overhead in the CPU+GPU par-
allel implementation increased slightly, as small amounts of
data are copied between GPU and CPU.
Figure 13 shows the detailed break down of the ker-
nel execution times collected using NVIDIA CUDA Visual
Proler 1.0.04. Each row illustrates data collected for a
CUDA kernel, and the execution time and speedup numbers
are analyzed with respect to a variety of metrics to help ex-
plain why the speedup observed may not be the expected
17.5x. IPC is instructions per cycle, computed from the
number of dynamic instructions logged during the execu-
tion time of the kernel. The kernel overhead is the overhead
incurred when launching a kernel, and is measured by the
CUDA Proler. The speedup without accounting for ker-
nel overhead indicates the effectiveness of our data-parallel
implementation, without the artifacts of the GeForce 8800
GTX platform. The percentage of coherent memory ac-
cesses is computed from the number of coherent/incoherent
loads/stores provided by the CUDA Proler. The instruc-
tions per memory access column shows how memory in-
tensive the kernel implementation is. We now analyze the
kernels using these metrics.
Step 1 is the largest kernel by execution time. We ob-
tained an IPC of more than 90, despite 3% of the mem-
ory accesses being uncoalesced. This is possible because
the step has a high ratio of instructions per memory access
(IPA), such that the long memory latencies can be hidden
with computation. Step 2b2 also reached a high IPC of 95.
Although it has a relatively low ratio of IPA, the high IPC
is achieved because it has fully coalesced memory accesses,
which helps to prevent memory bandwidth from becoming
a bottleneck.
The scan steps in 0b, 3c1 and 3c2 make up 20% of to-
tal execution time. The data-parallel implementation only
achieved 1.6x speedup because of three reasons: 1) the se-
quential version is very work-efcient and touches every el-
ement to be scanned just once, where as the data-parallel
version leverages the scan implementation from the CUDA
Performance Primitive (CUDPP) library 1.0a, which uses
an O(n) algorithm that requires a lot more than n opera-
tions for n inputs; 2) the need for global synchronization
in a scan routine requires multiple kernel invocations, thus
adding signicant kernel call overhead penalties; and 3) the
implementation of scan in CUDPP is still a subject of re-
search, and the current implementation may not the optimal.
Step 3a uses a two-level reduction approach which is a
different algorithm compared to the sequential implemen-
tation. The second level is implemented sequentially on
the CPU to avoid kernel overhead penalty, but this involves
transfer of the intermediate results back to the CPU, reduc-
ing the effective speedup to 4.5X.
The rest of the kernels fall under two groups: 1) limited
by kernel overhead, and 2) limited by uncoalesced memory
accesses. Steps 0a, 0c, 2b0, 3c3, 3c4, 3d1, and 3d2 falls
in the rst group. These are simple kernels that are only a
few lines long, where we would obtain speedups of around
10X if we discount the kernel overhead. The second group
of kernels include steps 0d1, 0d2, 1a, 2b1, 2c, 3b0 and 3b1,
where the kernels have low IPA ratio and are hence memory
bound, while at the same time the algorithm requires unco-
alesced memory accesses, aggravating the memory bottle-
neck.
We obtained an overall speedup of 9.0X for the GPU
over the CPU. Our key result is that the GPU implemen-
tation is about 6.25X better than the performance required
for real-time recognition.
Pushing the performance beyond that is required for real-
time recognition is crucial since this allows us to handle
more complex language models and increase the accuracy
of recognition by using multi-pass inference algorithms un-
der real time constraints. Further, speech recognition could
be just one of the tasks in an application like language
translation with spoken inputs which have real-time require-
ments.
6. Discussions on further optimizations
In this paper, we establish a baseline infrastructure for
data-parallel beam search in LVCSR on a standard HMM
implementation. We found that the most signicant compu-
tation bottleneck is in the GMM computation for observa-
tion probabilities, and the most signicant communication
bottleneck is in the bigram transition computation. These
are the same performance bottlenecks seen in sequential
implementations. There is a large body of prior work on
optimization techniques to mitigate these bottlenecks on se-
quential platforms. We discuss their applicability for a data-
parallel implementation here.
To reduce the amount of GMM computation, tech-
niques proposed in literature include Bucket Box Intersec-
tion (BBI)[17], Subspace Distribution Clustering for Con-
tinuous observation density HMM (SDCHMM)[18], and
Gaussian selection[19]. From a high level point of view,
these techniques trade off computation with decision trees
or lookup tables to reduce execution time. While these
are valid trade-offs for current sequential platforms, im-
plementing these optimizations on a data-parallel platform
with SIMD execution units would introduce uncoalesced
memory accesses and divergent branches in instruction
streams, severely limiting overall instruction throughput.
The benet of these techniques are reported to be limited to
2-5x reduction in the GMM computation. Since we nd in
our experiments that execution time on SIMD-based data-
parallel platforms can vary by an order of magnitude in the
presence of uncoalesced memory accesses and divergent
11
Figure 13. Detailed characterization of the data-parallel LVCSR implementation
branches, it does not seem promising that these optimiza-
tions will lead to runtime improvements.
To improve the recognition network representation for
more effective most-likely word sequence searches, the
Weighted Finite State Transducer (WFST)[20] model has
been recently proposed to allow a set of FSM optimization
techniques such as factorization and determinization to be
used to minimize the state space. This is achieved by us-
ing a Mealy machine representation for the state transition
graph, where word sequences are emitted from the state
transition edges, rather than the standard Moore machine
representation used in our work, where word sequences are
emitted from the states. Since the recognition network op-
timizations are performed at compile time, the optimization
techniques are transparent to the inference engine, and will
equally benet a sequential and a data-parallel implementa-
tions. We are pursuing WFST as a promising extension of
our work.
7. Conclusions
Automatic speech recognition is a key technology for
enabling rich human-computer interaction in emerging ap-
plications. This paper explores the concurrency present in
continuous speech recognition with large vocabulary mod-
els and presents an inference engine implementation on an
NVIDIA many-core GPU using the CUDA programming
language.
We establish a baseline infrastructure for data-parallel
beam search in LVCSR on a standard HMM implementa-
tion. Our results show a 9.01X speedup over a sequential
implementation on CPU. This corresponds to going from
0.69X the real-time requirement to over 6.25X the real time
requirements. Pushing the performance beyond real time
is crucial for speech recognition since we can then han-
dle more complex language models and can incorporate
more accurate multi-pass inference engines for more accu-
rate recognition. Further, we expect that speech recognition
will be just one step in exciting new applications with real-
time requirements such as language translation.
We are exploring an extension to handle WFST-based
recognition networks and would like to explore the poten-
tial speedups on other manycore processor as they become
available.
References
[1] X. Huang, A. Acero, and H.-W. Hon, Spoken Lan-
guage Processing: A Guide to Theory, Algorithm and
System Development. Prentice-Hall, 2001.
[2] K.-F. Lee, H.-W. Hon, and R. Reddy, An overview of
the SPHINX speech recognition system, Acoustics,
Speech, and Signal Processing, IEEE Transactions on,
vol. 38, no. 1, pp. 3545, Jan 1990.
[3] P. Woodland, J. Odell, V. Valtchev, and S. Young,
Large vocabulary continuous speech recognition us-
ing HTK, IEEE Intl. Conf. on Acoustics, Speech, and
Signal Processing, ICASSP 1994, vol. 2, pp. 125128,
Apr. 1994.
12
[4] A. Lee, T. Kawahara, and K. Shikano, Julius an
open source real-time large vocabulary recognition en-
gine, in Proc. European Conf. on Speech Commu-
nication and Technology (EUROSPEECH), 2001, pp.
16911694.
[5] M. Ravishankar, Parallel implementation of fast
beam search for speaker-independent continuous
speech recognition, 1993.
[6] K. Agaram, S. Keckler, and D. Burger, A charac-
terization of speech recognition on modern computer
systems, Workload Characterization, 2001. WWC-4.
2001 IEEE International Workshop on, pp. 4553, 2
Dec. 2001.
[7] P. Dixon, D. Caseiro, T. Oonishi, and S. Furui, The
titech large vocabulary wfst speech recognition sys-
tem, Automatic Speech Recognition & Understand-
ing, 2007. IEEE Workshop on, pp. 443448, 9-13 Dec.
2007.
[8] H.-W. Hon, A survey of hardware architectures de-
signed for speech recognition, CMU, Tech. Rep.,
1991.
[9] R. Krishna, S. Mahlke, and T. Austin, Architectural
optimizations for low-power, real-time speech recog-
nition, in CASES 03: Proceedings of the 2003 in-
ternational conference on Compilers, architecture and
synthesis for embedded systems. New York, NY,
USA: ACM, 2003, pp. 220231.
[10] Intel plans powerful multicore x86 cpu, http:
//www.pcworld.com/businesscenter/article/130815/
intel plans powerful multicore x86 cpu.html.
[11] J. Nickolls, I. Buck, K. Skadron, and M. Gar-
land, CUDA: Scalable parallel programming, ACM
Queue, Apr 2008.
[12] NVIDIA CUDA Programming Guide, NVIDIA Cor-
poration, Nov. 2007, version 1.1. [Online]. Available:
http://www.nvidia.com/CUDA
[13] J. A. Stratton, S. S. Stone, and W. W. Hwu, M-
CUDA: An efcient implementation of CUDA ker-
nels on multi-cores, UIUC, IMPACT Technical Re-
port IMPACT-08-01, Feb. 2008.
[14] G. E. Blelloch, Vector models for data-parallel com-
puting. Cambridge, MA, USA: MIT Press, 1990.
[15] CUDPP: CUDA data parallel primitives library,
http://www.gpgpu.org/developer/cudpp/.
[16] Optimizing Parallel Reduction with CUDA,
http://developer.download.nvidia.com/compute/cuda/
1 1/Website/projects/reduction/doc/reduction.pdf.
[17] J. Fritsch and I. Rogina, The bucket box intersection
(BBI) algorithm for fast approximative evaluation of
diagonal mixture gaussians, in Proc. ICASSP 96,
Atlanta, GA, 1996, pp. 837840. [Online]. Available:
citeseer.ist.psu.edu/fritsch96bucket.html
[18] E. Bocchieri and B. Mak, Subspace distribution
clustering for continuous observation density hidden
markov models, in Proc. Eurospeech 97, Rhodes,
Greece, 1997, pp. 107110. [Online]. Available:
citeseer.ist.psu.edu/bocchieri97subspace.html
[19] K. Knill, M. Gales, and S. Young, Use of gaussian se-
lection in large vocabulary continuous speech recogni-
tion using hmms, Proc. Fourth Intl. Conf. on Spoken
Language, ICSLP 96., vol. 1, pp. 470473 vol.1, 3-6
Oct 1996.
[20] M. Mohri, F. Pereira, and M. Riley, Weighted -
nite state transducers in speech recognition, Com-
puter Speech and Language, vol. 16, pp. 6988, 2002.
13

You might also like