Phonetic Speaker Recognition Using Maximum-Likelihood Binary-Decision Tree Models

The document describes a method for phonetic speaker recognition using binary decision tree models to extend phonetic context beyond standard n-grams. Binary decision trees allow exploiting dependencies from longer contexts than n-grams while controlling model complexity. Two approaches are also studied to address data sparsity: model adaptation and recursive bottom-up smoothing of symbol distributions. Experimental results on the NIST 2001 Speaker Recognition Extended Data Task show consistent improvements in equal error rate compared to bigram models, indicating the relevance of long phonetic context for phonetic speaker recognition.

Uploaded by

samee88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views4 pages

Phonetic Speaker Recognition Using Maximum-Likelihood Binary-Decision Tree Models

Uploaded by

samee88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

PHONETIC SPEAKER RECOGNITION USING MAXIMUM-LIKELIHOOD

BINARY-DECISION TREE MODELS

Jiřı́ Navrátil1 Qin Jin2 Walter D. Andrews3 Joseph P. Campbell4

1
IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, [email protected]
2
Carnegie Mellon University, NSH 2602J, Pittsburgh, PA 15213, [email protected]
3
Department of Defense, Ft. Meade, MD 20755, [email protected]
4
MIT Lincoln Laboratory, Lexington, MA 02420, [email protected]

ABSTRACT comes with a burden: in order to capture dependencies

Recent work in phonetic speaker recognition has shown within a reasonably long time window, the model order
that modeling phone sequences using n-grams is a viable needs to be chosen correspondingly high, incurring an ex-
and effective approach to speaker recognition, primarily ponential growth in the number of parameters. This leaves
aiming at capturing speaker-dependent pronunciation and three solutions: 1) provide sufficiently large amounts of
also word usage. This paper describes a method involving training data for each speaker, 2) decrease the model or-
binary-tree-structured statistical models for extending the der, or 3) use smoothing techniques. Smoothing techniques
phonetic context beyond that of standard n-grams (particu- have been extensively used in n-gram modeling; however,
larly bigrams) by exploiting statistical dependencies within the model order in practice is still limited to 2 (i.e., tri-
a longer sequence window without exponentially increasing grams) or 1 (bigrams). Another weakness of the n-gram
the model complexity, as is the case with n-grams. Two model is its rigid structure, i.e., the way contexts (or histo-
ways of dealing with data sparsity are also studied; namely, ries) of the modeled tokens are partitioned. For example,
model adaptation and a recursive bottom-up smoothing while certain phones preceding a token may belong to a
of symbol distributions. Results obtained under a variety common category and hence would ideally be members of
of experimental conditions using the NIST 2001 Speaker the same (history) partition, the n-gram assigns separate
Recognition Extended Data Task indicate consistent im- partitions for these alone because of their different labeling.
provements in equal-error rate performance as compared to In this paper, we introduce a binary-tree modeling ap-
standard bigram models. The described approach confirms proach applied to phonetic speaker recognition that allows
the relevance of long phonetic context in phonetic speaker for exploiting dependencies from longer contexts than that
recognition and represents an intermediate stage between of typical n-grams while keeping the number of free pa-
short phone context and word-level modeling without the rameters under control. This binary-decision tree structure
need for any lexical knowledge, which suggests its language is optimized using a maximum-likelihood training criterion
independence. and provides flexible context clustering. Tree structured
models were successfully applied in language and speech
recognition previously [6, 2]. To deal with limited training
1. INTRODUCTION data and robustness issues, we also introduce an adapta-
tion step in creating the tree models as well as a recursive
In recent years, the research area of automatic speaker smoothing technique.
recognition has seen an increased interest in utilizing
sources of high-level speaker discriminative information in
order to complement widely and successfully used frame- 2. BASELINE SYSTEM
by-frame approaches exploiting only short-time acoustic in- N-grams are a standard language modeling technique
formation from the speech signal. Motivated by the work that approximates the probability of occurrence of a spo-
of Doddington [3] on modeling idiolectal differences among ken utterance A represented by a sequence of tokens (in
speakers by means of word n-grams, Andrews et al. [1] in- our case decoded phones) a1 , ..., aT up to the (N − 1)-th
vestigated n-gram modeling of phonetic units in sequences order. Thus, bigram models (N = 2) imply the assumption
automatically obtained from multiple phone recognizers for that the probability occurrence of a token depends solely
speaker verification. The results of this work indicate that on the immediately preceding word. Due to the fact that
such models effectively capture speaker characteristics com- the n-gram model complexity increases exponentially with
plementary to the short-time acoustic information that re- the order, we restrict our considerations to bigrams and
late particularly to speaker-specific pronunciation of words trigrams for both the speaker and the background models.
as well as word idiolect. Due to the fact that the phone rec- The baseline system is a basic log-likelihood ratio detec-
ognizers do not apply any constraints on the search space tor. Five language- and gender-dependent open-loop pho-
during decoding (such as grammars or pronunciation base- netic recognizers are used to generate multiple language
forms), pronunciation differences propagate through the de- phone sequences that represent multiple views of the in-
coder and are reflected in variations of phones and their or- put speech signal [1]. Phonetic speaker recognition is per-
dering. Viewed statistically, for example in terms of n-gram formed in three steps. First, the five phone recognizers
probabilities, these variations can be observed as varying process the test speech utterance to produce multiple pho-
statistical dependencies between phone tokens in the se- netic sequences. Then, the test sequence from each phone
quence and offer themselves for speaker-dependent model- recognizer is compared to the hypothesized speaker model
ing. Performance presented in [1] suggest that the depen- and a speaker-independent Universal Background Phone
dencies indeed carry a substantial speaker-dependent com- Model (UBPM), corresponding to the appropriate phone
ponent. recognizer [1, 4]. Finally, the scores from the hypothesized
Phonetic speaker modeling using n-grams, however, speaker models and the UBPM are combined to form log-
Modeled
Predictors
Token
e.g. predicting unique symbols, corresponds to the desir-
able prediction property and vice versa (e.g. predicting all
... at-4 at-3 at-2 at-1 at at+1 ... symbols with same probability). Defining the entropy for a
distribution of a symbol set A at a leaf l as
p(at | f(predictors)) X
? Root Hl = − Pl (si ) log2 Pl (si ) (2)
N Y si ∈A
Node Question Example:
Is at-2 in {"a","ae","aI"} ?
? ? the average BT prediction entropy is then
N N Y
Y X
... H= P l · Hl (3)

p(a|path)
?
...
?
N Y l
N Y
Symbol a
... ... with Pl denoting the prior probability of visiting the leaf l,
and Pl (s) the probability of observing a symbol s at that
p(a|path)

? Y leaf. The measure (3) is to be minimized in the course of

building the BT model. During this process the probabil-
Symbol a ... ities Pl (si ) and Pl are not known and have to be replaced
by estimates, P̂l (si ) and P̂l , obtained from a training sam-
Figure 1. An example structure of a binary tree ple at , ..., aT . Assuming a BT model structure with certain
model parameters leads to partitioning of the training data into L
leaves, each containing a data partition αl , then the sample
likelihood ratio (LLR) scores, again corresponding to each distribution estimates are calculated from counts as follows:
phone recognizer. The five LLR scores are then fused to-
gether producing a single weighted score. The LLR score #(si |αl )
of A, given a hypothesized speaker model M and a single P̂l (si ) = (4)
|αl |
phone recognizer, is calculated as
|αl |
P̂l = PL (5)
LLR(A|M ) = log P (A|M ) − log P (A|UBPM) (1) |αl |
l=1

3. BINARY-DECISION TREE MODELING with #(si |αl ) being the ai count at leaf l, and |αl | the
Let us consider a token sequence a1 , ..., aT representing a total symbol count at l. On the other hand, the average
decoded utterance of a speaker and a particular token at in training data likelihood given a BT model can be computed
that sequence. The quality of a model with respect to at is as follows:
measured by its power in predicting at from a certain con- T
text – represented by a set of predictor variables X. These 1 X
L = log2 P (at |BT ) (6)
may be chosen according to some prior knowledge and are T
typically selected to be the N time slots preceding at ; i.e., t=1

at−N , ..., at−1 . We now seek a model with a good overall L

X X
quality in predicting individual tokens from their respec- = P̂l P̂l (si ) log2 Pl (si ) (7)
tive contexts. For this purpose, we apply binary-decision l=1 si ∈A
trees (BT) which provide a versatile and flexible structure.
Figure 1 illustrates the function of such a model consist- Thus, by replacing Pl (si ) with the estimate (4), the measure
ing of nonterminal nodes associated with a binary ques- (3) and (7) are in a relationship
tion leading to either of two child nodes and terminal nodes
(leaves) that contain symbol distributions. Certain selected L = −Hl (8)
N time-slots of the phone sequence are denoted predictors,
X1 , ..., XN and are taken into account in the binary ques- Hence, building the BT model so as to minimize the overall
tions. The probability of a token at given its context can be prediction entropy identically maximizes the likelihood of
obtained from the BT model by successively using appropri- the training data.
ate predictor values to answer the binary questions at each The remaining problem of finding an optimum tree struc-
node until a leaf node with a symbol distribution is reached, ture and the corresponding node questions is solved by ap-
as exemplified in Figure 1. Obviously, for a given sequence plying a greedy search algorithm at each node, combined
the predictor values determine the path through the tree with a recursive procedure for creating tree nodes. To limit
structure and thus effectively determine the distribution to the otherwise extensive search space, we restrict the binary
be used for at . Hereby a variable context clustering is eas- questions to be elementary expressions involving a single
ily achieved by including multiple predictor values into the predictor, rather than allowing for composite expressions.
subsets at each node. Stated in principal steps the tree building algorithm is as
To determine the tree structure and parameters, some follows:
applications, such as acoustic context modeling in speech 1. Let c be the current node of the tree. Initially c is the
recognition [7], are motivated by linguistically based root.
schemes designed by an expert. Because the BT struc- 2. For each predictor variable Xi (i = 1, ..., N ) find the sub-
ture can vary from speaker to speaker and no straightfor- set Sci which minimizes the average conditional entropy
ward rules can be determined for phone sequences a priori of the symbol distribution Y at node c
for speaker modeling, a fully data-driven BT building algo-
rithm appears necessary. We seek to create a speaker model H c (Y | “Xi ∈ Sci ?”)
with the objective of attaining a high average prediction X
power which is expressed by means of average prediction = −P (Xi ∈ Sci | c) P (sj | c, Xi ∈ Sci ) ×
entropy of the BT leaf distributions. Here, low entropy, sj ∈A
× log2 P (sj | c, Xi ∈ Sci ) 3.1.1. Leaf Adaptation
In case of sparse training data for an individual speaker,
X
−P (Xi 6∈ Sci | c) P (sj | c, Xi 6∈ Sci ) ×
a speaker-independent (SI) BT model built from sufficient
sj ∈A amounts of data can provide a robust tree structure (i.e., the
× log2 P (sj | c, Xi 6∈ Sci ). (9) nonterminal nodes) as a fixed basis for creating the speaker
model by adaptation. Herein, the speaker training set is
where Sci
denotes a subset of phones at node c. partitioned according to the fixed structure and the leaf
3. Determine which of the N questions derived in Step 2 distributions are updated using the new partitions. Let
leads to the lowest entropy. Let this be question k, i.e., Y0 = {P̂l (sj )}sj ∈A denote a leaf distribution estimate of
the SI model, #(sj |αl ) the count of sj tokens in the leaf
k = arg min H c (Y | “Xi ∈ Sci ?”) partition αl , and |αl | the leaf token count of the speaker
i
data. The updated leaf distribution Y1 = {P̂l (sj )0 }sj ∈A is
4. The reduction in entropy at node c due to question k is then calculated as a linear interpolation

Rc (k) = Hc (Y ) − H c (Y | “Xk ∈ Sck ?”),

· ¸
#(sj |αl )
Pˆ0 l (sj ) = bj + (1 − bj )P̂l (sj ) /D (10)
where |αl |
X
Hc (Y ) = − P (sj | c) · log2 P (sj | c). with
#(sj |αl )
sj ∈A bj = (11)
#(sj |αl ) + r
If this reduction is “significant,” store question k, cre-
ate two descendant nodes, c1 and c2 , pass the data cor- where D normalizes the adapted values to probabilities, and
responding to the conditions Xk ∈ Sck and Xk 6∈ Sck , and r is an empirical value controling the strength of the update.
repeat Steps 2-4 for each of the new nodes separately. Such a BT model retains the context resolution of the SI
model, while describing the speaker-specific statistics. This
Simply stated, the algorithm seeks a data split at each training scheme is particularly effective when the SI model
node such that the average entropy of the two data subsets is used at the same time as the background model during
due to that split significantly reduces the entropy of total the likelihood ratio test in which symbols with too low an
data before the split. The entropy reduction is considered observation count in certain contexts nearly cancel out due
significant relative to some threshold effectively determining to the identical tree structure.
the size of the tree model to be grown.
In order to determine the phone subset Sci in Step 2 the 3.1.2. Bottom-Up Recursive Smoothing
following greedy algorithm was applied: Despite sufficient token counts in a leaf overall, individual
1. Let S be empty. symbols with unreliable estimates may still exist. A sym-
bolwise back-off or smoothing scheme with one or several
2. Insert into S the phone a ∈ A (A being the phonetic reliable estimates may be beneficial. The BT framework
set) which leads to the greatest reduction in the av- offers a simple way of finding such estimates, namely by
erage conditional entropy (9). If no a ∈ A leads to a backing-off to the parent distribution of a leaf. Each parent
reduction, make no insertion. distribution is a pool of both child distributions and there-
3. Delete from S any member a, if so doing leads to a fore is more likely to contain more observations of a given
reduction in the average conditional entropy. symbol. The back-off process can be repeated recursively
4. If any insertions or deletions were made to S, return to bottom up until either enough observation mass is collected
Step 2. or the root node is reached. We suggest the following re-
cursive smoothing algorithm for calculating the probability
In addition to the significance criterion in Step 4 of the of a symbol at = sj given its context X = {at−N , ..., at−1 }:
tree building we implement an occupancy constraint. Sys-
tematic data sparseness may occur with too low significance 1. Find the leaf l using X. Set a node variable c = l.
thresholds due to the recursive partitioning, thus leading to
poor entropy estimates and consequently to overtraining. 2. Calculate symbol probability P̂smooth (sj ) = bj P̂c (sj ) +
The occupancy constraint is applied in evaluating each po- (1 − bj )P̂par(c) (sj ) where P̂par(c) (sj ) is obtained by
tential split during the search, discarding split hypotheses repeating Step 2 with c := par(c) recursively until
not fitting this constraint. Furthermore, in order to pre- c = root.
vent overtraining by modeling training data particularities,
we apply cross-evaluation in Step 4 of the tree growing us- Again, a linear interpolation scheme is used, whereby par(c)
ing a separate held-out set when computing the reduction denotes the parent node of c, and r is as in (11).
Rc (k).
4. EXPERIMENTS
3.1. Data Sparseness Issues
Applying the leaf occupancy constraint causes the BT mod- 4.1. Database
els to grow adaptively with respect to not only the intrinsic The Extended Data Paradigm used in the framework of the
data properties, but also the data set size. The latter may NIST 2001 Speaker Recognition Evaluation was adopted
become a problem with small training data amounts for in our experimental setup. As described in [8], the Ex-
which the growing process may terminate with only a few tended Data Task comprises of the complete Switchboard-I
leaf nodes, resulting in extremely coarse models. telephone-speech corpus partitioned into six splits to evalu-
Furthermore, even in sufficiently large training sets, ate the performance in a fashion similar to cross-validation.
sparseness of symbols in certain contexts may exist. In Furthermore, five different training conditions with data
the following, we describe two approaches to mitigate these amounts consisting of 1, 2, 4, 8 and 16 conversation sides
problems. (each of nominal length of 2.5 minutes) were considered.
# Training Conversations
16 8 4 2 1
Avg # of leaves 50 29 4 1 1

Table 1. Average speaker tree size for variable

training amounts (no adaptation)

4.2. Baseline Performance

The baseline bigram system was implemented using the
CMU Statistical Language Modeling toolkit (CMU-SLM).
A smoothing scheme by Katz [5], which combines Good-
Turing discounting with back-off, was used with a discount-
ing threshold of 7. Two UBPMs were created from splits 1-3
and 4-6, each used in evaluation of the respective excluded
partition sets. Furthermore, we include trigram baseline re-
sults obtained using a joint-probability system with pruning
described in [1]. In this system, all trigrams with an obser-
vation count lower than 500 were excluded from the scoring.
The final performance results of both systems were obtained
by pooling all six splits and combining the five decoder- Figure 2. The EER performance of the 5-tokenizer
dependent streams with uniform weights as described in system using BT models with and without adapta-
Section 2. tion and n-grams
4.3. Binary Trees System well as trigrams. Our experiments show that the prob-
The speaker BT models were examined in three configu- lems of data sparseness in speaker model training can be
rations: 1) Models with no smoothing, 2) with Bottom- addressed effectively by applying principles of adaptation
Up Recursive Smoothing (BURS), and 3) Adapted from a and smoothing for which the BT models offer a suitable
background (BG) model with BURS. The BG BT model basis. Using smoothing and adaptation, a relative reduc-
was created in the same fashion as for n-grams described in tion in EER ranging between 10-60% compared to the best
Section 4.2. The significance threshold was set such that the n-gram system was achieved across the different training
BG BT possessed on the order of 200-400 leaves. The same conditions.
threshold along with an occupancy constraint of 5·|A| ≈ 250
produced unadapted BT models with an average of 30 leaves REFERENCES
for 8-conversations training. In unadapted speaker models, [1] W. Andrews, M. Kohler, J. Campbell, J. Godfrey, and
the occupancy constraint appeared to be active in almost J. Hernandez-Cordero. Gender-dependent phonetic re-
all split decisions, as opposed to the redundancy reduction fraction for speaker recognition. In Proc. of the ICASSP,
which tended to be more active in building the BG model Orlando, FL, May 2002. IEEE.
with large data amounts. Table 1 shows the average model [2] L.R. Bahl, P.F. Brown, P.V. DeSouza, and R.L. Mer-
size (leaf count) for the five training conditions with no cer. A tree-based statistical language model for natural
adaptation. Lack of context resolution becomes obvious language speech recognition. IEEE Trans. on Acoustics,
for 4 or fewer training conversations due to the occupancy Speech, and Signal Processing, 37(7):1001–8, July 1989.
constraint, compared to, for example, a bigram context res-
olution of 45 (for |A| = 45). The value of the adaptation [3] G. Doddington. Speaker recognition based on id-
constant r in (11) seemed not critical in the range (0.5, 16) iolectal differences between speakers. In Proc. of
and was set to 4 in all experiments, based on a small data the EUROSPEECH, pages 2521–4, Aalborg, Denmark,
subset. September 2001.
The maximum number of predictors N considered in the [4] Q. Jin, J. Navrátil, D. Reynolds, J. Campbell, W. An-
training was set to four. Most of the BT models tended to drews, and J. Abramson. Combining cross-stream
use up to three preceding predictors, namely X1 , X2 , X3 in and time dimensions in phonetic speaker recognition.
such a way that X1 (i.e. immediately preceding) tended to ICASSP’03, to appear.
be chosen in splits earlier in the treee growing procedure [5] S.M. Katz. Estimation of probabilities from sparse data
to split the data set, followed by X2 and then X3 chosen for the language model component of a speech recog-
deeper for more detailed split decisions. nizer. Trans. on Acoustics, Speech, and Signal Process-
Figure 2 compares the performance of BT models with ing, 35(3):400–401, 1987.
and without adaptation and the bigram and trigram base- [6] J. Navrátil. Spoken language recognition - a step to-
lines in terms of the equal-error rate (EER) across training wards multilinguality in speech processing. IEEE Trans.
conditions. A considerable improvement in BT performance Audio and Speech Processing, 9(6):678–85, September
with adaptation can be seen for training conditions 4, 2, and 2001.
1, in which the resolution of unadapted models is insufficient
(see Table 1). With 8 and 16 conversations, the BT models [7] L. Polymenakos, P. Olsen, D. Kanevsky, R.A. Gopinath,
are able to further improve upon the trigrams due to their P.S. Gopalakrishnan, and S.S. Chen. Transcription of
extended context length and more flexible structure. broadcast news - some recent improvements to IBM’s
LVCSR system. In Proc. of the ICASSP, volume 2,
5. CONCLUSION pages 901–4, Seattle, May 1998.
[8] D. Reynolds et al. The SuperSID project: Exploiting
Binary-tree models represent a step towards flexible context high-level information for high-accuracy speaker recog-
structuring and extension in phonetic speaker recognition, nition. ICASSP’03, to appear.
consistently outperforming standard smoothed bigrams as