Automated Cryptanalysis of Plaintext Xors of Waveform Encoded Speech

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Automated Cryptanalysis of Plaintext XORs of

Waveform Encoded Speech


L. A. Khan and M. S. Baig

WinZip [5], Point to point tunneling protocol (PPTP) [6] etc. In


Abstract— Keystream reuse also known as the “two time pad” addition to this, the two time pad problem is predicted to remain
problem in case of stream ciphered data has been the focus of there for quite some time in the near future also because of the
cryptanalysts for several decades. All heuristics presented so far endorsement of counter mode of AES by NIST [2, 7] for high
assume the underlying plaintext to be uncompressed text based speed data transfer applications. In this case, cryptographers,
data encoded through conventional encoding mechanisms such as
who would have otherwise used a block cipher with cipher
ASCII Coding. This paper presents the use of hidden Markov
model (HMM) based automatic speech recognition (ASR)
block chaining (CBC) mode, are compelled to use AES in the
approach to cryptanalysis of stream-ciphered waveform-encoded counter mode, thereby turning a block cipher into a stream
speech in a keystream reuse situation. We present that an cipher and the chances of reusing key streams is further
adversary can automatically recover the digitized speech signals enhanced. Also, there is a compelling need for a cipher mode of
from their plaintext XORs obtained from two different speech operation which can efficiently provide authenticated
signals stream ciphered with the same keystream. The proposed encryptions at speeds of 10 gigabits/s and is free of intellectual
technique can be practically employed with the existing HMM property restrictions. The counter mode of operation of a block
based probabilistic speech recognition techniques with some cipher (e.g. AES) has been considered to be the best method for
modification in the selection of HMMs, their training and the this purpose [8, 9]. This has further increased the possibility of
maximum likelihood decoding procedures. Simulation
keystream reuse in actual systems because an effective and
experiments using such modified speech recognition tools have
been presented. secure key management has still been an uphill task for the
crypto designers as mentioned in [10] as: “If you think you
Index Terms— cryptanalysis, keystream reuse, speech coding, know how to do key management, but you don't have much
stream cipher, two time pad. confidence in your ability to design good ciphers, a one-time
pad might make sense. We're in precisely the opposite
situation, however: we have a hard time getting the key
management right….. And almost any system that uses a
I. INTRODUCTION one-time pad is insecure. It will claim to use a one-time pad,
but actually uses a two-time pad (oops).”
In a stream cipher, a message m is exclusive ORed with a
keystream k to produce the ciphertext c i.e. m ⊕ k = c . If the
As regards to hidden Markov models (HMMs), these are
keystream k is random and is of the same size as that of the very rich in mathematical structure and form the theoretical
message m then the stream cipher becomes a “one time pad” basis for use in a broad scope of applications, particularly in
which is considered as a perfect cipher [1] in the cryptographic machine recognition of speech [11]. Most contemporary speech
community. If two different plaintexts m1 and m2 are encrypted recognizers are based on HMMs. Speech is digitized,
with the same keystream k then their results m1 ⊕ k and m2 ⊕ k encrypted and sent between two parties in many situations.
can be XORed to neutralize the effect of the keystream k, Encryption schemes particularly designed for speech whether
thereby obtaining m1 ⊕ m2 . The key reuse problem in stream these are analog speech scramblers (e.g. [12]) or modern digital
selective speech encryption techniques (e.g. [13]), have been
ciphers and its exploitation in different scenarios for the focus of security professionals since long. With the
uncompressed text based data have been studied since long. It advancement in the speech digitization and compression
has also been mentioned in the literature as the “two time pad” techniques, the speech signal is now treated as an ordinary data
problem [2]. The vulnerability of keystream reuse exists with stream of bits as far as encryption is concerned. But the
many practical systems which are still in use such as Microsoft acoustic and articulatory features of speech signals exploited by
Office [2, 3], 802.11 Wired Equivalent Privacy (WEP) [4], the automatic speech recognition (ASR) equipment especially
in the distributed speech recognition (DSR) scenario [14] and
Manuscript received 05 May, 2008. automatic transcription of conversational speech [15] have
L. A. Khan is with Information Security Department, College of
Telecommunication Engineering, National University of Sciences and encouraged us to look at their characteristics from the
Technology (NUST), Rawalpindi, Pakistan. cryptanalytic point of view in a keystream reuse situation. We
M. S. Baig is with Center for Cyber Technology and Spectrum Management have extended the natural language approach from automated
(CCT&SM), National University of Sciences and Technology (NUST), cryptanalysis of encrypted text based data to the digital data
Islamabad, Pakistan. Phone: +92-51-2103420/9266480.
( e-mail: [email protected], [email protected]. ) extracted from the underlying verbal conversation. An
interesting by product of our attack is that it would not only Markov model is a statistical model in which the system to be
decipher the information but would automatically transcribe it modeled is assumed to be a Markov process with unknown
during the process of cryptanalysis through speech recognition. parameters and the challenge is to determine the hidden
parameters from the observable ones. For complete details of
The rest of the paper is organized as follows: In section 2, we the hidden Markov models and their applications in speech
present some background information on speech coding and recognition the reader is referred to [11]. All modern speech
hidden Markov model based speech recognition. In section 3, recognition tools use this technique because of its robustness,
we discuss the previous and related work on the keystream flexibility and efficiency. The goal of any ASR system is to find
reuse problem as well as the use of HMMs in cryptology. the most probable sequence of words W = ( w1, w2, w3, ……)
Section 4 presents the assumptions and concept behind our given an acoustic observation O = (o1, o2, o3, ….oT).
cryptanalysis technique. In section 5, we present the Mathematically,
implementation procedure which we adopted along with ^

experimental results. Section 6, concludes the paper and gives W = arg max P ( wi O ) (1)
i∈L
directions for future work.
where L indicates the phonetic units in a language model.
II. BACKGROUND INFORMATION Equation 1 cannot be solved directly but using Baye’s Rule, the
above equation can be modified as
P ( O wi ) P ( wi )
In this context, we first discuss speech coding and then ^
automatic speech recognition with relevance to our work. W = arg max (2)
i∈L P (O)
A. Speech Coding or it can also be written as
^
In waveform encoding, the speech signal waveform is sampled, W = arg max P ( O wi ) P ( wi ) (3)
quantized (and compressed) and then digitally encoded. The i∈L
A-law and µ-law algorithms [16] used in traditional Pulse here, P(O/wi) is calculated using HMM based acoustic models,
Coded Modulation (PCM) digital telephony can be seen as very whereas P(wi) is determined from the language model.
early precursors of speech digitization based on waveform
encoding. In case of parameter encoding speech is considered
as a source filter model in which the parameters of the model III. PREVIOUS AND RELATED WORK
alongwith excitation information in the form of
voiced/unvoiced signals is used for digital represenation of A. Keystream Reuse Exploitation
speech. In hybrid coding which has become the most popular in Key stream reuse vulnerability exploitation of stream ciphers
modern speech coding, the excitation information is not only dates back to the National Security Agency’s VENONA project
segregated as vioced or unvoiced but the details of excitation [19, 20] which started in 1943 and did not even finish uptil
information such as pitch, pulse positions/signs and gains are 1980. Other worth mentioning works on the topic are those of
used for its representation. The common coding technique in Rubin in 1978, who for the first time formalized the process of
this domain is Code Excited Linear Prediction (CELP) [17]. keystream reuse exploitation [21]; Dawson and Neilson in
Although speech coders based on CELP are well known and 1996, who automated the process of cryptanalysis of plaintext
common coders used in voice over IP networks and XORs [22] and the recent cryptanalysis of two time pads by
predominantly used in PC based systems yet some of the Joshua Mason and coauthors in 2006 [2]. Mostly the keystream
leading IP phone vendors unfortunately stopped supporting reuse exploitation discussed previously is with respect to the
some implementations of CELP. This leads to G.711, which is textual data and mainly based on heuristic rules for obtaining
based on waveform coding, as the common coder for PC to IP
the two plaintexts m1 and m2 from m1 ⊕ m2 except for [2]
phones [18]. Moreover, most of the telecommunication links
still use the A-law and µ-law algorithms of waveform coding which uses statistical finite states language models and natural
for speech digitization. Keeping this fact in mind, we have language approach. Prior works also exist on automated
developed our algorithm for the keystream reuse exploitation of cryptanalysis of analog speech scramblers (e.g. [23]), but no
speech signals based on waveform encoding. previous work exists on the use of modern automated speech
recognition (ASR) techniques based on hidden Markov models
(HMMs) being used for cryptanalysis of the two time pad
B. Automatic Speech Recognition problem for the digitally encoded speech signals. In [2], the
The purpose of speech recognition is to convert spoken words concepts borrowed from the natural language and speech
to machine readable input. Two main techniques of speech processing communities are used for text based data whereas
recognition presently exist, one based on dynamic time we use similar concepts with addition of speech recognition for
warping (DTW) and the other based on hidden Markov models speech based data.
(HMMs). The technique which is the most common is based on
hidden Markov Models and is also applicable in our scenario.
A Markov process is a stochastic process in which the B. Use of HMMs in Cryptology
conditional probability of the future states depends only on the As regards to the use of hidden Markov models in cryptology,
present state and not on any past state, whereas, a hidden these have recently been used for several problems in this area.
The most prominent are the works of A. Narayanan and V. its robustness, flexibility and efficiency [11]. The three basic
Shamtikov who used hidden Markov models for improving fast questions with respect to the HMMs and their solutions as
dictionary attacks on human memorable passwords [24]; D.X regards to speech recognition are effectively utilized with some
Song , D. Wagner and X. Tian who used HMMs for timing modification in our case. The three basic problems of interest
attacks on Secure Shell (SSH) [25]; D. Lee gave the concept of that are to be solved for the model to be useful as regards to the
substitution deciphering of compressed documents using cryptanalysis of speech signals encrypted with the same key
HMMs [26]; L. Zhuang, F. Zhou and J. D. Tygar modeled the are:
keyboard acoustic emanations as HMMs [27]; C. Karlof and D.
Wagner used HMMs for modeling countermeasures against 1) Finding Probability of Observation given a Model
side channel cryptanalysis [28] and finally the most relevant Given an observation sequence O (XORed ciphered speech
work of Joshua Mason et al [2] who used the Viterbi beam waveforms in our case) and a model λ = ( π i , A, B ) where π i
search for finding the most probable plaintext pairs from their
is the initial probability of states (XORed phonemes in our
XOR in case of textual data. It is worth mentioning here that the
case), A is the transition probability of states and B is the
works involving cryptanalysis with the aid of HMMs presented
emission probability distribution of the observation sequence,
so far either do not relate to two time pad cryptanalysis or
how do we compute the probability that the given sequence of
pertain only to text based data. Our algorithm for cryptanalysis
of the plaintext XOR of the digitized speech signals using observations was produced by the model λ i.e. P ( O λ ) ? The
hidden Markov model based speech recognition techniques is solution to this problem allows us to choose the model which
the first of its kind according to our knowledge and has showed best matches the observation sequence which in our case would
encouraging preliminary results. be a sequence of XORed speech samples.

2) Finding Internal States given Model and Observation


IV. PROPOSED APPROACH Given the observation sequence O of XORed speech
Before discussing the concept behind our approach for waveforms and the model λ , how do we choose a
exploiting keystream reuse in stream ciphered digitized speech corresponding sequence of states i.e. XORed spoken words in
signals, we first elaborate and justify our assumptions. case of isolated word recognizer and sequence of XORed
phonemes in case of continuous speech recognizer, which is
A. Assumptions optimal and best explains the observation? This is the problem
We assume that the cryptanalyst knows before launching an in which we try to find the hidden part of the model i.e. to find
attack, the details of the speech digitization and encoding the “correct” state sequence. In this case we have to impose
before being stream ciphered. As an a priori knowledge the certain optimality criterion like the sequence with maximum
cryptanalyst must know whether the audio data which he log probability may be selected.
targets is waveform encoded or parameter encoded. This
assumption becomes realistic from the fact that the audio 3) Adjusting Model Parameters given Observation
encodings follow some standards and by mere knowing the How do we adjust the model parameters λ = ( π i , A, B ) to
standard reference, the bit level details can be easily accessed
as these details are publicly available. For example, in the maximize the probability of the observation sequences of
waveform coding we may have ITU-T G. 711 [16], the bit level XORed speech waveforms given the model λ ? This is the
details of which are publicly available. Had these details not training part of the models and on one hand is the most crucial
been available then even the plaintext speech without part in the sequence of events but on the other hand gives a lot
encryption would have not been possible to be decoded. We of flexibility to the cryptanalyst thus empowering him to adjust
also assume that the cryptanalyst has a priori knowledge of the his model for all sorts of varying situations like different
language and genre of the underlying speech signal. For languages, accents, different speech coding and compression
example, we may like to model military telephonic techniques and even different noisy conditions. This allows the
conversations, corporate discussions, and informal chat over crypanalyst to create best models for real phenomena.
mobile telephones, etc. In case of VENONA project [20], the
NSA knew before hand that the said link carried Soviet military All the abovementioned problems and their efficient
and diplomatic communication. Moreover, keeping in view the mathematical solutions have a very rich literature with respect
Kerckhoff’s principle re-asserted by Claude Shannon as the to speech recognition [11]. In the conventional speech
enemy knows the system, the language and encoding details of recognition techniques, the hidden Markov models are trained
the underlying speech signals can be rightly assumed to be for complete words in case of isolated word recognizers and for
known to the cryptanalyst before hand. phonemes in case of continuous speech recognizers. In our
case, for isolated word recognition, we have to first list down
all the possible combination of words resulting from the XOR
B. The Concept of the two speech signals and hence the HMMs required to be
Our method of cryptanalyzing the speech signals being trained will increase from n to n2. Since the list of words is
encrypted with the same keystream is based on the hidden generally very large, therefore, this approach of training the
Markov model based speech recognition techniques. All HMMs would be very computational intensive and maybe
modern speech recognition tools use this technique because of impractical. A better and more efficient approach is to train the
HMMs for the exclusive ORed pairs of all the possible transcriptions in the form of phoneme pairs. The “+” sign in the
phonemes in the language under test. In this case, since the figure corresponds to XOR. For the individual sentences the
number of phonemes is relatively very small as compared to the number of phonemes is 27 for sentence 1 and 26 including
number of possible words, the increase in computational silence (sil) and hence the HMMs required to be trained for the
complexity from n to n2 does not become unachievable. For XOR case would be 702 (27x26) at the max. These are obtained
example, in English language there are about 40 to 50 by pairing every phoneme of sentence 1 with every phoneme of
phonemes and hence the number of HMMs to be trained in this sentence 2. We used ten utterances each of the sentence 1 and
case would be at the most 2500 (502) which are not high as sentence 2 from ten different speakers, labeled the wave files
regards to the computational resources available to a normal and then bit wise XORed both the files again labeling these
user these days. Using this approach would not require any with the HMM boundaries clearly defined. For recordings,
major modification in the conventional phoneme based speech transcription and speech recognition we used the HTK [30]
recognition procedure. In the pre-computation phase which which is a toolkit based on C language for analysis of hidden
comprises of selection and training of HMMs, we will first list Markov models particularly for machine recognition of speech.
down all the possible phonemes in the language and then we The recordings and transcription can be obtained by the HTK
will pair each phoneme with every other phoneme including tool HSLab. The above mentioned acoustical events were
itself in the list. We will then assign each HMM to each modeled by 167 HMMs with each HMM corresponding to one
phoneme pair and then train it with the training data selected on XORed pair of phonemes. Since all the possible phonemes do
the basis of the a priori knowledge about the language and not occur hence the actual number (167) of phoneme pairs is
encoding mechanism of the speech signal. Once the HMMs quite less than the total possible number (702). The basic
corresponding to phoneme pairs are fully trained then these can design of the HMM we used in this case for all the models is as
be used for the identification of phoneme pairs in a given shown in Fig. 2. Since speech recognition equipment cannot
sequence of XORed speech samples in a process similar to the process waveforms directly, these are to be converted into more
decoding part of a conventional HMM based ASR system. The compact form. The configuration we used for the speech
decoded phoneme pairs are first separated into two distinct recognition was based on Mel Frequency Cepstral Coefficients
groups and then the phonemes within a group can then be and (MFCC) [31] with 12 first MFCC coefficients, the null MFCC
combined to form different words and sentences. For both these coefficient which is proportional to the total energy in the
steps, help can be taken from the semantics and syntactic rules frame, 13 Delta coefficients estimating the first order derivative
of the language. of MFCC coefficients and 13 acceleration coefficients
estimating the second order derivatives, altogether a 39
coefficient vector is extracted from each signal frame. The
V. IMPLEMENTATION OF PROPOSED APPROACH frame length is 25 milliseconds with 10 milliseconds frame
The implementation part of our attack involves a pre periodicity. The parameters which are to be estimated for each
computation phase which involves selection and training of the HMM during the training phase are transitional probabilities aij
models and then the decoding phase which corresponds to the and the single Gaussian observation function for each emitting
recognition of sequence of phoneme pairs which gives the state which is described by a mean vector and variance vector
highest log probability. Both the phases are interrelated and (the diagonal elements of the autocorrelation matrix). In our
interdependent and the accuracy of the attack is greatly case we have to estimate all these values for each of the 167
dependent on how well these two parts of the attack are HMMs during the training phase. The HTK tools HInit,
carefully employed and joined. HCompV, and HRest can be used for this purpose.

A. HMM Selection and Training Phase Before using our HMMs we have to define the basic
This phase corresponds to the pre computation part of the architecture of our recognizer. In actual case this depends on
speech recognition in which the HMMs are first selected and the language and the syntactic rules of the underlying task for
then trained with the help of speech samples available with which the recognizer is used. We assume that these things like
respect to each phoneme pair. This is done once for a particular the language of the speakers and the digital encoding
language and specific speech encoding procedure. In order to procedures are known to the cryptanalyst before hand. HTK,
prove the concept, we present a simple example in which we like most speech recognizers, works on the concept of
take two phonetically balanced English sentences: Clothes and recognition network which are to be prepared in advance from
lodging are free to new men; and All that glitters is not gold at task grammars, and the performance of the recognizer is greatly
all. We bit wise XORed the digital encoded forms of these dependent on how well the recognition network maps the actual
sentences to simulate the keystream reuse scenario. Fig. 1(a), task of recognition. In addition to the recognition network, we
(b) show the spectrogram and waveform of the two sentences need to have a task dictionary which explains how the
along with their transcription. The transcription at the phoneme recognizer has to respond once a particular HMM is identified.
level is obtained from the British English pronunciation The task grammar for our recognition network is shown in Fig.
dictionary BEEP [29]. For simplicity of implementation the 3. The recognition network for our experiment is shown in Fig.
silence between words is not marked separately, only the initial 4. The HParse tool of HTK can be used to construct the
silence and the end silence are marked. Fig 1(c) corresponds to recognition network from the task grammar. HSGen can be
the bitwise XOR of the two signals and the associated used for testing and verification of the recognition network.
Fig. 1(a). Spectrogram and waveform along with transcription of the sentence:
Clothes and lodging are free to new men.

Fig. 1(b). Spectrogram and waveform along with transcription of the sentence:
All that glitters is not gold at all.

Fig. 1(c). Spectrogram and waveform along with transcription of XOR of the sentences.
a22 a33 a44 a55
a12 a23 a34 a45 a56

b2 b3 b4 b5
S1 S2 S3 S4 S5 S6
a13 a24 a35 a56

Fig. 2. Basic Topology of the HMM

/* Task grammar*/
$WORD = sil+ao | k+l | l+dh | ow+ae | dh+t | z+g | ae+g | n+l | d+l | l+ih | l+t | l+ax | oh+ax | oh+r | jh+z |ih+ih
| ng+z | ng+n | aa+n | r+oh | r+t | r+g | f+g | iy+g | iy+ow | iy+l | t+d | t+ax | uw+ax | uw+t | n+t | n+ao | y+ao |
uw+ao | m+l | m+sil | eh+sil | n+sil | k+ao | ow+dh | ow+t | dh+g | z+l | ae+l | ae+ih | n+ih | d+t | l+r | oh+z | jh+ih
| ih+z | aa+t | f+ow | r+ow | r+l | iy+d | iy+ax | iy+t | t+t | y+l | uw+sil | sil+l | f+oh | f+t | n+ax | y+t | m+ao | eh+l
| l+ao | ow+ao | dh+ao | ae+dh | ae+ae | d+g | l+g | oh+l | jh+l | ng+t | aa+ax | r+ax | r+r | f+r | f+z | f+ih | r+ih | iy+z
| t+z | uw+n | n+oh | uw+g | uw+ow | m+ow | eh+ow | n+d | sil+ax | sil+t | n+ae | d+ae | l+ae | oh+ae | oh+t | oh+g
| jh+g | ih+l | ng+l | aa+ih | f+ax | r+z | iy+ih | uw+z | n+n | y+oh | m+g | eh+g | n+ow | sil+ow | sil+d | l+l | ow+l
| dh+l | z+dh | jh+t | ih+t | ng+g | aa+g | f+l | n+z | y+z | m+z | eh+n | eh+oh | sil+oh | dh+dh | ih+g | aa+l | t+ih |
uw+ih | m+n | sil+g | k+sil | n+dh | ng+ax | aa+r | aa+z | f+n | l+sil | ow+sil | z+ao | ih+ax | ng+r | r+n | iy+oh | r+d
| ae+t | oh+ih | jh+ax | ih+r | t+g | y+d | uw+d | m+ax | eh+ax | aa+oh ;

( [ START_SIL ] { $WORD } [ END_SIL ] )

Fig. 3. Task Grammar for the Recognition Network

sil ⊕ ao
k ⊕l
l ⊕ dh

START_SIL END_SIL

m ⊕ ax
eh ⊕ ax
aa ⊕ oh

Fig. 4. Recognition Network


B. Detailed Experimentation
A. Decoding Phase
After getting encouraging results from the experiment in
Once the pre-computation phase has carefully been
the controlled environment, we then tested our technique on
completed, the decoding process becomes pretty simple and
actual speech files. For this purpose, we selected the
elegant. An input speech signal, which in our case will be the
Switchboard Corpus [32] which is a collection of telephone
bitwise XOR of two unknown speech waveforms, is first
bandwidth conversational speech data collected from T1 Lines.
converted to a sequence of n MFCC vectors and is then fed as
The speech files are fully transcribed. The reason for this
input to the recognizer. Every path from the start node to the
selection was to simulate the situation of eavesdropped
end node in the recognition which passes through precisely n
encrypted communication of waveform encoded speech. In
emitting states is a prospective recognition hypothesis. Each of
order to simulate the keystream reuse scenario, we selected 256
these paths has a log probability which is computed by
speech files and XORed them with each other. The acoustical
summing the log probability of individual transition in the path
events were modeled with 588 HMMs with each HMM
and the log probability of each emitting state generating the
corresponding to one pair of phonemes. Since all the possible
corresponding XORed vector. Within the model, transitions are
phonemes do not occur always hence the actual number of
determined from the model parameters (aij), between two
phoneme pairs (588) is quite less than the total possible number
models the transitions are regarded as constant and in case of
(2500). As discussed earlier, speech recognition tools cannot
large recognition networks the transition between end words
process speech waveforms directly. Different acoustic feature
are determined by language models likelihoods attached to the
representations were used for recognition purposes. For all the
word level networks. The decoder lists those paths through the
representations, we used frame length of 25 milliseconds with
network which have the highest log probability. These paths
10 milliseconds frame periodicity. The parameters which were
are found using a Token Passing Algorithm [30]. At time 0, a
to be estimated for each HMM during the training phase were
token is placed in every possible start node. Each time step,
transitional probabilities aij and the single Gaussian observation
tokens are propagated along connected paths through the
function for each emitting state which is described by a mean
recognition network stopping whenever they hit an emitting
vector and variance vector (the diagonal elements of the
HMM state. When there are more than one out going paths
autocorrelation matrix). The different acoustic features
from a node, the token is copied so that all possible paths are
extracted for recognition purposes include linear predictive
explored in parallel. As the token passes across transitions the
coefficients, linear predictive reflection coefficients, linear
corresponding transition and emission probabilities add up to
predictive Cepstral coefficients and Mel frequency Cepstral
its log probability. The token also maintains a history of its
coefficients along with delta and reflection coefficients. For
route during the process of propagation. In a large network,
testing purpose, we selected twenty different files from the
which will definitely be in our the case, a beam search may be
Switchboard corpus not included in the training data, XORed
used in case of which a record of the best token overall is kept
these to get ten files. As an initial test we also selected ten
while deactivating all tokens whose log probability falls more
different files from the training data and fed these to the
than a beam width below the best. This beam search technique
recognizer. The experimental results with respect to test files
has one problem i.e. if the pruning beam width is set too small
selected from the training data as well as arbitrary test files are
then the actual recognition path might be pruned before its
depicted in Table I. The best accuracy results were presented by
token reaches the end of the observation i.e. it may result in a
the Mel Frequency Cepstral Coefficients (MFCC) with delta
search error. Setting the beam width is thus a compromise
and acceleration coefficients for both the test file categories.
between computational load and avoiding search errors.
Fortunately, HTK tools HVite takes care of all these speed and TABLE I : RECOGNITION ACCURACIES OF DIFFERENT ACOUSTIC FEATURES
computation problems [30]. The initial experiments which we Recognition Accuracy (%)
Feature Extraction
performed gave us up to 80 percent correct recognition of SNo Training Data Arbitrary Test
Mechanism
phoneme pairs obtained though HResults tool of HTK as shown Test Files Files
in Fig. 5. 1.
Linear Predictive
65.93 29.51
Coefficients
ts=================== HTK Results Analysis============== Linear Predictive Reflection
2. 69.06 28.61
Coefficients
Date: Mon Jun 11 20:05:23 2007
Ref : data2/ref22.mlf Linear Predictive Cepstral
3. 72.09 34.62
Coefficients
Rec : data2/rec22.mlf
Mel Frequency Cepstral
---------------------------Overall Results --------------------------------------- 4. 74.72 40.73
Coefficients (MFCC)
SENT: %Correct=10.00 [H=1, S=9, N=10]
Linear Predictive Cepstral +
WORD: %Corr=86.98, Acc=83.72 [H=374, D=10, S=46, I=14, N=430] 5. 77.15 37.81
Delta Coefficients

================================= Mel Frequency Cepstral +


6. 79.96 59.09
Delta + Acceleration Coef.

Fig 5. HTK Recognition Performance


C. Complexity Analysis [9] R. Housley and A. Corry, GigaBeam high speed radio link encryption,
RFC 4705, Oct, 2006. Available from http://tools.ietf.org/html/rfc4705
For the complexity of the training phase, the increase in the [10] Bruce Schneier. Cryptogram-Newsletter, Oct, 2002.
number of models to be trained is from n to n2 at the most, if we [11] L. R. Rabiner. A tutorial on hidden markov models and selected
pair each phoneme with every other phoneme and of course applications in speech recognition, Proceedings of the IEEE, 77(2):
257-286, Feb, 1989.
itself. For the decoding phase, the number of phonemes which [12] E. Dawson. Design of a discrete cosine transform based speech
are to be checked at each stage of the decoding path is also scrambler, Electronics Letters, vol. 27, pp. 613-614, Mar, 1991.
increased from n to n2 and the number of possible paths [13] Chung Ping Wu and C. C. Jay Kuo. Fast encryption methods for
increases from n2 at each stage to n4 at each stage. Hence we audiovisual data confidentiality. Proceedings of SPIE vol. 4209, pp.
284-295, Nov, 2000.
may expect an exponential increase in the decoding from the
[14] David Pearce. Enabling new speech driven services for mobile devices:
conventional speech recognition. One way to reduce the An overview of the ETSI standards activities for distributed speech
number of iterations at each stage is to carry out beam search in recognition front ends. AVIOS 2000: The Speech Applications
which the width of the beam should be carefully selected in Conference, CA, USA, May, 2000.
[15] M. J. F. Gales, B. Jia, X. Liu, K. C. Sim, P. C. Woodland and K. Yu.
order to avoid pruning of useful paths at the early stage of the
Development of the CUHTK 2004 RT04F Mandarin Conversational
recognition network. Fortunately, HTK supports the beam Telephone Speech Transcription System. Proc. ICASSP 2005, Volume I,
viterbi search approach and hence can be employed in our case pp. 841-844, March 2005.
with no major modification in the decoding phase. [16] ITU-T Recommendations G.711. Pulse Code Modulation (PCM) of Voice
Frequencies, Nov, 1988.
[17] M. R. Schroeder and B. S. Atal, Code-excited linear prediction (CELP):
VI. CONCLUSIONS AND FUTURE WORK
high-quality speech at very low bit rates, in Proceedings of ICASSP, vol.
In this paper, we presented that how the keystream reuse 10, pp. 937-940, 1985.
problem of stream ciphers can be exploited in case of waveform [18] Randy Goldberg & Lance Riek. A Practical Handbook of Speech Coders,
CRC Press NYC, pp 67, 2000.
encoded speech signals. Prior to this work, one could safely [19] P. Wright. Spy Catcher. Viking, New York, NY, 1987.
reuse keystreams in case the underlying plaintext data was [20] R. L. Benson and M. Warner. VENONA: Soviet Espionage and the
speech as all exploitation techniques presented before this American Response 1939-1957. Central Intelligence Agency,
transaction assumed the underlying plaintext to be Washington D.C., 1996.
[21] R. Rubin. Computer methods for decrypting random stream ciphers.
uncompressed text-based data encoded through conventional Cryptologia, 2(3):215-231, July 1978.
encoding techniques such as ASCII coding. We have shown [22] E. Dawson and L. Nielsen. Automated cryptanalysis of XOR plaintext
that the conventional speech recognition techniques can be strings. Cryptologia, 20(2): 165-181, April, 1996.
adapted to be used for cryptanalysis of two time pads in case of [23] B. Goldburg, E. Dawson, S. Sridharan. The automated cryptanalysis of
analog speech scramblers, Adnances in Cryptology, EUROCRYPT’91,
stream ciphered digitized speech signals. HTK and other Springer-Verlag LNCS 457, pp 422, April, 1991.
automatic speech recognition (ASR) tools can be effectively [24] A. Narayan and V. Shmatikov. Fast dictionary attacks on
modified for this purpose. These speech recognition tools work human-memorable passwords using time-space trade-off. 12th ACM
on the concept of context-dependent tied-state multi-mixture Conference on Computer and Communications Security, pp 364-372,
Washington D.C., Nov, 2005.
tri-phones [30] which make the performance of the recognizer [25] D. X. Song, D. Wagner and X. Tian. Timing analysis of keystrokes and
more flexible and robust. The use of tri-phones in the timing attack on SSH. 10th USENIX Sec. Symposium, Aug, 2001.
keystream reuse scenario needs to be looked into in the future [26] D. Lee. Substitution deciphering based on HMMs with application to
assignments. The parameter encoded and compressed speech compressed document processing. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(12): 1661-1666, Dec, 2002.
signals in the keystream reuse situation can also be looked into
[27] L. Zhuang, F. Zhou and J. D. Tygar. Keyboard acoustic emanations
as a future work. revisited. 12th ACM Conference on Computer and Communications
Security, pp 373-382, Washington, D.C., Nov, 2005.
[28] C. Karlof and D. Wagner. Hidden Markov models cryptanalysis.
REFERENCES Cryptographic Hardware and Embedded Systems- CHES’03,
[1] C.E. Shannon. A mathematical theory of communication. Bell System Springer-Verlag LNCS 2779, 17-34, 2003.
Technical Journal, 27:379-423, July, 1948. [29] BEEP-British English Pronounciation Dictionary (Phonetic
[2] Joshua Mason, Kathryn Watkins, Jason Eisner and Adam Stubblefield. A Transcriptions of over 250,000 English words).
natural language approach to automated cryptanalysis of two time pads. http://svr-www.eng.cam.ac.uk/comp.speech/Section1/Lexical/beep.html.
In 13th ACM Conference on Computer and Communications Security, [30] S. J. Young, G. Evermann, T. Hain, D. Kershaw, G. L. Moore, J. J. Odell,
Nov, 2006. D. Ollason, D. Povey, V. Valtchev and P. C. Woodland, The HTK Book.
[3] H. Wu. The misuse of RC4 in Microsoft Word and Excel, Cryptology Cambridge University, Cambridge, 2003.
ePrint Archive, Report 2005/007, 2005. http://eprint.iacr.org. http://htk.eng.cam.ac.uk/download.shtml.
[4] N. Borisov, I. Goldberg and D. Wagner. Intercepting mobile [31] V. Tyagi and C. Wellekens, On desensitizing the mel-cepstrum to
communications: The insecurity of 802.11. In MOBICOM 2001, 2001. spurious spectral components for robust speech recognition. ICASSP
[5] T. Kohno. Attacking and repairing the WinZip encryption scheme, In 11th ’05. vol. 1, 2005, pp. 529–532.
ACM Conference on computer and communications security, pp 72-81, [32] J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD:
Oct 2004. Telephone speech corpus for research and development, Proceedings of
[6] B. Schneier, Mudge and D. Wagner. Cryptanalysis of Microsoft PPTP ICASSP, San Francisco, 1992.
Authentication Extensions (ms-chapv2). CQRE’99, 1999.
[7] M. Dworkin. Recommendation for block cipher modes of operations,
NIST Special Publication 800-38A, 2001.
[8] David A. McGrew and John Viega, The Galois/Counter mode of
Operation (GCM), May, 2005. Available from
http://csrc.nist.gov/CryptoToolkit/modes/proposedmodes/gcm/gcm-revis
ed-spec.pdf.

You might also like