Automated Cryptanalysis of Plaintext Xors of Waveform Encoded Speech
Automated Cryptanalysis of Plaintext Xors of Waveform Encoded Speech
Automated Cryptanalysis of Plaintext Xors of Waveform Encoded Speech
experimental results. Section 6, concludes the paper and gives W = arg max P ( wi O ) (1)
i∈L
directions for future work.
where L indicates the phonetic units in a language model.
II. BACKGROUND INFORMATION Equation 1 cannot be solved directly but using Baye’s Rule, the
above equation can be modified as
P ( O wi ) P ( wi )
In this context, we first discuss speech coding and then ^
automatic speech recognition with relevance to our work. W = arg max (2)
i∈L P (O)
A. Speech Coding or it can also be written as
^
In waveform encoding, the speech signal waveform is sampled, W = arg max P ( O wi ) P ( wi ) (3)
quantized (and compressed) and then digitally encoded. The i∈L
A-law and µ-law algorithms [16] used in traditional Pulse here, P(O/wi) is calculated using HMM based acoustic models,
Coded Modulation (PCM) digital telephony can be seen as very whereas P(wi) is determined from the language model.
early precursors of speech digitization based on waveform
encoding. In case of parameter encoding speech is considered
as a source filter model in which the parameters of the model III. PREVIOUS AND RELATED WORK
alongwith excitation information in the form of
voiced/unvoiced signals is used for digital represenation of A. Keystream Reuse Exploitation
speech. In hybrid coding which has become the most popular in Key stream reuse vulnerability exploitation of stream ciphers
modern speech coding, the excitation information is not only dates back to the National Security Agency’s VENONA project
segregated as vioced or unvoiced but the details of excitation [19, 20] which started in 1943 and did not even finish uptil
information such as pitch, pulse positions/signs and gains are 1980. Other worth mentioning works on the topic are those of
used for its representation. The common coding technique in Rubin in 1978, who for the first time formalized the process of
this domain is Code Excited Linear Prediction (CELP) [17]. keystream reuse exploitation [21]; Dawson and Neilson in
Although speech coders based on CELP are well known and 1996, who automated the process of cryptanalysis of plaintext
common coders used in voice over IP networks and XORs [22] and the recent cryptanalysis of two time pads by
predominantly used in PC based systems yet some of the Joshua Mason and coauthors in 2006 [2]. Mostly the keystream
leading IP phone vendors unfortunately stopped supporting reuse exploitation discussed previously is with respect to the
some implementations of CELP. This leads to G.711, which is textual data and mainly based on heuristic rules for obtaining
based on waveform coding, as the common coder for PC to IP
the two plaintexts m1 and m2 from m1 ⊕ m2 except for [2]
phones [18]. Moreover, most of the telecommunication links
still use the A-law and µ-law algorithms of waveform coding which uses statistical finite states language models and natural
for speech digitization. Keeping this fact in mind, we have language approach. Prior works also exist on automated
developed our algorithm for the keystream reuse exploitation of cryptanalysis of analog speech scramblers (e.g. [23]), but no
speech signals based on waveform encoding. previous work exists on the use of modern automated speech
recognition (ASR) techniques based on hidden Markov models
(HMMs) being used for cryptanalysis of the two time pad
B. Automatic Speech Recognition problem for the digitally encoded speech signals. In [2], the
The purpose of speech recognition is to convert spoken words concepts borrowed from the natural language and speech
to machine readable input. Two main techniques of speech processing communities are used for text based data whereas
recognition presently exist, one based on dynamic time we use similar concepts with addition of speech recognition for
warping (DTW) and the other based on hidden Markov models speech based data.
(HMMs). The technique which is the most common is based on
hidden Markov Models and is also applicable in our scenario.
A Markov process is a stochastic process in which the B. Use of HMMs in Cryptology
conditional probability of the future states depends only on the As regards to the use of hidden Markov models in cryptology,
present state and not on any past state, whereas, a hidden these have recently been used for several problems in this area.
The most prominent are the works of A. Narayanan and V. its robustness, flexibility and efficiency [11]. The three basic
Shamtikov who used hidden Markov models for improving fast questions with respect to the HMMs and their solutions as
dictionary attacks on human memorable passwords [24]; D.X regards to speech recognition are effectively utilized with some
Song , D. Wagner and X. Tian who used HMMs for timing modification in our case. The three basic problems of interest
attacks on Secure Shell (SSH) [25]; D. Lee gave the concept of that are to be solved for the model to be useful as regards to the
substitution deciphering of compressed documents using cryptanalysis of speech signals encrypted with the same key
HMMs [26]; L. Zhuang, F. Zhou and J. D. Tygar modeled the are:
keyboard acoustic emanations as HMMs [27]; C. Karlof and D.
Wagner used HMMs for modeling countermeasures against 1) Finding Probability of Observation given a Model
side channel cryptanalysis [28] and finally the most relevant Given an observation sequence O (XORed ciphered speech
work of Joshua Mason et al [2] who used the Viterbi beam waveforms in our case) and a model λ = ( π i , A, B ) where π i
search for finding the most probable plaintext pairs from their
is the initial probability of states (XORed phonemes in our
XOR in case of textual data. It is worth mentioning here that the
case), A is the transition probability of states and B is the
works involving cryptanalysis with the aid of HMMs presented
emission probability distribution of the observation sequence,
so far either do not relate to two time pad cryptanalysis or
how do we compute the probability that the given sequence of
pertain only to text based data. Our algorithm for cryptanalysis
of the plaintext XOR of the digitized speech signals using observations was produced by the model λ i.e. P ( O λ ) ? The
hidden Markov model based speech recognition techniques is solution to this problem allows us to choose the model which
the first of its kind according to our knowledge and has showed best matches the observation sequence which in our case would
encouraging preliminary results. be a sequence of XORed speech samples.
A. HMM Selection and Training Phase Before using our HMMs we have to define the basic
This phase corresponds to the pre computation part of the architecture of our recognizer. In actual case this depends on
speech recognition in which the HMMs are first selected and the language and the syntactic rules of the underlying task for
then trained with the help of speech samples available with which the recognizer is used. We assume that these things like
respect to each phoneme pair. This is done once for a particular the language of the speakers and the digital encoding
language and specific speech encoding procedure. In order to procedures are known to the cryptanalyst before hand. HTK,
prove the concept, we present a simple example in which we like most speech recognizers, works on the concept of
take two phonetically balanced English sentences: Clothes and recognition network which are to be prepared in advance from
lodging are free to new men; and All that glitters is not gold at task grammars, and the performance of the recognizer is greatly
all. We bit wise XORed the digital encoded forms of these dependent on how well the recognition network maps the actual
sentences to simulate the keystream reuse scenario. Fig. 1(a), task of recognition. In addition to the recognition network, we
(b) show the spectrogram and waveform of the two sentences need to have a task dictionary which explains how the
along with their transcription. The transcription at the phoneme recognizer has to respond once a particular HMM is identified.
level is obtained from the British English pronunciation The task grammar for our recognition network is shown in Fig.
dictionary BEEP [29]. For simplicity of implementation the 3. The recognition network for our experiment is shown in Fig.
silence between words is not marked separately, only the initial 4. The HParse tool of HTK can be used to construct the
silence and the end silence are marked. Fig 1(c) corresponds to recognition network from the task grammar. HSGen can be
the bitwise XOR of the two signals and the associated used for testing and verification of the recognition network.
Fig. 1(a). Spectrogram and waveform along with transcription of the sentence:
Clothes and lodging are free to new men.
Fig. 1(b). Spectrogram and waveform along with transcription of the sentence:
All that glitters is not gold at all.
Fig. 1(c). Spectrogram and waveform along with transcription of XOR of the sentences.
a22 a33 a44 a55
a12 a23 a34 a45 a56
b2 b3 b4 b5
S1 S2 S3 S4 S5 S6
a13 a24 a35 a56
/* Task grammar*/
$WORD = sil+ao | k+l | l+dh | ow+ae | dh+t | z+g | ae+g | n+l | d+l | l+ih | l+t | l+ax | oh+ax | oh+r | jh+z |ih+ih
| ng+z | ng+n | aa+n | r+oh | r+t | r+g | f+g | iy+g | iy+ow | iy+l | t+d | t+ax | uw+ax | uw+t | n+t | n+ao | y+ao |
uw+ao | m+l | m+sil | eh+sil | n+sil | k+ao | ow+dh | ow+t | dh+g | z+l | ae+l | ae+ih | n+ih | d+t | l+r | oh+z | jh+ih
| ih+z | aa+t | f+ow | r+ow | r+l | iy+d | iy+ax | iy+t | t+t | y+l | uw+sil | sil+l | f+oh | f+t | n+ax | y+t | m+ao | eh+l
| l+ao | ow+ao | dh+ao | ae+dh | ae+ae | d+g | l+g | oh+l | jh+l | ng+t | aa+ax | r+ax | r+r | f+r | f+z | f+ih | r+ih | iy+z
| t+z | uw+n | n+oh | uw+g | uw+ow | m+ow | eh+ow | n+d | sil+ax | sil+t | n+ae | d+ae | l+ae | oh+ae | oh+t | oh+g
| jh+g | ih+l | ng+l | aa+ih | f+ax | r+z | iy+ih | uw+z | n+n | y+oh | m+g | eh+g | n+ow | sil+ow | sil+d | l+l | ow+l
| dh+l | z+dh | jh+t | ih+t | ng+g | aa+g | f+l | n+z | y+z | m+z | eh+n | eh+oh | sil+oh | dh+dh | ih+g | aa+l | t+ih |
uw+ih | m+n | sil+g | k+sil | n+dh | ng+ax | aa+r | aa+z | f+n | l+sil | ow+sil | z+ao | ih+ax | ng+r | r+n | iy+oh | r+d
| ae+t | oh+ih | jh+ax | ih+r | t+g | y+d | uw+d | m+ax | eh+ax | aa+oh ;
sil ⊕ ao
k ⊕l
l ⊕ dh
START_SIL END_SIL
m ⊕ ax
eh ⊕ ax
aa ⊕ oh