Introduction To Machine Learning With Applications in Information Security Mark Stamp Download
Introduction To Machine Learning With Applications in Information Security Mark Stamp Download
https://ebookbell.com/product/introduction-to-machine-learning-
with-applications-in-information-security-mark-stamp-6761866
https://ebookbell.com/product/introduction-to-machine-learning-with-
applications-in-information-security-2nd-edition-2nd-mark-
stamp-46085688
https://ebookbell.com/product/introduction-to-machine-learning-with-
applications-in-information-security-stamp-232954874
https://ebookbell.com/product/an-introduction-to-optimization-with-
applications-in-machine-learning-and-data-analytics-jeffrey-paul-
wheeler-53329220
https://ebookbell.com/product/introduction-to-python-with-
applications-in-optimization-image-and-video-processing-and-machine-
learning-david-bezlpez-david-alfredo-bez-villegas-57213568
An Introduction To Optimization With Applications To Machine Learning
5th Edition 5th Edition Edwin K P Chong
https://ebookbell.com/product/an-introduction-to-optimization-with-
applications-to-machine-learning-5th-edition-5th-edition-edwin-k-p-
chong-215720954
https://ebookbell.com/product/introduction-to-machine-learning-with-
python-deepti-chopra-roopal-khurana-49419694
https://ebookbell.com/product/introduction-to-machine-learning-with-
python-a-guide-for-data-scientists-1st-edition-andreas-c-
mller-35139178
https://ebookbell.com/product/introduction-to-machine-learning-with-
python-a-guide-for-data-scientists-andreas-c-mller-sarah-
guido-42304868
https://ebookbell.com/product/introduction-to-machine-learning-with-r-
rigorous-mathematical-analysis-scott-v-burger-7167178
INTRODUCTION TO
MACHINE
LEARNING with
APPLICATIONS
in INFORMATION
SECURITY
Mark Stamp
San Jose State University
California
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
Preface xiii
1 Introduction 1
1.1 What Is Machine Learning? . . . . . . . . . . . . . . . . . . . 1
1.2 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Necessary Background . . . . . . . . . . . . . . . . . . . . . . 4
1.4 A Few Too Many Notes . . . . . . . . . . . . . . . . . . . . . 4
Index 338
Preface
For the past several years, I’ve been teaching a class on “Topics in Information
Security.” Each time I taught this course, I’d sneak in a few more machine
learning topics. For the past couple of years, the class has been turned on
its head, with machine learning being the focus, and information security
only making its appearance in the applications. Unable to find a suitable
textbook, I wrote a manuscript, which slowly evolved into this book.
In my machine learning class, we spend about two weeks on each of the
major topics in this book (HMM, PHMM, PCA, SVM, and clustering). For
each of these topics, about one week is devoted to the technical details in
Part I, and another lecture or two is spent on the corresponding applica-
tions in Part II. The material in Part I is not easy—by including relevant
applications, the material is reinforced, and the pace is more reasonable.
I also spend a week covering the data analysis topics in Chapter 8 and
several of the mini topics in Chapter 7 are covered, based on time constraints
and student interest.1
Machine learning is an ideal subject for substantive projects. In topics
classes, I always require projects, which are usually completed by pairs of stu-
dents, although individual projects are allowed. At least one week is allocated
to student presentations of their project results.
A suggested syllabus is given in Table 1. This syllabus should leave time
for tests, project presentations, and selected special topics. Note that the
applications material in Part II is intermixed with the material in Part I.
Also note that the data analysis chapter is covered early, since it’s relevant
to all of the applications in Part II.
1
Who am I kidding? Topics are selected based on my interests, not student interest.
xiv PREFACE
Mark Stamp
Los Gatos, California
April, 2017
2
In my experience, in-person lectures are infinitely more valuable than any recorded or
online format. Something happens in live classes that will never be fully duplicated in any
dead (or even semi-dead) format.
Chapter 1
Introduction
I took a speed reading course and read War and Peace in twenty minutes.
It involves Russia.
— Woody Allen
the primary goal of this book is to provide the reader with a deeper un-
derstanding of what is actually happening inside those mysterious machine
learning black boxes.
Why should anyone care about the inner workings of machine learning al-
gorithms when a simple black box approach can—and often does—suffice? If
you are like your curious author, you hate black boxes, and you want to know
how and why things work as they do. But there are also practical reasons
for exploring the inner sanctum of machine learning. As with any technical
field, the cookbook approach to machine learning is inherently limited. When
applying machine learning to new and novel problems, it is often essential to
have an understanding of what is actually happening “under the covers.” In
addition to being the most interesting cases, such applications are also likely
to be the most lucrative.
By way of analogy, consider a medical doctor (MD) in comparison to a
nurse practitioner (NP).1 It is often claimed that an NP can do about 80%
to 90% of the work that an MD typically does. And the NP requires less
training, so when possible, it is cheaper to have NPs treat people. But, for
challenging or unusual or non-standard cases, the higher level of training of
an MD may be essential. So, the MD deals with the most challenging and
interesting cases, and earns significantly more for doing so. The aim of this
book is to enable the reader to earn the equivalent of an MD in machine
learning.
The bottom line is that the reader who masters the material in this book
will be well positioned to apply machine learning techniques to challenging
and cutting-edge applications. Most such applications would likely be beyond
the reach of anyone with a mere black box level of understanding.
sometimes skip a few details, and on occasion, we might even be a little bit
sloppy with respect to mathematical niceties. The goal here is to present
topics at a fairly intuitive level, with (hopefully) just enough detail to clarify
the underlying concepts, but not so much detail as to become overwhelming
and bog down the presentation.3
In this book, the following machine learning topics are covered in chapter-
length detail.
Topic Where
Hidden Markov Models (HMM) Chapter 2
Profile Hidden Markov Models (PHMM) Chapter 3
Principal Component Analysis (PCA) Chapter 4
Support Vector Machines (SVM) Chapter 5
Clustering (�-Means and EM) Chapter 6
Topic Where
�-Nearest Neighbors (�-NN) Section 7.2
Neural Networks Section 7.3
Boosting and AdaBoost Section 7.4
Random Forest Section 7.5
Linear Discriminant Analysis (LDA) Section 7.6
Vector Quantization (VQ) Section 7.7
Naı̈ve Bayes Section 7.8
Regression Analysis Section 7.9
Conditional Random Fields (CRF) Section 7.10
http://www.cs.sjsu.edu/~stamp/ML/
where you’ll find links to PowerPoint slides, lecture videos, and other relevant
material. An updated errata list is also available. And for the reader’s benefit,
all of the figures in this book are available in electronic form, and in color.
3
Admittedly, this is a delicate balance, and your unbalanced author is sure that he didn’t
always achieve an ideal compromise. But you can rest assured that it was not for lack of
trying.
4 INTRODUCTION
A Revealing Introduction to
Hidden Markov Models
The bottom line is that this chapter is the linchpin for much of the remain-
der of the book. Consequently, if you learn the material in this chapter well,
it will pay large dividends in most subsequent chapters. On the other hand,
if you fail to fully grasp the details of HMMs, then much of the remaining
material will almost certainly be more difficult than is necessary.
HMMs are based on discrete probability. In particular, we’ll need some
basic facts about conditional probability, so in the remainder of this section,
we provide a quick overview of this crucial topic.
The notation “|” denotes “given” information, so that � (� | �) is read as
“the probability of �, given �.” For any two events � and �, we have
For example, suppose that we draw two cards without replacement from a
standard 52-card deck. Let � = {1st card is ace} and � = {2nd card is ace}.
Then
� (� and �) = � (�) � (� | �) = 4/52 · 3/51 = 1/221.
In this example, � (�) depends on what happens in the first event �, so we
say that � and � are dependent events. On the other hand, suppose we flip
a fair coin twice. Then the probability that the second flip comes up heads
is 1/2, regardless of the outcome of the first coin flip, so these events are
independent. For dependent events, the “given” information is relevant when
determining the sample space. Consequently, in such cases we can view the
information to the right of the “given” sign as defining the space over which
probabilities will be computed.
We can rewrite equation (2.1) as
� (� and �)
� (� | �) = .
� (�)
which comes from (2.3). For this example, suppose that the initial state
distribution, denoted by �, is
︀ ︀
� = 0.6 0.4 , (2.6)
that is, the chance that we start in the � state is 0.6 and the chance that
we start in the � state is 0.4. The matrices �, �, and � are row stochastic,
which is just a fancy way of saying that each row satisfies the requirements
of a discrete probability distribution (i.e., each element is between 0 and 1,
and the elements of each row sum to 1).
Now, suppose that we consider a particular four-year period of interest
from the distant past. For this particular four-year period, we observe the
series of tree ring sizes �, �, �, �. Letting 0 represent �, 1 represent � , and 2
represent �, this observation sequence is denoted as
︀ ︀
� = 0, 1, 0, 2 . (2.7)
We might want to determine the most likely state sequence of the Markov
process given the observations (2.7). That is, we might want to know the most
likely average annual temperatures over this four-year period of interest. This
is not quite as clear-cut as it seems, since there are different possible inter-
pretations of “most likely.” On the one hand, we could define “most likely”
as the state sequence with the highest probability from among all possible
state sequences of length four. Dynamic programming (DP) can be used to
efficiently solve this problem. On the other hand, we might reasonably define
“most likely” as the state sequence that maximizes the expected number of
correct states. An HMM can be used to find the most likely hidden state
sequence in this latter sense.
It’s important to realize that the DP and HMM solutions to this problem
are not necessarily the same. For example, the DP solution must, by defini-
tion, include valid state transitions, while this is not the case for the HMM.
And even if all state transitions are valid, the HMM solution can still differ
from the DP solution, as we’ll illustrate in an example below.
Before going into more detail, we need to deal with the most challenging
aspect of HMMs—the notation. Once we have the notation, we’ll discuss the
2.3 NOTATION 11
three fundamental problems that HMMs enable us to solve, and we’ll give
detailed algorithms for the efficient solution of each. We also consider critical
computational issues that must be addressed when writing any HMM com-
puter program. Rabiner [113] is a standard reference for further introductory
information on HMMs.
2.3 Notation
The notation used in an HMM is summarized in Table 2.1. Note that the
observations are assumed to come from the set {0, 1, . . . , � − 1}, which sim-
plifies the notation with no loss of generality. That is, we simply associate
each of the � distinct observations with one of the elements 0, 1, . . . , � − 1,
so that �� ∈ � = {0, 1, . . . , � − 1} for � = 0, 1, . . . , � − 1.
Notation Explanation
� Length of the observation sequence
� Number of states in the model
� Number of observation symbols
� Distinct states of the Markov process, �0 , �1 , . . . , �� −1
� Possible observations, assumed to be 0, 1, . . . , � − 1
� State transition probabilities
� Observation probability matrix
� Initial state distribution
� Observation sequence, �0 , �1 , . . . , �� −1
� � � �
�0 �1 �2 ··· �� −1
� � � �
�0 �1 �2 ··· �� −1
The matrix � is always row stochastic. Also, the probabilities ��� are inde-
pendent of �, so that the � matrix does not change. The matrix � = {�� (�)}
is of size � × � , with
As with the � matrix, � is row stochastic, and the probabilities �� (�) are
independent of �. The somewhat unusual notation �� (�) is convenient when
specifying the HMM algorithms.
An HMM is defined by �, �, and � (and, implicitly, by the dimensions �
and � ). Thus, we’ll denote an HMM as � = (�, �, �).
Suppose that we are given an observation sequence of length four, which
is denoted as ︀ ︀
� = �0 , �1 , �2 , �3 .
The corresponding (hidden) state sequence is
︀ ︀
� = �0 , �1 , �2 , �3 .
We’ll let ��0 denote the probability of starting in state �0 , and ��0 (�0 )
denotes the probability of initially observing �0 , while ��0 ,�1 is the proba-
bility of transiting from state �0 to state �1 . Continuing, we see that the
probability of a given state sequence � of length four is
� (�, �) = ��0 ��0 (�0 )��0 ,�1 ��1 (�1 )��1 ,�2 ��2 (�2 )��2 ,�3 ��3 (�3 ). (2.8)
Note that in this expression, the �� represent indices in the � and � matrices,
not the names of the corresponding states.3
3
Your kindly author regrets this abuse of notation.
2.3 NOTATION 13
Consider again the temperature example in Section 2.2, where the obser-
vation sequence is � = (0, 1, 0, 2). Using (2.8) we can compute, say,
� (����) = 0.6(0.1)(0.7)(0.4)(0.3)(0.7)(0.6)(0.1) = 0.000212.
Similarly, we can directly compute the probability of each possible state se-
quence of length four, for the given observation sequence in (2.7). We have
listed these results in Table 2.2, where the probabilities in the last column
have been normalized so that they sum to 1.
Normalized
State Probability
probability
���� 0.000412 0.042787
���� 0.000035 0.003635
���� 0.000706 0.073320
���� 0.000212 0.022017
���� 0.000050 0.005193
���� 0.000004 0.000415
���� 0.000302 0.031364
���� 0.000091 0.009451
���� 0.001098 0.114031
���� 0.000094 0.009762
���� 0.001882 0.195451
���� 0.000564 0.058573
���� 0.000470 0.048811
���� 0.000040 0.004154
���� 0.002822 0.293073
���� 0.000847 0.087963
2.4.4 Discussion
Consider, for example, the problem of speech recognition—which happens
to be one of the earliest and best-known applications of HMMs. We can
use the solution to HMM Problem 3 to train an HMM � to, for example,
recognize the spoken word “yes.” Then, given an unknown spoken word,
we can use the solution to HMM Problem 1 to score this word against this
2.5 THE THREE SOLUTIONS 15
model � and determine the likelihood that the word is “yes.” In this case, we
don’t need to solve HMM Problem 2, but it is possible that such a solution—
which uncovers the hidden states—might provide additional insight into the
underlying speech model.
Since
� (� ∩ � ∩ �)
� (�, � | �) =
� (�)
and
� (� ∩ � ∩ �) � (� ∩ �) � (� ∩ � ∩ �)
� (� | �, �)� (� | �) = · =
� (� ∩ �) � (�) � (�)
we have
� (�, � | �) = � (� | �, �)� (� | �).
Summing over all possible state sequences yields
︁
� (� | �) = � (�, � | �)
�
︁
= � (� | �, �)� (� | �) (2.9)
�
︁
= ��0 ��0 (�0 )��0 ,�1 ��1 (�1 ) · · · ��� −2 ,�� −1 ��� −1 (�� −1 ).
�
Hence, the forward algorithm gives us an efficient way to compute a score for
a given sequence �, relative to a given model �.
the highest-scoring overall path. As we have seen, these solutions are not
necessarily the same.
First, we define
�� (�) = � (��+1 , ��+2 , . . . , �� −1 | �� = �� , �)
for � = 0, 1, . . . , � − 1, and � = 0, 1, . . . , � − 1. The �� (�) can be computed
recursively (and efficiently) using the backward algorithm, or �-pass, which is
given here in Algorithm 2.2. This is analogous to the �-pass discussed above,
except that we start at the end and work back toward the beginning.
Once the �� (�, �) have been computed, the model � = (�, �, �) is re-estimated
using Algorithm 2.3. The HMM training algorithm is known as Baum-Welch
re-estimation, and is named after Leonard E. Baum and Lloyd R. Welch, who
developed the technique in the late 1960s while working at the Center for
Communications Research (CCR),4 which is part of the Institute for Defense
Analyses (IDA), located in Princeton, New Jersey.
The numerator of the re-estimated ��� in Algorithm 2.3 can be seen to
give the expected number of transitions from state �� to state �� , while the
denominator is the expected number of transitions from �� to any state.5
Hence, the ratio is the probability of transiting from state �� to state �� ,
which is the desired value of ��� .
The numerator of the re-estimated �� (�) in Algorithm 2.3 is the expected
number of times the model is in state �� with observation �, while the denom-
inator is the expected number of times the model is in state �� . Therefore,
the ratio is the probability of observing symbol �, given that the model is in
state �� , and this is the desired value for �� (�).
Re-estimation is an iterative process. First, we initialize � = (�, �, �)
with a reasonable guess, or, if no reasonable guess is available, we choose
4
Not to be confused with Creedence Clearwater Revival [153].
5
When re-estimating the A matrix, we are dealing with expectations. However, it might
make things clearer to think in terms of frequency counts. For frequency counts, it would be
easy to compute the probability of transitioning from state i to state j. That is, we would
simply count the number of transitions from state i to state j, and divide this count by the
total number of times we could be in state i. This is the intuition behind the re-estimation
formula for the A matrix, and a similar statement holds when re-estimating the B matrix.
In other words, don’t let all of the fancy notation obscure the relatively simple ideas that
are at the core of the re-estimation process.
2.5 THE THREE SOLUTIONS 19
random values such that �� ≈ 1/� and ��� ≈ 1/� and �� (�) ≈ 1/� . It’s
critical that �, �, and � be randomized, since exactly uniform values will
result in a local maximum from which the model cannot climb. And, as
always, �, � and � must be row stochastic.
The complete solution to HMM Problem 3 can be summarized as follows.
4. If � (� | �) increases, goto 2.
and hence the best (most probable) path of length two ending with � is ��
while the best path of length two ending with � is ��. Continuing, we
construct the diagram in Figure 2.2 one level or stage at a time, where each
arrow points to the next element in the optimal path ending at a given state.
Note that at each stage, the dynamic programming algorithm only needs
to maintain the highest-scoring path ending at each state—not a list of all
possible paths. This is the key to the efficiency of the algorithm.
C C C C
.28 .0336 .014112 .000847
2.7 Scaling
The three HMM solutions in Section 2.5 all require computations involving
products of probabilities. It’s very easy to see, for example, that �� (�) tends
to 0 exponentially as � increases. Therefore, any attempt to implement the
HMM algorithms as given in Section 2.5 will inevitably result in underflow.
The solution to this underflow problem is to scale the numbers. However,
care must be taken to ensure that the algorithms remain valid.
First, consider the computation of �� (�). The basic recurrence is
�
︁ −1
�� (�) = ��−1 (�)��� �� (�� ).
�=0
Following this approach, we compute scaling factors �� and the scaled �� (�),
which we denote as � ︀� (�), as in Algorithm 2.6.
To verify Algorithm 2.6 we first note that � ︀0 (�) = �0 �0 (�). Now suppose
that for some �, we have
�
︀� (�) = �0 �1 · · · �� �� (�). (2.12)
Then
�
︀�+1 (�) = ��+1 �
︀�+1 (�)
�
︁ −1
= ��+1 �
︀� (�)��� �� (��+1 )
�=0
�
︁ −1
= �0 �1 · · · �� ��+1 �� (�)��� �� (��+1 )
�=0
From equation (2.13) we see that for all � and �, the desired scaled value
of �� (�) is indeed given by �
︀� (�).
From (2.13) it follows that
�
︁ −1
�
︀� −1 (�) = 1.
�=0
= �0 �1 · · · �� −1 � (� | �).
24 HIDDEN MARKOV MODELS
It follows that we can compute the log of � (� | �) directly from the scaling
factors �� as
�
︁ −1
︀ ︀
log � (� | �) = − log �� . (2.14)
�=0
It is fairly easy to show that the same scale factors �� can be used in
the backward algorithm by simply computing �︀� (�) = �� �� (�). We then deter-
mine �� (�, �) and �� (�) using the same formulae as in Section 2.5, but with �
︀� (�)
︀
and �� (�) in place of �� (�) and �� (�), respectively. The resulting gammas and
di-gammas are then used to re-estimate �, �, and �.
By writing the original re-estimation formulae (as given in lines 3, 7,
and 12 of Algorithm 2.3) directly in terms of �� (�) and �� (�), it is a straight-
forward exercise to show that the re-estimated � and � and � are exact
when � ︀� (�) and �︀� (�) are used in place of �� (�) and �� (�). Furthermore,
� (� | �) isn’t required in the re-estimation formulae, since in each case it
cancels in the numerator and denominator. Therefore, (2.14) determines a
score for the model, which can be used, for example, to decide whether the
model is improving sufficiently to continue to the next iteration of the training
algorithm.
1. Given
2. Initialize
(a) Select � and determine � from �. Recall that the model is de-
noted � = (�, �, �), where � = {��� } is � × � , � = {�� (�)}
is � × � , and � = {�� } is 1 × � .
(b) Initialize the three matrices �, �, and �. You can use knowl-
edge of the problem when generating initial values, but if no such
2.8 ALL TOGETHER NOW 25
// compute �0 (�)
�0 = 0
for � = 0 to � − 1
�0 (�) = �� �� (�0 )
�0 = �0 + �0 (�)
next �
// scale the �0 (�)
�0 = 1/�0
for � = 0 to � − 1
�0 (�) = �0 �0 (�)
next �
// compute �� (�)
for � = 1 to � − 1
�� = 0
for � = 0 to � − 1
�� (�) = 0
for � = 0 to � − 1
�� (�) = �� (�) + ��−1 (�)���
next �
�� (�) = �� (�)�� (�� )
�� = �� + �� (�)
next �
// scale �� (�)
�� = 1/��
for � = 0 to � − 1
�� (�) = �� �� (�)
next �
next �
26 HIDDEN MARKOV MODELS
for � = 0 to � − 2
denom = 0
for � = 0 to � − 1
for � = 0 to � − 1
denom = denom + �� (�)��� �� (��+1 )��+1 (�)
next �
next �
for � = 0 to � − 1
�� (�) = 0
for � = 0 to �︀ − 1 ︀
�� (�, �) = �� (�)��� �� (��+1 )��+1 (�) /denom
�� (�) = �� (�) + �� (�, �)
next �
next �
next �
// Special case for �� −1 (�)
denom = 0
for � = 0 to � − 1
denom = denom + �� −1 (�)
next �
for � = 0 to � − 1
�� −1 (�) = �� −1 (�)/denom
next �
2.8 ALL TOGETHER NOW 27
// re-estimate �
for � = 0 to � − 1
�� = �0 (�)
next �
// re-estimate �
for � = 0 to � − 1
for � = 0 to � − 1
numer = 0
denom = 0
for � = 0 to � − 2
numer = numer + �� (�, �)
denom = denom + �� (�)
next �
��� = numer/denom
next �
next �
// re-estimate �
for � = 0 to � − 1
for � = 0 to � − 1
numer = 0
denom = 0
for � = 0 to � − 1
if(�� == �) then
numer = numer + �� (�)
end if
denom = denom + �� (�)
next �
�� (�) = numer/denom
next �
next �
︀ ︀
7. Compute log � (� | �)
logProb = 0
for � = 0 to � − 1
logProb = logProb + log(�� )
next �
logProb = −logProb
28 HIDDEN MARKOV MODELS
iters = iters + 1
� = |logProb − oldLogProb|
if(iters < minIters or � > �) then
oldLogProb = logProb
goto 3.
else
return � = (�, �, �)
end if
2.10 Problems
2. For this problem, use the same model � and observation sequence �
given in Problem 1.
a) Explain how you can solve HMM Problem 1 using the backward
algorithm instead of the forward algorithm.
2.10 PROBLEMS 31
10. Write an HMM program for the English text problem in Section 9.2 of
Chapter 9. Test your program on each of the following cases.
11. In this problem, you will use an HMM to break a simple substitution
ciphertext message. For each HMM, train using 200 iterations of the
Baum-Welch re-estimation algorithm.
12. Write an HMM program to solve the problem discussed in Section 9.2,
replacing English text with the following.
a) French text.
b) Russian text.
c) Chinese text.
2.10 PROBLEMS 33
13. Perform an HMM analysis similar to that discussed in Section 9.2, re-
placing English with “Hamptonese,” the mysterious writing system de-
veloped by James Hampton. For information on Hamptonese, see
http://www.cs.sjsu.edu/faculty/stamp/Hampton/hampton.html
14. Since HMM training is a hill climb, we are only assured of reaching a
local maximum. And, as with any hill climb, the specific local maximum
that we find will depend on our choice of initial values. Therefore, by
training a hidden Markov model multiple times with different initial
values, we would expect to obtain better results than when training
only once.
In the paper [16], the authors use an expectation maximization (EM)
approach with multiple random restarts as a means of attacking ho-
mophonic substitution ciphers. An analogous HMM-based technique is
analyzed in the report [158], where the effectiveness of multiple ran-
dom restarts on simple substitution cryptanalysis is explored in detail.
Multiple random restarts are especially helpful in the most challenging
cases, that is, when little data (i.e., ciphertext) is available. However,
the tradeoff is that the work factor can be high, since the number of
restarts required may be very large (millions of random restarts are
required in some cases).
15. The Zodiac Killer murdered at least five people in the San Francisco Bay
Area in the late 1960s and early 1970s. Although police had a prime
suspect, no arrest was ever made and the murders remain officially
unsolved. The killer sent several messages to the police and to local
newspapers, taunting police for their failure to catch him. One of these
34 HIDDEN MARKOV MODELS
16. In addition to the Zodiac 408 cipher, the Zodiac Killer (see Problem 15)
released a similar-looking cipher with 340 symbols. This cipher is known
as the Zodiac 340 and remains unsolved to this day.8 The ciphertext is
given below.
a) Repeat Problem 15, parts a) through d), using the Zodiac 340 in
place of the Zodiac 408. Since the plaintext is unknown, in each
case, simply print the decryption obtained from your highest scoring
model.
b) Repeat part a) of this problem, except use parts e) through h) of
Problem 15.
8
It is possible that the Zodiac 340 is not a cipher at all, but instead just a random
collection of symbols designed to frustrate would-be cryptanalysts. If that’s the case, your
easily frustrated author can confirm that the “cipher” has been wildly successful.
Chapter 3
3.1 Introduction
Here, we introduce the concept of a profile hidden Markov model (PHMM).
The material in this chapter builds directly on Chapter 2 and we’ll assume
that the reader has a good understanding of HMMs.
Recall that the key reason that HMMs are so popular and useful is that
there are efficient algorithms to solve each of the three problems that arise—
training, scoring, and uncovering the hidden states. But, there are significant
restrictions inherent in the HMM formulation, which limit the usefulness of
HMMs in some important applications.
Perhaps the most significant limitation of an HMM is the Markov as-
sumption, that is, the current state depends only on the previous state. The
time-invariant nature of an HMM is a closely related issue.1 These limita-
tions make the HMM algorithms fast and efficient, but they prevent us from
making use of positional information within observation sequences. For some
types of problems, such information is critically important.
1
According to your self-referential author’s comments in Chapter 2, we can consider
higher order Markov processes, in which case the current state can depend on n consecutive
previous states. But, the machinery becomes unwieldy, even for relatively small n. And,
even if we consider higher order Markov processes, we still treat all positions in the sequence
the same, as this only changes how far back in history we look.
38 PROFILE HIDDEN MARKOV MODELS
begin �1 �2 �3 �4 end
�0 �1 �2 �3 �4
begin �1 �2 �3 �4 end
�1 �2 �3 �4
begin �1 �2 �3 �4 end
�1 �2 �3 �4
�0 �1 �2 �3 �4
begin �1 �2 �3 �4 end
Notation Explanation
� Emitted symbols, �1 , �2 , . . . , �� , where � ≤ � + 1
� Number of states
� Match states, �1 , �2 , . . . , ��
� Insert states, �0 , �1 , . . . , ��
� Delete states, �1 , �2 , . . . , ��
� Initial state distribution
� State transition probability matrix
��� ��+1 Transition probability from �� to ��+1
� Emission probability matrix
��� (�) Emission probability of symbol � at state ��
� The PHMM, � = (�, �, �)
CBCBJILIIJEJE
Unaligned sequences
GCBJIIIJJEG
-CBCBJILIIJEJE-
Global alignment | ||| ||| ||
GC--BJI-IIJ-JEG
***CBJILII-JE**
Local alignment |||| || ||
***CBJI-IIJJE**
Notation Explanation
E Send email
G Play games
C C programming
J Java programming
ebookbell.com