0% found this document useful (0 votes)
15 views85 pages

Introduction To Machine Learning With Applications in Information Security Mark Stamp Download

The document is an introduction to machine learning with a focus on its applications in information security, authored by Mark Stamp. It covers various topics including hidden Markov models, support vector machines, and clustering concepts, providing both theoretical background and practical examples. The book is published by CRC Press and is intended for readers interested in the intersection of machine learning and information security.

Uploaded by

junezrasko2q
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views85 pages

Introduction To Machine Learning With Applications in Information Security Mark Stamp Download

The document is an introduction to machine learning with a focus on its applications in information security, authored by Mark Stamp. It covers various topics including hidden Markov models, support vector machines, and clustering concepts, providing both theoretical background and practical examples. The book is published by CRC Press and is intended for readers interested in the intersection of machine learning and information security.

Uploaded by

junezrasko2q
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Introduction To Machine Learning With

Applications In Information Security Mark Stamp


download

https://ebookbell.com/product/introduction-to-machine-learning-
with-applications-in-information-security-mark-stamp-6761866

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Introduction To Machine Learning With Applications In Information


Security 2nd Edition 2nd Mark Stamp

https://ebookbell.com/product/introduction-to-machine-learning-with-
applications-in-information-security-2nd-edition-2nd-mark-
stamp-46085688

Introduction To Machine Learning With Applications In Information


Security Stamp

https://ebookbell.com/product/introduction-to-machine-learning-with-
applications-in-information-security-stamp-232954874

An Introduction To Optimization With Applications In Machine Learning


And Data Analytics Jeffrey Paul Wheeler

https://ebookbell.com/product/an-introduction-to-optimization-with-
applications-in-machine-learning-and-data-analytics-jeffrey-paul-
wheeler-53329220

Introduction To Python With Applications In Optimization Image And


Video Processing And Machine Learning David Bezlpez David Alfredo Bez
Villegas

https://ebookbell.com/product/introduction-to-python-with-
applications-in-optimization-image-and-video-processing-and-machine-
learning-david-bezlpez-david-alfredo-bez-villegas-57213568
An Introduction To Optimization With Applications To Machine Learning
5th Edition 5th Edition Edwin K P Chong

https://ebookbell.com/product/an-introduction-to-optimization-with-
applications-to-machine-learning-5th-edition-5th-edition-edwin-k-p-
chong-215720954

Introduction To Machine Learning With Python Deepti Chopra Roopal


Khurana

https://ebookbell.com/product/introduction-to-machine-learning-with-
python-deepti-chopra-roopal-khurana-49419694

Introduction To Machine Learning With Python A Guide For Data


Scientists 1st Edition Andreas C Mller

https://ebookbell.com/product/introduction-to-machine-learning-with-
python-a-guide-for-data-scientists-1st-edition-andreas-c-
mller-35139178

Introduction To Machine Learning With Python A Guide For Data


Scientists Andreas C Mller Sarah Guido

https://ebookbell.com/product/introduction-to-machine-learning-with-
python-a-guide-for-data-scientists-andreas-c-mller-sarah-
guido-42304868

Introduction To Machine Learning With R Rigorous Mathematical Analysis


Scott V Burger

https://ebookbell.com/product/introduction-to-machine-learning-with-r-
rigorous-mathematical-analysis-scott-v-burger-7167178
INTRODUCTION TO

MACHINE
LEARNING with
APPLICATIONS
in INFORMATION
SECURITY

Mark Stamp
San Jose State University
California
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

International Standard Book Number-13: 978-1-138-62678-2 (Hardback)

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents

Preface xiii

1 Introduction 1
1.1 What Is Machine Learning? . . . . . . . . . . . . . . . . . . . 1
1.2 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Necessary Background . . . . . . . . . . . . . . . . . . . . . . 4
1.4 A Few Too Many Notes . . . . . . . . . . . . . . . . . . . . . 4

I Tools of the Trade 5

2 A Revealing Introduction to Hidden Markov Models 7


2.1 Introduction and Background . . . . . . . . . . . . . . . . . . 7
2.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 The Three Problems . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 HMM Problem 1 . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 HMM Problem 2 . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 HMM Problem 3 . . . . . . . . . . . . . . . . . . . . . 14
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 The Three Solutions . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Solution to HMM Problem 1 . . . . . . . . . . . . . . 15
2.5.2 Solution to HMM Problem 2 . . . . . . . . . . . . . . 16
2.5.3 Solution to HMM Problem 3 . . . . . . . . . . . . . . 17
2.6 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 20
2.7 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 All Together Now . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 A Full Frontal View of Profile Hidden Markov Models 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Overview and Notation . . . . . . . . . . . . . . . . . . . . . 39
3.3 Pairwise Alignment . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . 46
3.5 PHMM from MSA . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Principal Components of Principal Component Analysis 63


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 A Brief Review of Linear Algebra . . . . . . . . . . . . 64
4.2.2 Geometric View of Eigenvectors . . . . . . . . . . . . 68
4.2.3 Covariance Matrix . . . . . . . . . . . . . . . . . . . . 70
4.3 Principal Component Analysis . . . . . . . . . . . . . . . . . 73
4.4 SVD Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 All Together Now . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . 80
4.5.2 Scoring Phase . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . 83
4.7 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 A Reassuring Introduction to Support Vector Machines 95


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . 104
5.2.2 Lagrangian Duality . . . . . . . . . . . . . . . . . . . . 108
5.3 A Closer Look at SVM . . . . . . . . . . . . . . . . . . . . . . 110
5.3.1 Training and Scoring . . . . . . . . . . . . . . . . . . . 112
5.3.2 Scoring Revisited . . . . . . . . . . . . . . . . . . . . . 114
5.3.3 Support Vectors . . . . . . . . . . . . . . . . . . . . . 115
5.3.4 Training and Scoring Re-revisited . . . . . . . . . . . . 116
5.3.5 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . 117
5.4 All Together Now . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5 A Note on Quadratic Programming . . . . . . . . . . . . . . . 121
5.6 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6 A Comprehensible Collection of Clustering Concepts 133
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Overview and Background . . . . . . . . . . . . . . . . . . . . 133
6.3 �-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4 Measuring Cluster Quality . . . . . . . . . . . . . . . . . . . . 141
6.4.1 Internal Validation . . . . . . . . . . . . . . . . . . . . 143
6.4.2 External Validation . . . . . . . . . . . . . . . . . . . 148
6.4.3 Visualizing Clusters . . . . . . . . . . . . . . . . . . . 150
6.5 EM Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.1 Maximum Likelihood Estimator . . . . . . . . . . . . 154
6.5.2 An Easy EM Example . . . . . . . . . . . . . . . . . . 155
6.5.3 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . 159
6.5.4 Gaussian Mixture Example . . . . . . . . . . . . . . . 163
6.6 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7 Many Mini Topics 177


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2 �-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . 177
7.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.4.1 Football Analogy . . . . . . . . . . . . . . . . . . . . . 182
7.4.2 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.5 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.6 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 192
7.7 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . 202
7.8 Naı̈ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.9 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . 205
7.10 Conditional Random Fields . . . . . . . . . . . . . . . . . . . 208
7.10.1 Linear Chain CRF . . . . . . . . . . . . . . . . . . . . 209
7.10.2 Generative vs Discriminative Models . . . . . . . . . . 210
7.10.3 The Bottom Line on CRFs . . . . . . . . . . . . . . . 213
7.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

8 Data Analysis 219


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . 220
8.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8.4 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.5 Imbalance Problem . . . . . . . . . . . . . . . . . . . . . . . . 228
8.6 PR Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.7 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
II Applications 235

9 HMM Applications 237


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.2 English Text Analysis . . . . . . . . . . . . . . . . . . . . . . 237
9.3 Detecting Undetectable Malware . . . . . . . . . . . . . . . . 240
9.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 240
9.3.2 Signature-Proof Metamorphic Generator . . . . . . . . 242
9.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.4 Classic Cryptanalysis . . . . . . . . . . . . . . . . . . . . . . . 245
9.4.1 Jakobsen’s Algorithm . . . . . . . . . . . . . . . . . . 245
9.4.2 HMM with Random Restarts . . . . . . . . . . . . . . 251

10 PHMM Applications 261


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.2 Masquerade Detection . . . . . . . . . . . . . . . . . . . . . . 261
10.2.1 Experiments with Schonlau Dataset . . . . . . . . . . 262
10.2.2 Simulated Data with Positional Information . . . . . . 265
10.3 Malware Detection . . . . . . . . . . . . . . . . . . . . . . . . 269
10.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 270
10.3.2 Datasets and Results . . . . . . . . . . . . . . . . . . . 271

11 PCA Applications 277


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.2 Eigenfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.3 Eigenviruses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.3.1 Malware Detection Results . . . . . . . . . . . . . . . 280
11.3.2 Compiler Experiments . . . . . . . . . . . . . . . . . . 282
11.4 Eigenspam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.4.1 PCA for Image Spam Detection . . . . . . . . . . . . . 285
11.4.2 Detection Results . . . . . . . . . . . . . . . . . . . . . 285

12 SVM Applications 289


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
12.2 Malware Detection . . . . . . . . . . . . . . . . . . . . . . . . 289
12.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 290
12.2.2 Experimental Results . . . . . . . . . . . . . . . . . . 293
12.3 Image Spam Revisited . . . . . . . . . . . . . . . . . . . . . . 296
12.3.1 SVM for Image Spam Detection . . . . . . . . . . . . 298
12.3.2 SVM Experiments . . . . . . . . . . . . . . . . . . . . 300
12.3.3 Improved Dataset . . . . . . . . . . . . . . . . . . . . 304
13 Clustering Applications 307
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
13.2 �-Means for Malware Classification . . . . . . . . . . . . . . 307
13.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 308
13.2.2 Experiments and Results . . . . . . . . . . . . . . . . 309
13.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.3 EM vs �-Means for Malware Analysis . . . . . . . . . . . . . 314
13.3.1 Experiments and Results . . . . . . . . . . . . . . . . 314
13.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 317

Annotated Bibliography 319

Index 338
Preface

“Perhaps it hasn’t one,” Alice ventured to remark.


“Tut, tut, child!” said the Duchess.
“Everything’s got a moral, if only you can find it.”
— Lewis Carroll, Alice in Wonderland

For the past several years, I’ve been teaching a class on “Topics in Information
Security.” Each time I taught this course, I’d sneak in a few more machine
learning topics. For the past couple of years, the class has been turned on
its head, with machine learning being the focus, and information security
only making its appearance in the applications. Unable to find a suitable
textbook, I wrote a manuscript, which slowly evolved into this book.
In my machine learning class, we spend about two weeks on each of the
major topics in this book (HMM, PHMM, PCA, SVM, and clustering). For
each of these topics, about one week is devoted to the technical details in
Part I, and another lecture or two is spent on the corresponding applica-
tions in Part II. The material in Part I is not easy—by including relevant
applications, the material is reinforced, and the pace is more reasonable.
I also spend a week covering the data analysis topics in Chapter 8 and
several of the mini topics in Chapter 7 are covered, based on time constraints
and student interest.1
Machine learning is an ideal subject for substantive projects. In topics
classes, I always require projects, which are usually completed by pairs of stu-
dents, although individual projects are allowed. At least one week is allocated
to student presentations of their project results.
A suggested syllabus is given in Table 1. This syllabus should leave time
for tests, project presentations, and selected special topics. Note that the
applications material in Part II is intermixed with the material in Part I.
Also note that the data analysis chapter is covered early, since it’s relevant
to all of the applications in Part II.
1
Who am I kidding? Topics are selected based on my interests, not student interest.
xiv PREFACE

Table 1: Suggested syllabus

Chapter Hours Coverage


1. Introduction 1 All
2. Hidden Markov Models 3 All
9. HMM Applications 2 All
8. Data Analysis 3 All
3. Profile Hidden Markov Models 3 All
10. PHMM Applications 2 All
4. Principal Component Analysis 3 All
11. PCA Applications 2 All
5. Support Vector Machines 3 All
12. SVM Applications 3 All
6. Clustering 3 All
13. Clustering Applications 2 All
7. Mini-topics 6 LDA and selected topics
Total 36

My machine learning class is taught at the beginning graduate level. For


an undergraduate class, it might be advisable to slow the pace slightly. Re-
gardless of the level, labs would likely be helpful. However, it’s important to
treat labs as supplemental to—as opposed to a substitute for—lectures.
Learning challenging technical material requires studying it multiple times
in multiple different ways, and I’d say that the magic number is three. It’s no
accident that students who read the book, attend the lectures, and conscien-
tiously work on homework problems learn this material well. If you are trying
to learn this subject on your own, the author has posted his lecture videos
online, and these might serve as a (very poor) substitute for live lectures.2
I’m also a big believer in learning by programming—the more code that you
write, the better you will learn machine learning.

Mark Stamp
Los Gatos, California
April, 2017

2
In my experience, in-person lectures are infinitely more valuable than any recorded or
online format. Something happens in live classes that will never be fully duplicated in any
dead (or even semi-dead) format.
Chapter 1

Introduction

I took a speed reading course and read War and Peace in twenty minutes.
It involves Russia.
— Woody Allen

1.1 What Is Machine Learning?


For our purposes, we’ll view machine learning as a form of statistical discrim-
ination, where the “machine” does the heavy lifting. That is, the computer
“learns” important information, saving us humans from the hard work of
trying to extract useful information from seemingly inscrutable data.
For the applications considered in this book, we typically train a model,
then use the resulting model to score samples. If the score is sufficiently high,
we classify the sample as being of the same type as was used to train the
model. And thanks to the miracle of machine learning, we don’t have to
work too hard to perform such classification. Since the model parameters are
(more-or-less) automatically extracted from training data, machine learning
algorithms are sometimes said to be data driven.
Machine learning techniques can be successfully applied to a wide range
of important problems, including speech recognition, natural language pro-
cessing, bioinformatics, stock market analysis, information security, and the
homework problems in this book. Additional useful applications of machine
learning seem to be found on a daily basis—the set of potential applications
is virtually unlimited.
It’s possible to treat any machine learning algorithm as a black box and, in
fact, this is a major selling points of the field. Many successful machine learn-
ers simply feed data into their favorite machine learning black box, which,
surprisingly often, spits out useful results. While such an approach can work,
2 INTRODUCTION

the primary goal of this book is to provide the reader with a deeper un-
derstanding of what is actually happening inside those mysterious machine
learning black boxes.
Why should anyone care about the inner workings of machine learning al-
gorithms when a simple black box approach can—and often does—suffice? If
you are like your curious author, you hate black boxes, and you want to know
how and why things work as they do. But there are also practical reasons
for exploring the inner sanctum of machine learning. As with any technical
field, the cookbook approach to machine learning is inherently limited. When
applying machine learning to new and novel problems, it is often essential to
have an understanding of what is actually happening “under the covers.” In
addition to being the most interesting cases, such applications are also likely
to be the most lucrative.
By way of analogy, consider a medical doctor (MD) in comparison to a
nurse practitioner (NP).1 It is often claimed that an NP can do about 80%
to 90% of the work that an MD typically does. And the NP requires less
training, so when possible, it is cheaper to have NPs treat people. But, for
challenging or unusual or non-standard cases, the higher level of training of
an MD may be essential. So, the MD deals with the most challenging and
interesting cases, and earns significantly more for doing so. The aim of this
book is to enable the reader to earn the equivalent of an MD in machine
learning.
The bottom line is that the reader who masters the material in this book
will be well positioned to apply machine learning techniques to challenging
and cutting-edge applications. Most such applications would likely be beyond
the reach of anyone with a mere black box level of understanding.

1.2 About This Book


The focus of this book is on providing a reasonable level of detail for a reason-
ably wide variety of machine learning algorithms, while constantly reinforcing
the material with realistic applications. But, what constitutes a reasonable
level of detail? I’m glad you asked.
While the goal here is for the reader to obtain a deep understanding of
the inner workings of the algorithms, there are limits.2 This is not a math
book, so we don’t prove theorems or otherwise dwell on mathematical theory.
Although much of the underlying math is elegant and interesting, we don’t
spend any more time on the math than is absolutely necessary. And, we’ll
1
A physician assistant (PA) is another medical professional that is roughly comparable
to a nurse practitioner.
2
However, these limits are definitely not of the kind that one typically finds in a calculus
book.
1.2 ABOUT THIS BOOK 3

sometimes skip a few details, and on occasion, we might even be a little bit
sloppy with respect to mathematical niceties. The goal here is to present
topics at a fairly intuitive level, with (hopefully) just enough detail to clarify
the underlying concepts, but not so much detail as to become overwhelming
and bog down the presentation.3
In this book, the following machine learning topics are covered in chapter-
length detail.

Topic Where
Hidden Markov Models (HMM) Chapter 2
Profile Hidden Markov Models (PHMM) Chapter 3
Principal Component Analysis (PCA) Chapter 4
Support Vector Machines (SVM) Chapter 5
Clustering (�-Means and EM) Chapter 6

Several additional topics are discussed in a more abbreviated (section-length)


format. These mini-topics include the following.

Topic Where
�-Nearest Neighbors (�-NN) Section 7.2
Neural Networks Section 7.3
Boosting and AdaBoost Section 7.4
Random Forest Section 7.5
Linear Discriminant Analysis (LDA) Section 7.6
Vector Quantization (VQ) Section 7.7
Naı̈ve Bayes Section 7.8
Regression Analysis Section 7.9
Conditional Random Fields (CRF) Section 7.10

Data analysis is critically important when evaluating machine learning ap-


plications, yet this topic is often relegated to an afterthought. But that’s
not the case here, as we have an entire chapter devoted to data analysis and
related issues.
To access the textbook website, point your browser to

http://www.cs.sjsu.edu/~stamp/ML/

where you’ll find links to PowerPoint slides, lecture videos, and other relevant
material. An updated errata list is also available. And for the reader’s benefit,
all of the figures in this book are available in electronic form, and in color.
3
Admittedly, this is a delicate balance, and your unbalanced author is sure that he didn’t
always achieve an ideal compromise. But you can rest assured that it was not for lack of
trying.
4 INTRODUCTION

In addition, extensive malware and image spam datasets can be found on


the textbook website. These or similar datasets were used in many of the
applications discussed in Part II of this book.

1.3 Necessary Background


Given the title of this weighty tome, it should be no surprise that most of
the examples are drawn from the field of information security. For a solid
introduction to information security, your humble author is partial to the
book [137]. Many of the machine learning applications in this book are
specifically focused on malware. For a thorough—and thoroughly enjoyable—
introduction to malware, Aycock’s book [12] is the clear choice. However,
enough background is provided so that no outside resources should be neces-
sary to understand the applications considered here.
Many of the exercises in this book require some programming, and basic
computing concepts are assumed in a few of the application sections. But
anyone with a modest amount of programming experience should have no
trouble with this aspect of the book.
Most machine learning techniques do ultimately rest on some fancy math.
For example, hidden Markov models (HMM) build on a foundation of dis-
crete probability, principal component analysis (PCA) is based on sophisti-
cated linear algebra, Lagrange multipliers (and calculus) are used to show
how and why a support vector machine (SVM) really works, and statistical
concepts abound. We’ll review the necessary linear algebra, and generally
cover relevant math and statistics topics as needed. However, we do assume
some knowledge of differential calculus—specifically, finding the maximum
and minimum of “nice” functions.

1.4 A Few Too Many Notes


Note that the applications presented in this book are largely drawn from your
author’s industrious students’ research projects. Note also that the applica-
tions considered here were selected because they illustrate various machine
learning techniques in relatively straightforward scenarios. In particular, it is
important to note that applications were not selected because they necessarily
represent the greatest academic research in the history of academic research.
It’s a noteworthy (and unfortunate) fact of life that the primary function of
much academic research is to impress the researcher’s (few) friends with his
or her extreme cleverness, while eschewing practicality, utility, and clarity.
In contrast, the applications presented here are supposed to help demystify
machine learning techniques.
Part I

Tools of the Trade


Chapter 2

A Revealing Introduction to
Hidden Markov Models

The cause is hidden. The effect is visible to all.


— Ovid

2.1 Introduction and Background


Not surprisingly, a hidden Markov model (HMM) includes a Markov pro-
cess that is “hidden,” in the sense that we cannot directly observe the state
of the process. But we do have access to a series of observations that are
probabilistically related to the underlying Markov model.
While the formulation of HMMs might initially seem somewhat contrived,
there exist a virtually unlimited number of problems where the technique
can be applied. Best of all, there are efficient algorithms, making HMMs
extremely practical. Another very nice property of an HMM is that structure
within the data can often be deduced from the model itself.
In this chapter, we first consider a simple example to motivate the HMM
formulation. Then we dive into a detailed discussion of the HMM algorithms.
Realistic applications—mostly from the information security domain—can be
found in Chapter 9.
This is one of the most detailed chapters in the book. A reason for going
into so much depth is that once we have a solid understanding of this partic-
ular machine learning technique, we can then compare and contrast it to the
other techniques that we’ll consider. In addition, HMMs are relatively easy
to understand—although the notation can seem intimidating, once you have
the intuition, the process is actually fairly straightforward.1
1
To be more accurate, your dictatorial author wants to start with HMMs, and that’s all
that really matters.
8 HIDDEN MARKOV MODELS

The bottom line is that this chapter is the linchpin for much of the remain-
der of the book. Consequently, if you learn the material in this chapter well,
it will pay large dividends in most subsequent chapters. On the other hand,
if you fail to fully grasp the details of HMMs, then much of the remaining
material will almost certainly be more difficult than is necessary.
HMMs are based on discrete probability. In particular, we’ll need some
basic facts about conditional probability, so in the remainder of this section,
we provide a quick overview of this crucial topic.
The notation “|” denotes “given” information, so that � (� | �) is read as
“the probability of �, given �.” For any two events � and �, we have

� (� and �) = � (�) � (� | �). (2.1)

For example, suppose that we draw two cards without replacement from a
standard 52-card deck. Let � = {1st card is ace} and � = {2nd card is ace}.
Then
� (� and �) = � (�) � (� | �) = 4/52 · 3/51 = 1/221.
In this example, � (�) depends on what happens in the first event �, so we
say that � and � are dependent events. On the other hand, suppose we flip
a fair coin twice. Then the probability that the second flip comes up heads
is 1/2, regardless of the outcome of the first coin flip, so these events are
independent. For dependent events, the “given” information is relevant when
determining the sample space. Consequently, in such cases we can view the
information to the right of the “given” sign as defining the space over which
probabilities will be computed.
We can rewrite equation (2.1) as

� (� and �)
� (� | �) = .
� (�)

This expression can be viewed as the definition of conditional probability.


For an important application of conditional probability, see the discussion of
naı̈ve Bayes in Section 7.8 of Chapter 7.
We’ll often use the shorthand “�, �” for the joint probability which, in
reality is the same as “� and �.” Also, in discrete probability, “� and �” is
equivalent to the intersection of the sets � and � and sometimes we’ll want
to emphasize this set intersection. Consequently, throughout this section

� (� and �) = � (�, �) = � (� ∩ �).

Finally, matrix notation is used frequently in this chapter. A review of


matrices and basic linear algebra can be found in Section 4.2.1 of Chapter 4,
although no linear algebra is required in this chapter.
2.2 A SIMPLE EXAMPLE 9

2.2 A Simple Example


Suppose we want to determine the average annual temperature at a particular
location on earth over a series of years. To make it more interesting, suppose
the years we are focused on lie in the distant past, before thermometers were
invented. Since we can’t go back in time, we instead look for indirect evidence
of the temperature.
To simplify the problem, we only consider “hot” and “cold” for the av-
erage annual temperature. Suppose that modern evidence indicates that the
probability of a hot year followed by another hot year is 0.7 and the proba-
bility that a cold year is followed by another cold year is 0.6. We’ll assume
that these probabilities also held in the distant past. This information can
be summarized as
� �
︂ ︂ (2.2)
� 0.7 0.3
� 0.4 0.6
where � is “hot” and � is “cold.”
Next, suppose that current research indicates a correlation between the
size of tree growth rings and temperature. For simplicity, we only consider
three different tree ring sizes, small, medium, and large, denoted �, � , and �,
respectively. Furthermore, suppose that based on currently available evi-
dence, the probabilistic relationship between annual temperature and tree
ring sizes is given by
� � �
︂ ︂ (2.3)
� 0.1 0.4 0.5
.
� 0.7 0.2 0.1
For this system, we’ll say that the state is the average annual tempera-
ture, either � or �. The transition from one state to the next is a Markov
process,2 since the next state depends only on the current state and the fixed
probabilities in (2.2). However, the actual states are “hidden” since we can’t
directly observe the temperature in the past.
Although we can’t observe the state (temperature) in the past, we can
observe the size of tree rings. From (2.3), tree rings provide us with prob-
abilistic information regarding the temperature. Since the underlying states
are hidden, this type of system is known as a hidden Markov model (HMM).
Our goal is to make effective and efficient use of the observable information,
so as to gain insight into various aspects of the Markov process.
2
A Markov process where the current state only depends on the previous state is said
to be of order one. In a Markov process of order n, the current state depends on the n
consecutive preceding states. In any case, the “memory” is finite—much like your absent-
minded author’s memory, which seems to become more and more finite all the time. Let’s
see, now where was I?
10 HIDDEN MARKOV MODELS

For this HMM example, the state transition matrix is


︂ ︂
0.7 0.3
�= , (2.4)
0.4 0.6

which comes from (2.2), and the observation matrix is


︂ ︂
0.1 0.4 0.5
�= , (2.5)
0.7 0.2 0.1

which comes from (2.3). For this example, suppose that the initial state
distribution, denoted by �, is
︀ ︀
� = 0.6 0.4 , (2.6)

that is, the chance that we start in the � state is 0.6 and the chance that
we start in the � state is 0.4. The matrices �, �, and � are row stochastic,
which is just a fancy way of saying that each row satisfies the requirements
of a discrete probability distribution (i.e., each element is between 0 and 1,
and the elements of each row sum to 1).
Now, suppose that we consider a particular four-year period of interest
from the distant past. For this particular four-year period, we observe the
series of tree ring sizes �, �, �, �. Letting 0 represent �, 1 represent � , and 2
represent �, this observation sequence is denoted as
︀ ︀
� = 0, 1, 0, 2 . (2.7)

We might want to determine the most likely state sequence of the Markov
process given the observations (2.7). That is, we might want to know the most
likely average annual temperatures over this four-year period of interest. This
is not quite as clear-cut as it seems, since there are different possible inter-
pretations of “most likely.” On the one hand, we could define “most likely”
as the state sequence with the highest probability from among all possible
state sequences of length four. Dynamic programming (DP) can be used to
efficiently solve this problem. On the other hand, we might reasonably define
“most likely” as the state sequence that maximizes the expected number of
correct states. An HMM can be used to find the most likely hidden state
sequence in this latter sense.
It’s important to realize that the DP and HMM solutions to this problem
are not necessarily the same. For example, the DP solution must, by defini-
tion, include valid state transitions, while this is not the case for the HMM.
And even if all state transitions are valid, the HMM solution can still differ
from the DP solution, as we’ll illustrate in an example below.
Before going into more detail, we need to deal with the most challenging
aspect of HMMs—the notation. Once we have the notation, we’ll discuss the
2.3 NOTATION 11

three fundamental problems that HMMs enable us to solve, and we’ll give
detailed algorithms for the efficient solution of each. We also consider critical
computational issues that must be addressed when writing any HMM com-
puter program. Rabiner [113] is a standard reference for further introductory
information on HMMs.

2.3 Notation
The notation used in an HMM is summarized in Table 2.1. Note that the
observations are assumed to come from the set {0, 1, . . . , � − 1}, which sim-
plifies the notation with no loss of generality. That is, we simply associate
each of the � distinct observations with one of the elements 0, 1, . . . , � − 1,
so that �� ∈ � = {0, 1, . . . , � − 1} for � = 0, 1, . . . , � − 1.

Table 2.1: HMM notation

Notation Explanation
� Length of the observation sequence
� Number of states in the model
� Number of observation symbols
� Distinct states of the Markov process, �0 , �1 , . . . , �� −1
� Possible observations, assumed to be 0, 1, . . . , � − 1
� State transition probabilities
� Observation probability matrix
� Initial state distribution
� Observation sequence, �0 , �1 , . . . , �� −1

A generic hidden Markov model is illustrated in Figure 2.1, where the ��


represent the hidden states and all other notation is as in Table 2.1. The
state of the Markov process, which we can view as being hidden behind a
“curtain” (the dashed line in Figure 2.1), is determined by the current state
and the � matrix. We are only able to observe the observations �� , which
are related to the (hidden) states of the Markov process by the matrix �.
For the temperature example in the previous section, the observations
sequence is given in (2.7), and we have � = 4, � = 2, � = 3, � = {�, �},
and � = {0, 1, 2}. Note that we let 0, 1, 2 represent small, medium, and large
tree rings, respectively. For this example, the matrices �, �, and � are given
by (2.4), (2.5), and (2.6), respectively.
In general, the matrix � = {��� } is � × � with

��� = � (state �� at � + 1 | state �� at �).


12 HIDDEN MARKOV MODELS

� � � �
�0 �1 �2 ··· �� −1

� � � �

�0 �1 �2 ··· �� −1

Figure 2.1: Hidden Markov model

The matrix � is always row stochastic. Also, the probabilities ��� are inde-
pendent of �, so that the � matrix does not change. The matrix � = {�� (�)}
is of size � × � , with

�� (�) = � (observation � at � | state �� at �).

As with the � matrix, � is row stochastic, and the probabilities �� (�) are
independent of �. The somewhat unusual notation �� (�) is convenient when
specifying the HMM algorithms.
An HMM is defined by �, �, and � (and, implicitly, by the dimensions �
and � ). Thus, we’ll denote an HMM as � = (�, �, �).
Suppose that we are given an observation sequence of length four, which
is denoted as ︀ ︀
� = �0 , �1 , �2 , �3 .
The corresponding (hidden) state sequence is
︀ ︀
� = �0 , �1 , �2 , �3 .

We’ll let ��0 denote the probability of starting in state �0 , and ��0 (�0 )
denotes the probability of initially observing �0 , while ��0 ,�1 is the proba-
bility of transiting from state �0 to state �1 . Continuing, we see that the
probability of a given state sequence � of length four is

� (�, �) = ��0 ��0 (�0 )��0 ,�1 ��1 (�1 )��1 ,�2 ��2 (�2 )��2 ,�3 ��3 (�3 ). (2.8)

Note that in this expression, the �� represent indices in the � and � matrices,
not the names of the corresponding states.3
3
Your kindly author regrets this abuse of notation.
2.3 NOTATION 13

Consider again the temperature example in Section 2.2, where the obser-
vation sequence is � = (0, 1, 0, 2). Using (2.8) we can compute, say,
� (����) = 0.6(0.1)(0.7)(0.4)(0.3)(0.7)(0.6)(0.1) = 0.000212.
Similarly, we can directly compute the probability of each possible state se-
quence of length four, for the given observation sequence in (2.7). We have
listed these results in Table 2.2, where the probabilities in the last column
have been normalized so that they sum to 1.

Table 2.2: State sequence probabilities

Normalized
State Probability
probability
���� 0.000412 0.042787
���� 0.000035 0.003635
���� 0.000706 0.073320
���� 0.000212 0.022017
���� 0.000050 0.005193
���� 0.000004 0.000415
���� 0.000302 0.031364
���� 0.000091 0.009451
���� 0.001098 0.114031
���� 0.000094 0.009762
���� 0.001882 0.195451
���� 0.000564 0.058573
���� 0.000470 0.048811
���� 0.000040 0.004154
���� 0.002822 0.293073
���� 0.000847 0.087963

To find the optimal state sequence in the dynamic programming (DP)


sense, we simply choose the sequence with the highest probability, which in
this example is ����. To find the optimal state sequence in the HMM
sense, we choose the most probable symbol at each position. To this end we
sum the probabilities in Table 2.2 that have an � in the first position. Doing
so, we find the (normalized) probability of � in the first position is 0.18817
and the probability of � in the first position is 0.81183. Therefore, the first
element of the optimal sequence (in the HMM sense) is �. Repeating this for
each element of the sequence, we obtain the probabilities in Table 2.3.
From Table 2.3, we find that the optimal sequence—in the HMM sense—
is ����. Note that in this example, the optimal DP sequence differs from
the optimal HMM sequence.
14 HIDDEN MARKOV MODELS

Table 2.3: HMM probabilities

Position in state sequence


0 1 2 3
� (�) 0.188182 0.519576 0.228788 0.804029
� (�) 0.811818 0.480424 0.771212 0.195971

2.4 The Three Problems


There are three fundamental problems that we can solve using HMMs. Here,
we briefly describe each of these problems, then in the next section we discuss
efficient algorithms for their solution.

2.4.1 HMM Problem 1


Given the model � = (�, �, �) and a sequence of observations �, deter-
mine � (� | �). That is, we want to compute a score for the observed se-
quence � with respect to the given model �.

2.4.2 HMM Problem 2


Given � = (�, �, �) and an observation sequence �, find an optimal state
sequence for the underlying Markov process. In other words, we want to
uncover the hidden part of the hidden Markov model. This is the problem
that was discussed in some detail above.

2.4.3 HMM Problem 3


Given an observation sequence � and the parameter � , determine a model
of the form � = (�, �, �) that maximizes the probability of �. This can
be viewed as training a model to best fit the observed data. We’ll solve
this problem using a discrete hill climb on the parameter space represented
by �, �, and �. Note that the dimension � is determined from the training
sequence �.

2.4.4 Discussion
Consider, for example, the problem of speech recognition—which happens
to be one of the earliest and best-known applications of HMMs. We can
use the solution to HMM Problem 3 to train an HMM � to, for example,
recognize the spoken word “yes.” Then, given an unknown spoken word,
we can use the solution to HMM Problem 1 to score this word against this
2.5 THE THREE SOLUTIONS 15

model � and determine the likelihood that the word is “yes.” In this case, we
don’t need to solve HMM Problem 2, but it is possible that such a solution—
which uncovers the hidden states—might provide additional insight into the
underlying speech model.

2.5 The Three Solutions


2.5.1 Solution to HMM Problem 1
Let � = (�, �, �) be a given HMM and let � = (�0 , �1 , . . . , �� −1 ) be a
series of observations. We want to find � (� | �).
Let � = (�0 , �1 , . . . , �� −1 ) be a state sequence. Then by the definition
of � we have

� (� | �, �) = ��0 (�0 )��1 (�1 ) · · · ��� −1 (�� −1 )

and by the definition of � and � it follows that

� (� | �) = ��0 ��0 ,�1 ��1 ,�2 · · · ��� −2 ,�� −1 .

Since
� (� ∩ � ∩ �)
� (�, � | �) =
� (�)
and
� (� ∩ � ∩ �) � (� ∩ �) � (� ∩ � ∩ �)
� (� | �, �)� (� | �) = · =
� (� ∩ �) � (�) � (�)

we have
� (�, � | �) = � (� | �, �)� (� | �).
Summing over all possible state sequences yields

� (� | �) = � (�, � | �)


= � (� | �, �)� (� | �) (2.9)


= ��0 ��0 (�0 )��0 ,�1 ��1 (�1 ) · · · ��� −2 ,�� −1 ��� −1 (�� −1 ).

The direct computation in (2.9) is generally infeasible, since the number


of multiplications is about 2�� �, where � is typically large and � ≥ 2. One
of the major strengths of HMMs is that there exists an efficient algorithm to
achieve this same result.
16 HIDDEN MARKOV MODELS

To determine � (� | �) in an efficient manner, we can use the following


approach. For � = 0, 1, . . . , � − 1 and � = 0, 1, . . . , � − 1, define

�� (�) = � (�0 , �1 , . . . , �� , �� = �� | �). (2.10)

Then �� (�) is the probability of the partial observation sequence up to time �,


where the underlying Markov process is in state �� at time �.
The crucial insight is that the �� (�) can be computed recursively—and
efficiently. This recursive approach is known as the forward algorithm, or
�-pass, and is given in Algorithm 2.1.

Algorithm 2.1 Forward algorithm


1: Given:
Model � = (�, �, �)
Observations � = (�0 , �1 , . . . , �� −1 )
2: for � = 0, 1, . . . , � − 1 do
3: �0 (�) = �� �� (�0 )
4: end for
5: for � = 1, 2, . . . , � − 1 do
6: for � = 0, 1,︃. . . , � − 1 do ︃

︁ −1
7: �� (�) = ��−1 (�)��� �� (�� )
�=0
8: end for
9: end for

The forward algorithm only requires about � 2 � multiplications. This


is in stark contrast to the naı̈ve approach, which has a work factor of more
than 2�� �. Since � is typically large and � is relatively small, the forward
algorithm is highly efficient.
It follows from the definition in (2.10) that

︁ −1
� (� | �) = �� −1 (�).
�=0

Hence, the forward algorithm gives us an efficient way to compute a score for
a given sequence �, relative to a given model �.

2.5.2 Solution to HMM Problem 2


Given the model � = (�, �, �) and a sequence of observations �, our goal
here is to find the most likely state sequence. As mentioned above, there are
different possible interpretations of “most likely”—for an HMM, we maximize
the expected number of correct states. In contrast, a dynamic program finds
2.5 THE THREE SOLUTIONS 17

the highest-scoring overall path. As we have seen, these solutions are not
necessarily the same.
First, we define
�� (�) = � (��+1 , ��+2 , . . . , �� −1 | �� = �� , �)
for � = 0, 1, . . . , � − 1, and � = 0, 1, . . . , � − 1. The �� (�) can be computed
recursively (and efficiently) using the backward algorithm, or �-pass, which is
given here in Algorithm 2.2. This is analogous to the �-pass discussed above,
except that we start at the end and work back toward the beginning.

Algorithm 2.2 Backward algorithm


1: Given:
Model � = (�, �, �)
Observations � = (�0 , �1 , . . . , �� −1 )
2: for � = 0, 1, . . . , � − 1 do
3: �� −1 (�) = 1
4: end for
5: for � = � − 2, � − 3, . . . , 0 do
6: for � = 0, 1, . . . , � − 1 do

︁ −1
7: �� (�) = ��� �� (��+1 )��+1 (�)
�=0
8: end for
9: end for

Now, for � = 0, 1, . . . , � − 1 and � = 0, 1, . . . , � − 1, define


�� (�) = � (�� = �� | �, �).
Since �� (�) measures the relevant probability up to time � and �� (�) measures
the relevant probability after time �, we have
�� (�)�� (�)
�� (�) = .
� (� | �)
Recall that the denominator � (� | �) is obtained by summing �� −1 (�) over �.
From the definition of �� (�) it follows that the most likely state at time � is
the state �� for which �� (�) is maximum, where the maximum is taken over
the index �. Then the most likely state at time � is given by
︀� = max �� (�).

2.5.3 Solution to HMM Problem 3


Here we want to adjust the model parameters to best fit the given observa-
tions. The sizes of the matrices (� and � ) are known, while the elements
18 HIDDEN MARKOV MODELS

of �, �, and � are to be determined, subject to row stochastic conditions.


The fact that we can efficiently re-estimate the model itself is perhaps the
more impressive aspect of HMMs.
For � = 0, 1, . . . , � − 2 and �, � ∈ {0, 1, . . . , � − 1}, define the “di-gammas”
as
�� (�, �) = � (�� = �� , ��+1 = �� | �, �).
Then �� (�, �) is the probability of being in state �� at time � and transiting to
state �� at time � + 1. The di-gammas can be written in terms of �, �, �,
and � as
�� (�)��� �� (��+1 )��+1 (�)
�� (�, �) = .
� (� | �)
For � = 0, 1, . . . , � − 2, we see that �� (�) and �� (�, �) are related by

︁ −1
�� (�) = �� (�, �).
�=0

Once the �� (�, �) have been computed, the model � = (�, �, �) is re-estimated
using Algorithm 2.3. The HMM training algorithm is known as Baum-Welch
re-estimation, and is named after Leonard E. Baum and Lloyd R. Welch, who
developed the technique in the late 1960s while working at the Center for
Communications Research (CCR),4 which is part of the Institute for Defense
Analyses (IDA), located in Princeton, New Jersey.
The numerator of the re-estimated ��� in Algorithm 2.3 can be seen to
give the expected number of transitions from state �� to state �� , while the
denominator is the expected number of transitions from �� to any state.5
Hence, the ratio is the probability of transiting from state �� to state �� ,
which is the desired value of ��� .
The numerator of the re-estimated �� (�) in Algorithm 2.3 is the expected
number of times the model is in state �� with observation �, while the denom-
inator is the expected number of times the model is in state �� . Therefore,
the ratio is the probability of observing symbol �, given that the model is in
state �� , and this is the desired value for �� (�).
Re-estimation is an iterative process. First, we initialize � = (�, �, �)
with a reasonable guess, or, if no reasonable guess is available, we choose
4
Not to be confused with Creedence Clearwater Revival [153].
5
When re-estimating the A matrix, we are dealing with expectations. However, it might
make things clearer to think in terms of frequency counts. For frequency counts, it would be
easy to compute the probability of transitioning from state i to state j. That is, we would
simply count the number of transitions from state i to state j, and divide this count by the
total number of times we could be in state i. This is the intuition behind the re-estimation
formula for the A matrix, and a similar statement holds when re-estimating the B matrix.
In other words, don’t let all of the fancy notation obscure the relatively simple ideas that
are at the core of the re-estimation process.
2.5 THE THREE SOLUTIONS 19

Algorithm 2.3 Baum-Welch re-estimation


1: Given:
�� (�), for � = 0, 1, . . . , � − 1 and � = 0, 1, . . . , � − 1
�� (�, �), for � = 0, 1, . . . , � − 2 and �, � ∈ {0, 1, . . . , � − 1}
2: for � = 0, 1, . . . , � − 1 do
3: �� = �0 (�)
4: end for
5: for � = 0, 1, . . . , � − 1 do
6: for � = 0, 1, . . . , � − 1 do
�︁−2 ︂ �︁−2
7: ��� = �� (�, �) �� (�)
�=0 �=0
8: end for
9: end for
10: for � = 0, 1, . . . , � − 1 do
11: for � = 0, 1, . . . , � − 1 do
︁ ︂ �︁
−1
12: �� (�) = �� (�) �� (�)
�∈{0,1,...,� −1} �=0
�� =�
13: end for
14: end for

random values such that �� ≈ 1/� and ��� ≈ 1/� and �� (�) ≈ 1/� . It’s
critical that �, �, and � be randomized, since exactly uniform values will
result in a local maximum from which the model cannot climb. And, as
always, �, � and � must be row stochastic.
The complete solution to HMM Problem 3 can be summarized as follows.

1. Initialize, � = (�, �, �).

2. Compute �� (�), �� (�), �� (�, �) and �� (�).

3. Re-estimate the model � = (�, �, �) using Algorithm 2.3.

4. If � (� | �) increases, goto 2.

In practice, we would want to stop when � (� | �) does not increase by some


predetermined threshold, say, �. We could also (or alternatively) set a max-
imum number of iterations. In any case, it’s important to verify that the
model has converged, which can usually be determined by perusing the �
matrix.6
6
While it might seem obvious to stop iterating when the change in P (O | λ) is small,
this requires some care in practice. Typically, the change in P (O | λ) is very small over
20 HIDDEN MARKOV MODELS

2.6 Dynamic Programming


Before completing our discussion of the elementary aspects of HMMs, we
make a brief detour to show the close relationship between dynamic pro-
gramming (DP) and HMMs. The executive summary is that a DP can be
viewed as an �-pass where “sum” is replaced by “max.” More precisely, for �,
�, and � as above, the dynamic programming algorithm, which is also known
as the Viterbi algorithm, is given in Algorithm 2.4.

Algorithm 2.4 Dynamic programming


1: Given:
Model � = (�, �, �)
Observations � = (�0 , �1 , . . . , �� −1 )
2: for � = 0, 1, . . . , � − 1 do
3: �0 (�) = �� �� (�0 )
4: end for
5: for � = 1, 2, . . . , � − 1 do
6: for � = 0, 1, . . . , � − 1 do
︀ ︀
7: �� (�) = max ��−1 (�)��� �� (�� )
�∈{0,1,...,� −1}
8: end for
9: end for

At each successive �, a dynamic program determines the probability of


the best path ending at each of the states � = 0, 1, . . . , � − 1. Consequently,
the probability of the best overall path is
max �� −1 (�). (2.11)
�∈{0,1,...,� −1}

It is important to realize that (2.11) only gives the optimal probability,


not the corresponding path. By keeping track of each preceding state, the DP
procedure given here can be augmented so that we can recover the optimal
path by tracing back from the highest-scoring final state.
Consider again the example in Section 2.2. The initial probabilities are
� (�) = �0 �0 (0) = 0.6(0.1) = 0.06 and � (�) = �1 �1 (0) = 0.4(0.7) = 0.28.
The probabilities of the paths of length two are given by
� (��) = 0.06(0.7)(0.4) = 0.0168
the first several iterations. The model then goes through a period of rapid improvement—
at which point the model has converged—after which the the change in P (O | λ) is again
small. Consequently, if we simply set a threshold, the re-estimation process might stop
immediately, or it might continue indefinitely. Perhaps the optimal approach is to combine
a threshold with a minimum number of iterations—the pseudo-code in Section 2.8 uses this
approach.
2.6 DYNAMIC PROGRAMMING 21

� (��) = 0.06(0.3)(0.2) = 0.0036


� (��) = 0.28(0.4)(0.4) = 0.0448
� (��) = 0.28(0.6)(0.2) = 0.0336

and hence the best (most probable) path of length two ending with � is ��
while the best path of length two ending with � is ��. Continuing, we
construct the diagram in Figure 2.2 one level or stage at a time, where each
arrow points to the next element in the optimal path ending at a given state.
Note that at each stage, the dynamic programming algorithm only needs
to maintain the highest-scoring path ending at each state—not a list of all
possible paths. This is the key to the efficiency of the algorithm.

.06 .0448 .003136 .002822


H H H H

C C C C
.28 .0336 .014112 .000847

Figure 2.2: Dynamic programming

In Figure 2.2, the maximum final probability is 0.002822, which occurs


at the final state �. We can use the arrows to trace back from � to find
that the optimal path is ����. Note that this agrees with the brute force
calculation in Table 2.2.
Underflow is a concern with a dynamic programming problem of this
form—since we compute products of probabilities, the result will tend to 0.
Fortunately, underflow is easily avoided by simply taking logarithms. An
underflow-resistant version of DP is given in Algorithm 2.5.

Algorithm 2.5 Dynamic programming without underflow


1: Given:
Model � = (�, �, �)
Observations � = (�0 , �1 , . . . , �� −1 )
2: for � = 0, 1, . . . , � − 1 do
︀ ︀
3: �︀0 (�) = log �� �� (�0 )
4: end for
5: for � = 1, 2, . . . , � − 1 do
6: for � = 0, 1, . . . , � − 1 do
︁ ︀ ︀︁
7: �︀� (�) = max �︀�−1 (�) + log(��� ) + log �� (�� )
�∈{0,1,...,� −1}
8: end for
9: end for
22 HIDDEN MARKOV MODELS

Not surprisingly, for the underflow-resistant version in Algorithm 2.5, the


optimal score is given by

max �︀� −1 (�).


�∈{0,1,...,� −1}

Again, additional bookkeeping is required to determine the optimal path.

2.7 Scaling
The three HMM solutions in Section 2.5 all require computations involving
products of probabilities. It’s very easy to see, for example, that �� (�) tends
to 0 exponentially as � increases. Therefore, any attempt to implement the
HMM algorithms as given in Section 2.5 will inevitably result in underflow.
The solution to this underflow problem is to scale the numbers. However,
care must be taken to ensure that the algorithms remain valid.
First, consider the computation of �� (�). The basic recurrence is

︁ −1
�� (�) = ��−1 (�)��� �� (�� ).
�=0

It seems sensible to normalize each �� (�) by dividing by



︁ −1
�� (�).
�=0

Following this approach, we compute scaling factors �� and the scaled �� (�),
which we denote as � ︀� (�), as in Algorithm 2.6.
To verify Algorithm 2.6 we first note that � ︀0 (�) = �0 �0 (�). Now suppose
that for some �, we have


︀� (�) = �0 �1 · · · �� �� (�). (2.12)

Then


︀�+1 (�) = ��+1 �
︀�+1 (�)

︁ −1
= ��+1 �
︀� (�)��� �� (��+1 )
�=0

︁ −1
= �0 �1 · · · �� ��+1 �� (�)��� �� (��+1 )
�=0

= �0 �1 · · · ��+1 ��+1 (�)

and hence (2.12) holds, by induction, for all �.


2.7 SCALING 23

Algorithm 2.6 Scaling factors


1: Given:
�� (�), for � = 0, 1, . . . , � − 1 and � = 0, 1, . . . , � − 1
2: for � = 0, 1, . . . , � − 1 do
3: �︀0 (�) = �0 (�)
4: end for

︁ −1
5: �0 = 1/ �
︀0 (�)
�=0
6: for � = 0, 1, . . . , � − 1 do
7: �︀0 (�) = �0 �︀0 (�)
8: end for
9: for � = 1, 2, . . . , � − 1 do
10: for � = 0, 1, . . . , � − 1 do

︁ −1
11: �
︀� (�) = �
︀�−1 (�)��� �� (�� )
�=0
12: end for
︂ �︁
−1
13: �� = 1 �
︀� (�)
�=0
14: for � = 0, 1, . . . , � − 1 do
15: �︀� (�) = �� �
︀� (�)
16: end for
17: end for

From (2.12) and the definitions of �


︀ and �
︀ it follows that
︂ �︁
−1

︀� (�) = �� (�) �� (�). (2.13)
�=0

From equation (2.13) we see that for all � and �, the desired scaled value
of �� (�) is indeed given by �
︀� (�).
From (2.13) it follows that

︁ −1

︀� −1 (�) = 1.
�=0

Also, from (2.12) we have



︁ −1 �
︁ −1

︀� −1 (�) = �0 �1 · · · �� −1 �� −1 (�)
�=0 �=0

= �0 �1 · · · �� −1 � (� | �).
24 HIDDEN MARKOV MODELS

Combining these results gives us


︂ �︁
−1
� (� | �) = 1 �� .
�=0

It follows that we can compute the log of � (� | �) directly from the scaling
factors �� as

︁ −1
︀ ︀
log � (� | �) = − log �� . (2.14)
�=0

It is fairly easy to show that the same scale factors �� can be used in
the backward algorithm by simply computing �︀� (�) = �� �� (�). We then deter-
mine �� (�, �) and �� (�) using the same formulae as in Section 2.5, but with �
︀� (�)

and �� (�) in place of �� (�) and �� (�), respectively. The resulting gammas and
di-gammas are then used to re-estimate �, �, and �.
By writing the original re-estimation formulae (as given in lines 3, 7,
and 12 of Algorithm 2.3) directly in terms of �� (�) and �� (�), it is a straight-
forward exercise to show that the re-estimated � and � and � are exact
when � ︀� (�) and �︀� (�) are used in place of �� (�) and �� (�). Furthermore,
� (� | �) isn’t required in the re-estimation formulae, since in each case it
cancels in the numerator and denominator. Therefore, (2.14) determines a
score for the model, which can be used, for example, to decide whether the
model is improving sufficiently to continue to the next iteration of the training
algorithm.

2.8 All Together Now


Here, we give complete pseudo-code for solving HMM Problem 3, including
scaling. This pseudo-code also provides virtually everything needed to solve
HMM Problems 1 and 2.

1. Given

Observation sequence � = (�0 , �1 , . . . , �� −1 ).

2. Initialize

(a) Select � and determine � from �. Recall that the model is de-
noted � = (�, �, �), where � = {��� } is � × � , � = {�� (�)}
is � × � , and � = {�� } is 1 × � .

(b) Initialize the three matrices �, �, and �. You can use knowl-
edge of the problem when generating initial values, but if no such
2.8 ALL TOGETHER NOW 25

information is available (as is often the case), let �� ≈ 1/� and


let ��� ≈ 1/� and �� (�) ≈ 1/� . Always be sure that your initial
values satisfy the row stochastic conditions (i.e., the elements of
each row sum to 1, and each element is between 0 and 1). Also,
make sure that the elements of each row are not exactly uniform.

(c) Initialize each of the following.

minIters = minimum number of re-estimation iterations


� = threshold representing negligible improvement in model
iters = 0
oldLogProb = −∞

3. Forward algorithm or �-pass

// compute �0 (�)
�0 = 0
for � = 0 to � − 1
�0 (�) = �� �� (�0 )
�0 = �0 + �0 (�)
next �
// scale the �0 (�)
�0 = 1/�0
for � = 0 to � − 1
�0 (�) = �0 �0 (�)
next �
// compute �� (�)
for � = 1 to � − 1
�� = 0
for � = 0 to � − 1
�� (�) = 0
for � = 0 to � − 1
�� (�) = �� (�) + ��−1 (�)���
next �
�� (�) = �� (�)�� (�� )
�� = �� + �� (�)
next �
// scale �� (�)
�� = 1/��
for � = 0 to � − 1
�� (�) = �� �� (�)
next �
next �
26 HIDDEN MARKOV MODELS

4. Backward algorithm or �-pass

// Let �� −1 (�) = 1 scaled by �� −1


for � = 0 to � − 1
�� −1 (�) = �� −1
next �
// �-pass
for � = � − 2 to 0 by −1
for � = 0 to � − 1
�� (�) = 0
for � = 0 to � − 1
�� (�) = �� (�) + ��� �� (��+1 )��+1 (�)
next �
// scale �� (�) with same scale factor as �� (�)
�� (�) = �� �� (�)
next �
next t
5. Compute the gammas and di-gammas

for � = 0 to � − 2
denom = 0
for � = 0 to � − 1
for � = 0 to � − 1
denom = denom + �� (�)��� �� (��+1 )��+1 (�)
next �
next �
for � = 0 to � − 1
�� (�) = 0
for � = 0 to �︀ − 1 ︀
�� (�, �) = �� (�)��� �� (��+1 )��+1 (�) /denom
�� (�) = �� (�) + �� (�, �)
next �
next �
next �
// Special case for �� −1 (�)
denom = 0
for � = 0 to � − 1
denom = denom + �� −1 (�)
next �
for � = 0 to � − 1
�� −1 (�) = �� −1 (�)/denom
next �
2.8 ALL TOGETHER NOW 27

6. Re-estimate the model � = (�, �, �)

// re-estimate �
for � = 0 to � − 1
�� = �0 (�)
next �
// re-estimate �
for � = 0 to � − 1
for � = 0 to � − 1
numer = 0
denom = 0
for � = 0 to � − 2
numer = numer + �� (�, �)
denom = denom + �� (�)
next �
��� = numer/denom
next �
next �
// re-estimate �
for � = 0 to � − 1
for � = 0 to � − 1
numer = 0
denom = 0
for � = 0 to � − 1
if(�� == �) then
numer = numer + �� (�)
end if
denom = denom + �� (�)
next �
�� (�) = numer/denom
next �
next �

︀ ︀
7. Compute log � (� | �)

logProb = 0
for � = 0 to � − 1
logProb = logProb + log(�� )
next �
logProb = −logProb
28 HIDDEN MARKOV MODELS

8. To iterate or not to iterate, that is the question.

iters = iters + 1
� = |logProb − oldLogProb|
if(iters < minIters or � > �) then
oldLogProb = logProb
goto 3.
else
return � = (�, �, �)
end if

2.9 The Bottom Line


Hidden Markov models are powerful, efficient, and extremely useful in prac-
tice. Virtually no assumptions need to be made, yet the HMM process can
extract significant statistical information from data. Thanks to efficient train-
ing and scoring algorithms, HMMs are practical, and they have proven useful
in a wide range of applications. Even in cases where the underlying assump-
tion of a (hidden) Markov process is questionable, HMMs are often applied
with success. In Chapter 9 we consider selected applications of HMMs. Most
of these applications are in the field of information security.
In subsequent chapters, we often compare and contrast other machine
learning techniques to HMMs. Consequently, a clear understanding of the
material in this chapter is crucial before proceeding with the remainder of
the book. The homework problem should help the dedicated reader to clarify
any remaining issues. And the applications in Chapter 9 are highly recom-
mended, with the English text example in Section 9.2 being especially highly
recommended.

2.10 Problems

When faced with a problem you do not understand,


do any part of it you do understand, then look at it again.
— Robert Heinlein

1. Suppose that we train an HMM and obtain the model � = (�, �, �)


where
︂ ︂ ︂ ︂
0.7 0.3 0.1 0.4 0.5 ︀ ︀
�= , �= , � = 0.0 1.0 .
0.4 0.6 0.7 0.2 0.1
2.10 PROBLEMS 29

Furthermore, suppose the hidden states correspond to � and �, re-


spectively, while the observations are �, � , and �, which are mapped
to 0, 1, and 2, respectively. In this problem, we consider the observation
sequence � = (�0 , �1 , �2 ) = (�, �, �) = (1, 0, 2).

a) Directly compute � (� | �). That is, compute



� (� | �) = � (�, � | �)

using the probabilities in � = (�, �, �) for each of the following


cases, based on the given observation sequence �.
� (�, � = ���) = · · · · · =
� (�, � = ���) = · · · · · =
� (�, � = ���) = · · · · · =
� (�, � = ���) = · · · · · =
� (�, � = ���) = · · · · · =
� (�, � = ���) = · · · · · =
� (�, � = ���) = 1.0 · 0.2 · 0.6 · 0.7 · 0.4 · 0.5 =
� (�, � = ���) = · · · · · =
The desired probability is the sum of these eight probabilities.
b) Compute � (� | �) using the � pass. That is, compute
�0 (0) = · =
�0 (1) = 1.0 · 0.2 =
�1 (0) = ( · + · )· =
�1 (1) = ( · + · )· =
�2 (0) = ( · + · )· =
�2 (1) = ( · + · )· =
where we initialize
�0 (�) = �� �� (�0 ), for � = 0, 1, . . . , � − 1
and the recurrence is
︃� −1 ︃

�� (�) = ��−1 (�)��� �� (�� )
�=0

for � = 1, 2, . . . , � −1 and � = 0, 1, . . . , � −1. The desired probability


is given by
�︁ −1
� (� | �) = �� −1 (�).
�=0
30 HIDDEN MARKOV MODELS

c) In terms of � and � , and counting only multiplications, what is the


work factor for the method in part a)? What is the work factor for
the method in part b)?

2. For this problem, use the same model � and observation sequence �
given in Problem 1.

a) Determine the best hidden state sequence (�0 , �1 , �2 ) in the dy-


namic programming sense.
b) Determine the best hidden state sequence (�0 , �1 , �2 ) in the HMM
sense.

3. Summing the numbers in the “probability” column of Table 2.2, we


find � (� | �) = 0.009629 for � = (0, 1, 0, 2).

a) By a similar direct calculation, compute � (� | �) for each observa-


tion sequence ︀of the form � = (�0 , �1 , �2 , �3 ), where �� ∈ {0, 1, 2}.
Verify that � (� | �) = 1, where the sum is over the observation
sequences of length four. Note that you will need to use the proba-
bilities for �, �, and � given in equations (2.4), (2.5), and (2.6) in
Section 2.2, respectively.
b) Use the forward algorithm to compute � (� | �) for the same obser-
vation sequences and model as in part a). Verify that you obtain
the same results as in part a).

4. From equation (2.9) and the definition of �� (�) in equation (2.10), it


follows that

�� (�) = ��0��0(�0 )��0 ,�1��1(�1 ) · · · ���−2 ,��−1���−1(��−1 )���−1 ,� �� (�� )

where � = (�0 , �1 , . . . , ��−1 ). Use this expression for �� (�) to directly


verify the forward algorithm recurrence
︃� −1 ︃

�� (�) = ��−1 (�)��� �� (�� ).
�=0

5. As discussed in this chapter, the forward algorithm is used solve HMM


Problem 1, while the forward algorithm and backward algorithm to-
gether are used to compute the gammas, which are then used to solve
HMM Problem 2.

a) Explain how you can solve HMM Problem 1 using the backward
algorithm instead of the forward algorithm.
2.10 PROBLEMS 31

b) Using the model � = (�, �, �) and the observation sequence � in


Problem 1, compute � (� | �) using the backward algorithm, and
verify that you obtain the same result as when using the forward
algorithm.
6. This problem deals with the Baum-Welch re-estimation algorithm.
a) Write the re-estimation formulae, as given in lines 3, 7, and 12 of
Algorithm 2.3, directly in terms of the �� (�) and �� (�).
b) Using the re-estimation formulae obtained in part a), substitute the
︀� (�) and �︀� (�) for �� (�) and �� (�), respectively, and
scaled values �
show that the resulting re-estimation formulae are exact.
7. Instead of using �� to scale the �� (�), we can scale each �� (�) by
︂ �︁ −1
�� = 1 �︀� (�)
�=0

where the definition of �︀� (�) is analogous to that of �


︀� (�) as given in
Algorithm 2.6.
a) Using the scaling factors �� and �� show that the Baum-Welch re-
︀ and �︀ in place
estimation formulae in Algorithm 2.3 are exact with �
of � and �.
︀ ︀
b) Write log � (� | �) in terms of �� and �� .
8. When training, the elements of � can be initialized to approximately
uniform. That is, we let �� ≈ 1/� and ��� ≈ 1/� and �� (�) ≈ 1/� ,
subject to the row stochastic conditions. In Section 2.5.3, it is stated
that it is a bad idea to initialize the values to exactly uniform, since
the HMM would be stuck at a local maximum and hence it could not
climb to an improved solution. Suppose that �� = 1/� and ��� = 1/�
and �� (�) = 1/� . Verify that the re-estimation process leaves all of
these values unchanged.
9. In this problem, we consider generalizations of the HMM formulation
discussed in this chapter.
a) Consider an HMM where the state transition matrix is time depen-
dent. Then for each �, there is an � × � row-stochastic �� = {���� }
that is used in place of � in the HMM computations. For such an
HMM, provide pseudo-code to solve HMM Problem 1.
b) Consider an HMM of order two, that is, an HMM where the un-
derlying Markov process is of order two. Then the state at time �
depends on the states at time � − 1 and � − 2. For such an HMM,
provide pseudo-code to solve HMM Problem 1.
32 HIDDEN MARKOV MODELS

10. Write an HMM program for the English text problem in Section 9.2 of
Chapter 9. Test your program on each of the following cases.

a) There are � = 2 hidden states. Explain your results.


b) There are � = 3 hidden states. Explain your results.
c) There are � = 4 hidden states. Explain your results.
d) There are � = 26 hidden states. Explain your results.

11. In this problem, you will use an HMM to break a simple substitution
ciphertext message. For each HMM, train using 200 iterations of the
Baum-Welch re-estimation algorithm.

a) Obtain an English plaintext message of 50,000 plaintext characters,


where the characters consist only of lower case a through z (i.e., re-
move all punctuation, special characters, and spaces, and convert all
upper case to lower case). Encrypt this plaintext using a randomly
generated shift of the alphabet. Remember the key.
b) Train an HMM with � = 2 and � = 26 on your ciphetext from
part a). From the final � matrix, determine the ciphertext letters
that correspond to consonants and vowels.
c) Generate a digraph frequency matrix � for English text, where ��� is
the count of the number of times that letter � is followed by letter �.
Here, we assume that a is letter 0, b is letter 1, c is letter 2, and so on.
This matrix must be based on 1,000,000 characters where, as above,
only the 26 letters of the alphabet are used. Next, add five to each
element in your 26 × 26 matrix �. Finally, normalize your matrix �
by dividing each element by its row sum. The resulting matrix �
will be row stochastic, and it will not contain any 0 probabilities.
d) Train an HMM with � = � = 26, using the first 1000 characters of
ciphertext you generated in part a), where the � matrix is initialized
with your � matrix from part c). Also, in your HMM, do not re-
estimate �. Use the final � matrix to determine a putative key
and give the fraction of putative key elements that match the actual
key (as a decimal, to four places). For example, if 22 of the 26 key
positions are correct, then your answer would be 22/26 = 0.8462.

12. Write an HMM program to solve the problem discussed in Section 9.2,
replacing English text with the following.

a) French text.
b) Russian text.
c) Chinese text.
2.10 PROBLEMS 33

13. Perform an HMM analysis similar to that discussed in Section 9.2, re-
placing English with “Hamptonese,” the mysterious writing system de-
veloped by James Hampton. For information on Hamptonese, see

http://www.cs.sjsu.edu/faculty/stamp/Hampton/hampton.html

14. Since HMM training is a hill climb, we are only assured of reaching a
local maximum. And, as with any hill climb, the specific local maximum
that we find will depend on our choice of initial values. Therefore, by
training a hidden Markov model multiple times with different initial
values, we would expect to obtain better results than when training
only once.
In the paper [16], the authors use an expectation maximization (EM)
approach with multiple random restarts as a means of attacking ho-
mophonic substitution ciphers. An analogous HMM-based technique is
analyzed in the report [158], where the effectiveness of multiple ran-
dom restarts on simple substitution cryptanalysis is explored in detail.
Multiple random restarts are especially helpful in the most challenging
cases, that is, when little data (i.e., ciphertext) is available. However,
the tradeoff is that the work factor can be high, since the number of
restarts required may be very large (millions of random restarts are
required in some cases).

a) Obtain an English plaintext message consisting of 1000 plaintext


characters, consisting only of lower case a through z (i.e., remove all
punctuation, special characters, and spaces, and convert all upper
case letters to lower case). Encrypt this plaintext using a randomly
selected shift of the alphabet. Remember the key. Also generate a
digraph frequency matrix �, as discussed in part c) of Problem 11.
b) Train � HMMs, for each of � = 1, � = 10, � = 100, and � = 1000,
following the same process as in Problem 11, part d), but using
the � = 1000 observations generated in part a) of this problem.
For a given � select the best result based on the model scores and
give the fraction of the putative key that is correct, calculated as in
Problem 11, part d).
c) Repeat part b), but only use the first � = 400 observations.
d) Repeat part c), but only use the first � = 300 observations.

15. The Zodiac Killer murdered at least five people in the San Francisco Bay
Area in the late 1960s and early 1970s. Although police had a prime
suspect, no arrest was ever made and the murders remain officially
unsolved. The killer sent several messages to the police and to local
newspapers, taunting police for their failure to catch him. One of these
34 HIDDEN MARKOV MODELS

messages contained a homophonic substitution consisting of 408 strange


symbols.7 Not surprisingly, this cipher is known as the Zodiac 408.
Within days of its release, the Zodiac 408 was broken by Donald and
Bettye Harden, who were schoolteachers from Salinas, California. The
Zodiac 408 ciphertext is given below on the left, while the corresponding
plaintext appears on the right.
I L I K E K I L L I N G P E O P L
E B E C A U S E I T I S S O M U C
H F U N I T I S M O R E F U N T H
A N K I L L I N G W I L D G A M E
I N T H E F O R R E S T B E C A U
S E M A N I S T H E M O S T D A N
G E R O U E A N A M A L O F A L L
T O K I L L S O M E T H I N G G I
V E S M E T H E M O S T T H R I L
L I N G E X P E R E N C E I T I S
E V E N B E T T E R T H A N G E T
T I N G Y O U R R O C K S O F F W
I T H A G I R L T H E B E S T P A
R T O F I T I S T H A E W H E N I
D I E I W I L L B E R E B O R N I
N P A R A D I C E A N D A L L T H
E I H A V E K I L L E D W I L L B
E C O M E M Y S L A V E S I W I L
L N O T G I V E Y O U M Y N A M E
B E C A U S E Y O U W I L L T R Y
T O S L O I D O W N O R A T O P M
Y C O L L E C T I O G O F S L A V
E S F O R M Y A F T E R L I F E E
B E O R I E T E M E T H H P I T I
Note the (apparently intentional) misspellings in the plaintext, includ-
ing “FORREST”, “ANAMAL”, and so on. Also, the final 18 characters
(underlined in the plaintext above) appear to be random filler.
a) Solve the Zodiac 408 cipher using the HMM approach discussed in
Section 9.4. Initialize the � matrix as in part c) of Problem 11,
and do not re-estimate �. Use 1000 random restarts of the HMM,
and 200 iterations of Baum-Welch re-estimation in each case. Give
your answer as the percentage of characters of the actual plaintext
that are recovered correctly.
b) Repeat part a), but use 10,000 random restarts.
c) Repeat part b), but use 100,000 random restarts.
d) Repeat part c), but use 1,000,000 random restarts.
7
The Zodiac 408 ciphertext was actually sent in three parts to local newspapers. Here,
we give the complete message, where the three parts have been combined into one. Also,
a homophonic substitution is like a simple substitution, except that the mapping is many-
to-one, that is, multiple ciphertext symbols can map to one plaintext symbol.
2.10 PROBLEMS 35

e) Repeat part a), except also re-estimate the � matrix.


f) Repeat part b), except also re-estimate the � matrix.
g) Repeat part c), except also re-estimate the � matrix.
h) Repeat part d), except also re-estimate the � matrix.

16. In addition to the Zodiac 408 cipher, the Zodiac Killer (see Problem 15)
released a similar-looking cipher with 340 symbols. This cipher is known
as the Zodiac 340 and remains unsolved to this day.8 The ciphertext is
given below.

a) Repeat Problem 15, parts a) through d), using the Zodiac 340 in
place of the Zodiac 408. Since the plaintext is unknown, in each
case, simply print the decryption obtained from your highest scoring
model.
b) Repeat part a) of this problem, except use parts e) through h) of
Problem 15.

8
It is possible that the Zodiac 340 is not a cipher at all, but instead just a random
collection of symbols designed to frustrate would-be cryptanalysts. If that’s the case, your
easily frustrated author can confirm that the “cipher” has been wildly successful.
Chapter 3

A Full Frontal View of Profile


Hidden Markov Models

The sciences do not try to explain,


they hardly even try to interpret,
they mainly make models.
— John von Neumann

3.1 Introduction
Here, we introduce the concept of a profile hidden Markov model (PHMM).
The material in this chapter builds directly on Chapter 2 and we’ll assume
that the reader has a good understanding of HMMs.
Recall that the key reason that HMMs are so popular and useful is that
there are efficient algorithms to solve each of the three problems that arise—
training, scoring, and uncovering the hidden states. But, there are significant
restrictions inherent in the HMM formulation, which limit the usefulness of
HMMs in some important applications.
Perhaps the most significant limitation of an HMM is the Markov as-
sumption, that is, the current state depends only on the previous state. The
time-invariant nature of an HMM is a closely related issue.1 These limita-
tions make the HMM algorithms fast and efficient, but they prevent us from
making use of positional information within observation sequences. For some
types of problems, such information is critically important.
1
According to your self-referential author’s comments in Chapter 2, we can consider
higher order Markov processes, in which case the current state can depend on n consecutive
previous states. But, the machinery becomes unwieldy, even for relatively small n. And,
even if we consider higher order Markov processes, we still treat all positions in the sequence
the same, as this only changes how far back in history we look.
38 PROFILE HIDDEN MARKOV MODELS

A PHMM can be viewed as a series of HMMs where, in effect, we define


a new � matrix at each offset in the training data. Recall from Chapter 2
that for an HMM, the � matrix contains probability distributions that relate
the observations to the hidden states. Furthermore, in an HMM, the � ma-
trix represents the average behavior over the training sequence. By having
multiple � matrices, in a PHMM we can make explicit use of positional infor-
mation contained in training sequences. But, before we can determine such �
matrices, we must first align multiple training sequences. In contrast, for an
HMM there is no need to align training sequences, since the position within a
sequence—relative to other training sequences—is irrelevant. Consequently,
for an HMM we can (and do) simply append the training data into one long
observation super-sequence. But, when training a PHMM, alignment of mul-
tiple observation sequences is at the very heart of the technique.
Another potential issue with HMMs is that we do not explicitly account
for insertions or deletions that might occur relative to the training sequence.
Generally, this is not a problem with HMMs, since the technique is statisti-
cal in nature, and the average behavior will not be significantly affected by
a few insertions or deletions. However, if we do want to account for posi-
tional information—as in a PHMM—then we’ll need to deal explicitly with
the possibility of insertions and deletions, since a single insertion or dele-
tion could cause two otherwise similar sequences to align badly. Again, with
a standard hidden Markov model, extraneous elements (i.e., insertions) or
“missing” elements (i.e., deletions) within a sequence are ignored when train-
ing and scoring, since a small number of such elements will have a negligible
impact on the resulting model.
Aligning the observation sequences is the most challenging part of train-
ing a profile hidden Markov model. Although there are many ways to align
multiple sequences, doing so efficiently and in a reasonably stable manner
is certainly challenging. But, once we have aligned the sequences, comput-
ing the matrices that define the actual PHMM is trivial. In this sense, the
PHMM training process is virtually the opposite of that used in an HMM.
When training an HMM, we simply append observation sequence, which
can be viewed as a trivial “alignment,” but when determining the matri-
ces that define the HMM, clever algorithms (forward algorithm, backwards
algorithm, and Baum-Welch re-estimation) are used. In contrast, when train-
ing a PHMM, we use clever algorithms to align the training sequences, but
once this has been done, constructing the PHMM matrices is easy.
In this chapter, we first provide an overview of PHMMs. Then we consider
a few simple examples to illustrate important aspects of the technique. In
particular, we focus on sequence alignment, since that is the most challenging
aspect of training a PHMM. In Chapter 10, we consider realistic applications
of PHMMs to problems in information security, including malware detection
and masquerade detection.
3.2 OVERVIEW AND NOTATION 39

3.2 Overview and Notation


To train a PHMM, we must first construct a multiple sequence alignment
(MSA) from a set of training sequences. As the name suggests, an MSA
consists of an alignment of several different (training) sequences. Finding an
optimal simultaneous alignment of multiple sequences is computationally in-
feasible, so instead we’ll first determine pairwise alignments and then combine
the pairwise alignments into an MSA.
When aligning sequences in a PHMM, we allow gaps to be introduced,
which enables us to align more symbols. However, the more gaps that are
present, the easier it is for a sequence to match during the scoring phase,
and hence the less specific—and the less informative—is the resulting model.
Therefore, we want to penalize gaps. In addition, certain matches might be
better than other types of matches, and a near miss should be penalized less
than a bad miss. To account for such cases, we employ a substitution matrix,
which is discussed in more detail below. But, before we turn our attention to
the detail of pairwise alignment and MSA construction, we discuss PHMM
notation, which differs significantly from the notation used for HMMs.
For the remainder of this section, we’ll only consider PHMM state tran-
sitions, which correspond to the elements of the � matrix in an HMM. For
now, we do not deal with emission2 probabilities, which correspond to the �
matrix in an HMM. Before we can sensibly discuss emissions, we need to
fully develop the ideas behind pairwise alignments and multiple sequence
alignments—topics that we’ll cover in Sections 3.3 and 3.4, respectively.
In a PHMM, we’ll distinguish between three types of states, namely,
match, insert, and delete states. A match state is essentially equivalent to
a state in a standard HMM, while insert and delete states arise from allow-
ing insertions and deletions when aligning sequences. This will all be made
more precise below, but first let’s consider the simplest case, where every
state in the MSA is a match state. Then the state transitions are entirely
straightforward—in Figure 3.1, we illustrate a PHMM that has � = 4 match
states, and no insert or delete states. Again, the diagram in Figure 3.1
only deals with state transitions, and does not include any information about
emissions. Also, for notational convenience, we’ll sometimes refer to the begin
state as �0 and the end state as �� +1 .

begin �1 �2 �3 �4 end

Figure 3.1: PHMM without gaps


2
PHMM emissions are the same as HMM observations.
40 PROFILE HIDDEN MARKOV MODELS

In most real-world applications, we need to insert gaps into an MSA when


aligning sequences. If we have too many gaps, then the emissions probabilities
are unreliable. We refer to such an unreliable position as an insert state. Note
that an insert state can follow a match state (or the begin state) and that
multiple insertions can occur before we transition to the next match state.
When including both match and insert states, we can model PHMM state
transitions as illustrated in Figure 3.2.

�0 �1 �2 �3 �4

begin �1 �2 �3 �4 end

Figure 3.2: PHMM with insertions

Note that in the pure-match PHMM in Figure 3.1, we always transition


from match state �� to match state ��+1 . However, when insert states are
also included, as in Figure 3.2, we now have the possibility of multiple transi-
tions from each state, and hence the transitions themselves are probabilistic.
This is a major modification, as compared to an HMM.
A delete state is a state where no emission occurs. We model each deletion
as skipping a match state, and hence consecutive deletions correspond to
skipping consecutive match states.3 In addition, after a deletion (or series of
deletions), we must transition to an emission state and, consequently, only a
match (or insert) state can follow a delete state.
In Figure 3.3, we illustrate a PHMM that includes both match and delete
states. As with a PHMM that includes insertions, this simplified PHMM
allows for different types of state transitions.
Generically, a PHMM includes match, insert, and delete states, as illus-
trated in Figure 3.4. This illustration is essentially the PHMM equivalent of
the hidden states (and transitions) in an HMM. The rather complicated il-
lustration in Figure 3.4 only accounts for the state transitions in the PHMM,
and does not include any information about emissions. In comparison to the
generic HMM illustrated in Figure 2.1 of Chapter 2, the PHMM illustration
in Figure 3.4 only deals with the hidden part of the model. That is, Figure 3.4
only deals with the structure of the � matrix in a PHMM.
3
Actually, a deletion can skip a match or insert state, or some combination thereof.
However, we’ll keep things simple at this point, and only consider the interaction between
match and delete states.
3.2 OVERVIEW AND NOTATION 41

�1 �2 �3 �4

begin �1 �2 �3 �4 end

Figure 3.3: PHMM with deletions

�1 �2 �3 �4

�0 �1 �2 �3 �4

begin �1 �2 �3 �4 end

Figure 3.4: Profile hidden Markov model

The standard notation for a PHMM [47] is summarized in Table 3.1.


Note that the � matrix includes all of the transitions that are illustrated
in Figure 3.4. That is, in addition to transitions of the type ��� ��+1 , we
have match-to-delete transitions ��� ��+1 , insert-to-match transitions ��� �� ,
delete-to-insert transitions ��� ��+1 , and so on.
It may be instructive to compare the PHMM notation in Table 3.1 to the
standard HMM notation, which is given in Table 2.1 in Chapter 2. In an
HMM, we refer to observed symbols, while a PHMM has emitted symbols.
In a PHMM (as with an HMM), we associate the emitted (observed) symbols
with integers. There are many other similarities and a few notable differences
that should become clear in the discussion below.
Next, we turn our attention to generating pairwise alignments. Then we
consider the process of constructing a multiple sequence alignment (MSA)
from a collection of pairwise alignments, and we show how to generate the
PHMM matrices from an MSA. Finally, we consider PHMM scoring, which
is slightly more complex than scoring with an HMM, due primarily to the
greater complexity in the state transitions.
42 PROFILE HIDDEN MARKOV MODELS

Table 3.1: PHMM notation

Notation Explanation
� Emitted symbols, �1 , �2 , . . . , �� , where � ≤ � + 1
� Number of states
� Match states, �1 , �2 , . . . , ��
� Insert states, �0 , �1 , . . . , ��
� Delete states, �1 , �2 , . . . , ��
� Initial state distribution
� State transition probability matrix
��� ��+1 Transition probability from �� to ��+1
� Emission probability matrix
��� (�) Emission probability of symbol � at state ��
� The PHMM, � = (�, �, �)

3.3 Pairwise Alignment


To train a PHMM, we need an MSA, which we’ll construct from a collection
of pairwise alignments. Consequently, we’ll first consider a method to align
a pair of sequences from a given training set.
Ideally, we would like to globally align a pair of sequences, that is, we want
an alignment that accounts for as many elements as possible. However, we
also want to minimize the number of gaps that are inserted, since gaps tend to
weaken the resulting alignment by making it more generic. By using a local
alignment strategy instead of a global approach, we can often significantly
reduce the number of gaps. The tradeoff is that a local alignment may not
utilize all of the information available in the sequences.
To simplify the local alignment problem, we’ll illustrate such an alignment
where only the initial and ending parts of the sequences can remain unaligned.
For example, suppose that we want to align the sequences

CBCBJILIIJEJE and GCBJIIIJJEG.

In Table 3.2 we give a global alignment of these sequences, and we illustrate


a local alignment where the initial and final parts of the sequences are not
aligned. Note that we use “-” to represent an inserted gap, while “*” is an
omitted symbol (i.e., omitted from consideration in the local alignment), and
“|” indicates that the corresponding elements are aligned.
For the global alignment in Table 3.2, we are able to align nine out of
fifteen of the positions (i.e., 60%), while for the local alignment, eight of the
ten positions under consideration are correctly aligned (80%). Consequently,
the model resulting from this local alignment is likely to be more faithful to
3.3 PAIRWISE ALIGNMENT 43

Table 3.2: Global vs local alignment

CBCBJILIIJEJE
Unaligned sequences
GCBJIIIJJEG
-CBCBJILIIJEJE-
Global alignment | ||| ||| ||
GC--BJI-IIJ-JEG
***CBJILII-JE**
Local alignment |||| || ||
***CBJI-IIJJE**

the training data—and, in a sense, stronger—as compared to the model we


obtain from the global alignment. Therefore, in practice we’ll most likely
want to consider local alignments. However, to simplify the presentation, in
the remainder of this chapter, we only consider global alignments.
To construct a pairwise alignment, it’s standard practice to use dynamic
programming. For dynamic programming (see Section 2.6 of Chapter 2), we
must have meaningful scores when comparing elements. In the context of
sequence alignment, we’ll specify an � × � substitution matrix �, where � is
the number of distinct symbols.
For example, consider the problem of masquerade detection [61], where
we want to detect an attacker who has gained access to a legitimate user’s
account. Such an attacker might try to evade detection by masquerading as
the legitimate user. That is, the attacker might attempt to behave in almost
the same manner as the legitimate user, so as to avoid triggering any warning
based on unusual activity. Suppose that in a simplified masquerade detection
system, we monitor the four operations in Table 3.3.

Table 3.3: Masquerade detection example

Notation Explanation
E Send email
G Play games
C C programming
J Java programming

In a masquerade detection system, we’ll collect information from an active


user and compare the user’s behavior to the expected behavior of the currently
logged-in user. If the behavior differs significantly, then we flag the user as a
Exploring the Variety of Random
Documents with Different Content
Thus, in spite of considerable diversity as to incidental conditions,
city and State were closely bound up with each other in the
development of political society. We find no city apart from a State,
and it is doubtful whether there was a State without a city as the
seat and centre of its political power. But this correlation obtained
only during the period of the genesis of States and of the attendant
rise of the original city. Once States have come into existence, many
other conditions may lead to the establishment of a community
which, as regards extent and relative political independence, is of
the nature of a city. Such phenomena may be referred to as the
secondary foundation of cities; they are possible only on the basis of
a previously existing political society. An approximation to original
conditions occurs when a victorious State either establishes cities in
the conquered provinces, centralizing in them the power over the
respective territories, or transforms cities that already exist into
political centres. Occurrences of this sort were frequent during the
extension of Alexander's world-dominion and at the time of the
Roman Empire. The same fact may be observed at a later period, in
connection with the occupation of the Italian cities by the Goths and
Lombards. The German cities founded during the Middle Ages differ
still more widely from the original type. These cities first arose as
market centres, and then gradually acquired political privileges.
Thus, the process of the original foundation of cities was, as it were,
reversed. In the latter case, the castle came first and the market
followed; the mediæval city began as a market and reached its
completion with the building of a castle. In mediæval times,
however, leadership was not originally vested in the city but in rulers
who occupied isolated estates scattered here and there throughout
the country. Yet these secondary phenomena and their further
development do not belong to our present problem of the origin of
political society.

8. THE BEGINNINGS OF THE LEGAL SYSTEM.


The social regulations which we have thus far considered find their
consummation in the legal system. This possesses no content
independent of the various social institutions, but merely provides
certain norms of action with a social sanction. As a result, these
norms are protected against violation or are designated as
regulations which, whenever necessary, are defended against
violators by the use of external force. Thus, the legal system does
not involve the outright creation of a social order. It consists
primarily in the singling out, as definite prescriptions, of certain
regulations that have already arisen in the course of social life, and
that are for the most part already maintained by custom. The
enforcement of these regulations is expressly guaranteed by society,
and means are established whereby this pledge is to be redeemed.
Thus, the most important social institutions—the family, the classes,
the vocations, village settlements and cities, and also the relations of
property, intercourse, and contract, which these involve—were
already in existence before becoming constituent parts of a legal
system. Moreover, the advance beyond custom and the settlement of
difficulties case by case was not made suddenly or, much less, at the
same time in all regions, but came only very gradually. The
formulation of laws did not, as a rule, begin in connection with the
political community and then pass down to the more restricted
groups, ending with the single individual. On the contrary, law began
by regulating the intercourse of individuals; later, it acquired
authority over family relations, which had remained under the
shelter of custom for a relatively long period; last of all, it asserted
itself also over the political order. That is to say, the State, which is
the social organization from which the legal system took its rise, was
the very last institution in connection with which objective legal
forms were developed. We may account for this by reference to a
factor which played an important rôle from the very outset. After the
legal system had once grown up out of custom and had subjected
many of the important fields of the latter to its authority, it was able
of itself to create regulations, which were thus from the very
beginning legal prescriptions. Such primarily legal regulations arose
in connection with conditions in which, frequently, the fact that there
be some law was of more importance than the precise character of
the law. But even in these cases the regulations were always
connected with the larger body of law that was rooted in custom.
This larger body of law was but supplemented by ordinances that
were called into being by temporal and cultural conditions.
The transition from custom to law reflects the joint influence of two
factors, which, particularly at the outset, were themselves closely
connected. The first of these factors consists in the rise of firmly
established forms of rulership, which are indicative also of the
transition leading to States; the other is the religious sanction which
was attached to those regulations that were singled out by the law
from the broader field of custom. Both factors indicate that the
heroic age properly marks the origin of the legal system, even
though it be true that all such changes are gradual and that
occasional beginnings of the legal system, therefore, may be found
at an earlier period, in connection with the very ancient institution of
chieftainship. As regards the external social organization and the
religious life of the heroic age, these are characterized, respectively,
by the development of strict forms of rulership and by the origin of a
deity cult. Each of these social phenomena reinforces the other. The
kingdom of the gods was but the terrestrial State projected into an
ideal sphere. No less was the development of the legal system
dependent upon the union of the two factors. Neither the external
force of the political authority governing the individual nor the inner
constraint of religious duty sufficed in itself to establish the
tremendous power characteristic of the legal system from early
times on. It is true that, at a later period, the feeling that law
represents a religious duty gave way to the moral law of conscience.
The latter, however, itself owes its origin to the increasing influence
of the political authority which is at the basis of the legal system;
moreover, as an inner motive reinforcing the external compulsion of
the law, it continued to preserve a similarity to the religious source
from which it sprang. True, a significant change occurred. During the
early stages of legal development, the weight of emphasis fell on the
religious aspect of law, whereas it later more and more shifted to the
political side. At first, the entire body of law was regarded as having
been given directly by the deity, as was the case, for example, with
the Ten Commandments of Moses and with the Israelitic Priests'
Code, which clothes even the most external modes of life in the garb
of religious commands. Sometimes a twofold credit is given for the
introduction of the legal system, in that the one who wields the
power is regarded as administering justice both in his own name and
as commissioned by the gods. An illustration of this is the Babylonian
code of Hammurabi. It is, naturally, when the priests wield the
authority that the laws are most apt to be ascribed exclusively to the
gods. The tendency, on the other hand, to give the ruler a certain
amount of credit for legislative enactments, is greatest whenever the
ruler occupies also the position of chief priest. The direct impetus to
such a union of priesthood and political authority is to be found in
the rise of the legal system itself, for this resulted from a fusion of
religious and political motives. The idea that the earthly ruler is the
terrestrial representative of a world-governing deity, or, as occurs in
extreme cases, that he is the world-governing deity himself, is,
therefore, a conception that is closely bound up with the rise of
political society and that receives pregnant expression in the earliest
forms of the legal system. No trace of such a conception was
associated with the chiefs of the totemic period. Their position was
entirely distinct from that of the magicians, the shamans, and the
medicine-men, who were the original representatives of the priestly
class that later arose in the age of deity cults. But it is for this very
reason that the mandates of the totemic chief cannot be said as yet
to have constituted a legal system; they were commands which were
given as occasion demanded, and which were determined partly by
the will of the chief and partly by transmitted customs. Secular and
religious motives are to be found in similar combination elsewhere,
even among tribes that are usually regarded as peoples of nature,
as, for example, particularly those of Polynesia. In cases such as
these, however, there are present also the beginnings of a legal
system, as well as its correlates, the fundamentals of a political
organization and of a deity cult. Whether these are the remnants of
a culture brought by these migratory peoples from their original
Asiatic home, or whether they represent an independently achieved
culture that has fallen into decay, we need not here inquire.
That the development of the legal system is dependent upon the
first of these phenomena—that is, upon political organization—is
directly apparent from the fact that the administration of justice in
general presupposes two sources of authority. Here again the
beginnings are to be found in the totemic age. During this period,
the administration of justice was vested, in the first place, in a
relatively restricted group of the older and experienced men, such as
exercised authority over the older members of the horde even in
pretotemic times. Judicial powers were assumed, in the second
place, by individual leaders in the chase or in war. The authority of
the latter, it is true, was temporary, frequently shifting with changing
circumstances; it was all the more effective, however, for the very
reason that it was centred in single individuals. Now, the initial step
in the formation of a legal system—which, as already remarked, was
at first concerned merely with what we would call civil justice—was
taken when the quarrels of individuals came to be settled in the
same way as were matters of common concern to the clan or tribe—
namely, by the decisions of the two long-established authorities, the
'council of elders,' as they later continued to be called among many
civilized peoples, and the individual leader or chieftain. Even in
relatively primitive times, fellow-tribesmen or clansmen who
disagreed as to the ownership of an object or perhaps as to whether
or not some mutual agreement had been kept, and who preferred a
peaceful decision to settlement by combat, were accustomed to seek
the decision of the elders or of a man of commanding respect. Thus,
these initial stages of legal procedure indicate that the earliest judge
was an arbitrator; he was freely selected by the disputants, though
he constantly became more firmly established in his position as a
result both of his authority in the general affairs of the tribe and of
tradition. We next find the appointed judge, who owes his office to
political authority, and who decides particular controversies, not
because he has been asked to do so by the parties themselves but
'of right' and as commissioned by the State; supported as he is by
the political power, his decision has compelling force. As soon as the
State assumes the function of deciding the controversies of
individuals, the judge becomes an official. Indeed, he is one of the
first representatives of officialdom. For, in the early stages of political
organization, all matters other than the quarrels of individuals are
regulated by ancient customs, except in so far as war and the
preparation for war involve conditions that necessarily place
authority of an entirely different sort in the hands of particular
individuals. Thus, together with the offices of those who, though
only gradually, come to have charge of the maintenance of the
military organization even in times of peace, the office of the
judiciary represents one of the earliest of political creations. In it, we
find a parallel to the division of power between the ruler and a
separate council of experienced men, an arrangement that
represents a legacy from the period of tribal organization, but that
only now becomes firmly established. The individual judge and the
college of judges both occur so early that it is scarcely possible to
say whether either antedated the other. Affecting the development
just described are two other conditions, capable of bringing about a
division of judicial authority at an early time. One of these conditions
is the connection of the state with deity cult, as a result of which the
secular power is limited by the authority of the priesthood, whose
chief prerogative comes to be penal justice. The second factor in the
differentiation of judicial functions consists in the institution of
chieftainship, one of the two characteristic features of political
society. Chieftainship involves a tendency towards a delegation of
the supreme judicial authority to the ruler. This is particularly the
case during the first stages of political organization, which still reflect
the fact that the external political power of the chieftain grew up out
of the conditions attendant upon war. Even though the secular
judiciary, which originated in the council of elders, or, in certain
cases, the judicial office of the priest, also continues to be
maintained, the ruler nevertheless reserves for himself the authority
over the most important issues. Particularly in doubtful cases, in
which the ordinary judge has no traditional norms to guide his
decision, the 'king's court' intervenes in order, if necessary, to secure
a recognition of the claim of reasonableness. This is especially apt to
occur in connection with capital crimes. Hence it is that, even after
penal law has once become a matter of general governmental
control—which, as a rule, occurs only at a later stage of legal
development—the final decision in criminal cases usually rests with
the ruler. Generally, moreover, it is the ruler alone who has sufficient
power to put an end to the blood-revenge demanded by kinship
groups. Owing to the fact that, in his capacity of military leader, the
ruler possesses power over life and death during war with hostile
tribes, he comes to exercise the same authority in connection also
with the feuds of his fellow-tribesmen. Modern States have retained
a last remnant of this power in the monarch's right to pardon, an
erratic phenomenon of a culture that has long since disappeared.
Thus, the State, as such, possesses an external power which finds
its most direct expression—just as does the unity of the State—in
the exercise of judicial authority on the part of the ruler. In the
beginnings of legal development, however, law always possesses
also a religious sanction. True, the above-mentioned unification of
the offices of priest and judge or of the authority of priest and ruler
—the latter of which sometimes occurs in connection with the former
—may be the result of particular cultural conditions. This, however,
but indicates all the more forcibly how permanent has been the
religious sanction of law. Such a sanction is evidenced by the words
and symbolisms that accompany legal procedure even in the case of
secular judges and of the relations of individuals themselves. Not
without significance, for example, is the solemnity manifested in the
tones of those who are party to a barter, a contract, or an
assignment of property. Indeed, their words are usually
accompanied by express confirmations resembling the formulas of
prayer and imprecation; the gods are invoked as witnesses of the
transaction or as avengers of broken pledges. Because of the
solemnity of the spoken word, speech was displaced but slowly by
writing. Long after the latter art had been acquired, its use
continued to be avoided, not only in the case of legal formulas, such
as the above, but occasionally even in connection with more general
legal declarations. In the Brahman schools of India, for example, the
rules of legal procedure, as well as the hymns and prayers, were for
centuries transmitted purely through memory; we are told,
moreover, that in ancient Sparta it was forbidden to put the laws in
writing. To an age, however, which is incapable of conceiving even a
legal transaction except as a perceptual act, the spoken word by
itself is inadequate to give the impression of reality. As an indication
that he has acquired a piece of land, the purchaser lifts a bit of soil
from the earth, or the vendor tosses a stalk of grain to him—a
ceremony which is imitated in the case of other objects of exchange
and which has led to the word 'stipulation' (from the Latin stipulatio,
throwing of a stalk). Another symbol of acquisition is the laying on of
the hand. Similar to it is the clasp of right hands as a sign of mutual
agreement. By this act the contracting parties pledge their freedom
in case they break the promise which they are giving. When the fact
that the two parties lived at some distance from each other rendered
the hand clasp impossible, the Germans were accustomed to
exchange gloves. One who challenged another to a duel likewise did
so by the use of a glove, even though his opponent was present. By
throwing his glove before his opponent the challenger gave
expression to the distance which separated him in feeling from his
enemy. In this case, the symbol has changed from a sign of
agreement to the opposite. All the symbols of which we have been
speaking agree in having originally been regarded, not as symbols,
but as real acts possessing certain magical potencies. When an
individual, who is acquiring a piece of land, picks up a bit of soil
while speaking the appropriate words, he intends to produce a
magical effect upon the land, such that disaster will come to any one
who may seek to deprive him of it. He who offers his hand in sealing
a compact signifies that he is prepared to lose his freedom in case
he fails to keep his word. For this reason the shaking of hands is
sometimes supplemented by the extension of a staff—a special use
of the magical wand which occurs particularly when the pledge is
administered by a judge. In a second stage of development, the act
loses the status of reality, but it remains associated with religious
feelings. At a third stage, it becomes a mere matter of form, though
the solemnity with which it envelops the transaction adds to the
impressiveness of the latter and fixes it more firmly in memory.
Combined with the word, thus, is a gesture that faithfully reflects its
meaning. Moreover, other individuals are summoned to witness the
legal transaction. This is done, not so much that these persons may
later be able to give definite testimony, as that they, too, shall hear
the word and see the gesture, and so, in a sense, enhance the
reality of that which is transpiring. Besides this oldest form of
witness, who is not to testify regarding that which he has
experienced, as occurs in later times, but who is merely present on
the occasion of the legal transaction, there is the compurgator, who
substantiates the oath of the man involved. The latter fortifies his
statements by invoking the gods as witnesses. Now, the oath of the
compurgator does not relate to the testimony of his companion, but
merely to the companion himself; it is a pledge to share the
punishment of the latter in case he swears falsely. As in battle, so
also in calling upon the terrible powers whose vengeance is to fall
upon the perjurer, companion stands protectingly by the side of
companion. Thus, the oath itself is a ceremony both of cult and of
magic. As a cult activity, the oath was originally given at the place
where the cult was administered—that is, in the immediate presence
of the gods; the method of procedure was to raise the fingers and to
point them directly to the gods, who were regarded as witnesses of
the act. The magical nature of the oath appears in the fact that the
latter involved the conjuration of an object, which was to bring
disaster upon him who took the oath in case he swore falsely. Thus,
the Germans swore by their battle-steeds or their weapons, and, in
so doing, they laid their hands upon these objects; or, instead of the
latter, they used an oath-staff—one of the numerous metamorphoses
of the magical wand—which was extended toward him who received
the oath, whether the opposing party or the judge. This oath
signified that the object by which the individual swore would bring
ruin upon him in case he committed perjury. The oath, therefore,
came to be a fixed and definitely prescribed means of judicial
procedure, though this occurred only after deity cult effected a union
of the two factors, cult and magic. Nevertheless, the beginnings of
this development are to be found as early as the totemic age, and
they approximate to the cult-oath particularly in those regions that
practise ancestor worship. The Bantu, for example, swears by the
head of his father or the cap of his mother, as well as by the colour
of his ox. In all these cases, the intention is that the perjurer shall
suffer the vengeance which the demon of the deceased or of the
animal visits upon him who swears falsely.
Closely related in its motives to the oath is another legal institution,
the ordeal. In the earliest form of the ordeal, the strife of individuals
was settled by a duel. Such an ordeal was very similar to the sword-
oath, at least among Indo-Germanic peoples. Just as the man who
swore by his weapons invoked death by their agency in the
indefinite future, so each of the participants in the duel sought to
bring these magical powers into immediate effect in the case of his
opponent. Not to him whose arm is the stronger, but to him who has
the stronger cause, will the gods grant victory through the magic of
his weapon. Like the oath, therefore, the ordeal was originally a
method of legal procedure in civil cases. Like the oath, furthermore,
it was, in its beginnings, a means whereby individuals settled their
controversies independently of a judge. It is at this point that the
punitive action of individuals gives way to public legal procedure.
Originally, crimes against life and property were dealt with by
individuals; the endeavour to secure the judgment of the gods by
means of the duel was doubtless one of the earliest steps by which
the penal process became a public procedure, and the punishment
itself, therefore, became raised above the plane of mere revenge.
Blood revenge involved an unexpected attack in the open or from
ambush. To renounce this custom in favour of the duel, therefore,
was in harmony with the character of the heroic age. For this was
the period in which the ideal of manly honour was rapidly gaining
strength, and in which, therefore, it was regarded as unworthy
under any circumstances to take the life of a defenceless man. The
principle accepted as self-evident in war, namely, that the person
attacked have an opportunity to defend himself, became, in a warlike
age, a maxim applying also to times of peace. Moreover, even
though it be true of the ordeal as of the oath that, at the outset, cult
was secondary to magical conjuration, nevertheless, the dominance
of the latter varied with the degree in which the State freed penal
justice from the passion for revenge on the part of individuals. The
ordeal thus came to be more than merely a combat between the
accuser and the accused. The judge in charge of the combat
acquired the duty of determining guilt or innocence, and, as a result,
the ordeal assumed other forms. Only the one who was accused was
now involved. The ordeal changed from a magic combat into a
magic test, which came to be regarded as a direct revelation of the
decision of the deity. This led to the adoption of means of proof
other than combat. It was obviously cult that caused penal justice as
such to be taken out of the hands of private individuals. For this
reason it was particularly sacrilege that demanded a magical
judgment independent of the combat of individuals. In cases of
sacrilege, the deity himself tested the assertions of the one who
endeavoured to free himself from the charges of religious crime. The
means for determining guilt or innocence were fire and water—the
same agencies that had long been employed by religious cult for
purposes of lustration. That the tests by water and by fire used in
connection with the witchcraft cases of mediæval times still
possessed a magical significance is unmistakable. If the witch sank
in the water—that is, if she was received by the purifying element—
she was guiltless. If the accused was not injured by holding a
glowing iron in his hand or by walking barefooted over coals, this
also was regarded as indicative of innocence. Apparently the
underlying conception was that the deity who gave to water and fire
the power of purifying a sinner from his guilt also communicated to
them the power of freeing the innocent from an accusation and of
withholding assistance from the guilty. Hence it is that while these
modes of divine judgment were not, indeed, as common as was
purification by means of water and fire, they nevertheless appeared
again and again, so far as their fundamental characteristics are
concerned. They were resorted to by the Germanic peoples, and
were prevalent also in Græco-Roman antiquity, and in India; trial by
water was likewise a custom in Babylonia, where it was prescribed
by Hammurabi as a means by which a suspected person might free
himself. We have noticed how, in the case of the ordeal and
particularly of its earliest form, judicial combat, the legal
controversies of individuals concerning rights relating to property,
buying and selling and other agreements, came to be considered
from the standpoint of punishment. This process is characteristic of
the development of penal law in general.

9. THE DEVELOPMENT OF PENAL LAW.

As an institution protected by the State, the administration of penal


law everywhere grew up out of civil law. The judge who was
appointed by the State to arbitrate personal controversies developed
into a criminal judge. Still later these two judicial offices became
distinct. This separation began in connection with the most serious
offences, such as seemed to demand a separate tribunal. The
determining feature, in this instance, was, at the outset, not any
qualitative characteristic of the offence but its gravity. Now, at the
time when deity cults were at their zenith, the most serious crimes
were held to be those connected with religion, namely, temple
sacrilege and blasphemy. Only at a relatively late period were crimes
against life and limb classed along with those affecting religion; to
these were added, shortly afterwards, violations of property rights.
That murder, though the most frequent crime of early culture, should
not be penalized by political authority until so late a period, is
directly due to the fact that it has its origin in the strife of
individuals. In such a strife, each man personally assumes all
consequences, even though these consist in the loss of his life. Even
to slay a man from ambush is regarded as justifiable by primitive
society if an individual is avenging a crime from which he has
suffered. As family and kinship ties become stronger, the family or
kin participates as a group in the quarrels of its individual members,
just as it does in war against hostile tribes. A murder, whether or not
it be an act of vengeance, is avenged by a fellow-member of the
victim, either upon the murderer or upon some one of his kin,
inasmuch as in this case also the group is regarded as taking the
part of the individual. This is the practice of blood-revenge, a
practice which antedates the heroic age but which nevertheless
continues to exercise a powerful influence upon it. Blood-revenge is
so closely bound up with totemic tribal organization that it was
probably never lacking wherever any such system arose. Its status,
however, was purely that of a custom, not that of a legal
requirement. It was custom alone, and not political authority, that
compelled one kinsman to avenge the death of another. It was
custom also that sought to do away with the disastrous results of a
continuous blood-feud by means of an arrangement that came to
take the place of blood-revenge. This substitute was the 'wergild,'
which was paid as an indemnity by the malefactor to the family of
the one who had been murdered, and which thus maintained
precisely the same relation to blood-revenge as did marriage by
purchase to marriage by capture. In the former case, however, the
substitution of a peaceful agreement for an act of violence gave the
political authority its first occasion to exercise its regulative power.
This first manifestation of power consisted in the fact that the
political authority determined the amount which must be paid in lieu
of the blood-guilt. With the institution of wergild the entire matter
becomes one of civil law. Only one further step is necessary, and the
law of contract will indirectly have established the penal authority of
the State. This step is taken when the State compels the parties to
enter into an agreement on the basis of the wergild. The advance,
however, was not made at a single bound, but came only through
the influence of a number of intermediate factors. That which first
demanded a legal determination of the amount of expiation money
was the necessity of estimating the personal value of the one who
had been murdered, according as the individual was free-born or
dependent, of a high or of a low class, an able-bodied man or a
woman. Such a gradation in terms of general social status suggested
the propriety of allowing temporary and less serious injuries to life
and limb to be compensated for on the basis of their magnitude. But
the estimation of damages in such cases again made civil jurisdiction
absolutely necessary.
Closely interconnected with this complex of social factors, and
imposing a check upon the impulse for vengeance that flames up in
blood-revenge, was a religious influence—the fear of contaminating
by a deed of violence a spot that was sanctified by the presence of
invisible gods. No violence of any kind was allowed within sacred
precincts, particularly in places set apart for sacrifice or for other cult
ceremonies; least of all was violence tolerated in the temple, for the
temple was regarded as the dwelling of a deity. Such places,
therefore, afforded protection to all who fled to them from
impending blood-revenge or other sources of danger. The sacred
place also stood under the protection of the community; any
violation of it brought down upon the offender the vengeance of the
entire group, for the latter regarded such sacrilege as a source of
common danger. Thus, the protection of the sanctuary came to be a
legal right even at a time when retribution for the crime itself was
left to the vengeance of individuals. The right of protection afforded
by the temple, however, was sometimes held to exist also in the
case of the dwellings of persons of distinguished power and esteem,
particularly the dwellings of the chief and of the priest. Indeed, prior
to the existence of public temples, the latter were doubtless the only
places of refuge. In this form, the beginnings of a right of refuge
date back even into the totemic age. At that early time, however, the
protection was apparently due, not so much to directly religious
factors, as to the personal power of the individual who afforded the
refuge, or also, particularly in Polynesia, to the 'taboo' with which
the upper classes were privileged to guard their property. But, since
the taboo was probably itself of religious origin, and since the
medicine-man, and occasionally also the chief, could utilize
demoniacal agencies as well as his own external power, even the
very earliest forms of refuge were of the general nature of religious
protection. In some cases, the right of refuge eventually became
extended so as to be connected not only with the property set apart
for the chief or the priest but also with the homes of inferior men.
This, however, was a relatively late phenomenon. Its origin is
traceable to the cult of household deities, first of the ancestral spirits
who guard domestic peace, and then of the specific protective
deities of the hearth by whom the ancestral spirits were supplanted.
As a rule, it was not the criminal but the visiting stranger who
sought the protection of the house. The right to hospitality thus
became also a religiously sanctioned right to protection. The guest
was no less secure against the host himself than against all others.
The right of protection afforded by the house, therefore, should
probably be interpreted as a transference of the right of refuge
inherent in sacred precincts. The protective right of the chief was
doubtless the beginning of what in its complete development came
to be household right in general.
The divine protection afforded by the sanctuary obviously offers but
a temporary refuge from the avenger. The fugitive again encounters
the dangers of blood-revenge as soon as he leaves the sacred
precincts. Nevertheless, the time that is thus made to elapse
between the act and its reprisal tempers the passion of the avenger,
and affords an opportunity for negotiations in which the hostile
families or clans may arrange that a ransom be paid in satisfaction of
the crime that was committed. Moreover, the chief or the temple
priest under whose protection the fugitive places himself, is given a
direct opportunity for mediating in the capacity of an arbitrating
judge, and later, as the political power gradually acquires greater
strength, for taking the measures of retribution into his own hands.
Revenge, thus, is changed into punishment, and custom is displaced
by the norm of law, which grows up out of repeated decisions in the
adjudication of similar cases.
Sojourn in a place of refuge resembles imprisonment in that it limits
personal freedom. One might, therefore, be inclined to suppose that,
through a further development other than that described above, the
sanctuary led to a gradual moderation of punishment by introducing
the practice of imprisonment. Such a supposition, however, is not
borne out by the facts. At the time when the transition from the
place of refuge into the prison might have taken place, the idea of
reducing the death penalty to the deprivation of freedom was still
remote. The value which the heroic age placed on the life of the
individual was not sufficiently high to induce such a change, and the
enforcement of prison penalties would, under the existing
conditions, have appeared difficult and uncertain. Hence
imprisonment was as yet entirely unknown as a form of punishment.
Though the State had suppressed blood-revenge, it showed no less
an inclination than did ancient custom to requite not only murder but
even milder crimes with death. Indeed, inasmuch as the peaceful
mode of settlement by ransom gradually disappeared, it might be
truer to say that the relentlessness of the State was even greater
than that of blood-revenge. The oldest penal codes were very
strongly inclined to impose death penalties. That the famous
Draconian laws of Athens became proverbial in this respect was due
merely to the fact that other ancient legal codes, though not
infrequently more severe, were still unknown. The law of King
Hammurabi punished by death any one who stole property
belonging to the court or the temple, or even to one of the king's
captains; the innkeeper who charged her guests extortionate prices
was thrown into the water, and the temple maiden who opened a
wine-shop was burned to death. Whoever acquired possession of
stolen goods, or sheltered a runaway slave, was put to death, etc.
For every crime that was judged to be in any way serious, and for
whose expiation a money ransom was not adequate, the law knew
only the one penalty, death. The earliest law made no use of custody
except in connection with civil justice. The debtor was confined in
the house of the creditor. This simply enforced the pledge involved in
the shaking of hands at the time when the debt was contracted—an
act by which the debtor vowed to be responsible for his debt with
his own person.
The confinement of the debtor was at first a matter that was left to
individuals, and its original sanction was custom; later, however, it
came under the supervision of the legal system of the State. This
suggested the adoption of confinement in connection with other
crimes, in which the death penalty appeared too severe a
punishment and the exaction of money one that was too light, as
well, primarily, as too dependent upon the wealth of the guilty
individual. Contributory to this change, was a practice which,
similarly to confinement, was also originally an arrangement
between individuals, and was rooted in custom. I refer to the
holding of individuals as pledges, to the hostage, who gave security
with his own person for the promise of another. The hostage is of
the nature of a forfeit, guaranteeing in advance the fulfilment of the
obligation. For this reason the holding of hostages came to be
practised not merely in the case of property contracts but in
connection with every possible obligation of a private or a public
nature. This development was furthered by the fact that hostages
came to be held in times of war, and, as a result, were given also
upon the assumption of public duties. In both cases, custody
changed from a private arrangement into a public concern. This
change made it possible for a judge to impose the penalty of
imprisonment whenever the transgression did not appear to warrant
death. Imprisonment is a penalty that admits of no fewer degrees
than does a fine, and has the advantage of being independent of the
irrelevant circumstance of the wealth of the one who is condemned.
Moreover, the restriction of arbitrary deprivations of freedom in
favour of custody on the part of the political power, makes it possible
to hold a suspect whose case requires examination before a judicial
verdict can be given. Thus arises the practice of confinement during
investigation, an incidental form of legal procedure which is
influenced by, and in turn reacts upon, the penalty of imprisonment.
Such confinement makes it possible to execute the penalty of
imprisonment in the case of those whom investigation shows to be
guilty. But this is not its only important result. It also leads to those
barbarous methods which, particularly during the early stages of this
development, are connected with the infliction of the punishment
itself as well as with the preceding inquisitorial activities. The public
administration of justice is still affected by the passion for vengeance
which comes down from the earlier period of blood-revenge. To this
coarser sense of justice a merely quantitative gradation of
punishment is not satisfactory; the punishment must rather be made
to correspond qualitatively with the crime that has been committed.
Hence the many different modes of prison punishment—more
numerous even than the modes of inflicting the death penalty—and
of the means of torture, which are often conceived with devilish
cunning. These means of torture come to be used also in the
inquisitional procedure; the endeavour to force a confession causes
them to become more severe, and this in turn reacts upon the
punishment itself. On the whole, the ultimate tendency, of
imprisonment was greatly to restrict the death penalty and thus to
contribute to more humane methods of punishment. Nevertheless, it
is impossible not to recognize that this result was preceded by an
increasing cruelty. The fact that the prisoner was under the control
of the punitive authority for a longer period of time led to a
multiplication of the means of punishment. How simple, and, one
might say, how relatively humane, was blood-revenge, satisfied as it
was to demand life for life, in comparison with the penal law of the
Middle Ages, with its methods of forcing confession by means of the
rack and of various forms of physical suffering and of death
penalties!
The same is true of a further change inaugurated by the passing of
blood-revenge into punishment. This change likewise led to a
decided restriction of the death penalty, yet it also, no less than the
forcing of confession, brought upon penal justice the stigma of
systematic cruelty. The assumption of penal power on the part of the
public judiciary, in conjunction with the possession of unlimited
control over the person and life of the malefactor, led to the
adoption of a principle which long continued to dominate penal
justice. This principle was drastically expressed in the Priests' Code
of the Israelites, "Eye for eye, tooth for tooth." True, this jus talionis
was already foreshadowed in the custom of blood-revenge, and yet
the simple form which it here possessed, 'a life for a life,' made it a
principle of just retribution, and not a demand sharpened by hate
and cruelty. In the case of blood-revenge, moreover, the emotions of
revenge were moderated by virtue of the fact that considerations of
property played a rôle. Requital was sought for the loss which the
clan sustained through the death of one of its members. Hence the
clan might be satisfied with a money compensation, or, occasionally,
with the adoption either of a fellow-tribesman of the murderer or,
indeed, even of the murderer himself. In contrast with this, even the
most severe physical injuries, so long as they did not result in death,
were originally always left to the retaliation of the individual. This
retaliation was sought either in direct combat, or, in the heroic age
proper, in a duel conducted in accordance with regulations of
custom. All this is changed as soon as the State abolishes blood-
revenge and assumes jurisdiction over cases of murder. In the event
of personal injuries, the judge determines the sentence, particularly
if the individual is unable for any reason to secure retaliation—
having been rendered helpless, for example, through his injury, or
being prevented by the fact of class differences. Under such
circumstances it is but natural that the principle, 'a life for a life,'
which has been borrowed from the institution of blood-revenge and
has been applied to the punishment for murder, should be developed
into a scale of physical punishment representing the more general
principle 'like for like.' He who has destroyed the eye of another,
must lose his own eye; whoever has disabled another's arm, must
have his arm cut off, etc. Other injuries then came to be similarly
punished, even those of a moral character to which the principle
"eye for eye, tooth for tooth" is not directly applicable. The hand
which has been implicated in an act of sacrilege, such as the
commission of perjury, is to be cut off; the tongue which has
slandered, must be torn out. Originally, the death penalty was
employed all too freely. Hence this substitution of a physical
punishment which spared the life of the offender was doubtless in
the direction of moderation. But, since this substitution gave rise to
cruelties that resulted in the infliction of various sorts of death
penalties, preceded and accompanied by tortures, its original effect
became reversed, just as in the case of imprisonment. Moreover, the
two forms of punishment—imprisonment and death—and the degree
to which these were carried to excess differed according to
civilization and race. The jus talionis was the older principle of
punishment. It is more closely bound up with man's natural impulse
for retaliation, and therefore recurs even within humane civilizations,
sometimes merely in suggestions but sometimes in occasional
relapses which are of a more serious sort and are due to the passion
for revenge. In fundamental contrast with the Mosaic law,
Christianity repudiated the requital of like with like. Perhaps it was
the fear of violating its own principle that led it, in its later
development, to seek in the cruelties of severe prison penalties a
substitute for the repressed impulse to revenge which comes to
expression in coarser conceptions of justice. Nevertheless, this
substitution was superior to the inflexible severity of the jus talionis
in that it more effectively enabled milder customs to influence the
judicial conscience.
But there is still another respect in which the recedence of the
principle of retaliation gradually led to an advance beyond the legal
conceptions characteristic of the heroic age. The command for strict
retribution takes into consideration merely the objective injury in
which a deed results; to it, it is immaterial whether a person
destroys another's eye accidentally or intentionally. The same injury
that he has caused must befall him. Whoever kills a man must,
according to the law of Hammurabi, himself suffer death; if he kills a
woman, he is to be punished by the death of his daughter. If a
house collapses, the builder who constructed it must suffer death.
For a successful operation, the physician receives a compensation; if
the operation fails, the hand that has performed it is cut off. The
same law determines both reward and punishment. Moreover, it
includes within its scope even intellectual and moral transgressions.
The judge who commits an error is to be dismissed from office in
disgrace; the owner who neglects his field is to be deprived of it.

10. THE DIFFERENTIATION OF LEGAL FUNCTIONS.


The direct impetus to overcoming the defects that were inherent in
penal justice as a result of its having originated in the conflicts of
individuals, did not come from a clear recognition of differences in
the character of the crimes themselves, but primarily from the fact
of a gradual division of judicial functions. This is shown particularly
by the development of Græco-Roman as well as of Germanic law. It
is in the criminal court, which supersedes blood-revenge, that public
authority is most directly conscious of its power over the individual.
Hence the criminal court appears to be the highest of the courts,
and the one that most deeply affects the natural rights of man. Its
authority is vested solely in the ruler, or in a particularly sacred
tribunal. This is due, not so much to the specific character of the
crimes over which it has jurisdiction, as to the respect which it
receives because it assumes both the ancient duty of blood-revenge
and the function of exacting a requital for religious guilt. Similarly,
other offences also gradually pass from the sphere of personally
executed revenge or from that of the strife of individuals, and
become subject to the penal authority of the State. The division of
judicial authority, to which these tendencies lead, is promoted by the
differentiation of public power, as a result of which the
administration of justice is apportioned to various officials and
magistrates, as well as are the other tasks of the State. It is for this
reason that, if we consider their civilization as a whole, the
constitutional States of the Occidental world were led to differentiate
judicial functions much earlier than were the great despotic
monarchies of the Orient. These monarchies, as the code of
Hammurabi shows, possessed a highly developed husbandry and a
correspondingly advanced commercial and monetary system,
whereas they centralized all judicial functions in the ruler.
Thus, the State gains a twofold power, manifested, in the first place,
in the very establishment of a judicial order, and, secondly, in the
differentiation of the spheres of justice in which the authority of the
State over the individual is exercised. This finally prepares the way
for the last stage of development. The state itself becomes subject
to an established legal order which determines its various functions
and the duties of its members. There thus originates an officialdom,
organized on fixed principles and possessing carefully defined public
privileges. The people of the State, on the other hand, are divided
into definite classes on the basis of the duties demanded of them as
well as of the rights connected with these duties. These articulations
of political society, which determine the organization of the army, the
mode of taxation, and the right of participation in the government of
the State, develop, as we have already seen, out of totemic tribal
organization, as a result of the external conditions attendant upon
the migrations and wars connected with the rise of States. But they
also exhibit throughout the traces of statutes expressing the will and
recording the decisions of individual rulers, though even here, of
course, universal human motives are decisive. After the political
powers of the State have been divided and have been delegated to
particular officials and official colleges, and after political rights have
been apportioned to the various classes of society, the next step
consists in rendering the organization of the State secure by means
of a Constitution regulating the entire political system. In the
shaping of the Constitution, it cannot be denied that individual
legislators or legislative assemblies played a significant rôle.
Nevertheless, it must be remembered that it is solely as respects the
form of State organization that the final and most comprehensive
legal creation appears to be predominantly the result of the will acts
of individuals. The content of the Constitution is in every respect a
product of history; it is determined by conditions which, in the last
analysis, depend upon the general culture of a nation and upon its
relations with other peoples. These conditions, however, are so
complex that, though every form of Constitution and all its
modifications may be regarded as absolutely involved in the causal
nexus of historical life, the endless diversity of particular conditions
precludes Constitutions from being classifiable according to any
universal principle. Constitutions can at most be classified on the
basis of certain analogies. The most influential attempt at a genetic
classification of the various historical forms of government was that
of Aristotle. But his classification, based on the number of rulers
(one, a few, many, all) and on the moral predicates of good and evil
(monarchy and tyranny, aristocracy and oligarchy, etc.), offers a
purely logical schema which corresponds but partially with facts.
True, it not infrequently happens that the rule of all—that is,
democracy—gives way to the evil form of individual rulership—
namely, tyranny. An aristocracy, however, or even a monarchy, may
likewise develop into a tyranny. What the change is to be, depends
upon historical conditions. Nor are monarchy, aristocracy, or the rule
of the middle class forms of government that are ever actually to be
found in the purity which logical schematization demands. Even in
the Homeric State there was a council of elders and an assembly of
freemen—an agora—in addition to the king. Indeed, if we go back
still farther and inquire concerning those more primitive peoples of
nature who are merely on the point of passing from tribal
organization to a political Constitution, it might perhaps be nearer
the truth to assert that democracy, and not monarchy, was the form
of the early State. The fact is that the organization characteristic of
the State as a whole is the product of historical factors of an
exceedingly variable nature, and that it never adequately fits into
any logical system that is based on merely a few political features.
Even less may a logical schema of this sort be regarded as
representing a universal law of development.
Thus, the State is indeed the ultimate source of all the various
branches of the legal system. So far as the fundamental elements of
its own Constitution are concerned, however, it is really itself a
product of custom, if we take this term in its broadest sense, as
signifying an historically developed order of social life which has not
yet come under the control of political authority. The course of
development is the very opposite of that which rationalistic theories
have taught, ever since the time of the Sophists, concerning the
origin of the State. These theories maintain that the legal system
originated in connection with the State, and that it then acquired an
application to the separate departments of life. The reverse is true.
It is with the determination of the rights of individuals and with the
settlement of the controversies arising from these rights that the
legal power of the State takes its rise. It is strengthened and
extended when the custom of personal retribution comes to be
superseded by penal law. Last of all comes the systematic
formulation of the political Constitution itself. The latter, however, is
never more than a development; it is not a creation in the proper
sense of the word. Even such States as the United States of North
America and the new German Empire were not created by lawgivers,
but were only organized by them in respect to details. The State as
such is always a product of history, and so it must ever remain.
Every legal system presupposes the power of a State. Hence the
latter can never itself originate in an act of legislation, but can only
transform itself into a legal order after it has once arisen.

11. THE ORIGIN OF GODS.

At first glance it may seem presumptuous even to raise the question


as to how gods originated. Have they not always existed? one is
inclined to ask. As a matter of fact, this is the opinion of most
historians, particularly of historians of religion. They hold that the
belief in gods is underived. Degenerate forms may arise, the belief
may at times even disappear altogether or be displaced by a crude
belief in magic and demons, but it itself can in no wise have been
developed from anything else, for it was possessed by mankind from
the very beginning. Were it true that the belief in gods represents an
original possession of mankind, our question concerning the origin of
gods would be invalidated. The assumption, however, is disproved by
the facts of ethnology. There are peoples without gods. True, there
are no peoples without some sort of supersensuous beings.
Nevertheless, to call all such beings 'gods'—beings, for example,
such as sickness-demons or the demons which leave the corpse and
threaten the living—would appear to be a wholly unwarranted
extension of the conception of deity. Unbiased observation goes to
show that there are no peoples without certain conceptions that may
be regarded as precursors of the later god-ideas. Nevertheless, there
can be no doubt that there are some peoples without gods. The
Veddahs of Ceylon, the so-called nature-Semangs and Senoi of
Malacca, the natives of Australia, and many other peoples of nature
as well, possess no gods, in our sense of the word. Because all of
these primitive peoples interpret certain natural phenomena—such
as clouds, winds, and stars—in an anthropomorphic fashion, it has
been attempted time and again to establish the presence of the god-
idea of higher religions. Such attempts, however, may be straightway
characterized as a play with superficial analogies in which no
thought whatsoever is taken of the real content of the god-
conception.
Accepting the lead of ethnological facts, then, let us grant that there
are stages in the development of the myth in which real gods are
lacking. Even so, two opposing views are possible concerning the
relation of such 'prereligious' conditions to the origin of the god-
ideas essential to religion. Indeed, these views still actively compete
with each other in the science of religion. On the one hand, it is
maintained that the god-idea is original, and that belief in demons,
totemism, fetishism, and ancestor worship are secondary and
degenerate derivatives. On the other hand, the gods are regarded as
products of a mythological development, and, in so far, as analogous
to the State, which grew up in the course of political development
out of the primitive forms of tribal organization. Those who defend
the first of these views subscribe to a degeneration theory. If the
ancestors reverenced in cult are degenerated deities, and if the
same is true of demons and even of fetishes, then the main course
of religious development has obviously been downward and not
upward. The representatives of the second view, on the contrary,
assume an upward or progressive tendency. If demons, fetishes, and
the animal or human ancestors worshipped in cult antedate gods,
the latter must have developed from the former. Thus, the views
concerning the origin of gods may be classified as theories of
degeneration and theories of development.
But the theories of degeneration themselves fall into two classes.
The one upholds an original monotheism, the basis of which is
claimed to be either an innate idea of God or a revelation made to
all mankind. Obviously this assumption is itself more nearly a belief
than a scientific hypothesis. As a belief, it may be accounted for in
terms of a certain religious need. This explains how it happens that,
in spite of the multiplication of contradictory facts, the theory has
been repeatedly urged in comparatively recent times. Only a short
time ago, even a distinguished ethnologist, Wilhelm Schmidt,
attempted to prove that such an original monotheism was without
doubt a dominant belief among the so-called Pygmies, who must, in
general, be classed with primitive peoples. The argument adduced in
support of this view, however, unquestionably lacks the critical
caution otherwise characteristic of this investigator. One cannot
escape the conviction that, in this case, personal religious needs
influenced the ethnological views, even though one may well doubt
whether the degeneration theory is a theory that is suited to satisfy
such needs.[1] The second class of theories adopts the view that the
basis of all religious development was not monotheism but primitive
polytheism. This polytheism is supposed to have originated, at a
very early age, in the impression made by the starry heavens,
particularly by the great heavenly bodies, the sun and the moon.
Here for the first time, it is maintained, man was confronted by a
world far transcending his own realm of sense perception; because
of the multiplicity of the motives that were operative, it was not the
idea of one deity but the belief in many deities that was evoked. In
essential contrast with the preceding view, this class of theories
regards all further development as upward. Monotheism is held to be
a refined religious product of earlier polytheistic conceptions. In so
far, the hypothesis represents a transition to developmental theories
proper. It cannot be counted among the latter, however, for it holds
to the originality of the god-idea, believing that this conception,
which is essential to all religion, was not itself the product of
development, but formed an original element of man's natural
endowment. Moreover, the theory attaches a disproportionate
significance to the transition from many gods to a single god. It is
doubtful, to say the least, whether the intrinsic value of the god-idea
may be measured merely in terms of this numerical standard.
Furthermore, the fact is undeniable that philosophy alone really
exhibits an absolute monotheism. A pure monotheistic belief
probably never existed in the religion of any people, not even in that
of the Israelites, whose national deity, Jahve, was not at all the sole
god in the sense of a strict monotheism. When the Decalogue says,
"Thou shalt have no other gods before me," this does not deny the
existence of gods other than Jahve, but merely prohibits the
Israelites from worshipping any other deity. These other gods,
however, are the national gods of other peoples. Not only do these
other tribal gods exist alongside of Jahve, but the patriarchal sagas
centre about individuals that resemble now demonic and now divine
beings. The most remarkable of these figures is Jacob. In the
account of his personality there seem to be mingled legends of
differing origin, dating from a time probably far earlier than the
developed Jahve cult. The scene with his father-in-law, Laban,
represents him as a sort of crafty märchen-hero. He cheats Laban
through his knowledge of magic, gaining for himself the choicest of
the young lambs by constructing the watering troughs of half-peeled
rods of wood—a striking example of so-called imitative magic. On
the other hand, Jacob is portrayed as the hero who rolls from the
well's mouth the stone which all the servants of Laban could not
move. And finally, when he wrestles with Jahve by night on the bank
of the stream and is not overcome until the break of day, we are
reminded either of a mighty Titan of divine lineage, or possibly of
the river demon who, according to ancient folk belief, threatens to
engulf every one who crosses the stream, be it even a god. But what
is true of the figures of the patriarchal sagas applies also, in part, to
Jahve himself. In the remarkable scene in which Jahve visits
Abraham near the terebinths of Mamre, he associates with the
patriarch as a primus inter pares. He allows Sarah to bake him a
cake and to wash his feet, and he then promises Abraham a
numerous posterity. He appears as a man among men, though, of
course, as one who is superior and who possesses magical power.
Only gradually does the god acquire the remoteness of the
superhuman. Abraham is later represented as falling down before
him, and as scarcely daring to approach him. Here also, however,
the god still appears on earth. Finally, when he speaks to Moses
from the burning bush, only his voice is perceptible. Thus, his
sensuous form vanishes more and more, until we come to the Jahve
who uses the prophets as his mouthpiece and is present to them
only as a spiritual being. The purified Jahve cult, therefore, was not
an original folk-religion. It was the product of priests and prophets,
created by them out of a polytheism which contained a rich
profusion of demon conceptions, and which was never entirely
suppressed.
If an original monotheism is nowhere to be found, one might be
tempted to believe conversely, that polytheism represents the
starting-point of all mythology. In fact, until very recently this was
doubtless the consensus of opinion among mythologists and
historians of religion, and the idea is still widely prevalent. For, if we
hold in any way to the view that the god-idea is underived, there is
but one recourse, once we abandon the idea of an original
monotheism. The polytheistic theory is, as a rule, connected with the
further contention that god-ideas are directly due to celestial
phenomena. In substantiation of this view, it is pointed out that,
with the exception of the gods of the underworld, the gods are
usually supposed to dwell in the heavens. Accordingly, it is
particularly the great heavenly bodies, the sun and the moon, or also
the clouds and storms, to which—now to the one and now to the
other, according to their particular tendency—these theories trace
the origin of the gods. Celestial phenomena were present to man
from the beginning, and it is supposed that they aroused his
reflection from earliest times on. Those mythologists who champion
the celestial theory of the origin of religion, therefore, regard god-
ideas as in great measure the products of intellectual activity; these
ideas are supposed to represent a sort of primitive explanation of
nature, though an explanation, of course, which, in contrast to later
science, is fantastical, arbitrary, and under the control of emotion.
During the past century, moreover, this class of hypotheses has
gradually placed less emphasis on emotional as compared with
rational factors. In the first instance, it was the phenomena of
storms, clouds, thunder, and lightning that were thought to be the
basis of deity belief; later, the sun came to be regarded as the
embodiment of the chief god; the present tendency is to emphasize
particularly the moon, whose changing phases may easily give rise
to various mythological ideas. Does not the proverbial 'man in the
moon' survive even to-day as a well-known fragment of mythological
conceptions of this sort? Similarly, the crescent moon suggests a
sword, a club, a boat, and many other things which, though not
conceived as gods, may at any rate be regarded as their weapons or
implements. The gods, we are told, then gradually became
distinguished from celestial objects and became independent
personal beings. The heroes of the hero saga are said to be
degenerated gods, as it were. When the myth attributes a divine
parentage to the hero, or allows him to enter the realm of the gods
upon his death, this is interpreted as indicative of a vague memory
that the hero was once himself a god. The lowest place in the scale
of heroes is given to the märchen-hero, though he also is supposed
in the last analysis to have originated as a celestial deity. The
märchen itself is thus regarded as the last stage in the decline of the
myth, whose development is held to have been initiated in the
distant past by the celestial myth. Accordingly, the most prevalent
present-day tendency of nature mythology is to assume an orderly
development of a twofold sort. On the one hand, the moon is
regarded as having been the earliest object of cult, followed by the
sun and the stars. Later, it is supposed, a distinction was made
between gods and celestial objects, though the former were still
given many celestial attributes. On the other hand, it is held that the
gods were more and more anthropomorphized; their celestial origin
becoming gradually obscured, they were reduced to heroes of
various ranks, ranging from the heroic figures of the saga to the
heroes of children's märchen. These theories of an original
polytheism are rendered one-sided by the very fact that they are not
based upon any investigations whatsoever concerning the gods and
myths actually prevalent in folk-belief. They merely give an
interpretation of hypothetical conceptions which are supposed to be
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like