Montanez Dissertation
Montanez Dissertation
George D. Montañez
May 2017
CMU-ML-17-100
Thesis Committee:
Cosma R. Shalizi, Chair
Roni Rosenfeld
Geoff Gordon
Milos Hauskrecht
This research was sponsored by the National Science Foundation under grant numbers DMS1207759, DMS1418124
and DGE1252522; a gift from the Microsoft Corporation; and a graduate fellowship from the Ford Foundation.
Travel grants and one-time support provided by the IEEE Computational Intelligence Society, Center for Evolution-
ary Informatics, and Harvey Mudd College.
This thesis contains content from the following publications:
2017 Montañez G, Shalizi C, “Why Machine Learning Works, In One Equation” (In Prep.)
2017 Montañez G, Finley T, “Kernel Density Optimization” (In Prep.)
2016 Montañez G, “The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm.”
(In Prep.)
2015 Montañez G, Amizadeh S, Laptev N, “Inertial Hidden Markov Models: Modeling Change in
Multivariate Time Series.” AAAI Conference on Artificial Intelligence (AAAI-15), 2015.
2013 Montañez G, “Bounding the Number of Favorable Functions in Stochastic Search.” In
Evolutionary Computation (CEC), 2013 IEEE Congress on, pages 3019–3026, 2013.
Keywords: machine learning, algorithmic search, famine of forte, no free lunch, dependence
Dedicated to heroes everywhere, who stood their ground in the face of opposition and did what
was right. To Jesus Christ, the first and last Hero, the iron pillar and bronze wall.
“In this world you will have trouble. But take heart! I have overcome the world.”
iv
Abstract
To better understand why machine learning works, we cast learning problems as
searches and characterize what makes searches successful. We prove that any search
algorithm can only perform well on a narrow subset of problems, and show the ef-
fects of dependence on raising the probability of success for searches. We examine
two popular ways of understanding what makes machine learning work, empirical
risk minimization and compression, and show how they fit within our search frame-
work. Leveraging the “dependence-first” view of learning, we apply this knowledge
to areas of unsupervised time-series segmentation and automated hyperparameter
optimization, developing new algorithms with strong empirical performance on real-
world problem classes.
vi
Contents
2 Related Work 7
2.1 Existing Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Generalization as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Constraints on Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 No Free Lunch Theorems . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Conservation of Generalization Performance . . . . . . . . . . . . . . . 9
2.3.3 An Algorithmic View of No Free Lunch . . . . . . . . . . . . . . . . . . 9
2.3.4 Necessary and Sufficient Conditions for No Free Lunch . . . . . . . . . 11
2.3.5 Few Sets of Functions are CUP . . . . . . . . . . . . . . . . . . . . . . 11
2.3.6 Focused No Free Lunch Theorems for non-CUP Sets . . . . . . . . . . . 12
2.3.7 Beyond No Free Lunch . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.8 Continuous Lunches Are Free! . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.9 Continuous Lunches Are Not Free . . . . . . . . . . . . . . . . . . . . . 14
2.4 Bias in Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Futility of Bias-Free Learning . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Overfitting Avoidance as Bias . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Evaluating “Overfitting Avoidance as Bias” . . . . . . . . . . . . . . . . 16
2.4.4 The Need for Specific Biases . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Improvements on Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
II Theory 19
3 An Algorithmic Search Framework 21
3.1 The Search Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 The Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
4 Machine Learning as Search 27
4.1 Regression as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Classification as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Clustering as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Parameter Estimation as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Hyperparameter Optimization as Search . . . . . . . . . . . . . . . . . . . . . . 32
4.6 General Learning Problems as Search . . . . . . . . . . . . . . . . . . . . . . . 32
5 Theoretical Results 35
5.1 Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 The Fraction of Favorable Targets . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 The Famine of Forte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Corollary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Additional Corollary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Bound on Expected Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 The Famine of Favorable Strategies . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6 Learning Under Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7 Why Machine Learning Works in One Equation . . . . . . . . . . . . . . . . . . 50
5.8 The Need for Biased Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.8.1 Special Case: Concept and Test Set Fixed . . . . . . . . . . . . . . . . . 53
5.8.2 General Case: Concept and Test Set Drawn from Distributions . . . . . . 54
5.8.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
viii
8.3 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.4 One Equation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.4.1 Example: Strong Target Structure . . . . . . . . . . . . . . . . . . . . . 83
8.4.2 Example: Minimal Target and Positive Dependence . . . . . . . . . . . . 84
8.4.3 Example: Target Avoidance Algorithm . . . . . . . . . . . . . . . . . . 85
IV Conclusion 113
11 Conclusion 115
ix
V Appendices 117
A The Need for Biased Classifiers: Noisy Labels 119
Bibliography 123
x
List of Figures
5.1 Graphical model of objects involved in unbiased classification and their relation-
ships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.1 Mass concentration around promising regions of the hyperparameter search space.
Queries 1-50, 200-250, and 450-500, respectively. . . . . . . . . . . . . . . . . . 101
xi
10.2 Results for real-world learning tasks. Dashed lines represent where random ini-
tialization ends for SMAC and KDO methods (i.e., 50th query). . . . . . . . . . 110
10.3 Results for synthetic Rastrigin task. . . . . . . . . . . . . . . . . . . . . . . . . 111
xii
List of Tables
3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
10.1 Datasets, learners, and hyperparameters for each experiment task. (log+ denotes
log scaling.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
xiii
xiv
Part I
1
Chapter 1
Introduction
HY does machine learning work? While wildly successful in business and science, ma-
W chine learning has remained something of a mysterious black-box to all but a trained
few. Even the initiated often have trouble articulating what exactly makes machine
learning work in general. Although it is easy enough to point someone to a textbook on statis-
tical learning theory, giving a concise (and correct) explanation becomes a little more difficult.
Part of this thesis aims to help remedy the situation by offering a succinct yet clear answer to
the question, based on two simple ideas that underlie most, and perhaps all, of machine learning:
search and dependence. We reformulate several types of machine learning (e.g., classification,
regression, density estimation, hyperparameter optimization) as search problems within a unified
framework, formalizing a view of learning suggested by Mitchell [56]. Viewing machine learn-
ing as a search abstracts away many of the specific complexities of individual learning processes,
allowing us to gain insight from a simplified view. Under this view, our original question can be
transformed into two simpler ones:
Having reduced machine learning to a type of search, if we can discover what makes generic
searches successful then we also discover what makes machine learning successful.
Towards that end, we prove theorems regarding the probability of success for searches and
explore what factors enable their success. We are introduced to No Free Lunch [71, 94, 95] and
other Conservation of Information [22, 55, 55, 69] theorems, which demonstrate the need for
correct exploitation of dependence to gain better-than-chance learning. As part of the process, we
rederive and strengthen some existing historical results, and explore how our search framework
compares and contrasts to other popular learning paradigms, such as empirical risk minimization
and learning via compression (minimum description length). Lastly, we apply these insights
to two concrete application areas, showing how dependence within those problems leads to the
possibility of greater-than-chance success.
3
1.1 Contributions
A central contribution of this thesis is to develop a formal framework for characterizing search
problems, into which we reduce several broad classes of machine learning problems (Chapters 3
and 4). Having established the link between machine learning and search, we demonstrate that
favorable search problems must necessarily be rare (Chapter 5, Theorem 2). Our work departs
from No Free Lunch results (namely, that the mean performance across sets of problems is fixed
for all algorithms) to show that the proportion of favorable problems is strictly bounded in re-
lation to the inherent problem difficulty and the degree of improvement sought (i.e., not just
the mean performance is bounded). Our results continue to hold for sets of objective functions
that are not closed-under-permutation, extending traditional No Free Lunch theorems. Further-
more, the bounds presented here do not depend on any distributional assumptions on the space
of possible problems, such that the proportion of favorable problems is small regardless of which
distribution holds over them in the real world. This directly answers critiques aimed at No Free
Lunch results arguing against a uniform distribution on problems in the real world (cf. [64]),
since given any distribution over possible problems, there are still relatively few favorable prob-
lems within the set one is taking the distribution over.
As a corollary, we prove the information costs of finding any favorable search problem is
bounded below by the number of bits “advantage” gained by the algorithm on such problems.
We do this by using an active information transform to measure performance improvement in
bits [17], proving a conservation of information result [17, 22, 57] that shows the amount of
information required to locate a search problem giving b bits of expected information is at least b
bits. Thus, to get an expected net gain of information, the true distribution over search problems
must be biased towards favorable problems for a given algorithm. This places a floor on the
minimal information costs for finding favorable problems.
In the same chapter, we establish the equivalence between the expected per-query probability
of success for an algorithm and the probability of a successful single-query search under some
induced conditional distribution, which we call a strategy. Each algorithm maps to a strategy,
and we prove an upper-bound on the proportion of favorable strategies for a fixed problem (The-
orem 3). If finding a good search problem for a fixed algorithm is hard, then so is finding a good
search strategy for a fixed problem. Thus, the matching of problems to algorithm strategies is
provably difficult, regardless of which is fixed and which varies.
Given that any fixed algorithm can only perform well on a limited proportion of possible
problems, we then turn our attention to the problems for which they do perform well, and ask
what allows the algorithm to succeed in those circumstances. To that end, we characterize and
bound the single-query probability of success for an algorithm based on information resources
(Chapter 5, Theorems 4 and 5). Namely, we relate the degree of dependence (measured in
mutual information) between target sets and external information resources, such as objective
functions, inaccurate measurements or sets of training data, to the maximum improvement in
search performance. We prove that for a fixed target-sparseness and given an algorithm A, the
single-query probability of success for the algorithm is bounded as
4
where I(T ; F ) is the mutual information between target set T (as a random variable) and external
information resource F , D(PT kUT ) is the Kullback-Leibler divergence between the marginal
distribution on T and the uniform distribution on target sets, and IΩ is the baseline information
cost for the search problem due to sparseness. This simple equation takes into account degree
of dependence, target sparseness, target function uncertainty, and the contribution of random
luck. It is surprising that such well-known quantities appear in the course of simply trying to
upper-bound the probability of success. We strengthen this in Theorem 5 to give a closed form
solution to the single-query probability of success for an algorithm, again consisting of easily
interpretable components.
Given our theoretical results and search framework, we are able to rederive an existing result
from Culberson [15] as a corollary (Corollary 3) and formally prove (and extend) an early result
from Mitchell [55] showing the need for inductive bias in classification (Theorems 6 and 7).
Chapter 6 explores the relationship of this work to established research in statistical learning
theory, particularly in regards to empirical risk minimization. In Chapter 7 we discuss com-
pression and its relation to generalization performance, including a discussion of the Minimum
Description Length framework. We find that while a preference for simpler models can be a good
strategy in some common situations, a preference for simplicity does not always translate into
successful machine learning, disqualifying Occam’s razor and related notions as a complete (i.e.,
necessary and sufficient) explanation for why machine learning works. Within the same chap-
ter we prove a simple bound for the proportion of learnable binary concepts for deterministic
classification algorithms with restricted data (Theorem 9) and give examples.
Moving from the theoretical world into the applied, Chapters 9 and 10 explore how mak-
ing assumptions that enforce dependence (namely, temporal and spatial regularity) can lead to
strong algorithms in the areas of multivariate time-series segmentation and hyperparameter opti-
mization, respectively. Our “dependence-first” view of learning helps us to identify and exploit
potential areas of dependence, leveraging them to develop novel applied machine learning algo-
rithms with good empirical performance on real-world problems.
1. Dependence.
2. What is seen tells us something about what is unseen.
3. We are trying to recover an object based on observations; dependence within the problem
controls how informative each observation can be, while biases (shaped by underlying
assumptions) control how informative each observation actually is. Information theory
then governs how fast we recover our object given the dependence, assumptions and biases.
Each of these answers will be justified during the course of the thesis.
5
1.3 Computational and Informational Complexity
It should be noted that this thesis views machine learning from a search and information theoretic
perspective, focusing on the information constraints on learning problems. Additionally, there
are computational constraints that must be dealt with in machine learning, focusing on scalability
and engineering challenges. Here we focus solely on the informational challenges, while still
acknowledging the importance of the computational aspects of machine learning; they are simply
beyond the scope of this thesis.
6
Chapter 2
Related Work
7
In Mitchell’s framework, each hypothesis in the hypothesis space can be viewed as general or
specific, depending on how many instances in the instance space are matched by the hypothesis.
The more-specific-than relation between hypotheses, where {x : g1 (x) = 1} ⊆ {x : g2 (x) = 1}
indicates that g1 is more specific than g2 , provides a partial ordering over the hypothesis space,
allowing a structured search over that space. A few different search strategies are outlined,
including a version space method, and their time and space complexities are provided.
Mitchell explores cases where there are not enough training examples to reduce the version
space to a single hypothesis. In such cases, there exists a need for voting or some other method
of choosing among the remaining hypotheses consistent with the training data. Methods of pre-
ferring one hypothesis to another with regard to anything other than strict consistency with the
training data are examples of inductive bias in learning algorithms, as are choices made in the
hypothesis space construction. Biases are encoded in the choice of “generalization language”
(meaning the possible concepts that can be represented in the hypothesis space), and could be
used to incorporate prior domain knowledge into the search problem. The problem of selecting
“good” generalization languages (choosing a correct inductive bias) is highlighted as an impor-
tant open problem in learning that was poorly understood at the time.
A connection is made to active learning (though not by that name), since the most informative
labels for instances are those for which the current version space disagrees on with an almost
equal number of hypotheses. In other words, those instances that are the most informative are
those for which the current version space has the most trouble reliably classifying.
8
2.3.2 Conservation of Generalization Performance
Schaffer [69] gave one of the earliest formal proofs of a conservation theorem for supervised
learning. Schaffer shows that inductive inference is a zero-sum enterprise, with better-than-
chance performance on any set of problems being offset exactly by worse-than-chance perfor-
mance on the remaining set of problems. To characterize how good an algorithm is, we test it on
a set of problems. The larger the set, the more confident our conclusions concerning the behav-
ior of the algorithm. Schaffer shows that taking this process to the extreme, where all discrete
problems of any fixed length are considered and our judgments concerning behavior should be
most confident, we find that all learning algorithms have identical generalization performance.
In the problem setting considered by Schaffer, training and test data are allowed to be drawn
from the same distribution, but instances that have already been seen during training are ignored
during testing. This gives the off-training-set generalization error problem setting investigated
by Wolpert [94] and others. Using his result, he shows that tweaking an algorithm to perform
well on a larger set of problems means that you are simultaneously decreasing performance on
all remaining problems. Thus, algorithm design is simply adjusting inductive bias to fit the
problem(s) at hand.
Schaffer comments on a common attitude concerning No Free Lunch results, namely that
“the conservation law is theoretically sound, but practically irrelevant.” In his opinion, this view
is common because each induction algorithm is only applied to a small subset of the space of all
possible problems. Therefore, if bad generalization performance can be sequestered to the region
outside this subset, then any constraints imposed by the generalization law can be safely ignored
as having no practical consequence. Since we design learning algorithms to be used in the real-
world, many hope that this is indeed the case. As an argument against this view, he highlights
many instances in the literature where worse-than-chance learner behavior had been observed and
reported on real-world problems, such as cases where avoidance of overfitting led to degraded
generalization performance and where accuracy on an important real-world problem severely
and continuously decreased from .85 to below .50 as a standard overfitting avoidance method
was increasingly applied. Given the self-selection bias against including examples of negative
(worse-than-chance) performance on tasks, it is remarkable that the examples were continually
popping up. Most researchers view such results as anomalous, but according to Schaffer “the
conservation law suggests instead that they ought to be expected, and that, the more we think to
look for them, the more often examples will be identified in contexts of practical importance.”
Thus, even for real-world problem sets, we encounter cases where our algorithms work worse
than random chance.
9
restrictive than the NFL theorems.
In the paper, evolutionary algorithms (EAs) are claimed to be “no prior knowledge” algo-
rithms, which means EAs only gain knowledge of the problem through the external information
object that is a fitness function. In proposing configurations to be evaluated, the environment is
said to act as a black box and return an evaluation of that particular configuration. Culberson
states that Wolpert and Macready’s NFL theorem for optimization prove that it is unreasonable
to expect optimization algorithms to do well under these circumstances, since they show that
all algorithms (including dumb random sampling) have identical averaged behavior when faced
with such a black box environment.
He proves a version of the NFL theorem using an informal adversary argument, and gives
a second NFL theorem in the spirit of Schaffer’s Conservation Law for Generalization Perfor-
mance, namely, that “any algorithm that attempts to exploit some property of functions in a
subclass of F will be subject to deception for some other class, in the sense that it will perform
worse than random search.”
Most relevant to the results in this thesis, Culberson gives a theorem resembling the Famine
of Forte for a much more restricted setting, showing that if we restrict ourselves to bijective
functions from S n (the space of all configurations, which we call Ω) to R (the finite range of
values the function can take on when evaluated on strings in S n ), and if we use a non-repeating
algorithm and consider the problem of selecting the optimum string within no more than an
expected ξ queries, then that proportion is bounded above by ξ/2n , for any fixed algorithm A
where |S n | = 2n . This says that an algorithm can only select a target in polynomial expected
time on an exponentially small fraction of the set of bijective functions, i.e., no more than nk /2n .
Since he is dealing with the proportion of functions for which an algorithm can perform well
in expectation, the similarity in spirit to the Famine of Forte result is clear, yet because his
problem setting is also greatly restricted (i.e., considering only the closed-under-permutation set
of bijective functions for non-repeating algorithms with a pre-specified single-item target set),
his informal proof is much simpler than the proof given here for our more general result.
Culberson reasons that applying an algorithm to all possible functions (or any closed-under-
permutation set of functions) gives the adversary too much power, in that it allows an adversary to
return values that are completely independent of the algorithm in use. This is an early recognition
of the importance of dependence for successful search. He evaluates issues in evolutionary algo-
rithms, such as the implications of NFL for Holland’s once-celebrated Schema theorem [1, 36],
and explores why the NFL results came as such a shock to the evolutionary computing commu-
nity. Assuming that something like an evolutionary algorithm was responsible for the emergence
and diversification of life, Culberson explores ways that evolution might escape some of the
more damning implications of the NFL. For example, given that the universe is not as random
as it could be, and that there is regularity and structure in physical laws, perhaps this restriction
on possible problems is sufficient to escape application of NFL-type theorems. Yet, he points
out that many NP-hard optimization problems remain NP-complete even when their complexity
is restricted, such as determining whether a graph is 3-colorable even when restricting to planar
graphs of degree at most 4, and thus does not seem to fully endorse this particular solution to
the paradox. As he states in another part of the paper: “Although most researchers would readily
accept that there must be some exploitable structure in a function if an algorithm is going to
make progress, Wolpert and Macready make the further point that unless the operators of the al-
10
gorithm are correlated to those features, there is no reason to believe the algorithm will do better
than random search.”
Ultimately, he argues that removing the assumption that life is optimizing may be crucial to
resolving the issue and offers preliminary evidence in favor of that conclusion.
11
|X |
functions has 2|Y| − 1 non-empty subsets, Theorem 2 gives the number of subsets that are
closed under permutation:
Theorem 2: The number of non-empty subsets of Y X that are CUP is given by
|X +|Y|−1
2( |X | ) − 1,
where X is the input space, Y is the output space, and Y X is the set of all possible functions over
X and Y. Even for |X | = 8 and |Y| = 2, the fraction of sets that is CUP is less than 10−74 .
Second, the paper demonstrates that for subsets of functions for which non-trivial1 neighbor-
hood relations hold, the NFL does not hold. The reason for this is given in Theorem 3,
Theorem 3: A non-trivial neighborhood on X is not invariant under permutations
of X .
They then prove corollaries of this theorem, such as showing that if we restrict the steepness
of neighborhoods for a subset of functions (steepness being a measure of the range of values in a
neighborhood) to be less than the maximum for that same subset (considering pairs on points not
in a neighborhood, as well), then the subset is not CUP. They also show this is true if we restrict
the number of local minima of a function to be less than the maximum possible.
As the conclusion states, the authors
. . . have shown that the statement “I’m only interested in a subset F of all possi-
ble functions, so the NFL theorems do not apply” is true with probability close to
one. . . [and] the statements “In my application domain, functions with the maximum
number of local minima are not realistic” and “For some components, the objective
functions under consideration will not have the maximal possible steepness” lead to
situations where NFL does not hold.
12
Next, the authors show that the Sharpened No Free Lunch theorem is a special case of focused
sets, and follows as a corollary from a proved lemma. They also provide a heuristic algorithm
for discovering focused sets for a set of algorithms, but state that it may not always be possible
to find a focused set any smaller than the full permutation closure.
Thus, when comparing the performance of some finite number of search algorithms over a
set of functions, equal performance for members in that set can occur within a focused set much
smaller than the full permutation enclosure of the functions, and an NFL-like result may hold
even when the set of functions being evaluated is not closed under permutation.
13
fitness functions are non-trivial random fields in the continuous case) necessarily
have correlations.
The fact that their results rely on special ways of defining random fitnesses and free lunches
suggests that perhaps their results do not hold for all continuous spaces in general. Lockett and
Miikkulainen argue that this is the case [51].
14
of a generalization system follows directly from the strength of its biases. This line of thinking
foreshadows Schaffer and Wolpert’s later statements that induction is only an expression of bias.
The paper also discusses some possibly justifiable forms of inductive bias, such as using domain
knowledge to reduce uncertainty within the hypothesis space. The claim is made that progress in
machine learning depends on understanding, and justifying, various forms of bias. An appeal is
made for researchers to make examination of the effect of bias in controlling learning as explicit
(and rigorous) as research into how training instances affect learning has been in the past.
15
In the B attribute language, the parity function is as simple as the C = A1 function discussed
earlier, and the sophisticated strategy has superior performance. The opposite holds when the
parity function is represented using the A attribute representation. Thus, representation can have
a large effect on the simplicity of a function and on the advantage of overfitting avoidance when
learning that function. As Schaeffer notes, “An overfitting avoidance scheme that serves as a bias
towards models that are considered simple under one representation is at the same time a bias
towards models that would be considered complex under another representation.”
Furthermore, sparse functions (those with few zeros or few ones) were more suited to use of
overfitting avoidance, regardless of the representation language. This is contrasted with initially
simple functions such as C = A1 , where half of the instances have labels of 1 and the other
half are labeled 0. Under most representations, this function is complex, while under the original
representation language it is simple.
A point is made that although researchers discuss using overfitting avoidance to separate
noise from structure in the data, it is not possible to make such distinctions using the training
data. If many hypotheses of varying complexity are equally consistent with the training data,
then nothing in the observed data can distinguish between them. In those cases, we must use
domain knowledge to guide our choice. This follows as a consequence of Bayes’ rule, where
P (h|d) ∝ P (d|h)P (h), so that when the data are explained equally well by all models (i.e.,
P (d|h) is the same for all h), the only possible way to increase the likelihood of one hypothesis
over another is to have differences in P (h), which are independent of the data.
The paper discusses some issues with specific applications of Occam’s Razor and minimum
description length, showing that using either of these is simply applying a bias which may or
may not be applicable in certain problem domains. Biases are neither inherently good nor bad,
only more or less appropriate. Schaeffer comments that while many data sets used in machine
learning research do apparently have simple underlying relationships, this only tells us what
types of problems researchers tend to work on. By itself, this is not evidence that problems in
the real world tend to have simple underlying relationships.
16
generalization performance compared to any other, even when considering the conventional gen-
eralization error function, as within the PAC learning framework. He suggests caution when
interpreting PAC learning results that may seem to indicate otherwise.
Lastly, Wolpert argues that while it is true that for some algorithms there exists a P (f ) for
which it beats all other algorithms, there are some algorithms that are not optimal for any P (f ).
Therefore, we can prefer algorithms that are optimal under some distribution to those that are not
under any.
17
Third, the conservation of information theorems proven in this thesis make progress towards
answering the open question of whether selecting a meta-bias is any easier than selecting a good
bias, showing that within the search framework, good biasing distributions (strategies) are just as
rare as good elements, which can be viewed as preliminary evidence that the problem does not
get easier as one moves up the bias hierarchy.
18
Part II
Theory
19
Chapter 3
21
F (∅) represents the method used to extract initial information for the search algorithm (absent of
any query), and F (ω) represents the evaluation of point ω under resource F .
A search problem is defined as a 3-tuple, (Ω, T, F ), consisting of a search space, a target
subset of the space and an external information resource F , respectively. Since the true target
locations are hidden, any information gained by the search concerning the target locations is
mediated through the external information resource F alone. Thus, the space of possible search
problems includes many deceptive search problems, where the external resource provides mis-
leading information about target locations, and many noisy problems. In the fully general case,
there can be any relationship between T and F . Because we consider any and all degrees of
dependence between external information resources and target locations, this effectively creates
independence when considering the set as a whole, allowing some of our results to follow as a
consequence.
However, in many natural settings, target locations and external information resources are
tightly coupled. For example, we typically threshold objective function values to designate the
target elements as those that meet or exceed some minimum. Doing so enforces dependence
between target locations and objective functions, where the former is fully determined by the
latter, once the threshold is known. This dependence causes direct correlation between the ob-
jective function (which is observable) and the target locations (which are not directly observable).
We will demonstrate such correlation is exploitable, affecting the upper bound on the expected
probability of success.
22
CHOOSE NEXT POINT AT TIME STEP i
HISTORY ω, F(ω)
Pi
i − 1 ω₂, F(ω₂)
i − 2 ω₅, F(ω₅) BLACK-BOX
ALGORITHM
i − 3 ω₀, F(ω₀)
Ω
i − 4 ω₃, F(ω₃)
i − 5 ω₈, F(ω₈)
i − 6 ω₆, F(ω₆)
Figure 3.1: Black-box search algorithm. At time i the algorithm computes a probability dis-
tribution Pi over the search space Ω, using information from the history, and a new point is
drawn according to Pi . The point is evaluated using external information resource F . The tuple
(ω, F (ω)) is then added to the history at position i.
23
3.3 Measuring Performance
Since general search algorithms may vary the total number of sampling steps performed, we
measure performance using the expected per-query probability of success,
|P̃ |
1 X
q(T, F ) = EP̃ ,H Pi (ω ∈ T ) F , (3.1)
|P̃ | i=1
where P̃ is the sequence of probability distributions for sampling elements (with one distribution
Pi for each time step i), H is the search history, and the T and F make the dependence on the
search problem explicit. |P̃ | denotes the length of the sequence P̃ , which equals the number of
queries taken. The expectation is taken over all sources of randomness, which includes random-
ness over possible search histories and any randomness in constructing the various Pi from h0:i−1
(if such a construction is not entirely deterministic). Taking the expectation over all sources of
randomness is equivalent to measuring performance for samples drawn from an appropriately
averaged distribution P (· | F ) (see Lemma 3). Because we are sampling from a fixed (after
conditioning and expectation) probability distribution, the expected per-query probability of suc-
cess for the algorithm is equivalent to the induced amount of probability mass allocated to target
elements. Thus, each P (· | F ) demarks an equivalence class of search algorithms mapping to
the same averaged distribution; we refer to these equivalence classes as search strategies.
We use uniform random sampling with replacement as our baseline search algorithm, which
is a simple, always available strategy and we define p(T, F ) as the per-query probability of
success for that method. In a natural way, p(T, F ) is a measure of the intrinsic difficulty of
a search problem [17], absent of any side-information. The ratio p(T, F )/q(T, F ) quantifies
the improvement achieved by a search algorithm over simple random sampling. Like an error
quantity, when this ratio is small (i.e., less than one) the performance of the algorithm is better
than uniform random sampling, but when it is larger than one, performance is worse. We will
often write p(T, F ) simply as p, when the target set and information resource are clear from the
context.
3.4 Notation
Table 3.1 gives the symbols used in this manuscript with their meanings.
24
Table 3.1: Notation
Symbol Definition
Ω Search space
ω Element of search space Ω
A Search algorithm
T Target set, T ⊆ Ω
F External information resource
F (∅) Initialization information
F (ω) Query feedback information
τ Set of target sets
Bm Set of external information resources
P̃ Sequence of probability distributions on Ω
H Search history
Pi Probability distribution on Ω, Pi ∈ P̃
p Baseline per-query probability of success under uniform random sampling: p = |T |/|Ω|
q ET,F [q(T, F )] h P i
q(T, F ) Expected per-query probability of success, i.e. EP̃ ,H |P̃1 | |i=1
P̃ |
Pi (ω ∈ T ) | F
P (· | F ) Averaged conditional distribution on Ω for an algorithm, also called a search strategy
Iq(T,F ) − log p/q(T, F )
L Loss function
Ξ Error functional
D Training dataset
V Test dataset
µ(A) measure on object A
U(A) uniform measure / distribution on object A
25
26
Chapter 4
N this chapter we demonstrate how to reduce several types of learning problems to search
I problems within our formal framework, including the abstract learning problem considered
by Vapnik [81] in his work on empirical risk minimization. Extensions to other types of
learning problems not considered here should be equally straightforward.
27
functions contains a good function), Ξ can be made relative to the within-class optimum, as a
regret. Also note that T is typically random, being a function of random set V .
The external information resource F consists of loss function L together with D, encoded in
binary format. The ability to represent F as a finite bit string follows directly from the finiteness
of X , Y and the assumption that the loss function can be encoded as a finite binary procedure
(such as a C++ method). F is also random whenever D is itself a random set. Dependence be-
tween D and V (via shared distributional assumptions, for example) creates dependence between
F and T .
Given the elements (Ω, T, F ) (which require L, D and V ), we can view each regression al-
gorithm as a search that outputs a gb ∈ Ω given external information resource F . Thus, regression
problems can be represented naturally within our search framework.
28
Figure 4.1: Points to be clustered in a vector space.
2. A fixed set of usually unknown clusters (more precisely, distributions), from which the
points are drawn; and
3. A ground truth assignment of points to clusters (namely, the actual generation history of
points).
The set of points are represented in a vector space, typically Rd , and we want to recover the
latent cluster assignments for all points. For soft-clustering algorithms, each point may be have
membership in more than one cluster, whereas hard-clustering algorithms assign each point to
at most one cluster. We will make a simplifying assumption that the number of clusters does not
exceed the number of points and that the number of points, N , is finite. Given these assumptions,
one can represent any clustering of the points among K classes as a real-valued N × K matrix,
W , where entry Wij represents the membership of point i in cluster j. For soft-clustering algo-
rithms, these membership weights can take on real values, whereas hard-clustering algorithms
produce binary values at all entries. Without loss of generality, assume all membership weights
are normalized so that 0 ≤ Wij ≤ 1. As a further simplification, since K ≤ N , we can represent
W as simply an N ×N matrix, where columns with all zeros correspond to non-existent clusters.
In other words, if our algorithm produces only three clusters, then all but three columns of W
will be equal to zero vectors.
The points themselves, being embedded in some vector space, have a representation within
that space. The could be a vector of values in Rd , or some other representation. Let us make
a distinction between the points themselves and their representation within a particular vector
space; thus a point can have many representations, given different vector spaces.
To cast any problem within our framework, we need to define our search space Ω, an external
29
information resource F , and a target set T , and we assume the clustering algorithm A is fixed
and given and the set of N points is also fixed and given. The search space Ω is the space of all
possible assignment matrices W , and we discretize the search space by requiring finite precision
representation of the assignment weights. The true assignments is represented by matrix W ∗ ,
and the target set T consists of all W such that L(W, W ∗ ) ≤ for loss function L and threshold
, namely
Lastly, the external information resource F is a binary-coded vector space representation of all
N points, where the values are again finite precision. We can assume that F (∅) returns the entire
set of points, and that the algorithm produces at most one W matrix (acting as a single-query
algorithm). Thus, we can fully represent general clustering problems within our framework.
To gain intuition as to how specific clustering tasks would look within our framework, let us
return to the dataset presented in Figure 4.1. Although the point visually look like they belong
to a single cluster, they were generated by sampling two independent sets of points i.i.d. from
two multivariate Gaussian distributions. Figure 4.2 shows the ground-truth cluster assignments
for this example. Given an unlabeled set of points like Figure 4.1 (and perhaps the true number
of clusters), our algorithm must produce a W matrix that reliably separates the points into their
distinctive groups. (We assume that our loss function takes into account label permutations,
which are equivalent to transpositions of the columns of the W matrix.)
What becomes immediately apparent is the importance of the vector space representation of
the points, as well as the set of points themselves. Had the points been projected to a different
30
space that superimposed the two clusters, we would have little hope of recovering the true assign-
ments. Thus, representations can be more (or less) informative of the true cluster assignments,
and thus, provide more (or less) information for recovering the true W ∗ . This is similar to a
case where you want to sort a group of people into Republican and Democrat clusters, but
you have a choice in what attributes and features to use. If you represent the people using a bad
feature representation (such as using gender-normalized shoe size, shirt color, number of eyes,
number of vowels in first name), you have little chance of reliably clustering your population into
their respective groups. However, if you use features highly correlated with the latent classes,
such as voter registration party and zip code, you have a much higher probability of producing
a clustering close to the truth. A good representation will increase the distance between classes,
as in Figure 4.3, while a poor one will decrease it. Thus, we see the importance of dependence
between the external information resource F , which is the chosen representation of a sampled
set of points, and T , the target set of acceptable clusterings.
In addition to this, it should be clear that the way the points themselves are chosen, in any
vector space representation, will also play a major role in the recovery of the true clusters. Choos-
ing points in a noisy or adversarial manner instead of by sampling i.i.d. from the true underlying
distributions will also decrease the information the set of points provides concerning the true
cluster assignments.
Figure 4.3: Samples drawn from two multivariate Gaussian distributions, with better separation
among classes.
31
θ −→ Density / Sampling Process −→ D
This includes the true parameter vector θ, as well as any acceptably close alternatives. Restric-
tions detailing how D is produced in regards to θ (e.g., by sampling i.i.d. from a density param-
eterized by θ) control how F tends to vary for different θ, and consequently, how informative F
is concerning θ.
32
defined on space Z, andRconsider the parameterized set of functions Qα (z), α ∈ Λ. The goal
is to minimize R(α) = Qα (z)dP (z) for α ∈ Λ, when P (z) is unknown but an i.i.d. sample
z1 , . . . , z` is given. Let Remp (α) = 1` `i=1 Qα (zi ) be the empirical risk.
P
To reduce this general problem to a search problem within our framework, assume Λ is finite,
choose ∈ R≥0 , and let
• Ω = Λ;
• T = {α : α ∈ Λ, R(α) − minα0 ∈Λ R(α0 ) < };
• F = {z1 , . . . , z` };
• F (∅) = {z1 , . . . , z` }; and
• F (α) = Remp (α).
Thus, any finite problem representable in Vapnik’s learning framework is also directly rep-
resentable within our search framework. The results presented here apply to all such problems
considered in statistical learning theory.
33
34
Chapter 5
Theoretical Results1
5.1 Lemmata
We begin by proving several lemmata which will help us in establishing our main theorems.
d
Lemma 1 (Sauer-Shelah Inequality). For d ≤ n, dj=0 nj ≤ en
P
d
.
Proof. We reproduce a simple proof of the Sauer-Shelah inequality [68] for completeness.
d d j
X n n d X n d
≤ (5.1)
j=0
j d j=0
j n
n j
n d X n d
≤ (5.2)
d j=0
j n
n d n
d
= 1+ (5.3)
d n
n d n
d
≤ lim 1 + (5.4)
d n→∞ n
en d
= . (5.5)
d
Pb 2nb c n
Lemma 2. j=0 j
≤ 2n−b for b ≥ 3 and n ≥ 2b .
2b + log2 e ≤ 2b (5.6)
1
This chapter reproduces content from Montañez, “The Famine of Forte: Few Search Problems Greatly Favor
Your Algorithm” (arXiv 2016)
35
which implies 2−b (2b + log2 e) ≤ 1. Therefore,
1 ≥ 2−b (2b + log2 e) (5.7)
b b + log2 e
= b+ (5.8)
2 2b
b b + log2 e
≥ + , (5.9)
n 2b
using the condition n ≥ 2b , which implies
n
n ≥ b + b (b + log2 e). (5.10)
2
Thus,
n
2n ≥ 2b+ 2b (b+log2 e)
n
= 2b 2 2b (b+log2 e)
n
= 2b 2b 2log2 e 2b
n
= 2b 2b e 2b
nb
en 2
= 2b n
2b
n
2b
X n
b
≥2
j=0
j
bX
n
2b
c
n
≥ 2b ,
j=0
j
where the penultimate inequality follows from the Sauer-Shelah inequality [68]. Dividing through
by 2b gives the desired result.
Lemma 3. (Expected Per Query Performance From Expected Distribution) Let t be a target set,
q(t, f ) the expected per-query probability of success for an algorithm and ν be the conditional
joint measure induced by that algorithm over finite sequences of probability distributions and
search histories, conditioned on external information resource f . Denote a probability distri-
bution sequence by P̃ and a search history
R by h. Let U(P̃ ) denote a uniform distribution on
elements of P̃ and define P (x | f ) = EP ∼U (P̃ ) [P (x)]dν(P̃ , h | f ). Then,
q(t, f ) = P (X ∈ t|f )
where P (X|f ) is a probability distribution on the search space.
Proof. Begin by expanding the definition of EP ∼U (P̃ ) [P (x)], being the average probability mass
on element x under sequence P̃ :
|P̃ |
1 X
EP ∼U (P̃ ) [P (x)] = Pi (x).
|P̃ | i=1
36
Next, we confirm that P (x|f ) is a proper probability distribution:
1. P (x|f ) ≥ 0, being the integral of a nonnegative function;
2. P (x|f ) ≤ 1, since
Z |P̃ |
1 X
P (x|f ) ≤ 1 dν(P̃ , h|f ) = 1;
|P̃ | i=1
|P̃ |
Z " #
1 X X
= Pi (x) dν(P̃ , h|f )
|P̃ | i=1 x
Z |P̃ |
1 X
= 1dν(P̃ , h|f )
|P̃ | i=1
|P̃ |
Z
= dν(P̃ , h|f )
|P̃ |
= 1.
Finally,
X
P (X ∈ t|f ) = 1x∈t P (x|f )
x
X Z
= 1x∈t EP ∼U (P̃ ) [P (x)]dν(P̃ , h|f )
x
Z |P̃ |
1x∈t 1
X X
= Pi (x) dν(P̃ , h|f )
x
|P̃ | i=1
|P̃ |
Z " #
1 X X
= 1x∈t Pi (x) dν(P̃ , h|f )
|P̃ | i=1 x
|P̃ |
1 X
= EP̃ ,H EX∼Pi [1X∈t ] f
|P̃ | i=1
|P̃ |
1 X
= EP̃ ,H Pi (X ∈ t) f
|P̃ | i=1
= q(t, f ).
37
Lemma 4. If X ⊥ T |F , then
Proof. Pr(X ∈ T ; A) denotes the probability that random variable X will be in target T (marginal-
ized over all values of F ) when T is random and X is drawn from P (X|F ). Then,
where the third equality makes use of the law of iterated expectation, the fourth follows from the
conditional independence assumption, and the final equality follows from Lemma 3.
Lemma 5. (Maximum
n
√ Number of Satisfying Vectors) Given an integer 1 ≤ k ≤ n, a set n
S = {s :
P
s ∈ {0, 1} , ksk = k} of all n-length k-hot binary vectors, a set P = {P : P ∈ R , j Pj =
1} of discrete n-dimensional simplex vectors, and a fixed scalar threshold ∈ [0, 1], then for any
fixed P ∈ P,
X 1 n−1
1s> P ≥ ≤
s∈S
k−1
Proof. For = 0, the bound holds trivially. For > 0, let S be a random quantity that takes
values s uniformly in the set S. Then, for any fixed P ∈ P,
X n
1s> P ≥ = E [1S > P ≥ ]
s∈S
k
n
Pr S > P ≥ .
=
k
Let 1 denotes the all ones vector. Under a uniform distribution on random quantity S and because
38
P does not change with respect to s, we have
−1 X
> n
s> P
E S P =
k s∈S
−1 X
n
= P> s
k s∈S
n−1
1
= P > k−1 n
k
n−1
1
= P > n k−1
n−1
k k−1
k >
= P 1
n
k
=
n
since P must sum to 1.
Noting that S > P ≥ 0, we use Markov’s inequality to get
X n
Pr S > P ≥
1s> P ≥ =
s∈S
k
n 1 >
≤ E S P
k
n 1k
=
k n
1kn n−1
=
nk k−1
1 n−1
= .
k−1
Proof. Similar results have been proved by others with regard to No Free Lunch theorems [17,
18, 21, 71, 95]. Our result concerns the maximum proportion of sufficiently good strategies (not
39
the mean performance of strategies over all problems, as in the NFL case) and is a simplification
over previous search-for-a-search results.
For = 0, the bound holds trivially. For > 0, We first notice that the µ(P)−1 term can be
viewed as a uniform density over the region of the simplex P, so that the integral becomes an
expectation with respect to this distribution, where P is drawn uniformly from P. Thus, for any
s ∈ S,
Z
µ(Gs, ) 1
= [1s> P ≥ ] dµ(P )
µ(P) P µ(P)
= EP ∼U (P) [1s> P ≥ ]
= Pr(s> P ≥ )
1
≤ EP ∼U (P) s> P ,
where the final line follows from Markov’s inequality. Since the symmetric Dirichlet distribution
in n dimensions with parameter α = 1 gives the uniform distribution over the simplex, we get
EP ∼U (P) [P ] = EP ∼Dir(α=1) [P ]
α
= Pn 1
i=1 α
1
= 1,
n
40
the same problem2 . For algorithm A, positive Iq(T,F ) is equivalent to sampling a smaller search
space for a target set of the same size in expectation, thus having a natural advantage over uni-
form search on the original search space. We can quantify favorability in terms of the number
of bits advantage A has over uniform sampling. We have the following theorem concerning the
proportion of b-bit favorable problems:
Theorem 1. Let τ = {T | T ⊆ Ω} and τb = {T | ∅ = 6 T ⊆ Ω, Iq(T,F ) ≥ b}. Then for b ≥ 3,
|τb |
≤ 2−b .
|τ|
For |Ω| < 2b and Iq(T,F ) ≥ b, we have |T | < 1 for all elements of τ0b (making the set empty)
and the theorem follows immediately. Thus, |Ω| ≥ 2b for the remainder.
By Lemma 2, we have
|τb | |τ0 |
≤ b (5.12)
|τ| |τ|
|Ω|
bX2b
c
|Ω|
= 2−|Ω| (5.13)
k=0
k
−|Ω| |Ω|−b
≤2 2 (5.14)
= 2−b . (5.15)
Thus, for a fixed information resource F , few target sets can be greatly favorable for any
single algorithm.
Since this results holds for all target sets on Ω, a question immediately arises: would this
scarcity of favorable problems exist if we limited ourselves to only sparse target sets, namely
those having few target elements? Since the problems would be more difficult for uniform
sampling, perhaps the fraction of favorable problems can grow accordingly. We examine this
situation next.
2
This pointwise KL-divergence is almost identical to the active information (I+ ) transform defined in [17], but
differs in that for active information q is the probability of success for algorithm A under some query constraint.
Here we keep the pointwise KL-divergence form of I+ , but take the ratio of the expected per-query probabilities of
success.
41
5.3 The Famine of Forte
If we restrict our consideration to k-sparse target sets, we have the following result which shows
that a similarly restrictive bound continues to hold in the k-sparse case.
Theorem 2. (Famine of Forte) Define
τk = {T | T ⊆ Ω, |T | = k ∈ N}
and let Bm denote any set of binary strings, such that the strings are of length m or less. Let
R = {(T, F ) | T ∈ τk , F ∈ Bm }, and
Rqmin = {(T, F ) | T ∈ τk , F ∈ Bm , q(T, F ) ≥ qmin },
where q(T, F ) is the expected per-query probability of success for algorithm A on problem
hΩ, T, F i. Then for any m ∈ N,
|Rqmin | p
≤
|R| qmin
and
|Rqmin | p
lim ≤
m→∞ |R| qmin
where p = k/|Ω|.
Thus, the famine of favorable problems exists for target-sparse problems as well. This result
improves on Theorem 1 by removing the restrictions on the minimum search space size and
amount of favorability, and by allowing for consideration of not just a fixed information resource
F but any finite set of external information resources. We will now prove the result.
Proof. We begin by defining a set S√of all |Ω|-length target functions with exactly k ones, namely,
S = {s : s ∈ {0, 1}|Ω| , ksk = k}. For each of these, we have |Bm | external information
resources. The total number of search problems is therefore
|Ω|
|Bm |. (5.16)
k
We seek to bound the proportion of possible search problems for which q(s, f ) ≥ qmin for any
threshold qmin ∈ (0, 1]. Thus,
P
|Rqmin | |Bm | supf s∈S 1q(s,f )≥qmin
≤ (5.17)
|R| |Bm | |Ω|
k
−1 X
|Ω|
= 1q(s,f ∗ )≥qmin , (5.18)
k s∈S
42
where f ∗ ∈ Bm denotes the arg sup of the expression. Therefore,
−1 X
|Rqmin | |Ω|
≤ 1q(s,f ∗ )≥qmin
|R| k s∈S
−1 X
|Ω|
= 1P (ω∈s|f ∗ )≥qmin
k s∈S
−1 X
|Ω|
= 1s> P f ∗ ≥qmin
k s∈S
where the first equality follows from Lemma 3, ω ∈ s means the target function s evaluated at
ω is one, and P f ∗ represents the |Ω|-length probability vector defined by P (·|f ∗ ). By Lemma 5,
we have
−1 X −1
|Ω| |Ω| 1 |Ω| − 1
1s> P f ∗ ≥qmin ≤
k s∈S
k qmin k − 1
k 1
=
|Ω| qmin
= p/qmin (5.19)
Next, we use the monotone convergence theorem to show the limit exists. First,
P
supf ∈Am s 1q(s,f )≥qmin
lim am = lim |Ω|
(5.22)
m→∞ m→∞
k
By construction, the successive Am are nested with increasing m, so the sequence of suprema
(and numerator) are increasing, though not necessarily strictly increasing. The denominator is
43
not dependent on m, so {am } is an increasing sequence. Because it is also bounded above by
p/qmin , the limit exists by monotone convergence. Thus,
Lastly,
P
|Bm | supf ∈Bm s∈S 1q(s,f )≥qmin
lim bm = lim |Ω|
m→∞ m→∞ |Bm | k
P
supf ∈Bm s∈S 1q(s,f )≥qmin
= lim |Ω|
m→∞
P k
supf ∈Am s∈S 1q(s,f )≥qmin
≤ lim |Ω|
m→∞
k
= lim am
m→∞
≤ p/qmin .
We see that for small p (problems with sparse target sets) favorable search problems are rare
if we desire a strong probability of success. The larger qmin , the smaller the proportion. In many
real-world settings, we are given a difficult search problem (with minuscule p) and we hope that
our algorithm has a reasonable chance of achieving success within a limited number of queries.
According to this result, the proportion of problems fulfilling such criteria is also minuscule.
Only if we greatly relax the minimum performance demanded, so that qmin approaches the scale
of p, do such accommodating search problems become plentiful.
5.3.1 Corollary
Using the active information of expectations transform, we can restate the result in Theorem 2
to compare it directly with the bound from Theorem 1. Doing so shows the exact same bound
continues to hold:
Corollary 1. Let R = {(T, F ) | T ∈ τk , F ∈ Bm } and Rb = {(T, F ) | T ∈ τk , F ∈
Bm , Iq(T,F ) ≥ b}. Then for any m ∈ N
|Rb |
≤ 2−b .
|R|
Proof. The proof follows from the definition of active information of expectations and Theo-
rem 2. Note,
p
b ≤ − log2 (5.23)
q(T, F )
44
implies
Since Iq(T,F ) ≥ b implies q(T, F ) ≥ p2b , the set of problems for which Iq(T,F ) ≥ b can be no
bigger than the set for which q(T, F ) ≥ p2b . By Theorem 2, the proportion of problems for
which q(T, F ) is at least p2b is no greater than p/(p2b ). Thus,
|Rb | 1
≤ b. (5.25)
|R| 2
Thus, restricting ourselves to target-sparse search problems does nothing to increase the pro-
portion of b-bit favorable problems, contrary to initial expectations. Furthermore, we see that
finding a search problem for which an algorithm effectively reduces the search space by b bits
requires at least b bits, so information is conserved in this context. Assuming you have no domain
knowledge to guide the process of finding a search problem for which your search algorithm ex-
cels, you are unlikely to stumble upon one under uniform chance; indeed, they are exponentially
rare in the amount of improvement sought.
where p is the per-query probability of success for uniform random sampling and q̃ is the per-
query probability of success for an alternative search algorithm. Define
τk = {T | T ⊆ Ω, |T | = k ∈ N}
and let Bm denote any set of binary strings, such that the strings are of length m or less. Let
R = {(T, F ) | T ∈ τk , F ∈ Bm }, and
Rb = {(T, F ) | T ∈ τk , F ∈ Bm , E[Iq̃ ] ≥ b}.
|Rb |
≤ 2−b .
|R|
45
Proof. By Jensen’s inequality and the concavity of log2 (q̃/p) in q̃, we have
p
b ≤ E − log2
q̃
q̃
= E log2
p
E [q̃]
≤ log2
p
p
= − log2
q(T, F )
= Iq(T,F ) .
τ1 = {T | T ⊆ Ω, |T | = 1}
and let Bm denote any set of binary strings, such that the strings are of length m or less, for some
m ∈ N. Let R be the set of possible search problems on these sets and let RA (ξ) be the subset
of search problems for algorithm A such that the target element is found in expectation within ξ
evaluations, namely,
R = {(T, F ) | T ∈ τ1 , F ∈ Bm }, and
( " ξ # )
X
RA (ξ) = (T, F ) | T ∈ τ1 , F ∈ Bm , E 1ωi ∈T F ≥ 1 .
i=1
Then,
|RA (ξ)| ξ
≤ .
|R| |Ω|
Proof. We consider only algorithms that run for exactly ξ queries and terminate. For any al-
gorithm terminating in fewer queries, we consider a related algorithm that has identical initial
behavior but repeats the final query until ξ queries are achieved. Similarly, for algorithms that
might produce more than ξ queries, we replace it with an algorithm that has identical initial
behavior, but terminates after the ξth query.
46
Using the law of iterated expectation, we have
" ξ #
X
1≤E 1ωi ∈T F (5.26)
i=1
ξ
" " # #
X
= EP̃ ,H E 1ωi ∈T P̃ , H F (5.27)
i=1
ξ
" #
X
= EP̃ ,H Eωi 1ωi ∈T P̃ , H F (5.28)
i=1
" ξ #
X
= EP̃ ,H Pi0 (ω ∈ T ) F , (5.29)
i=1
47
where Gt,qmin = {P : P ∈ P, t> P ≥ qmin } and µ is Lebesgue measure. Furthermore, the
proportion of possible search strategies giving at least b bits of active information of expectations
is no greater than 2−b .
Proof. Applying Lemma 6, with s = t, = qmin , k = |t|, n = |Ω|, and p = |t|/|Ω|, yields
the first result, while following the same steps as Corollary 1 gives the second (noting that by
Lemma 3 each strategy is equivalent to a corresponding q(t, f )).
Then,
I(T ; F ) + D(PT kUT ) + 1
q≤
IΩ
where IΩ = − log2 k/|Ω|, D(PT kUT ) is the Kullback-Liebler divergence between the marginal
distribution on T and the uniform distribution on T , and I(T ; F ) is the mutual information.
Alternatively, we can write
H(UT ) − H(T | F ) + 1
q≤
IΩ
48
The mutual information I(T ; F ) is the amount of exploitable information the external resource
contains regarding T ; lowering the mutual information lowers the maximum expected probabil-
ity of success for the algorithm. Lastly, the 1 in the numerator upper bounds the contribution of
pure randomness. This expression constrains the relative contributions of predictability, problem
difficulty, side-information, and randomness for a successful search.
Proof. This proof loosely follows that of Fano’s Inequality [24], being a reversed generalization
of it, so in keeping with the traditional notation we let X := ω for the remainder of this proof.
Let Z = 1(X ∈ T ). Using the chain rule for entropy to expand H(Z, T |X) in two different
ways, we get
By definition, H(Z|T, X) = 0, and by the data processing inequality H(T |F ) ≤ H(T |X).
Thus,
which implies
Examining H(Z|X), we see it captures how much entropy of Z is due to the randomness of
T . To see this, imagine Ω is a roulette wheel and we place our bet on X. Target elements are
“chosen” as balls land on random slots, according to the distribution on T . When a ball lands on
X as often as not (roughly half the time), this quantity is maximized. Thus, this entropy captures
49
the contribution of dumb luck, being averaged over all X. (When balls move towards always
landing on X, something other than luck is at work.) We upperbound this by its maximum value
of 1 and obtain
I(T ; F ) + D(PT kUT ) + 1
Pr(X ∈ T ; A) ≤ , (5.42)
IΩ
and substitute q for Pr(X ∈ T ; A) to obtain the first result, noting that q = ET,F P (ω ∈ T |F )
specifies a proper probability distribution by the linearity and boundedness of the expectation.
To obtain the second form, use the definitions I(T ; F ) = H(T ) − H(T |F ) and D(PT kUT ) =
H(UT ) − H(T ).
50
Combining this result with Equation 5.46, we get
By the definition of mutual information, H(T |X) = H(T ) − I(T ; X), so we can substitute
and rearrange to obtain
I(T ; X) − H(T ) + H(T |Z = 0, X) + H(Z|X)
Pg = . (5.50)
H(T |Z = 0, X) − H(T |Z = 1, X)
Let UT denote the uniform distribution over T ∈ τk . Using the facts that
|Ω|
H(T ) = log2 − D(PT kUT ), (5.51)
k
X
H(T |Z = 0, X) = Pr(X = x)H(T |Z = 0, X = x), (5.52)
x∈Ω
X |Ω| − 1
= Pr(X = x) log2 − D(PT |Z=0,X=x kUT |Z=0,X=x ) , (5.53)
x∈Ω
k
|Ω| − 1
= log2 − EX D(PT |Z=0,X kUT |Z=0,X ) , (5.54)
k
we get
|Ω|
+ D(PT kUT ) + log2 |Ω|−1
I(T ; X) − log2 k k
− EX D(PT |Z=0,X kUT |Z=0,X ) + H(Z|X)
Pg =
H(T |Z = 0, X) − H(T |Z = 1, X)
(5.55)
k
I(T ; X) + log2 1 − |Ω| + D(PT kUT ) − EX D(PT |Z=0,X kUT |Z=0,X ) + H(Z|X)
= .
H(T |Z = 0, X) − H(T |Z = 1, X)
(5.56)
Applying similar substitutions to the denominator, we obtain
k
I(T ; X) + log2 1 − |Ω| + D(PT kUT ) − EX D(PT |Z=0,X kUT |Z=0,X ) + H(Z|X)
Pg =
log2 |Ω|−1 |Ω|−1
k
− log 2 k−1
+ E X D(P T |Z=1,X kUT |Z=1,X ) − EX D(P T |Z=0,X kU T |Z=0,X )
(5.57)
k
I(T ; X) + log2 1 − |Ω| + D(PT kUT ) − EX D(PT |Z=0,X kUT |Z=0,X ) + H(Z|X)
=
log2 |Ω|
k
− 1 + EX D(P T |Z=1,X kUT |Z=1,X ) − EX D(P T |Z=0,X kU T |Z=0,X )
(5.58)
k
I(T ; X) + log2 1 − |Ω| + D(PT kUT ) − EX D(PT |Z=0,X kUT |Z=0,X ) + H(Z|X)
= .
k
IΩ + log2 1 − |Ω| + EX D(PT |Z=1,X kUT |Z=1,X ) − EX D(PT |Z=0,X kUT |Z=0,X )
(5.59)
51
k
Let Cr = log2 1 − |Ω|
− EX D(PT |Z=0,X kUT |Z=0,X ) , as a correction term and define the
information leakage IL = I(T ; F ) − I(T ; X). Using the definition of Pg and adding/subtracting
I(T ; F ) from the numerator, we conclude
52
on the instance space, then those hypotheses that remain will have an equal number of positive
labels and negative labels for any instance not yet seen. If strict consistency with training data is
our only criterion for choosing hypotheses, we cannot decide if an unseen instance should receive
a positive or negative label based on majority voting, since the remaining consistent hypotheses
are equally split among both outcomes. Thus, we must essentially guess, having an equal chance
of being correct or wrong in each instance. While Mitchell gave an informal argument similar
to the preceding to prove this point, we formally prove the need for biased classifiers, extending
the result to hold for all finite multi-class learning problems.
We will give two theorems, one for when the true concept h∗ is fixed, and the second for the
more general case, when the true concept is chosen according to some distribution. The first is a
special case of the second (assuming a point-mass, degenerate distribution), but the proof for the
special case (with credit to Cosma Shalizi) is much simpler, so we begin with it.
R = {h : h ∈ ΩD , }, and (5.62)
Rw = {h : h ∈ ΩD , wh,h∗ (vx ) = w}. (5.63)
53
Given these definitions, |R| = |Y||X |−|D| and |Rw | = |vwx | (|Y| − 1)w |Y||X |−|D|−|vx | . Because
unbiased classifier A has a uniform distribution on the set of all consistent hypotheses, the ratio
|Rw |/|R| is equivalent to the probability of choosing a hypothesis making exactly w errors on
vx , namely
|vx |
(|Y| − 1)w |Y||X |−|D|−|vx |
|Rw | w
= (5.64)
|R| |Y||X |−|D|
w
|vx | 1 1
= |Y| 1 − (5.65)
w |Y| |Y||vx |
w
|Y|w
|vx | 1
= 1− (5.66)
w |Y| |Y||vx |
w |vx |−w
|vx | 1 1
= 1− (5.67)
w |Y| |Y|
w M −w
M 1 1
= 1− . (5.68)
w |Y| |Y|
5.8.2 General Case: Concept and Test Set Drawn from Distributions
We again consider algorithms that make no assumptions beyond strict consistency with training
data. This requires that the training data be noiseless, with accurate class labels. Although one
can extend the proof to handle the case of noisy training labels (see Appendix A), doing so
increases the complexity of the proof and does not add anything substantial to the discussion,
since the same result continues to hold. Thus, we focus on the case of noiseless training data,
assuming strict consistency with it.
Theorem 7. Define as follows:
• X - finite instance space,
• Y - finite label space,
• Ω - Y X , the space of possible concepts on X ,
• h - a hypothesis, h ∈ Ω,
• D = {(x1 , y1 ), . . . , (xN , yN )} - any training dataset where xi ∈ X , yi ∈ Y,
• Dx = {x : (x, ·) ∈ D} - the set of x instances in D,
• Vx = {x1 , . . . , xM } - any test dataset disjoint from Dx (i.e., Dx ∩ Vx = ∅) containing
exactly M elements xi ∈ X with M > 0, hidden from the algorithm during learning,
• ΩD - subset of hypotheses strictly consistent with D, and
• unbiased classifier A - any classifier such that P (h|D, M ) = 1(h ∈ ΩD )/|ΩD | (i.e., makes
no assumptions beyond strict consistency with training data).
Then the distribution of 0-1 generalization error counts for any unbiased classifier is given by
w M −w
M 1 1
P (w|D, M ; A) = 1−
w |Y| |Y|
54
where w is the number of wrong predictions on disjoint test sets of size M . Thus, every unbiased
classifier has generalization performance equivalent to random guessing (e.g., flipping a coin)
of class labels for unseen instances.
A
D h
N
Dx w
h∗ M
Vx
Figure 5.1: Graphical model of objects involved in unbiased classification and their relationships.
Proof. Figure 5.1 gives the graphical model for our problem setting. In agreement with this
model, we assume the training data D are generated from the true concept h∗ by some process,
and that h is chosen by A using D alone, without access to h∗ . Thus, h∗ → D → h, implying
P (h∗ |D, h, M ) = P (h∗ |D, M ) by d-separation. By similar reasoning, since {Dx , M } → Vx ,
with D as an ancestor of both Vx and h and no other active path between them, we have
P (Vx |f, D, h, M ) = P (Vx |f, D, M ). This can be verified intuitively by the fact that Vx is gen-
erated prior to the algorithm choosing h, and the algorithm has no access to Vx when choosing h
(it only has access to D, which we’re already conditioning on).
Let K = 1(h ∈ ΩD )/|ΩD | and let L be the number of free instances in X , namely
55
Then
P (w, D|M )
P (w|D, M ; A) = (5.71)
P (D|M )
P
h∈ΩD P (w, h, D|M )
= P (5.72)
P (h, D|M )
P h∈ΩD
h∈ΩD P (w|h, D, M )P (h|D, M )P (D|M )
= P (5.73)
h∈ΩD P (h|D, M )P (D|M )
P
h∈ΩD P (w|h, D, M )KP (D|M )
= P (5.74)
h∈ΩD KP (D|M )
P
KP (D|M ) h∈ΩD P (w|h, D, M )
= (5.75)
KP (D|M )|ΩD |
1 X
= P (w|h, D, M ) (5.76)
|ΩD | h∈Ω
D
1 X
= P (w|h, D, M ) (5.77)
|Y|M +L h∈Ω
D
1 X
= P (w|h, D, M ), (5.78)
|Y|M +L h∈Ω
D
Marginalizing over possible true concepts f for term P (w|h, D, M ) and letting Z = {f, h, D, M },
we have
X
P (w|h, D, M ) = P (w, f |h, D, M ) (5.79)
f ∈Ω
X
= P (w|Z)P (f |h, D, M ) (5.80)
f ∈Ω
X
= P (w|Z)P (f |D, M ) (by d-separation) (5.81)
f ∈Ω
X X
= P (f |D, M ) P (w, vx |Z) (5.82)
f ∈Ω vx
X X
= P (f |D, M ) P (vx |Z)P (w|Z, vx ) (5.83)
f ∈Ω vx
X X
= P (f |D, M ) P (vx |Z)1(w = wh,f (vx )), (5.84)
f ∈Ω vx
P
where wh,f (vx ) = x∈vx 1(h(x) 6= f (x)) and the final equality follows since
(
1 w = wh,f (vx ),
P (w|Z, vx ) = P (w|f, h, D, M, vx ) = (5.85)
0 w=6 wh,f (vx ).
56
Combining (5.78) and (5.84), we obtain
" #
1 X X X
P (w|D, M ; A) = P (f |D, M ) P (vx |f, h, D, M )1(w = wh,f (vx ))
|Y|M +L h∈ΩD f ∈Ω vx
(5.86)
" #
1 X X X
= P (f |D, M ) P (vx |f, D, M )1(w = wh,f (vx ))
|Y|M +L h∈ΩD f ∈Ω vx
(5.87)
1 X X X
= P (f |D, M ) P (vx |Z 0 ) 1(w = wh,f (vx )), (5.88)
|Y|M +L f ∈Ω vx h∈ΩD
where we have defined Z 0 := {f, D, M } and the second equality follows from d-separation
between Vx and h, conditioned on D.
P
Note that h∈ΩD 1(w = wh,f (vx )) is the number of hypotheses strictly consistent with D that
disagree with concept f exactly w times on vx . There are M
w
ways to choose w disagreements
with f on vx , and for each of the w disagreements we can choose |Y| − 1 possible values for h
at that instance, giving a multiplicative factor of (|Y| − 1)w . For the remaining L instances that
are in neither D nor vx , we have |Y| possible values, giving the additional multiplicative factor
of |Y|L . Thus,
1 X X M 0 w L
P (w|D, M ; A) = P (f |D, M ) P (vx |Z ) (|Y| − 1) |Y| (5.89)
|Y|M +L f ∈Ω vx
w
X
1 M X
= (|Y| − 1) w
|Y| L
P (f |D, M ) P (vx |Z 0 ) (5.90)
|Y|M +L w f ∈Ω vx
X
1 M
= M L
(|Y| − 1)w |Y|L P (f |D, M ) (5.91)
|Y| |Y| w f ∈Ω
M 1
= (|Y| − 1)w M (5.92)
w |Y|
w
M 1 1
= |Y| 1 − (5.93)
w |Y| |Y|M
w
|Y|w
M 1
= 1− (5.94)
w |Y| |Y|M
w
M 1
= 1− |Y|w−M (5.95)
w |Y|
w M −w
M 1 1
= 1− . (5.96)
w |Y| |Y|
57
5.8.3 Discussion
As a historical note, Wolpert explored many related issues in [91], proving a related result using
his Extended Bayesian Formalism. Namely, he showed that in the noiseless case for error E =
w/(|X | − |D|),
|X | − |D|
P (E|h, D) = (|Y| − 1)E(|X |−|D|) /|Y|(|X |−|D|) , (5.97)
(|X | − |D|)(1 − E)
whenever P (f |D) is uniform (i.e., in a maximum entropy universe), and argues by symmetry
that a similar result would hold for P (h|D) being uniform. By letting M = |X | − |D| and taking
into account that E = w/(|X | − |D|), one can recover a result matching (5.96) from (5.97),
namely
While we consider different quantities here (i.e., P (w|D, M ; A)) and prove our results using
uniformity in the algorithm rather than in the universe (i.e., P (h|D, M ) being uniform rather
than P (f |D)), the similar proof techniques and closely related results demonstrate the continued
relevance of that early work.
These results show that if the learning algorithm makes no assumptions beyond strict consis-
tency with the training data (i.e., equally weighting every consistent hypothesis) then the general-
ization ability of the algorithm for previously unseen training data cannot surpass that of random
uniform guessing. Therefore, assumptions, encoded in the form of inductive bias, are a necessary
component for generalization to unseen instances in classification on finite spaces.
When we consider the model of generalization used in statistical learning theory, we find
that there “generalization” is simply defined to be performance over a test set drawn from the
same distribution as the training set. Thus, the generalization error is often measured over in-
stances that were seen in training in addition to the new instances. The possibility of shared
instances allows for improved performance simply as a consequence of seeing more instances,
which improves the chances that test instances will have been seen in training (for finite spaces,
like those considered here). Thus, to consider the typical generalization error is to entangle two
forms ability: memorization ability, for instances seen during training, and true generalization
ability, which applies only to previously unseen instances. If our primary concern is minimizing
risk under the instance distribution, then the blended form of generalization error is completely
appropriate. However, discovering what allows machine learners to generalize beyond examples
seen in training data requires that we separate the two abilities, and consider only performance
on instances not seen during training. Thus, although the test set is assumed to be initially drawn
from the same distribution as the training set, we exclude from our test set those instances con-
tained in the training data. Doing so gives us a clearer statement of what can be gained from the
processing of training data alone.
In addition to demonstrating the necessity of assumptions in classification learning, our re-
sults help us understand why this must be the case. Consider the space of possible hypotheses on
58
the test data that are consistent with the training instances. Equally weighting every possible la-
bel for a given test instance means that every label is equally likely, and thus the proposed labels
become independent of the instance features. No matter what relationships are found to hold
between labels and features in the training set, the unbiased algorithm ignores these and does
not require that the same relationships continue to hold within the test set. This allows for the
possibility of unrepresentative training data sets, since relationships found there may not carry
over to test data sets, at the cost of generalization ability. If, instead, we require that training
data sets be representative, as is the case when the training and test data sets are drawn (i.i.d.)
from the same instance distribution, then non-trivial relationships between labels and features of
the training data set will continue to hold for test instances as well, allowing generalization to
become possible. Thus, it is the dependence between features and labels that allows machine
learning to work, and a lack of assumptions on the data generating process leads to indepen-
dence, causing performance equivalent to random guessing. This underscores the importance of
dependence between what is seen and what is unseen for successful machine learning.
59
60
Chapter 6
APNIK ’ S statistical learning theory [81] has become something of a de facto paradigm
V for understanding modern machine learning, and has grown into a diverse field in its
own right. We provide a brief overview of the concepts and goals of Vapnik’s theory,
and begin to explore the ways in which questions from statistical learning theory can be re-cast
as search questions within our framework. The central concern of Vapnik’s theory, empirical
risk minimization, reduces to a search for models that minimize the empirical risk, making the
necessary assumptions to ensure that the observed information resource feedback (empirical risk
under a training dataset) is informative of the target (the true risk minimizer).
Vapnik’s statistical learning theory reduces areas of machine learning to the problem of trying
to minimize risk under some loss function, given a sample from an unknown (but fixed) distribu-
tion. The model for learning from examples can be described using three core components:
• generator of random vectors P (x);
• supervisor, which returns an output vector y for each x, P (y|x); and
• a learner that implements a set of functions fα (x) = f (x, α), α ∈ Λ.
The learning problem becomes choosing α so that fα (x) predicts the response of the super-
visor with the minimal amount of loss. Overall loss is measured using the risk functional on loss
function L,
Z
R(α) = L(y, fα (x))dP (x, y).
Thus, the goal becomes finding an α0 ∈ Λ that minimizes R(α) when P (x, y) is unknown.
Vapnik shows how this general setting can be applied to problems in classification, regression
and density estimation, using appropriate loss functions, such as zero-one loss for classification
and surprisal for density estimation.
The separate learning problems can be further abstracted to a common form as follows. Let
P (z) be defined on a space Z and assume a setRof functions Qα (z) = L(y, fα (x)), α ∈ Λ. The
learning problem becomes minimizing R(α) = Qα (z)dP (z) for α ∈ Λ when P (z) is unknown
but an i.i.d. sample z1 , . . . , z` is given.
61
6.1 Empirical Risk Minimization
Define the empirical risk as
`
1X
Remp (α) = Qα (zi ).
` i=1
The empirical risk minimization (ERM) principle seeks to approximate Qα0 (zi ), where α0
minimizes the true risk, with Qα` (zi ), which minimizes the empirical risk. One then demon-
strates conditions under which the true minimal risk is approached as a consequence of minimiz-
ing the empirical risk.
In the context of empirical risk minimization, two questions arise:
1. What are the necessary and sufficient conditions for the statistical consistency of the ERM
principle?
2. At what rate does the sequence of minimized empirical risk values converge to the minimal
actual risk?
The first question is answered by the theory of consistency for learning processes, which given
p
appropriate statistical conditions, can show that R(α` ) → R(α0 ), where R(α` ) is the risk and
the α` are such that they minimize the empirical risk Remp (α` ) for each ` = 1, 2, 3, . . ., and
p
that Remp (α` ) → R(α0 ) as well. The second question is answered by the nonasymptotic theory
of the rate of convergence, which uses concentration of measure to show at what rate the gap
between the estimate and the optimal risk diminishes. The precise conditions for convergence
are provided in [83] and [82], and demonstrate that restricting the complexity of a family of
learners effectively constrains the ability of the empirical estimate of risk to differ considerably
from the true risk. Here we will examine how the problem of empirical risk minimization can be
viewed as a search problem within our framework.
62
practical machine learning settings a successful search for the minimal empirical risk α is by no
means assured. Performing this search forms the basis for much of applied machine learning and
optimization research.
If we, like Vapnik, assume searches for empirical risk minimizers are always successful given
an information resource, there is still the higher-level question of what statistical assumptions
guarantee strong dependence between F and T . These questions are answered by statistical
learning theory. In passing the baton to statistical learning theory our framework provides a nat-
ural scaffolding to anchor existing results in machine learning, providing a high-level overview
that makes sense of (and ties together) low-level details. Our search framework naturally divides
and highlights the primary concerns of applied and theoretical machine learning, showing how
each contribute to overall success in learning.
It should also be noted that this form of learning theory is only concerned with reducing
estimation error, namely, finding the best α within our class that reduces true risk. It says nothing
about what the value of that minimal risk will be, whether small or large. By enlarging the class
of functions considered one can possibly reduce the value of the minimal risk in the class, but
must incur a cost of weakening the guarantees which rely on restricted classes of functions.
Thus a trade-off exists between reducing the approximation error (i.e., the difference between
the optimal risk for all functions and the best function in our class) and reducing the estimation
error (reducing the error between our estimate and the best model possible within our class).
63
64
Chapter 7
S shown in Figure 7.1, machine learning problems can be represented as channel commu-
A nication problems. Let each zi = (xi , yi ) and assume we have some function g(z) from
which the data are generated according to a process. Given the output of that process
(i.e., the training dataset), the job of a machine learning algorithm is to infer back the correct
ĝ approximation based on the training data. As should be obvious, unless we can place some
constraints on how the data are related to the original function (namely, about characteristics of
the communication channel), then reliable inference is impossible. These constraints may take
the form of i.i.d. assumptions on the data generation, or parametric assumptions on how noise
is added to labels, for example. Given some initial uncertainty regarding the true function g,
it is hoped that Z = {z1 , . . . , zn } can reduce that uncertainty and lead to the recovery of g (or
something close to it).
Given an output Z measurable in bits and a function g also representable using a finite num-
ber of bits, one can simultaneously look at the problem as a compression problem whenever
bits(Z) bits(g). We will explore this in the context of binary classification, examining the
link between generalization and compression and between compression and simplicity. It will
be shown that while compression leads to good generalization guarantees, it does not do so for
reasons that depend on simplicity in any intrinsic way, thus ruling out Occam’s razor (quoted
above) as a general explanation for why machine learning works.
65
p(z)
g(z) Channel z1 , . . . , zn
ĝ
Figure 7.1: Machine learning as a channel communication problem.
Proof. It will suffice for our purposes to give a simple proof sketch that is faithful to the proof
strategy followed by Littlestone and Warmuth. We have also slightly changed the wording of the
theorem (removing references to a sample ‘kernel’ and replacing it with the explanation of what
a kernel is in this context). The full proof is given in [49].
For any sample of size m, select k indices and extract the subsample residing at those indices,
labeling that set S. This leaves a disjoint set V of all other m − k samples. Using S as a
parameterization set, assume the learning algorithm outputs a hypothesis that is consistent with
all instances in V . Assuming the set V consists of independent samples which were not used
in the construction of the hypothesis, the probability that all m − k samples are consistent for a
hypothesis with true generalization error greater than is no greater than (1−)m−k . Considering
the union of all possible ways we could have selected a subset of k samples, namely m
k
, and
taking a union bound produces the final result.
66
independent test set if it also has good expected loss under the true distribution (from which the
test set was drawn). Using a union bound to control for multiple hypothesis testing yields the
final result. By only using a small portion of the data to train the model and keeping the test
set V independent of the training set S, this ensures that empirical generalization performance
correlates tightly with true generalization performance. Requiring the compression scheme used
to have perfect empirical generalization performance ensures the agreement is with a small true
generalization error value.
What we see “compression” here refers to how much of the data is available for testing versus
training. It has nothing to do with complexity of the hypothesis class, once we condition on the
size of the class. By compressing the sample down to a small subsample of training data, we
are increasing the amount of testing data, which is what statistically powers the proof when
taken together with the small number of possible hypotheses and assumption that our model
class already contains at least one member capable of perfectly classifying all remaining training
examples. Thus, complexity and compression are only indirectly involved in the bound; what
is essential is having lots of independent testing data and assuming our class already contains a
good model, while controlling for multiple-hypothesis testing.
concepts in Ω and the proportion of concepts learnable to within zero-one loss is upper bounded
by e m
2−b+1 .
Proof. Let ∈ [0, 1] represent the fraction of errors made on classifying the instances of space
X (as an enumerated list, not drawn from a distribution). Deterministic algorithms always map
the same input to the same output, and since there are only 2m−b+1 − 1 possible datasets of m − b
or fewer bits, such an algorithm can map to at most 2m−b+1 − 1 output elements Pm(themelements
being concepts in Ω). Each output element has a Hamming sphere of at most i=0 i nearby
concepts with no more than proportion zero-one loss. Thus, given a collection of datasets
each of fewerP than m − b bits, algorithm A can achieve less than zero-one loss for no more than
m−b+1 m m
(2 −1) i=1 i possible concepts in Ω, the first result. Using the Sauer-Shelah inequality,
we can upper bound the binomial sum by (e/)m . Dividing by the total number of concepts in
Ω yields the second.
67
Although the first bound can be loose due to the possibility of overlapping Hamming spheres,
in some cases it is tight. Thus, without making additional assumptions on the algorithm it cannot
be generally improved upon. We examine one such special case next.
Example: Let m = 1000, b = 999 and = 0.001. Since m − b = 1, we consider only single
bit datasets (and the empty dataset). Each dataset can map to at most one concept in Ω, which
for our example algorithm A will map 0 to the all zeros concept and 1 to the all ones concept,
mapping the empty dataset to the concept that alternates between zero and one (i.e., 01010 . . .).
Given that 1000 × 0.001 = 1, the Hamming spheres for the concepts are all those that differ
from the true concept by at most one bit, so that A can learn no more than 3,003 concepts in Ω
(a proportion of roughly 2.8 × 10−298 ) to within accuracy using only single-bit datasets. Our
second looser bound puts the number at 10,873 total concepts with a proportion of 1.01 × 10−297 .
Example: Let m = 100, 000, b = 18, 080 and = 0.01. Thus m − b = 81, 920 bits (10 kb).
A deterministic classification algorithm A can learn no more than 4.7 × 10−3009 of all concepts
in Ω to within zero-one loss if it is limited to datasets of size 10 kb.
Neither of the bounds in Theorem 9 is very useful as one moves towards larger instance
spaces and larger dataset sizes, due to the inability of most calculators to compute large expo-
nents.
where Remp (f ) is the empirical risk of function f on dataset z1 , . . . , zn , R(f ) is the true risk, and
c(f ) is the code-length of the function under the prefix-free code.
68
have
!
\
Pr (∀f ∈ F |Remp (f ) − R(f )| ≤ En,f ) = Pr {|Remp (f ) − R(f )| ≤ En,f } (7.1)
f ∈F
!
[
= 1 − Pr {|Remp (f ) − R(f )| > En,f } (7.2)
f ∈F
X
≥1− Pr (|Remp (f ) − R(f )| > En,f ) (7.3)
f ∈F
X
≥1− δf (7.4)
f ∈F
≥ 1 − δ. (7.5)
The result in Theorem 10 shows that one can get good bounds on how far the empirical risk
deviates from the actual risk when using a family of functions drawn from a distribution and
encoded using a prefix-free code. When a family of functions is countably infinite, it requires
that most complex objects are given low probability under the distribution. We cannot assign
large probabilities (short code-lengths) to every complex structure, given the infinite number of
them. However, we can choose, if we wish, to assign large probabilities to every simple model,
since there are relatively few of them. While this superficially ties function complexity to good
generalization bounds, the next theorem will show that this is coincidental. “Complexity” is
a free-rider, and what really matters is how the items are assigned codewords. Because com-
plex objects outnumber simple ones, the encoding creates a correlation between complexity and
deviation tightness. However, restricting ourselves to an immensely large but finite family of
functions shows that the deviation tightness does not depend on complexity in any essential way.
Theorem 11. For a finite class of functions f ∈ F under a prefix-free code where objects are
assigned shorter codewords if they are more complex, and δ ∈ (0, 1],
r !
log 2c(f )+1 /δ
Pr ∀f ∈ F, |Remp (f ) − R(f )| ≤ ≥ 1 − δ,
2n
where Remp (f ) is the empirical risk of function f on dataset z1 , . . . , zn , R(f ) is the true risk, and
c(f ) is the code-length of the function under the prefix-free code.
Proof. The proof is identical to that of the previous theorem, since all steps apply equally to
finite classes of functions under a prefix-free code, regardless of the code.
In Theorem 11 we assume the prefix-free code is ranked in reverse order, such that the most
complex functions have the shortest codewords, and the simplest objects are assigned the longest
codewords. It follows that the complex functions achieve the best generalization bounds, demon-
strating that the “complexity” argument was superficial. While sufficient under a specific assign-
ment of codewords based on model simplicity, simplicity is not necessary for good generalization
behavior, even in the case of countably infinite sets of functions. We show this next.
69
Theorem 12. For any countably infinite class of functions f ∈ F, δ ∈ (0, 1], any definition of
“short code,” and any ordering based on complexity (however defined), there exists a prefix-free
code such that all short codes are assigned to complex functions,
r !
log 2c(f )+1 /δ
Pr ∀f ∈ F, |Remp (f ) − R(f )| ≤ ≥ 1 − δ,
2n
and the tightest bounds are not given by the simplest functions.
Proof. We will prove the existence of such a code by construction. For any countably infinite set
F, begin with a prefix-free code that assigns codeword lengths proportional to object complexity
under the user-provided complexity ordering. Order the functions of F in ascending order by
their code-lengths, breaking ties arbitrarily, and pick any N such that objects with order position
greater than N have codes that are not “short” (under the given definition of short code) and are
also complex (under the notion of complexity used in the user-defined ordering). Take the first
2N functions and reverse their order, such that the 2N th function has the shortest codeword, and
the simplest object is assigned the codeword formerly belonging to the 2N th object. Since all
short codewords fall within the first N positions and all objects that were formerly between N
and 2N are complex, in the new code all short codewords belong to complex objects.
Theorem 12 presents a case where tight ERM deviation bounds arise in spite of complexity of
the functions in their final prefix-free ordering, not because of it. At best, complexity constraints
can give rise to sufficient conditions for attaining tight generalization bounds, but as the previous
two results have shown, complexity notions are not necessary for such bounds. The notion
of simplicity and the notion of compression are somewhat orthogonal; this again suggests that
Occam-style ordering of models is not a general explanation for why machine learning works.
The next section provides additional evidence.
70
or equal to n. By the prime number theorem, there are approximately n/ ln(n) such
prime numbers. Suppose one of these rules correctly classifies all the data. The like-
lihood of it doing so if its actual error rate were q is (1 − q)n . By a union bound, the
probability that the actual error rate is q or more is n(1 − q)n / ln(n). Notice that no
matter how low an error rate I demand, I can achieve it with arbitrarily high confi-
dence by taking n sufficiently large. But, notice, most of the rules involves here have
enormous serial numbers, and so must be very long and complicated programs. That
doesn’t matter, because there still just aren’t many of them. In fact, this argument
would still work if I replaced the prime less than or equal to n with a random set,
provided the density of the random set was 1/ ln(n).
Mitchell also discusses the use of a form of Occam’s Razor in machine learning [54]. He
states:
One argument is that because there are fewer short hypotheses than long ones
(based on straightforward combinatorial arguments), it is less likely that one will
find a short hypothesis that coincidentally fits the training data. In contrast there
are often many very complex hypotheses that fit the current training data but fail
to generalize correctly to subsequent data. Consider decision tree hypotheses, for
example. There are many more 500-node decision trees than 5-node decision trees.
Given a small set of 20 training examples, we might expect to be able to find many
500-node decision trees consistent with these, whereas we would be more surprised
if a 5-node decision tree could perfectly fit this data. We might therefore believe the
5-node tree is less likely to be a statistical coincidence and prefer this hypothesis
over the 500-node hypothesis.
Upon closer examination, it turns out there is a major difficulty with the above
argument. By the same reasoning we could have argued that one should prefer de-
cision trees containing exactly 17 leaf nodes with 11 nonleaf nodes, that use the
decision attribute A1 at the root, and test attributes A2 through A11 , in numerical or-
der. There are relatively few such trees, and we might argue (by the same reasoning
as above) that our a priori chance of finding one consistent with an arbitrary set of
data is therefore small. The difficulty here is that there are very many small sets of
hypotheses that one can define-most of them rather arcane. Why should we believe
that the small set of hypotheses consisting of decision trees with short descriptions
should be any more relevant than the multitude of other small sets of hypotheses that
we might define?
Thus, his arguments anticipate Shalizi’s and are consistent with the counterexamples pre-
sented here.
Domingos’ critique of Occam’s razor arguments in machine learning [19] is equally devas-
tating. Discussing the role of PAC generalization guarantees in the context i.i.d. datasets, he
notes:
For the present purposes, the results can be summarized thus. Suppose that the
generalization error of a hypothesis is greater than . Then the probability that the
hypothesis is correct on m independent examples is smaller than (1 − )m . If there
are |H| hypotheses in the hypothesis class considered by a learner, the probability
71
that at least one is correct on all m training examples is smaller than |H|(1 − )m ,
since the probability of a disjunction is smaller than the sum of the probabilities
of the disjuncts. Thus, if a model with zero training-set error is found within a
sufficiently small set of models, it is likely to have low generalization error. This
model, however, could be arbitrarily complex. The only connection of this result
to Occam’s razor is provided by the information-theoretic notion that, if a set of
models is small, its members can be distinguished by short codes. But this in no
way endorses, say, decision trees with fewer nodes over trees with many. By this
result, a decision tree with one million nodes extracted from a set of ten such trees
is preferable to one with ten nodes extracted from a set of a million, given the same
training-set error.
Put another way, the results in Blumer et al. [11] only say that if we select a
sufficiently small set of models prior to looking at the data, and by good fortune one
of those models closely agrees with the data, we can be confident that it will also do
well on future data. The theoretical results give no guidance as to how to select that
set of models.
This agrees with the observation that it is the small set sizes and assumption that the sets contain
at least one good hypothesis that does the heavy-lifting in such proofs; it is not the inherent
complexity of the members of the sets themselves.
Domingos defines two forms of Occam’s razor, one favoring simplicity as an end goal in
itself (the first razor), and the second favoring simpler models because of the belief it will lead
to lower generalization error. He argues that use of the first razor is justified, but the second is
problematic. After reviewing several empirical studies showing how complex models, including
ensemble methods, often outperform simpler models in practice, Domingos concludes, “All of
this evidence supports the conclusion that not only is the second razor not true in general; it is
also typically false in the types of domains [that knowledge discovery / data-mining] has been
applied to.”
72
Figure 7.2: Polynomial curve fitting. A simple, a complex and 3rd degree polynomial. Image
reproduced from [32].
than either of the other polynomials shown. For Grünwald, such an “intuition is confirmed by
numerous experiments on real-world data from a broad variety of sources [66, 67, 84].” Although
such examples and intuitions sound plausible in certain contexts (e.g., when a simple function is
corrupted by noise, leading to data that are more complex than the true underlying function), they
do not seem to be universally valid [19, 87]. In fact, our earlier theoretical results suggest that no
bias can be universally valid unless the universe has graciously constrained itself to producing a
select subset of possible problems.
Returning to the polynomial-fitting example, one can argue that the primary problem of the
high-degree curve is not that it is ‘complex,’ but that it places the curve in regions where no data
demands it. In other words, while it explains what is there (the data points), it also explains
what isn’t there (the regions where long stretches of line are not supported by any data point).
In contrast, the straightforward technique of connecting each neighboring data point by a line
segment leads to a jagged line perfectly fitting all points. Such a ‘curve’ loses smoothness prop-
erties and would be highly complex (requiring on the order of n parameters to define the curve
fitting n data points), but remains close to the true curve as long as the data points themselves
are representative of that curve. Thus, we have a highly-complex, flexible model (growing on
order n, the size of the data), that also intuitively will give good generalization performance. The
difference with our jagged model is that while it adequately explains what is there, it minimizes
the amount of curve unsupported by observations, thus not explaining what isn’t there. This is
closer in spirit to E.T. Jayne’s maximum entropy principle [43], which seeks to avoid (as much
as possible) making assumptions beyond what is supported by the data. While we must always
make assumptions beyond what is seen in the data in order to generalize (see Theorem 7, for
example), we can be careful to not make more assumptions than are actually necessary.
73
where θ̂M (D) identifies the model within class M maximizing the likelihood of the data, and
COMP|D| (M) measures the ‘richness’ and ‘flexibility’ (i.e., complexity) of the model class M.
MDL seeks to manage the trade-off between fit and complexity via minimizing the quantity
Q(D; M). Part of the historical challenge of MDL theory has been how to define a code that
measures the model complexity COMP|D| (M) in a non-arbitrary way, and Grünwald [32] gives
examples of codes formulated to do so in a principled and sensible way, such as by using univer-
sal codes that minimize minimax regret. It has been shown that in certain contexts MDL methods
are statistically consistent [97], although not being formulated with that goal explicitly in mind.
74
Third, smoothness and regularity imply both compressibility and dependence, where depen-
dence exists between nearby points in space or time. When a function is smooth it means that
if we know something about the value of a point at x we also know something about the likely
values of points at x + that are nearby. Thus, smoothness induces exploitable dependence for
learning. Simple functions are relatively easy to learn, and since we must begin with some bias,
a bias towards simplicity (and learnability) is not a bad bias to start with. It may be demonstrably
wrong, but so will every other bias in certain situations.
Fourth, in situations where noise is present, the noise will typically cause the data to become
more complex than the generating signal, which justifies regularized methods and biasing them
towards models that are simpler than the data themselves appear. When our assumptions and
biases align to what holds in the actual world, our methods perform well in practice. It follows
that in noisy situations a bias towards simplicity is reasonable and can lead to good empirical
performance (though not always – see [70] and accompanying discussion in Section 2.4.2).
Lastly, in the small data case where there are few data samples, it may be better to prefer sim-
ple models (with fewer parameters) over more complex ones, since there will typically be more
data points per parameter when fitting the simpler model. Because of limitations on statistical
power, a bias towards simpler models becomes justified in such situations.
75
76
Part III
77
ETTING out to answer the question “Why does machine learning work?,” we learned that
S we could reduce machine learning to a form of search and investigate the dual question
of what makes searches successful. Our theoretical results showed us that for any fixed
algorithm, relatively few possible problems are greatly favorable for the algorithm. The subset of
problems which are favorable require a degree of dependence between what is observed and what
is not, between labels and features, datasets and concepts. Without strong dependence between
the target set and the external information resource, the probability of a successful search remains
negligible. Successful learning thus becomes a search for exploitable dependence. We can use
this view of machine learning to uncover exploitable structure in problem domains and inspire
new learning algorithms.
In the chapters that follow, we will apply our dependence-first view of learning to two real-
world problem areas, showing how dependence can lead to greater-than-chance learning perfor-
mance. This confirms the practical utility of our framework for machine learning and applied
research. It should be noted, however, that the algorithms developed are not derived explic-
itly from the formal theorems proven, but are instead guided by the “dependence-first” view of
learning that naturally results from considering machine learning within our framework.
We begin by exploring a set of simple examples showing direct application of our formal
results and their consequences for a few problem domains, then explore two application areas
in depth. These problem areas are unsupervised time-series segmentation and hyperparameter
learning. For the first task, we leverage temporal dependence in our time series data to achieve
improved segmentation results, and for the second task, we leverage spatial dependence. Both
types of dependence flow from smoothness and regularity within the problem domain, a common
source of exploitable structure within machine learning.
79
80
Chapter 8
Examples1
I specified landmark within a city. Due to the complexity of the attack and methods of infil-
tration, the group is forced to construct a plan relying on the coordinated actions of several
interdependent agents, of which the failure of any one would cause the collapse of the entire plot.
As a member of the city’s security team, you must allocate finite resources to protect the many
important locations within the city. Although you know the attack is imminent, your sources
have not indicated which location, of the hundreds possible, will be hit; your lack of manpower
forces you to make assumptions about target likelihoods. You know you can foil the plot if you
stop even one enemy agent. Because of this, you seek to maximize the odds of capturing an
agent by placing vigilant security forces at the strategic locations throughout the city. Allocating
more security to a given location increases surveillance there, raising the probability a conspira-
tor will be found if operating nearby. Unsure of your decisions, you allocate based on your best
information, but continue to second-guess yourself.
With this picture in mind, we can analyze the scenario through the lens of algorithmic search.
Our external information resource is the pertinent intelligence data, mined through surveillance.
We begin with background knowledge (represented in F (∅)), used to make the primary security
force placements. Team members on the ground communicate updates back to central head-
quarters, such as suspicious activity, which are the F (ω) evaluations used to update the internal
information state. Each resource allocated is a query, and manpower constraints limit the num-
ber of queries available. Doing more with fewer officers is better, so the hope is to maximize the
per-officer probability of stopping the attack.
Our results tell us a few things. First, a fixed strategy can only work well in a limited number
of situations. There is little or no hope of a problem being a good match for your strategy if the
problem arises independently of it (Theorem 2). So reliable intelligence becomes key. The better
correlated the intelligence reports are with the actual plot, the better a strategy can perform (The-
orem 4). However, even for a fixed search problem with reliable external information resource
1
This chapter reproduces content from Montañez, “The Famine of Forte: Few Search Problems Greatly Favor
Your Algorithm” (arXiv 2016)
81
there is no guarantee of success, if the strategy is chosen poorly; the proportion of good strate-
gies for a fixed problem is no better than the proportion of good problems for a fixed algorithm
(Theorem 3). Thus, domain knowledge is crucial in choosing either. Without side-information to
guide the match between search strategy and search problem, the expected probability of success
is dismal in target-sparse situations.
p |T |/|Ω| 1/|Ω| 1
|Ω| = |Ω| = |Ω| = .
qmin qmin qmin qmin
Because this expression is independent of the size of the search space, the number of elements
for which a fitness function can strongly raise the probability of success remains fixed even as the
size of the search space increases. Thus, for very large search spaces the proportion of favored
locations effectively vanishes. There can exist no single fitness function that is strongly favorable
for many elements simultaneously, and thus no “one-size-fits-all” fitness function.
82
Figure 8.1: Possible axis-aligned target set examples.
The space of possible binary concepts is Ω, with the true concept being an element in that
space. In our example, let |Ω| = 2100 . The target set consists of the set of all concepts in
that space that (1) are consistent with the training data (which we will assume all are), and
(2) differ from
P10the truth
in at most 10% of positions on the generalization held-out dataset.
100
Thus, |T | = i=0 i . Let us assume the marginal distribution on T is uniform, which isn’t
necessary but simplifies the calculation. The external information resource F is the set of training
examples. The algorithm uses the training examples (given by F (∅)) to produce a distribution
over the space of concepts; for deterministic algorithms, this is a degenerate distribution on
exactly one element. A single query is then taken (i.e., a concept is output), and we assess the
probability of success for the single query. The chance of outputting a concept with at least 90%
generalization accuracy is thus no greater than I(T I;FΩ)+1 ≈ I(TIΩ;F ) ≤ I(T59;F ) . The denominator
is the information cost of specifying at least one element of the target set and the numerator
represents the information resources available for doing so. When the mutual information meets
(or exceeds) that cost, success can be ensured for any algorithm perfectly mining the available
mutual information. When noise reduces the mutual information below the information cost, the
probability of success becomes strictly bounded in proportion to that ratio.
83
guide it), which should give a probability of success equal to 8/64 = 0.125. We will see that
Equation 5.43 gives precisely this answer.
Under our assumptions, we compute the remaining components:
• IΩ = − log2 8/64 = 3;
• H(T ) = 4 (since it takes one bit to choose between rows or columns, and three bits to
specify the row or column);
• D(PT kUT ) = log2 64
8
− H(T ) = 32.04 − 4 = 28.04;
• H(Z|X) = −1/8 log2 1/8 − 7/8 log2 7/8 = 0.5436;
63
• EX D(PT |Z=0,X kUT |Z=0,X ) = log2 8
− log2 1/14 = 28.04;
63
• EX D(PT |Z=1,X kUT |Z=1,X ) = log2 7 − 1 = 28.04 (since H(T |Z = 1, X) = 1,
because we have two options if we know an element of T : either it is on the row or column
of X); and
• Cr = log2 7/8 − 28.04 = −28.233.
Thus, we get
84
If we want to upper bound this quantity, we can use the fact that H(Z|X) ≤ 1 and note that
the ratio is monotonically increasing in H(T |Z = 0, X), for which we have H(T |Z = 0, X) ≤
H(T ) = 6. It follows that,
I(T ; F ) + 1
Pr(X ∈ T ; A) ≤ . (8.7)
6
When I(T ; F ) = 2, then the probability of success can be no greater than 0.5; when I(T ; F ) =
0.1, the maximum drops to 0.183. This upper bound coincides with that given in Theorem 4,
which is much easier to use. An exact form simpler to use than Equation 5.43 is suggested by
Equation 5.50,
85
• H(T |Z = 1, X) = 0; and
1
• H(T |Z = 0, X) = 4
− 8 ;
The computation for H(T |Z = 0, X) can be seen as follows. First, we compute P (X) by
marginalizing over the Z values and computing P (X|Z)P (Z). The conditional probability will
change depending on which row X is on, given our search strategy. Taking this into account, we
find that P (X) equals 1/4 − /8 if on the second row, /8 if on the last row, and 1/8 otherwise.
Next, we notice that for all rows other than the second, knowing X and that Z = 0 allows you to
uniquely pinpoint T , reducing its conditional entropy to zero. For X elements chosen from the
second row, there are two equally weighted possibilities for T , either above or below, thus giving
one bit of uncertainty. Multiply this by the marginal probability of choosing an element in the
second row, 1/4 − /8, and we obtain that value.
We need to also compute IL in this case. We have
Using the chain rule for entropy two ways on H(T, Z|X) and remembering that H(Z|T, X) = 0
by definition, that Z := 1X∈T , and that P (Z = 1) = , we obtain
Thus,
I(T ; F ) − IL + D(PT kUT ) + H(Z|X) + Cr
Pr(X ∈ T ; A) = (8.15)
IΩ + EX D(PT |Z=1,X kUT |Z=1,X ) + Cr
6 − (1 − )H(T |Z = 0, X) + 0 + H(T |Z = 0, X) − 6
= (8.16)
6 + 0 + H(T |Z = 0, X) − 6
H(T |Z = 0, X)
= (8.17)
H(T |Z = 0, X)
= . (8.18)
86
Chapter 9
IME series have become ubiquitous in many areas of science and technology. Sophisti-
T cated methods are often required to model their dynamical behavior. Furthermore, their
dynamical behavior can change over time in systematic ways. For example, in the case
of multivariate accelerometer data taken from human subjects, one set of dynamical behaviors
may hold while the subject is walking, only to change to a new regime once the subject begins
running. Gaining knowledge of these change points can help in the modeling task, since given
a segmentation of the time series, one can learn a more precise model for each regime. Change
point detection algorithms [45, 50, 65, 96] have been proposed to determine the location of sys-
tematic changes in the time series, while Hidden Markov Models (HMM) can both determine
where regime (state) changes occur, and also model the behavior of each state.
One crucial observation in many real-world systems, natural and man-made, is the behavior
changes are typically infrequent; that is, the system takes some (typically unknown) time before
it changes its behavior to that of a new state. In our earlier example, it would be unlikely for a
person to rapidly switch between walking and running, making the durations of different activ-
ities over time relatively long and highly variable. We refer to this as the inertial property, in
reference to a physical property of matter that ensures it continues along a fixed course unless
acted upon by an external force. Unfortunately, classical HMMs trained to maximize the likeli-
hood of data can result in abnormally high rates of state transitioning, as this often increases the
likelihood of the data under the model, leading to false positives when detecting change points,
as is seen in Figure 9.1.
The inertial property of real-world time series represents a potentially exploitable source of
dependence: temporal consistency. Since states are not likely to rapidly fluctuate, this increases
dependence between neighboring points, since knowledge of current state reveals the likely next
state, thus reducing uncertainty for the next point. In this thesis, we seek to operationalize this
insight to produce a system capable of exploiting temporal consistency. We do so by introducing
temporal regularization for HMMs, forming Inertial Hidden Markov Models [58]. These models
are able to successfully segment multivariate time series from real-world accelerometer data
and synthetic datasets, in an unsupervised manner, improving on state-of-the-art Bayesian non-
parameteric sticky hierarchical Dirichlet process hidden Markov models (HDP-HMMs) [27, 89]
1
This chapter reproduces content from Montañez et al., “Inertial Hidden Markov Models: Modeling Change in
Multivariate Time Series” (AAAI-2015)
87
3
hidden state 1
hidden state 2
2 hidden state 3
1
3
Pred.
True
as well as classical HMMs. This application demonstrates the practical utility of the dependence-
first view of learning inspired by our search framework.
88
9.2 Maximum A Posteriori (MAP) Regularized HMM
Following [30], we alter the standard HMM to include a Dirichlet prior on the transition prob-
ability matrix, such that transitions out-of-state are penalized by some regularization factor. A
Dirichlet prior on the transition matrix A, for the jth row, has the form
K
η −1
Y
p(Aj ; η) ∝ Ajkjk
k=1
where the ηjk are free parameters and Ajk is the transition probability from state j to state k. The
posterior joint density over X and Z becomes
"K K #
η −1
YY
P (X, Z; θ, η) ∝ Ajkjk P (X, Z | A; θ)
j=1 k=1
K X
X K
`(X, Z; θ, η) ∝ (ηjk − 1) log Ajk + log P (z1 ; θ)
j=1 k=1
T
X T
X
+ log P (xt |zt ; θ) + log P (zt |zt−1 ; θ).
t=1 t=2
MAP estimation is then used in the M-step of the expectation maximization (EM) algorithm
to update the transition probability matrix. Maximizing, with appropriate Lagrange multiplier
constraints, we obtain the update equation for the transition matrix,
PT
(ηjk − 1) + t=2 ξ(z(t−1)j , ztk )
Ajk = PK PK PT (9.1)
i=1 (ηji − 1) + i=1 t=2 ξ(z(t−1)j , zti )
89
9.3 Inertial Regularization via Pseudo-observations
Alternatively, we can alter the HMM likelihood function to include a latent binary random vari-
able, V , indicating that a self-transition was chosen at random from among all transitions, ac-
cording to some distribution. Thus, we view the transitions as being partitioned into two sets,
self-transitions and non-self-transitions, and we draw a member of the self-transition set accord-
ing to a Bernoulli distribution governed by parameter p. Given a latent state sequence Z, with
transitions chosen according to transition matrix A, we define p as a function P of both Z and A.
We would like p to have two properties: 1) it should increase with increasing k Akk (probability
of self-transitions) and 2) it should increase as the number of self-transitions in Z increases. This
will allow us to encourage self-transitions as a simple consequence of maximizing the likelihood
of our observations.
We begin with a version of p based on a penalization constant 0 < < 1 that scales appropri-
ately with the number of self-transitions. If we raise to a large positive power, the resulting p
will decrease. Thus, we define p as raised to the number of non-self-transitions, M , in the state
transition sequence, so that the probability of selecting a self-transition increases as M decreases.
Using the fact that M = (T − 1) − t=2 K
PT P
k=1 z(t−1)k ztk , we obtain
PT PT PK
p = M = t=2 1− t=2 k=1 z(t−1)k ztk
PT PK PT PK
= t=2 k=1 z(t−1)k − t=2 k=1 z(t−1)k ztk
T Y
Y K
= z(t−1)k −z(t−1)k ztk . (9.3)
t=2 k=1
Since is arbitrary, we choose = Akk , to allow p to scale appropriately with increasing proba-
bility of self-transition. We therefore arrive at
T Y
K
z −z(t−1)k ztk
Y
p= Akk(t−1)k .
t=2 k=1
90
where 1 denotes the all-ones sequence of length λ.
Noting that V is conditionally independent of X given the latent state sequence Z, we max-
imize (with respect to Ajk ) the expected (with respect to Z) joint log-density over X, V, and Z
parameterized by θ = {π, A, φ}, which are the start-state probabilities, state transition matrix
and emission parameters, respectively. Using appropriate Lagrange multipliers, we obtain the
regularized maximum likelihood estimate for Ajk :
The forward-backward algorithm can then be used for efficient computation of the γ and ξ values,
as in standard HMMs [10].
Ignoring normalization, we see that
(
Bj,k,T + Cj,j,T if j = k
Ajk ∝
Bj,k,T otherwise.
Examining the Cj,j,T term (i.e., Equation (9.5)), we see that λ is a multiplier of additional mass
contributions for self-transitions, where the contributions are the difference between γ(z(t−1)j )
and ξ(z(t−1)j , ztj ). These two quantities represent, respectively, the expectation of being in a state
j at time t − 1 and the expectation of remaining there in the next time step. The larger λ or the
larger the difference between arriving at a state and remaining there, the greater the additional
mass given to self-transition.
91
2.0
hidden state 1
1.5 hidden state 2
hidden state 3
1.0
0.5
0.0
−0.5
−1.0
−1.5
−2.0
Figure 9.2: Human activities accelerometer data, short sequence. Vertical partitions correspond
to changes of state.
Figure 9.3: The long sequence human activities accelerometer data using regularization parame-
ter from short sequence.
We desire models where the regularization strength is scale-free, having roughly the same
strength regardless of how the time series grows. To achieve this, we define the λ parameter to
scale with the number of transitions, namely λ = (T − 1)ζ , and our scale-free update equation
becomes
((T − 1)ζ − 1)1(j = k) + Tt=2 ξ(z(t−1)j , ztk )
P
Ajk = . (9.6)
((T − 1)ζ − 1) + K
P PT
i=1 t=2 ξ(z(t−1)j , zti )
This preserves the effect of regularization as T increases, and ζ becomes our new regularization
parameter, controlling the strength of the regularization. For consistency, we also re-parameterize
Equation (9.5) using λ = (T − 1)ζ .
92
dispersion often used to quantify income inequality. For a collection of observed segment lengths
L = {l1 , . . . , lm }, given in ascending order, the Gini ratio is estimated by
Pm
2 i=1 ili
G(L) = 1 − m − Pm .
m−1 i=1 li
We assume that the true segmentation has a Gini ratio less than one-half, which corresponds to
having more equality among segment lengths than not. One can perform a binary search on the
search interval to find the smallest ζ parameter for which the Gini ratio is at least one-half. This
increases the time complexity by a factor of O(log2 (R/)), where R is the range of the parameter
space and is the stopping precision for the binary search.
9.4 Experiments
We perform two segmentation tasks on synthetic and real multivariate time series data, using our
scale- and parameter-free regularized inertial HMMs. For comparison, we present the results of
applying a standard K-state hidden Markov model as well as the sticky HDP-HMM of [27]. We
performed all tasks in an unsupervised manner, with state labels being used only for evaluation.
9.4.1 Datasets
The first (synthetic) multivariate dataset was generated using a two-state HMM with 3D Gaussian
emissions, with transition matrix
0.9995 0.0005
A= ,
0.0005 0.9995
equal start probabilities and emission parameters µ1 = (−1, −1, −1)> , µ2 = (1, 1, 1)> , Σ1 =
Σ2 = diag(3). Using this model, we generated one hundred time series consisting of ten-
thousand time points each. Figure 9.4 shows an example time series from this synthetic dataset.
The second dataset was generated from real-world forty-five dimensional human accelerom-
eter data, recorded for users performing five different activities, namely, playing basketball, row-
ing, jumping, ascending stairs and walking in a parking lot [2]. The data were recorded from
a single subject using five Xsens MTxTM units attached to the torso, arms and legs. Each unit
had nine sensors, which recorded accelerometer (X, Y, Z) data, gyroscope (X, Y, Z) data and
magnetometer (X, Y, Z) data, for a total of forty-five signals at each time point.
We generated one hundred multivariate time series from the underlying dataset, with varying
activities (latent states) and varying number of segments. To generate these sets, we first uni-
formly chose the number of segments, between two and twenty. Then, for each segment, we
chose an activity uniformly at random from among the five possible, and selected a uniformly
random segment length proportion. The selected number of corresponding time points were ex-
tracted from the activity, rescaled to zero mean and unit variance, and appended to the output
sequence. The final output sequence was truncated to ten thousand time points, or discarded
if the sequence contained fewer than ten thousand points or fewer than two distinct activities.
93
Additionally, prospective time series were rejected if they caused numerical instability issues for
the algorithms tested. The process was repeated to generate one hundred such multivariate time
series of ten thousand time ticks each, with varying number of segments, activities and segment
lengths. An example data sequence is shown in Figure 9.5 and the distribution of the time series
according to number of activities and segments is shown in Figure 9.6.
2
Dimensions
−1
−2
−3
State
Figure 9.4: Synthetic data example. Generated from two-state HMM with 3D Gaussian emis-
sions and strong self-transitions.
6.0
4.5
3.0
Dimensions
1.5
0.0
−1.5
−3.0
−4.5
−6.0
State
94
Number of Activities / Classes
5
0 2 4 6 8 10 12 14
Number of Segments
where St is the true number of segments in the sequence and Sp is the predicted number of
segments, and quantifies how much a segmentation method diverges from the ground truth in
terms of relative factor of segments. Lastly, we tracked the number of segments difference (SND)
between the predicted segmentation and true segmentation and how many segmentations we done
perfectly (Per.), giving the correct states at all correct positions.
Parameter selection for the inertial HMM methods was done using the automated parameter
selection procedure described in the Parameter Modifications section. For faster evaluation, we
ran the automated parameter selection process on ten randomly drawn examples, averaged the
final ζ parameter value, and used the fixed value for all trials. The final ζ parameters are shown
in Tables 9.1 and 9.2.
To evaluate the sticky HDP-HMM, we used the publicly available HDP-HMM toolbox for
MATLAB, with default settings for the priors [26]. The Gaussian emission model with normal
inverse Wishart (NIW) prior was used, and the truncation level L for each example was set to
the true number of states, in fairness for comparing with the HMM methods developed here,
which are also given the true number of states. The “stickiness” κ parameter was chosen in a
data-driven manner by testing values of κ = 0.001, 0.01, 0.1, 1, 5, 10, 50, 100, 250, 500, 750 and
1000 for best performance over ten randomly selected examples each. The mean performance of
the 500th Gibbs sample of ten trials was then taken for each parameter setting, and the best κ was
empirically chosen. For the synthetic dataset, a final value of κ = 10 was chosen by this method.
For the real human accelerometer data, a value of κ = 100 provided the best accuracy and
relatively strong variation of information performance. These values were used for evaluation on
each entire dataset, respectively.
To evaluate the HDP-HMM, we performed five trials on each example in the test dataset,
measuring performance of the 1000th Gibbs sample for each trial. The mean performance was
then computed for the trials, and the average of all one hundred test examples was recorded.
95
6
hidden state 1
hidden state 2
4
hidden state 3
−2
−4
−6
Pred.
True
Figure 9.7: Example segmentation of human activities accelerometer data using inertial (MAP)
HMM. Only first dimension shown.
Table 9.1: Results from quantitative evaluation on 3D synthetic data. Statistical significance is
computed with respect to MAP results.
Method Acc. SNR ASNR SND VOI Per.
HDP-HMM (κ = 10) 0.85* 0.59* 3.50* 2.79* 0.56* 0/100
Standard HMM 0.87* 172.20* 172.20* 765.91* 0.62* 0/100
MAP HMM (ζ = 2.3) 0.99 0.96 1.13 0.51 0.07 2/100
PsO HMM (ζ = 8.2) 0.99 0.87‡ 1.43‡ 1.15* 0.14† 1/100
Acc. = Average Accuracy (value of 1.0 is best)
SNR = Average Segment Number Ratio (value of 1.0 is best)
ASNR = Average Absolute Segment Number Ratio (value of 1.0 is best)
SND = Average Segment Number Difference (value of 0.0 is best)
VOI = Average Normalized Variation of Information (value of 0.0 is best)
Per. = Total number of perfect/correct segmentations
paired t-test: † < α = .05, ‡ < α = .01, * < α = .001
96
over-segmentation of the data (as reflected in the high SNR, ASNR, and SND scores), while the
sticky HDP-HMM tended to under-segment the data. All methods were able to achieve fairly
high accuracy.
Results from the human accelerometer dataset are shown in Table 9.2. Both the MAP HMM
and inertial pseudo-observation HMM achieved large gains in performance over the standard
HMM model, with average accuracy of 94%. Furthermore, the number of segments was close to
correct on average, with a value near one in both the absolute (ASNR) and simple (SNR) ratio
case. The average normalized variation of information (VOI) was low for both the MAP and
pseudo-observation methods. Figure 9.7 shows an example segmentation for the MAP HMM,
displaying a single dimension of the multivariate time series for clarity.
In comparison, the standard hidden Markov model performed poorly, strongly over-segmenting
the sequences in many cases. Even more striking was the improvement over the sticky HDP-
HMM, which had an average normalized variation of information near 1 (i.e., no correlation
between the predicted and the true segment labels). The method tended to under-segment the
data, often collapsing to a single uniform output state, reflected in the SNR having a value below
one, and may struggle with moderate dimensional data, as related by Fox and Sudderth through
private correspondence. Moreover, the poor performance on this dataset likely results from a
strong dependence on Bayesian tuning parameters. The sticky HDP-HMM suffers from slow
mixing rates as the dimensionality increases, and computation time explodes, being roughly cu-
bic in the dimension. As a result, the one hundred test examples took several days of computation
time to complete, whereas the inertial HMM methods took a few hours.
9.5 Discussion
Our results demonstrate the effectiveness of inertial regularization on HMMs for behavior change
modeling in multivariate time series. Although derived in two independent ways, the MAP
regularized and pseudo-observation inertial regularized HMM converge on a similar maximum
likelihood update equation, and thus, had similar performance.
The human activity task highlighted an issue with using standard HMMs for segmentation of
time series with infrequent state changes, namely, over-segmentation. Incorporating regulariza-
tion for state transitions provides a simple solution to this problem. Since our methods rely on
97
changing a single update equation for a standard HMM learning method, they can be easily in-
corporated into HMM learning libraries with minimal effort. This ease-of-implementation gives
a strong advantage over existing persistent-state HMM methods, such as the sticky HDP-HMM
framework.
While the sticky HDP-HMM performed moderately well on the low-dimensional synthetic
dataset, the default parameters produced poor performance on the real-world accelerometer data.
It remains possible that different settings of hyperparameters may improve performance, but the
cost of a combinatorial search through hyperparameter space combined with lengthy computa-
tion time prohibits an exhaustive exploration. The results, at minimum, show a strong depen-
dence on hyperparameter settings for acceptable performance. In contrast, the inertial HMM
methods make use of a simple heuristic for automatically selecting the strength parameter ζ,
which resulted in excellent performance on both datasets without the need for hand-tuning sev-
eral hyperparameters. Although the sticky HDP-HMM has poor performance on the two seg-
mentation tasks, there exist tasks for which it may be a better choice (e.g., when the correct
number of states is unknown).
98
Chapter 10
rapid expansion of machine learning methods in recent decades has created a common
HE
T question for non-expert end users: “What hyperparameter settings should I use for my
algorithm?” Even the simplest algorithms often require the tuning of one or more hyper-
paramters, which can yield significant effects on performance [7, 14]. However, knowing which
settings to use for which dataset and algorithm has remained something of an “art,” relying on the
implicit knowledge of practitioners in the field. Because employing machine learning methods
should not require a PhD in data science, researchers have recently begun to investigate auto-
mated hyperparameter tuning methods, with marked success ([7, 8, 9, 25, 39, 76, 80, 86]). These
methods have been successfully applied to a wide range of problems, including neural networks
and deep belief network hyperparameter tuning [8, 9], thus showing promise for automatically
tuning the large numbers of hyperparameters required by deep learning architectures. We extend
this body of research by improving on the state-of-the-art in automated hyperparameter tuning,
guided by our search view of machine learning. Because dependence makes searches success-
ful, we begin by first investigating which dependencies hold and then appropriately biasing our
search procedure to exploit the dependencies found, giving improved performance in our original
learning problem.
We introduce a new sequential model-based, gradient-free optimization algorithm, Kernel
Density Optimization (KDO), which biases the search in two ways. First, it assumes strong
spatial consistency of the search space, such that nearby points in the space have similar function
values, and second, it assumes that sets of randomly chosen points from the space will have
function evaluations that follow a roughly unimodal, approximately Gaussian distribution. These
assumptions hold for several real-world hyperparameter optimization problems on UCI datasets
using three different learning methods (gradient boosted trees, regularized logistic regression,
and averaged perceptrons), allowing our method to significantly improve on the state-of-the-art
SMAC method. Thus, we gain increased empirical performance by appropriately biasing our
algorithm based on our knowledge of dependencies.
1
This chapter reproduces content from Montañez and Finley, “Kernel Density Optimization”(In Prep.)
99
10.1 Theoretical Considerations
Given that sequential hyperparameter optimization is a literal search through a space of hyper-
parameter configurations, our results are directly applicable. The search space Ω consists of all
the possible hyperparameter configurations (appropriately discretized in the case of numerical
hyperparameters). The target set T is determined by the particular learning algorithm the con-
figurations are applied to, the performance metric used, and the level of performance desired.
Let S denote a set of points sampled from the space, and let the information gained from the
sample become the external information resource f . Given that resource, we have the following
theorem:
Theorem 13. Given a search algorithm A, a finite discrete hyperparameter configuration space
Ω, a set S of points sampled from that search space, and information resource f that is a function
of S, let Ω0 := Ω \ S, τk = {T | T ⊆ Ω0 , |T | = k ∈ N}, and τk,qmin = {T | T ∈ τk , q(T, f ) ≥
qmin }, where q(T, f ) is the expected per-query probability of success for algorithm A under T
and f . Then,
|τk,qmin | p0
≤
|τk | qmin
where p0 = k/|Ω0 |
The proof follows directly from Theorem 2.
The proportion of possible hyperparameter target sets giving an expected probability of suc-
cess qmin or more is minuscule when k |Ω0 |. If we have no additional information beyond that
gained from the points S, we have no justifiable basis for expecting a successful search. Thus, we
must make some assumptions concerning the relationship of the points sampled to the remaining
points in Ω0 . We can do so by either assuming structure on the search space, such that spatial
coordinates becomes informative, or by making an assumption on the process by which S was
sampled, so that the sample is representative of the space in quantifiable ways. These assump-
tions allow f to become informative of the target set T , leading to exploitable dependence. Thus
we see the need for inductive bias in hyperparameter optimization [55], which hints at a strategy
for creating more effective hyperparameter optimization algorithms (i.e., through exploitation of
spatial structure). We adopt this strategy for KDO.
100
probabilistically model the space and sample from the resulting distribution, which would tend
to sample in proportion to “goodness” of a region, while efficiently exploring the space.
Because each sample evaluation is costly, one desires to locate regions of high performance
with as few queries as possible. Sampling directly from the true distribution would include oc-
casional wasteful samples from low-performance regions, so what we actually desire is a model
that not only builds a probabilistic model of the space, but builds a skewed model, which dis-
proportionately concentrates mass in promising regions. Sampling from that distribution will be
biased towards returning points from high-performance regions. Just as importantly, we desire a
method that we can efficiently sample from.
Kernel Density Optimization meets these challenges by building a truncated kernel density
estimate model over the hyperparameter cube, where density correlates with empirical perfor-
mance. The model uses performance-(inversely)proportional bandwidth selection for each ob-
servation, tending to concentrate mass in high-performance regions (via small bandwidths) and
pushing it away from low-performance regions (via large bandwidths). Furthermore, by truncat-
ing the KDE model to the top k performing points, we further bias the model towards concentra-
tion of mass to high-performance regions. The end result is a model that iteratively concentrates
mass in good regions, is easy to sample from, and naturally balances exploration and exploita-
tion. Figure 10.1 shows the progressive concentration of mass in high-performance regions,
reflected in the density of samples returned by the model over a series of five-hundred queries,
via snapshots takes during the first fifty queries, the middle fifty, and the final fifty.
KDO KDO
LDS LDS
SMAC SMAC
RANDOM RANDOM
0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.55 0.60 0.65 0.70 0.75 0.80 0.85
KDO
LDS
SMAC
RANDOM
Figure 10.1: Mass concentration around promising regions of the hyperparameter search space.
Queries 1-50, 200-250, and 450-500, respectively.
101
10.2.1 KDO: Mathematical Form
Given an objective function f to be maximized, configurations ci drawn from the hyperparam-
eter search space, a user-defined truncation length parameter L, a user-defined weight rescaling
parameter r, and a user-defined minimum mutation spread parameter m, the truncated weighted
KDE that forms the model for KDO sampling of numerical hyperparameters has the following
form, assuming configurations c(i) are ranked in descending order of performance:
L
kc − c(i) k
X w(i)
PL K (10.1)
i=1 j=1 w(j)
h(i)
where
α+ N
P
i:ci [y]=v f (ci )
Pr(v) = , (10.6)
α|V | + v0 ∈V N
P P
i:ci [y]=v 0 f (ci )
where α is the pseudocount weight, ci [y] is the value of hyperparameter y for the ith configuration
sampled, and N is the total number of configurations sampled.
Pseudocode for KDO is provided in Algorithm 2.
102
Algorithm 2 KDO method for hyperparameter optimization.
1: procedure KDO
2: Initialize:
3: Rescale numeric hyperparameters to [0,1] range.
4: Randomly sample
S Ninit configurations, save as pop.
5: history ← c∈pop hc, f (c)i, for metric f .
6: Fb ← CDF(mean(pop), var(pop)).
7: Main Loop:
8: for total number of queries do
9: kbest ← GetBest(history, L).
10: for number of samples in batch do
11: Sample uniformly random, with prob q.
12: Sample from kbest, with prob (1 − q).
13: end for
14: for each uniformly sampled configuration c do
15: Update Fb with f (c).S
16: history ← history hc, f (c)i.
17: end for
18: for each configuration p sampled from kbest do
19: c ← mutate(p).
20: Update Fb with f (c).S
21: history ← history hc, f (c)i.
22: end for
23: end for
24: Return:
25: Return cbest = GetBest(history, 1).
26: end procedure
103
10.3.2 Fitness Models
KDO employs two primary models, a simple one-dimensional Gaussian model of fitness values
for hyperparameter configurations (based on an estimate constructed from uniformly sampled
points), and a weighted kernel density estimate that models the spatial distribution of fitness
values within the hyperparameter cube. We will discuss each model in turn.
For the one-dimensional Gaussian model, the initial random configurations and an additional
set of interleaved uniform random points (which are taken with probability q, as a user defined
parameter, q = 0.05 being the default) are used to estimate the mean and variance of this distri-
bution. Evaluating performance values using the CDF of the Gaussian gives a rough estimate of
the empirical “goodness” of a configuration, which is then used to control the mutation rate in an
inversely fitness-proportional manner. Thus, when a configuration exhibits good performance,
the mutation rate is lowered to sample configurations near this point, which fits the assumption
that nearby points have similar fitness values. (For minimizing performance metrics such as loss
functions, we take advantage of the symmetry of the Gaussian to reflect the point around the
mean to obtain the corresponding upper-tail CDF value.)
For the second model, an approximate weighted kernel density estimate over the space of
numerical hyperparameters is used, with weights being proportional to the (normalized) em-
pirical performance values, and with fitness-dependent individual covariance matrices at each
sampled point. We simplify the model by using a diagonal bandwidth matrix, and set the di-
agonal entries using Silverman’s rule-of-thumb with identical standard deviations set to s =
max(1 − Fb(f (c)), m) where Fb is the CDF for the empirical one-dimensional Gaussian fitness
model, f (c) is the observed performance of configuration c, and m is the minimum mutation
spread, a user-defined parameter in the range [0, 1]. Setting the m value lower allows for better
exploitation and more precise exploration, at the expense of possible premature convergence.
Formally, the bandwidth matrix is defined as a d-dimensional diagonal matrix with diagonal
entries equal to (4/(d + 2))1/(d+4) sn−1/(d+4) . This is used as the covariance matrix for each
multivariate Gaussian centered at the observed configurations. Thus, the model concentrates
mass more closely around strongly performing configurations in the space, and spreads mass
away from weakly performing points. Because of the diagonal bandwidth assumption, sampling
from the model is also greatly simplified.
An additional set of models are used for categorical hyperparameters. For each categorical
hyperparameter, KDO uses a categorical distribution over the categories, where mass is propor-
tional to the normalized sum of historical performance values for each category. (When using
minimizing metrics, such as loss, the final normalized weights are inverted by subtracting each
from 1 and renormalizing.) Pseudocounts are used for unseen parameter values, with the default
being set to 0.1 mass added to each category. Alternative schemes, such as adding 1/n, may also
be used.
104
plification of the bandwidth matrix, sampling from the multivariate Gaussian becomes equivalent
to sampling from independent Gaussians for each numerical hyperparameter.
To select a configuration from the history, we first truncate the history to some predetermined
number of empirically strongest samples (controlled by a history length parameter, with default
value of L = 20), then form a normalized categorical distribution over these configurations,
with mass being proportional to the rescaled sum of empirical performance values. Because
we want to concentrate mass on strongly performing configurations, we rescale the categorical
distribution by raising each entry to a positive power τ = r · Fb(f (c(1) )), where r is a user-defined
weight rescaling parameter (defaulted to r = 30), Fb is again the one-dimensional Gaussian
fitness CDF, and f (c(1) ) is the observed performance of the empirically best configuration in the
history (using the reflected quantile when working with minimizing metrics). This rescaling has
the effect of concentrating mass more tightly around the best performing configurations, trading
exploitation for weaker exploration. It also intensifies the concentration as the quality of the
best configuration found so far increases. The weights are then renormalized to create a proper
categorical distribution, from which we sample one or more configurations.
For each configuration chosen, we then sample from the multivariate Gaussian centered on
that configuration using the simplified bandwidth matrix and sample from the independent cate-
gorical distributions for the remaining categorical hyperparameters.
10.4 Extensions
10.4.1 Hierarchical Hyperparameter Spaces
Tree-Structured Parzen Estimators (TPE) [9] and SMAC are both well-suited for optimization of
hierarchically structured hyperparameter spaces, allowing for the arbitrary nesting of hyperpa-
rameters. Since KDO’s model allows for fast sampling of high-dimensional spaces (for example,
on a 1000 dimensional Rastrigin function space KDO takes approximately 0.03 seconds to sam-
ple and evaluate each configuration), an obvious solution for handling hierarchically structures
spaces is simply to optimize all sets of available conditional hyperparameters in parallel, and al-
low the evaluation algorithm to select out the subset of relevant parameters using the conditional
hierarchy. Since each relevant hyperparameter will have a value (as all possible hyperparam-
eters are assigned values), this is guaranteed to return a valid configuration at each iteration.
Furthermore, truly irrelevant hyperparameters that are never chosen will not harm predictive per-
formance, and will only negligibly affect computational runtime. Thus, KDO’s fast sampling
structure allows for natural adaptation to conditional hyperparameter spaces.
105
10.4.2 Acquisition Functions
Sequential model-based optimization methods often use a secondary acquisition function to es-
timate the expected improvement of configurations over incumbent configurations (e.g., SMAC
and TPE). For SMAC, this involves taking the empirical mean and variance from trees within its
random forest to parameterize a Gaussian model for computation of the EI (expected improve-
ment) [39]. TPE takes the results from two configuration models (one for good configurations,
one for poor), using a transformed likelihood-ratio of them to estimate the EI [9]. In both cases,
proposed configurations are measured according to the acquisition function, effectively filtering
which configurations to accept and which to reject. Although the goal for both methods is to
maximize the acquisition function, neither does so analytically, instead relying on an ad hoc
search procedure to locate configurations with large EI.
Instead of using the indirect process of approximately maximizing an acquisition function,
KDO builds a probabilistic model directly on the hyperparameter configuration space to directly
sample promising regions. If a secondary acquisition function is desired, one can be accom-
modated within KDO by using distance-based measures (euclidean distances for numerical hy-
perparameters, hamming distances for categorical hyperparameters) with empirically estimated
Lipschitz smoothness constraints. First, a maximum empirical Lipschitz smoothness constant is
estimated using a random subsample of observations. Next, compute the upper Lipschitz bounds
for a potential configuration using the k nearest points. Taking the mean and variance of these
upper bounds, one can then use a Gaussian model in the manner of [39] to compute an EI score.
These scores can then be used to filter out unpromising configurations and select locally maximal
ones. Whether using a secondary acquisition function would further improve the performance of
KDO is an open question, and more research is needed to determine the appropriateness of such
extensions.
10.4.3 Parallelization
One advantage of uniform random sampling over sequential model-based optimization methods
is the unquestionably parallel nature of random sampling. Contrast that with the serially pro-
cessed sequential model updating of SMBO methods. Although sequential, parallelization can
be introduced into SMBO methods at the expense of sampling from less accurate models. (In-
deed, we adopt one such approach for our experiments.) For example, rather than sampling a
single configuration from a model at each time step, one can sample several in parallel, as in
batch methods. This parallelization comes at a cost: if the samples in a batch were processed
serially then latter samples would benefit from information returned from earlier samples. Thus,
the trade-off is one of accuracy for speed, with less accurate models being traded for faster eval-
uations.
A second option for parallelization is to run several independent serial KDO processes, and
take the best configuration found from any run. To the degree that one process would not benefit
from the information gained by another process, the trade-off between speed and information
would still continue to hold. A thorough study of parallelization within KDO remains an open
future research area.
106
10.5 Experiments
10.5.1 Experimental Setup
We evaluated KDO and a number of other hyperparameter optimization methods on a test suite
of seven algorithm / dataset pairs. We restricted ourselves to binary classification tasks, and
selected three types of learning algorithms to train and test (MART gradient boosted trees [29],
logistic regression [3], and averaged perceptrons [28]). We performed 100 independent trials of
each task for each hyperparameter optimization method, with each method sequentially sampling
500 points from the hyperparameter search space during every trial.
Datasets
For data, we used publicly available UCI datasets [48] (adult-tiny2 , adult [46], breast cancer [13],
ionosphere [74], seismic bumps [75]), a two-class version of the CIFAR-10 dataset [47], and a
synthetic fifty-dimensional Rastrigin function [60]. The six real-world datasets were assigned
to the learning methods in the following ways: (adult-tiny, averaged perceptron), (breast can-
cer, averaged perceptron), (adult, logistic regression), (ionosphere, logistic regression), (seismic
bumps, gradient boosted trees), and (CIFAR-10 two-class, gradient boosted trees). No learn-
ing methods were used for the Rastrigin synthetic function, but the hyperparameter optimization
methods attempted to optimize the function directly. Existing train/test splits were used when-
ever available.
107
Table 10.1: Datasets, learners, and hyperparameters for each experiment task. (log+ denotes log
scaling.)
108
forest implementation did not include an nmin parameter (which controls how many items must
be in a node to split), but had a “min docs in leaves” parameter, which controlled the minimum
number of items that could be contained in a leaf (which was set to 2).
For KDO, the settings used were L = 20, m = 0.0001, Ninit = 50, q = 0.05, r = 30.
Neither set of parameters was tuned to improve performance on individual tasks, instead being
held constant across all tasks. Further investigation is needed to determine the sensitivity of the
two methods to changes in their parameter settings.
We used batch processing to speed up the experimental runs, using 50 batches of 10 proposed
configurations each iteration (where the 10 configurations are simultaneously drawn from the
sample model, evaluated, then are used to update the model in a single large update). This trades
off efficiency (drawing multiple points at once), for less frequent updating of the models (which
cannot benefit the immediate evaluation feedback provided by the first points in a batch). Because
we needed to sample 250,000 configurations for each experiment, with each configuration being
used to train and test an independent machine learning model, we choose batch processing as an
acceptable compromise to keep overall runtime manageable.
Evaluation Criteria
The area-under-ROC-curve (AUC) was used as the evaluation criterion for all tests. We computed
the cumulative maximum AUC over 500 queries for each trial, then found the mean and 95%
confidence interval over all trials, plotting the curves visible in Figures 10.2a-10.3.
10.5.2 Results
Figure 10.2 plots the outcomes for all six real-world experiments. Some general trends emerge,
such as SMAC showing statistically significant improvement over random search and low-discrepancy
grid sequences (LDS), as expected, and KDO further improving over SMAC in all real-world
tasks. The early improvement of LDS seen in most trials suggests it as a potentially useful
method when the total number of iterations is severely restricted, such as when you have a budget
of 50 or fewer iterations. The synthetic Rastrigin optimization task was an outlier (Figure 10.3),
with low-discrepancy sequences able to achieve the optimum on the first query (being directly
in the center of the hypercube, where such sequences coincidentally start), but to make the plots
clearer the LDS line is not shown. We also see Nelder-Mead excel on that same function, with
SMAC eventually outperforming KDO. In all other experiments, Nelder-Mead performs rela-
tively poorly.
KDO performs significantly better on all real-world experiments, with confidence intervals
clearly separated from the nearest competitors (SMAC, LDS, uniform random sampling). Just
as SMAC significantly improves over uniform random sampling, KDO further increases that
improvement by 64%, on average over all trials and queries.
109
0.9960
0.845
0.9955
0.840
0.9950
0.835
KDO KDO
LDS 0.9945 LDS
0.830 NM NM
SMAC SMAC
RANDOM RANDOM
0.9940
0 100 200 300 400 500 0 100 200 300 400 500
0.9058 0.858
0.856
0.9056
0.854
0.852
0.9054
0.850
0.9052 0.848
0.846
0.9050 KDO 0.844 KDO
LDS LDS
SMAC 0.842 SMAC
0.9048 RANDOM RANDOM
0 100 200 300 400 500 0 100 200 300 400 500
0.68 0.83
0.67
0.82
0.66
0.81
0.65
0.80
0.64
KDO KDO
0.63 LDS 0.79 LDS
NM NM
SMAC SMAC
0.62 RANDOM 0.78 RANDOM
0 100 200 300 400 500 0 100 200 300 400 500
Figure 10.2: Results for real-world learning tasks. Dashed lines represent where random initial-
ization ends for SMAC and KDO methods (i.e., 50th query).
110
0.0030
0.0025
0.0020
0.0015
KDO
0.0010 NM
SMAC
RANDOM
0.0005
0 100 200 300 400 500
10.6 Discussion
Our theoretical results show the importance of dependence between targets and information re-
sources for successful learning. As discussed in Section 7.6.3, smoothness can be viewed as
an exploitable source of dependence, either temporally or spatially. In Chapter 9 we exploited
temporal smoothness through the use of temporally regularized methods, allowing our algo-
rithm to perform well in cases where such inductive bias aligned to what held in the real world.
Here in Chapter 10, we demonstrate the benefits of exploiting spatial smoothness for improved
performance. In both cases the source of improved performance was correctly identified depen-
dence within the respective domains; our view of learning as a search for exploitable dependence
works to guide the discovery of novel algorithms by suggesting sources of exploitable informa-
tion, such as smoothness. While the algorithms presented in these two chapters are not derived
directly from the search framework, their discovery was guided by the insight it provides and the
structures it suggests.
111
didate configurations and estimate their expected improvement over incumbent configurations.
Bergstra, Yamins and Cox [7] emphasized the importance of hyperparameter choice in algorithm
comparisons, arguing that hand-tuned parameters cause poor reproducibility and lead to large
variance in performance estimation.
Thorton et al. [80] developed Auto-WEKA as an automated sequential model-based Bayesian
optimization algorithm for simultaneously selecting algorithms and hyperparameters in WEKA [37].
Their approach used both popular Sequential Model-based Algorithm Configuration (SMAC)
and tree-structured Parzen Estimators (TPE), and found they were able to select hyperparamater
settings that improved performance over random and grid-based search approaches. Eggensperger
et al. [20] introduced a library of benchmarks for evaluating hyperparameter optimization meth-
ods and compared SMAC, TPE, and Spearment, three prominent methods for Bayesian hyper-
parameter optimization. Feurer et al. [25] addressed the problem of “cold-starts” in Bayesian
hyperparameter optimization, suggesting a form of transfer learning to initialize the SMAC pro-
cedure, showing improved rates of convergence towards sets of optimal parameters.
Snoek et al. [76] developed practical Bayesian methods for hyperparameter optimization
based on a Gaussian processes model with a modified acquisition function, that optimized ex-
pected improvement per second (of wall-clock time). Wang et al. [86] proposed using random
embeddings for Bayesian hyperparameter optimization in high dimensions, demonstrating the
effectiveness of this technique (better than random search and on par with SMAC) on synthetic
and real-world datasets. Hutter, Hoos and Leyton-Brown [40] introduced an efficient approach
for assessing the importance of individual (as well as subsets of) hyperparameters using a form
of functional ANOVA on estimated marginals computed from random forest models.
112
Part IV
Conclusion
113
Chapter 11
Conclusion
HIS thesis presents a unified search framework for learning in finite spaces, capable of
T addressing what makes machine learning work while also answering questions related to
search and optimization. This is an improvement over previous work as it allows us to
address both areas within the same framework instead of having to deal with them separately.
We reformulate many areas of machine learning as searches within our framework, includ-
ing regression, classification, parameter estimation, clustering, and empirical risk minimization.
Exploring two dominant paradigms for understanding machine learning (statistical learning the-
ory and minimum description length) we see that the former is naturally viewed as a search to
find models with low risk, while latter can be seen as a specific form of inductive bias. These
paradigms thus fit naturally within our higher-level framework, anchoring them to a more general
structure. Examining the link between compressibility and generalization, we see that Occam’s
razor (i.e., biasing models towards simplicity) is not the general explanation for why machine
learning works, but at best can be a sufficient condition for learning in certain restricted contexts.
Bounds proving good generalization performance for compressed structures are shown to not de-
pend on any Occam-like notion of simplicity, but derive from small set sizes, multiple-hypotheses
testing controls, and i.i.d. data assumptions.
Several formal results are proven within this framework. We formally bound the proportion
of favorable search problems for any fixed search algorithm, showing that algorithms can only
outperform uniform sampling on a limited proportion of problems. Furthermore, the informa-
tion needed to identify a problem giving b bits of competitive advantage over uniform sampling
requires at least b bits – thus, information is conserved. We prove within our framework an
early result from Culberson [15] bounding the proportion of problems for which an algorithm
is expected to find a target within a fixed number of queries. Defining a search strategy as the
probability distribution induced by an algorithm over the search space when considering the ex-
pected per-query probability of success, we find that for any fixed search problem the proportion
of favorable search strategies is also bounded. It follows that whether one fixes the algorithm
and varies the search problem or fixes the search problem and varies the algorithm, finding a
favorable match is difficult. In the same vein, it is shown in Chapter 7 that any deterministic
binary classification algorithm with restricted data set size can only learn a limited number of
concepts to low error.
Of central importance, two results are proven that quantify the effects of dependence for
115
learning. In the first, a simple upper bound is given on the expected probability of success
for algorithms defined in terms of dependence, target set size, and target predictability. It shows
dependence to be a necessary condition for the successful learning of uncertain targets. A closed-
form exact expression is also derived for the same quantity, demonstrating that the probability
of success is determined by the dependence between information resources (like datasets) and
targets, algorithm information loss, the size of targets, target uncertainty, randomness, and target
structure. This equation proves the importance of dependence between what is observed and
what is latent for better-than-chance learning, and gives an exact formula for computing the
expected probability of success for any search algorithm.
Lastly, to demonstrate the practical utility of our paradigm, we apply our insights to two
learning areas, time-series segmentation and hyperparameter optimization. Chapters 9 and 10
show how a “dependence-first” view of learning leads naturally to discovering new exploitable
sources of dependence, and we operationalize our understanding by developing new algorithms
for these application areas with strong empirical performance.
116
Part V
Appendices
117
Appendix A
Thus, a ψ-expansion adds additional tuples to the dataset, which represent possible alternative y
values for each x, defined by the set-valued function ψ applied to elements of X .
Definition A.0.2. (ψ-consistency with D) Given spaces X and Y, a set D and its ψ-expansion
ψ(D), a function g : X → Y is said to be ψ-consistent with D iff (x, g(x)) ∈ ψ(D) for every
x ∈ Dx .
Definition A.0.3. (ψ-consistent set Ωψ(D) ) A ψ-consistent set Ωψ(D) is defined as the set of all
functions g : X → Y that are ψ-consistent with D.
The mapping ψ acts as an expansion, turning a single element y into a set containing y (and
possibly other elements of Y). This allows us to consider consistency with training data when the
true y value may differ from the observed y value, as when label noise is present. Of course, the
noiseless case is also covered in this definition, as a particular special case: the set of hypothesis
functions consistent with training set D, denoted ΩD , is equivalent to the ψ-consistent set Ωψ(D)
when ψ(x) = {y} for every (x, y) pair in D (or equivalently, when ψ(D) = D).
Theorem 14. Define as follows:
• X - finite instance space,
• Y - finite label space,
• Ω - Y X , the space of possible concepts on X ,
• h - a hypothesis, h ∈ Ω,
• D = {(x1 , y1 ), . . . , (xN , yN )} - any training dataset where xi ∈ X , yi ∈ Y,
• Dx = {x : (x, ·) ∈ D} - the set of x instances in D,
119
• Vx = {x1 , . . . , xM } - any test dataset disjoint from Dx (i.e., Dx ∩ Vx = ∅) containing
exactly M elements xi ∈ X with M > 0, hidden from the algorithm during learning,
• Ωψ(D) - subset of hypotheses ψ-consistent with D for some ψ(·) function, and
• unbiased classifier A - any classifier such that P (h|D, M ) = 1(h ∈ Ωψ(D) )/|Ωψ(D) | (i.e.,
makes no assumptions beyond strict ψ-consistency with training data).
Then the distribution of 0-1 generalization error counts for any unbiased classifier is given by
w M −w
M 1 1
P (w|D, M ; A) = 1−
w |Y| |Y|
where w is the number of wrong predictions on disjoint test sets of size M . Thus, every unbiased
classifier has generalization performance equivalent to random guessing (e.g., flipping a coin)
of class labels for unseen instances.
Proof. In agreement with the problem model given in Figure 5.1, we assume the training data
D are generated from the true concept h∗ by some process, and that h is chosen by A using D
alone, without access to h∗ . Thus, h∗ → D → h, implying P (h∗ |D, h, M ) = P (h∗ |D, M ) by
d-separation. By similar reasoning, since {Dx , M } → Vx , with D as an ancestor of both Vx and
h and no other active path between them, we have P (Vx |f, D, h, M ) = P (Vx |f, D, M ). This
can be verified intuitively by the fact that Vx is generated prior to the algorithm choosing h, and
the algorithm has no access to Vx when choosing h (it only has access to D, which we’re already
conditioning on).
Let K = 1(h ∈ Ωψ(D) )/|Ωψ(D) | and let L be the number of free instances in X , namely
L = |X | − (|Dx | + |Vx |) (A.1)
= |X | − (N + M ). (A.2)
Then
P (w, D|M )
P (w|D, M ; A) = (A.3)
P (D|M )
P
h∈Ωψ(D) P (w, h, D|M )
= P (A.4)
h∈Ωψ(D) P (h, D|M )
P
h∈Ωψ(D) P (w|h, D, M )P (h|D, M )P (D|M )
= P (A.5)
h∈Ωψ(D) P (h|D, M )P (D|M )
P
KP (D|M ) h∈Ωψ(D) P (w|h, D, M )
= (A.6)
KP (D|M )|Ωψ(D) |
1 X
= P (w|h, D, M ) (A.7)
|Ωψ(D) | h∈Ω
ψ(D)
1 X
= QN P (w|h, D, M ) (A.8)
|Y| M +L |ψ(xi )| h∈Ωψ(D)
i=1
1 X
= P (w|h, D, M ), (A.9)
C1 |Y|M +L h∈Ωψ(D)
120
where we have defined C1 := N
Q
i=1 |ψ(xi )|. Marginalizing over possible true concepts f for
term P (w|h, D, M ) and letting Z = {f, h, D, M }, we have
X
P (w|h, D, M ) = P (w, f |h, D, M ) (A.10)
f ∈Ω
X
= P (w|Z)P (f |h, D, M ) (A.11)
f ∈Ω
X
= P (w|Z)P (f |D, M ) (by d-separation) (A.12)
f ∈Ω
X X
= P (f |D, M ) P (w, vx |Z) (A.13)
f ∈Ω vx
X X
= P (f |D, M ) P (vx |Z)P (w|Z, vx ) (A.14)
f ∈Ω vx
X X
= P (f |D, M ) P (vx |Z)1(w = wh,f (vx )), (A.15)
f ∈Ω vx
P
where wh,f (vx ) = x∈vx 1(h(x) 6= f (x)) and the final equality follows since
(
1 w = wh,f (vx ),
P (w|Z, vx ) = P (w|f, h, D, M, vx ) = (A.16)
0 w 6= wh,f (vx ).
(A.17)
" #
1 X X X
= P (f |D, M ) P (vx |f, D, M )1(w = wh,f (vx ))
C1 |Y|M +L h∈Ωψ(D) f ∈Ω vx
(A.18)
1 X X X
= P (f |D, M ) P (vx |Z 0 ) 1(w = wh,f (vx )), (A.19)
C1 |Y|M +L f ∈Ω vx h∈Ωψ(D)
where we have defined Z 0 := {f, D, M } and the second equality follows from d-separation
between Vx and Ph, conditioned on D.
Note that h∈Ωψ(D) 1(w = wh,f (vx )) is the number of hypotheses ψ-consistent with D that
disagree with concept f exactly w times on vx . There are M
w
ways to choose w disagreements
with f on vx , and for each of the w disagreements we can choose |Y| − 1 possible values for
h at that instance, giving a multiplicative factor of (|Y| − 1)w . For the training set D, there are
exactly |ψ(x
QNi )| alternative values for each training set instance xi , giving a multiplicative factor
of C1 = i=1 |ψ(xi )|. For the remaining L instances that are in neither D nor vx , we have |Y|
121
possible values, giving the additional multiplicative factor of |Y|L . Thus,
1 X X
0 M w L
P (w|D, M ; A) = P (f |D, M ) P (vx |Z ) C1 (|Y| − 1) |Y|
C1 |Y|M +L f ∈Ω v
w
x
(A.20)
X
1 M X
= C1 (|Y| − 1)w |Y|L P (f |D, M ) P (vx |Z 0 )
C1 |Y|M +L w f ∈Ω v x
(A.21)
X
1 M w L
= C1 (|Y| − 1) |Y| P (f |D, M ) (A.22)
C1 |Y|M |Y|L w f ∈Ω
M 1
= (|Y| − 1)w M (A.23)
w |Y|
w
M 1 1
= |Y| 1 − (A.24)
w |Y| |Y|M
w
|Y|w
M 1
= 1− (A.25)
w |Y| |Y|M
w
M 1
= 1− |Y|w−M (A.26)
w |Y|
w M −w
M 1 1
= 1− . (A.27)
w |Y| |Y|
122
Bibliography
[1] Lee Altenberg. The schema theorem and prices theorem. Foundations of genetic algo-
rithms, 3:23–49, 1995. 2.3.3
[2] Kerem Altun, Billur Barshan, and Orkun Tunçel. Comparative study on classifying human
activities with miniature inertial and magnetic sensors. Pattern Recogn., 43(10):3605–
3620, October 2010. ISSN 0031-3203. doi: 10.1016/j.patcog.2010.04.019. URL http:
//dx.doi.org/10.1016/j.patcog.2010.04.019. 9.4.1
[3] Galen Andrew and Jianfeng Gao. Scalable training of l 1-regularized log-linear models. In
Proceedings of the 24th international conference on Machine learning, pages 33–40. ACM,
2007. 10.5.1
[4] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages
1027–1035. Society for Industrial and Applied Mathematics, 2007. 4.3
[5] Anne Auger and Olivier Teytaud. Continuous lunches are free! In Proceedings of the 9th
annual conference on Genetic and evolutionary computation, pages 916–922. ACM, 2007.
2.3.8, 2.3.9
[6] Toby Berger. Rate-distortion theory. Encyclopedia of Telecommunications, 1971. 7.6.3
[7] J Bergstra, D Yamins, and DD Cox. Making a science of model search: Hyperparameter
optimization in hundreds of dimensions for vision architectures. In Proc. 30th International
Conference on Machine Learning (ICML-13), 2013. 10, 10.7
[8] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. The
Journal of Machine Learning Research, 13(1):281–305, 2012. 10, 10.2, 10.5.1, 10.7
[9] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-
parameter optimization. In Advances in Neural Information Processing Systems, pages
2546–2554, 2011. 10, 10.4.1, 10.4.2, 10.7
[10] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2007. 616–
625. 9.3
[11] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam’s
razor. Information processing letters, 24(6):377–380, 1987. 7.5
[12] Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou. Fast kernel classifiers
with online and active learning. Journal of Machine Learning Research, 6(Sep):1579–
1619, 2005. 7.2
123
[13] Peter Clark and Tim Niblett. Induction in noisy domains. In Progress in Machine Learning
(from the Proceedings of the 2nd European Working Session on Learning), volume 96,
pages 11–30. Sigma Press, 1987. 10.5.1
[14] D. Cox and N. Pinto. Beyond simple features: A large-scale feature search approach to
unconstrained face recognition. In Automatic Face Gesture Recognition and Workshops
(FG 2011), 2011 IEEE International Conference on, pages 8–15, March 2011. doi: 10.
1109/FG.2011.5771385. 10
[15] J.C. Culberson. On the futility of blind search: An algorithmic view of ‘no free lunch’.
Evolutionary Computation, 6(2):109–127, 1998. 1.1, 2.3.3, 5.4, 11
[16] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu.
Adaptive learning with robust generalization guarantees. In Proceedings of the 29th Con-
ference on Learning Theory, COLT, pages 23–26, 2016. 7.2
[17] W.A. Dembski and R.J. Marks II. Conservation of information in search: Measuring the
cost of success. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Trans-
actions on, 39(5):1051 –1061, sept. 2009. ISSN 1083-4427. doi: 10.1109/TSMCA.2009.
2025027. 1.1, 3.3, 5.1, 5.2, 2
[18] William A. Dembski, Winston Ewert, and Robert J. Marks II. A general theory of informa-
tion cost incurred by successful search. In Biological Information, chapter 3, pages 26–63.
World Scientific, 2013. doi: 10.1142/9789814508728 0002. 5.1
[19] Pedro Domingos. The role of occam’s razor in knowledge discovery. Data mining and
knowledge discovery, 3(4):409–425, 1999. 7.2, 7.5, 7.6
[20] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-
Brown. Towards an empirical foundation for assessing bayesian optimization of hyper-
parameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, 2013.
10.7
[21] T.M. English. Evaluation of evolutionary and genetic optimizers: No free lunch. In Evo-
lutionary Programming V: Proceedings of the Fifth Annual Conference on Evolutionary
Programming, pages 163–169, 1996. 5.1
[22] T.M. English. No more lunch: Analysis of sequential search. In Evolutionary Computation,
2004. CEC2004. Congress on, volume 1, pages 227–234. IEEE, 2004. 1, 1.1
[23] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm
for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages
226–231, 1996. 4.3
[24] Robert M Fano and David Hawkins. Transmission of information: A statistical theory of
communications. American Journal of Physics, 29(11):793–794, 1961. 5.6
[25] M. Feurer, T. Springenberg, and F. Hutter. Initializing bayesian hyperparameter optimiza-
tion via meta-learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial
Intelligence, January 2015. 10, 10.7
[26] Emily B. Fox and Erik B. Sudderth. HDP-HMM Toolbox. https://www.
stat.washington.edu/˜ebfox/software.html, 2009. URL https://
124
www.stat.washington.edu/˜ebfox/software.html. [Online; accessed 20-
July-2014]. 9.4.2
[27] Emily B Fox, Erik B Sudderth, Michael I Jordan, Alan S Willsky, et al. A sticky HDP-
HMM with application to speaker diarization. The Annals of Applied Statistics, 5(2A):
1020–1056, 2011. 9, 9.4, 9.6
[28] Yoav Freund and Robert E Schapire. Large margin classification using the perceptron
algorithm. Machine learning, 37(3):277–296, 1999. 10.5.1
[29] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals
of statistics, pages 1189–1232, 2001. 10.5.1
[30] Jean-luc Gauvain and Chin-hui Lee. Maximum A Posteriori Estimation for Multivari-
ate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech
and Audio Processing, 2:291–298, 1994. URL http://citeseerx.ist.psu.edu/
viewdoc/summary?doi=10.1.1.18.6428. 9.2, 9.6
[31] Corrado Gini. On the measure of concentration with special reference to income and statis-
tics. In Colorado College Publication, number 208 in General Series, pages 73–79, 1936.
9.3.1
[32] Peter Grunwald. A tutorial introduction to the minimum description length principle. arXiv
preprint math/0406077, 2004. (document), 7.6, 7.2, 7.6.1, 7.6.2
[33] Peter Grünwald and John Langford. Suboptimal behavior of bayes and mdl in classification
under misspecification. Machine Learning, 66(2-3):119–149, 2007. 7.6.2
[34] Peter D Grünwald. The minimum description length principle. MIT press, 2007. 2.1, 7.6
[35] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, Tin Kam Ho, N. Maci,
B. Ray, M. Saeed, A. Statnikov, and E. Viegas. Design of the 2015 chalearn automl chal-
lenge. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8,
July 2015. doi: 10.1109/IJCNN.2015.7280767. 5.5
[36] John H Holland. Adaptation in natural and artificial systems. an introductory analysis with
application to biology, control, and artificial intelligence. Ann Arbor, MI: University of
Michigan Press, 1975. 2.3.3
[37] Geoffrey Holmes, Andrew Donkin, and Ian H Witten. Weka: A machine learning work-
bench. In Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian
and New Zealand Conference on, pages 357–361. IEEE, 1994. 10.7
[38] David Hume. A Treatise of Human Nature by David Hume, reprinted from the original
edition in three volumes and edited, with an analytical index, 1896. 2.1
[39] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimiza-
tion for general algorithm configuration (extended version). Technical report, Technical
Report TR-2010-10, University of British Columbia, Computer Science, 2010. 10, 10.4.2,
10.5.1, 10.7
[40] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An efficient approach for assess-
ing hyperparameter importance. In Proceedings of the 31st International Conference on
Machine Learning (ICML-14), pages 754–762, 2014. 10.7
125
[41] Marcus Hutter. Optimality of universal bayesian sequence prediction for general loss and
alphabet. Journal of Machine Learning Research, 4(Nov):971–1000, 2003. 2.1
[42] Christian Igel and Marc Toussaint. On classes of functions for which no free lunch results
hold. arXiv preprint cs/0108011, 2001. 2.3.5
[43] Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press,
2003. 7.6
[44] Mark Johnson. Why doesn’t EM find good HMM POS-taggers. In In EMNLP, pages
296–305, 2007. 9.6
[45] Yoshinobu Kawahara, Takehisa Yairi, and Kazuo Machida. Change-point detection in time-
series data based on subspace identification. In Data Mining, 2007. ICDM 2007. Seventh
IEEE International Conference on, pages 559–564. IEEE, 2007. 9
[46] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.
In Proceedings of the Second International Conference on Knowledge Discovery and Data
Mining, volume 96, pages 202–207. Citeseer, 1996. 10.5.1, 2
[47] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny im-
ages. 2009. 10.5.1
[48] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.
uci.edu/ml. 10.5.1
[49] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Tech-
nical report, Technical report, University of California, Santa Cruz, 1986. 7.2, 8, 7.2
[50] Song Liu, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. Change-point detection
in time-series data by relative density-ratio estimation. Neural Networks, 43:72–83, 2013.
9
[51] Alan J Lockett and Risto Miikkulainen. A probabilistic re-formulation of no free lunch:
Continuous lunches are not free. Evolutionary Computation, 2016. 2.3.8, 2.3.9
[52] James AR Marshall and Thomas G Hinton. Beyond no free lunch: realistic algorithms for
arbitrary problem classes. In IEEE Congress on Evolutionary Computation, pages 1–6.
IEEE, 2010. 2.3.7
[53] Marina Meilă. Comparing clusterings by the variation of information. In Bernhard
Schölkopf and ManfredK. Warmuth, editors, Learning Theory and Kernel Machines, vol-
ume 2777 of Lecture Notes in Computer Science, pages 173–187. Springer Berlin Hei-
delberg, 2003. ISBN 978-3-540-40720-1. doi: 10.1007/978-3-540-45167-9 14. URL
http://dx.doi.org/10.1007/978-3-540-45167-9_14. 9.4.2
[54] T.M. Mitchell. Machine Learning. McGraw-Hill International Editions. McGraw-Hill,
1997. ISBN 9780071154673. URL https://books.google.com/books?id=
EoYBngEACAAJ. 7.5
[55] Tom M. Mitchell. The need for biases in learning generalizations. Technical report, Rutgers
University, 1980. 1, 1.1, 2.4.1, 5.7, 5.8, 10.1
[56] Tom M. Mitchell. Generalization as search. Artificial intelligence, 18(2):203–226, 1982.
126
1, 2.2
[57] George D. Montañez. Bounding the number of favorable functions in stochastic search. In
Evolutionary Computation (CEC), 2013 IEEE Congress on, pages 3019–3026, June 2013.
doi: 10.1109/CEC.2013.6557937. 1.1, 5.2
[58] George D Montañez, Saeed Amizadeh, and Nikolay Laptev. Inertial hidden markov models:
Modeling change in multivariate time series. In Twenty-Ninth AAAI Conference on Artificial
Intelligence (AAAI 2015), pages 1819–1825, 2015. 9
[59] Shay Moran and Amir Yehudayoff. Sample compression schemes for vc classes. Journal
of the ACM (JACM), 63(3):21, 2016. 7.2, 7.2
[60] Heinz Mühlenbein, M Schomisch, and Joachim Born. The parallel genetic algorithm as
function optimizer. Parallel computing, 17(6-7):619–632, 1991. 10.5.1
[61] John A Nelder and Roger Mead. A simplex method for function minimization. The com-
puter journal, 7(4):308–313, 1965. 10.5.1
[62] Christoph Neukirchen and Gerhard Rigoll. Controlling the complexity of HMM systems
by regularization. Advances in Neural Information Processing Systems, pages 737–743,
1999. 9.6
[63] Lawrence Rabiner. A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2):257–286, 1989. 9.1, 9.6
[64] R Bharat Rao, Diana Gordon, and William Spears. For every generalization action, is there
really an equal and opposite reaction? analysis of the conservation law for generalization
performance. Urbana, 51:61801. 1.1
[65] Bonnie K Ray and Ruey S Tsay. Bayesian methods for change-point detection in long-range
dependent processes. Journal of Time Series Analysis, 23(6):687–705, 2002. 9
[66] BD Ripley. Neural networks and pattern recognition. Cambridge University, 1996. 7.6
[67] Jorma Rissanen. Stochastic complexity in statistical inquiry, volume 15. World scientific,
1998. 7.6
[68] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13
(1):145–147, 1972. 5.1, 5.1
[69] C. Schaffer. A conservation law for generalization performance. In W. W. Cohen and
H. Hirsch, editors, Proceedings of the Eleventh International Machine Learning Confer-
ence, pages 259–265. Rutgers University, New Brunswick, NJ, 1994. 1, 2.3.2
[70] Cullen Schaffer. Overfitting avoidance as bias. Machine learning, 10(2):153–178, 1993.
2.4.2, 2.4.3, 7.2, 7.6.3
[71] C. Schumacher, MD Vose, and LD Whitley. The no free lunch and problem description
length. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-
2001), pages 565–570, 2001. 1, 2.3.4, 2.3.6, 5.1
[72] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking
the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE,
104(1):148–175, 2016. 10.7
127
[73] Cosma Rohilla Shalizi. ”occam”-style bounds for long programs, 2016. URL http://
bactra.org/notebooks/occam-bounds-for-long-programs.html. [On-
line; accessed 17-March-2017]. 7.5
[74] Vincent G Sigillito, Simon P Wing, Larrie V Hutton, and Kile B Baker. Classification of
radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical
Digest, 10(3):262–266, 1989. 10.5.1
[75] Marek Sikora and Łukasz Wróbel. Application of rule induction algorithms for analysis
of data collected by seismic hazard monitoring systems in coal mines. Archives of Mining
Sciences, 55(1):91–114, 2010. 10.5.1
[76] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of
machine learning algorithms. In Advances in neural information processing systems, pages
2951–2959, 2012. 10, 10.7
[77] Il’ya Meerovich Sobol’. On the distribution of points in a cube and the approximate evalu-
ation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 7(4):784–
802, 1967. 10.5.1
[78] Ray Solomonoff. Does algorithmic probability solve the problem of induction. Information,
Statistics and Induction in Science, pages 7–8, 1996. 2.1
[79] Ray J. Solomonoff. Three kinds of probabilistic induction: Universal distributions and
convergence theorems. The Computer Journal, 51(5):566–570, 2008. doi: 10.1093/comjnl/
bxm120. URL http://comjnl.oxfordjournals.org/content/51/5/566.
abstract. 2.1
[80] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka:
Combined selection and hyperparameter optimization of classification algorithms. In Pro-
ceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 847–855. ACM, 2013. 5.5, 10, 10.7
[81] Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural
networks, 10(5):988–999, 1999. 2.1, 4, 4.6, 6, 6.1.1
[82] Vladimir N Vapnik and A Ya Chervonenkis. Necessary and sufficient conditions for the uni-
form convergence of means to their expectations. Theory of Probability & Its Applications,
26(3):532–553, 1982. 6.1
[83] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative fre-
quencies of events to their probabilities. In Measures of Complexity, pages 11–30. Springer,
2015. 6.1
[84] Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1.
Wiley New York, 1998. 7.6
[85] John Vickers. The problem of induction. In Edward N. Zalta, editor, The Stanford Ency-
clopedia of Philosophy. Spring 2016 edition, 2016. 2.1
[86] Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando de Freitas.
Bayesian optimization in a billion dimensions via random embeddings. arXiv preprint
arXiv:1301.1942, 2013. 10, 10.7
128
[87] Geoffrey I Webb. Further experimental evidence against the utility of occam’s razor. Jour-
nal of Artificial Intelligence Research, 4:397–417, 1996. 7.2, 7.6
[88] Wikipedia. Gini coefficient — Wikipedia, the free encyclopedia, 2014. URL http://
en.wikipedia.org/wiki/Gini_coefficient. [Online; accessed 8-June-2014].
9.3.1
[89] Alan S Willsky, Erik B Sudderth, Michael I Jordan, and Emily B Fox. Nonparametric
Bayesian learning of switching linear dynamical systems. In Advances in Neural Informa-
tion Processing Systems, pages 457–464, 2009. 9
[90] D Randall Wilson and Tony R Martinez. Bias and the probability of generalization. In
Intelligent Information Systems, 1997. IIS’97. Proceedings, pages 108–114. IEEE, 1997.
2.4.4
[91] David H Wolpert. On the connection between in-sample testing and generalization error.
Complex Systems, 6(1):47, 1992. 2.3.1, 5.8.3
[92] David H Wolpert. What the no free lunch theorems really mean; how to improve search
algorithms. In Santa fe Institute Working Paper, page 12. 2012. 2.3.1
[93] David H Wolpert et al. On overfitting avoidance as bias. Technical report, Technical Report
SFI TR 92-03-5001. Santa Fe, NM: The Santa Fe Institute, 1993. 2.4.3, 7.2
[94] D.H. Wolpert. The supervised learning no-free-lunch theorems. In Proceedings of the 6th
Online World Conference on Soft Computing in Industrial Applications, volume 6, pages
1–20, 2001. 1, 2.3.1, 2.3.2
[95] D.H. Wolpert and W.G. Macready. No free lunch theorems for optimization. IEEE Trans-
actions on Evolutionary Computation, 1(1):67–82, April 1997. doi: 10.1109/4235.585893.
1, 2.3.1, 5.1
[96] Yao Xie, Jiaji Huang, and Rebecca Willett. Change-point detection for high-dimensional
time series with missing data. Selected Topics in Signal Processing, IEEE Journal of, 7(1):
12–27, 2013. 9
[97] Tong Zhang. On the convergence of mdl density estimation. In COLT, volume 3120, pages
315–330. Springer, 2004. 7.6.1
129