Hyperdimensional Computing - Introduction
Hyperdimensional Computing - Introduction
DOI 10.1007/s12559-009-9009-8
Abstract The 1990s saw the emergence of cognitive conventional computing. To think of brains as computers
models that depend on very high dimensionality and responsible for human and animal behavior represents a
randomness. They include Holographic Reduced Repre- major challenge. No two brains are identical yet they can
sentations, Spatter Code, Semantic Vectors, Latent Semantic produce the same behavior—they can be functionally
Analysis, Context-Dependent Thinning, and Vector- equivalent. For example, we learn to make sense of the
Symbolic Architecture. They represent things in high- world, we learn language, and we can have a meaningful
dimensional vectors that are manipulated by operations that conversation about the world. Even animals without a full-
produce new high-dimensional vectors in the style of tradi- fledged language can learn by observing each other, and
tional computing, in what is called here hyperdimensional they can communicate and function in groups and assume
computing on account of the very high dimensionality. The roles as the situation demands.
paper presents the main ideas behind these models, written as This means that brains with different ‘‘hardware’’ and
a tutorial essay in hopes of making the ideas accessible and internal code accomplish the same computing. Further-
even provocative. A sketch of how we have arrived at these more, the details of the code are established over time
models, with references and pointers to further reading, is through interaction with the world. This is very different
given at the end. The thesis of the paper is that hyperdi- from how computers work, where the operations and code
mensional representation has much to offer to students of are prescribed in detail from the outside by computer-
cognitive science, theoretical neuroscience, computer design engineers and programmers.
science and engineering, and mathematics. The disparity in architecture between brains and
computers is matched by disparity in performance.
Keywords Holographic reduced representation Notably, computers excel in routine tasks that we—our
Holistic record Holistic mapping Random indexing brains—accomplish with effort, such as calculation,
Cognitive code von Neumann architecture whereas they are yet to be programmed for universal
human traits such as flexible learning, language use, and
understanding.
Introduction: The Brain as a Computer Although the disparity in performance need not be due
to architecture, brainlike performance very likely requires
In this tutorial essay we address the possibility of under- brainlike architecture. The opposite is not necessarily true,
standing brainlike computing in terms familiar to us from however: brainlike architecture does not guarantee bra-
inlike ‘‘intelligent’’ behavior, as evidenced by many kinds
of mental illness. Thus, we can look at the brain’s
architecture for clues on how to organize computing.
However, to build computers that work at all like brains,
P. Kanerva (&)
we must do more than copy the architecture. We must
Center for the Study of Language and Information, Stanford
University, Stanford, CA 94305, USA understand the principles of computing that the architec-
e-mail: [email protected] ture serves.
123
140 Cogn Comput (2009) 1:139–159
123
Cogn Comput (2009) 1:139–159 141
algorithm for computing the sum is embodied in the design numbers are relatively simple and are readily expressed as
of the circuit. computer circuits.
The details of circuit design—its ‘‘logic’’—depend The choice of representation is often a compromise. The
crucially on how numbers are represented. Computers following example illustrates a bias in favor of some
usually represent numbers in the binary system, that is, in operations at the expense of others. Any number (a positive
strings of bits, with each bit being either a 0 or a 1. The integer) can be represented as the product of its prime
logical design of the adder circuit then specifies how each factors. That makes multiplication easy—less work than
bit of the sum is formed from the bits of the two numbers multiplication in base-2 arithmetic—but other operations
being added together. When computer engineers speak of such as addition become unduly complicated. For overall
logic, it is in this restricted sense of how patterns are efficiency in arithmetic, the base-2 system is an excellent
transformed by circuits. compromise. The brain’s representations must be subject to
The materials that a circuit is made of are incidental— similar tradeoffs and compromises.
we could say, immaterial. Computer-logic design is The brain’s representations of number and magnitude
therefore abstract and mathematical, and finding suitable are subject to all sorts of context effects, as seen in the
materials for implementing a design is a separate field in its kinds of errors we make, and obviously are not optimized
own right. This holds a lesson for us who want to under- for fast and reliable arithmetic. Rather than being a design
stand neural circuits: the logical design can be separated flaw, however, the context effects very likely reflect a
from neural realization. We need all the insight into the compromise in favor of more vital functions that brains
brain’s circuits and representations that neuroscience can must perform.
provide, but we must then abstract away from neurotrans- The brain’s representations are carried on components
mitters, ion channels, membrane potentials, and spike that by and large are nonbinary. Yet many brainlike context
trains and face the challenge as a mathematical puzzle effects can be demonstrated with binary patterns and
driven by the behavior to be reproduced. Such abstraction operations, and there is a good reason to do so in our
is essential to the understanding of the underlying princi- modeling, namely, the important properties of the repre-
ples and to the building of computers based on them. sentation follow from high dimensionality rather than from
the precise nature of the dimensions. When binary is suf-
An Engineering View of Representation ficient for demonstrating these properties, we should use it
because it is the simplest possible and is an excellent way
Representation is crucial to traditional computing as to show that the system works by general principles rather
illustrated by the following example, and apparently it is than by specialized tailoring of individual components.
equally important to the brain’s computing. Computers use Since the dimensionality of representation is a major
binary representation almost exclusively, which means that concern in this paper, we need to touch upon dimension-
an individual circuit component has two possible states, ality reduction, which is a standard practice in the
usually denoted by 0 and 1. The reason for restricting to processing of high-dimensional data. However, it is also
only two states has to do with physics: electronic compo- possible that very high dimensionality actually facilitates
nents are most reliable when they are bistable. Richer and processing: instead of being a curse, high dimensionality
more meaningful representations are then gotten by using a can be a blessing. For example, numbers (i.e., scalars), by
set of such binary components. Thus, a representation is the definition, are one-dimensional, but in a computer they are
pattern of 0s and 1s on a set of components and it can be represented by strings of bits, that is, by high-dimensional
thought of as a string of bits or as a binary vector. In terms vectors: a 32-bit integer is a 32-dimensional binary vector.
of computing theory, the binary-based system is fully The high-dimensional representation makes simple algo-
general. rithms and circuits for high-precision arithmetic possible.
Representation must satisfy at least one condition, it We can contrast this with one-dimensional representation
must discriminate: the bit patterns for different things must of numbers. The slide rule represents them one-dimen-
differ from one another. Beyond that, how the patterns sionally and makes calculating awkward and imprecise.
relate to each other determines their possible use for Thus, the dimensionality of an entity (a number) and the
computing. For example, the representation of numbers dimensionality of its representation for computing purposes
must be suited for arithmetic—that is, for computing in the (a bit vector) are separate issues. One has to do with
traditional sense of the word. This is accomplished with existence in the world, the other with the suitability for
positional representation: by treating a string of bits as a manipulation by algorithms—that is, suitability for com-
number in the binary number system. The rules for addi- puting. The algorithms discussed in this paper work by
tion, subtraction, multiplication, and division of binary virtue of their high (hyper)dimensionality.
123
142 Cogn Comput (2009) 1:139–159
Properties of Neural Representation same value, and letting the majority rule when the three
disagree. However, there are much better ways to achieve
Hyperdimensionality redundancy and robustness.
The brain’s circuits are massive in terms of numbers of Independence from Position: Holistic Representation
neurons and synapses, suggesting that large circuits are
fundamental to the brain’s computing. To explore this idea, Electrical recording from neurons shows that even seem-
we look at computing with ultrawide words—that is, with ingly simple mental events involve the simultaneous
very high-dimensional vectors. How would we compute activity of widely dispersed neurons. Finding out directly
with 10,000-bit words? How like and unlike is it from how the activity is organized is extremely difficult but we
computing with 8-to-64-bit words? What is special about can try to picture it by appealing to general principles. For
10,000-bit words compared to 8-to-64-bit words? maximum robustness—that is, for the most efficient use of
Computing with 10,000-bit words takes us into the redundancy—the information encoded into a representation
realm of very high-dimensional spaces and vectors; we will should be distributed ‘‘equally’’ over all the components,
call them hyperdimensional when the dimensionality is in that is, over the entire 10,000-bit vector. When bits fail, the
the thousands. and we will use hyperspace as shorthand for information degrades in relation to the number of failing
hyperdimensional space, and similarly hypervector. In bits irrespective of their position. This kind of representa-
mathematics, ‘‘hyperspace’’ usually means a space with tion is referred to as holographic or holistic. It is very
more than three dimensions; in this paper it means a lot different from the encoding of data in computers and
more. databases where the bits are grouped into fields for
The theme of this paper is that hyperspaces have subtle different pieces of information, or from binary numbers
properties on which to base a new kind of computing. This where the position of a bit determines its arithmetic value.
‘‘new’’ computing could in reality be the older kind that Of course, some information in the nervous system is
made the human mind possible, which in turn invented tied to physical location and hence to position within the
computers and computing that now serve as our standard! representation. The closer we are to the periphery—to the
High-dimensional modeling of neural circuits goes back sense organs and to muscles and glands—the more clearly
several decades under the rubric of artificial neural the position of an individual component—a neuron—cor-
networks, parallel distributed processing (PDP), and con- responds to a specific part of a sense organ, muscle, or
nectionism. The models derive their power from the gland. Thus, the position-independence applies to repre-
properties of high-dimensional spaces and they have been sentations at higher, more abstract levels of cognition
successful in tasks such as classification and discrimination where information from different senses has been inte-
of patterns. However, much more can be accomplished by grated and where some of the more general computing
further exploiting the properties of hyperspaces. Here we mechanisms come into play.
draw attention to some of those properties.
Randomness
Robustness
We know from neuroanatomy that brains are highly
The neural architecture is amazingly tolerant of component structured but many details are determined by learning or
failure. The robustness comes from redundant representa- are left to chance. In other words, the wiring does not
tion, in which many patterns are considered equivalent: follow a minute plan, and so no two brains are identical.
they mean the same thing. It is very unlike the standard They are incompatible at the level of hardware and internal
binary representation of, say, numbers in a computer where patterns—a mind cannot be ‘‘downloaded’’ from one brain
a single-bit difference means that the numbers are differ- to another.
ent—where every bit ‘‘counts.’’ To deal with the incompatibility of ‘‘hardware’’ and the
Error-correcting codes of data communications are seeming arbitrariness of the neural code, our models use
robust in the sense that they tolerate some number of randomness. The system builds its model of the world from
errors. A remarkable property of hyperdimensional repre- random patterns—that is, by starting with vectors drawn
sentation is that the number of places at which equivalent randomly from the hyperspace.
patterns may differ can become quite large: the proportion The rationale for this is as follows. If random origins can
of allowable ‘‘errors’’ increases with dimensionality. lead to compatible systems, the incompatibility of hardware
Replication is a simple way to achieve redundancy. ceases to be an issue. The compatibility of systems—and the
Each of the bits in a nonredundant representation, such as a equivalence of brains—is sought not in the actual patterns of
binary number, can be replaced by three bits, all with the the internal code but in the relation of the patterns to one
123
Cogn Comput (2009) 1:139–159 143
another within each system. Language is a prime example of Important properties of hyperdimensional representation
a system like that at a higher level: we can say the same thing are demonstrated beautifully with 10,000-bit patterns, that
in different languages in spite of their different grammars is, with 10,000-dimensional binary vectors. The represen-
and vocabularies. Likewise at the level of the internal code, tational space then consists of all 210000 such patterns—also
the patterns for girl and boy, for example, should be more called points of the space. That is truly an enormous
similar than the patterns for girl and locomotive in the same number of possible patterns; any conceivable system would
system, whereas the patterns for girl in different systems ever need but an infinitesimal fraction of them as repre-
need not bear any similarity to each other. Examples of such sentations of meaningful entities.
model building will be given below. Our experience with three-dimensional space does not
Randomness has been a part of artificial neural systems prepare us to intuit the shape of this hyperspace and so we
from the start. Self-organizing feature maps and the must tease it out with analysis, example, and analogy. Like
Boltzmann machine are good examples. We can think of the corner points of an ordinary cube, the space looks
randomness as the path of least assumptions. A system that identical from any of its points. That is to say, if we start
works in spite of randomness is easy to design and does not with any point and measure the distances to all the other
necessarily require randomness. The randomness assump- points, we always get the same distribution of distances. In
tion is also used as a means to simplify the analysis of a fact, the space is nothing other than the corners of a 10,000-
system’s performance. dimensional unit (hyper)cube.
We can measure distances between points in Euclidean
or Hamming metric. For binary spaces the Hamming dis-
Hyperdimensional Computer tance is the simplest: it is the number of places at which
two binary vectors differ, and it is also the length of the
Notation Mathematics will be displayed as follows: shortest path from one corner point to the other along the
lowercase for scalars, variables, relations, and functions edges of the hypercube. In fact, there are 2k such shortest
(a, x, f), Latin uppercase for vectors (A, X), and Greek paths between two points that are k bits apart. Naturally,
uppercase for (permutation) matrices (P, C). Letters are the maximum Hamming distance is 10,000 bits, from any
chosen to be mnemonic when possible (A for address, G for point to its opposite point. The distance is often expressed
grandmother). The order of operations when not shown by relative to the number of dimensions, so that here 10,000
parentheses is the following: multiplication by matrix first bits equals 1.
(PA), then multiplication by vector (XOR, *), and finally Although the points are not concentrated or clustered
addition (?). anywhere in the space—because every point is just like
every other point—the distances are highly concentrated
Hyperdimensional Representation half-way into the space, or around the distance of 5,000
bits, or 0.5. It is easy to see that half the space is closer to a
The units with which a computer computes make up its point than 0.5 and the other half is further away, but it is
space of representations. In ordinary computers, the space somewhat surprising that less than a millionth of the space
is that of relatively low-dimensional binary vectors. The is closer than 0.476 and less than a thousand-millionth is
memory is commonly addressed in units of eight-bit bytes, closer than 0.47; similarly, less than a millionth is further
and the arithmetic operations are commonly done in units than 0.524 away and less than a thousand-millionth is
of 32-bit words. A computer with a 32-bit ALU and up to further than 0.53. These figures are based on the binomial
4 GB of memory can be thought of as having 32-bit binary distribution with mean 5,000 and standard deviation (STD)
vectors as its representational space, denoted mathemati- 50, and on its approximation with the normal distribution—
cally by {0, 1}32. These are the building blocks from which the distance from any point of the space to a randomly
further representations are made. drawn point follows the binomial distribution. These dis-
Hyperdimensional representational spaces can be of tance ranges give the impression that a 600-bit wide
many kinds: the vector components can be binary, ternary, ‘‘bulge’’ around the mean distance of 5,000 bits contains
real, or complex. They can be further specified as to nearly all of the space! In other words, if we take two
sparseness, range of values, and probability distribution. vectors at random and use them to represent meaningful
For example, the space of n-dimensional vectors with i.i.d. entities, they differ in approximately 5,000 bits, and if we
components drawn from the normal distribution with mean then take a third vector at random, it differs from each of
0 and variance 1/n was originally used. A cognitive system the first two in approximately 5,000 bits. We can go on
can include several representational spaces. One kind may taking vectors at random without needing to worry about
be appropriate for modeling a sensory system and another running out of vectors—we run out of time before we run
for modeling language. out of vectors. We say that such vectors are unrelated.
123
144 Cogn Comput (2009) 1:139–159
Measured in standard deviations, the bulk of the space, and In addition to being related by similarity, patterns can
the unrelated vectors, are 100 STDs away from any given relate to each another by transformation—that is, by how
vector. one is transformed into another or how several patterns are
This peculiar distribution of the space makes hyperdi- combined to form a new pattern, in a kind of pattern
mensional representation robust. When meaningful entities arithmetic. This is analogous to what ordinary computers
are represented by 10,000-bit vectors, many of the bits can do: new patterns are created from existing ones by arith-
be changed—more than a third—by natural variation in metic operations that are built into the computer’s circuits.
stimulus and by random errors and noise, and the resulting This way of interpreting the neural code is mostly unex-
vector can still be identified with the correct one, in that it plored. We have much to say about it below.
is closer to the original ‘‘error-free’’ vector than to any
unrelated vector chosen so far, with near certainty. Hyperdimensional Memory
The robustness is illustrated further by the following
example. Let us assume that two meaningful vectors A and Memory is a vital part of an ordinary computer, and we
B are only 2,500 bits apart—when only 1/4 of their bits would expect that something like it would also be a part of
differ. The probability of this happening by chance is about any computer for emulating cognition. An ordinary com-
zero, but a system can create such vectors when their puter memory is an array of addressable registers, also
meanings are related; more on such relations will be said called memory locations. Each location holds a string of
later. So let us assume that 1/3 of the bits of A are changed bits of a fixed length; the length is called the word size. The
at random; will the resulting ‘‘noisy’’ A vector be closer to contents of a location are made available for processing by
B than to A—would it be falsely identified with B? It is probing the memory with the location’s address, which
possible but most unlikely because the noisy vector would likewise is a string of bits. An n-bit address can access a
be 4,166 bits away from B, on the average, and only 3,333 memory with 2n locations, with memories of 230 or a
bits from A; the difference is 17 STDs. The (relative) thousand million eight-bit wide locations becoming more
distance from the noisy A vector to B is given by d ? e - and more common.
2de with d = 1/4 and e = 1/3. Thus, adding e amount of It is possible to build a memory for storing 10,000-bit
noise to the first vector increases the distance to the second vectors that is also addressed with 10,000-bit vectors,
vector by (1 - 2d)e on the average. Intuitively, most although 210000 locations is far too many ever to be built or
directions that are away from A in hyperspace are also needed. In artificial neural-net research they are called
away from B. associative memories. An associative memory can work
The similarity of patterns is the flip-side of distance. We somewhat like an ordinary computer memory in that when
say that two patterns, vectors, points are similar to each the pattern X is stored using the pattern A as the address, X
other when the distance between them is considerably can later be retrieved by addressing the memory with
smaller than 0.5. We can now describe points of the space A. Furthermore, X can be retrieved by addressing the
and their neighborhoods as follows. Each point has a large memory with a pattern A0 that is similar to A.
‘‘private’’ neighborhood in terms of distance: the volume of This mode of storage is called heteroassociative, to be
space within, say, 1/3 or 3,333 bits is insignificant com- contrasted with autoassociative. Both are based on the
pared to the total space. The rest of the space—all the same mechanism, the difference being that autoassociative
unrelated ‘‘stuff’’—becomes significant only when the storage is achieved by storing each pattern X using X itself
distance approaches 0.5. In a certain probabilistic sense, as the address. This may appear silly but in fact is useful
then, two points even as far as 0.45 apart are very close to because it allows the original stored X to be recovered from
each other. Furthermore, the ‘‘private’’ neighborhoods of an approximate or noisy version of it, X0 , thus making the
any two unrelated points have points in common—there memory robust. Such recovery typically takes several
are patterns that are closely related to any two unrelated iterations (fewer than ten) where the address X0 is used to
patterns. For example, a point C half-way between unre- retrieve X00 is used to retrieve X000 … as the process con-
lated points A and B is very closely related to both, and verges to X. However, if the amount of noise is too great—
another half-way point D can be unrelated to the first, C. if X0 is too far from X—the original X will not be recov-
This can be shown with as few as four dimensions: ered. The pattern X is called a point attractor, the region of
A = 0000, B = 0011, C = 0001, and D = 0010. How- space surrounding it is called the basin of attraction, and
ever, the ‘‘unusual’’ probabilities implied by these relative the memory is referred to as content addressable.
distances require high dimensionality. This is significant The same kind of iteration to a noise-free X is not
when representing objects and concepts with points of the possible in heteroassociative storage. If the memory is
hyperspace, and significantly different from what we are probed with a noisy address A0 , the retrieved pattern X0 will
accustomed to in ordinary three-dimensional space. usually have some noise relative to X. If the memory is
123
Cogn Comput (2009) 1:139–159 145
then addressed with X0 , there is no guarantee that anything major role in artificial neural-net models. The sum-vector
useful will be retrieved. We therefore envisage a cognitive is a possible representation for the set that makes up the
computing architecture that relies primarily on autoasso- sum.
ciative memory. It will serve as an item memory or clean- Subtracting one vector from another is accomplished by
up memory, which is discussed below. adding the vector’s complement. The complement of a real
vector is gotten by multiplying each component with -1,
Hyperdimensional Arithmetic and of a binary vector by flipping its bit (turning 0s into 1s
and 1s into 0s).
The ALU is an essential part of a computer. It has the Multiplication comes in several forms, the simplest
circuits for the computer’s built-in operations—its inherent being weighting, when a vector is multiplied with a number
capabilities. For example, it has the adder circuit that as described above. Two vectors can be multiplied to form
produces the sum—a binary string for the sum—of two a number, called the inner product, that can be used as a
numbers given to it as arguments. The ALU is a trans- measure of similarity between the vectors. The cosine of
former of bit patterns. two vectors is a special case of their inner product. Another
The idea that also brains compute with a set of built-in way of multiplying two vectors yields a matrix called the
operations is sound, although trying to locate the equivalent outer product. It is used extensively for adjusting the
of an ALU seems foolish, and so we will merely look for weights of a network and thus plays an important role in
operations on hyperdimensional patterns that could be used many learning algorithms. Multiplication of a vector with a
for computing. We will view the patterns as vectors matrix, resulting in a vector, is yet another kind, ubiquitous
because we can then tap into the vast body of knowledge in artificial neural nets. Usually the result from a matrix
about vectors, matrices, linear algebra, and beyond. This multiplication needs to be normalized; normalizing was
indeed has been the tradition in artificial neural-net mentioned above. Permutation is the shuffling of the vector
research, yet rich areas of high-dimensional representation components and it can be represented mathematically by
remain to be explored. By being thoroughly mathematical, multiplication with a special kind of matrix, called the
such exploration may seem peripheral to neuroscience, but permutation matrix, that is filled with 0s except for exactly
the shared goal of understanding the brain’s computing can one 1 in every row and every column.
actually make it quite central. Time will tell. The above-mentioned examples of multiplication differ
We will start with some operations on real vectors from addition in one important respect, they are hetero-
(vectors with real-number components), which are com- geneous: besides vectors they involve numbers and
monly used in artificial neural-net research. matrices. In contrast, addition is homogeneous, as all par-
Weighting with a constant is a very basic operation that ticipants are vectors of the same kind: we start with vectors
is often combined with other, more complex operations, and end up with a vector of the same dimensionality.
such as addition. The math is most simple: each component A much more powerful representational system
of the vector is multiplied with the same number, and the becomes possible when the operations also include multi-
result is a vector. plication that is homogeneous—in mathematical terms
The comparison of two vectors (e.g., with the cosine) is when the system is closed under both addition and multi-
another basic operation, and the resulting measure of plication. Further desiderata include that the
similarity, a number, is often used as a weighting factor in
– multiplication is invertible, i.e., no information is lost,
further computations.
– multiplication distributes over addition,
A set of vectors can be combined by componentwise
– multiplication preserves distance and, as a rule,
addition, resulting in a vector of the same dimensionality.
– product is dissimilar to the vectors being multiplied.
To conform to the distributional assumptions about the
representation, the arithmetic-sum-vector is normalized, The product’s being dissimilar is in contrast with the sum
yielding a mean vector. It is this mean-vector that is usually that is similar to the vectors that are added together. These
meant when we speak of the sum of a set of vectors. The desired properties of multiplication make it possible to
simplest kind of normalization is achieved with weighting. encode compositional structure into a hypervector and to
Other kinds are achieved with other transformations of the analyze the contents of composed hypervectors, as will be
vector components, for example by applying a threshold to seen below. We now merely state that multiplication
get a binary vector. operations of that kind exist for binary, real, and complex
The sum (and the mean) of random vectors has the vectors, and will discuss them later.
following important property: it is similar to each of the The above-mentioned examples of vector arithmetic
vectors being added together. The similarity is very pro- suggest that computing in hyperdimensional representa-
nounced when only a few vectors are added and it plays a tion—with large random patterns—can be much like
123
146 Cogn Comput (2009) 1:139–159
conventional computing with numbers. We will next look Representing Basic Entities with Random Vectors
at how the various operations can be used to build a system
of internal representations—what can be called a cognitive Classical formal systems start with a set of primitives, that
code. One example has already been mentioned, namely, is, with ‘‘individuals’’ or ‘‘atoms’’ and predicates. and build
that a sum-vector can represent a set. The cognitive up a universe of discourse by using functions, relations, first-
equivalence of brains should then be sought in part in how order logic, quantification, and other such means. We will
representations are computed from one another rather than borrow from this tradition and assume a world with basic
what the specific activity patterns, the exact vectors, are. atomic entities. This assumption, however, is for conve-
Thus we can think of hyperdimensional random vectors as nience—it is to get our representation story underway—
the medium that makes certain kinds of computing rather than a commitment to a world with basic atomic
possible. entities for cognitive systems to discover and deal with.
The smallest meaningful unit of the cognitive code is a
large pattern, a hypervector, a point in hyperspace. The
Constructing a Cognitive Code atomic entities or individuals are then represented by ran-
dom points of the space. In fact, when we need to represent
Conventional computing uses a uniform system for repre- anything new that is not composed of things already rep-
sentation that allows different kinds of entities to be resented in the system, we simply draw a vector at random
represented in the same way. This is accomplished with from the space. When a vector is chosen to represent an
pointers, which are addresses into memory; they are also entity in the system, it is stored in the item memory for
numbers that can take part in arithmetic calculations. later reference.
Pointers are the basis of symbolic computing. Because of hyperdimensionality, the new random vector
Corresponding to traditional pointers we have hyper- will be unrelated to all the vectors that already have
vectors, corresponding to traditional memory we have meaning; its distance from all of them is very close to
content-addressable memory for hypervectors, and corre- 5,000 bits. In mathematical terms, it is approximately
sponding to the ALU operations we have hyperdimensional orthogonal to the vectors that are already in use. A 10,000-
arithmetic. How might we use them for building a repre- dimensional space has 10,000 orthogonal vectors but it has
sentational system for entities of various kinds? a huge number of nearly orthogonal vectors. The ease of
making nearly orthogonal vectors is a major reason for
Item Memory using hyperdimensional representation.
123
Cogn Comput (2009) 1:139–159 147
element that is found will then be subtracted off the sum- vectors has 0s where the two agree and it has 1s where they
vector and the difference-vector is used to probe the item disagree. For example, 0011…10 XOR 0101…00 =
memory, to recover another of the set’s elements. The 0110…10. Mathematically, the XOR is the arithmetic sum
process is repeated to recover more and more of the sets’s modulo 2. The (1, -1)-binary system, also called bipolar,
elements. However, only small sets can be analyzed into is equivalent to the (0, 1)-binary system when the XOR is
their elements in this way, and slightly larger sets can, by replaced by ordinary multiplication. We will use the
accumulating a (partial) sum from the vectors recovered so notation A B for the multiplication of the vectors A and
far, and by subtracting it from the original (total) sum B—for their product-vector. Here * is the XOR unless
before probing for the next element. However, if the otherwise noted.
unmapped sum has been stored in the item memory, this The XOR commutes, A B ¼ B A, and is its own
method fails because probing the (autoassociative) memory inverse so that A A ¼ O, where O is the vector of all 0s
with the sum will always retrieve the sum rather than any (in algebra terms O is the unit vector because A O ¼ A).
of its elements. Since the XOR-vector has 1s where the two vectors dis-
It is also possible to find previously stored sets (i.e., agree, the number of 1s in it is the Hamming distance
sums) that contain a specific element by probing the between the two vectors. By denoting the number of 1s in a
memory with that element (with its vector). Before prob- binary vector X with j X j we can write the Hamming dis-
ing, the element must be mapped into the same part of tance d between A and B as dðA; BÞ ¼ j A Bj:
space—with the same mapping—as sums are before they Multiplication can be thought of as a mapping of points
are stored. As mentioned above, after one vector has been in the space. Multiplying the vector X with A maps it to
recovered, it can be subtracted off the probe and the the vector XA ¼ A X which is as far from X as there are 1s
memory can be reprobed for another set that would contain in A (i.e., dðXA ; XÞ ¼ jXA X j ¼ jðA XÞ X j ¼ jA X
that particular element. Xj ¼ j Aj). If A is a typical (random) vector of the space,
Besides being unordered, the strict notion of a set implies about half of its bits are 1s, and so XA is in the part of the
that no element is duplicated, and thus a set is an enumer- space that is unrelated to X in terms of the distance crite-
ation of the kinds of elements that went into it. A slightly rion. Thus we can say that multiplication randomizes.
more general notion is multiset, also called a bag. It, too, is Mapping with multiplication preserves distance. This is
unordered, but any specific kind of element can occur seen readily by considering XA ¼ A X and YA ¼ A Y;
multiple times. We might then say that a set is a collection of taking their XOR, and noting that the two As cancel out
types whereas a multiset is a collection of tokens. thus:
A multiset can be represented in the same way as a set, XA YA ¼ ðA XÞ ðA YÞ ¼ A X A Y ¼ X Y
by the sum of the multiset’s elements, and elements can be
extracted from the sum also in the same way. In this case, Since the XOR-vector is the same, the Hamming distance
the frequent elements would be the first ones to be recov- is the same: jXA YA j ¼ jX Y j: Consequently, when a set
ered, but reconstructing the entire multiset from this of points is mapped by multiplying with the same vector,
representation would be difficult because there is no reli- the distances are maintained—it is like moving a constel-
able way to recover the frequencies of occurrence. For lation of points bodily into a different (and indifferent) part
example, the normalized sum is not affected by doubling of the space while maintaining the relations (distances)
the counts of all the elements in the multiset. between them. Such mappings could play a role in high-
level cognitive functions such as analogy and the gram-
Two Kinds of Multiplication, Two Ways to Map matical use of language where the relations between
objects are more important than the objects themselves.
Existing patterns can give rise to new patterns by mappings In the above-mentioned example, we think of the vector
of various kinds, also called functions. One example of a A as a mapping applied to vectors X and Y. The same math
function has already been discussed at length: the (com- applies if we take two mappings A and B and look at their
ponentwise) addition of two or more vectors that produces effect on the same vector X: X will be mapped onto two
a sum-vector or a mean-vector. The following discussion vectors that are exactly as far from each other as mapping
about multiplication is in terms of binary vectors, although A is from mapping B. Thus, when vectors represent map-
the ideas apply much more generally. pings, we can say that the mappings are similar when the
vectors are similar; similar mappings map any vector to
Multiplication by Vector two similar vectors. Notice that any of the 210000 vectors of
the representational space is potentially a mapping, so that
A very basic and simple multiplication of binary vectors is what was said above about the similarity of vectors in the
by componentwise Exclusive-Or (XOR). The XOR of two space holds equally to similarity of mappings.
123
148 Cogn Comput (2009) 1:139–159
Because multiplication preserves distance it also pre- mentioned above, we can map the same vector with two
serves noise: if a vector contains a certain amount of noise, different permutations and ask how similar the resulting
the result of mapping it contains exactly the same noise. If vectors are: by permuting X with P and C, what is the
each of the multiplied vectors contains independent ran- distance between PX and CX, what can we say of the
dom noise, the amount of noise in the product—its distance vector Z ¼ PX CX? Unlike above with multiplication by
to the noise-free product-vector—is given by e = f ? g a vector, this depends on the vector X (e.g., the 0-vector is
- 2fg, where f and g are the relative amounts of noise in unaffected by permutation), so we will consider the effect
the two vectors being multiplied. on a typical X of random 0s and 1s, half of each. Wherever
A very useful property of multiplication is that it the two permutations (represented as lists of integers)
distributes over addition. That means, for example, that agree, they move a component of X to the same place
A ½X þ Y þ Z ¼ ½A X þ A Y þ A Z making that bit of Z a 0; let us denote the number of such
places with a. In the n - a remaining places where the two
The brackets […] stand for normalization. Distributivity is permutations disagree, the bits of PX and CX come from
invaluable in analyzing these representations and in different places in X and thus their XOR is a 1 with
understanding how they work and fail. probability 1/2. We then have that the probability of 1s in Z
Distributivity for binary vectors is most easily shown equals (n - a)/2. If the permutations P and C are chosen at
when they are bipolar. The vector components then are 1s random, they agree in only one position (a = 1) on the
and -1s, the vectors are added together into an ordinary average, and so the distance between PX and CX is
arithmetic-sum-vector, and the (normalized) bipolar-sum- approximately 0.5; random permutations map a given point
vector is gotten by considering the sign of each component to (in)different parts of the space. In fact, pairs of permu-
(the signum function). The XOR becomes now ordinary tations (of 10,000 elements) that agree in an appreciable
multiplication (with 1s and -1s), and since it distributes number of places are extremely rare among all possible
over ordinary addition, it does so also in this bipolar case. pairs of permutations. Thus we can say that, by being
If the number of vectors added together is even, we end up dissimilar from one another, random permutations ran-
with a ternary system unless we break the ties, for example, domize, just as does multiplying with random vectors as
by adding a random vector. seen above.
Permutations reorder the vector components and thus are Sequences are all-important for representing things that
very simple; they are also very useful in constructing a occur in time. We can even think of the life of a system as
cognitive code. We will denote the permutation of a vector one long sequence—the system’s individual history—
with a multiplication by a matrix (the permutation matrix where many subsequences repeat approximately. For a
P), thus XP = PX. We can also describe the permutation cognitive system to learn from experience it must be able
of n elements as the list of the integers 1, 2, 3, …, n in the to store and recall sequences.
permuted order. A random permutation is then one where One possible representation of sequences is with pointer
the order of the list is random—it is a permutation chosen chains or linked lists in an associative memory. The
randomly from the n! possible permutations. sequence of patterns ABCDE… is stored by storing the
As a mapping operation, permutation resembles vector pattern B using A as the address, by storing C using B as the
multiplication: (1) it is invertible, (2) it distributes over address, by storing D using C as the address, and so forth;
addition—in fact, it distributes over any componentwise this is a special case of heteroassociative storage. Probing
operation including multiplication with the XOR—and the memory with A will then recall B, probing it with B will
as a rule (3) the result is dissimilar to the vector being recall C, and so forth. Furthermore, the recall can start from
permuted. Because permutation merely reorders the anywhere in the sequence, proceeding from there on, and
coordinates, (4) the distances between points are main- the sequence can be retrieved even if the initial probe is
tained just as they are in multiplication with a vector, thus noisy, as subsequent retrievals will converge to the noise-
PX PY ¼ PðX YÞ and dðPX; PYÞ ¼ jPX PY j ¼ free stored sequence in a manner resembling convergence
jPðX YÞj ¼ jX Y j ¼ dðX; YÞ: to a fixed point in an autoassociative memory.
Although permutations are not elements of the space of Although straightforward and simple, this way of rep-
representations (they are not n-dimensional hypervectors), resenting sequences has its problems. If two sequences
they have their own rules of composition—permutations contain the same pattern, progressing past it is left to
are a rich mathematical topic in themselves—and they can chance. For example, if ABCDE… and XYCDZ… have
be assessed for similarity by how they map vectors. As been stored in memory and we start the recall with A, we
123
Cogn Comput (2009) 1:139–159 149
would recall BCD reliably but could thereafter branch off history, the vector that is stored in the item memory
to Z because D would point somewhere between E and Z. encodes the sequence BCDE with S = PPPB ? PPC ?
Clearly, more of the history is needed for deciding where to PD ? E, and later when encountered with BCD we would
go from D. Longer histories can be included by storing start the retrieval of E by probing the item memory with
links that skip over elements of the sequence (e.g., by PPPB ? PPC ? PD. By now the stored vectors contain
storing E using B as the address) and by delaying their enough information to discriminate between ABCDE and
retrieval according to the number of elements skipped. The XYCDZ so that E will be retrieved rather than Z.
element evoked by the more distant past would then bias Even if it is possible to encode ever longer histories into
the retrieval toward the original sequence. a single vector, the prediction of the next element does not
necessarily keep on improving. For example, if the
Representing Sequences by Permuting Sums sequence is kth order Markov, encoding more than k ? 1
elements into a single vector weakens the prediction.
As with sets, several elements of a sequence can be repre- Furthermore, the capacity of a single binary vector sets a
sented in a single hypervector. This is called flattening or limit on the length of history that it can represent reliably.
leveling the sequence. However, sequences cannot be flat- How best to encode the history for the purposes of pre-
tened with the sum alone because the order of the elements diction depends of course on the statistical nature of the
would be lost. Before computing the vector sum, the ele- sequence.
ments must be ‘‘labeled’’ according to their position in the A simple recurrent network can be used to produce
sequence so that X one time step ago appears different from flattened histories of this kind if the history at one moment
the present X, and that the vectors for AAB and ABA will be is permuted and then fed back and added to the vector for
different. Such labeling can be done with permutations. the next moment. By normalizing the vector after each
Let us first look at one step of a sequence, for example, addition we actually get a flattened history that most
that D is followed by E. This corresponds to one step of strongly reflects the most recent past and is unaffected by
heteroassociative storage, which was discussed above. The the distant past. If we indicate normalization with brackets,
order of the elements can be captured by permuting one of the sequence ABCDE will give rise to the sum
them before computing their sum. We will permute the first
and represent the pair with the sum S ¼P½P½P½PA þ B þ C þ D þ E
S ¼ PD þ E ¼½½½PPPPA þ PPPB þ PPC þ PD þ E
and we will store S in the item memory. The entire The last element E has equal weight to the history up to it,
sequence can then be stored by storing each of its elements irrespective of the length of the history—the distant past
and each two-element sum such as S above in the item simply fades away. Some kind of weighting may be needed
memory. If we later encounter D we can predict the next to keep it from fading too fast, the proper rate depending on
element by probing the memory not with D itself but with a the nature of the sequence. As mentioned before, the var-
permuted version of it, PD. It will retrieve S by being ious permutations keep track of how far back in the
similar to it. We can then retrieve E by subtracting PD sequence each specific element occurs without affecting
from S and by probing the memory with the resulting the relative contribution of that element.
vector. Several remarks about permutations are in order. An
Here we have encoded the sequence step DE so that the iterated permutation, such as PPP above, is just another
previous element, D, can be used to retrieve the next, E. permutation, and if P is chosen randomly, iterated versions
However, we can also encode the sequence so that the two of it appear random to each other with high probability.
previous elements C and D are used for retrieving E. In However, all permutations are made of loops in which bits
storing the sequence we merely substitute the encoding of return to their original places after some number of iterations
CD for D, that is to say, we replace D with PC ? D. After (every bit returns at least once in n iterations), and so some
the substitution, the S of the preceding paragraph becomes care is needed to guarantee permutations with good loops.
S = P(PC ? D) ? E = PPC ? PD ? E, which is stored in Pseudorandom-number generators are one-dimensional
memory. When CD is subsequently encountered, it allows analogs. The simpler ones get the next number by multi-
us to make the probe PPC ? PD which will retrieve S as plying the previous number with a constant and truncating
above, which in turn is used to retrieve E as above. the product to fit the computer’s word—they lop off the
We can go on like this, including more and more ele- most significant bits. Such generators necessarily run in
ments of the sequence in each stored pattern and thereby loops, however long. Incidentally, the random permuta-
including more and more of the history in them and in their tions of our computer simulations are made with random-
retrieval. Thus, with one more element included in the number generators.
123
150 Cogn Comput (2009) 1:139–159
A feedback circuit for a permutation is particularly can mean age and the same number—the same bit pat-
simple: one wire goes out of each component of the vector tern—stored in another location can mean distance. Thus
and one wire comes back in, the pairing is random, and the the meaningful entity is the address–value pair. The value
outgoing signal is fed back after one time-step delay. The can be many other things besides a number. In particular, it
inverse permutation has the same connections taken in the can be a memory address. Data structures are built from
opposite direction. such pairs and they are the basis of symbolic representation
and processing.
Representing Pairs with Vector Multiplication In holistic representation, the variable, the value, and
the bound pair are all hypervectors of the same dimen-
A pair is a basic unit of association, when two elements A sionality. If X is the vector for the variable x and A is the
and B correspond to each other. Pairs can be represented vector for the value a, then the bound pair x = a can be
with multiplication: in C ¼ A B the vector C represents represented by the product-vector X A: It is dissimilar to
the pair. If we know the product C and one of its elements, both X and A but either one can be recovered from it
say A, we can find the other by multiplying C with the given the other. Unbinding means that we take the vector
inverse of A. for the bound pair and find one of its elements, say A, by
The XOR as the multiplication operation can ‘‘over- multiplying with the other, as seen above. In cognitive
perform’’ because it both commutes (A XOR B = B XOR modeling, variables are often called roles and values are
A) and is its own inverse (A XOR A = O). For example, any called fillers.
pair of two identical vectors will be represented by the
0-vector. This can be avoided with a slightly different Representing Data Records with Sets of Bound Pairs
multiplication that neither commutes nor is a self-inverse.
As with sequences, we can encode the order of the oper- Complex objects of ordinary computing are represented
ands by permuting one of them before combining them. By by data records composed of fields, and by pointers to
permuting the first we get such data records. Each field in the record represents a
C ¼ A B ¼ PA XOR B variable (a role). The roles are implicit—they are implied
by the location of the field in the record. Holistic repre-
This kind of multiplication has all the desired properties: sentation makes the roles explicit by representing them
(1) it is invertible although the right and the left-inverse with vectors. Vectors for unrelated roles, such as name
operations are different, (2) it distributes over addition, (3) and age, can be chosen at random. The role x with the
it preserves distance, and (4) the product is dissimilar to filler a, i.e., x = a, will then be represented by X A as
both A and B. We can extract the first element from C by shown above.
canceling out the second and permuting back (right-inverse A data record combines several role–filler pairs into a
of *), single entity. For example, a record for a person might
P1 ðC XOR BÞ ¼ P1 ððPA XOR BÞXOR BÞ ¼ P1 PA include name, sex, and the year of birth, and the record for
¼A Mary could contain the values ‘Mary Myrtle’, female, and
1966. Its vector representation combines vectors for the
where P-1 is the inverse permutation of P, and we can variables and their values—name (X), sex (Y), year of birth
extract the second element by canceling out the permuted (Z), ‘Mary Myrtle’ (A), female (B), and 1966 (C)—by
version of the first (left-inverse of *), binding each variable to its value and by adding the three
PA XOR C ¼ PA XOR ðPA XOR BÞ ¼ B resulting vectors into the holistic sum-vector H:
H ¼XAþY BþZC
Because of the permutation, however, this multiplication is
not associative: ðA BÞ C 6¼ A ðB CÞ: For simplicity The vector H is self-contained in that it is made of the
in the examples that follow, the multiplication operator * bit patterns for the variables and their values, with nothing
will be the XOR. left implicit. Being a sum, H is similar to each of the three
pairs, but the pairs by being products ‘‘hide’’ the identity of
Representing Bindings with Pairs their elements so that H is dissimilar to each of A, B, C, X,
Y, Z. However, the information about them is contained in
In traditional computing, memory locations—their H and can be recovered by unbinding. For example, to find
addresses—represent variables and their contents represent the value of x in H we multiply H with (the inverse of) X
values. The values are set by assignment, and we say that it and probe the item memory with the result, retrieving A.
binds a variable to a value. A number stored in one location The math works out as follows:
123
Cogn Comput (2009) 1:139–159 151
123
152 Cogn Comput (2009) 1:139–159
and the match between our subjective judgments of simi- The context information is typically collected into a
larity of concepts and the distribution of distances in large matrix of frequencies where each word in the
hyperspace. Here we see that the modeling is further jus- vocabulary has its own row in the matrix. The columns
tified by hyperdimensional arithmetic—by its producing refer either to words of the vocabulary (one column per
effects that suggest cognitive functions. word) or to documents (one column per document). The
We are still some way from a fully worked-out archi- rows are perfectly valid context vectors as such, but they
tecture for cognitive computing. The examples below are are usually transformed into better context vectors, in the
meant to serve not as a recipe but as a source of ideas for sense that the distances between vectors correspond more
future modelers. Worth pointing out is the likeness of hy- closely to similarity of meanings. The transformations
perdimensional computing to conventional computing: include logarithms, inverses, and frequency cut-offs, as
things are represented with vectors, and new representa- well as principal components of the (transformed) fre-
tions are computed from existing ones with (arithmetic) quency matrix. Perhaps the best known method is latent
operations on the vectors. This idea is central and should be semantic analysis (LSA), which uses singular-value
taken to future models. decomposition and reduces the dimensionality of the data
by discarding a large number of the least significant prin-
Context Vectors as Examples of Sets; Random cipal components.
Indexing Random-vector methods are singularly suited for mak-
ing context vectors, and they even overcome some
Context vectors are a statistical means for studying drawbacks of the more ‘‘exact’’ methods. The idea will be
relations between words of a language. They are high- demonstrated when documents are used as the contexts in
dimensional representations of words based on their which words occur. The standard practice of LSA is to
contexts. They provide us with an excellent example of collect the word frequencies into a matrix that has a row for
random initial vectors giving rise to compatible systems. each word of the vocabulary (for each ‘‘term’’) and a col-
The idea is that words with similar or related meanings umn for each document of the corpus. Thus for each
appear in the same and similar contexts and therefore document there is a column that shows the number of times
should give rise to similar vectors. For example, the vectors that the different words occur in that document. The
for synonyms such as ‘happy’ and ‘glad’ should be similar, resulting matrix is very sparse because most words do not
as should be the vectors for related words such as ‘sugar’ occur in most documents. For example, if the vocabulary
and ‘salt’, whereas the vectors for unrelated words such as consists of 100,000 words, then a 500-word document—a
‘glad’ and ‘salt’ should be dissimilar. This indeed is page of text—will have a column with at most 500 non-0s
achieved with all context vectors described below, (out of 100,000). A fairly large corpus could have 200,000
including the ones that are built from random vectors. documents. The resulting matrix of frequencies would then
The context vector for a word is computed from the have 100,000 rows and 200,000 columns, and the ‘‘raw’’
contexts in which the word occurs in a large body of text. context vectors for words would be 200,000-dimensional.
For any given instance of the word, its context is the sur- LSA reconstructs the frequency matrix from several hun-
rounding text, which is usually considered in one of two dred of the most significant principal components arrived
ways: (1) as all the other words within a short distance at by singular-value decomposition of the 100,000-by-
from where the word occurs, referred to as a context win- 200,000 matrix. One drawback is the computational cost of
dow, or (2) as a lump, referred to as a document. A context extracting principal components of a matrix of that size.
window is usually narrow, limited to half a dozen or so Another, more serious, is encountered when data are added,
nearby words. A document is usually several hundred when the documents grow into the millions. Updating the
words of text on a single topic, a news article being a good context vectors—computing the singular-value decompo-
example. Each occurrence of a word in a text corpus thus sition of an ever larger matrix—becomes impractical.
adds to the word’s context so that massive amounts of text, Random-vector methods can prevent the growth of the
such as available on the Internet, can provide a large matrix as documents are added. In a method called Ran-
amount of context information for a large number of words. dom Indexing, instead of collecting the data into a 100,000-
When a word’s context information is represented as a by-200,000 matrix, we collect it into a 100,000-by-10,000
vector, it is called that word’s context vector. One way to matrix. Each word in the vocabulary still has its own row in
characterize the two kinds of context vectors is that one the matrix, but each document no longer has its own col-
represents the multiset of words (a bag of words) in all the umn. Instead, each document is assigned a small number of
context windows for a given word, and the other kind columns at random, say, 20 columns out of 10,000, and we
represents the multiset of documents in which a given word say that the document activates those columns. A 10,000-
appears. dimensional vector mostly of 0s except for the twenty 1s
123
Cogn Comput (2009) 1:139–159 153
where the activated columns are located is called that that the sum-vector of high-dimensional random vectors is
document’s random index vector. similar to the vectors that make up the sum and it is
When the frequencies are collected into a matrix in therefore a good representation of a set. When the context
standard LSA, each word in a document adds a 1 in the of a word is defined as a set of documents, as above, it is
column for that document, whereas in random indexing naturally represented by the sum of the vectors for those
each word adds a 1 in all 20 columns that the document documents. That is exactly what random indexing does: a
activates. Another way of saying it is that each time a word context vector is the sum of the random index vectors for
occurs in the document, the document’s random index the documents in which the word occurs. Thus two words
vector is added to the row corresponding to that word. So that share contexts share many documents, and so their
this method is very much like the standard method of context vectors share many index vectors in their respec-
accumulating the frequency matrix, and it produces a tive sums, making the sums—i.e., the context vectors—
matrix whose rows are valid context vector for words, akin similar.
to the ‘‘raw’’ context vectors described above. The other comment concerns the linguistic adequacy of
The context vectors—the rows—of this matrix can be context vectors. The contexts of words contain much richer
transformed by extracting dominant principal components, linguistic information than is captured by the context
as in LSA, but such further computing may not be neces- vectors in the examples above. In fact, these context
sary. Context vectors nearly as good as the ones from LSA vectors are linguistically impoverished and crude—with
have been obtained with a variant of random indexing that language we can tell a story, with a bag of words we might
assigns each document a small number (e.g., 10) of ‘‘posi- be able to tell what the story is about. The technical reason
tive’’ columns and the same number of ‘‘negative’’ columns, is that only one operation is used for making the context
at random. In the positive columns, 1s are added as above, vectors, namely, vector addition, and so only sets can be
whereas in the negative columns 1s are subtracted. The represented adequately. However, other operations on
random index vectors for documents are now ternary with a vectors besides addition have already been mentioned, and
small number of 1s and -1s placed randomly among a large they can be used for encoding relational information about
number of 0s, and the resulting context vectors have a mean words. The making of linguistically richer context vectors
of 0—their components add up to 0. is possible but mostly unexplored.
Several things about random indexing are worth noting. To sum up, high-dimensional random vectors—that is,
(1) Information about documents is distributed randomly large random patterns—can serve as the basis of a cogni-
among the columns. In LSA, information starts out local- tive code that captures regularities in data. The simplicity
ized and is distributed according to the dominant principal and flexibility of random-vector methods can surpass those
components. (2) Adding documents—i.e., including new of more exact methods, and the principles apply to a wide
data—is very simple: all we need to do is to select a new range of tasks—beyond the computing of context vectors.
set of columns at random. This can go on into millions of They are particularly apt for situations where data keep on
documents without needing to increase the number of accumulating. Thus random-vector-based methods are
columns in the matrix. In LSA, columns need to be added good candidates for use in incremental on-line learning and
for new documents, and singular-value decomposition in building a cognitive code.
needs to be updated. (3) Random indexing can be applied
equally to the vocabulary so that the matrix will have fewer Learning to Infer by Holistic Mapping; Learning
rows than there are words in the vocabulary, and that new from Example
words will not require adding rows into the matrix. In that
case, individual rows no longer serve as context vectors for Logic deals with inference. It lets us write down general
words, but the context vectors are readily computed by statements—call them rules—which, when applied to
adding together the rows that the word activates. (4) specific cases, yield specific statements that are true. Here
Semantic vectors for documents can be computed by we look at such rules in terms of hyperdimensional
adding together the columns that the documents activate. arithmetic.
(5) Random indexing can be used also when words in a Let us look at the rule ‘If x is the mother of y and y is the
sliding context window are used as the context. (6) And, of father of z then x is the grandmother of z.’ If we substitute the
course, all the context vectors discussed in this section names of a specific mother, son, and baby for x, y, and z, we
capture meaning, in that words with similar meaning have get a true statement about a specific grandmother. How
similar context vectors and unrelated words have dissimilar might the rule be encoded in distributed representation, and
context vectors. how might it be learned from specific examples of it?
Two further comments of a technical nature are in order, Here we have three relations, ‘mother of’, ‘father of’,
one mathematical, the other linguistic. We have seen above and ‘grandmother of’; let us denote them with the letters M,
123
154 Cogn Comput (2009) 1:139–159
F, and G. Each relation has two constituents or arguments; same ‘‘rule’’ to get Ruvw ¼ Guw ðMuv þ Fvw Þ: If we
we will label them with subscripts 1 and 2. That x is the combine these two rules simply by adding them together,
mother of y can then be represented by Mxy ¼ M1 X þ we get an improved rule based on two examples: R = Rxyz
M2 Y: Binding X and Y to two different vectors M1 and ? Ruvw. The new rule is better in the sense that if applied
M2 keeps track of which variable, x or y, goes with which to—multiplied by—the antecedent involving Anna, Bill,
of the two arguments, and the sum combines the two bound and Cid, as above, we get a vector G00ac that improves upon
pairs into a vector representing the relation ‘mother of’. G0ac by being closer to Gac. We can go on adding examples,
Similarly, Fyz ¼ F1 Y þ F2 Z for ‘father of’ and Gxz ¼ further improving the result somewhat. This can also be
G1 X þ G2 Z for ‘grandmother of’. thought of as learning by analogy. The thing to note is that
Next, how to represent the implication? The left side— everything is done with simple arithmetic on random
the antecedent—has two parts combined with an ‘and’; we hypervectors.
can represent it with addition: Mxy ? Fyz. The right side— So as not to give the impression that all kinds of infer-
the consequent Gxz—is implied by the left; we need an ence will work out as simply as this, we need to point out
expression that maps the antecedent to the consequent. when they don’t. Things work out here because the rela-
With XOR as the multiplication operator, the mapping is tions in the antecedent and the consequent are different.
effected by the product-vector However, some of them could be the same. Examples of
Rxyz ¼ Gxz ðMxy þ Fyz Þ such include (1) ‘if x is the mother of y and y is a brother of
z (and not half-bother) then x is the mother of z,’ and the
So the mapping Rxyz represents our rule and it can be transitive relation (2) ‘if x is a brother of y and y is a brother
applied to specific cases of mother, son, baby. of z (different from x) then x is a brother of z.’ When these
Now let us apply the rule. We will encode ‘Anna is the are encoded in the same way as the mother–son–baby
mother of Bill’ with Mab ¼ M1 A þ M2 B and ‘Bill is example above and the resulting rule applied to a, b, and c,
the father of Cid’ with Fbc ¼ F1 B þ F2 C; combine the computed inference correlates positively with the
them into the antecedent Mab ? Fbc, and map it with the correct inference but a relation that is a part of the ante-
rule Rxyz: cedent—a tautology—correlates more highly; in both cases
Rxyz ðMab þ Fbc Þ ¼ Gxz ðMxy þ Fyz Þ ðMab þ Fbc Þ ‘b is a brother of c’ wins over the intended conclusion
about a’s relation to c. An analysis shows the reason for the
The resulting vector, we will call it G0ac , is more similar to failure. It shows that the mapping rule includes the identity
Gac (i.e., more similar to G1 A þ G2 C) than to any vector, which then takes the antecedent into the computed
other vector representing a relation of these same elements, inference. The analysis is not complicated but it is lengthy
thus letting us infer that Anna is the grandmother of Cid. and is not presented here.
The above example of inference can also be interpreted A major advantage of distributed representation of this
as learning from example. It uses a traditional formal kind is that it lends itself to analysis. We can find out why
framework with variables and values to represent relations, something works or fails, and what could be done to work
merely encoding them in distributed representation. The around a failure.
traditional framework relies on two-place relations and on
the variables x, y, and z to identify individuals across the What is the Dollar of Mexico?
relations that make up the rule. However, because variables
in distributed representation are represented explicitly by Much of language use, rather than being literal, is indirect
vectors, just as individuals are, the encoding of the or figurative. For example, we might refer to the peso as the
rule ‘mother–son–baby implies grandmother’, and of an Mexican dollar because the two have the same role in their
instance of it involving Anna, Bill, and Cid, are identical in respective countries. For the figurative expression to work,
form. We can therefore regard the rule itself as a specific we must be able to infer the literal meaning from it. That
instance of it(self); we can regard it as an example. Thus implies the need to compute the literal meaning from the
we can interpret the above description as computing from figurative.
one example or instance of mother–son–baby implying The following example suggests that the inference can
grandmother another instance of grandmother. It is be achieved with holistic mapping. We will encode the
remarkable that learning from a single example would lead country (x) and its monetary unit (y) with a two-field
to the correct inference. ‘‘record.’’ The holistic record for the United States then is
We can go further and learn from several examples. If A ¼ X U þ Y D and for Mexico it is B ¼ X M þ Y
one example gives us the mapping (rule) Rxyz and we have P; where U, M, D, P are random 10,000-bit vectors rep-
another example involving u, v, and w—think of them as a resenting United States, Mexico, dollar, and peso,
second set of specific individuals—we can recompute the respectively.
123
Cogn Comput (2009) 1:139–159 155
From the record for United States A we can find its learning a second language. To make sense of what we
monetary unit by unbinding (multiplying it) with the hear or read, we translate into our native tongue. Even after
variable Y. We can also find what role dollar plays in A becoming fluent in the new language, idioms of the mother
by multiplying it with the dollar D: D A ¼ D X U þ tongue can creep into our use of the other tongue.
D Y D ¼ D X U þ Y Y: If we take the literal To reflect this view, we can leave out X and Y from the
approach and ask what role dollar plays in the record for representations above and encode United States as a
Mexico B we get nonsense: D B ¼ D X M þ D Y prototype, namely, A0 = U ? D. The holistic record for
P is unrecognizable. But we have already found out above Mexico is then encoded in terms of it, giving B0 ¼
the role that dollar plays in another context, namely the role U M þ D P: The dollar of Mexico now becomes simply
Y, and so we can use it to unbind B and get P0 that is similar D B0 ¼ D ðU M þ D PÞ ¼ D U M þ D D P
to P for peso. The interesting thing is that we can find the ¼ D U M þ P ¼ P0 P; with U and D taking the place
Mexican dollar without ever explicitly recovering the of the variables X and Y. Using U and D as variables, we
variable Y; we simply ask what in Mexico corresponds to can in turn interpret ‘the peso of France’ exactly as ‘the
the dollar in the United States? This question is encoded dollar of Mexico’ is interpreted in the original example.
with ðD AÞ B; and the result approximately equals P.
The math is an exercise in distributivity, with vectors
occasionally canceling out each other, and is given here in Looking Back
detail:
ðD AÞ B ¼ ðD ðX U þ Y DÞÞ ðX M þ Y PÞ Artificial neural-net associative memories were the first
cognitive models to embrace truly high dimensionality and
¼ ðD ðX UÞ þ D ðY DÞÞ ðX M þ Y PÞ
to see it as a possible asset. The early models were the
¼ ðD X U þ D Y DÞ ðX M þ Y PÞ linear correlation-matrix memories of Anderson [1] and
¼ ðD X U þ YÞ ðX M þ Y PÞ Kohonen [2] that equate stored patterns with the eigen-
¼ ðD X U þ YÞ ðX MÞ þ ðD X U þ YÞ vectors of the memory (weight) matrix. Later models were
ðY PÞ made nonlinear with the application of a squashing func-
tion to the memory output vector, making stored patterns
¼ ððD X UÞ ðX MÞ þ Y ðX MÞÞ
into point attractors. The best-known of these models is the
þ ððD X UÞ ðY PÞ þ Y ðY PÞÞ Hopfield net [3]. They have one matrix of weights, which
¼ ðD X U X M þ Y X MÞ limits the memory storage capacity—the number of patters
þ ðD X U Y P þ Y Y PÞ that can be stored—to a fraction of the dimensionality of
¼ ðD U M þ Y X MÞ the stored vectors. By adding a fixed layer (a matrix) of
random weights, the Sparse Distributed Memory [4] allows
þ ðD X U Y P þ PÞ
the building of associative memories of arbitrarily large
capacity. The computationally most efficient implementa-
The only meaningful term in the result is P. The other three
tion of it, by Karlsson [5], is equivalent to the RAM-based
terms act as random noise.
WISARD of Aleksander et al. [6]. Representative early
work on associative memories appears in a 1981 book
Cognitive Structure Based on Prototypes edited by Hinton and Anderson [7], more recent by Has-
soun [8], and more detailed analyses of these memories
The last two examples let us question the primacy of have been given by Kohonen [9] and Palm [10].
variables in cognitive representation. We have learned to The next major development is marked by the 1990
think in abstract terms such as country and monetary unit special issue of Artificial Intelligence (vol. 46) on con-
and to represent more concrete objects in terms of them, as nectionist symbol processing edited by Geoffrey Hinton. In
above, but we can also think in terms of prototypes and it Hinton [11] argues for the necessity of a reduced rep-
base computing on them, accepting expressions such as resentation if structured information such as hierarchies
‘the dollar of Mexico’ and ‘the dollar of France’ as per- were to be handled by neural nets. Smolensky [12] intro-
fectly normal. In fact, this is more like how children start duced tensor-product variable binding, which allows the
out talking. Mom and Dad are specific persons to them, and (neural-net-like) distributed representation of traditional
somebody else’s mother and father become understood in symbolic structures. However, the tensor product carries all
terms of my relation to Mom and Dad. The instances low-level information to each higher level at the expense of
encountered early in life become the prototypes, and later increasing the size of the representation—it fails to reduce.
instances are understood in terms of them. This kind of This problem was solved by Plate in the holographic
prototyping is very apparent to us when as adults we are reduced representation (HRR) [13]. The solution
123
156 Cogn Comput (2009) 1:139–159
compresses the n 9 n outer product of two real vectors of frequencies in documents and represents it with several
dimensionality n into a single n-dimensional vector with hundred dominant principal components of the (trans-
circular convolution, it being the multiplication operator. formed) frequency matrix. The desire to avoid the
The method requires a clean-up memory to recover infor- computational task of extracting principal components
mation that is lost when the representation is reduced. The inspired Random Indexing by Kanerva et al. [20], the idea
problem of clean-up had already been solved, in theory at of which is discussed above. Random indexing is a special
least, by autoassociative memory. We now have a system case of Random Projections by Papadimitriou et al. [21]
of n-dimensional distributed representation with operators and Random Mappings by Kaski [22]. All are examples of
for addition and multiplication, that is closed under these low-distortion geometric embedding, which has been
operations and sufficient for encoding and decoding of reviewed by Indyk [23].
compositional structure, as discussed above. Language is a prime motivator and a rich source
Plate also discusses HRR with complex vectors [14]. of ideas and challenges for hyperdimensional models.
The addition operator for them is componentwise addition, The original word-space model of Schütze [24] and the
as above, and the multiplication operator is componentwise hyperspace analogue to language (HAL) model of Lund
multiplication. HRR for binary vectors is called the Spatter et al. [25], as well as LSA, are here called ‘‘exact’’
Code [15] for which componentwise XOR is an appropriate because they do not distribute the frequency information
multiplication operator; for the equivalent bipolar spatter with random vectors. Sahlgren’s [26] results at capturing
code it is componentwise multiplication, making the word meaning with random indexing are comparable.
spatter code equivalent to the complex HRR when the However, context vectors that are based solely on the co-
‘‘complex’’ vector components are restricted to the values occurrence of words ignore a major source of linguistic
1 and -1. information, namely, grammar. First attempts at including
The circular convolution includes all n 9 n elements of grammar have been made by encoding word order into
the outer-product matrix. However, Plate points out that the context vectors. Jones and Mewhort [27] do it with
multiplication can also be accomplished with a subset of circular convolution applied to real-valued HRR-vectors,
the elements. The simplest such has been used successfully Sahlgren et al. [28] do it with permutations applied to
by Gayler [16] by taking only the n diagonal elements of ternary random-index vectors. Notice that both use mul-
the outer-product matrix, making that system a general- tiplication—both circular convolution and permutation are
ization of the bipolar spatter code. multiplication operators. Widdows [29] covers numerous
Permutation is a very versatile multiplication operator studies that represent word meaning with points of a high-
for hyperdimensional vectors, as discussed above. Rach- dimensional space.
kovskij and Kussul use it to label the variables of a relation We can conclude from all of the above that we are
[17], and Kussul and Baidyk [18] mark positions of a dealing with very general properties of high-dimensional
sequence with permutations. Gayler [16] uses permutations spaces. There is a whole family of mathematical systems
for ‘‘hiding’’ information in holographic representation. that can be used as the basis of computing, referred to here
Rachkovskij and Kussul [17] use them for Context- as hyperdimensional computing and broadly covered under
Dependent Thinning, which is a method of normalizing HRR, the definitive work on which is Plate’s book [14]
binary vectors—that is, of achieving a desired sparseness in based on his 1994 PhD thesis.
vectors that are produced by operations such as addition.
When a variable that is represented with a permutation
is bound to a value that is represented with a hypervector, Looking Forth; Discussion
the inverse permutation will recover the value vector.
Similarly, when a holistic record of several variables is In trying to understand brains, the most fundamental
constructed as a sum of permuted values—each variable questions are philosophical: How does the human mind
having its own random permutation—the inverse permu- arise from the matter we are made of? What makes us so
tations will recover approximate value vectors. However, special, at least in our own eyes? Can we build robots with
there is no practical way to compute the permutations—to the intelligence of, say, a crow or a bear? Can we build
find the variables—from the holistic record and to deter- robots that will listen, understand, and learn to talk?
mine what variable is associated with a given value. In that According to one view, such questions will be answered
sense binding with vector multiplication and with permu- in the positive once we understand how brains compute.
tation are very different. The seeming paradox of the brain’s understanding its own
Another thread in the development of these models leads understanding is avoided by modeling. If our theories allow
to LSA, which is described in detail by Landauer and us to build a system whose behavior is indistinguishable
Dumais [19]. LSA takes a large sparse matrix of word from the behavior of the intended ‘‘target’’ system, we have
123
Cogn Comput (2009) 1:139–159 157
understood that system—the theory embodies our under- considerations can suggest how an individual component or
standing of it. This view places the burden on modeling. a circuit needs to work to achieve a certain function. The
This paper describes a set of ideas for cognitive mod- mathematical modeler, in turn, can follow some leads and
eling, the key ones being very high dimensionality and dismiss others by looking at the neural data.
randomness. They are a mathematical abstraction of certain It has been pointed out above that no two brains are
apparent properties of real neural systems, and they are identical yet they can be equivalent. The flip side is indi-
amenable to building into cognitive models. It is equally vidual differences, which can be explained by randomness.
important that cognition, and behavior in general, is An individual’s internal code can be especially suited or
described well at the phenomenal level with all their sub- unsuited for some functions simply by chance. This is
tleties, for example, how we actually think—or fail to— particularly evident in the savant’s feats of mental arith-
how we remember, forget, and confuse, how we learn, how metic, which to a computer engineer is clearly a matter of
we use language, what are the concepts we use, their the internal code. The blending of sensory modalities in
relation to perception. With all of it being somehow pro- synesthesia is another sign of random variation in the
duced by our brains, the modeler’s task is to find a internal code. The specifics of encoding that would result
plausible explanation in underlying mechanisms. That calls in these and other anomalies of behavior and perception are
for a deep understanding of both the phenomenon and the yet to be discovered—as are the specifics that lead to
proposed mechanisms. normal behavior! The thesis of this paper is that discov-
Experimental psychologists have a host of ways of ering the code is a deeply mathematical problem.
testing and measuring behavior. Examples include reaction The mathematics of hyperdimensional representation as
time, memory recognition and recall rates, confusions and discussed above is basic to mathematicians, and the models
errors introduced by priming and distractions, thresholds of based on it will surely fall short of explaining the brain’s
perception, judgments of quantity, eye-tracking, and now computing. Yet, they show promise and could pave the
also imaging brain activity. We can foresee the testing of way to more comprehensive models based on deeper
hyperdimensional cognitive codes in a multitude of psy- mathematics. The problem is in identifying mathematical
chological experiments. systems that mirror ever more closely the behavior of
If you have never doubted your perceptions, visit a cognitive systems we want to understand. We can hope that
psychophysicist—or a magician. It is amazing how our some mathematicians become immersed in the problem
senses are fooled. All the effects are produced by our and will show us the way.
nervous systems and so tell of its workings. They seriously Of the ideas discussed in this paper, random indexing is
challenge our cognitive modeling, and serve as a useful ready for practical application. The example here is of
guide. Hyperdimensional representation may explain at language, but the method can be used in any task that
least some illusions, and possibly our bistable perception of involves a large and ever increasing sparse matrix of fre-
the Necker cube. quencies. The analysis of dynamic networks of many
Language has been cited above as a test-bed for ideas on sorts—social networks, communications networks—comes
representation, for which it is particularly suited on several readily to mind, but there are many others. The benefit is in
accounts. The information has already been filtered by our being able to accommodate unpredictable growth in data
brains and encoded into letters, words, sentences, passages, within broad limits, in a fixed amount of computer memory
and stories. It is therefore strongly influenced by the brain’s by distributing the data randomly and by reconstructing it
mechanisms, thus reflecting them. Linguists can tell us statistically when needed.
about language structure, tolerance of apparent ambiguity, The ideas have been presented here in terms familiar to
stages of learning, literal and figurative uses, slips of the us from computers. They suggest a new breed of computers
tongue, and much more, presenting us with a host of issues that, contrasted to present-day computers, work more like
to challenge our modeling. Data are available in ever- brains and, by implication, can produce behavior more like
increasing amounts on the Internet, in many languages, that produced by brains. This kind of neural-net computing
easily manipulated by computers. If we were to limit the emphasizes computer-like operations on vectors—directly
development and testing of ideas about the brain’s repre- computing representations for composite entities from
sentations and processing to a single area of study, those of the components—and deemphasizes iterative
language would be an excellent choice. Our present models searching of high-dimensional ‘‘energy landscapes,’’ which
barely scratch the surface. is at the core of many present-day neural-net algorithms.
Neuroscience can benefit from mathematical ideas The forming of an efficient energy landscape in a neural net
about representation and processing. Work at the level of would still have a role in making efficient item memories.
individual neurons cannot tell us much about higher Very large word size—i.e., hyperdimensionality—
mental functions, but theoretical—i.e., mathematical— means that the new computers will be very large in terms of
123
158 Cogn Comput (2009) 1:139–159
numbers of components. In light of the phenomenal pro- 9. Kohonen T. Self-organization and associative memory. 3rd ed.
gress in electronics technology, the required size will be Berlin: Springer; 1989.
10. Palm G. Neural assemblies: an alternative approach to artificial
achieved in less than a lifetime. In fact, computer engineers intelligence. Heidelberg: Springer; 1982.
will soon be looking for appropriate architectures for the 11. Hinton GE. Mapping part–whole hierarchies into connectionist
massive circuits they are able to manufacture. The com- networks. Artif Intell. 1990;46(1–2):47–75.
puting discussed here can use circuits that are not produced 12. Smolensky P. Tensor product variable binding and the repre-
sentation of symbolic structures in connectionist networks. Artif
in identical duplicates, and so the manufacturing of circuits Intell. 1990;46(1–2):159–216.
for the new computers could resemble the growing of 13. Plate T. Holographic Reduced Representations: convolution
neural circuits in the brain. It falls upon those of us who algebra for compositional distributed representations. In: My-
work on the theory of computing to work out the archi- lopoulos J, Reiter R, editors. Proc. 12th int’l joint conference on
artificial intelligence (IJCAI). San Mateo, CA: Kaufmann; 1991.
tecture. In that spirit, we are encouraged to explore the p. 30–35.
possibilities hidden in very high dimensionality and 14. Plate TA. Holographic reduced representation: distributed rep-
randomness. resentation of cognitive structure. Stanford: CSLI; 2003.
A major challenge for cognitive modeling is to identify 15. Kanerva P. Binary spatter-coding of ordered K-tuples. In: von der
Malsburg C, von Seelen W, Vorbruggen JC, Sendhoff B, editors.
mathematical systems of representation with operations Artificial neural networks – ICANN 96 proceedings (Lecture
that mirror cognitive phenomena of interest. This alone notes in computer science, vol. 1112). Berlin: Springer; 1996.
would satisfy the engineering objective of building com- p. 869–73.
puters with new capabilities. The mathematical systems 16. Gayler RW. Multiplicative binding, representation operators, and
analogy. Poster abstract. In: Holyoak K, Gentner D, Kokinov B,
should ultimately be realizable in neural substratum. editors. Advances in analogy research. Sofia: New Bulgarian
Computing with hyperdimensional vectors is meant to take University; 1998. p. 405. Full poster http://cogprints.org/502/.
us in that direction. Accessed 15 Nov 2008.
17. Rachkovskij DA, Kussul EM. Binding and normalization of
Acknowledgements Real Wold Computing Project funding by binary sparse distributed representations by context-dependent
Japan’s Ministry of International Trade and Industry to the Swedish thinning. Neural Comput. 2001;13(2):411–52.
Institute of Computer Science in 1994–2001 made it possible for us to 18. Kussul EM, Baidyk TN. On Information encoding in associative–
develop the ideas for high-dimensional binary representation. The projective neural networks. Report 93-3. Kiev, Ukraine: V.M.
support of Dr. Nobuyuki Otsu throughout the project was most valu- Glushkov Inst. of Cybernetics; 1993 (in Russian).
able. Dr. Dmitri Rachkovskij provided information on early use of 19. Landauer T, Dumais S. A solution to Plato’s problem: the Latent
permutations to encode sequences by researchers in Ukraine. Dikran Semantic Analysis theory of acquisition, induction and repre-
Karagueuzian of CSLI Publications accepted for publication Plate’s sentation of knowledge. Psychol Rev. 1997;104(2):211–40.
book on Holographic Reduced Representation after a publishing 20. Kanerva P, Kristoferson J, Holst A. Random Indexing of text
agreement elsewhere fell through. Discussions with Tony Plate and samples for latent semantic analysis. Poster abstract. In: Gleitman
Ross Gayler have helped shape the ideas and their presentation here. LR, Josh AK, editors. Proc. 22nd annual conference of the Cog-
Sincere thanks to you all, as well as to my coauthors on papers on nitive Science Society. Mahwah, NJ: Erlbaum; 2000. p. 1036. Full
representation and to three anonymous reviewers of the manuscript. poster http://www.rni.org/kanerva/cogsci2k-poster.txt. Accessed
23 Nov 2008.
21. Papadimitriou C, Raghavan P, Tamaki H, Vempala S. Latent
semantic indexing: a probabilistic analysis. Proc. 17th ACM
References symposium on the principles of database systems. New York:
ACM Press; 1998. p. 159–68.
1. Anderson JA. A simple neural network generating an interactive 22. Kaski S. Dimensionality reduction by random mapping: fast
memory. Math Biosci. 1972;14:197–220. similarity computation for clustering. Proc. int’l joint conference
2. Kohonen T. Correlation matrix memories. IEEE Trans Comput. on neural networks, IJCNN’98. Piscataway, NJ: IEEE Service
1984;C21(4):353–9. Center; 1999. p. 413–8.
3. Hopfield JJ. Neural networks and physical systems with emergent 23. Indyk P. Algorithmic aspects of low-distortion geometric em-
collective computational abilities. Proc Natl Acad Sci USA. beddings. Annual symposium on foundations of computer science
1982;79(8):2554–8. (FOCS) 2001 tutorial. http://people.csail.mit.edu/indyk/tut.ps.
4. Kanerva P. Sparse distributed memory. Cambridge, MA: MIT Accessed 15 Nov 2008.
Press; 1988. 24. Schütze H. Word space. In: Hanson SJ, Cowan JD, Giles CL,
5. Karlsson R. A fast activation mechanism for the Kanerva SDM editors. Advances in neural information processing systems 5.
memory. In: Uesaka Y, Kanerva P, Asoh H, editors. Foundations San Mateo, CA: Kaufmann; 1993. p. 895–902.
of real-world computing. Stanford: CSLI; 2001. p. 289–93. 25. Lund K, Burgess C, Atchley R. Semantic and associative priming
6. Aleksander I, Stonham TJ, Wilkie BA. Computer vision systems in high-dimensional semantic space. Proc. 17th annual confer-
for industry: WISARD and the like. Digit Syst Ind Autom. ence of the Cognitive Science Society. Mahwah, NJ: Erlbaum;
1982;1:305–23. 1995. p. 660–5.
7. Hinton GH, Anderson JA, editors. Parallel models of associative 26. Sahlgren M. The word-space model. Doctoral disserta-
memory. Hillsdale, NJ: Erlbaum; 1981. tion. Department of Linguistics, Stockholm University; 2006.
8. Hassoun MH, editor. Associative neural memories: theory and http://www.sics.se/*mange/TheWordSpaceModel.pdf. Accessed
implementation. New York, Oxford: Oxford University Press; 1993. 23 Nov 2008.
123
Cogn Comput (2009) 1:139–159 159
27. Jones MN, Mewhort DJK. Representing word meaning and order Cognitive Science Society. Austin, TX: Cognitive Science
information in a composite holographic lexicon. Psychol Rev. Society. p. 1300–5.
2007;114(1):1–37. 29. Widdows D. Geometry and meaning. Stanford: CSLI; 2004.
28. Sahlgren M, Holst A, Kanerva P. Permutations as a means to
encode order in word space. Proc. 30th annual conference of the
123