0% found this document useful (0 votes)

9 views10 pages

283 Hierarchical Memory Networks

Uploaded by

Bill Petrie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views10 pages

283 Hierarchical Memory Networks

Uploaded by

Bill Petrie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Under review as a conference paper at ICLR 2017

H IERARCHICAL M EMORY N ETWORKS

Sarath Chandar∗ 1 , Sungjin Ahn1 , Hugo Larochelle2,4 , Pascal Vincent1,4 ,
Gerald Tesauro3 , Yoshua Bengio1,4
1
Université de Montréal, Canada.
2
Twitter, USA.
3
IBM Watson Research Center, USA.
4
CIFAR, Canada.

A BSTRACT

Memory networks are neural networks with an explicit memory component that
can be both read and written to by the network. The memory is often addressed in
a soft way using a softmax function, making end-to-end training with backprop-
agation possible. However, this is not computationally scalable for applications
which require the network to read from extremely large memories. On the other
hand, it is well known that hard attention mechanisms based on reinforcement
learning are challenging to train successfully. In this paper, we explore a form of
hierarchical memory network, which can be considered as a hybrid between hard
and soft attention memory networks. The memory is organized in a hierarchical
structure such that reading from it is done with less computation than soft attention
over a flat memory, while also being easier to train than hard attention over a flat
memory. Specifically, we propose to incorporate Maximum Inner Product Search
(MIPS) in the training and inference procedures for our hierarchical memory net-
work. We explore the use of various state-of-the art approximate MIPS techniques
and report results on SimpleQuestions, a challenging large scale factoid question
answering task.

1 I NTRODUCTION

Until recently, traditional machine learning approaches for challenging tasks such as image caption-
ing, object detection, or machine translation have consisted in complex pipelines of algorithms, each
being separately tuned for better performance. With the recent success of neural networks and deep
learning research, it has now become possible to train a single model end-to-end, using backprop-
agation. Such end-to-end systems often outperform traditional approaches, since the entire model
is directly optimized with respect to the final task at hand. However, simple encode-decode style
neural networks often underperform on knowledge-based reasoning tasks like question-answering
or dialog systems. Indeed, in such cases it is nearly impossible for regular neural networks to store
all the necessary knowledge in their parameters.
Neural networks with memory (Graves et al., 2014; Weston et al., 2015b) can deal with knowledge
bases by having an external memory component which can be used to explicitly store knowledge.
The memory is accessed by reader and writer functions, which are both made differentiable so
that the entire architecture (neural network, reader, writer and memory components) can be trained
end-to-end using backpropagation. Memory-based architectures can also be considered as general-
izations of RNNs and LSTMs, where the memory is analogous to recurrent hidden states. However
they are much richer in structure and can handle very long-term dependencies because once a vector
(i.e., a memory) is stored, it is copied from time step to time step and can thus stay there for a very
long time (and gradients correspondingly flow back time unhampered).
There exists several variants of neural networks with a memory component: Memory Networks (We-
ston et al., 2015b), Neural Turing Machines (NTM) (Graves et al., 2014), Dynamic Memory Net-
∗
Corresponding author: [email protected]

1
Under review as a conference paper at ICLR 2017

works (DMN) (Kumar et al., 2015). They all share five major components: memory, input module,
reader, writer, and output module.
Memory: The memory is an array of cells, each capable of storing a vector. The memory is often
initialized with external data (e.g. a database of facts), by filling in its cells with a pre-trained vector
representations of that data.
Input module: The input module is to compute a representation of the input that can be used by
other modules.
Writer: The writer takes the input representation and updates the memory based on it. The writer
can be as simple as filling the slots in the memory with input vectors in a sequential way (as often
done in memory networks). If the memory is bounded, instead of sequential writing, the writer has
to decide where to write and when to rewrite cells (as often done in NTMs).
Reader: Given an input and the current state of the memory, the reader retrieves content from the
memory, which will then be used by an output module. This often requires comparing the input’s
representation or a function of the recurrent state with memory cells using some scoring function
such as a dot product.
Output module: Given the content retrieved by the reader, the output module generates a prediction,
which often takes the form of a conditional distribution over multiple labels for the output.
For the rest of the paper, we will use the name memory network to describe any model which has
any form of these five components. We would like to highlight that all the components except the
memory are learnable. Depending on the application, any of these components can also be fixed. In
this paper, we will focus on the situation where a network does not write and only reads from the
memory.
In this paper, we focus on the application of memory networks to large-scale tasks. Specifically, we
focus on large scale factoid question answering. For this problem, given a large set of facts and a nat-
ural language question, the goal of the system is to answer the question by retrieving the supporting
fact for that question, from which the answer can be derived. Application of memory networks to
this task has been studied by Bordes et al. (2015). However, Bordes et al. (2015) depended on key-
word based heuristics to filter the facts to a smaller set which is manageable for training. However
heuristics are invariably dataset dependent and we are interested in a more general solution which
can be used when the facts are of any structure. One can design soft attention retrieval mechanisms,
where a convex combination of all the cells is retrieved or design hard attention retrieval mecha-
nisms where one or few cells from the memory are retrieved. Soft attention is achieved by using
softmax over the memory which makes the reader differentiable and hence learning can be done
using gradient descent. Hard attention is achieved by using methods like REINFORCE (Williams,
1992), which provides a noisy gradient estimate when discrete stochastic decisions are made by a
model.
Both soft attention and hard attention have limitations. As the size of the memory grows, soft
attention using softmax weighting is not scalable. It is computationally very expensive, since its
complexity is linear in the size of the memory. Also, at initialization, gradients are dispersed so
much that it can reduce the effectiveness of gradient descent. These problems can be alleviated by
a hard attention mechanism, for which the training method of choice is REINFORCE. However,
REINFORCE can be brittle due to its high variance and existing variance reduction techniques are
complex. Thus, it is rarely used in memory networks (even in cases of a small memory).
In this paper, we propose a new memory selection mechanism based on Maximum Inner Product
Search (MIPS) which is both scalable and easy to train. This can be considered as a hybrid of soft
and hard attention mechanisms. The key idea is to structure the memory in a hierarchical way such
that it is easy to perform MIPS, hence the name Hierarchical Memory Network (HMN). HMNs are
scalable at both training and inference time. The main contributions of the paper are as follows:

• We explore hierarchical memory networks, where the memory is organized in a hierarchical

fashion, which allows the reader to efficiently access only a subset of the memory.

• While there are several ways to decide which subset to access, we propose to pose memory
access as a maximum inner product search (MIPS) problem.

2
Under review as a conference paper at ICLR 2017

• We empirically show that exact MIPS-based algorithms not only enjoy similar convergence
as soft attention models, but can even improve the performance of the memory network.
• Since exact MIPS is as computationally expensive as a full soft attention model, we propose
to train the memory networks using approximate MIPS techniques for scalable memory
access.
• We empirically show that unlike exact MIPS, approximate MIPS algorithms provide a
speedup and scalability of training, though at the cost of some performance.

2 H IERARCHICAL M EMORY N ETWORKS

In this section, we describe the proposed Hierarchical Memory Network (HMN). In this paper,
HMNs only differ from regular memory networks in two of its components: the memory and the
reader.
Memory: Instead of a flat array of cells for the memory structure, HMNs leverages a hierarchical
memory structure. Memory cells are organized into groups and the groups can further be organized
into higher level groups. The choice for the memory structure is tightly coupled with the choice
of reader, which is essential for fast memory access. We consider three classes of approaches for
the memory’s structure: hashing-based approaches, tree-based approaches, and clustering-based
approaches. This is explained in detail in the next section.
Reader: The reader in the HMN is different from the readers in flat memory networks. Flat memory-
based readers use either soft attention over the entire memory or hard attention that retrieves a single
cell. While these mechanisms might work with small memories, with HMNs we are more interested
in achieving scalability towards very large memories. So instead, HMN readers use soft attention
only over a selected subset of the memory. Selecting memory subsets is guided by a maximum inner
product search algorithm, which can exploit the hierarchical structure of the organized memory to
retrieve the most relevant facts in sub-linear time. The MIPS-based reader is explained in more
detail in the next section.
In HMNs, the reader is thus trained to create MIPS queries such that it can retrieve a sufficient set
of facts. While most of the standard applications of MIPS (Ram & Gray, 2012; Bachrach et al.,
2014; Shrivastava & Li, 2014) so far have focused on settings where both query vector and database
(memory) vectors are precomputed and fixed, memory readers in HMNs are learning to do MIPS by
updating the input representation such that the result of MIPS retrieval contains the correct fact(s).

3 M EMORY R EADER WITH K-MIPS ATTENTION

In this section, we describe how the HMN memory reader uses Maximum Inner Product Search
(MIPS) during learning and inference.
We begin with a formal definition of K-MIPS. Given a set of points X = {x1 , . . . , xn } and a query
vector q, our goal is to find
(K)
argmaxi∈X q > xi (1)

where the argmax(K) returns the indices of the top-K maximum values. In the case of HMNs, X
corresponds to the memory and q corresponds to the vector computed by the input module.
A simple but inefficient solution for K-MIPS involves a linear search over the cells in memory by
performing the dot product of q with all the memory cells. While this will return the exact result
for K-MIPS, it is too costly to perform when we deal with a large-scale memory. However, in
many practical applications, it is often sufficient to have an approximate result for K-MIPS, trading
speed-up at the cost of the accuracy. There exist several approximate K-MIPS solutions in the
literature (Shrivastava & Li, 2014; 2015; Bachrach et al., 2014; Neyshabur & Srebro, 2015).
All the approximate K-MIPS solutions add a form of hierarchical structure to the memory and visit
only a subset of the memory cells to find the maximum inner product for a given query. Hashing-
based approaches (Shrivastava & Li, 2014; 2015; Neyshabur & Srebro, 2015) hash cells into multiple
bins, and given a query they search for K-MIPS cell vectors only in bins that are close to the bin

3
Under review as a conference paper at ICLR 2017

associated with the query. Tree-based approaches (Ram & Gray, 2012; Bachrach et al., 2014) create
search trees with cells in the leaves of the tree. Given a query, a path in the tree is followed and
MIPS is performed only for the leaf for the chosen path. Clustering-based approaches (Auvolat
et al., 2015) cluster cells into multiple clusters (or a hierarchy of clusters) and given a query, they
perform MIPS on the centroids of the top few clusters. We refer the readers to (Auvolat et al., 2015)
for an extensive comparison of various state-of-the-art approaches for approximate K-MIPS.
Our proposal is to exploit this rich approximate K-MIPS literature to achieve scalable training and
inference in HMNs. Instead of filtering the memory with heuristics, we propose to organize the
memory based on approximate K-MIPS algorithms and then train the reader to learn to perform
MIPS. Specifically, consider the following softmax over the memory which the reader has to perform
for every reading step to retrieve a set of relevant candidates:
Rout = softmax(h(q)M T ) (2)
d N ×d
where h(q) ∈ R is the representation of the query, M ∈ R is the memory with N being the
total number of cells in the memory. We propose to replace this softmax with softmax(K) which is
defined as follows:
C = argmax(K) h(q)M T (3)

Rout = softmax(K) (h(q)M T ) = softmax(h(q)M [C]T ) (4)

where C is the indices of top-K MIP candidate cells and M [C] is a sub-matrix of M where the rows
are indexed by C.
One advantage of using the softmax(K) is that it naturally focuses on cells that would normally
receive the strongest gradients during learning. That is, in a full softmax, the gradients are otherwise
more dispersed across cells, given the large number of cells and despite many contributing a small
gradient. As our experiments will show, this results in slower training.
One problematic situation when learning with the softmax(K) is when we are at the initial stages of
training and the K-MIPS reader is not including the correct fact candidate. To avoid this issue, we
always include the correct candidate to the top-K candidates retrieved by the K-MIPS algorithm,
effectively performing a fully supervised form of learning.
During training, the reader is updated by backpropagation from the output module, through the
subset of memory cells. Additionally, the log-likelihood of the correct fact computed using K-
softmax is also maximized. This second supervision helps the reader learn to modify the query
such that the maximum inner product of the query with respect to the memory will yield the correct
supporting fact in the top K candidate set.
Until now, we described the exact K-MIPS-based learning framework, which still requires a linear
look-up over all memory cells and would be prohibitive for large-scale memories. In such scenarios,
we can replace the exact K-MIPS in the training procedure with the approximate K-MIPS. This is
achieved by deploying a suitable memory hierarchical structure. The same approximate K-MIPS-
based reader can be used during inference stage as well. Of course, approximate K-MIPS algorithms
might not return the exact MIPS candidates and will likely to hurt performance, but at the benefit of
achieving scalability.
While the memory representation is fixed in this paper, updating the memory along with the query
representation should improve the likelihood of choosing the correct fact. However, updating the
memory will reduce the precision of the approximate K-MIPS algorithms, since all of them assume
that the vectors in the memory are static. Designing efficient dynamic K-MIPS should improve the
performance of HMNs even further, a challenge that we hope to address in future work.

3.1 R EADER WITH C LUSTERING - BASED APPROXIMATE K-MIPS

Clustering-based approximate K-MIPS was proposed in (Auvolat et al., 2015) and it has been shown
to outperform various other state-of-the-art data dependent and data independent approximate K-
MIPS approaches for inference tasks. As we will show in the experiments section, clustering-based
MIPS also performs better when used to training HMNs. Hence, we focus our presentation on the
clustering-based approach and propose changes that were found to be helpful for learning HMNs.

4
Under review as a conference paper at ICLR 2017

Following most of the other approximate K-MIPS algorithms, Auvolat et al. (2015) convert MIPS
to Maximum Cosine Similarity Search (MCSS) problem:
(K) q T xi T
(K) q xi
argmaxi∈X = argmaxi∈X (5)
||q|| ||xi || ||xi ||
When all the data vectors xi have the same norm, then MCSS is equivalent to MIPS. However, it is
often restrictive to have this additional constraint. Instead, Auvolat et al. (2015) append additional
dimensions to both query and data vectors to convert MIPS to MCSS. In HMN terminology, this
would correspond to adding a few more dimensions to the memory cells and input representations.
The algorithm introduces two hyper-parameters, U < 1 and m ∈ N∗ . The first step is to scale all the
vectors in the memory by the same factor, such that maxi ||xi ||2 = U . We then apply two mappings,
P and Q, on the memory cells and on the input vector, respectively. These two mappings simply
concatenate m new components to the vectors and make the norms of the data points all roughly the
same (Shrivastava & Li, 2015). The mappings are defined as follows:
m
P (x) = [x, 1/2 − ||x||22 , 1/2 − ||x||42 , . . . , 1/2 − ||x||22 ] (6)
Q(x) = [x, 0, 0, . . . , 0] (7)

We thus have the following approximation of MIPS by MCSS for any query vector q:
(K) > (K) Q(q)> P (xi )
argmaxi q xi ' argmaxi (8)
||Q(q)||2 · ||P (xi )||2

Once we convert MIPS to MCSS, we can use spherical K-means (Zhong, 2005) or its hierarchical
version to approximate and speedup the cosine similarity search. Once the memory is clustered,
then every read operation requires only K dot-products, where K is the number of cluster centroids.
Since this is an approximation, it is error-prone. As we are using this approximation for the learning
process, this introduces some bias in gradients, which can affect the overall performance of HMN.
To alleviate this bias, we propose three simple strategies.

• Instead of using only the top-K candidates for a single read query, we also add top-K
candidates retrieved for every other read query in the mini-batch. This serves two pur-
poses. First, we can do efficient matrix multiplications by leveraging GPUs since all the
K-softmax in a minibatch are over the same set of elements. Second, this also helps to
decrease the bias introduced by the approximation error.
• For every read access, instead of only using the top few clusters which has a maximum
product with the read query, we also sample some clusters from the rest, based on a prob-
ability distribution log-proportional to the dot product with the cluster centroids. This also
decreases the bias.
• We can also sample random blocks of memory and add it to top-K candidates.

We empirically investigate the effect of these variations in Section 5.5.

4 R ELATED W ORK
Memory networks have been introduced in (Weston et al., 2015b) and have been so far applied to
comprehension-based question answering (Weston et al., 2015a; Sukhbaatar et al., 2015), large scale
question answering (Bordes et al., 2015) and dialogue systems (Dodge et al., 2015). While (Weston
et al., 2015b) considered supervised memory networks in which the correct supporting fact is given
during the training stage, (Sukhbaatar et al., 2015) introduced semi-supervised memory networks
that can learn the supporting fact by itself. (Kumar et al., 2015; Xiong et al., 2016) introduced
Dynamic Memory Networks (DMNs) which can be considered as a memory network with two
types of memory: a regular large memory and an episodic memory. Another related class of model
is the Neural Turing Machine (Graves et al., 2014), which uses softmax-based soft attention. Later
(Zaremba & Sutskever, 2015) extended NTM to hard attention using reinforcement learning. (Dodge
et al., 2015; Bordes et al., 2015) alleviate the problem of the scalability of soft attention by having

5
Under review as a conference paper at ICLR 2017

an initial keyword based filtering stage, which reduces the number of facts being considered. Our
work generalizes this filtering by using MIPS for filtering. This is desirable because MIPS can be
applied for any modality of data or even when there is no overlap between the words in a question
and the words in facts.
The softmax arises in various situations and most relevant to this work are scaling methods for large
vocabulary neural language modeling. In neural language modeling, the final layer is a softmax
distribution over the next word and there exist several approaches to achieve scalability. (Morin
& Bengio, 2005) proposes a hierarchical softmax based on prior clustering of the words into a
binary, or more generally n-ary tree, that serves as a fixed structure for the learning process of the
model. The complexity of training is reduced from O(n) to O(log n). Due to its clustering and tree
structure, it resembles the clustering-based MIPS techniques we explore in this paper. However, the
approaches differ at a fundamental level. Hierarchical softmax defines the probability of a leaf node
as the product of all the probabilities computed by all the intermediate softmaxes on the way to that
leaf node. By contrast, an approximate MIPS search imposes no such constraining structure on the
probabilistic model, and is better thought as efficiently searching for top winners of what amounts
to be a large ordinary flat softmax. Other methods such as Noice Constrastive Estimation (Mnih
& Gregor, 2014) and Negative Sampling (Mikolov et al., 2013) avoid an expensive normalization
constant by sampling negative samples from some marginal distribution. By contrast, our approach
approximates the softmax by explicitly including in its negative samples candidates that likely would
have a large softmax value. Jean et al. (2015) introduces an importance sampling approach that
considers all the words in a mini-batch as the candidate set. This in general might also not include
the MIPS candidates with highest softmax values.
(Spring & Shrivastava, 2016) is the only work that we know of, proposing to use MIPS during learn-
ing. It proposes hashing-based MIPS to sort the hidden layer activations and reduce the computation
in every layer. However, a small scale application was considered and data-independent methods
like hashing will likely suffer as dimensionality increases. Rae et al. (2016) have also independently
proposed a model called SAM to use approximate search methods for memory access in NTM-like
architectures. However, our motivation is different. While Rae et al. (2016) focus on architectures
where the memory is written by the controller itself, we focus on handling memory access to large
external knowledge bases. While both the models fix the memory access mechanism (HMN uses
MIPS and SAM uses NNS), our controller works in a much more constrained setting. Moreover,
our experiments suggest that the performance of SAM could be improved using a clustering-based
approach as in our work, instead of tree/hash-based approaches for memory search used by SAM.

5 E XPERIMENTS

In this section, we report experiments on factoid question answering using hierarchical memory
networks. Specifically, we use the SimpleQuestions dataset Bordes et al. (2015). The aim of these
experiments is not to achieve state-of-the-art results on this dataset. Rather, we aim to propose
and analyze various approaches to make memory networks more scalable and explore the achieved
tradeoffs between speed and accuracy.

5.1 DATASET

We use SimpleQuestions (Bordes et al., 2015) which is a large scale factoid question answering
dataset. SimpleQuestions consists of 108,442 natural language questions, each paired with a cor-
responding fact from Freebase. Each fact is a triple (subject,relation,object) and the answer to the
question is always the object. The dataset is divided into training (75910), validation (10845), and
test (21687) sets. Unlike Bordes et al. (2015) who additionally considered FB2M (10M facts) or
FB5M (12M facts) with keyword-based heuristics for filtering most of the facts for each question,
we only use SimpleQuestions, with no keyword-based heuristics. This allows us to do a direct com-
parison with the full softmax approach in a reasonable amount of time. Moreover, we would like to
highlight that for this dataset, keyword-based filtering is a very efficient heuristic since all questions
have an appropriate source entity with a matching word. Nevertheless, our goal is to design a general
purpose architecture without such strong assumptions on the nature of the data.

6
Under review as a conference paper at ICLR 2017

5.2 M ODEL

Let Vq be the vocabulary of all words in the natural language questions. Let Wq be a |Vq | ∗ m
matrix where each row is some m dimensional embedding for a word in the question vocabulary.
This matrix is initialized with random values and learned during training. Given any question, we
represent it with a bag-of-words representation by summing the vector representation of each word
in the question. Let q = {wi }pi=1 ,

p
X
h(q) = Wq [wi ]
i=1

Then, to find the relevant fact from the memory M, we call the K-MIPS-based reader module with
h(q) as the query. This uses Equation 3 and 4 to compute the output of the reader Rout . The reader
is trained by minimizing the Negative Log Likelihood (NLL) of the correct fact.

N
X
Jθ = −log(Rout [fi ])
i=1

where fi is the index of the correct fact in Wm . We are fixing the memory embeddings to the TransE
(Bordes et al., 2013) embeddings and learning only the question embeddings.
This model is simpler than the one reported in (Bordes et al., 2015) so that it is esay to analyze the
effect of various memory reading strategies.

5.3 T RAINING D ETAILS

We trained the model with the Adam optimizer (Kingma & Ba, 2014), with a fixed learning rate
of 0.001. We used mini-batches of size 128. We used 200 dimensional embeddings for the TransE
entities, yielding 600 dimensional embeddings for facts by concatenating the embeddings of the
subject, relation and object. We also experimented with summing the entities in the triple instead
of concatenating, but we found that it was difficult for the model to differentiate facts this way.
The only learnable parameters by the HMN model are the question word embeddings. The entity
distribution in SimpleQuestions is extremely sparse and hence, following Bordes et al. (2015), we
also add artificial questions for all the facts for which we do not have natural language questions.
Unlike Bordes et al. (2015), we do not add any other additional tasks like paraphrase detection to
the model, mainly to study the effect of the reader. We stopped training for all the models when the
validation accuracy consistently decreased for 3 epochs.

5.4 E XACT K-MIPS IMPROVES ACCURACY

In this section, we compare the performance of the full soft attention reader and exact K-MIPS
attention readers. Our goal is to verify that K-MIPS attention is in fact a valid and useful attention
mechanism and see how it fares when compared to full soft attention. For K-MIPS attention, we
tried K ∈ 10, 50, 100, 1000. We would like to emphasize that, at training time, along with K
candidates for a particular question, we also add the K-candidates for each question in the mini-
batch. So the exact size of the softmax layer would be higer than K during training. In Table 1,
we report the test performance of memory networks using the soft attention reader and K-MIPS
attention reader. We also report the average softmax size during training. From the table, it is
clear that the K-MIPS attention readers improve the performance of the network compared to soft
attention reader. In fact, smaller the value of K is, better the performance. This result suggests that
it is better to use a K-MIPS layer instead of softmax layer whenever possible. It is interesting to see
that the convergence of the model is not slowed down due to this change in softmax computation (as
shown in Figure 1).
This experiment confirms the usefulness of K-MIPS attention. However, exact K-MIPS has the
same complexity as a full softmax. Hence, to scale up the training, we need more efficient forms of
K-MIPS attention, which is the focus of next experiment.

7
Under review as a conference paper at ICLR 2017

90
Model Test Acc. Avg. Softmax Size softmax
Full-softmax 59.5 108442 80 10-softmax
10-MIPS 62.2 1290 70
50-softmax
100-softmax

Val Error
50-MIPS 61.2 6180 1000-softmax
60
100-MIPS 60.6 11928
1000-MIPS 59.6 70941 50
Clustering 51.5 20006 40
PCA-Tree 32.4 21108 300 5 10 15 20 25
WTA-Hash 40.2 20008 Epochs
Table 1: Accuracy in SQ test-set and average size Figure 1: Validation curve for various models.
of memory used. 10-softmax has high performance Convergence is not slowed down by k-softmax.
while using only smaller amount of memory.

5.5 A PPROXIMATE K-MIPS BASED LEARNING

As mentioned previously, designing faster algorithms for K-MIPS is an active area of research. Au-
volat et al. (2015) compared several state-of-the-art data-dependent and data-independent methods
for faster approximate K-MIPS and it was found that clustering-based MIPS performs significantly
better than other approaches. However the focus of the comparison was on performance during the
inference stage. In HMNs, K-MIPS must be used at both training stage and inference stages. To
verify if the same trend can been seen during learning stage as well, we compared three different
approaches:
Clustering: This was explained in detail in section 3.
WTA-Hash: Winner Takes All hashing (Vijayanarasimhan et al., 2014) is a hashing-based K-MIPS
algorithm which also converts MIPS to MCSS by augmenting additional dimensions to the vectors.
This method used n hash functions and each hash function does p different random permutations
of the vector. Then the prefix constituted by the first k elements of each permuted vector is used to
construct the hash for the vector.
PCA-Tree: PCA-Tree (Bachrach et al., 2014) is the state-of-the-art tree-based method, which con-
verts MIPS to NNS by vector augmentation. It uses the principal components of the data to construct
a balanced binary tree with data residing in the leaves.
For a fair comparison, we varied the hyper-parameters of each algorithm in such a way that the
average speedup is approximately the same. Table 1 shows the performance of all three methods,
compared to a full softmax. From the table, it is clear that the clustering-based method performs
significantly better than the other two methods. However, performances are lower when compared
to the performance of the full softmax.
As a next experiment, we analyze various the strategies proposed in Section 3.1 to reduce the ap-
proximation bias of clustering-based K-MIPS:
Top-K: This strategy picks the vectors in the top K clusters as candidates.
Sample-K: This strategy samples K clusters, without replacement, based on a probability distri-
bution based on the dot product of the query with the cluster centroids. When combined with the
Top-K strategy, we ignore clusters selected by the Top-k strategy for sampling.
Rand-block: This strategy divides the memory into several blocks and uniformly samples a random
block as candidate.
We experimented with 1000 clusters and 2000 clusters. While comparing various training strategies,
we made sure that the effective speedup is approximately the same. Memory access to facts per
query for all the models is approximately 20,000, hence yielding a 5X speedup.
Results are given in Table 2. We observe that the best approach is to combine the Top-K and Sample-
K strategies, with Rand-block not being beneficial. Interestingly, the worst performances correspond
to cases where the Sample-K strategy is ignored.

8
Under review as a conference paper at ICLR 2017

1000 clusters 2000 clusters

Top-K Sample-K rand-block Test Acc. epochs Test Acc. epochs
Yes No No 50.2 16 51.5 22
No Yes No 52.5 68 52.8 63
Yes Yes No 52.8 31 53.1 26
Yes No Yes 51.8 32 52.3 26
Yes Yes Yes 52.5 38 52.7 19

Table 2: Accuracy in SQ test set and number of epochs for convergence.

6 C ONCLUSION
In this paper, we proposed a hierarchical memory network that exploits K-MIPS for its attention-
based reader. Unlike soft attention readers, K-MIPS attention reader is easily scalable to larger
memories. This is achieved by organizing the memory in a hierarchical way. Experiments on the
SimpleQuestions dataset demonstrate that exact K-MIPS attention is better than soft attention. How-
ever, existing state-of-the-art approximate K-MIPS techniques provide a speedup at the cost of some
accuracy. Future research will investigate designing efficient dynamic K-MIPS algorithms, where
the memory can be dynamically updated during training. This should reduce the approximation bias
and hence improve the overall performance.

R EFERENCES
Alex Auvolat, Sarath Chandar, Pascal Vincent, Hugo Larochelle, and Yoshua Bengio. Clustering
is efficient for approximate maximum inner product search. arXiv preprint arXiv:1507.05910,
2015.
Yoram Bachrach et al. Speeding up the xbox recommender system using a euclidean transformation
for inner-product spaces. RecSys ’14, pp. 257–264, 2014.
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.
Translating embeddings for modeling multi-relational data. In Advances in NIPS, pp. 2787–2795.
2013.
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple question
answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur
Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog sys-
tems. CoRR, abs/1511.06931, 2015.
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint
arXiv:1410.5401, 2014.
Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large
target vocabulary for neural machine translation. In Proceedings of ACL,2015, pp. 1–10, 2015.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,
abs/1412.6980, 2014.
Ankit Kumar et al. Ask me anything: Dynamic memory networks for natural language processing.
CoRR, abs/1506.07285, 2015.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word repre-
sentations in vector space. In International Conference on Learning Representations, Workshop
Track, 2013.
Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv
preprint arXiv:1402.0030, 2014.
Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In
Robert G. Cowell and Zoubin Ghahramani (eds.), Proceedings of AISTATS, pp. 246–252, 2005.

9
Under review as a conference paper at ICLR 2017

Behnam Neyshabur and Nathan Srebro. On symmetric and asymmetric lshs for inner product search.
In Proceedings of the 31st International Conference on Machine Learning, 2015.
Jack W Rae, Jonathan J Hunt, Tim Harley, Ivo Danihelka, Andrew Senior, Greg Wayne, Alex
Graves, and Timothy P Lillicrap. Scaling memory-augmented neural networks with sparse reads
and writes. In Advances in NIPS. 2016.
Parikshit Ram and Alexander G. Gray. Maximum inner-product search using cone trees. KDD ’12,
pp. 931–939, 2012.
Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for sublinear time maximum inner
product search (MIPS). In Advances in Neural Information Processing Systems 27, pp. 2321–
2329, 2014.
Anshumali Shrivastava and Ping Li. Improved asymmetric locality sensitive hashing (alsh) for
maximum inner product search (mips). In Proceedings of Conference on Uncertainty in Artificial
Intelligence (UAI), 2015.
Ryan Spring and Anshumali Shrivastava. Scalable and sustainable deep learning via randomized
hashing. CoRR, abs/1602.08194, 2016.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks.
arXiv preprint arXiv:1503.08895, 2015.
Sudheendra Vijayanarasimhan, Jon Shlens, Rajat Monga, and Jay Yagnik. Deep networks with large
output spaces. arXiv preprint arXiv:1412.7479, 2014.
Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. Towards ai-complete question
answering: a set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015a.
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In Proceedings Of The
International Conference on Representation Learning (ICLR 2015), 2015b. In Press.
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine Learning, 8:229–256, 1992.
Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and
textual question answering. CoRR, abs/1603.01417, 2016.
Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural turing machines. CoRR,
abs/1505.00521, 2015.
Shi Zhong. Efficient online spherical k-means clustering. In Neural Networks, 2005. IJCNN’05.
Proceedings. 2005 IEEE International Joint Conference on, volume 5, pp. 3180–3185. IEEE,
2005.

Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
4/5 (9)
Information Sciences: C.L. Philip Chen, Chun-Yang Zhang
No ratings yet
Information Sciences: C.L. Philip Chen, Chun-Yang Zhang
34 pages
Neural Turing
No ratings yet
Neural Turing
194 pages
Emorizing Ransformers: Memorize
No ratings yet
Emorizing Ransformers: Memorize
19 pages
Titans: Learning To Memorize at Test Time: Ali Behrouz, Peilin Zhong, and Vahab Mirrokni Google Research
No ratings yet
Titans: Learning To Memorize at Test Time: Ali Behrouz, Peilin Zhong, and Vahab Mirrokni Google Research
27 pages
Zhang 21 Ac
No ratings yet
Zhang 21 Ac
11 pages
Deep Reniforcement Leanring Paper
No ratings yet
Deep Reniforcement Leanring Paper
13 pages
Papier de Recherche
No ratings yet
Papier de Recherche
26 pages
1410 5401v1 PDF
No ratings yet
1410 5401v1 PDF
26 pages
Memory Head For Pre-Trained Backbones in
No ratings yet
Memory Head For Pre-Trained Backbones in
19 pages
2024 - A Framework For Inference Inspired by Human Memory Mechanisms - Zeng Et Al - Arxiv
No ratings yet
2024 - A Framework For Inference Inspired by Human Memory Mechanisms - Zeng Et Al - Arxiv
22 pages
Larimar: Large Language Models With Episodic Memory Control: Equal Contribution IBM AI Research Princeton University
No ratings yet
Larimar: Large Language Models With Episodic Memory Control: Equal Contribution IBM AI Research Princeton University
17 pages
Neural Turing Machines: Alex Graves Greg Wayne Ivo Danihelka Google Deepmind, London, Uk
No ratings yet
Neural Turing Machines: Alex Graves Greg Wayne Ivo Danihelka Google Deepmind, London, Uk
26 pages
Neural Turing Machinev2
No ratings yet
Neural Turing Machinev2
26 pages
Augmenting Language Models With Long-Term Memory
No ratings yet
Augmenting Language Models With Long-Term Memory
14 pages
Article: Hybrid Computing Using A Neural Network With Dynamic External Memory
No ratings yet
Article: Hybrid Computing Using A Neural Network With Dynamic External Memory
21 pages
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet
Recurrent Memory Transformers 2207.06881
No ratings yet
Recurrent Memory Transformers 2207.06881
18 pages
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
From Everand
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Kernel Memory Networks: A Unifying Framework For Memory Modeling
No ratings yet
Kernel Memory Networks: A Unifying Framework For Memory Modeling
24 pages
Memory Layers at Scale
No ratings yet
Memory Layers at Scale
13 pages
Spikings
No ratings yet
Spikings
38 pages
Memory Mosaics
No ratings yet
Memory Mosaics
20 pages
Neural Execution Engines: Learning To Execute Subroutines: Work Completed During An Internship at Google
No ratings yet
Neural Execution Engines: Learning To Execute Subroutines: Work Completed During An Internship at Google
21 pages
One-Shot Learning With Memory-Augmented Neural Networks
No ratings yet
One-Shot Learning With Memory-Augmented Neural Networks
13 pages
Towards Mental Time Travel: A Hierarchical Memory For Reinforcement Learning Agents
No ratings yet
Towards Mental Time Travel: A Hierarchical Memory For Reinforcement Learning Agents
29 pages
End-To-End Memory Networks
No ratings yet
End-To-End Memory Networks
11 pages
Network Learning and Training of A Cascaded Link-Based Feed Forward Neural Network (CLBFFNN) in An Intelligent Trimodal Biometric System
No ratings yet
Network Learning and Training of A Cascaded Link-Based Feed Forward Neural Network (CLBFFNN) in An Intelligent Trimodal Biometric System
21 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
NeurIPS 2021 Towards Mental Time Travel A Hierarchical Memory For Reinforcement Learning Agents Paper
No ratings yet
NeurIPS 2021 Towards Mental Time Travel A Hierarchical Memory For Reinforcement Learning Agents Paper
14 pages
MemoryPrompt A Light Wrapper To Improve Context Tracking in
No ratings yet
MemoryPrompt A Light Wrapper To Improve Context Tracking in
9 pages
Emergent and Predictable Memorization in Large Language Models
No ratings yet
Emergent and Predictable Memorization in Large Language Models
17 pages
Electronics 07 00396 PDF
No ratings yet
Electronics 07 00396 PDF
13 pages
Computer Science: Research in Memory Management
From Everand
Computer Science: Research in Memory Management
Iris Li
No ratings yet
Low-Area and Low-Power VLSI Architectures For Long Short-Term Memory Networks
No ratings yet
Low-Area and Low-Power VLSI Architectures For Long Short-Term Memory Networks
15 pages
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
Operating System Interview Questions and Answers
From Everand
Operating System Interview Questions and Answers
Manish Soni
No ratings yet
Implementation of Memristive Neural Network With Full-Function Pavlov Associative Memory
No ratings yet
Implementation of Memristive Neural Network With Full-Function Pavlov Associative Memory
10 pages
Q M A N L M: Uantifying Emorization Cross Eural Anguage Odels
No ratings yet
Q M A N L M: Uantifying Emorization Cross Eural Anguage Odels
19 pages
A Deep Learning Bidirectional Long-Short Term Memory Model For Short-Term Wind Speed Forecasting
No ratings yet
A Deep Learning Bidirectional Long-Short Term Memory Model For Short-Term Wind Speed Forecasting
7 pages
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NeurIPS 2023 Birth of A Transformer A Memory Viewpoint Paper Conference
No ratings yet
NeurIPS 2023 Birth of A Transformer A Memory Viewpoint Paper Conference
29 pages
Exploring Neuromorphic Computing Based On Spiking Neural Networks: Algorithms To Hardware
No ratings yet
Exploring Neuromorphic Computing Based On Spiking Neural Networks: Algorithms To Hardware
49 pages
MoBA Tech Report
No ratings yet
MoBA Tech Report
15 pages
Cognitive Digital Memory. 05-21
No ratings yet
Cognitive Digital Memory. 05-21
6 pages
iCELIA: A Full-Stack Framework For STT-MRAM-Based Deep Learning Acceleration
No ratings yet
iCELIA: A Full-Stack Framework For STT-MRAM-Based Deep Learning Acceleration
15 pages
Make 04 00004 v3
No ratings yet
Make 04 00004 v3
37 pages
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
No ratings yet
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
42 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Nov9 2023 Perspective - Paper-3
No ratings yet
Nov9 2023 Perspective - Paper-3
7 pages
Resilient H State Estimation For Discrete-Time Stochastic Delayed Memristive Neural Networks A Dynamic Event-Triggered Mechanism
No ratings yet
Resilient H State Estimation For Discrete-Time Stochastic Delayed Memristive Neural Networks A Dynamic Event-Triggered Mechanism
9 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
2008 02217 PDF
No ratings yet
2008 02217 PDF
86 pages
T H e Capacity of The Hopfield Associative M e M o Ry: (Cf. FL)
No ratings yet
T H e Capacity of The Hopfield Associative M e M o Ry: (Cf. FL)
22 pages
Learning To Remember More With Less Memorization
No ratings yet
Learning To Remember More With Less Memorization
20 pages
NLP Cache Model
No ratings yet
NLP Cache Model
9 pages
Journal Pcbi 1011843
No ratings yet
Journal Pcbi 1011843
18 pages
Learning To Remember Rare Events
No ratings yet
Learning To Remember Rare Events
10 pages
Implementation of Deep Neural Network Using VLSI B
No ratings yet
Implementation of Deep Neural Network Using VLSI B
8 pages
Adopting Cognitive Computing Solutions in Healthcare
No ratings yet
Adopting Cognitive Computing Solutions in Healthcare
14 pages
High-Street Changes and Populism ssrn-5119375
No ratings yet
High-Street Changes and Populism ssrn-5119375
94 pages
Mind The Gaps. Logical English, Prolog, and Multi-Agent Systems For Autonomous Vehicles
No ratings yet
Mind The Gaps. Logical English, Prolog, and Multi-Agent Systems For Autonomous Vehicles
14 pages
) Evolution of Brain Size and Juvenile Periods in Primates
No ratings yet
) Evolution of Brain Size and Juvenile Periods in Primates
10 pages
The Attention System of The Human Brain
No ratings yet
The Attention System of The Human Brain
32 pages
Agent-Based Modeling The Emergent Behavior of A System of Systems
No ratings yet
Agent-Based Modeling The Emergent Behavior of A System of Systems
10 pages
(Yet) Another Theoretical Model of Thinking
No ratings yet
(Yet) Another Theoretical Model of Thinking
24 pages
Auditory Cortex - Science
No ratings yet
Auditory Cortex - Science
4 pages
Adapting The Adaptive Toolbox - Set of Cognitive Mechanisms
No ratings yet
Adapting The Adaptive Toolbox - Set of Cognitive Mechanisms
70 pages
A Case of Applying AI To An Ethylene Plant
No ratings yet
A Case of Applying AI To An Ethylene Plant
6 pages
Logical Reasoning in Large Language Models A Survey
No ratings yet
Logical Reasoning in Large Language Models A Survey
9 pages
A Case Study On AI Engineering Practices
No ratings yet
A Case Study On AI Engineering Practices
13 pages
AIAgent Frameworkin Healthcare Industry
No ratings yet
AIAgent Frameworkin Healthcare Industry
8 pages
Dojo System v25
No ratings yet
Dojo System v25
45 pages
Language Evolution As Cultural Evolution
No ratings yet
Language Evolution As Cultural Evolution
6 pages
Elon Musk's Neuralink Brain Chip
No ratings yet
Elon Musk's Neuralink Brain Chip
5 pages
A Comparative Study of AI Agent Orchestration Frameworks
No ratings yet
A Comparative Study of AI Agent Orchestration Frameworks
13 pages
Building The Unified Data Warehouse and Data Lake TDWI Best Practices Report
No ratings yet
Building The Unified Data Warehouse and Data Lake TDWI Best Practices Report
30 pages
Agents and Ambient Intelligence Case Studies
No ratings yet
Agents and Ambient Intelligence Case Studies
11 pages
2D To 3D Image Conversion Algorithms
0% (1)
2D To 3D Image Conversion Algorithms
10 pages
NEURALINK A Brain-Machine Interface Device
No ratings yet
NEURALINK A Brain-Machine Interface Device
5 pages
A Trading Agent With No Intelligence Routinely Outperforms AI
No ratings yet
A Trading Agent With No Intelligence Routinely Outperforms AI
8 pages
AI and The Frontiers of Finance
No ratings yet
AI and The Frontiers of Finance
7 pages
A - Deep - Learning - Approach - To - Classify - Classify Drones and Birds
No ratings yet
A - Deep - Learning - Approach - To - Classify - Classify Drones and Birds
7 pages
Accelerating A Just Transition To Smart, Sustainable Cities
No ratings yet
Accelerating A Just Transition To Smart, Sustainable Cities
10 pages
Accelerating AI Impact by Taming The Data Beast
No ratings yet
Accelerating AI Impact by Taming The Data Beast
6 pages
2024 11 28 Engineered Carbon Removals Energy Security Affordability Quiggin
No ratings yet
2024 11 28 Engineered Carbon Removals Energy Security Affordability Quiggin
53 pages
A Beginner's Guide To Variational Inference
No ratings yet
A Beginner's Guide To Variational Inference
48 pages
Functional Architecture of The Cerebral Cortex
No ratings yet
Functional Architecture of The Cerebral Cortex
30 pages
Internet Technology and Web Designing
No ratings yet
Internet Technology and Web Designing
242 pages
The Promise of Convolutional Neural Networks
No ratings yet
The Promise of Convolutional Neural Networks
13 pages
Unit 2 - Week 1: Assignment 1
No ratings yet
Unit 2 - Week 1: Assignment 1
3 pages
Artificial Intelligent Techniques1
No ratings yet
Artificial Intelligent Techniques1
23 pages
A Deep Learning Model For Estimating Story Points
No ratings yet
A Deep Learning Model For Estimating Story Points
21 pages
Knowledge Distillation: A Survey
No ratings yet
Knowledge Distillation: A Survey
29 pages
Chapter 3 Ann
No ratings yet
Chapter 3 Ann
26 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
44 pages
Intro Deep Learning
No ratings yet
Intro Deep Learning
32 pages
International Journal of Fatigue: Ankit Agrawal, Alok Choudhary
No ratings yet
International Journal of Fatigue: Ankit Agrawal, Alok Choudhary
12 pages
AI and Machine Learning A Mixed Blessing For Cybersecurity
No ratings yet
AI and Machine Learning A Mixed Blessing For Cybersecurity
7 pages
Predicting NBA Games Using Neural Networks
100% (5)
Predicting NBA Games Using Neural Networks
18 pages
100 Days of DEep Learning
No ratings yet
100 Days of DEep Learning
5 pages
Fuzzy Logic Control
No ratings yet
Fuzzy Logic Control
9 pages
Data Science Professional Final
No ratings yet
Data Science Professional Final
21 pages
Analysis of Machine Learning Algorithms On Cancer Dataset
No ratings yet
Analysis of Machine Learning Algorithms On Cancer Dataset
10 pages
(IJCST-V12I1P6) :kaushik Kashyap, Rinku Moni Borah, Priyanku Rahang, DR Bornali Gogoi, Prof. Nelson R Varte
No ratings yet
(IJCST-V12I1P6) :kaushik Kashyap, Rinku Moni Borah, Priyanku Rahang, DR Bornali Gogoi, Prof. Nelson R Varte
5 pages
IIIT-B & UpGrad PG Diploma Program-Curriculum
No ratings yet
IIIT-B & UpGrad PG Diploma Program-Curriculum
7 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
16 pages
B.Tech - CS - Design 3rd Year Year 2023-24
No ratings yet
B.Tech - CS - Design 3rd Year Year 2023-24
33 pages
Silkroad 2019 Proceedings
No ratings yet
Silkroad 2019 Proceedings
235 pages
cst401 - Ai - Module 1 Note 3
No ratings yet
cst401 - Ai - Module 1 Note 3
26 pages
700 Hours NASSCOM Certified Full Stack Data Science
No ratings yet
700 Hours NASSCOM Certified Full Stack Data Science
26 pages
Tmi 2018 2833635
No ratings yet
Tmi 2018 2833635
14 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
Introduction To Deep Learning: Internet of Things Group
No ratings yet
Introduction To Deep Learning: Internet of Things Group
50 pages
Machine Learning
No ratings yet
Machine Learning
73 pages
Adaptive Attitude Controller For A Satellite Based On Neural Network
No ratings yet
Adaptive Attitude Controller For A Satellite Based On Neural Network
11 pages
Forecasting Sales Using Neural Networks PDF
No ratings yet
Forecasting Sales Using Neural Networks PDF
4 pages
Report - Nutrition Analysis Using Image Classification
No ratings yet
Report - Nutrition Analysis Using Image Classification
10 pages

283 Hierarchical Memory Networks

Uploaded by

283 Hierarchical Memory Networks

Uploaded by

Under review as a conference paper at ICLR 2017

H IERARCHICAL M EMORY N ETWORKS

• We explore hierarchical memory networks, where the memory is organized in a hierarchical

2 H IERARCHICAL M EMORY N ETWORKS

3 M EMORY R EADER WITH K-MIPS ATTENTION

Rout = softmax(K) (h(q)M T ) = softmax(h(q)M [C]T ) (4)

3.1 R EADER WITH C LUSTERING - BASED APPROXIMATE K-MIPS

We empirically investigate the effect of these variations in Section 5.5.

5.3 T RAINING D ETAILS

5.4 E XACT K-MIPS IMPROVES ACCURACY

5.5 A PPROXIMATE K-MIPS BASED LEARNING

1000 clusters 2000 clusters

Table 2: Accuracy in SQ test set and number of epochs for convergence.

You might also like