The Annotated Transformer: Alexander M. Rush

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

The Annotated Transformer

Alexander M. Rush
[email protected]
Harvard University

Abstract In this experimental paper, I propose an ex-


ercise in open-source NLP. The goal is to tran-
A major aim of open-source NLP is scribe a recent paper into a simple and under-
to quickly and accurately reproduce standable form. The document itself is pre-
the results of new work, in a manner sented as an annotated paper. That is the
that the community can easily use and main document (in different font) is an ex-
modify. While most papers publish cerpt of the recent paper “Attention is All You
enough detail for replication, it still Need” (Vaswani et al., 2017). I add annota-
may be difficult to achieve good re- tion in the form of italicized comments and
sults in practice. This paper is an ex- include code in PyTorch directly in the paper
periment. In it, I consider a worked itself.
exercise with the goal of implement-
ing the results of the recent paper. Note this document itself is presented as a
The replication exercise aims at sim- blog post 1 and is completely executable as a
ple code structure that follows closely notebook. In the spirit of reproducibility this
with the original work, while achiev- work itself is distilled from the same source
ing an efficient usable system. An im- with images inline.
plicit premise of this exercise is to en-
courage researchers to consider this
method for new results.

1 Introduction
Replication of published results remains a
challenging issue in open-source NLP. When
a new paper is published with major im-
provements, it is common for many mem-
bers of the community to independently re-
produce the numbers experimentally, which
is often a struggle. Practically this makes it
difficult to improve scores, but also impor-
tantly it is a pedagogical issue if students can-
not reproduce results from scientific publica-
tions.
The recent turn towards deep learning has
exerbated this issue. New models require
extensive hyperparameter tuning and long
training times. Small mistakes can cause ma-
1 Presented at http://nlp.seas.harvard.
jor issues. Fortunately though, new toolsets
edu/2018/04/03/attention.html with source
have made it possible to write simpler more code at https://github.com/harvardnlp/
mathematically declarative code. annotated-transformer
52
Proceedings of Workshop for NLP Open Source Software, pages 52–60
Melbourne, Australia, July 20, 2018. 2018
c Association for Computational Linguistics
2 Background is auto-regressive (Graves, 2013), consum-
ing the previously generated symbols as ad-
The goal of reducing sequential computa- ditional input when generating the next.
tion also forms the foundation of the Extended
class EncoderDecoder(nn.Module):
Neural GPU, ByteNet and ConvS2S, all of """
A standard Encoder-Decoder architecture.
which use convolutional neural networks as Base for this and many other models.
"""
basic building block, computing hidden rep- def __init__(self, encoder, decoder, src_embed,
tgt_embed, generator):
resentations in parallel for all input and out- super(EncoderDecoder, self).__init__()
self.encoder = encoder
put positions. In these models, the number self.decoder = decoder
self.src_embed = src_embed
of operations required to relate signals from self.tgt_embed = tgt_embed
self.generator = generator
two arbitrary input or output positions grows
def forward(self, src, tgt, src_mask, tgt_mask):
in the distance between positions, linearly "Take in and process masked src and target sequences."
return self.decode(self.encode(src, src_mask),
for ConvS2S and logarithmically for ByteNet. src_mask,
tgt, tgt_mask)
This makes it more difficult to learn depen-
dencies between distant positions. In the def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)
Transformer this is reduced to a constant def decode(self, memory, src_mask, tgt, tgt_mask):
number of operations, albeit at the cost of return self.decoder(self.tgt_embed(tgt), memory,
src_mask, tgt_mask)
reduced effective resolution due to averag-
ing attention-weighted positions, an effect we class Generator(nn.Module):
"Define standard linear + softmax generation step."
counteract with Multi-Head Attention. def __init__(self, d_model, vocab):
super(Generator, self).__init__()
Self-attention, sometimes called intra- self.proj = nn.Linear(d_model, vocab)

attention is an attention mechanism relating def forward(self, x):


return F.log_softmax(self.proj(x), dim=-1)
different positions of a single sequence in or-
der to compute a representation of the se- The Transformer follows this overall archi-
quence. Self-attention has been used suc- tecture using stacked self-attention and point-
cessfully in a variety of tasks including read- wise, fully connected layers for both the en-
ing comprehension, abstractive summariza- coder and decoder, shown in the left and right
tion, textual entailment and learning task- halves of Figure 1, respectively.
independent sentence representations. End-
to-end memory networks are based on a re-
current attention mechanism instead of se-
quencealigned recurrence and have been
shown to perform well on simple-language
question answering and language modeling
tasks.
To the best of our knowledge, however, the
Transformer is the first transduction model re-
lying entirely on self-attention to compute rep-
resentations of its input and output without
using sequence aligned RNNs or convolution.

3 Model Architecture

Most competitive neural sequence trans-


duction models have an encoder-decoder
structure (Bahdanau et al., 2014). Here, the
encoder maps an input sequence of symbol
representations ( x1 , ..., xn ) to a sequence of
continuous representations z = (z1 , ..., zn ).
Given z, the decoder then generates an out-
put sequence (y1 , ..., ym ) of symbols one el-
ement at a time. At each step the model
53
3.1 Encoder and Decoder Stacks class EncoderLayer(nn.Module):
"Encoder calls self-attn and feed forward."
def __init__(self, size, self_attn,
3.1.1 Encoder feed_forward, dropout):
super(EncoderLayer, self).__init__()
The encoder is composed of a stack of N = 6 self.self_attn = self_attn
self.feed_forward = feed_forward
identical layers. sublayer = SublayerConnection(size, dropout)
self.sublayer = clones(sublayer, 2)
self.size = size
def clones(module, N):
"Produce N identical layers." def forward(self, x, mask):
return nn.ModuleList([copy.deepcopy(module) "Follow Figure 1 (left) for connections."
for _ in range(N)]) x = self.sublayer[0](x, lambda x:
self.self_attn(x, x, x, mask))
return self.sublayer[1](x, self.feed_forward)

class Encoder(nn.Module):
"Core encoder is a stack of N layers"
def __init__(self, layer, N):
3.1.2 Decoder
super(Encoder, self).__init__()
self.layers = clones(layer, N) The decoder is also composed of a stack of
self.norm = LayerNorm(layer.size)
N = 6 identical layers.
def forward(self, x, mask):
"Pass the input/mask through each layer in turn."
for layer in self.layers: class Decoder(nn.Module):
x = layer(x, mask) "Generic N layer decoder with masking."
return self.norm(x) def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
We employ a residual connection (He et al., self.norm = LayerNorm(layer.size)

2016) around each of the two sub-layers, fol- def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
lowed by layer normalization (Ba et al., 2016). x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)

class LayerNorm(nn.Module):
"Construct a layernorm module (See citation for details)." In addition to the two sub-layers in each
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__() encoder layer, the decoder inserts a third
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features)) sub-layer, which performs multi-head atten-
self.eps = eps
tion over the output of the encoder stack.
def forward(self, x):
mean = x.mean(-1, keepdim=True) Similar to the encoder, we employ residual
std = x.std(-1, keepdim=True)
return (self.a_2 * (x - mean) / connections around each of the sub-layers,
(std + self.eps) + self.b_2)
followed by layer normalization.
That is, the output of each sub-layer class DecoderLayer(nn.Module):
is LayerNorm( x + Sublayer( x )), where "Decoder calls self-attn, src-attn, and feed forward."
def __init__(self, size, self_attn,
Sublayer( x ) is the function implemented src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
by the sub-layer itself. We apply dropout self.self_attn = self_attn
self.src_attn = src_attn
(Srivastava et al., 2014) to the output of each self.feed_forward = feed_forward
sublayer = SublayerConnection(size, dropout)
sub-layer, before it is added to the sub-layer self.sublayer = clones(sublayer, 3)
self.size = size
input and normalized. def forward(self, x, memory, s_mask, t_mask):
To facilitate these residual connections, all "Follow Figure 1 (right) for connections."
m = memory
sub-layers in the model, as well as the em- x = self.sublayer[0](x, lambda x:
self.self_attn(x, x, x, t_mask))
bedding layers, produce outputs of dimension x = self.sublayer[1](x, lambda x:
self.src_attn(x, m, m, s_mask))
dmodel = 512. return self.sublayer[2](x, self.feed_forward)

class SublayerConnection(nn.Module): We also modify the self-attention sub-layer


"""
A layer norm followed by a residual connection. in the decoder stack to prevent positions
Note norm is not applied to residual x.
""" from attending to subsequent positions. This
def __init__(self, size, dropout):
super(SublayerConnection, self).__init__() masking, combined with fact that the output
self.norm = LayerNorm(size)
self.dropout = nn.Dropout(dropout) embeddings are offset by one position, en-
def forward(self, x, sublayer): sures that the predictions for position i can
"Apply residual connection to sublayer fn."
return x + self.dropout(sublayer(self.norm(x))) depend only on the known outputs at posi-
tions less than i.
Each layer has two sub-layers. The first is a
multi-head self-attention mechanism, and the def subsequent_mask(size):
"Mask out subsequent positions."
second is a simple, position-wise fully con- attn_shape = (1, size, size)
subsequent_mask = np.triu(np.ones(attn_shape), k=1)
nected feed-forward network. return torch.from_numpy(
subsequent_mask.astype('uint8')) == 0

54
3.1.3 Attention two are similar in theoretical complexity, dot-
An attention function can be described as product attention is much faster and more
mapping a query and a set of key-value pairs space-efficient in practice, since it can be im-
to an output, where the query, keys, values, plemented using highly optimized matrix mul-
and output are all vectors. The output is com- tiplication code.
puted as a weighted sum of the values, where While for small values of dk the two mech-
the weight assigned to each value is com- anisms perform similarly, additive attention
puted by a compatibility function of the query outperforms dot product attention without
with the corresponding key. scaling for larger values of dk (Britz et al.,
We call our particular attention "Scaled 2017). We suspect that for large values of
Dot-Product Attention". The input consists of dk , the dot products grow large in magni-
queries and keys of dimension dk , and values tude, pushing the softmax function into re-
of dimension dv . We compute the dot prod- gions where it has extremely small gradients
ucts (To illustrate why the dot products get large,
√ of the query with all keys, divide each by
dk , and apply a softmax function to obtain assume that the components of q and k are
the weights on the values. independent random variables with mean 0
and variance 1. Then their dot product, q · k =
d
∑i=k 1 qi k i , has mean 0 and variance dk .). To
counteract this effect, we scale the dot prod-
ucts by √1d .
k

In practice, we compute the attention func-


tion on a set of queries simultaneously,
packed together into a matrix Q. The keys
and values are also packed together into ma-
trices K and V. We compute the matrix of
outputs as:

QK T
Attention( Q, K, V ) = softmax( √ )V
dk
Multi-head attention allows the model to
def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'" jointly attend to information from different
d_k = query.size(-1)
key_t = key.transpose(-2, -1) representation subspaces at different posi-
scores = torch.matmul(query, key_t) / math.sqrt(d_k)
if mask is not None: tions. With a single attention head, averaging
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim=-1) inhibits this.
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
MultiHead( Q, K, V ) = Concat(head1 , ..., headh )W O
The two most commonly used attention
where headi = Attention( QWiQ , KWiK , VWiV )
functions are additive attention (Bahdanau
et al., 2014), and dot-product (multiplicative) Where the projections are parameter ma-
attention. Dot-product attention is identical to trices WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk ,
our algorithm, except for the scaling factor of WiV ∈ Rdmodel ×dv and W O ∈ Rhdv ×dmodel . In
√1 . Additive attention computes the com- this work we employ h = 8 parallel attention
dk
patibility function using a feed-forward net- layers, or heads. For each of these we use
work with a single hidden layer. While the dk = dv = dmodel /h = 64. Due to the reduced
55
dimension of each head, the total computa- the usual learned linear transformation and
tional cost is similar to that of single-head at- softmax function to convert the decoder out-
tention with full dimensionality. put to predicted next-token probabilities. In
our model, we share the same weight ma-
class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1): trix between the two embedding layers and
"Take in model size and number of heads."
super(MultiHeadedAttention, self).__init__() the pre-softmax linear transformation, similar
assert d_model % h == 0
# We assume d_v always equals d_k to (Press and Wolf, 2016). In the embedding

self.d_k = d_model // h
self.h = h layers, we multiply those weights by dmodel .
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout) class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
def forward(self, query, key, value, mask=None): super(Embeddings, self).__init__()
"Implements Figure 2" self.lut = nn.Embedding(vocab, d_model)
if mask is not None: self.d_model = d_model
# Same mask applied to all h heads.
mask = mask.unsqueeze(1) def forward(self, x):
nb = query.size(0) return self.lut(x) * math.sqrt(self.d_model)
# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = [
l(x).view(nb, -1, self.h, self.d_k).transpose(1, 2) 3.4 Positional Encoding
for l, x in zip(self.linears, (query, key, value))]

# 2) Apply attention on all the projected vectors in batch. Since our model contains no recurrence and
x, self.attn = attention(query, key, value, mask=mask,
dropout=self.dropout) no convolution, in order for the model to make
# 3) "Concat" using a view and apply a final linear. use of the order of the sequence, we must in-
x = x.transpose(1, 2).contiguous().view(
nb, -1, self.h * self.d_k) ject some information about the relative or ab-
return self.linears[-1](x)
solute position of the tokens in the sequence.
3.2 Position-wise Feed-Forward Networks To this end, we add "positional encodings" to
the input embeddings at the bottoms of the
In addition to attention sub-layers, each of encoder and decoder stacks. The positional
the layers in our encoder and decoder con- encodings have the same dimension dmodel
tains a fully connected feed-forward network, as the embeddings, so that the two can be
which is applied to each position separately summed. There are many choices of posi-
and identically. This consists of two linear tional encodings, learned and fixed (Gehring
transformations with a ReLU activation in be- et al., 2017).
tween. In this work, we use sine and cosine func-
tions of different frequencies:
FFN( x ) = max(0, xW1 + b1 )W2 + b2

While the linear transformations are the same PE( pos,2i) = sin( pos/100002i/dmodel )
across different positions, they use different
parameters from layer to layer. Another way PE( pos,2i+1) = cos( pos/100002i/dmodel )
of describing this is as two convolutions with
kernel size 1. The dimensionality of input and
output is dmodel = 512, and the inner-layer has where pos is the position and i is the dimen-
dimensionality d f f = 2048. sion. That is, each dimension of the posi-
tional encoding corresponds to a sinusoid.
class PositionwiseFeedForward(nn.Module): The wavelengths form a geometric progres-
"Implements FFN equation."
def __init__(self, d_model, d_ff, dropout=0.1): sion from 2π to 10000 · 2π. We chose this
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff) function because we hypothesized it would
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout) allow the model to easily learn to attend by
def forward(self, x):
relative positions, since for any fixed offset k,
return self.w_2(self.dropout(F.relu(self.w_1(x))))
PE pos+k can be represented as a linear func-
tion of PE pos .
3.3 Embeddings and Softmax
In addition, we apply dropout to the sums of
Similarly to other sequence transduction the embeddings and the positional encodings
models, we use learned embeddings to con- in both the encoder and decoder stacks. For
vert the input tokens and output tokens to the base model, we use a rate of Pdrop = 0.1.
vectors of dimension dmodel . We also use
56
class PositionalEncoding(nn.Module): self.src = src
"Implement the PE function." self.src_mask = (src != pad).unsqueeze(-2)
def __init__(self, d_model, dropout, max_len=5000): if trg is not None:
super(PositionalEncoding, self).__init__() self.trg = trg[:, :-1]
self.dropout = nn.Dropout(p=dropout) self.trg_y = trg[:, 1:]
self.trg_mask = self.make_std_mask(self.trg, pad)
# Compute the positional encodings once in log space. self.ntokens = (self.trg_y != pad).data.sum()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1) @staticmethod
div_term = torch.exp(torch.arange(0, d_model, 2) * def make_std_mask(tgt, pad):
-(math.log(10000.0) / d_model)) "Create a mask to hide padding and future words."
pe[:, 0::2] = torch.sin(position * div_term) tgt_mask = (tgt != pad).unsqueeze(-2)
pe[:, 1::2] = torch.cos(position * div_term) tgt_mask = tgt_mask & Variable(
pe = pe.unsqueeze(0) subsequent_mask(tgt.size(-1))
self.register_buffer('pe', pe) .type_as(tgt_mask.data))
return tgt_mask
def forward(self, x):
x = x + Variable(self.pe[:, :x.size(1)],
requires_grad=False)
return self.dropout(x)
4.2 Training Loop
def run_epoch(data_iter, model, loss_compute):
"Standard Training and Logging Function"
plt.figure(figsize=(15, 5)) start = time.time()
pe = PositionalEncoding(20, 0) total_tokens = 0
y = pe.forward(Variable(torch.zeros(1, 100, 20))) total_loss = 0
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy()) tokens = 0
plt.legend(["dim %d" % p for p in [4, 5, 6, 7]]) for i, batch in enumerate(data_iter):
None out = model.forward(batch.src, batch.trg,
batch.src_mask, batch.trg_mask)
loss = loss_compute(out, batch.trg_y, batch.ntokens)
total_loss += loss
total_tokens += batch.ntokens
tokens += batch.ntokens
if i % 50 == 1:
elapsed = time.time() - start
print("Epoch Step: %d Loss: %f Tokens / Sec: %f" %
(i, loss / batch.ntokens, tokens / elapsed))
start = time.time()
tokens = 0
return total_loss / total_tokens

4.3 Training Data and Batching


We also experimented with using learned We trained on the standard WMT 2014
positional embeddings (Gehring et al., 2017) English-German dataset consisting of about
instead, and found that the two versions pro- 4.5 million sentence pairs. Sentences were
duced nearly identical results. We chose encoded using byte-pair encoding, which has
the sinusoidal version because it may al- a shared source-target vocabulary of about
low the model to extrapolate to sequence 37000 tokens. For English-French, we used
lengths longer than the ones encountered the significantly larger WMT 2014 English-
during training. French dataset consisting of 36M sentences
and split tokens into a 32000 word-piece vo-
def make_model(src_vocab, tgt_vocab, N=6,
d_model=512, d_ff=2048, h=8, dropout=0.1): cabulary.
"Helper: Construct a model from hyperparameters."
c = copy.deepcopy Sentence pairs were batched together by
attn = MultiHeadedAttention(h, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout) approximate sequence length. Each training
position = PositionalEncoding(d_model, dropout)
d = d_model batch contained a set of sentence pairs con-
model = EncoderDecoder(
Encoder(EncoderLayer(d, c(attn), c(ff), dropout), N), taining approximately 25000 source tokens
Decoder(DecoderLayer(d, c(attn), c(attn),
c(ff), dropout), N), and 25000 target tokens.
nn.Sequential(Embeddings(d, src_vocab), c(position)),
nn.Sequential(Embeddings(d, tgt_vocab), c(position)),
Generator(d_model, tgt_vocab)) global max_src_in_batch, max_tgt_in_batch
# This was important from their code. def batch_size_fn(new, count, sofar):
# Initialize parameters with Glorot / fan_avg. "Calculate total number of tokens + padding."
for p in model.parameters(): global max_src_in_batch, max_tgt_in_batch
if p.dim() > 1: if count == 1:
nn.init.xavier_uniform(p) max_src_in_batch = 0
return model max_tgt_in_batch = 0
max_src_in_batch = max(max_src_in_batch,
len(new.src))
max_tgt_in_batch = max(max_tgt_in_batch,
4 Training len(new.trg) + 2)
src_elements = count * max_src_in_batch
tgt_elements = count * max_tgt_in_batch
This section describes the training regime for return max(src_elements, tgt_elements)

our models.
4.4 Hardware and Schedule
4.1 Batches and Masking We trained our models on one machine with 8
class Batch: NVIDIA P100 GPUs. For our base models us-
"Batch of data with mask for training."
def __init__(self, src, trg=None, pad=0): ing the hyperparameters described through-
57
out the paper, each training step took about
0.4 seconds. We trained the base models for
a total of 100,000 steps or 12 hours. For our
big models, step time was 1.0 seconds. The
big models were trained for 300,000 steps
(3.5 days).

4.5 Optimizer

We used the Adam optimizer (Kingma and


Ba, 2014) with β 1 = 0.9, β 2 = 0.98 and
e = 10−9 . We varied the learning rate over
the course of training, according to the for- 4.6 Regularization
mula:
4.6.1 Label Smoothing
−0.5
lrate = dmodel · During training, we employed label smooth-
ing of value els = 0.1 (Szegedy et al., 2015).
This hurts perplexity, as the model learns to
min(step_num−0.5 , step_num · warmup_steps−1.5 )
be more unsure, but improves accuracy and
BLEU score.
This corresponds to increasing the learning
rate linearly for the first warmup_steps train- class LabelSmoothing(nn.Module):
"Implement label smoothing."
ing steps, and decreasing it thereafter propor- def __init__(self, size, padding_idx, smoothing=0.0):
super(LabelSmoothing, self).__init__()
tionally to the inverse square root of the step self.criterion = nn.KLDivLoss(size_average=False)
self.padding_idx = padding_idx
number. We used warmup_steps = 4000. self.confidence = 1.0 - smoothing
self.smoothing = smoothing
self.size = size
self.true_dist = None

class NoamOpt: def forward(self, x, target):


"Optim wrapper that implements rate." assert x.size(1) == self.size
def __init__(self, model_size, factor, true_dist = x.data.clone()
warmup, optimizer): true_dist.fill_(self.smoothing / (self.size - 2))
self.optimizer = optimizer true_dist.scatter_(1, target.data.unsqueeze(1),
self._step = 0 self.confidence)
self.warmup = warmup true_dist[:, self.padding_idx] = 0
self.factor = factor mask = torch.nonzero(target.data == self.padding_idx)
self.model_size = model_size if mask.dim() > 0:
self._rate = 0 true_dist.index_fill_(0, mask.squeeze(), 0.0)
self.true_dist = true_dist
def step(self): return self.criterion(x,
"Update parameters and rate" Variable(true_dist,
self._step += 1 requires_grad=False))
rate = self.rate()
for p in self.optimizer.param_groups:
p['lr'] = rate
#Example of label smoothing.
self._rate = rate
crit = LabelSmoothing(5, 0, 0.4)
self.optimizer.step()
predict = torch.FloatTensor(
[[0, 0.2, 0.7, 0.1, 0],
def rate(self, step=None):
[0, 0.2, 0.7, 0.1, 0],
"Implement `lrate` above"
[0, 0.2, 0.7, 0.1, 0]])
if step is None:
v = crit(Variable(predict.log()),
step = self._step
Variable(torch.LongTensor([2, 1, 0])))
return self.factor * (
self.model_size ** (-0.5) *
# Show the target distributions expected by the system.
min(step ** (-0.5), step * self.warmup ** (-1.5)))
plt.imshow(crit.true_dist)
None
def get_std_opt(model):
return NoamOpt(model.src_embed[0].d_model, 2, 4000,
torch.optim.Adam(model.parameters(),
lr=0,
betas=(0.9, 0.98),
eps=1e-9))

# Three settings of the lrate hyperparameters.


opts = [NoamOpt(512, 1, 4000, None),
NoamOpt(512, 1, 8000, None),
NoamOpt(256, 1, 4000, None)]
plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts]
for i in range(1, 20000)])
plt.legend(["512:4000", "512:8000", "256:4000"])
None

58
max_len=60,
start_symbol=TGT.stoi["<s>"])
print("Translation:", end="\t")
trans = "<s> "
crit = LabelSmoothing(5, 0, 0.1) for i in range(1, out.size(1)):
def loss(x): sym = TGT.itos[out[0, i]]
d = x + 3 * 1 if sym == "</s>":
predict = torch.FloatTensor([[0, x / d, 1 / d, break
1 / d, 1 / d]]) trans += sym + " "
return crit(Variable(predict.log()), print(trans)
Variable(torch.LongTensor([1]))).data[0]
plt.plot(np.arange(1, 100),
[loss(x) for x in range(1, 100)]) 5.2 Attention Visualization
None
tgt_sent = trans.split()
def draw(data, x, y, ax):
seaborn.heatmap(data,
xticklabels=x, square=True,
yticklabels=y, vmin=0.0, vmax=1.0,
cbar=False, ax=ax)

for layer_num in range(1, 6, 2):


fig, axs = plt.subplots(1, 4, figsize=(20, 10))
print("Encoder Layer", layer_num + 1)
layer = model.encoder.layers[layer_num]
for h in range(4):
draw(layer.self_attn.attn[0, h].data,
sent, sent if h == 0 else [], ax=axs[h])
plt.show()

for layer_num in range(1, 6, 2):


fig, axs = plt.subplots(1, 4, figsize=(20, 10))
print("Decoder Self Layer", layer_num + 1)
layer = model.decoder.layers[layer_num]
for h in range(4):
draw(layer.self_attn.attn[0, h]
.data[:len(tgt_sent), :len(tgt_sent)],
tgt_sent, tgt_sent if h == 0 else [], ax=axs[h])
plt.show()
print("Decoder Src Layer", layer_num + 1)
fig, axs = plt.subplots(1, 4, figsize=(20, 10))
4.7 Loss Computation for h in range(4):
draw(layer.src_attn.attn[0, h].data[
:len(tgt_sent), :len(sent)],
class SimpleLossCompute: sent, tgt_sent if h == 0 else [], ax=axs[h])
"A simple loss compute and train function." plt.show()
def __init__(self, generator,
criterion, opt=None):
self.generator = generator
self.criterion = criterion
self.opt = opt

def __call__(self, x, y, norm):


x = self.generator(x)
loss = self.criterion(
x.contiguous().view(-1, x.size(-1)),
y.contiguous().view(-1)) / norm
loss.backward()
if self.opt is not None:
self.opt.step()
self.opt.optimizer.zero_grad()
return loss.data[0] * norm

5 Decoding and Visualization


5.1 Greedy Decoding 6 Conclusion
def greedy_decode(model, src, src_mask,
max_len, start_sym):
memory = model.encode(src, src_mask)
This paper presents a replication exercise of
ys = torch.ones(1, 1).fill_(start_sym).type_as(src.data) the transformer network. Consult the full on-
for i in range(max_len - 1):
out = model.decode(memory, src_mask, line version for features such as multi-gpu
Variable(ys),
Variable( training, real experiments on full translation
subsequent_mask(ys.size(1))
.type_as(src.data))) problems, and pointers to other extensions
prob = model.generator(out[:, -1])
_, next_word = torch.max(prob, dim=1) such as beam search, sub-word models, and
next_word = next_word.data[0]
ys = torch.cat([ys, model averaging. The goal is to explore a lit-
torch.ones(1, 1)
.type_as(src.data) erate programming experiment of interleav-
.fill_(next_word)],
dim=1) ing model replication with formal writing.
return ys
While not always possible, this modality can
model.eval()
be useful for transmitting ideas and encour-
sent = """@@@The @@@log @@@file @@@can @@@be @@@sent @@@secret ly aging faster open-source uptake. Addition-
@@@with @@@email @@@or @@@FTP @@@to @@@a @@@specified
@@@receiver""".split() ally this method can be an easy way to learn
src = torch.LongTensor([[SRC.stoi[w] for w in sent]])
src = Variable(src) about a model alongside its implementation.
src_mask = (src != SRC.stoi["<blank>"]).unsqueeze(-2)
out = greedy_decode(model, src, src_mask,

59
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E
Hinton. 2016. Layer normalization. arXiv
preprint arXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua


Bengio. 2014. Neural machine translation by
jointly learning to align and translate. CoRR,
abs/1409.0473.

Denny Britz, Anna Goldie, Minh-Thang Luong,


and Quoc V. Le. 2017. Massive exploration
of neural machine translation architectures.
CoRR, abs/1703.03906.

Jonas Gehring, Michael Auli, David Grangier, De-


nis Yarats, and Yann N. Dauphin. 2017. Convo-
lutional sequence to sequence learning. CoRR,
abs/1705.03122.

Alex Graves. 2013. Generating sequences with re-


current neural networks. CoRR, abs/1308.0850.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and
Jian Sun. 2016. Deep residual learning for im-
age recognition. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 770–778.
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. CoRR,
abs/1412.6980.
Ofir Press and Lior Wolf. 2016. Using the out-
put embedding to improve language models.
CoRR, abs/1608.05859.
Nitish Srivastava, Geoffrey Hinton, Alex
Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. 2014. Dropout: A simple
way to prevent neural networks from overfit-
ting. The Journal of Machine Learning Research,
15(1):1929–1958.
Christian Szegedy, Vincent Vanhoucke, Sergey
Ioffe, Jonathon Shlens, and Zbigniew Wojna.
2015. Rethinking the inception architecture for
computer vision. CoRR, abs/1512.00567.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. At-
tention is all you need. CoRR, abs/1706.03762.

60

You might also like