Timit IAM Online JSB Chorales: Transactions On Neural Networks and Learning Systems 6
Timit IAM Online JSB Chorales: Transactions On Neural Networks and Learning Systems 6
100 TIMIT 2.5 IAM Online 1.8 12.5 JSB Chorales 1.4
100
1.6 12.0
90 1.2
2.0
1.4
11.5
80 80
1.0
negative log-likelihood
1.2
40 0.6
50
9.5 0.4
0.4
0.5
40
9.0 0.2
20 0.2
30 8.5
0.0
V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF
35 TIMIT 5 IAM Online 3.0 JSB Chorales 1.6
30
8.8
34 1.4
2.5
4
33 1.2
25
number of parameters ∗105
negative log-likelihood
2.0
character error rate
32 1.0
3
2
30 0.6
1.0
8.5
15
29 0.4
1
0.5
28 0.2
8.4
10
27 0.0
V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF V CIFG FGR NP NOG NIAF NIG NFG NOAF
Figure 3. Test set performance for all 200 trials (top) and for the best 10% (bottom) trials (according to the validation set) for each dataset and variant. Boxes
show the range between the 25th and the 75th percentile of the data, while the whiskers indicate the whole range. The red dot represents the mean and the red
line the median of the data. The boxes of variants that differ significantly from the vanilla LSTM are shown in blue with thick lines. The grey histogram in the
background presents the average number of parameters for the top 10% performers of every variant.
specific to our choice of search ranges. We have tried to chose Input and forget gate coupling (CIFG) did not significantly
reasonable ranges for the hyperparameters that include the best change mean performance on any of the datasets, although
settings for each variant and are still small enough to allow the best performance improved slightly on music modeling.
for an effective search. The means and variances tend to be Similarly, removing peephole connections (NP) also did not
rather similar for the different variants and datasets, but even lead to significant changes, but the best performance improved
here some significant differences can be found. slightly for handwriting recognition. Both of these variants
In order to draw some more interesting conclusions we simplify LSTMs and reduce the computational complexity, so
restrict our further analysis to the top 10% performing trials it might be worthwhile to incorporate these changes into the
for each combination of dataset and variant (see bottom half architecture.
of Figure 3). This way our findings will be less dependent on Adding full gate recurrence (FGR) did not significantly
the chosen search space and will be representative for the case change performance on TIMIT or IAM Online, but led to
of “reasonable hyperparameter tuning efforts.”9 worse results on the JSB Chorales dataset. Given that this
The first important observation based on Figure 3 is that variant greatly increases the number of parameters, we generally
removing the output activation function (NOAF) or the forget advise against using it. Note that this feature was present in
gate (NFG) significantly hurt performance on all three datasets. the original proposal of LSTM [14, 15], but has been absent
Apart from the CEC, the ability to forget old information in all following studies.
and the squashing of the cell state appear to be critical for Removing the input gate (NIG), the output gate (NOG), and
the LSTM architecture. Indeed, without the output activation the input activation function (NIAF) led to a significant reduc-
function, the block output can in principle grow unbounded. tion in performance on speech and handwriting recognition.
Coupling the input and the forget gate avoids this problem and However, there was no significant effect on music modeling
might render the use of an output non-linearity less important, performance. A small (but statistically insignificant) average
which could explain why GRU performs well without it. performance improvement was observed for the NIG and NIAF
9 How much effort is “reasonable” will still depend on the search space. If
architectures on music modeling. We hypothesize that these
the ranges are chosen much larger, the search will take much longer to find behaviors will generalize to similar problems such as language
good hyperparameters. modeling. For supervised learning on continuous real-valued