0% found this document useful (0 votes)
47 views9 pages

Bouchard, Stenetorp, Riedel - Unknown - Learning To Generate Textual Data

The document discusses using generative models to help machine learning models learn from limited training data by generating additional synthetic training examples. It proposes an algorithm called G ENE R E that jointly learns the parameters of the target prediction model and the parameters of an associated generative model used to generate additional training data. G ENE R E trains the models by minimizing a combined loss function over the real training data and synthetic generated data. This allows the generative model to be "weakly specified" and have its generation parameters automatically tuned during training to best match the prediction model.

Uploaded by

Black Fox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

Bouchard, Stenetorp, Riedel - Unknown - Learning To Generate Textual Data

The document discusses using generative models to help machine learning models learn from limited training data by generating additional synthetic training examples. It proposes an algorithm called G ENE R E that jointly learns the parameters of the target prediction model and the parameters of an associated generative model used to generate additional training data. G ENE R E trains the models by minimizing a combined loss function over the real training data and synthetic generated data. This allows the generative model to be "weakly specified" and have its generation parameters automatically tuned during training to best match the prediction model.

Uploaded by

Black Fox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Learning to Generate Textual Data

Guillaume Bouchard†‡∗ and Pontus Stenetorp†∗ and Sebastian Riedel†


{g.bouchard,p.stenetorp,s.riedel}@cs.ucl.ac.uk

Department of Computer Science, University College London

Bloomsbury AI

Abstract seen as regularizers dedicated to the task we want


to solve (Scholkopf and Smola, 2001). While sim-
To learn text understanding models with mil-
ple, the idea of data simulation is powerful and can
lions of parameters, one needs a massive
amount of data. We argue that generating lead to a significantly better estimations of a predic-
data can compensate for this need for large tive model because it prevents overfitting. At the
datasets. While defining generic data gen- same time it is subject to a strong model bias, be-
erators is tricky, we propose to allow these cause such data generators often generate data that
generators to be “weakly” specified, letting is different from the observed data.
the undetermined coefficients be learned from
data. We derive an efficient algorithm called Creating virtual samples is strongly linked to
G ENE R E that jointly estimates the parameters transfer learning when the task to transfer is corre-
of the model and the undetermined sampling lated to the objective (Pan and Yang, 2010). The
coefficients, removing the need for costly computer vision literature adopted this idea very
cross-validation. We illustrate its benefits by early through the notion of virtual samples, which
learning to solve math exam questions using have a natural interpretation: by creating artificial
a highly parametrized sequence-to-sequence
perturbations of an image, its semantics is likely
neural network.
to be unchanged, i.e. training samples can be ro-
tated, blurred, or slightly cropped without chang-
1 Introduction ing the category of the objects contained in the im-
Many tasks require a large amount of background age (Niyogi et al., 1998).
knowledge to be solved efficiently, but acquiring a However, for natural language applications, this
huge amount of data is costly, both in terms of time idea of creating invariant transformations is diffi-
and money. In several situations, a human trainer cult to apply directly, as simple meaning-preserving
can specify domain knowledge by providing a gen- transformations, such as the replacement of words
erator of virtual data, such as a negative data sampler by their synonyms or active-passive verb transfor-
for implicit feedback in recommendation systems, mations, are quite limited. More advanced meaning-
physical 3D rendering engines as a simulator of data preserving transformations would require an already
in a computer vision system, simulators of physi- good model that understands natural language. An-
cal processes to solve science exam question, and other option more targeted to textual applications is
math problem generators for the automatic solving to build top-down generators, such as probabilistic
of math problems. Domain-specific data simulators grammars, with a much wider coverage of linguis-
can generate an arbitrary amount of data, that can tic phenomena. This way of being able to leverage
be treated exactly the same way as standard obser- many years of research in computational linguistics
vations, but since they are virtual, they can also be to create good data generators would be a natural

Contributed equally to this work. and useful reuse of scientific knowledge, and better
than blindly believing in the current trend of “data 2 Regularization Based on a Generative
takes all”. Model
As with any machine learning approach, we assume
While the idea of generating data is straightfor- that given the realisation of a variable x ∈ X repre-
ward, one could argue that it may be difficult to senting the input, we want to predict the distribution
come up with good generators. What we mean by of a variable y ∈ Y representing the output. The
a good generator is the ability to help predicting test goal is to find this predictive distribution by learning
data when the model is trained on the generated data. it from examples D := {(xi , yi )}ni=1 .
In this paper, we will show several types of gener- Building on the current success in the applica-
ators, some contributing more than others in their tion of deep learning to NLP, we assume that there
ability to generalize to unseen data. In fact, finding exists a good model family {fθ , θ ∈ Θ} to pre-
good generators is more difficult than one might ini- dict y given x, where θ is an element of the pa-
tially expect: should we generate data by modifying rameter space Θ. For example, the stacked LSTM
existing training samples, or “go wild” and derive encoder-decoder is a general purpose model that
a full probabilistic context-free grammar that could has helped to improve results on relatively complex
possibly generate unnatural examples and add noise tasks, such as machine translation (Sutskever et al.,
to the estimator? While we do not arrive at a spe- 2014), syntactic parsing (Vinyals et al., 2014), se-
cific framework to build programs that generate vir- mantic parsing (Dong and Lapata, 2016) and textual
tual data, we assume in this work that a domain ex- entailment (Rocktäschel et al., 2016).
pert can easily write a program in her own program- One common issue with flexible probabilistic
ming language, leaving some generation parameters models is that they often require many training sam-
unspecified. In our approach these unspecified pa- ples. For example, a nested stacked LSTM requires
rameters are automatically learned from the data, by a very large amount of examples to predict long and
selecting the ones most compatible with the model. meaningful output sequences.
More precisely, we assume that the correct parame- For many applications, the amount of training
ters are the ones for which the data generator gives data is too small or too costly to acquire. We hence
on average the smallest penalty to the likelihood. look for alternative ways to regularize the model so
Learning generative models discriminatively (Petrov that we can achieve good performance using few
and Klein, 2007) would be an alternative to our ap- data points. One way is to use surrogate tasks in
proach that could be more practical as it decouples a transfer learning framework, but finding these sur-
the definition of the prediction model from the struc- rogate tasks is sometimes difficult. Another way is
tural biases that we want to impose trough genera- to use domain knowledge that can partially explain
tive regularization. the data.
Let pθ (y|x) be the target prediction model. Given
the training dataset D, the penalized maximum like-
In the next section, we introduce G ENE R E, a lihood estimator is obtained by minθ∈Θ L(θ) where:
generic algorithm that extends any gradient-based n
learning approach with a data generator that can be
X
L(θ) := − log pθ (yi |xi ) + λΩ(θ) . (1)
tuned while learning the model on the training data i=1
using stochastic optimization. In Section 2.2, we
show how G ENE R E can be adapted to handle a (pos- Here, Ω(θ) is a regularizer that prevents over-fitting,
sibly non-differentiable) black-box sampler without and λ ∈ R the regularization parameter that can be
requiring modifications to it. We also illustrate how set by cross-validation. Instead of using a standard
this framework can be implemented in practice for regularizer, such as the squared norm or the Lasso
a specific use case: the automatic solving of math penalty which are domain-agnostic, we propose in
exam problems. Further discussion is given in the this paper to use a generative model to regularize the
concluding section. estimator.
Domain knowledge A natural way to inject back- distribution with two components: the empirical
ground information is to define a generative model data distribution and the sampled data points gen-
that simulates the way the data is generated. In text erated by the data simulator. Their respective pro-
1 λ
understanding applications, such generative mod- portions are 1+λ and 1+λ , respectively. We refer to
els are common and include probabilistic context- this algorithm as G ENE R E for Generative Regular-
free grammars (PCFG) and natural language gener- ization and provide the pseudocode in Algorithm 1.
ation frameworks (e.g. SimpleNLG (Gatt and Re- It can be viewed as a variant of the R EINFORCE al-
iter, 2009)). Let pγ (x, y) be such a generative model gorithm which is commonly used in Reinforcement
parametrized by a continuous parameter vector γ ∈ Learning (Williams, 1988) using the policy gradient.
Γ, such as the concatenation of all the parameters It is straightforward to verify that at each iteration,
of the production rules in a PCFG. One important G ENE R E computes a noisy estimate of the exact gra-
difference between the discriminative and the gen- dient of the objective function L(θ, γ) with respect
erative probability distributions is that the inference to both parameter vectors θ and γ.
problem of y given x might be intractable1 for the
generative model, even if the joint model can be Algorithm 1 The G ENE R E Algorithm
computed efficiently. Require: P̂ : real data sampler
In this work, we use the following regularizer: Require: Pγ : parametric data generator
   Require: λ: generative regularization strength
pγ (y|x)
Ω(θ) := min Epγ (x,y) log .(2) Require: ΠC : proximal regularization operator
γ∈Γ pθ (y|x)
Require: η: learning rate
This regularizer makes intuitive sense as it corre- Require: α: baseline smoothing coefficient
sponds to the Kullback-Leibler divergence between 1: Initialize parameters θ, sampling coefficients γ
the generative and discriminative models. We can and baseline µ
see that if the generator pγ is close to the distribution 2: for t = 1, 2, · · · do
1 λ
that generates the test data, the method can poten- 3: x, y ∼ 1+λ P̂ + 1+λ Pγ
tially yield good performance. However, in practice, 4: gθ ← ∇θ log pθ (y|x)
γ is unknown and difficult to set. In this work, we 5: gγ ← (log pθ (y|x) − µ) ∇ log pγ (x, y)
focus on several techniques that can be used to esti- 6: (θ, γ) ← ΠC ((θ, γ) − η(gθ , gγ ))
mate the generative parameter vector γ on the train- 7: µ ← αµ + (1 − α) log pθ (y|x)
ing data, making the regularizer data-dependent. 8: end for
The objective L(θ, γ) can be viewed as a
Generative-Discriminative Tradeoff estimator
(GDT (Bouchard and Triggs, 2004)) that smoothly Generative models: interpretable sampling, in-
interpolates between a purely un-regularized dis- tractable inference Generative modeling is natu-
criminative model when λ = 0 and a generative ral because we can consider latent variables that add
model when λ tends to infinity. Note that when interpretable meaning to the different components of
there is no regularization, the sampling parameters the model. For example, in NLP we can define the
γ do not need to be estimated as the objective latent variable as being the relations that are men-
function does not depend on them. tioned in the sentence.
We could consider two main types of approaches
2.1 The G ENE R E Algorithm to choose a good latent variable:
The objective function can be minimized using
stochastic gradient descent by sampling a mixture • Discrete data structure: we can use efficient
algorithms, such as dynamic programming to
1
Even if tractable, the inference can be very costly: for ex- perform sampling and can propagate the gradi-
ample, PCFG decoding can be done using dynamic program-
ming and has a cubic complexity in the length of the decoded ent
sentence, which is still too high for some applications with long
sentences. • Continuous distribution: having a continuous
latent variable enables easy handling of corre- Algorithm 2 Black Box G ENE R E
lations across different parts of the model. Require: P̂ : real data sampler
Require: G(γ): black box data generator
It is often laborious to design data generators Require: λ: generative regularization strength
which can return the probability of the samples it Require: ηγ , ηθ : learning rates
generates, as well as the gradient of this probability 1: Initialize parameters θ, sampling coefficients γ
with respect to the input parameters γ. In the next and baseline µ
section, we show how to alleviate this constraint by 2: for t = 1, 2, · · · do
allowing any data-generating code to be used with 3: 1
if 1+λ < U([0, 1]) then
nearly no modification.
4: x, y ∼ P̂
5: else
2.2 G ENE R E with a Black Box Sampler
6: ∆ ∼ N (0, I)
Let us assume the data generator is a black box 7: x, y ∼ Gγ+∆
that takes a K-dimensional seed vector as input 8: γ ← γ − ηγ (log pθ (y|x) − µ)∆
and outputs an input-output sample x, y. To enable 9: end if
G ENE R E to be applied without having to modify the 10: θ ← θ − η2 ∇θ log pθ (y|x)
code of data generators, we used the following data 98 2
11: µ ← 100 µ + 100 log pθ (y|x)
generation process: 12: end for

1. Sample a Gaussian seed vector ∆ ∼ N (0, I)


3 Application to Encoder-Decoder
2. Use the data generator Gz with seed value z :=
∆+γ to generate an input-output sample (x, y). In this section, we show that the G ENE R E algorithm
is well suited to tune data generators for problems
This two-step generation procedure enables the that are compatible with the encoder-decoder archi-
gradient information to be computed using the den- tecture commonly used in NLP.
sity of a Gaussian distribution. Formally, this is
3.1 Mixture-based Generators
equivalent to Algorithm 1 with the following gen-
erative model: In the experiments below, we consider mixture-
based generators with known components but
pγ (x, y) = E∆∼N (0,I) [gγ+∆ (x, y)] (3) unknown mixture proportions. Formally, we
parametrize the proportions
PK using a softmax link
where gz is the density of the black-box data gen- σ(t) := exp(tk )/ k0 =1 exp(tk0 ). In other words,
erator Gz for the seed value z ∈ RK . Ideally, the the data generator distribution is:
second data generator should be close to a determin-
K
istic function in order to allocate more uncertainty in X
pγ (x, y) = σk (γ + ∆)pk (x, y),
the trainable part of the model which corresponds to
k=1
the Gaussian distribution.2
where pk (x, y) are data distributions, called base
Learning The pseudo-code for the Black Box generators, that are provided by domain experts, and
G ENE R E variant is shown in Algorithm 2. It is basi- ∆ is a K-dimensional centered Gaussian with an
cally the same as Algorithm 1, but where the sam- identity covariance matrix. This class of genera-
pling phase is decomposed into the two steps: A tor makes sense in practice, as we typically build
random Gaussian variable sampling followed by the multiple base generators pk (x, y), k = 1, · · · , K,
black box sampling of generators. without knowing ahead of time which one is the
2
What we mean by deterministic is that the black-box sam-
most relevant. Then, the training data is used by
pler has the form δ{f (∆ + γ) = (x, y)}, where δ is the indica- the G ENE R E algorithm to automatically learn the
tor function. optimal parameter γ that controls the contribution
{πk }K
k=1 of each of the of the base generators, equal teresting because it contains more training data, and
to πk := E∆∼N (0,I) [σk (γ + ∆)]. the best performance is not obtained using only the
generator, but with 40% of the real data, illustrating
3.2 Synthetic Experiment the fact that it is beneficial to jointly use real and
In this section, we illustrate how G ENE R E can learn simulated data during training.
to identify the correct generator, when the data gen-
erating family is a mixture of multiple data genera- 3.3 Math word problems
tors and only one of these distributions – say p1 – To illustrate the benefit of using generative regular-
has been used to generate the data. The other dis- ization, we considered a class of real world problems
tributions (p2 , · · · , pK ) are generating input-output for which obtaining data is costly: learning to an-
data samples (x, y) with different distributions. swer math exam problems. Prior work on this prob-
We verified that the algorithm correctly identifies lem focuses on standard math problems given to stu-
the correct data distribution, and hence leads to bet- dents aged between 8 and 10, such as the following:3
ter generalization performances than what the model
learns without the generator. For Halloween Sarah received 66 pieces of
In this illustrative experiment, a simple text-to- candy from neighbors and 15 pieces from
equation translation problem is created, where in- her older sister. If she only ate 9 pieces a
puts are sentences describing an equation such as day, how long would the candy last her?
“compute one plus three minus two”, and outputs are
symbolic equations, such as “X = 1 + 3 - 2”. Num- The answer is given by the following equation:
bers were varying between -20 and 20, and equa-
X = (66 + 15)/9 .
tions could have 2 or 3 numbers with 2 or 3 opera-
tions.
Note that similarly to real world school exams, giv-
As our model, we used a 20-dimensional
ing the final answer of (9 in this case) is not consid-
sequence-to-sequence model with LSTM recurrent
ered to be enough for the response to be correct.
units. The model was initialized using 200 iterations
The only publicly available word problem
of standard gradient descent on the log-probability
datasets we are aware of contain between 400 and
of the output. G ENE R E was run for 500 iterations,
600 problems (see Table 2), which is not enough
varying the fraction of real and generated samples
to properly train sufficiently rich models that cap-
from 0% to 100%. A `2 regularization of magnitude
ture the link between the words and the quantities
0.1 was applied to the model. The baseline smooth-
involved in the problem.
ing coefficient was set to 0.98 and the shrinkage pa-
Sequence-to-sequence learning is the task of pre-
rameter was set to 0.99. All the experiments were
dicting an output sequence of symbols based on a
repeated 10 times and a constant learning rate of 0.1
sequence of input symbols. It is tempting to cast the
was used.
problem of answering math exams as a sequence-
Results are shown on Figure 1, where the average
to-sequence problem: given the sequence of words
loss computed on the test data is plotted against the
from the problem description, we can predict the se-
fraction of real data used during learning.
quence of symbols for the equation as output. The
We can see that the best generalization perfor-
most successful models for sequence prediction are
mance is obtained when there is a balanced mix of
Recurrent Neural Nets (RNN) with non-linear tran-
real and artificial data, but the proportion depends
sitions between states.
on the amount of training data: on the left hand side,
Treated as a translation problem, math word prob-
the best performance is obtained with generated data
lem solving should be simpler than developing a
only, meaning that the number of training samples is
machine translation model between two human lan-
so small that G ENE R E only used the training data
guages, as the output vocabulary (the math symbols)
to select the best base generator (the component p1 ),
is significantly smaller than any human vocabulary.
and the best performance is attained using only gen-
3
erated data. The plot on the right hand side is in- From the Common Core dataset (Roy and Roth, 2015)
Figure 1: Test loss vs. fraction of real data used in G ENE R E on the text-to-equation experiment.

AI2 IL CC AI2 IL CC
X+Y X+Y X+Y−Z Train 198 214 300
X+Y+Z X−Y X ∗ (Y + Z) Dev 66 108 100
X−Y X∗Y X ∗ (Y − Z) Test 131 240 200
X/Y (X + Y)/Z Total 395 562 600
(X − Y)/Z Table 2: Math word problems dataset sizes.
Table 1: Patterns of the equations seen in the datasets for one
permutation of the placeholders.
sets. For AI2 and CC, we simply split the data ran-
domly and for IL we opted to maintain the clusters
However, machine translation can be learned on mil- described in Roy and Roth (2015). We then used the
lions of pairs of already translated sentences, and implementation of Roy and Roth (2015) provided by
such massive training datasets dwarfs all previously the authors, which is the current state-of-the-art for
introduced math exam datasets. Today, standard all three data sets, to obtain results to compare our
repositories are restricted to a few hundreds prob- model against. The resulting data sizes are shown
lems with their solutions (Hosseini et al., 2014; Roy on Table 2. We verified that there are no duplicate
et al., 2015; Roy and Roth, 2015). problems, and our splits and a fork of the baseline
We used standard benchmark data from the litera- implementation are available online.4
ture. The first one, AI2, was introduced by Hosseini
3.4 Development of the Generator
et al. (2014) and covers addition and subtraction of
one or two variables or two additions scraped from Generators were organized as a set of 8 base genera-
two web pages. The second (IL), introduced by Roy tors pk , summarized in Table 4. Each base generator
et al. (2015), contains single operator questions but has several functions associated with it. The func-
covers addition, subtraction, multiplication, and di- tions were written by a human, over 3 days of full-
vision, and was also obtained from two, although time development. The first group of base genera-
different from AI2, web pages. The last data set tors is only based on the type of symbol the equation
(CC) was introduced by Roy and Roth (2015) to has, the second group is the pair (#1, #2) to represent
cover combinations of different operators and was equations with one or two symbols. Finally, the last
obtained from a fifth web page. two generators are more experimental as they corre-
An overview of the equation patterns in the data spond to simple modifications applied to the avail-
is shown in Table 1. It should be noted that there able training data. The Noise ’N’ generator picks
are sometimes numbers mentioned in the problem one or two random words from a training sample to
description that are not used in the equation. create a new (but very similar) problem. Finally, the
’P’ generator is based on computing the statistics of
As there are no available train/dev/test split in the
4
literature we introduced such splits for all three data https://github.com/ninjin/roy_and_roth_2015
John sprints to William’s apartment. The distance is 32 yards from John’s apartment to 32 / 2
William’s apartment. It takes John 2 hours to at the end get there. How fast did John go?
Sandra has 7 erasers. She grasps 7 more. The following day she grasps 18 whistles at the 7+7
local supermarket. How many erasers does Sandra have in all?
A pet store had 81 puppies In one day they sold 41 of them and put the rest into cages with 8 ( 81 - 41 ) / 8
in each cage. How many cages did they use?
S1 V1 Q1 O1 C1 S1(pronoun) V2 Q2 of O1(pronoun) and V2 the rest into O3(plural) with ( Q1 - Q2 ) /
Q3 in each O3. How many O3(plural) V3? Q3
Table 3: Examples of generated sentences (first 3 rows). The last row is the template used to generate the 3rd example where
brackets indicate modifiers, symbols starting with ’S’ or ’O’ indicate a noun phrase for a subject or object, symbols with ’V’
indicate a verb phrase, and symbols with ’Q’ indicate a quantity. They are identified with a number to match multiple instances of
the same token.

The problem... tic gradient descent using Adam as an adaptive step


+ contains at least one addition size scheme (Kingma and Ba, 2014), with mini-
- contains at least one subtraction batches of size 32. A total of 256 epochs over the
* contains at least one multiplication data was used in all the experiments.
/ contains at least one division To evaluate the benefit of learning the data gener-
1 has a single mathematical operation ator, we used a hybrid method as a baseline where
2 has a couple of mathematical operations a fraction of the data is real and another fraction is
N is a training sample with words removed generated using the default parameters of the gen-
P is based on word position frequencies erators (i.e. a uniform distribution over all the base
Table 4: The base generators to create math exam problems. generators). The optimal value for this fraction ob-
tained on the development set was 15% real data,
85% generated data. For G ENE R E, we used a fixed
the words for the same question pattern (as one can
size learning rate of 0.1, the smoothing coefficient
see in Table 1), and generates data using simple bi-
was selected to be 0.5, and the shrinkage coefficient
ased word samples, where words are distributed ac-
to be 0.99.
cording to their average positions in the training data
We also compared our approach to the publicly
(positions are computed relatively to the quantities
math exam solver RR2015 (Roy and Roth, 2015).
appearing in the text, i.e. “before the first number”,
This method is based on a combination of template-
“between the 1st and the 2nd number”, etc.).
based features and categorizers. The accuracy per-
3.5 Implementation Details formance was measured by counting the number of
times the equation generated the correct results, so
We use a standard stacked RNN encoder- that 10 + 7 and 7 + 10 would both be considered to
decoder (Sutskever et al., 2014), where we be correct. Results are shown on Table 5.
varied the recurrent unit between LSTM and
We can see that there is a large difference in
GRU (Cho et al., 2014), stack depth from 1 to 3,
performance between RR2015 and the RNN-based
the size of the hidden states from 4 to 512, and the
vocabulary threshold size. As input to the encoder,
we downloaded pre-trained 300-dimensional em- AI2 IL CC Avg.
beddings trained on Google News data using the RR2015 82.4 75.4 55.5 71.1
word2vec software (Mikolov et al., 2013). The 100% Data 72.5 53.7 95.0 73.7
development data was used to tune these parameters 100% Gen 60.3 51.2 92.0 67.8
before performing the evaluation on the test set. We 85%Gen + 15%Data 74.0 55.4 97.5 75.6
obtained the best performances with a single stack, G ENE R E 77.9 56.7 98.5 77.7
GRU units, and a hidden state size of 256. Table 5: Test accuracies of the math-exam methods on the
The optimization algorithm was based on stochas- available datasets averaged over 10 random runs.
encoder-decoder approach. While their method
seems to be very good on some datasets, it fails on
CC, which is the dataset in which one needs to equa-
tions involving parentheses. On average, the trend is
the following: using data only does not succeed in
giving good results, and we can see that with gen-
erated data we are performing better already. This
could be explained by the fact that the generators’
vocabulary has a good overlap with the vocabulary
of the real data. However, mixing real and gener-
ated data improves performance significantly. When
G ENE R E is used, the sampling is tuned to the prob-
lem at hand and give better generalization perfor-
mance.
To understand if G ENE R E learned a meaning-
ful data generator, we inspected the coefficients
γ1 , · · · , γ8 that are used to select the 8 data gener-
ators described earlier. This is shown is Figure 2.
The results are quite surprising at first sight: the Figure 2: Base generators proportions learned by G ENE R E.
AI2 dataset only involves additions and subtractions,
but G ENE R E selects the generator generating divi-
sions as the most important. Investigating, we noted ficient algorithm called G ENE R E, that jointly esti-
that problems generated by the division generator mates the parameters of the model and the undeter-
were reusing some lexical items that were present mined sampling coefficients, removing the need for
in AI2, making the vocabulary very close to the costly cross-validation. While this procedure could
problems in AI2, even if it does not cover division. be viewed as a generic way of building informa-
We can also note that the differences in proportions tive priors, it does not rely on a complex integra-
are quite small among the 4 symbols +, −, ∗ and / tion procedure such as Bayesian optimization, but
across all the datasets. We can also clearly see that corresponds to a simple modification of the standard
the noisy generator ’N’ and ’P’ are not very relevant stochastic optimization algorithms, where the sam-
in general. We explain this by the fact that the noise pling alternates between the use of real and gener-
induced by these generators is too artificial to gen- ated data. While the general framework assumes
erate relevant data for training. Their likelihood on that the sampling distribution is differentiable with
the model trained on real data remains small. respect to its learnable parameters, we proposed a
Gaussian integration trick that does not require the
4 Conclusion data generator to be differentiable, enabling practi-
tioners to use any data sampling code, as long as the
In this work, we argued that many problems can be data resembles the real data.
solved by high-capacity discriminative probabilistic We also showed in the experiments, that a simple
models, such as deep neural nets, at the expense of way to parametrize a data generator is to use a mix-
a large amount of required training data. Unlike ture of base generators, that might have been derived
the current trend which is to reduce the size of the independently. The G ENE R E algorithm learns auto-
model, or to define features well targeted for the matically the relative weights of these base genera-
task, we showed that we can completely decouple tors, while optimizing the original model. While the
the choice of the model and the design of a data gen- experiments only focused on sequence-to-sequence
erator. We proposed to allow data generators to be decoding, our preliminary experiments with other
“weakly” specified, leaving the undetermined coef- high-capacity deep neural nets seem promising.
ficients to be learned from data. We derived an ef- Another future work direction is to derive efficient
mechanisms to guide the humans that are creating P. Niyogi, F. Girosi, and T. Poggio. 1998. Incorporating
the data generation programs. Indeed, there is a lack prior information in machine learning by creating vir-
of generic methodology to understand where to start tual examples. Proceedings of the IEEE, 86(11):2196–
and which training data to use as inspiration to create 2209, Nov.
Sinno Jialin Pan and Qiang Yang. 2010. A survey on
generators that generalize well to unseen data.
transfer learning. Knowledge and Data Engineering,
Acknowledgments IEEE Transactions on, 22(10):1345–1359.
Slav Petrov and Dan Klein. 2007. Discriminative log-
We would like to thank Subhro Roy for helping us linear grammars with latent variables. In Advances in
run his model on our new data splits. Lastly, we Neural Information Processing Systems, pages 1153–
would like to thank the three anonymous reviewers 1160.
for their helpful comments and feedback. Tim Rocktäschel, Edward Grefenstette, Karl Moritz Her-
mann, Tomáš Kočiskỳ, and Phil Blunsom. 2016. Rea-
This work was supported by a Marie Curie Ca-
soning about entailment with neural attention. In In-
reer Integration Award and an Allen Distinguished ternational Conference on Learning Representations.
Investigator Award. Subhro Roy and Dan Roth. 2015. Solving general arith-
metic word problems. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
References guage Processing, pages 1743–1752. Association for
Guillaume Bouchard and Bill Triggs. 2004. The trade- Computational Linguistics.
off between generative and discriminative classifiers. Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason-
In 16th IASC International Symposium on Computa- ing about quantities in natural language. Transactions
tional Statistics (COMPSTAT’04), pages 721–728. of the Association for Computational Linguistics, 3:1–
Kyunghyun Cho, Bart Van Merriënboer, Çalar Gülçehre, 13.
Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Bernhard Scholkopf and Alexander J Smola. 2001.
and Yoshua Bengio. 2014. Learning phrase represen- Learning with kernels: support vector machines, reg-
tations using rnn encoder–decoder for statistical ma- ularization, optimization, and beyond. MIT press.
chine translation. In Proceedings of the 2014 Con- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
ference on Empirical Methods in Natural Language Sequence to sequence learning with neural networks.
Processing (EMNLP), pages 1724–1734, Doha, Qatar, In Advances in neural information processing systems,
October. Association for Computational Linguistics. pages 3104–3112.
Li Dong and Mirella Lapata. 2016. Language to Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,
logical form with neural attention. arXiv preprint Ilya Sutskever, and Geoffrey E. Hinton. 2014. Gram-
arXiv:1601.01280. mar as a foreign language. CoRR, abs/1412.7449.
Albert Gatt and Ehud Reiter. 2009. Simplenlg: A realisa- Ronald J Williams. 1988. On the use of backpropaga-
tion engine for practical applications. In Proceedings tion in associative reinforcement learning. In Neural
of the 12th European Workshop on Natural Language Networks, 1988., IEEE International Conference on,
Generation, pages 90–93. Association for Computa- pages 263–270. IEEE.
tional Linguistics.
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren
Etzioni, and Nate Kushman. 2014. Learning to solve
arithmetic word problems with verb categorization.
In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 523–533, Doha, Qatar, October. Association for
Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. CoRR,
abs/1412.6980.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
tions of words and phrases and their compositionality.
In Advances in neural information processing systems,
pages 3111–3119.

You might also like