0% found this document useful (0 votes)
14 views6 pages

Transfer Learning On Quora Dataset

class assignment on transfer learning on quora daatset

Uploaded by

drsvr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Transfer Learning On Quora Dataset

class assignment on transfer learning on quora daatset

Uploaded by

drsvr1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Transfer Learning for NLP with Application to Detecting Duplicate Question Pairs

in Quora

P Bharat Kumar, D Ravi Shankar, J Ramtej


December 4, 2017

Deep Learning for NLP Course Project Report

Abstract
Transfer learning is a research problem in ML where the knowledge gained by solving a problem related
to a task or domain is applied to solve different but related problem.Transfer Learning techniques have been
effectively used in fields like Image Processing and were able to achieve good results. But in NLP Transfer
Learning has been loosely applied and conclusions are not consistent.Neural networks are shown to obtain
state of art performance on text pair classification tasks like SNLI task. In this project we explored different
neural network based Transfer Learning schemes on a variety of datasets.

1 Introduction have used INIT and MULT methods while transfer-


ring the parameters. Researchers [1] have also shown
Transfer Learning plays important roles in many natu- that neural network based transfer learning improves
ral language processing (NLP) applications, especially the performance of models for sequence tagging and
when we do not have large enough datasets for the task shown that even problems across domains and ap-
of interest.The real world is messy and contains an in- plications can benefit(may not be significant) from
finite number of novel scenarios, many of which our transfer learning.
model has not encountered during training and for
which it is in turn ill-prepared to make predictions.
Transfer Learning helps in such situations to be able
to generalize to the new conditions.Transfer Learning 3 Transfer Methods
has been widely used in Image processing field but in
NLP the conclusions are not consistent. We have experimented with 3 methods in neural net-
The rest of the paper is organized as follows. Sec- work based transfer learning.They are
tion 2 discusses the related work. Section 3 describes 1.Parametric Initialization (INIT): The INIT
the three transfer learning methods that we used for method first trains the network on source dataset and
the experiments.Datasets used, our Contributions and then directly uses the tuned parameters to initialize
experiments performed are discussed in section 4, 5 & the network for Target dataset. After transfer we may
6. Finally conclusions are reported in section 7. fix the tuned parameters in the target domain when
we dont have labelled data in Target domain. If we
have labelled data in Target domain we can further
fine tune the parameters.
2 Related Work 2. Multi-task learning (MULT): In this method we
simultaneously use samples from both the domains to
Transfer learning(Neural network features specifically)
train the network. The overall cost function is given
is being successfully applied to a variety of problems in
by
computer vision across domains and applications but
J = αJs + (1 − α)JT
researchers [3] have shown that neural network based
transfer learning in NLP will only benefit when the where Js and JT are the individual cost functions for
source and target tasks are semantically related. They source and target domains respectively. α ∈ (0,1) is
the hyperparameter balancing both the domains. Here We have also trained a NMT and extracted the ”con-
we switch to source domain with probability α and text” vectors of question pairs of Quora dataset which
target domain with probability 1 − α and train the were later used to train a classifier. Results are re-
network with the corresponding domain. ported in section 6.3.
3. Sequence-to-sequence (seq2seq) models
(Sutskever et al., 2014, Cho et al., 2014) have enjoyed
great success in a variety of tasks such as machine
translation, speech recognition, and text summariza-
tion. NMT system first reads the source sentence
using an encoder to build a ”context” vector, a se-
quence of numbers that represents the sentence mean-
ing; a decoder, then, processes the sentence vector to
emit a translation. The context vector thus contains
sufficient lexical and semantic information to fully re-
construct a sentence in another language. Therefore,
we expected that comparing two context vectors could
provide a useful measure of similarity.

4 Datasets:
IMBD: A large dataset for binary sentiment classifi-
cation (positive vs. negative) - 25k sentences. Figure 1: (Ref [3]) Architecture used by Lili Mou et
MR: A small dataset for binary sentiment classifica- al., ’a’ for Experiment 1 and ’b’ for Experiment 2
tion - 10662 sentences.
QC A (small) ) dataset for 6-way question classifica-
tion (e.g., location, time, and number) - 5000 ques-
tions.
SNLI: A large dataset for sentence entailment recog-
nition. The classification objectives are entailment,
contradiction, and neutral - 500k pairs. 6 Experiments
SICK: A small dataset with exactly the same classifi-
cation objective as SNLI - 10k pairs.
MSRP: A (small) dataset for paraphrase detec-
tion.The objective is binary classification: judging 6.1 Experiment 1:
whether two sentences have the same meaning - 5000
pairs.
Quora dataset: It contains duplicate questions pairs For this experiment we have used LSTM archi-
with labels indicating whether the pair of questions tecture.We have trained the model on IMDB and
request the same information - 400k question pairs. then transferred the three layers embeddings, hidden
(LSTM) and output layer to MR dataset and em-
beddings and hidden (LSTM) layers to QC and the
5 Our Contributions results are shown in table 1.When we have trans-
ferred the parameters from IMDB to MR, the ac-
We have tried to replicate the results from [3] using curacy has improved by 1.57% and from IMDB to
INIT method. For experiment 1, we have trained a QC, there isn’t much change in accuracy. The rea-
LSTM model on IMDB and transferred the parame- son for this is that IMDB and MR are semantically
ters to MR and QC datasets and the results are shown similar datasets whereas IMDB and QC are semanti-
in table 1 and in figure 1. For experiment 2, we have cally different.Figure 2 shows accuracy for MR dataset
trained CNN model on SNLI dataset and transferred without transfer and with transfer of parameters from
the parameters to SICK and MRSP datasets. IMDB dataset.Figure 3 shows accuracy for QC dataset
We have then experimented with both INIT and without transfer and with transfer of parameters from
MULT on SNLI(source) and Quora(target) datasets. IMDB dataset.
Paper
Paper
(with Wiithout With
Dataset (without
Trans- Transfer Transfer
Transfer)
fer)
IMDB 87.0 - 84.50 -
MR 75.1 81.4 76.89 78.46
QC 90.8 93.2 95.3 94.73

Table 1: Transfer of Parameteres from IMDB to MR


and QC

Figure 4: SNLI to SICK

Figure 2: IMDB to MR
Figure 5: Quora to SICK

Figure 3: IMDB to QC Figure 6: SNLI to MRSP


SNLI and SICK are semantically similar datasets,
where as SNLI and MSRP are not. This suggests that
transfer learning is more prone to semantics.
Similarly we trained the model on Quora dataset
and transferred the parameters to SICK and MSRP
datastes and results are also shown in table 2.

6.3 Experiments on Quora Dataset:


INIT and MULT:
All these experiments except for the experiment with
NMT use siamese CNN architecture(Figure 1.b).
Adam optimizer was used to train all the networks.
A set of experiments were used to determine hyper-
Figure 7: Quora to MRSP parameters for transfer learning schemes. As can be
6.2 Experiment 2: seen from the Fig.8, in MULT learning rate of source
dataset should be less than that of the target dataset.
For this experiment we have used siamese CNN archi-
Henceforth learning rate for source(SNLI) is fixed
tecture and additionally used two hidden layers and
at 0.0001 and the learning rate for target(Quora) is
an output layer. We have trained the model on SNLI
fixed at 0.001. Another experiment was performed
and then analyzed transferability of each layer (em-
to determine the effect of sharing different parts of
beddings, filter weights, two hidden layers and output
the network in MULT and the results were plotted
layer) to SICK dataset and to MSRP dataset. The
in Fig.9.We haven’t observed considerable difference
results are shown in table 2. Figure 4 shows trans-
in accuracies between the cases when only the fil-
ferability accuracies from SNLI to SICK with differ-
ters are shared and when both filters and hidden
ent layers transferred. Figure 6 shows transferability
layers were shared(this case seems to be slightly bet-
accuracies from SNLI to MSRP with different layers
ter). Henceforth all the experiments share all the
transferred. SNLI to SICK appears to be successful
parameters(expect those of the output layer). We
with 5% increase in accuracy, while there is not much
have also performed an experiment to determine how
significant improvement from SNLI to MSRP.
over-fitting on the source dataset in INIT effects the
transfer learning. Model was trained on SNLI dataset
Paper for different number of epochs before transferring the
Paper
(without Wiithout With parameters to train on Quora. Accuracies of both
Dataset (with
Trans- Transfer Transfer SNLI(source) and Quora(target) are then plotted for
Transfer)
fer) comparison in Fig.10. As can be seen from the plot
SNLI accuracy of Quora dataset peaked along with the acu-
to 70.9 77.6 66.12 72.10 racy of SNLI dataset. When the model was overfit on
SICK SNLI data, accuracy on Quora dataset showed a dip.
Quora This indicates that parameters from best model on the
to 70.9 77.6 66.12 70.50 source dataset should be used for transfer learning in
SICK INIT. Finally a plot comparing INIT and MULT is
SNLI shown in Fig.11.In this experiment accuracies in case
to 69 69.9 64.34 65.53 of both INIT and MULT were plotted when different
MSRP percentages of target data(Quora) are used(available)
Quora for training. We can conclude from this plot that
to 69 69.9 64.34 64.08 transfer learning is effective when the amount of tar-
MSRP get data is small and using that small data alone, we
cannot train a good model.Also INIT appears to be
Table 2: Transfer of Parameteres from SNLI to SICK slightly better in comparison to MULT.
and MSRP and from Quora to SICK and MSRP
NMT and context vector:
A neural machine translator[4][5] with 2-layer LSTMs
of 512 units with bidirectional encoder, embedding di-
mension of 512 and Luong attention is used together
with dropout keep prob of 0.8.SGD with learning rate
1.0 is used to train for 20K steps ( 20 epochs) on
IWSLT English-Vietnamese corpus(133k examples). a
blue score of 24.4 was obtained on the test data. Later,
the trained model was used to extract two 512 di-
mensional context vectors for each question pair in
Quora which is given as input to the feed forward
neural network whose architecture is described in sec-
tion 8. Context vector was taken as last hidden state
of the unrolled LSTM. Accuracies of all three trans-
fer learning methods were tabulated below. Accuracy
with NMT is not very high, further experiments are Figure 9: Effect of sharing different parts of the net-
required exploring different approaches of extracting work parameters in INIT, graphs represent accuracies
the context vector need to be studied. of Quora dataset for two schemes of parameter shar-
ing.

Accuracy
Transfer Learning Scheme
(in %)
INIT 82.41
MULT 82.12
NMT context vector 77.11
No Transfer Learning 82.01

Table 3: Accuracies for different Transfer Learning


Schemes with Source: SNLI Target: Quora

Figure 10: Effect of over-fitting source dataset in


INIT, graphs show how the accuracy on Quora(target)
dataset varies with the accuracy of the SNLI(source)
dataset

Figure 8: Effect of varying learning rate of source


dataset on transfer learning in MULT, graphs repre- Figure 11: Comparison of INIT, MULT. Graphs rep-
sent accuracies of Quora dataset for different learning resent accuracy on Quora(target) dataset for different
rates of SNLI data transfer learning schemes.
7 Conclusions: No.of filters = 100
We used max pooling
1. Transfer Learning is successful when we are dealing learning rate = 0.001
with semantically similar tasks. No.of hidden layers = 2
2. It is helpful when the target dataset is small. Each hidden layer has 1024 units with RELU
INIT method is performing slightly better compared
to MULT. Classifier for NMT context vector:
3. Transfer Learning also depends on what layers we Input dim 512
are transferring (evident from experiment 2 : SNLI to No.of hidden layers = 2
SICK). Each hidden layer has 1024 units with RELU
4. While training a model in MULT method, learning learning rate = 0.01
rate for source dataset should be less than that of
target dataset.
5. Are we loosing general information if the model 9 Acknowledgements:
is trained on source data for best accuracy? The an-
swer seems to be NO as evident from figure 10.The We thank Prof. Shirish K Shevade and Dr.S Sun-
accuracy on Quora dataset peaked along with that of dararajan for facilitating such a wonderful course. We
SNLI dataset also thank all the course students for their suggestions
6. Does the context vector of NMT encoder capture and discussions during the presentations.
information useful for our problem? The context vec-
tor is definitely capturing some useful information but
further experiments are required for exploring differ- 10 References
ent approaches of extracting the context vector.
1. Transfer Learning for Sequence Tagging with Hi-
erarchical Recurrent Networks - Zhilin Yang, Ruslan
Salakhutdinov, William W. Cohen.
8 Network Architecture: 2. Efficient Transfer Learning Schemes for Personal-
ized Language Modeling using Recurrent Neural Net-
We used Glove Embeddings with 300 dimension rep- work - Seunghyun Yoon, Hyeongu Yun, Yuna Kim,
resentation for each word for all the models.. Gyu-tae Park, Kyomin Jung.
LSTM:Used for experiment 1 3. How Transferable are Neural Networks in NLP
Max sentence length = 80 Applications? - Lili Mou, Zhao Meng, Rui Yan, Ge
Used Glove Embeddings with 300 dimension repre- Li, Yan Xu, Lu Zhang, Zhi Jin.
sentation for each word. 4.Sequence to Sequence Learning with Neural Net-
No.of hidden units = 128 works - Ilya Sutskever, Oriol Vinyals, Quoc V. Le.
learning rate = 0.01 5.https://github.com/tensorflow/nmt
6. https://engineering.quora.com/Semantic-
Siamese CNN: Used for experiment 2 and on Question-Matching-with-Deep-Learning
Quora dataset. 7. http://ruder.io/transfer-learning/ Blog by Sebas-
Max sentence length = 40 tian Ruder
Filter Sizes 4,5,6,7

You might also like