Shen18 Interspeech
Shen18 Interspeech
3464 10.21437/Interspeech.2018-1149
Flight
context vector ct is learned to concatenate with the RNN hidden
state ht , i.e., ct ⊕ ht , to learn a slot attention for predicting Intent
Detection
the slot tag yt . All hidden states of slot filling attention layer
are used to predict the intent label in the end. The objective O B-fromloc O B-toloc
function of Att-BiRNN model is as follows: Slot
YT
Filling
P (y|x) = max P (yt |y1 , . . . , yt−1 , x; θr , θs , θI ) (1) weighting weighting weighting weighting
θr ,θs ,θI
t=1 𝐝1 𝐡1 𝐜1 𝐝2 𝐡2 𝐜2 𝐝3 𝐡3 𝐜3 𝐝4 𝐡4 𝐜4
where θr , θs , θI are the trainable parameters of different
components (utterance BiRNN, slot filling attention layer and User
intent classifier) in Att-BiRNN model. Info
Tagging
O B-loc O B-loc
3. Problem Definition 𝐡1 𝐜1 𝐡2 𝐜2 𝐡3 𝐜3 𝐡4 𝐜4
Utterance
We propose the User Info Augmented Semantic Frame Parsing BiLSTM BiLSTM BiLSTM BiLSTM BiRNN
problem for the same two tasks, intent detection and slot filling,
… between Miami …
by considering the following additional inputs. NY and
Figure 1: Progressive Attention based RNN Model
User Info Dictionary: This defines the categorical relations
between user info type and slots. In other words, each key semantic meaning for each type of user info is distilled into the
in the dictionary is a type of user info and its corresponding model to continue training for intent detection and slot filling.
value is the slots belonging to this type. The generation of
this dictionary is not the focus of our paper since it can be Table 1: ATIS corpus sample with intent and slot annotations
simply generated by a software developer when he generates with additional user info and its corresponding user info
slots during the development of a new domain in practice. sequence (in gray)
Each type of user info is associated with an external utterance (x) round trip flights between ny and miami
or pre-trained model to extract their semantically meaningful
slots (y) B-round trip I-round trip O O B-fromloc O B-toloc
prior knowledge. For example, the semantics of a location is
represented by its longitude and latitude such that the distance intent (I) atis flight
between two locations reflect their actual geographical distance. user info (U ) {“User Location” : “Brooklyn, NY”}
User Info for Each Utterance: Each input sequence x is user info seq (z) O O O O B-loc O B-loc
associated with its corresponding user info U . U is represented
as a set of tuples, hInfo Type, Info Contenti. As an example
utterance in Table 1, the first gray row shows our generated user As shown in Figure 1, our proposed Prog-BiRNN model
info with type “User Location” and content “Brooklyn, NY”. is designed based on the state-of-the-art Att-BiRNN model [2],
Learning user info has been well studied, such as user which consists of the following four main components.
contextual information (e.g., time, location, activity, etc.) via Utterance BiRNN Layer: We use the same bidirectional RNN
smartphone [15], Internet of Things [16] and user interests (e.g., (BiRNN) to encode an utterance with LSTM cells (BiLSTM)
favorite food, etc.) using recommendation models [17]. as in [2]. The hidden state ht at each time step t is the
concatenation of forward state fht and backward state bht , i.e.,
Remarks: One may argue that this is a simple extension of
ht = fht ⊕ bht .
semantic frame parsing problem in which the user info can
be simply encoded into an existing model as a new input or a User Info Tagging Layer: This component labels the user
new state. However, these naive approaches ignore the different info type for each word in the input utterance. Since the
semantic meanings between user info and language context in labeling is based on the language context of input utterance, we
an utterance, as well as between different types of user info. follow the previous work [2] to use a language context vector
Thus, as we later show in experiment (Section 5), these baseline ct at each time stamp t via the P weighted sum of all hidden
T
approaches do not show any advantage over existing approaches states {hk }∀1≤k≤T i.e., ct = k=1 αt,k hk . Here, α t =
exp(et,j )
without user info. softmax(et ), i.e., αt,j = PT exp(e ) . et,k = g(sut−1 , hk )
k=1 t,k
is also learned from a feed forward neural network g with the
4. Proposed Approach previous hidden state sut−1 defined as the concatenation of ht−1
and ct−1 , i.e., sut−1 = ht−1 ⊕ ct−1 . At each time step t, the
In this section, we describe the main idea and details of our user info tagging layer outputs Pu (t) as follows:
proposed Prog-BiRNN model as well as its training procedure. Ptu = softmax(Wu sut ); z̃t = arg max Ptu (2)
θu
4.1. Progressive Attention-based RNN Model Slot Filling Layer: This is the key layer for distilling user info
into the model to help reduce the need of annotated training
As the name indicates, our main idea is to train the semantic data. It shares the same hidden state ht and language context ct
frame parsing model progressively with an intermediate task with the user info tagging layer. For each word in the utterance,
before achieving the final goal of intent detection and slot we use external knowledge to derive the prior distance vectors
filling. This is motivated by the recent success of progressive dt = {dt (1), . . . , dt (|U |)} for each time stamp t (green in
neural networks [13]. Specifically, for each utterance x, Figure 1) where |U | is the number of user info types in IOB
we first define the user info sequence z using the user info format. And each element djt is defined as follows:
dictionary. In Table 1, the last row shows the user info sequence
corresponding to this example. Our approach first trains a dt (j) = sigmoid β (j) δ t (j) (3)
user info tagging to derive z. Then, the prior knowledge with where stands for element-wise multiplication. (j) is a |U |
β
3465
dimensional trainable vector; and δ t (j) is the distance between where |S| is the number of slots in IOB format and |I| is the
the tth word and user info w.r.t. the prior knowledge of type j. number of intents. P (i) stands for the probability P (X = xi ).
Next, we define the calculation of distance δt (j) for each Moreover, θr , θu , θs , θI are the parameters in utterance BiRNN,
info type j at time stamp t, through the example in Figure 1. Let user info tagging, slot filling and intent detection components in
δt (loc) be the distance w.r.t. the location type of user info. It is our proposed Prog-BiRNN model.
a one-dimensional scalar in this case. Taking the second word
“NY” as an example, we have its following location distance 4.2.2. Details of IOB Format Support
since it is tagged as “Location” type of user info:
Thanks to the progressive training procedure, the IOB format
δ2 (loc) = dist(“NY”, “Brooklyn, NY”) ≈ 4.8 (miles)
will be naturally supported in our model. As shown in Figure
by using external location based services, i.e., Google Maps
2, in the case of “New York” with “B-loc I-loc” user info tags,
Distance Matrix API [18]. If the word and user info are of
we take them together to extract the prior geographical distance
different types, we set the distance δt (j) as -1 such that its
dist(“New York”, “Brooklyn, NY”). Moreover, since B-loc and
corresponding dt (j) will be close to 0 via the sigmoid function.
I-loc are considered as different tags in the output Ptu of user
To feed the prior distance vectors dt into the slot filling info tagging component, they can be directly used to infer B-
layer, we weight each element dt (j) and the language context fromloc and I-fromloc in slot filling component accordingly.
ct over the softmax probability distribution Ptu from the user
In the case that the type of user info for the tth word is
info tagging layer. Intuitively, this determines how important
incorrectly tagged, the hidden state ht and language context ct
a type of user info or the language context in utterance is to
will be used to infer the slot tags since the user info tagging
predict the slot tag of each word in the utterance. Thus, we have
output Ptu will weight more on ht , ct in this case. In addition,
the input Φ t of LSTM cell at each time step t in slot filling layer
the second phase of training procedure for joint training of all
as follows:
components also leans to use more language context to correct
Φ t = Ptu (1)dt (1) ⊕ · · · ⊕ Ptu (|U |)dt (|U |) ⊕ PtO ct (4)
the incorrectly tagged type of user info.
where Ptu (j) and PtO stand for the probability that the tth word
is predicted as j type of user info and as “O” meaning none of B-fromloc I-fromloc
the types. Note that we will discuss how to deal with IOB format
Slot
in Section 4.2.2. At last, the state sst at time step t is computed Filling
as ht ⊕ Φ t and the slot tag is predicted as follows: 𝐝1 𝐝2
Pts = Ws sst ; ỹt = arg max Pts (5) 𝐡1 𝐜1 𝐡2 𝐜2
θs User
𝛿1 (loc)=𝛿2 (loc)
Intent Detection Layer: We add an additional intent detection =dist(NY, Brooklyn, NY) Info
Tagging
layer as in [2] to generate the probability distribution PI of B-loc I-loc
intent class labels by using the concatenation of hidden states … New York …
from slot filling layer, i.e., sI = ss1 ⊕ . . . ⊕ ssT . Figure 2: Support of IOB Format (omitted other model details)
P I = softmax(WI sI ); I˜ = arg max P I
θI
Remarks: The sharing of hidden state ht and language context Remarks: The capability of prior knowledge distillation in
ct between user info tagging and slot filling layers is crucial to our approach leverages user information to largely improve
reduced the required annotated training data. For the user info the performance and reduce the requirement of annotated
tagging layer, ht , ct are mainly used to tag the words which training data. Moreover, the overall training time is also
belong to one type of user info. The semantic slots of these largely shortened since our approach divides SLU into simpler
words can be easily tagged in slot filling layer by utilizing the subproblems in which each subproblem is much easier to train.
distilled prior knowledge instead of using ht , ct again. The
slot filling then depends on ht , ct to tag the rest of words not 5. Experimental Evaluation
belonging to any type of user info.
5.1. Dataset
4.2. Progressive Training with IOB Format Support We evaluate our approach on the ATIS (Airline Travel
4.2.1. Training Algorithm Information Systems) dataset [19], a widely used dataset in
SLU research. The training set contains 4,978 utterances from
The training procedure is progressively conducted step by step. the ATIS-2 and ATIS-3 corpora, and the test set contains 893
The first step is to train user info tagging component with loss utterances from the ATIS-3 data sets. There are 127 distinct
function Lu as follows: slot labels and 22 different intent classes.
|U | n
1 XX Due to the lack of benchmark datasets with user info,
Lu (θr , θu ) , − zt (i) log Ptu (i) (6) we design the following two mechanisms to synthesize two
n i=1 t=1
types of user info, user contextual location and user preferred
where |U | is the number of user info types in IOB format. time periods in ATIS dataset. We first construct the user
Then, we train the slot filling layer with loss function Ls info dictionary by including all slots with ”loc” keyword in
and intent classifier with loss function LI simultaneously. In contextual location and including all slots with ”time” keyword
the meanwhile, we also allow the fine tuning of parameters θr in user preferred time period.
and θu in utterance BiRNN and user info tagging layers. The prior distance δ of contextual location are computed
|S| n
1 XX using Google Maps Distance Matrix API [18]. For time period,
Ls (θr , θI , θs , θu ) , − yt (i) log Pts (i) (7) we calculate δ by using the difference between the tagged time
n i=1 t=1
stamp in an utterance and the middle time stamp of the user
|I|
X preferred time period.
LI (θr , θI , θs , θu ) , − I(i) log P I (i) (8)
i=1
Contextual Location: W.l.o.g., we synthesize user contextual
3466
Table 2: Examples of synthesized user info in ATIS dataset 97.5 96.2
F1 Score
96.5 95.4
Type Content 95.2
96 95
arrive period
F1 Score
96.5 95.6
{“arrive time.period of day”: “morning”} 96
95.4
95.2
95
95.5 Prog-BiRNN 94.8 Prog-BiRNN
is usually close to flight depart city. We first extract all 2000 3000
Size of Training Set
4000 5000 2000 3000
Size of Training Set
4000 5000
original paper of the base Att-BiRNN model [2] since our model 60
50
does not have additional hyperparameters. 40
30 Prog-BiRNN
20 Att-BiRNN Baseline without User Info
10 20 30 40 50 60 70 80 90
ATIS training set and randomly sampled 3 different sizes Figure 4: Training time results on full size training set using
(2,000, 3,000 and 4,000) utterances out of the total 4,978 both contextual location & preferred time periods as user info
utterances. Figure 3 reports the average performance results on
10 differently sampled training set of each size.
Since location related slots are the majority of all slots in 6. Conclusion
ATIS dataset, we first consider only using contextual location
as user info. As shown in Figure 3a, the F1 score of slot We present a novel progressive neural network model to
filling outperforms both baseline approaches with around 0.2% train a semantic frame parsing model by incorporating user
absolute gain of each size. The accuracy improvement of information. By using simple user information, we show that
intent detection is around 0.1% and up to 0.2% for full size our approach not only significantly improves the performance
training set. This slightly smaller improvement margin is but largely reduces the needs of annotated training set as well.
due to the small number of intent classes. When using both In addition, our approach also shows its ability to shorten the
contextual location and preferred time period as user info, we training time for achieving the competitive performance. Thus,
observe more significant improvement with 0.25% gain for we enable the quick development of a semantic frame parsing
intent detection and 0.31% gain for slot filling. Note that model with less annotated training set in new domains.
3467
7. References [18] https://developers.google.com/maps/documentation/distance-
matrix.
[1] P. Haffner, G. Tur, and J. H. Wright, “Optimizing svms
for complex call classification,” in Acoustics, Speech, and [19] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The
Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE atis spoken language systems pilot corpus,” in Proceedings
International Conference on, vol. 1. IEEE, 2003, pp. I–I. of the Workshop on Speech and Natural Language, ser. HLT
’90. Stroudsburg, PA, USA: Association for Computational
[2] B. Liu and I. Lane, “Attention-based recurrent neural network Linguistics, 1990, pp. 96–101.
models for joint intent detection and slot filling,” arXiv preprint
arXiv:1609.01454, 2016. [20] https://developers.google.com/places/web-service.
[3] A. McCallum, D. Freitag, and F. C. N. Pereira, “Maximum
entropy markov models for information extraction and
segmentation,” in Proceedings of the Seventeenth International
Conference on Machine Learning, ser. ICML ’00. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000,
pp. 591–598.
[4] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken
language understanding using long short-term memory neural
networks,” in 2014 IEEE Spoken Language Technology Workshop
(SLT), Dec 2014, pp. 189–194.
[5] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-
Tur, X. He, L. Heck, G. Tur, D. Yu et al., “Using recurrent neural
networks for slot filling in spoken language understanding,”
IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP), vol. 23, no. 3, pp. 530–539, 2015.
[6] B. Peng and K. Yao, “Recurrent neural networks with
external memory for language understanding,” arXiv preprint
arXiv:1506.00195, 2015.
[7] B. Liu and I. Lane, “Recurrent neural network structured
output prediction for spoken language understanding,” in Proc.
NIPS Workshop on Machine Learning for Spoken Language
Understanding and Interactions, 2015.
[8] G. Kurata, B. Xiang, B. Zhou, and M. Yu, “Leveraging
sentencelevel information with encoder lstm for natural language
understanding,” arXiv preprint, 2016.
[9] D. Guo, G. Tur, W.-t. Yih, and G. Zweig, “Joint semantic utterance
classification and slot filling with recursive neural networks,”
in Spoken Language Technology Workshop (SLT), 2014 IEEE.
IEEE, 2014, pp. 554–559.
[10] P. Xu and R. Sarikaya, “Convolutional neural network based
triangular crf for joint intent detection and slot filling,” in
2013 IEEE Workshop on Automatic Speech Recognition and
Understanding, Dec 2013, pp. 78–83.
[11] D. Hakkani-Tür, G. Tür, A. Celikyilmaz, Y.-N. Chen, J. Gao,
L. Deng, and Y.-Y. Wang, “Multi-domain joint semantic frame
parsing using bi-directional rnn-lstm.” in INTERSPEECH, 2016,
pp. 715–719.
[12] http://www.businessinsider.com/amazon-alexa-how-many-skills-
chart-2017-7.
[13] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer,
J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell,
“Progressive neural networks,” arXiv preprint arXiv:1606.04671,
2016.
[14] C. Raymond and G. Riccardi, “Generative and discriminative
algorithms for spoken language understanding,” in
INTERSPEECH, 2007.
[15] . Yürür, C. H. Liu, Z. Sheng, V. C. M. Leung, W. Moreno, and
K. K. Leung, “Context-awareness for mobile sensing: A survey
and future directions,” IEEE Communications Surveys Tutorials,
vol. 18, no. 1, pp. 68–93, 2016.
[16] C. Perera, A. Zaslavsky, P. Christen, and D. Georgakopoulos,
“Context aware computing for the internet of things: A survey,”
IEEE Communications Surveys Tutorials, vol. 16, no. 1, pp. 414–
454, 2014.
[17] X. Su and T. M. Khoshgoftaar, “A survey of
collaborative filtering techniques,” Adv. in Artif. Intell.,
vol. 2009, pp. 4:2–4:2, Jan. 2009. [Online]. Available:
http://dx.doi.org/10.1155/2009/421425
3468