2017 SoICT CrossDomainIntentionDetection
2017 SoICT CrossDomainIntentionDetection
discusses related work. Section 3 introduces our proposed of its content to words that are specific to intent categories.
method. Section 4 presents experiments on intention detec- Chen et al. [8] leverage labeled data from other domains to
tion on four different domains. Finally, Section 5 concludes train a classifier for the target domain by using domain adap-
the paper and discusses future work. tation techniques. They propose a co-training like algorithm
that alternates between two classifiers trained on source and
target domains to boost the final performance. Ding et al.
2. RELATED WORK [10] also report on successful application of domain adap-
A considerable amount of work has been done on intention tation by using convolutional neural networks with shared
detection and domain adaptation which we review in this middle layers, which were trained on labeled data from other
section. domains. A different approach has been proposed by Li et
Intention detection. Because of its importance in ad- al [16], in which they use Wikipedia as an external source
vertisement and targeted marketing, the task of identifying of knowledge and map microblog posts to Wikipedia con-
intention from social media such as tweets, forum posts, has cepts to classify them, thus reducing the need of having large
attracted substantial research interest in recent years. The training sets.
most straightforward approach is to formulate the task as Domain adaptation and transfer learning. Domain
a text categorization problem, which can be solved with adaptation and closely related transfer learning has been
supervised learning methods. Hollerit et al. [14] classify long studied in settings when we have labeled data from one
tweets into containing and not containing intents by train- or more domains and want to use the data to train classi-
ing support vector machines (SVMs) on words and part-of- fiers for another domain, but the distributions of features
speech n-grams extracted from tweets. Luong et al. [20] use and/or labels are different from domain to domain [2, 4,
maximum entropy with n-grams features to classify forum 23]. Blitzer et al. [4] were among the first to apply do-
and Facebook posts, written in Vietnamese, as intention- main adaptation to sentiment classification - a document
containing and normal ones. While simple and straightfor- categorization problem by nature. They train a classifier to
ward, supervised learning approaches require laborious an- predict the presence of domain dependent feature based on
notation of documents to create training data, which limit domain independent features, thus reducing the mismatch of
their applicability across various application domains. To features between different domains. Other approaches use
alleviate this problem, techniques that make use of unla- different types of transformation to project source and/or
beled data from the same domain or labeled data from other target features into a new feature space so that they have
domain have been applied to identify and classify intents. similar distributions in the new space (an overview is given
Wang et al. [25] use both labeled and unlabeled data to de- in [23]). Pan et al. [22] propose a so called transfer com-
tect intent-containing tweets and classify into six categories ponent analysis to optimize an objective function based on
with a graph-based semi-supervised learning method. Infor- the Maximum Mean Discrepancy principle. They reported
mally, their method classify a tweet based on the proximities good transfer effect over several applications. Bach et al. [2]
ŝ
propose a method that finds a new feature space by com- N−,w . We then compute the number of occurrences of w in
bining canonical correlation analysis with word embeddings. the documents of the positive (and negative) class in all the
They report improved performance on cross-domain senti- source domains and store them in a knowledge base:
ment classification task. Liu and colleagues use the name X ŝ
KB
“lifelong learning” for a group of methods that utilize data N+,w = N+,w
from other domain to support supervised or unsupervised ŝ
learning in the target domain [7]. Among the methods by
that group is the method that uses topics learnt from other KB
X ŝ
N−,w = N−,w
domain to improve topic modeling in the target domain [6]. ŝ
Chen et al. [9] also propose a lifelong learning method that
uses stochastic gradient descent to adjust the importance of 3.3 Classification with Naive Bayes
features from multiple source and target domains to gain
In classification step, the goal is to find a class label cj
maximum positive effect of feature transfer while reducing
given a sample d. Here, d is a document, i.e. a post in dis-
the negative effect (if any). They report improved perfor-
cussion forums, and cj is a label class indicating whether the
mance for sentiment classification using labeled data from
post expresses an intention (+) or not (−). Naive Bayesian
multiple source domains.
classification will find the class label that maximizes the con-
For document categorization in a new domain, we may
ditional probability P (cj |d).
have some labeled data and a lot of unlabeled data. This is
By using the Bayes’ theorem and the independence as-
a typical semi-supervised learning setting. A popular semi-
sumption, we have:
supervised learning method is co-training [5], which is used
by Chen et al. [8] in their intention detection method to
Q
P (d|cj )P (cj ) P (w|cj )P (cj )
combine labeled and unlabeled data from different domains. P (cj |d) = ≈ w (1)
P (d) P (d)
Graph-based learning is another popular semi-supervised
learning approach, which has also been applied to intent de- here the product is computed over all word w in d. Because
tection and classification as reported by Wang et al. [25]. In the denominator of the Equation (1) is independent from the
addition to semi-supervised, the setting when training data class label, it can be ignored in computation. Further more,
contain noisy labels, it is known as weakly supervised learn- P (cj ) can be estimated based on the frequency of label cj in
ing, an example of which is reported by Bach et al. [3], where the training data, we focus on the key parameters P (w|cj ),
they use noisy training data from ratings to augment labeled which are computed as follows:
data to predict sentiment polarity of new review posts.
Our method here is based on domain adaptation of exter- λ + Ncj ,w
nal data sources for improved intention detection in forum P (w|cj ) = P
λ|V | + v∈V Ncj ,v
posts.
where Ncj ,w is the frequency of word w in documents of class
3. PROPOSED METHOD cj . |V | is the size of vocabulary V and λ (0 ≤ λ ≤ 1) is used
for smoothing.
In this section, we present our method for cross-domain Recall that we consider the task in a cross-domain set-
intention detection in discussion forums. We consider the ting, Ncj ,w will be counted in the whole dataset, consisting
case that we have labeled data in both source domains and of labeled data in both source and target domains. A sim-
the target domain. The goal is to leverage labeled data in ple method to compute Ncj ,w is to sum up the counts in
multi-source domains to improve the performance of inten- multi-source domains, i.e. in the knowledge base, with the
tion detection in the target domain. empirical counts in the target domain as follows:
3.1 Method Overview KB
N+,w = N+,w t
+ N+,w
As illustrated in Figure 2, our method consists of three
modules: data aggregation, optimization, and classification. KB t
N−,w = N−,w + N−,w
• Data aggregation: This module extracts knowledge
where t denotes target domain and KB denotes source do-
from multi-source domains and stores it in a knowledge
mains in the knowledge base. This method, however, has
base.
two weaknesses.
• Optimization: This module utilizes knowledge from
the knowledge base to optimize key parameters, which • Past domains contain much more data than the target
will be used in the classification model. domain. Merged results, therefore, may be dominated
by the counts from the source domains.
• Classification: This module uses the optimized pa-
rameters to build the Naive Bayesian classification model. • The method does not consider domain-dependent words.
A word may be an indicator of intention (+) in the tar-
In the following, we describe these modules in detail. Clas- get domain but not (−) in source domains.
sification with Naive Bayes will be presented before the op-
timization section for clarity and readability purposes. To deal with such problems, we introduce a method to revise
these counts by optimizing two variables X+,w and X−,w ,
3.2 Aggregation the number of times that a word w appears in the positive
For each source domain ŝ, we count the number of times and negative class. In classification, we will use those virtual
ŝ
a word w appears in positive or negative class, N+,w and counts instead of empirical counts N+,w and N−,w .
Figure 2: A Method for cross-domain intention detection.
based on both classifiers using a bootstrapping tech- Baseline1 and Baseline2. A somewhat surprising observation
nique. An important point is that both classifiers use is that a simple combination of labeled data from source and
the same feature set selected from the target data. target domain as training data (Combined method) achieved
better results than more sophisticated Co-class in all four
• Combined: This model used labeled data in three cases. A possible explanation for the superiority of Com-
source domains and 9/10 data in the target domain bined over Co-class is that the former uses some labeled
to train a Naive Bayesian classification model. The data of the target domain, and that data is important for
purpose of this experiment is to investigate the perfor- classification. Combined method is also consistently better
mance of the detection system when we combine source than both Baseline1 and Baseline2. These results render
and target domains without domain adaptation. simple Combined as a very competitive method.
From four domains, Camera is the least sensitive to meth-
• Our method: Our method is similar to the Combined
ods used. For this domain, the last three methods achieved
model but using an optimization technique for domain
nearly the same F1 scores, while the worst performing method
adaptation. In effect, the method combines source and
is behind by only 2%.
target domain labeled data based on their contribution
In all cases, except for insensitive Camera domain, our
to final classification accuracy.
proposed method consistently outperformed the other ex-
perimented methods, achieving the most accurate results in
We summarize the experimented methods in Table 2. Note
terms of F1 scores. The differences in F1 score between our
that the first three methods, i.e. Baseline1, Baseline2, and
and the second best method, namely Combined, is almost
Co-Class were run with the same settings as described by
2% for Cellphone and TV and 0.8% for Electronics. All
Chen et al. [8]. For Combined and Our method, we selected
differences are statistically significant, according to a t-test
2500 features on the target domain and 1500 features on
with the threshold of 0.05. Since both Combined and our
each source domain. We conducted 10-fold cross-validation
method use similar training and test data, we believe the
with models that used part of data in the target domain in
improvement of the later method comes from optimization
the training processes i.e. Baseline1, Combined, and Our
we have performed to calculate the posterior probability of
method.
the Naive Bayes classifier, which is our main contribution in
4.3 Results this paper.
Error Analysis. We now analyze some cases in which
Tables 3, 4, 5, 6 summarize experimental results on four
our system made a mistake. We divide errors into two main
domains. In each table, we show Precision, Recall, and F1
types: False positives (a non-intent post was predicted as
scores, averaged over 10 folds for one target domain. Note
intent) and False negatives (an intent post was predicted as
that F1 score is the most important metrics as it balances
non-intent). For each type, we list several typical examples.
Precision and Recall. As can be seen, there is no clear winner
between Baseline1 and Baseline2. Each method achieved 1. False positives. These posts usually contain inten-
higher F1 scores in two domains and lower scores in the tion description words, but the meaning does not in-
others. These results suggest that using lots of training data clude purchase intention. Another case is that the post
from other domains may be more or less useful than using is an experience sharing. The author had bought the
fewer training data from the same domain, depending on product when he posted. Here are some examples.
specific cases.
Co-class is comparable or slightly better than either of two • Who is looking to buy a camera as a semi proffes-
baselines. Specifically, co-class achieved F1 scores that are sional camera mix of SLR and normal digital cam-
comparable with those of Baseline1 in Cellphone and TV era and easy to learn I advice to buy Canon SX1
and higher scores in Electronics and Camera, beating both IS as HD viedo, 2.8 LCD Rotate, 20X zoom and
Table 2: Methods to compare
Model Training data Test data Exp method Learning algorithm
Baseline1 9/10 target 1/10 target cross-validation NB
Baseline2 3 sources target One time NB
Co-Class 3 sources target One time NB, bootstrapping
Combined 3 sources, 9/10 target 1/10 target cross-validation NB
Our method 3 sources, 9/10 target 1/10 target cross-validation NB, optimization