0% found this document useful (0 votes)
15 views

2020 - Deep Search Query Intent Understanding

abc

Uploaded by

fanvuliz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

2020 - Deep Search Query Intent Understanding

abc

Uploaded by

fanvuliz1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Deep Search Query Intent Understanding

Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long


LinkedIn, Mountain View, California
{xwli,wguo,hgao,blong}@linkedin.com

ABSTRACT queries, usually with two or three words in a complete query or


Understanding a user’s query intent behind a search is critical for several characters in a incomplete one, have limited contexts.
modern search engine success. Accurate query intent prediction This paper proposes a practical deep learning framework to
allows the search engine to better serve the user’s need by ren- tackle the two challenges, with the goal of improving LinkedIn’s
dering results from more relevant categories. This paper aims to commercial search engine. Two search result blending components
arXiv:2008.06759v2 [cs.CL] 18 Aug 2020

provide a comprehensive learning framework for modeling query were identified where query intent is useful: incomplete query
intent under different stages of a search. We focus on the design intent for typeahead blending, and complete query intent for SERP
for 1) predicting users’ intents as they type in queries on-the-fly blending (search engine result page).
in typeahead search using character-level models; and 2) accurate The common part of both systems is to use query intent to assist
word-level intent prediction models for complete queries. Various the ranking of retrieved documents of different types. Meanwhile,
deep learning components for query text understanding are experi- the two products have their unique challenges. Typeahead blending
mented. Offline evaluation and online A/B test experiments show has strong latency requirements; the input is an incomplete query (a
that the proposed methods are effective in understanding query sequence of characters); and it is okay to return a fuzzy prediction,
intent and efficient to scale for online search systems. since users will continue to type the whole query if he/she does not
find the results relevant. On the other hand, SERP blending has less
KEYWORDS latency constraint compared to typeahead but a higher accuracy
requirement as it directly affects the search result page.
Query Intent, Query Classification, Natural Language Processing, Based on the characteristics of production applications, we pro-
Deep Learning pose different solutions. For typeahead blending, character-level
query representation is used as the resulting models are compact in
1 INTRODUCTION terms of the number of parameters. Meanwhile, it can handle multi-
Modern search engines provide search services specialized across linguality well due to the small vocabulary size. For SERP blending,
various domains (e.g., news, books, and travel). Users come to a the complete query intent model is word level. Since accuracy is a
search engine to look for information with different possible in- high standard, BERT is explored to extract query representations
tents: choosing favorite restaurants, checking opening hours, or which lead to a more robust model.
restaurant addresses on Yelp; searching for people, finding job op- This paper is motivated by tackling the challenges in query intent
portunities, looking for company information on LinkedIn, etc. prediction, while satisfying production needs in order to achieve
Understanding the intent of a searcher is crucial to the success real-world impact in search systems. The major contributions are:
of search systems. Queries contain rich textual information pro-
vided explicitly by the searcher, hence a strong indicator to the
searcherâĂŹs intent. Understanding the underlying searcher intent • Developed a practical deep query intent understanding frame-
from a query, is referred to the task of query intent modeling. work that can adapt to various product scenarios in industry.
Query intent is an important component in the search engine It allows for fast and compact models suitable for online sys-
ecosystem [12, 16, 20]. As shown in Figure 1, when the user starts tems, with the ability of incorporating traditional features
typing a query, the intent is predicted based on the incomplete that enables more accurate predictions.
character sequence; when the user finishes typing the whole query, • Developed and deployed character-level deep models to pro-
a more accurate intent is predicted based on the completed query. duction that is scalable for incomplete query intent predic-
Understanding the user intent accurately allows the search engine tion. In addition, we propose a multilingual that incorporates
to trigger corresponding vertical searches, as well as to better rank language features which is accurate and easier to maintain
the retrieved documents based on the intent [2], so that users do than traditional per-language models.
not have to refine their searches by explicitly navigating through • Developed and deployed a BERT based model to production
the different facets in the search engine. for complete query intent prediction. To our best knowl-
Traditional methods rely on bag-of-words representation and edge, this is the first reported BERT model for query intent
rule based features to perform intent classification [2, 3]. Recently, understanding in real-word search engines.
deep learning based models [10] show significant improvement, • Conducted comprehensive offline experiments and online
which can handle similar words/word sense disambiguation well. A/B tests on various neural network structures for deep
However, developing deep learning based query intent models for query intent understanding, as well as in-depth investiga-
productions requires considering several challenges. Firstly, produc- tion on token granularity (word-level, character-level), DNN
tion models have very strict latency requirements, and the whole components (CNN, LSTM, BERT), and multilingual models,
process needs to be finished within tens of milliseconds. Secondly, with practical lessons summarized.
CIKM, 2020, Galway, Ireland Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long

Query: l Query: li Query: link Query: linkedin

Page 1 Page 2

Typeahead result blending SERP result blending

Figure 1: Query intent in search engines for incomplete queries (typeahead blending) and complete queries (SERP blending).

2 QUERY INTENT UNDERSTANDING AT Query intent for typeahead blending is challenging given that
LINKEDIN the queries are often incomplete and contains only several letters.
In addition, for every keystroke, the system needs to retrieve the
LinkedIn search hosts many different types of documents, e.g., user
documents from different vertical searches and blend the results. It
profiles, job posts, user feeds, etc. When a user issues a query with-
means the query intent models will be called frequently, and each
out specifying the document type they are interested in, identifying
run should be finished within a short time.
the intent is crucial to retrieve relevant documents and provide
high-quality user experience. At LinkedIn, we define the query
intent as the document type.
Query intent is important for result blending [12]: (1) When an 2.2 Query Intent in SERP Blending
intent is not presented in the query, the corresponding vertical
SERP blending is a more common component in search engines than
search may not be triggered. (2) For the documents retrieved from
typeahead blending. When a user finishes typing and hits "search",
the triggered vertical searches, a result blending algorithm will
a complete query is issued; the query intent is identified and used
rank the documents based on detected intents and other features.
for retrieved document blending. The right most block in Figure
In this section, we present two productions where query intent is
1 shows SERP blending results for a complete query "linkedin",
an important feature for the blending algorithm, followed by how
including company pages, people profiles, job posts, etc.
query intent is used in the blending algorithm.
Compared to typeahead blending, query intent in SERP blend-
ing has a larger latency buffer. Queries contain complete words,
however, it still suffers from limited contexts: only several words
2.1 Query Intent in Typeahead Blending in a query for intent prediction.
When a user starts typing, the query intent is detected and used in
typeahead blending. At LinkedIn, the typeahead product directly
displays document snapshots from multiple vertical searches, which
is different from traditional query auto completion that only gener- 2.3 Retrieved Document Blending Systems
ates query strings. The left three snapshots in Figure 1 shows an Both typeahead blending and SERP blending systems follow a simi-
example of typeahead blending results at LinkedIn. The example lar design. Multiple features are generated for blending/ranking the
assumes the user is searching for the company "LinkedIn". Blended retrieved documents: (1) Probability over each intent that is based
results are rendered as soon as the user typed one letter "l". Next, on query texts and user behaviors (the query intent model output);
given the query prefix "li", the intent prediction has a tendency (2) matching features between the query and retrieved documents;
towards a people result type and many people profiles are ranked (3) personalized and contextualized features.
higher than company or groups results. After the user types "link" In the rest of this paper, we focus on how to generate the query
in the third picture, the company result LinkedIn is ranked first. intent probability for typeahead blending and SERP blending.
Deep Search Query Intent Understanding CIKM, 2020, Galway, Ireland

3 PROBLEM DEFINITION
Output Layer
Both incomplete query intent and complete query intent are essen-
tially classification tasks. Without loss of generosity, given a user
id u ∈ U and a query string q ∈ Q, the goal is to learn a function γ
Dense Layer
predict the predefined intent i in the finite number of intent classes
I = {i 1 , i 2 , ...i n }:

γ : ⟨Q, U⟩ → I Query Embeddings

For the two tasks, incomplete/complete query intent, the intent Traditional Features
class sets are slightly different, due to the design of products. As
shown in the next section, deep learning models are applied to query Deep Modules CNN/LSTM/BERT
strings; the user ids are used to generate personalized features.

4 DEEP QUERY INTENT UNDERSTANDING


In this section, we introduce the proposed intent modeling frame- Input Representation
work, as well as the detailed design of two applications: incomplete
query intent model and complete query intent model.
Query Text
4.1 Product Requirements
As shown in Figure 1, there are two result blending products that
rely on query intent: typeahead blending and SERP blending. These Figure 2: Deep query intent understanding framework.
two products pose several requirements of the query intent models.
BERT [9] uses self attention [26] to explicitly integrate contex-
Typeahead blending has a strict latency standard: for every key-
tual word meaning into the target word, hence it is better at word
stroke, the model needs to return the results. Meanwhile, it does
sense disambiguation. Meanwhile, the pretraining enables using a
not require very high accuracy since the prediction becomes more
large amount of unsupervised data. Given a query, BERT takes a
precise as it receives more characters. On the other hand, SERP
sequence of tokens as input and output the contextualized repre-
blending has more latency buffer, and it requires high accuracy of
sentation of the sequence. A special token [CLS] at the beginning
query intent, otherwise users might abandon the search.
of each sequence models the representation of the entire sequence
and is used for classification tasks.
4.2 Intent Modeling Framework
Driven by the product requirements, we design a framework for 4.2.3 Traditional Features. Traditional features are hand-crafted
query intent understanding. The overall architecture is in Figure 2. features, which are powerful for capturing contextual informa-
tion that is complementary to the deep textual features. There are
4.2.1 Input Representation. The input to the model is represented various types of traditional features that can be considered in pro-
in a sequence of embeddings. Two granularity choices are provided: duction, such as language features, user profile / behavioral features.
character- and word-level embeddings to support incomplete and These features are especially important for enhancing the limited
complete queries, respectively. context for short queries. In a wide-and-deep fashion [7], traditional
features are concatenated with the query embeddings, and then fed
4.2.2 Deep Modules. In this framework, several popular text en- to a dense layer to get non-linear interactions among the features.
coding methods are provided to generate query embeddings. This
enables good flexibility to adapt the framework to various product 4.3 Incomplete Query Intent Modeling
scenarios under different latency / accuracy constraints.
In typeahead search, users usually type a query prefix, and select
CNN is powerful at extracting local ngram features in a sequence the results from the drop-down bar. In this case, a large number
[15, 17]. The input to the CNN is a sequence of token embeddings, of incomplete words are generated, which are recognized as out-
i.e. the embedding matrix. The 1-dimensional convolution layers of-vocabulary words. This motivates us to design the incomplete
could involve multiple filters of different heights. The width of the query intent classification with character-level representations. The
CNN filters is always the same size as the embedding dimension, character-level models have additional benefits: (1) it is more robust
while the height of the filters could varyâĂŤit represents word or to spelling errors, compared to word-level models where words with
character n-grams covered by the filters. Max-pooling over time is wrong spelling will be out of vocabulary; (2) the resulting model
done after each convolution layer. is compact (1.4 Megabytes with 500 characters), as the character
Compared with CNN, the long distance dependencies can be vocabulary size is small.
better captured by LSTM [11], especially the character sequence. Due to the ability of capturing long range dependency informa-
Bi-directional LSTM [29] is used to model the sequence information tion, LSTM is best suited for this problem, as the sequence could
from both forward and backward. The last hidden state of both be over 10 characters. In contrast, CNN captures the n-gram letter
layers are concatenated together to form the output layer. patterns (e.g., tri-letters). However, it does not keep track of what
CIKM, 2020, Galway, Ireland Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long

are the contexts of the tri-letters, as well as the position of the Note that we haven’t explored a multilingual approach for the
tri-letters. BERT is not applied in this task, because the typeahead word-level complete query intent models. This is because the word-
product has a strict latency constraint. The latency would not meet level vocabulary could be fairly large if the major languages are
the production constraint for BERT is too computationally heavy. included and therefore adding more complexity to the models.

4.3.1 Multilingual Support. Search systems often experience a 5 EXPERIMENTS


large amount of traffic in many countries with different languages.
Experiments on both incomplete and complete query intent predic-
In LinkedIn typeahead search, more than 30% of search traffic is
tion models are discussed in this section. Performance comparison
from languages other than English. Handling the performance gap
in both offline and online metrics are shown by relative differences
between English and other languages is crucial to enable a good
instead of absolute values due to company policy. We also share
user experience in typeahead.
our analysis in the scalability of these deep learning based models,
The typical way to handle multilinguality is to train per-language
as well as the online A/B testing performance of these models com-
models, which is not easy to maintain and scale. The character-level
pared with production baselines. Tensorflow serving with CPU is
approach can be easily extended to support multiple languages with
used for online inference for the deep models. Traditional features
a unified model, since the character vocabulary of all languages is
such as the user behavioral features are pushed to a online store
much smaller than the word vocabulary.
after daily offline computation and can be accessed during model
To train a unified multilingual model, two approaches are de-
inferencing. All the reported online metrics lift are statistically sig-
signed to model the language/locale information:
nificant (p < 0.05) and are collected based on 4 weeks of 50% A/B
Language Feature Embedding: The first approach is to concate-
testing search traffic. The LSTM model and BERT model are fully
nate the interface locale embedding to the input query embedding,
ramped to production after the A/B testing for incomplete query
and then feed it through a BiLSTM to generate interface locale
intent and complete query intent, respectively.
âĂIJembeddedâĂİ query features, as shown in Figure 3a. The em-
beddings of the interface locales are randomly initialized and to-
gether trained with the network.
5.1 Training Data Collection
Language Feature Concatenation: The second approach is to Search click-through log is used for training data collection. For
use the BiLSTM layer for extracting query features from query example, given that a user searched for linkedin sales solutions and
embeddings, and then concatenate the interface locale embeddings clicked on a job posting instead of a people or content page, a job
to the last output state of the BiLSTM, as in Figure 3b. Compared intent label is assigned to this query. Similarly, when a user typed
to the first approach, BiLSTM is unaware of the language. an incomplete query face and chose the facebook company page,
then a company intent is inferred.
For both incomplete and complete intent models, we collected
4.4 Complete Query Intent Modeling training data sampled from clicked-through log for a month. For
We design word-level CNN, LSTM, and BERT based models for com- train/dev/test dataset, we sampled 42M/21K/21K for English in-
plete query intent classification. CNN is good at capturing word complete models, 63M/36K/36K for multilingual incomplete query
n-gram patterns. LSTM is powerful for modeling long sequence models, and 24M/50k/50k for complete query intent models.
context. BERT further enriches the query representation by mod-
eling the context meaning. Specifically, a pre-trained BERT model 5.2 Incomplete Query Intent Prediction
with LinkedIn data (LiBERT) is used.
5.2.1 Experiment Setting. We evaluate the performance for character-
There are two major motivations on pre-training a BERT model
based CNN and LSTM models in the offline experiments. The base-
with in-domain LinkedIn data (LiBERT) instead of using the off-
line model is learnt using letter tri-gram bag-of-words features with
the-shelf BERT models released by Google [9]:
a logistic regression classifier.
Latency Constraint. The probabilities of different intents given In the CNN/LSTM based models, character embeddings of size
a query need to be computed on-the-fly instead of pre-computing 128, with a 500 vocabulary size are used. The characters in the
as there are unique queries from different users searched each day. vocabulary are extracted from the most frequent characters in the
Given that GPU/TPU serving is not yet available at LinkedIn, the la- training data. The embeddings are randomly initialized at the be-
tency for computing the query intent outputs using the BERT-Base ginning of training. We used 128 convolutional filters of size 3 for
model from Google wonâĂŹt satisfy the critical latency require- training the CNN model. In the Bidirectional LSTM model, the same
ment. The LiBERT model has a lighter architecture (discussed in character vocabulary is used as in the CNN model and the number
Section 5.3.2) than the smallest model released from the original of hidden units is 128.
paper [9].
Better Relevance Performance. In addition to the benefit of low 5.2.2 Offline Experiments. Both English and multilingual incom-
latency, pre-training the BERT model from scratch with LinkedIn plete query models are investigated in offline experiments. Through-
data leverages the in-domain knowledge in the corpus, therefore out experiment result comparison, we use the −en and −i18n nota-
can provide better relevance performance for downstream fine- tion to represent English models and multilingual models.
tuning tasks. An example is that LinkedIn, Pinterest are English Model. The first set of experiments were conducted on
both out-of-vocabulary words in GoogleâĂŹs BERT-Base model. English incomplete queries. Experimental results show that com-
These company names are generally important for LinkedIn search. pared with the tri-letter model, the CNN-based model CNN-en is
Deep Search Query Intent Understanding CIKM, 2020, Galway, Ireland

Softmax Softmax

Dense Layer Dense Layer

Backward
Backward
LSTM
LSTM
Forward
Forward
LSTM
LSTM

Character
Character Embeddings
Embeddings
L i n [En]
L [En] i [En] n [En]

(a) Embedding. (b) Concatenation.

Figure 3: The language feature embedding and concatenation scheme for multilingual character-level query intent model.

Table 1: Offline performance comparison for different Table 3: Multilingual models for character-level incomplete
character-level models vs production baseline. query intent.

Model Accuracy F1 (people) F1 (company) Model Accuracy F1 (people) F1 (company)


Tri-letter - - - LSTM-i18n - - -
CNN-en +4.14% +2.69% +11.39% LSTM-i18n-embed +0.67% +0.44% +1.34%
LSTM-en +7.11% +4.46% +20.36% LSTM-i18n-concat +0.55% +0.37% +1.12%

Table 2: The size of the character vocabulary for incomplete


query intent models.
Table 3 summarizes the performance of two different strategies
discussed in Section 4.3.1. The results suggest that both methods
Model Vocab Accuracy F1 (people) F1 (company) outperform the baseline without language embedding injected. The
Tri-letter - - - - language feature embedding method (LSTM-i18n-embed) outper-
CNN-en 70 +4.09% +2.69% +11.38% forms the concatenation (LSTM-i18n-concat) method slightly. This
CNN-en 500 +4.14% +2.73% +11.75% might be that in LSTM-i18n-concat, the LSTM is aware of the lan-
LSTM-en 70 +7.04% +4.46% +20.36% guage, hence generating a more meaningful representation for the
LSTM-en 500 +7.11% +4.49% +20.49% character input sequence.
We further compare the effectiveness of the multilingual models
to the traditional per-language models. Table4 shows the overall
able to achieve a higher accuracy, which is mostly because of the accuracy and per-language accuracy. The LSTM-per-lang is trained
fact that CNN could extract more abstract word pattern features and evaluate on each individual language portion. LSTM-i18n is the
than traditional bag-of-words features. As shown in Table 1, the multilingual model without language features added. LSTM-i18n-
Bidirectional LSTM model LSTM-en can further improve the per- embed is with language feature added to the character embeddings.
formance (+7.11% vs. +4.14%), since LSTM can better model the The overall accuracy for LSTM-i18n-embed is comparable to the
long range character sequences. per-language model. We also sampled 4 languages to show the
We also compare the impact of character vocabulary size for the individual performances. For French (fr) and Chinese (zh), both
deep models (Table 2). It is interesting to see that even with a small multilingual models improved performance compared with the sin-
vocabulary (only 70 characters consisting of lower-case English gle language LSTM-i18n-per-lang model trained on data from single
characters and special characters), the accuracy is significantly language only. This is true that for certain languages, multilingual
higher compared with the baseline model. helps the model to learn from other languages and generalize bet-
Multilingual Model. In the multilingual experiments, we first col- ter to help predictions on the specific languages. From results in
lect training data using a similar procedure on top 30 international the other 2 languages, both LSTM-i18n and LSTM-i18n-embed did
locales, such as French, Portuguese, German, etc. The international not outperform the per-language models. This is often expected.
locales are used as language feature in our multilingual model. The But with the language feature embedded (LSTM-i18n-embed), we
most frequent 10k characters are extracted as vocabulary. observed an increased accuracy compared to the LSTM-i18n model.
CIKM, 2020, Galway, Ireland Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long

Table 4: Multilingual model vs per-language models. Table 7: Offline performance comparison for the word-level
models vs production baseline.
Model Overall fr de zh da
LSTM-i18n-per-lang - - - - - Model Accuracy F1 (people) F1 (job)
LSTM-i18n −0.65% +0.28% −0.84% +0.54% −1.24% LR-BOW - - -
LSTM-i18n-embed +0.02% +0.56% +1.01% −1.11% −0.66%
CNN-word +6.40% +1.36% +17.35%
CNN-char +3.49% −0.50% +10.29%
Table 5: Online performance comparison for incomplete LSTM-word +6.69% +1.47% +17.99%
query intent models vs baseline linear model. LSTM-char +4.63% +0.23% +13.46%
BERT-Base +8.20% +2.33% +19.33%
Model Traffic Metrics Lift LiBERT +8.35% +2.46% +20.60%

Search session success +0.43%


LSTM-en English
Time to success (mobile) -0.15%
Search session success +0.86% queries, user profiles, job posts, etc. The collected data for pre-
LSTM-i18n-embed Non-English
Time to success (mobile) -0.19%
training include around 1 billion words. A light-weight structure
is used compared to the BERT-Base (a smaller model published in
Table 6: Model architecture comparison between LiBERT BERT [9]). A comparison between BERT-Base and LiBERT is given
Google’s BERT-Base. in Table 6.

5.3.3 Offline Experiments. Table 7 summarizes the performance


BERT HParams BERT-Base LiBERT
of different models. CNN/LSTM outperforms the baseline models,
Layers 12 3 by +6.40%/+6.69%, respectively.
Hidden size 768 256 We further compared the performances of the complete query
Attention heads 12 8 models in different token granularity and deep model types. First,
Total #params 110M 10M the word-level models outperforms character-level models consis-
tently. This implies that for complete queries, word level representa-
tions captures more meaningful features than character sequences.
5.2.3 Online Experiments. As shown in Table 5, the English model Next, we compare the performance in LSTM and CNN models under
(LSTM-en) and multilingual model (LSTM-i18n-embed) are tested different token granularity. More specifically, on word-level models,
in typeahead search production with English/non-English traffic, re- LSTM and CNN have similar performances. This implies that the
spectively. The online business metrics measure the success criteria useful sequential information is limited in the word sequences due
within search sessions. Both models show significant performance to the short length of the queries and possibly the effectiveness
gains for search session success. An interesting observation is that of word-level information such as word meaning. This is further
the average time to success on mobile is reduced for both models. proved as when comparing the character-level models, LSTM out-
An intuitive reason is that mobile users are more likely to be click performs CNN by a larger margin, meaning that longer range in-
on the high quality results rendered in typeahead search due to the formation is more useful in character sequence modeling. Similar
typing behavior compared with on desktop. observation can be found in Table 1 on incomplete query intent
model experiments, where we saw even more significant improve-
5.3 Complete Query Intent Prediction ment on LSTM over CNN, implying that the LSTM is more suitable
5.3.1 Experiment Setting. The input is word-level embeddings. The in modeling the incomplete character sequences in typeahead than
embeddings are initialized using GloVe[22] word vector represen- the complete queries in SERP.
tations pre-trained on LinkedIn text data. It covers a vocabulary of BERT models (BERT-Base and LiBERT) can further improve the
100K with dimension 64. accuracy over CNN/LSTM models. This can be attributed to the
The production baseline model is a logistic regression model, contextual embeddings that are able to disambiguate words under
with bag-of-words features, and user profile/behavioral features, different contexts better.
etc. In the offline experiments, we first compare the performance In addition, we’d like to analyze the effect of traditional features
among the deep learning models with the production baseline. For on the deep models. In As shown in Table 8, without traditional
the CNN-based model, we used 128 filters of height 3 (tri-gramd) features, the deep models (CNN, LSTM, and LiBERT) bring much
for the 1-D convolution. The hidden state size is 128 for the LSTM- less performance gain compared with those that incorporate these
based model. For both CNN and LSTMs, after the query embedding features in a wide-and-deep fashion.
is generated, a layer of size 200 is used to capture the non-linear
5.3.4 Online Experiments. It is worth noting that even though the
interactions between query representation and traditional features.
LSTM model performs slightly better than the CNN model, we
5.3.2 LiBERT Pre-training. The LiBERT-baesed model is fine-tuned proceed to implement the CNN model in online production. This
with a BERT model pre-trained on LinkedIn data. The pre-training is due to the fact that LSTM does not bring much relevance gain
data include a wide variety of data sources across LinkedIn: search while introducing almost 2 times inference latency (Table 10).
Deep Search Query Intent Understanding CIKM, 2020, Galway, Ireland

Table 8: Performance comparison of whether to incorporate Table 10: Latency and model size comparison on different
traditional features. query intent models at 99 percentile.

Model Trad Features Accuracy F1 (people) F1 (job) Model #Params P99 Latency

LR-BOW - - - - Tri-letter - -
Incomplete
CNN-en 123k +0.38ms
CNN False +1.27% −1.56% +5.27% Query Intent
LSTM-en 379k +2.49ms
CNN True +6.40% +1.36% +17.35% LSTM-i18n-embed 1M +2.72ms
LSTM False +1.26% −1.54% +5.15%
LSTM True +6.69% +1.47% +17.99% LR-BOW - -
LiBERT False +1.90% −1.14% −6.16% Complete CNN-word 6.5M +0.45ms
LiBERT True +8.35% +2.46% +20.60% Query Intent LSTM-word 6.5M +0.96ms
BERT-Base [9] 110M +53.2ms
LiBERT 10M +15.0ms
Table 9: Online performance comparison for word-level
complete query intent model vs production baseline in 6 LESSONS LEARNED
SERP blending.
We would like to share our experiences and lessons learned through
our journey of leveraging state-of-the-art technologies and deploy-
Model Metrics % lift ing the deep models for query intent prediction to online production
CTR@5 neutral systems. We hope they could benefit others from the industry who
CNN-word CTR@5 for job results +0.43% are facing similar tasks.
SAT click neutral Online Effectiveness/Efficiency Trade-off. Our experiments with
CTR@5 +0.17% BERT-based models show that pre-training with in-domain data
LiBERT (vs. CNN-word) CTR@5 for job results +0.51% improved fine-tuning performances compared to the off-the-shelf
SAT click +1.36% models significantly. Aside from relevance gains that BERT brings
us, latency is a big issue for productionizing these large models. We
found that reducing the pre-trained model sizes significantly reduces
Table 9 shows the online metrics gain of CNN model over logistic the inference latency while providing accurate predictions.
regression. It improves two job search related metrics: the overall For the incomplete query intent, we cannot deploy complicated
click-through rate of job postings at top 1 to 5 positions in the models due to the latency constraint in typeahead search. In or-
search result page (CTR@5, which is the CTR at position 1 to 5 der to maximize the relevance performance within a strict latency
for Job Results), and total number of job search result viewers (Job constraint, we investigated the characteristics of LSTM and CNN
Viewers). This is consistent with the significantly improved F1 score models w.r.t. effectiveness and efficiency. LSTMs give superior rele-
of job intent (offline results in Table 7). vance performance for its ability to capture long range information
Later, we conducted online experiments of LiBERT over CNN in sequences, especially when the sequence is incomplete, whereas
(Table 9). Since LiBERT can further improve the accuracy of all CNNs are much faster in inference speed by design. We were able
intents, more online metrics are significant, including SAT clicks to deploy compact LSTM models to the typeahead product while
(the satisfactory clicks on documents with a long dwell time), and achieving an optimal prediction performance with inference latency
CTR@5 over all documents. being tolerable in production.
Token Granularity. In the incomplete query intent models for the
5.4 Scalability typeahead search product, the choice of character-level models is
In the online system, latency at 99 percentile (P99) is the ma- effective in modeling the character sequences in the incomplete
jor challenge for productionization. For query intent, offline pre- queries. It’s worth mentioning that these models require a smaller
computation of the frequent queries does not help, since the P99 vocabulary size compared to word or subword level models, which
queries are usually rare, and there are large number of unseen result in much more compact models suitable for online serving.
queries each day at LinkedIn search. The latency comparison of The character-level granularity allows for combining vocabular-
the character-level models (CNN-char and LSTM-char) for incom- ies from different languages since many languages share similar
plete queries is shown in Table 10. The P99 latency numbers are characters. The design of multilingual models could further benefit
computed on Intel(R) Xeon(R) 8-core CPU E5-2620 v4 @ 2.10GHz online systems for robust predictions across multiple markets yet
machine with 64-GB memory. easier maintenance than per-language models.
In terms of complete queries, the CNN/LSTM models have even Combining Traditional Features. Traditional features are infor-
less latency, since number of words are generally smaller than mative for promoting deep query intent understanding. From our
number of characters. For BERT models, the latency increases sig- experiments we show that simply discarding the handcrafted fea-
nificantly. It is worth noting that by pre-training a smaller BERT tures and user personalized features and replacing them with deep
model on in-domain data, we are able to reduce the latency from learning models hurts the relevance performances. We find that
53ms (BERT-Base) to 15ms (LiBERT) without hurting the relevance incorporating the traditional features in a wide-end-deep fashion
performance, which meets the search blending latency requirement. is crucial in successful intent prediction.
CIKM, 2020, Galway, Ireland Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long

7 RELATED WORK REFERENCES


Query intent prediction has been an important topic in modern [1] Luca Maria Aiello, Debora Donato, Umut Ozertem, and Filippo Menczer. 2011.
Behavior-driven clustering of queries into topics. In CIKM. 1373–1382.
search engines [12, 16, 20, 25]. We firstly introduce traditional meth- [2] Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. 2009.
ods, then show how deep learning models are applied to this prob- Sources of evidence for vertical selection. In SIGIR. ACM, 315–322.
[3] Azin Ashkan, Charles LA Clarke, Eugene Agichtein, and Qi Guo. 2009. Classifying
lem. Finally, we discuss the existing works regarding incomplete and characterizing query intent. In ECIR. Springer, 578–586.
query intent modeling. [4] Fei Cai and Maarten De Rijke. 2016. A survey of query auto completion in
information retrieval. Foundations and Trends® in Information Retrieval (2016).
[5] Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong
7.1 Traditional Methods Chen, and Qiang Yang. 2009. Context-aware query classification. In SIGIR. ACM,
Early query intent works use rule based methods [13, 24], as a high 3–10.
[6] Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert for joint intent classification
precision low recall strategy. However, it is hard to maintain as the and slot filling. arXiv preprint arXiv:1902.10909 (2019).
rules become more complicated, and the recall is low. [7] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.
More recently, statistical models have shown significant improve- 2016. Wide & deep learning for recommender systems. In DLRS. 7–10.
ments in unsupervised (query clustering) [1, 8, 23] or supervised [8] Jackie Chi Kit Cheung and Xiao Li. 2012. Sequence clustering and labeling for
(query classification) approaches. In supervised methods, common unsupervised query intent discovery. In WSDM. 383–392.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
features include unigram, language model scores, lexicon matching Pre-training of deep bidirectional transformers for language understanding. arXiv
features, etc [2, 3, 5]. preprint arXiv:1810.04805 (2018).
[10] Homa B Hashemi, Amir Asiaee, and Reiner Kraft. 2016. Query intent detection
7.2 Deep Learning for Query Intent using convolutional neural networks. In WSDM.
[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
Classification computation 9, 8 (1997), 1735–1780.
[12] Jian Hu, Gang Wang, Fred Lochovsky, Jian-tao Sun, and Zheng Chen. 2009.
Deep learning approaches have shown significant improvement Understanding user’s query intent with wikipedia. In WWW. ACM, 471–480.
in text classification tasks, where multiple CNN and LSTM based [13] Bernard J Jansen, Danielle L Booth, and Amanda Spink. 2008. Determining the
informational, navigational, and transactional intent of Web queries. Information
methods [15, 17, 19, 21, 28] have been proposed. The closest related Processing & Management 44, 3 (2008), 1251–1266.
work is an CNN-based approach [10] for classifying types of web [14] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag
search queries, with similar network architecture as in [17]. It is of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).
[15] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A Convolutional
worth noting that although the most recent technology on [9] has Neural Network for Modelling Sentences. In ACL. 655–665.
shown promising improvement for text understanding including [16] In-Ho Kang and GilChang Kim. 2003. Query type classification for web document
intent classification [6], the performance on short text such as retrieval. In SIGIR. ACM, 64–71.
[17] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv
search queries is not clear. preprint arXiv:1408.5882 (2014).
To our best knowledge, we have yet to see BERT models applied [18] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-
aware neural language models. In AAAI.
to real-world search systems for query intent prediction. [19] Ji Young Lee and Franck Dernoncourt. 2016. Sequential Short-Text Classification
with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016
7.3 Incomplete Query Intent Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. ACL, San Diego, California, 515–520.
There have been character-level models for classic NLP tasks such https://doi.org/10.18653/v1/N16-1062
as text classification and language modeling [14, 18, 27]. However, [20] Xiao Li, Ye-Yi Wang, and Alex Acero. 2008. Learning query intent from regularized
click graphs. In SIGIR. ACM, 339–346.
we have yet to find previous work of character-level models on [21] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network
incomplete query intent prediction in search productions. We for text classification with multi-task learning. In IJCAI.
[22] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
further extended our approach to a multilingual model where one Global Vectors for Word Representation. In EMNLP. 1532–1543. http://www.
model could serve many languages in international markets. aclweb.org/anthology/D14-1162
One related topic to typeahead search result blending is query [23] Xiang Ren, Yujing Wang, Xiao Yu, Jun Yan, Zheng Chen, and Jiawei Han.
2014. Heterogeneous graph-based intent learning with queries, web pages and
auto completion [4]. However, the two topics are fundamentally wikipedia concepts. In WSDM. 23–32.
different in terms of system architecture. The former has compli- [24] Daniel E Rose and Danny Levinson. 2004. Understanding user goals in web
cated indexing systems to retrieve documents. A ranking system is search. In WWW. 13–19.
[25] Rodrygo LT Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search
build to blend the documents where query intent is a feature. The result diversification. In SIGIR. ACM, 595–604.
latter’s objective is only to generate a complete query without any [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
document side information. you need. In NIPS. 5998–6008.
[27] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional
8 CONCLUSION networks for text classification. In NIPS. 649–657.
[28] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM
This paper proposes a comprehensive framework for modeling the neural network for text classification. arXiv preprint arXiv:1511.08630 (2015).
query intent in search systems for different product components. [29] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo
The proposed deep learning based models are proven to be effective Xu. 2016. Text classification improved by integrating bidirectional LSTM with
two-dimensional max pooling. arXiv preprint arXiv:1611.06639 (2016).
and efficient for online search applications. Discussions about the
challenges for deploying these models to production as well as
our insights in making these decisions are provided. We hope the
framework as well as the experiences during our journey could be
useful for readers designing real-world query understanding and
text classification tasks.

You might also like