2020 - Deep Search Query Intent Understanding
2020 - Deep Search Query Intent Understanding
provide a comprehensive learning framework for modeling query were identified where query intent is useful: incomplete query
intent under different stages of a search. We focus on the design intent for typeahead blending, and complete query intent for SERP
for 1) predicting users’ intents as they type in queries on-the-fly blending (search engine result page).
in typeahead search using character-level models; and 2) accurate The common part of both systems is to use query intent to assist
word-level intent prediction models for complete queries. Various the ranking of retrieved documents of different types. Meanwhile,
deep learning components for query text understanding are experi- the two products have their unique challenges. Typeahead blending
mented. Offline evaluation and online A/B test experiments show has strong latency requirements; the input is an incomplete query (a
that the proposed methods are effective in understanding query sequence of characters); and it is okay to return a fuzzy prediction,
intent and efficient to scale for online search systems. since users will continue to type the whole query if he/she does not
find the results relevant. On the other hand, SERP blending has less
KEYWORDS latency constraint compared to typeahead but a higher accuracy
requirement as it directly affects the search result page.
Query Intent, Query Classification, Natural Language Processing, Based on the characteristics of production applications, we pro-
Deep Learning pose different solutions. For typeahead blending, character-level
query representation is used as the resulting models are compact in
1 INTRODUCTION terms of the number of parameters. Meanwhile, it can handle multi-
Modern search engines provide search services specialized across linguality well due to the small vocabulary size. For SERP blending,
various domains (e.g., news, books, and travel). Users come to a the complete query intent model is word level. Since accuracy is a
search engine to look for information with different possible in- high standard, BERT is explored to extract query representations
tents: choosing favorite restaurants, checking opening hours, or which lead to a more robust model.
restaurant addresses on Yelp; searching for people, finding job op- This paper is motivated by tackling the challenges in query intent
portunities, looking for company information on LinkedIn, etc. prediction, while satisfying production needs in order to achieve
Understanding the intent of a searcher is crucial to the success real-world impact in search systems. The major contributions are:
of search systems. Queries contain rich textual information pro-
vided explicitly by the searcher, hence a strong indicator to the
searcherâĂŹs intent. Understanding the underlying searcher intent • Developed a practical deep query intent understanding frame-
from a query, is referred to the task of query intent modeling. work that can adapt to various product scenarios in industry.
Query intent is an important component in the search engine It allows for fast and compact models suitable for online sys-
ecosystem [12, 16, 20]. As shown in Figure 1, when the user starts tems, with the ability of incorporating traditional features
typing a query, the intent is predicted based on the incomplete that enables more accurate predictions.
character sequence; when the user finishes typing the whole query, • Developed and deployed character-level deep models to pro-
a more accurate intent is predicted based on the completed query. duction that is scalable for incomplete query intent predic-
Understanding the user intent accurately allows the search engine tion. In addition, we propose a multilingual that incorporates
to trigger corresponding vertical searches, as well as to better rank language features which is accurate and easier to maintain
the retrieved documents based on the intent [2], so that users do than traditional per-language models.
not have to refine their searches by explicitly navigating through • Developed and deployed a BERT based model to production
the different facets in the search engine. for complete query intent prediction. To our best knowl-
Traditional methods rely on bag-of-words representation and edge, this is the first reported BERT model for query intent
rule based features to perform intent classification [2, 3]. Recently, understanding in real-word search engines.
deep learning based models [10] show significant improvement, • Conducted comprehensive offline experiments and online
which can handle similar words/word sense disambiguation well. A/B tests on various neural network structures for deep
However, developing deep learning based query intent models for query intent understanding, as well as in-depth investiga-
productions requires considering several challenges. Firstly, produc- tion on token granularity (word-level, character-level), DNN
tion models have very strict latency requirements, and the whole components (CNN, LSTM, BERT), and multilingual models,
process needs to be finished within tens of milliseconds. Secondly, with practical lessons summarized.
CIKM, 2020, Galway, Ireland Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long
Page 1 Page 2
Figure 1: Query intent in search engines for incomplete queries (typeahead blending) and complete queries (SERP blending).
2 QUERY INTENT UNDERSTANDING AT Query intent for typeahead blending is challenging given that
LINKEDIN the queries are often incomplete and contains only several letters.
In addition, for every keystroke, the system needs to retrieve the
LinkedIn search hosts many different types of documents, e.g., user
documents from different vertical searches and blend the results. It
profiles, job posts, user feeds, etc. When a user issues a query with-
means the query intent models will be called frequently, and each
out specifying the document type they are interested in, identifying
run should be finished within a short time.
the intent is crucial to retrieve relevant documents and provide
high-quality user experience. At LinkedIn, we define the query
intent as the document type.
Query intent is important for result blending [12]: (1) When an 2.2 Query Intent in SERP Blending
intent is not presented in the query, the corresponding vertical
SERP blending is a more common component in search engines than
search may not be triggered. (2) For the documents retrieved from
typeahead blending. When a user finishes typing and hits "search",
the triggered vertical searches, a result blending algorithm will
a complete query is issued; the query intent is identified and used
rank the documents based on detected intents and other features.
for retrieved document blending. The right most block in Figure
In this section, we present two productions where query intent is
1 shows SERP blending results for a complete query "linkedin",
an important feature for the blending algorithm, followed by how
including company pages, people profiles, job posts, etc.
query intent is used in the blending algorithm.
Compared to typeahead blending, query intent in SERP blend-
ing has a larger latency buffer. Queries contain complete words,
however, it still suffers from limited contexts: only several words
2.1 Query Intent in Typeahead Blending in a query for intent prediction.
When a user starts typing, the query intent is detected and used in
typeahead blending. At LinkedIn, the typeahead product directly
displays document snapshots from multiple vertical searches, which
is different from traditional query auto completion that only gener- 2.3 Retrieved Document Blending Systems
ates query strings. The left three snapshots in Figure 1 shows an Both typeahead blending and SERP blending systems follow a simi-
example of typeahead blending results at LinkedIn. The example lar design. Multiple features are generated for blending/ranking the
assumes the user is searching for the company "LinkedIn". Blended retrieved documents: (1) Probability over each intent that is based
results are rendered as soon as the user typed one letter "l". Next, on query texts and user behaviors (the query intent model output);
given the query prefix "li", the intent prediction has a tendency (2) matching features between the query and retrieved documents;
towards a people result type and many people profiles are ranked (3) personalized and contextualized features.
higher than company or groups results. After the user types "link" In the rest of this paper, we focus on how to generate the query
in the third picture, the company result LinkedIn is ranked first. intent probability for typeahead blending and SERP blending.
Deep Search Query Intent Understanding CIKM, 2020, Galway, Ireland
3 PROBLEM DEFINITION
Output Layer
Both incomplete query intent and complete query intent are essen-
tially classification tasks. Without loss of generosity, given a user
id u ∈ U and a query string q ∈ Q, the goal is to learn a function γ
Dense Layer
predict the predefined intent i in the finite number of intent classes
I = {i 1 , i 2 , ...i n }:
For the two tasks, incomplete/complete query intent, the intent Traditional Features
class sets are slightly different, due to the design of products. As
shown in the next section, deep learning models are applied to query Deep Modules CNN/LSTM/BERT
strings; the user ids are used to generate personalized features.
are the contexts of the tri-letters, as well as the position of the Note that we haven’t explored a multilingual approach for the
tri-letters. BERT is not applied in this task, because the typeahead word-level complete query intent models. This is because the word-
product has a strict latency constraint. The latency would not meet level vocabulary could be fairly large if the major languages are
the production constraint for BERT is too computationally heavy. included and therefore adding more complexity to the models.
Softmax Softmax
Backward
Backward
LSTM
LSTM
Forward
Forward
LSTM
LSTM
Character
Character Embeddings
Embeddings
L i n [En]
L [En] i [En] n [En]
Figure 3: The language feature embedding and concatenation scheme for multilingual character-level query intent model.
Table 1: Offline performance comparison for different Table 3: Multilingual models for character-level incomplete
character-level models vs production baseline. query intent.
Table 4: Multilingual model vs per-language models. Table 7: Offline performance comparison for the word-level
models vs production baseline.
Model Overall fr de zh da
LSTM-i18n-per-lang - - - - - Model Accuracy F1 (people) F1 (job)
LSTM-i18n −0.65% +0.28% −0.84% +0.54% −1.24% LR-BOW - - -
LSTM-i18n-embed +0.02% +0.56% +1.01% −1.11% −0.66%
CNN-word +6.40% +1.36% +17.35%
CNN-char +3.49% −0.50% +10.29%
Table 5: Online performance comparison for incomplete LSTM-word +6.69% +1.47% +17.99%
query intent models vs baseline linear model. LSTM-char +4.63% +0.23% +13.46%
BERT-Base +8.20% +2.33% +19.33%
Model Traffic Metrics Lift LiBERT +8.35% +2.46% +20.60%
Table 8: Performance comparison of whether to incorporate Table 10: Latency and model size comparison on different
traditional features. query intent models at 99 percentile.
Model Trad Features Accuracy F1 (people) F1 (job) Model #Params P99 Latency
LR-BOW - - - - Tri-letter - -
Incomplete
CNN-en 123k +0.38ms
CNN False +1.27% −1.56% +5.27% Query Intent
LSTM-en 379k +2.49ms
CNN True +6.40% +1.36% +17.35% LSTM-i18n-embed 1M +2.72ms
LSTM False +1.26% −1.54% +5.15%
LSTM True +6.69% +1.47% +17.99% LR-BOW - -
LiBERT False +1.90% −1.14% −6.16% Complete CNN-word 6.5M +0.45ms
LiBERT True +8.35% +2.46% +20.60% Query Intent LSTM-word 6.5M +0.96ms
BERT-Base [9] 110M +53.2ms
LiBERT 10M +15.0ms
Table 9: Online performance comparison for word-level
complete query intent model vs production baseline in 6 LESSONS LEARNED
SERP blending.
We would like to share our experiences and lessons learned through
our journey of leveraging state-of-the-art technologies and deploy-
Model Metrics % lift ing the deep models for query intent prediction to online production
CTR@5 neutral systems. We hope they could benefit others from the industry who
CNN-word CTR@5 for job results +0.43% are facing similar tasks.
SAT click neutral Online Effectiveness/Efficiency Trade-off. Our experiments with
CTR@5 +0.17% BERT-based models show that pre-training with in-domain data
LiBERT (vs. CNN-word) CTR@5 for job results +0.51% improved fine-tuning performances compared to the off-the-shelf
SAT click +1.36% models significantly. Aside from relevance gains that BERT brings
us, latency is a big issue for productionizing these large models. We
found that reducing the pre-trained model sizes significantly reduces
Table 9 shows the online metrics gain of CNN model over logistic the inference latency while providing accurate predictions.
regression. It improves two job search related metrics: the overall For the incomplete query intent, we cannot deploy complicated
click-through rate of job postings at top 1 to 5 positions in the models due to the latency constraint in typeahead search. In or-
search result page (CTR@5, which is the CTR at position 1 to 5 der to maximize the relevance performance within a strict latency
for Job Results), and total number of job search result viewers (Job constraint, we investigated the characteristics of LSTM and CNN
Viewers). This is consistent with the significantly improved F1 score models w.r.t. effectiveness and efficiency. LSTMs give superior rele-
of job intent (offline results in Table 7). vance performance for its ability to capture long range information
Later, we conducted online experiments of LiBERT over CNN in sequences, especially when the sequence is incomplete, whereas
(Table 9). Since LiBERT can further improve the accuracy of all CNNs are much faster in inference speed by design. We were able
intents, more online metrics are significant, including SAT clicks to deploy compact LSTM models to the typeahead product while
(the satisfactory clicks on documents with a long dwell time), and achieving an optimal prediction performance with inference latency
CTR@5 over all documents. being tolerable in production.
Token Granularity. In the incomplete query intent models for the
5.4 Scalability typeahead search product, the choice of character-level models is
In the online system, latency at 99 percentile (P99) is the ma- effective in modeling the character sequences in the incomplete
jor challenge for productionization. For query intent, offline pre- queries. It’s worth mentioning that these models require a smaller
computation of the frequent queries does not help, since the P99 vocabulary size compared to word or subword level models, which
queries are usually rare, and there are large number of unseen result in much more compact models suitable for online serving.
queries each day at LinkedIn search. The latency comparison of The character-level granularity allows for combining vocabular-
the character-level models (CNN-char and LSTM-char) for incom- ies from different languages since many languages share similar
plete queries is shown in Table 10. The P99 latency numbers are characters. The design of multilingual models could further benefit
computed on Intel(R) Xeon(R) 8-core CPU E5-2620 v4 @ 2.10GHz online systems for robust predictions across multiple markets yet
machine with 64-GB memory. easier maintenance than per-language models.
In terms of complete queries, the CNN/LSTM models have even Combining Traditional Features. Traditional features are infor-
less latency, since number of words are generally smaller than mative for promoting deep query intent understanding. From our
number of characters. For BERT models, the latency increases sig- experiments we show that simply discarding the handcrafted fea-
nificantly. It is worth noting that by pre-training a smaller BERT tures and user personalized features and replacing them with deep
model on in-domain data, we are able to reduce the latency from learning models hurts the relevance performances. We find that
53ms (BERT-Base) to 15ms (LiBERT) without hurting the relevance incorporating the traditional features in a wide-end-deep fashion
performance, which meets the search blending latency requirement. is crucial in successful intent prediction.
CIKM, 2020, Galway, Ireland Xiaowei Liu, Weiwei Guo, Huiji Gao, Bo Long