0% found this document useful (0 votes)
138 views9 pages

An End-to-End Solution For Named Entity Recognition in Ecommerce Search

This paper proposes an end-to-end solution for named entity recognition (NER) in ecommerce search. The paper introduces a novel model training framework called "TripleLearn" which iteratively trains models from three separate datasets, rather than a single training set. Using this approach, the model's F1 score improved from 69.5 to 93.3 on test data. Additionally, in A/B tests the solution led to significant improvements in user engagement and revenue conversion. The trained model has been deployed on homedepot.com for over 9 months. The TripleLearn framework is model and problem independent, making it generalizable to other industrial applications, especially in ecommerce.

Uploaded by

Xiaolingfei Lu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views9 pages

An End-to-End Solution For Named Entity Recognition in Ecommerce Search

This paper proposes an end-to-end solution for named entity recognition (NER) in ecommerce search. The paper introduces a novel model training framework called "TripleLearn" which iteratively trains models from three separate datasets, rather than a single training set. Using this approach, the model's F1 score improved from 69.5 to 93.3 on test data. Additionally, in A/B tests the solution led to significant improvements in user engagement and revenue conversion. The trained model has been deployed on homedepot.com for over 9 months. The TripleLearn framework is model and problem independent, making it generalizable to other industrial applications, especially in ecommerce.

Uploaded by

Xiaolingfei Lu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

An End-to-End Solution for Named Entity Recognition in eCommerce Search

Xiang Cheng, Mitchell Bowden, Bhushan Ramesh Bhange,


Priyanka Goyal, Thomas Packer, Faizan Javed
The Home Depot
2455 Paces Ferry Rd NW, Atlanta, GA 30339
{xiang cheng, mitchell s bowden, bhushan ramesh bhange, faizan javed}@homedepot.com
arXiv:2012.07553v1 [cs.CL] 11 Dec 2020

Abstract However, there is a gap between academic research and


industrial applications of NER. Recent research works of-
Named entity recognition (NER) is a critical step in modern
search query understanding. In the domain of eCommerce,
ten use the latest neural architectures (Yadav and Bethard
identifying the key entities, such as brand and product type, 2019) and language models (Pennington, Socher, and Man-
can help a search engine retrieve relevant products and there- ning 2014; Devlin et al. 2018; Peters et al. 2018) to improve
fore offer an engaging shopping experience. Recent research performance on popular NER shared tasks and datasets
shows promising results on shared benchmark NER tasks us- (Sang and De Meulder 2003). The focused outcome is of-
ing deep learning methods, but there are still unique chal- ten marginal improvement of F1 scores, while the feasibility
lenges in the industry regarding domain knowledge, training in real-world applications is not often considered.
data, and model production. This paper demonstrates an end-
to-end solution to address these challenges. The core of our 1.1 Recent Research in NER
solution is a novel model training framework ”TripleLearn”
which iteratively learns from three separate training datasets, In recent years, deep-learning-based NER together with em-
instead of one training set as is traditionally done. Using this beddings has become increasingly popular in the research
approach, the best model lifts the F1 score from 69.5 to 93.3 community. Collobert et al. proposed the first neural net-
on the holdout test data. In our offline experiments, Triple- work architecture for NER (Collobert and Weston 2008; Li
Learn improved the model performance compared to tradi- et al. 2020) and later experimented pre-trained word em-
tional training approaches which use a single set of train- beddings (Collobert et al. 2011) as model features. Since
ing data. Moreover, in the online A/B test, we see signifi- then various neural architectures and word representations
cant improvements in user engagement and revenue conver- have been studied (Yadav and Bethard 2019). The most pop-
sion. The model has been live on homedepot.com for more
ular approach is to use recurrent neural networks (RNN)
than 9 months, boosting search conversions and revenue. Be-
yond our application, this TripleLearn framework, as well as over word, sub-word, and/or character embeddings (Yadav
the end-to-end process, is model-independent and problem- and Bethard 2019; Lee 2017). Long Short Term Memory
independent, so it can be generalized to more industrial ap- (LSTM) and its variants (e.g. Gated Recurrent Unit, i.e.
plications, especially to the eCommerce industry which has GRU) are the most common neural architectures for NER
similar data foundations and problems. as well as for sequence tagging (Huang, Xu, and Yu 2015;
Ma and Hovy 2016; Lample et al. 2016; Lee 2017). Re-
cently, transformer-based language models (Vaswani et al.
1 Introduction 2017; Devlin et al. 2018) have been tested on benchmark
The search engine at homedepot.com processes billions of NER tasks and claim the state-of-art performance (Liu et al.
search queries, serves tens of millions of customers, and 2019; Li et al. 2020).
generates tens of billions of dollars in revenue every year for It is exciting to see the booming research progress based
The Home Depot (THD). One of the most fundamental chal- on modern deep learning methods and latest language mod-
lenges in our search engine is to understand a search query els. However, industrial applications still seem being left be-
and extract entities, which is critical to retrieve the most hind due to unique challenges as described below.
relevant products. This task can be framed as named entity
recognition (NER) which is a common information retrieval 1.2 Industry Applications and Challenges
task to locate, segment, and categorize a pre-defined set of Compared to the large amount of research on deep-learning-
entities from unstructured text. Since its introduction in the based NER, there are relatively few publications that explore
early 1990s, NER has been studied extensively and is evolv- its applications in industrial settings, and fewer still that ad-
ing rapidly (Nadeau and Sekine 2007; Yadav and Bethard dress the task of performing NER as a step in eCommerce
2019), especially after the adoption of deep learning and re- search query understanding.
lated techniques in recent years (Yadav and Bethard 2019). Guo et al. (Guo et al. 2009) apply a probabilistic topic
Copyright © 2021, Association for the Advancement of Artificial model, Weakly Supervised Latent Dirichlet Allocation, to
Intelligence (www.aaai.org). All rights reserved. identify four types of entities from commercial web search
Legacy NER Entities True Entities
O P RD P RD O
z }| { z }| { z }| { z }| {
fridge no ice maker fridge no ice maker
BRD P RD O P RD O
z }| { z }| { z }| { z }| { z }| {
weed eater light weight weed eater light weight
BRD P RD O BRD P RD
z }| { z }| { z }| { z }| { z }| {
cosco table and chair set cosco table and chair set

Table 1: The examples where legacy NER mislabeled entities. The first example has two product types (i.e. "ice maker" and
"fridge"). "weed eater" is ambiguous because it could be either a brand or a product type. For the third one, "table
and chair set" is not in the existing product taxonomy.

queries containing single named entities. It is not clear if this Among these papers, three papers (Wu et al. 2017; Ma-
approach was fully evaluated in their online setting. jumder et al. 2018; Wen et al. 2019) explore deep learning
Putthividhya and Hu (Putthividhya and Hu 2011) use for NER in eCommerce search. Two papers (Wu et al. 2017;
statistical sequence models to recognize entities (product Wen et al. 2019) perform an evaluation on queries in pro-
brands and designers). They use character n-gram similarity duction. Only four papers (Putthividhya and Hu 2011; More
scores to resolve entities to canonical forms. The evaluation 2016; Wu et al. 2017; Wen et al. 2019) leverage a large num-
appears to be offline only, using around 2K clothing product ber of unlabeled queries available to commercial search en-
listings. gines.
Cowan et al. (Cowan et al. 2015) use a conditional ran- Despite the above progress, we believe the following chal-
dom field (CRF) to recognize three types of entities in travel lenges remain.
search queries. 3.5K manually labeled queries are used in the 1. Industrial applications require custom, domain-specific
evaluation. The NER system was used in production, though knowledge and training data, covering the full extent of
the evaluation appears to be offline only. entities of different types, including representative ex-
More (More 2016) uses CRFs and a voted-perceptron- amples of noisy queries (spelling errors, abbreviations,
based Markov model–both forms of supervised learning. etc.) and noisy intent signals from conversion (purchase)
They utilize a large set of unlabeled data by using regular ex- events.
pressions to label training data for distant supervision. They 2. Machine learning models, especially deep learning mod-
also evaluate the value of brand name recognition in produc- els, usually require a large amount of high quality training
tion. data, which is often time-consuming and sometimes im-
Wu et al. (Wu et al. 2017) identify product attributes possible to acquire.
intended by user queries by treating this as a multi-
label text categorization problem. They train character- and 3. Productionization is challenging as deep learning models
word-level bidirectional-LSTMs (BiLSTM) jointly with a are computationally expensive to train and execute in a
product-attribute auto-encoder, using implicit user feedback real-time application such as eCommerce search.
and no hand-labeled training data. To address these challenges, this paper demonstrates an
Majumder et al. (Majumder et al. 2018) evaluate multi- end-to-end solution (see Sec. 2), including preparing train-
ple deep recurrent networks by extracting brand names from ing data in a scalable manner, iteratively training the model
product titles. They mention other named entity types they using the TripleLearn framework, and deploying the model
may have also evaluated in production but it is not clear in production for real-time usage in our eCommerce search
whether these results were cherry-picked from a larger ex- engine.
periment.
Wen et al. (Wen et al. 2019) describe a production NER 1.3 NER at The Home Depot
system in eCommerce for extracting entities from search THD is a leading home improvement retailer, and this do-
queries using a process that improves the accuracy of ex- main is rich in entity types and relationships. For example,
tracted information over its distant-supervision training data THD has more than 14,000 brands, 11,000 product types,
which comes from a legacy NER system based on logistic and 3 million items. In our search engine, NER is a vital but
regression. They do not specify which entities they evalu- challenging step to extract the key entities.
ated nor what their performance was in absolute terms. They The legacy NER system in production used pre-defined
mention that their approach is efficient in terms of human la- taxonomies of brands and product types, recognized using
beling cost. However, it is dependent on distant supervision a sequential greedy exact match. Beginning with brands and
from a legacy NER system that “is built upon a lot of do- then product types, the longest matching token sequence was
main expertise-aided feature engineering”. Therefore, it is labeled as the corresponding entity. However, this approach
impossible to determine how much labeling or engineering was far from satisfactory, and common challenges included
cost the current system relies on as their description is not queries containing multiple product types but only one shop-
truly end-to-end. ping product type, ambiguity between product types and
Label Description 2.2 End-to-End Process with TripleLearn
B-BRD beginning of a brand Our end-to-end process from training data creation to model
I-BRD inside of a brand deployment is shown in Fig. 1. The core is the TripleLearn
B-PRD beginning of a product type framework in Phase II. Each phase is described as follows.
I-PRD inside of a product type
O outside Phase I: Training Data Preparation. An ideal set of
training data for deep learning models should have three
Table 2: The five labels in this NER problem. characteristics:
1. large volume to train a large number of model parameters;
Step Example
2. high quality labels to provide correct supervision;
1. search query LG washer mini
2. query preprocessing ["lg", "washer", "mini"] 3. high coverage of label values (i.e. all brands and product
3. NER ["B-BRD", "B-PRD", "O"] types here) to potentially recognize all of them.
4. output { However, it is often too time-consuming to prepare one
"brand": "lg",
set of such training data. We find that it is more realistic to
"product": "washer"
} prepare three separate datasets to meet these requirements.
These three sets of training data are prepared as described
Table 3: General steps to label entities in a search query. below.
The starting point are two foundational datasets in an
eCommerce business: the product catalog (node I.1 in Fig.
brands, and new product types not in pre-defined taxonomy. 1) and customer behavior data (node I.2). Product cata-
Table 1 has examples to demonstrate these challenges. These log data is the ground truth for the key entities of all prod-
mislabels were often closely related to the context in the ucts sold on homedepot.com which have more than 14K
query and specific entity values (brands or product types), unique brands and 11K unique product types. Customer be-
so it was hard to fix at scale in the legacy NER system. Deep havior data stores customers’ shopping journeys, including
learning models could be good candidates to handle this kind searches, impressions, clicks, and purchases.
of complexities (Wu et al. 2017; Majumder et al. 2018; Wen Firstly, a large amount of training data (I.4) is auto-
et al. 2019), which is the type of model we will use. matically generated using rule-based algorithms (I.3) by
matching the tokens in product catalog (I.1) and customer
behavior data (I.2). This dataset is large (2.7M queries)
The rest of this paper is organized as follows. Sec. 2 de-
but noisy due to noisy customer behavior data and imperfect
fines the problem and describes the whole process including
matching algorithms, but it can still capture the variety of
the TripleLearn framework in detail. Sec. 3 covers offline
patterns in real customers’ search queries.
experiments and results while Sec. 4 covers online deploy-
Secondly, to prepare a set of high quality “golden” data
ment and measured impact to business. In Sec. 5, we discuss
(I.9), we sample a small amount (16K) from I.4 for man-
the TripleLearn training framework and future work.
ual annotation. To avoid bias or over-fitting our model on
a small number of patterns, we stratified sampled the data
2 Technical Design Overview by entity sequence patterns. Here, a pattern is defined as a
In this section, we first define the problem, then elaborate the series of consecutive entities. See Table 4 for the top four
end-to-end process including the TripleLearn framework, patterns and examples. The reason is that we find that our
and describe the selected model architecture in the last sub- model is sensitive to the order of entities in the training data,
section. which makes sense for a sequence tagging model. This set of
golden data is vital to start the model training in TripleLearn
2.1 Problem Definition (Phase II) and to measure the real model performance.
Thirdly, a rule-based algorithm (I.5) creates a set of
Given a user search query as a sequence of word tokens, the synthetic data (I.6) using all the brands and products in
primary task of NER is to identify the important entities. our product catalog (I.1). The synthetic pattern is simple,
This task is formulated as a sequence tagging problem using such as brand-only queries (e.g. "samsung" as a query)
beginning-inside-outside (BIO) tagging format. In this pa- and product-type-only queries (e.g. "washer" as a query).
per, we focus on the two most important entities for eCom- All the distinct brands and products are included so that the
merce: brand (BRD) and product type (PRD). Therefore, two model can potentially recognize them all.
entities are translated into five labels as shown in Table 2. The outputs of Phase I are three datasets to meet these
The general process to label the entities in a search query is three requirements of an ideal training dataset. Table 5
shown in Table 3. The key step is to build a machine learning shows corpus statistics and entity distributions. These three
model for sequence tagging in Step 3. The model evaluation datasets are the training data for TripleLearn as described
metric is the exact-match F1 score (Sang and De Meulder next.
2003) where only the correct prediction of the whole entity
is considered as a true positive. The baseline is the legacy Phase II: TripleLearn TripleLearn is the core of this end-
NER system in production as described in Sec. 1.3. to-end solution. In this phase, we iteratively train the model,
Phase I: Training Data Preparation Phase II: TripleLearn Phase III: Model Production
I .9 I I .1 I I I .1
Golden Formulate
Final Model
Training New Training
Data Data
I I I .2
I I .2
I .8 Package of
Training Data
checkpoint &
Preprocessing
Human data
Annotation
I I .3
I .1 I .2 I I I .3
Model Training Deployment on
Product Customer I .7 Cloud VM
Catalog Behavior Stratified
Data Data Sampling I I I .4
I I .4 Model
Best Real-time web
Evaluation service

I .4 I I I .5
I .3
Auto Large I I .5 I I .6 User Input:
Generate Noisy Stratified Model Husky toolbox
Raw Training Training Sampling Prediction
Data Data
I I I .6 Model Service Output:
I .5 I .6 xxx
I I .7 I I .8 xxx
Auto
Synthetic Formulate
Generate Stratified xx
Training Additional
Synthethic Sampling xx
Data Training Data
Data xx

Figure 1: The end-to-end solution with TripleLearn framework highlighted in bold and orange. Phase I prepares three sets of
training data for TripleLearn, and Phase II is the iterative training process of TripleLearn. The last phase is model production.

Top Sequential Pattern Example Large (I.4) Synthetic (I.6) Golden (I.9)
BRD O P RD Query 2,737,399 23,710 16,915
Token 16,914,607 48,568 89,692
z }| { z }| { z }| {
BRD+O+PRD milwaukee cheap drill
BRD O O P RD BRD 2,509,132 14,058 14,915
BRD+O+PRD+O
z}|{ z }| { z }| { z}|{
ge 7.4 cu ft dryer gas PRD 2,694,413 9,652 14,774
BRD P RD O
z }| { z }| { z }| { Table 5: Corpus statistics. The numbers of queries, tokens,
BRD+PRD+O behr paint discount and entities in the three training datasets for TripleLearn.
O P RD O
z }| { z }| { z }| {
O+PRD+O bronze faucet pull down
validation set stops improving. The next step is to evaluate
Table 4: The examples of the top four sequential patterns in the model (II.4) on the test data to see whether this itera-
the large volume of training data (I.4). tion has a better performed model. If the current iteration’s
model is worse than the previous one, we stop the iterative
training and select the best model from all the previous iter-
cumulatively add more training data (i.e. more brands, more ations. If it is better, we continue the iterations by stratified
product types, and more patterns), and incrementally im- sampling (II.5) from I.4 as done in I.7, for model infer-
prove the model performance. ence (II.6). If the prediction on a query matches the noisy
The first iteration starts with the golden data (I.9). 15% labels in I.4, we pass the query to the next step as addi-
of the golden data is randomly held out as test data; the rest tional training data. The idea is to reduce the noise in train-
is randomly split into training (90%) and validation (“dev”) ing data by looking for consensus between the noisy labels
data (10%). Before the training, the training data is pre- produced by I.3 and the noisy labels predicted by II.6.
processed to balance the labels (II.2), which requires do- Similarly, we sample (II.7) synthetic data (I.6) to in-
main knowledge. For example, we identify 50 tokens which crease the coverage of brands and product types. The goal
can be either a brand or a product type, such as "instant is to cover all possible entity values after a few iterations.
pot", "cutter", and "anchor". Such queries are bal- In the end, we formulate new additional training data from
anced by oversampling entities with fewer queries so that no II.6 and II.8 for the next iteration.
bias is generated due to skewed data distribution. A question we had during early development was whether
Then the model is trained (II.3) until the F1 score on the this iterative self-training process would accumulate errors,
Concatentated Forward B-BRD O B-PRD O
Character-to-Word
Backward
Representation

Backward LSTM LSTM LSTM LSTM LSTM LSTM LSTM CRF CRF CRF CRF
CRF

Forward LSTM LSTM LSTM LSTM LSTM LSTM LSTM


Backward GRU GRU GRU GRU GRU

Character
Embeddings Forward GRU GRU GRU GRU GRU

c h e a p <pad>

Word Embeddings
Concatentated with
milwaukee cheap drill Character-to-Word
Representations F F F
B B B

Figure 2: The character-to-word subgraph in our model. milwaukee cheap drill <pad>
Both forward and backward word representations are
learned from character embeddings using a BiLSTM layer. Figure 3: The word-to-label-sequence subgraph. Forward
and backward character-based embeddings are concatenated
to word embeddings as input to the bidirectional GRU (Bi-
thereby biasing itself toward erroneous patterns of predic- GRU) which in turn provides features for the CRF.
tions by II.6. We reduce the chance of the model drift-
ing by stratified sampling by label sequence patterns in steps
I.7, II.5, and II.7. Beyond that, the final judge of this chitecture is a BiGRU-CRF with a BiLSTM subgraph for
iterative process is the F1 score on the held-out test set. In character-level embedding, as shown in Figures 2 and 3.
our experiments (Sec. 3), we see that this iterative process This neural architecture is implemented using Tensorflow.
can indeed improve the model performance iteration by iter-
ation. More results are shown in Sec. 3.
3 Model Experiments
Phase III: Model Production. Model production is an in-
dispensable step to use the model in THD search engine. We systematically run experiments on the TripleLearn
This phase starts with the best model selected from Triple- framework, model architectures, and embeddings. The eval-
Learn (Phase II). In node III.2, the neural graph, weights, uation metric is the micro-averaged F1 score which is com-
and required data (vocabulary, word embeddings, etc) are monly used in NER tasks. Because the number of the two
packed for the next step. Then the model package is de- entities are roughly balanced, so the macro-averaged F1 is
ployed on a cloud virtual machine (III.3). To use it in a also consistent with the micro-F1. The evaluation dataset is
real-time search engine, in node III.4, the other necessary the holdout test data which is the 15% random sample of the
components are also added to be able to parse a raw user golden data in Phase I of Fig 1.
search queries and to convert NER model prediction into
machine readable outputs, which enables the deep learning 3.1 TripleLearn Framework
model as a real-time web service. The sample input and out- The experimental results show that TripleLearn (highlighted
put of this model service are shown in III.5 and III.6 in in bold in Fig. 1) iteratively improves the model perfor-
Fig. 1. More details on deployment and testing are described mance on the test data as shown in Fig. 4 and Table 6. As
in Sec. 4 (Productization). we iteratively add more training data, the coverage of brands
and product types increases and reaches 100% at iteration 7,
2.3 Selected Model Architecture which is important for this model to be able to potentially
After experimenting multiple neural architectures (Sec. 3), recognize all brands and product types. The F1 scores also
our selected model is based on the bidirectional RNN-CRF, saturate after iteration 7.
a popular approach to sequence tagging (Yadav and Bethard The advantage of TripleLearn is justified in three aspects.
2019; Lee 2017). Recent works helped us to finalize the Firstly, the F1 score improves from 87.1 at iteration 1 to
architecture. Huang et al. first demonstrated that BiLSTM- 93.3 at iteration 7, which demonstrates that it can iteratively
CRF can effectively use both left and right context as well improve the model. Secondly, compared to simply training
as the statistical sequential dependency among token labels the model once using all the data (Fig. 4 blue horizontal
(Huang, Xu, and Yu 2015). Lample et al. (Lample et al. line), this iterative process performs better in every iteration.
2016) further showed that BiLSTM-CRF with pre-trained Lastly, compared to the most important baseline, the legacy
word embeddings and character embeddings performed the NER system in production, our best model significantly in-
best on CoNLL-2003 (Sang and De Meulder 2003). The creases the F1 score (from 69.5 to 93.3), which shows the su-
GRU, a simplified variant of the LSTM, has also shown periority of the model and justifies an A/B test. The model
state-of-the-art performance (Lee 2017). After numerous at iteration 7 has the best F1 score and also covers all the
experiments (see details in Sec. 3.2), our final neural ar- brands and product types, so it is selected for production.
models char emb. dev F1 test F1
90 No 85.77 85.05
BiLSTM
Yes 86.99 86.23
No 87.69 86.72
85 BiLSTM-CRF
Yes 88.57 88.44
F1 Score

No 85.42 85.57
80 TripleLearn BiGRU
Yes 86.53 87.09
Baseline: Without TripleLearn No 87.12 87.04
75 Baseline: Legacy NER System
BiGRU-CRF
Yes 88.71 88.82
BERT N/A 83.22 82.51
70 BERT-CRF N/A 83.93 83.10
1 2 3 4 5 6 7 8 9
Iteration Table 7: Six neural architectures tested on golden data, with
and without layers of character-based embedding if applica-
Figure 4: F1 score on the test data of TripleLearn at each ble.
iteration (green curve on the top) and two baselines. The red
line on the bottom is for the legacy NER system. The blue
line in the middle is for one-pass training using all the data help bridge the gap between common query words and un-
(I.4 + I.6 + I.9 in Fig. 1). common query words that contain common sub-word char-
acter sequences.
iter. training unq. BRD unq. PRD dev F1 test F1 We also compare BiGRU and BiLSTM, with BiGRU
1 14,378 3400 3239 89.32 88.82 showing comparable or even better performance. We select
2 19,911 3936 3768 95.81 91.36 BiGRU because it has 25% fewer parameters and is thus
3 31,510 4402 4216 96.52 91.47 faster to train and execute compared to BiLSTM.
4 57,379 5374 5131 97.70 91.73 In addition to the RNN-based neural models, we also
5 115,564 7544 7004 98.68 92.49 tested transformer-based BERT (uncased large) (Devlin
6 241,629 12,082 10,305 99.19 92.83 et al. 2018), but the performance is less satisfactory as
7 484,390 14,102 11,111 99.49 93.30 shown in Table 7, either with a CRF layer or with a sim-
8 992,823 14,102 11,111 99.66 93.12 ple softmax layer. It may be because our search queries are
9 2,089,573 14,102 11,111 99.81 92.79 domain-specific and have very different patterns comparing
to the BERT training corpus (i.e. Wikipedia and books). Fur-
Table 6: The F1 scores, the numbers of training queries, ther improvements may require fine-tuning using domain
unique brands, and unique product types by iteration. specific corpus and/or additional neural layers (e.g. BERT-
LSTM-CRF), which requires significantly more efforts in
terms of both training and deployment.
A disadvantage of this process is that it takes a longer time The CRF layer is helpful to predict the most likely en-
because we have to train multiple iterations. On average, the tity sequence and to forbid invalid sequence transitions, such
whole process with nine iterations takes about 20 hours on a as B-BRD→I-PRD and O→I-BRD. Our results in Table 7
GPU machine (NVIDIA Tesla K80) using our training data. also show that the CRF layer consistently improves F1 score.
However, the training is an offline process, and we value the
quality of the model more than the offline training cost.
3.3 Word Embeddings
Considering the long training time, most of the experi-
ments below are tested on golden data only (iteration 1), Word embedding is an important component and simplifies
which is much faster. However, from the finished iterative feature engineering (Yadav and Bethard 2019). We test pre-
model trainings, we find that the model performance at it- trained Word2vec embedding (Mikolov et al. 2013), pre-
eration 1 is already a good indicator of the final model per- trained GloVe (Pennington, Socher, and Manning 2014),
formance, which can justify the experimental findings using custom Word2vec, and custom GloVe embedding. The cus-
the golden data in the next subsections. tom embeddings are trained using the top 50 million search
queries and 3 million product titles on homedepot.com.
3.2 Tested Models We select custom Word2vec embedding in our final model
We test six neural architectures as shown in Table 7. Each for three reasons. Firstly, domain-specific word embed-
neural layer is tested and discussed below. ding is a better semantic representation in terms of simi-
The character-based embedding is produced by a BiL- lar words. Table 8 shows the top three similar words for
STM layer (Lample et al. 2016) as shown in Fig. 2. It es- "milwaukee" which is a city but also a popular brand
sentially extracts features for each word using a character- in the home improvement domain. Secondly, the vocabulary
level model. With character-based embedding, the F1 score coverage is much higher for the custom-trained embeddings.
is consistently better as shown in Table 7 using golden data Lastly, as shown in Table 8, the model performance is also
for training and testing. We believe the reason is that it can better than the pre-trained embeddings.
similar vocab. engineering deployment of the model.
embedding F1
words coverage
chicago 4.2 Usage, A/B Test, and Business Impact
pre-trained
detroit 39.8% 87.56 The deep learning model service is used in our search en-
GloVe
minneapolis gine serving a high volume of search queries in real time.
springfield The extracted entities (brand and product types) are used to
pre-trained
harvey 15.5% 83.99 retrieve items in the search inverted index as well as serving
Word2vec
wisconsin as an additional input for ranking.
m18 In the A/B test, we tested the NER model service against
custom
dewalt 99.9% 87.58 the legacy NER system with equally random traffic split at
GloVe
drill homedepot.com. With millions of search visits from real on-
dewalt line shoppers, we saw a significant improvement in both
custom
makita 99.9% 88.82 click-through-rate and revenue conversion per search. The
Word2vec
ridgid estimated annual business impact is $60M in incremental
revenue. The cost is minimal comparing to its benefit, so
Table 8: Top three similar words for "milwaukee", vo- this project is yielding a high return on investment. Detailed
cabulary coverage of all unique words in training data, and metrics measured from the A/B test are confidential and not
F1 score on the test data of the four word embeddings. disclosed here.
After the A/B test, this model has replaced the legacy
NER system and has been live on homedepot.com for more
4 Productionization than 9 months, serving millions of real customers and boost-
Productionization is the key to delivering a real-world im- ing search conversions.
pact, and there are two correlated challenges, speed and cost.
Firstly, the model execution has to be fast enough to serve 4.3 Maintenance
thousands of queries per second and help retrieve search re- There is no model that works well forever. It is especially
sults within a time limit. The speed also has a direct impact true in the eCommerce scenario where new products with
on user satisfaction and conversion rate. Secondly, the cost new brands and/or new product types are added regularly.
of serving the model has to be reasonably low to justify the Therefore, we also need to refresh the model regularly to
return on investment. More details are explained below. reflect the changes.
The TripleLearn framework makes it easy to refresh the
4.1 Deployment model because we only need to incrementally update a small
The first challenge is speed, and a practical solution is to fraction of the three datasets to retrain the model. For ex-
leverage an existing platform with customized optimization. ample, we recently refreshed the model to improve the per-
Specifically, the model is deployed using Tensorflow Serv- formance on short queries that have more average traffic
ing, a flexible and high-performance model serving system than long queries. In this case, we only added 5% addi-
by Google, on the Google Cloud Platform (GCP). This can tional short-query-training-data to the golden dataset (i.e. I.9
provide a stable environment but still not an optimized one. in Fig. 1) to retrain the model. In both the offline evaluation
Thus, we customize the optimization by reducing the serv- and A/B test, the refreshed model showed significant im-
able model size which has a direct impact on the model provements on short queries and consistent performance on
inference time. This involves converting the variables in a other queries. Another planned improvement is to retrain the
Tensorflow checkpoint into constants stored directly in the model to cover new brands and new product types recently
model graph, stripping out unreachable parts of the graph, being added to our product catalog. We plan to first add these
folding constants, folding batch normalizations, removing new brands and product types to the synthetic data (i.e. I.6
training and debug operations, etc. In our experience, the in Fig. 1), which takes minimal effort. If the improvement is
optimized model has a significantly smaller size, faster load- not satisfactory, we may incrementally add a small amount
ing, and faster inference. of such data to the golden data set, which is still additional
minimal effort.
The other challenge is cost. Serving the deep learning
model on a GPU machine would be fast but also much more
expensive. We manage to optimize the model for CPU to 5 Discussion and Future Work
meet our performance requirement and deploy it on a GCP We discuss our TripleLearn framework to address three
virtual machine instances with custom CPU machine type (4 questions below.
vCPUs, 3.75 GB memory). This deployment automatically
scales to real-time traffic, leading to a very cost-effective so- 1) What challenges can TripleLearn help with? Triple-
lution. Learn is a novel training framework that works with less sat-
The model has been live in production for more than 9 isfactory training data. In most industrial settings, it is often
months, and we see that the model service can serve 99% of too expensive to prepare a large amount of high quality train-
the search queries within 26 milliseconds. This is highly sat- ing data. However, it is more realistic to prepare three sets
isfactory for our search engine. This concludes a successful of training data as required by TripleLearn. As shown in the
end-to-end process (Fig. 1), the synthetic data and large vol- 6 Conclusion
ume noisy data can be generated automatically, while the
Our work demonstrates an end-to-end solution to apply
golden data is a small set which can be annotated manually
state-of-art deep learning model in a domain-specific indus-
within a reasonable cost.
trial problem, i.e. named entity recognition on eCommerce
Meanwhile, separate datasets are also easier to maintain
search queries. The core is a novel model training frame-
in practice because they are independent, and we may
work TripleLearn which can iteratively learn from three sep-
only update one set of training data. For example, an
arate sets of training data. We demonstrate that this iterative
eCommerce business adds or removes products regularly.
process is effective at improving model performance, instead
Using TripleLearn, only the synthetic data has to be updated
of traditionally training using one set of data (Sec. 3.1). The
to reflect the latest product catalog.
resulting model by TripleLearn has been deployed in pro-
duction as a real-time web service for the search engine at
2) Why TripleLearn works? TripleLearn leverages
homedepot.com. The model A/B test and day-to-day use for
three separate datasets that provide collaborative supervi-
more than 9 months show stable model performance, higher
sion. Golden data is the ground truth to start the model train-
user engagement, and increased revenue, which proves the
ing and to measure the real model performance; the large
significant value in the domain of eCommerce. Moreover,
noisy data provides rich variations of real-world patterns;
the TripleLearn framework is practically easy to maintain
and the synthetic data feeds the model all possible entity
and refresh the model by allowing incrementally update one
values which may or may not show up in the golden data
or multiple of the three datasets (Sec. 4.3).
or the large noisy data.
More importantly, as discussed in Sec. 5, our proposed
TripleLearn may resemble a natural way of learning.
TripleLearn framework can be generalized to different mod-
An analogy is the process of learning a new language.
els and problems. We hope that it will inspire more industrial
Usually, we may learn from three types of materials: 1) a
innovations using data science and machine learning.
textbook with high quality content, exercises, and solutions;
2) a dictionary with all the words and common phrases;
3) a variety of real-world materials such as TV shows and References
movies. These materials may be related to the three datasets Collobert, R.; and Weston, J. 2008. A unified architecture
for TripleLearn. The high quality golden data corresponds for natural language processing: Deep neural networks with
to the textbook; the synthetic data covers all the brands and multitask learning. In Proceedings of the 25th international
product types, which is similar to the dictionary; and the conference on Machine learning, 160–167.
large amount of noisy data plays a role as the variety of
real-world materials which may be noisy too. With these Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.;
three types of materials, learning a new language is also an Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language pro-
iterative process. Specifically, we learn from each type of cessing (almost) from scratch. Journal of machine learning
the materials in each ”iteration”, then validate our learning research 12(Aug): 2493–2537.
using high quality textbook content and exercises. We keep Cowan, B.; Zethelius, S.; Luk, B.; Baras, T.; Ukarde, P.;
going through this process iteratively so that we can learn and Zhang, D. 2015. Named Entity Recognition in Travel-
more vocabulary and more complex patterns, which is related Search Queries. In Proceedings of the Twenty-
similar to how TripleLearn works. Ninth AAAI Conference on Artificial Intelligence, AAAI’15,
3935–3941. AAAI Press. ISBN 0-262-51129-0. URL
3) How to generalize TripleLearn? TripleLearn can be http://dl.acm.org/citation.cfm?id=2888116.2888261.
generalized to different models and different problems. In
terms of the models, TripleLearn is model-independent and Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
can use any type of supervised models. In terms of the prob- Bert: Pre-training of deep bidirectional transformers for lan-
lems, it can be applied to any problem with similar train- guage understanding. arXiv preprint arXiv:1810.04805 .
ing data issues. For example, NER problem in eCommerce
Guo, J.; Xu, G.; Cheng, X.; and Li, H. 2009. Named entity
is a perfect fit because of the same data foundation and re-
recognition in query. In Proceedings of the 32nd interna-
quirements, such as extracting entities from search queries
tional ACM SIGIR conference on Research and development
or product titles and descriptions. Another candidate is Ma-
in information retrieval, 267–274. ACM.
chine Translation (Koehn and Knowles 2017) because it usu-
ally has well-defined dictionaries but a large amount of noisy Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional
training data. LSTM-CRF models for sequence tagging. arXiv preprint
arXiv:1508.01991 .
In the future, we plan to work further in three directions.
Koehn, P.; and Knowles, R. 2017. Six challenges for neural
Firstly, more entities, such as color and size, will be included
machine translation. arXiv preprint arXiv:1706.03872 .
in the model. Secondly, the use cases can be extended to
offline applications, such as extracting entities from product Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami,
descriptions to enrich a knowledge graph. Lastly, we would K.; and Dyer, C. 2016. Neural Architectures for Named
like to test this framework using public available datasets for Entity Recognition. CoRR abs/1603.01360. URL http:
easier reproducibility. //arxiv.org/abs/1603.01360.
Lee, C. 2017. LSTM-CRF models for named entity recog- Wu, C.-y.; Ahmed, A.; Kumar, G. R.; and Datta, R. 2017.
nition. IEICE Transactions on Information and Systems Predicting Latent Structured Intents from Shopping Queries.
100(4): 882–887. WWW 2017 .
Li, J.; Sun, A.; Han, J.; and Li, C. 2020. A survey on deep Yadav, V.; and Bethard, S. 2019. A survey on recent ad-
learning for named entity recognition. IEEE Transactions vances in named entity recognition from deep learning mod-
on Knowledge and Data Engineering . els. arXiv preprint arXiv:1910.11470 .
Liu, Y.; Meng, F.; Zhang, J.; Xu, J.; Chen, Y.; and Zhou,
J. 2019. Gcdt: A global context enhanced deep transi-
tion architecture for sequence labeling. arXiv preprint
arXiv:1906.02437 .
Ma, X.; and Hovy, E. 2016. End-to-end sequence la-
beling via bi-directional lstm-cnns-crf. arXiv preprint
arXiv:1603.01354 .
Majumder, B. P.; Subramanian, A.; Krishnan, A.; Gandhi,
S.; and More, A. 2018. Deep Recurrent Neural Networks for
Product Attribute Extraction in ECommerce. arXiv preprint
ArXiv:1803.11284 .
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef-
ficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781 .
More, A. 2016. Attribute Extraction from Product Titles in
eCommerce. CoRR abs/1608.04670. URL http://arxiv.org/
abs/1608.04670.
Nadeau, D.; and Sekine, S. 2007. A survey of named entity
recognition and classification. Lingvisticæ Investigationes
30(1): 3–26. ISSN 0378-4169. doi:https://doi.org/10.1075/
li.30.1.03nad. URL https://www.jbe-platform.com/content/
journals/10.1075/li.30.1.03nad.
Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
Global vectors for word representation. In Proceedings of
the 2014 conference on empirical methods in natural lan-
guage processing (EMNLP), 1532–1543.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
word representations. arXiv preprint arXiv:1802.05365 .
Putthividhya, D. P.; and Hu, J. 2011. Bootstrapped Named
Entity Recognition for Product Attribute Extraction. In Pro-
ceedings of the Conference on Empirical Methods in Natural
Language Processing, EMNLP ’11, 1557–1567. Strouds-
burg, PA, USA: Association for Computational Linguistics.
ISBN 978-1-937284-11-4. URL http://dl.acm.org/citation.
cfm?id=2145432.2145598.
Sang, E. F.; and De Meulder, F. 2003. Introduction to the
CoNLL-2003 shared task: Language-independent named
entity recognition. arXiv preprint cs/0306050 .
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In Advances in neural information
processing systems, 5998–6008.
Wen, M.; Vasthimal, D. K.; Lu, A.; Wang, T.; and Guo, A.
2019. Building Large-Scale Deep Learning System for En-
tity Recognition in E-Commerce Search. In Proceedings of
the 6th IEEE/ACM International Conference on Big Data
Computing, Applications and Technologies, 149–154.

You might also like