2021 Findings-Emnlp 7
2021 Findings-Emnlp 7
Mention Embedding
Extracted Mention
Pool
MLP Decoder Element-wise Absolute Difference Other Feature
Encoded Embedding +
XLM
MLP
other Features + FC Layer Mention Mention
Embedding Embedding 2
Entities Embeddings
Pool Pool
XLM Encoder
SentencePiece
+
XLM Encoder XLM Encoder 3
Segment
+
Position
Extracted Mention Extracted Mention
Figure 1: (a) The open-world extraction model, where each sentence piece is classified as B/I/O/E; (b) The open-
world link prediction model, which predicts if two mentions refer to the same entity. (c) The closed-world linking
model, which predicts the probability that a mention corresponds to each entity candidate (entity embeddings are
generated offline and fetched from the storage at inference time).
wiki entities selected by 2 out of 5 raters to be our embedding is broadcasted to dot with its candidate
community ground truth. This method, compared entity embeddings after a linear projection, to out-
against the oracle labels provided by in-house ex- put a relevance score, as shown in Figure 1c.
perts, can achieve 80% chance of having all ex- We also experimented with first predicting a
tracted entities being correct, and 70% chance of mention score as in the open world case, but found
having all correct entities being extracted. Both little difference in the final entity metric. Addi-
number would further increase by 14% if we toler- tional supervision on salience is also added for
ate one single error. As reference, the F1 score of entities based on the number of votes received
an individual average rater on this task is 0.68. from the raters. We concatenate these scores with
some counter-based features such as the prior of the
4.2 Modeling mention-entity link, to get the final linking score
Similar to the open-world model, we break up the after feed forward layer.
task into the extraction stage and the linking stage.
5 Scaling Challenges
4.2.1 Extraction Stage
Instead of finding possible entity links dynamically To have a good coverage over various documents,
after the mentions are extracted, we rely on a static our system needs to scale across languages, entity
dictionary, containing mapping from various men- types and document types. Naively, we can de-
tion aliases to entities, to extract all possible links velop a model for each triple (language, entity type,
in advance using fuzzy string matching. This sim- document type) and run a combination of models
plifies the labeling effort, while also reduces the for each piece of document. However, this would
computation time for both training and inference. bring significant overhead in model development
The performance would then heavily depend on and model serving. Therefore, our system tackles
the quality of the dictionary. We recursively trace these scaling challenges with the following tech-
Wikipedia’s Redirection, which defines a mapping niques and train a single model instead.
from a mention to an entity, and Disambiguation
5.1 Cross Language Model and Fine-Tuning
pages, which maps a mention onto multiple possi-
ble entities, to build the dictionary. Various rule- Transformer (Vaswani et al., 2017) based pre-
based clean-ups are also performed for the men- trained language model has led to strong improve-
tions, entities and the mapping. ments on various natural language processing tasks
(Wang et al., 2018). With cross-lingual pretraining,
4.2.2 Linking Stage XLM (Lample and Conneau, 2019) can achieve
The linking model then computes the similarity be- state-of-art results cross languages. In our work, we
tween the mention and its candidate entities. The employ XLM and further improve the prediction
mention tower is similar to the open world model, by fine-tuning on multilingual data. We compare
where we run the input document through a lan- the performance of zero-shot and fine-tuned prod-
guage model and pool the outputs to get embed- uct extraction models on ads in Table 2. While the
dings for the mentions. On the entity side, its zero-shot model predicts reasonably for Romance
Wikipedia texts are summarized offline into embed- languages, e.g. French (fr), Portuguese (pt), it has
dings. For each mention-entity pair, the mention a poor performance for Arabic (ar) and Vietnamese
71
Task Metric Separate Models Shared-Encoder Models
(vi). This is expected since the latter have very dif- Precision 0.4301 0.4171
ferent characteristics from English. By fine-tuning Extraction Recall 0.6750 0.6671
F1 0.5254 0.5133
on all-language data, we see a substantial boost in Closed-World Accuracy 0.6729 0.6815
model performance for all languages.
Table 3: Co-train product name extraction and closed-
Zero-Shot Fine-Tuned world linking models with a shared XLM encoder
Language Precision Recall F1 Precision Recall F1
ar 0.2556 0.0676 0.1069 0.3170 0.5331 0.3976
da 0.2437 0.4037 0.3040 0.4093 0.5444 0.4673 Ads Model Ads+Web Pages Model
Task Metric
Ads Web Pages Ads Web Pages
de 0.2966 0.3670 0.3281 0.3349 0.5921 0.4279
Precision 0.4301 0.4519 0.4315 0.5148
en 0.4301 0.6750 0.5254 0.4251 0.7036 0.5300
Extraction Recall 0.6750 0.5167 0.6951 0.6906
es 0.2739 0.3500 0.3073 0.3439 0.5955 0.4360
F1 0.5254 0.4821 0.5325 0.5899
fr 0.3499 0.3584 0.3541 0.4067 0.5988 0.4844
Closed-World Accuracy 0.6729 0.6106 0.6811 0.6852
it 0.3157 0.3626 0.3375 0.4152 0.6146 0.4956
nl 0.2466 0.4673 0.3228 0.3316 0.5299 0.4079
pt 0.3075 0.4395 0.3618 0.4122 0.6555 0.5061 Table 4: Transfer learning between product name ex-
ru 0.3144 0.4467 0.3691 0.4300 0.7021 0.5334 traction and closed-world linking models between ads
vi 0.1886 0.0283 0.0492 0.3653 0.6888 0.4774
and web pages data. The first model is trained on ads
Overall 0.3315 0.3834 0.3556 0.3861 0.6331 0.4797
only; the second is trained on both ads and web pages.
Table 2: Multilingual fine-tuning of product name ex- The sample sizes of ads and web pages are the same.
traction model. Zero-shot model is trained on English
only; fine-tuned model is trained on all languages (one-
tenth of English sample size for each new language). they have a performance comparable with that of
separately trained models. While the closed-world
linking model has a slightly better accuracy with
5.2 Multi-Task Learning For Extraction, co-training, the product name extraction model per-
Clustering, and Linking forms slightly worse. This is probably because a
single XLM of a moderate size may not encode all
Multi-task learning (Caruana, 1997) is a subfield
info required by different entity extraction heads.
of machine learning, in which multiple tasks are
We expect increasing the capacity of encoder will
simultaneously learned by a shared model. Such
reduce the conflicts. To sum up, the unified model
approaches offer advantages like improved data ef-
permits new entity types with little inference cost
ficiency, reduced overfitting through shared repre-
and only slight performance drop.
sentations, and fast learning by leveraging auxiliary
information. It has been proved effective in various 5.3 Cross Document Transfer Learning
applications like Computer Vision (Zhang et al.,
Transfer learning aims at improving the perfor-
2014) and Natural Language Processing (Vaswani
mance of target models on target domains by trans-
et al., 2017). In previous subsections, we train
ferring the knowledge contained in different but
models separately and predict in parallel for differ-
related source domains (Zhuang et al., 2021). Dif-
ent entity types. This is advantageous in that we
ferent transfer learning approaches are developed
can train a model for a new entity type or update
from zero-shot transfer learning (Xian et al., 2017)
the model for an existing entity type without af-
to few-shot transfer learning (Vinyals et al., 2016).
fecting other entity models. However, this causes
We incorporate the transfer learning framework in
ever-increasing inference costs as new entity types
our system to solve cross document types challenge.
are considered. Currently we have 5 entity types
We run experiments on zero-shot transfer learning
and 7 Transformer-based models, which means to
and few-shot transfer learning as in Table 4. As
run 7 XLM encoders for every ad, web page, etc.
we can see, the transfer learning could boost the
The heavy inference cost is a major blocker for
performance of the model on both document types.
our service. To resolve this issue, we developed
the unified model structure and training framework. 6 Conclusion And Future Work
We are able to co-train all entity extraction and
linking models with a shared XLM encoder. Since In this paper, we present the platform of the en-
the encoding part accounts for the majority of all tity extraction at giant internet company’s scale.
computation, the inference time is reduced to 1/7 We discuss the practical learnings from our work.
of before and unblocks the service. Table 3 dis- In the future, we would like to improve the effi-
plays the performance of the shared-encoder mod- ciency of Transformer related language model as
els trained with the framework. It can be seen that discussed in (Tay et al., 2020).
72
References Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan Gomez, Ł ukasz
Xavier Amatriain and Justin Basilic. 2012. Netflix rec- Kaiser, and Illia Polosukhin. 2017. Attention is all
ommendations: Beyond the 5 stars. The Netflix Tech you need. In Advances in Neural Information Pro-
Blog. cessing Systems, volume 30.
Vincent D Blondel, Jean-Loup Guillaume, Renaud
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Ko-
Lambiotte, and Etienne Lefebvre. 2008. Fast un-
ray kavukcuoglu, and Daan Wierstra. 2016. Match-
folding of communities in large networks. Jour-
ing networks for one shot learning. In Advances in
nal of statistical mechanics: theory and experiment,
Neural Information Processing Systems, volume 29.
2008(10):P10008.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Ermei Cao, Difeng Wang, Jiacheng Huang, and Wei
Hill, Omer Levy, and Samuel Bowman. 2018. Glue:
Hu. 2020. Open knowledge enrichment for long-
A multi-task benchmark and analysis platform for
tail entities. In Proceedings of The Web Conference
natural language understanding. In Proceedings
2020, pages 384–394.
of the 2018 EMNLP Workshop BlackboxNLP: An-
Rich Caruana. 1997. Multitask learning. Machine alyzing and Interpreting Neural Networks for NLP,
Learning, 28:41–75. pages 353–355.
Hanxiong Chen, Xu Chen, Shaoyun Shi, and Yongfeng Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017.
Zhang. 2019. Generate natural language explana- Zero-shot learning — the good, the bad and the ugly.
tions for recommendation. In Proceedings of the In 2017 IEEE Conference on Computer Vision and
29th ACM International Conference on Information Pattern Recognition, pages 3077–3086.
and Knowledge Management, page 755–764.
Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xi-
James Davidson, Benjamin Liebald, Junning Liu, aoou Tang. 2014. Facial landmark detection by deep
Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy multi-task learning. In Computer Vision – ECCV
Gupta, Yu He, Mike Lambert, Blake Livingston, and 2014, pages 94–108.
Dasarathi Sampath. 2010. The youtube video rec-
ommendation system. In Proceedings of the fourth Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi,
ACM conference on Recommender systems, page Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing
293–296. He. 2021. A comprehensive survey on transfer learn-
ing. In Proceedings of the IEEE, volume 109, pages
John Lafferty, Andrew McCallum, and Fernando 43 – 76.
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling se-
quence data. In Proceedings of the Eighteenth In-
ternational Conference on Machine Learning, page
282–289.
Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc
Duong. 2008. Addressing cold-start problem in rec-
ommendation systems. In Proceedings of the 2nd
international conference on Ubiquitous information
management and communication, page 208–211.
Guillaume Lample and Alexis Conneau. 2019. Cross-
lingual language model pretraining. In Advances in
Neural Information Processing Systems, volume 32.
Thomas Lin, Oren Etzioni, et al. 2012. No noun phrase
left behind: detecting and typing unlinkable enti-
ties. In Proceedings of the 2012 joint conference on
empirical methods in natural language processing
and computational natural language learning, pages
893–903.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald
Metzler. 2020. Efficient transformers: A survey.
Arxiv.
Aditya Srinivas Timmaraju, Angli Liu, and Pushkar Tri-
pathi. 2020. Addressing challenges in building web-
scale content classification systems. In 2020 IEEE
International Conference on Acoustics, Speech and
Signal Processing, pages 8134–8138.
73