0% found this document useful (0 votes)
9 views5 pages

2021 Findings-Emnlp 7

The document presents a web-scale entity extraction system developed by Facebook, focusing on efficiently processing billions of documents daily. It discusses the challenges of multilingual, multi-task, and cross-document type learning, as well as the methodologies for data labeling and model architecture for both open-world and closed-world entity extraction. The system aims to enhance content understanding for applications like recommendation systems by accurately extracting and linking entities from diverse document types.

Uploaded by

Bharat Thakarar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

2021 Findings-Emnlp 7

The document presents a web-scale entity extraction system developed by Facebook, focusing on efficiently processing billions of documents daily. It discusses the challenges of multilingual, multi-task, and cross-document type learning, as well as the methodologies for data labeling and model architecture for both open-world and closed-world entity extraction. The system aims to enhance content understanding for applications like recommendation systems by accurately extracting and linking entities from diverse document types.

Uploaded by

Bharat Thakarar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

A Web Scale Entity Extraction System

Xuanting Cai Quanbin Ma Pan Li Jianyu Liu


Facebook Inc Facebook Inc Facebook Inc Facebook Inc
[email protected]@fb.com [email protected] [email protected]

Qi Zeng Zhengkan Yang Pushkar Tripathi


Facebook Inc Facebook Inc Facebook Inc
[email protected] [email protected] [email protected]

Abstract web pages, ads and user generated content; Scale


- owing to our scale, we need a system that is re-
Understanding the semantic meaning of con- sponsive and resource efficient to process billions
tent on the web through the lens of entities and
of documents per day.
concepts has many practical advantages. How-
ever, when building large-scale entity extrac- In the subsequent sections, we will review the
tion systems, practitioners are facing unique methodology to collect data and the ideas behind
challenges involving finding the best ways to the models. Then we will discuss techniques to
leverage the scale and variety of data avail- deploy these models efficiently.
able on internet platforms. We present learn-
ings from our efforts in building an entity ex- 2 Notation and Setup
traction system for multiple document types
at large scale using Transformers. We empir- An entity is a human interpretable concept that is
ically demonstrate the effectiveness of multi- grounded in a real world notion. A mention is a
lingual, multi-task and cross-document type word or a phrase in the text that refers to an entity.
learning. We also discuss the label collection For example, both “Joe Biden” and “Biden” can
schemes that help to minimize the amount of be mentions for the same entity that represents the
noise in the collected data. 46th president of the United States. Entity extrac-
tion is the task of extracting mentions from a given
1 Introduction
text and linking them to entities. Each instance
Content understanding finds myriad applications in of this problem consists of a structured document
large scale recommendation system. One example with text attributes like title and description, as well
is ranking content with sparse data (Davidson et al., as categorical features and metadata, from which
2010; Amatriain and Basilic, 2012). In such scenar- we wish to extract multiple entities. We catego-
ios, content signals can offer better generalization rize the entity extraction tasks into closed-world
to overcome cold-start problems (Lam et al., 2008; task and open-world task. The former is applica-
Timmaraju et al., 2020). Another example is ex- ble when we have a fixed predefined universe of
plaining the working theory of the recommendation entities, say, topics from Wikipedia; while the lat-
system to users and regulators (Chen et al., 2019). ter is needed when such a list is not available e.g.
In such scenarios, content signals can offer human products.
understandable features.
This paper presents an overview of the entity
3 Open-World Entity Extraction
extraction platform we build for our recommen- In this section, we discuss the data labeling and
dation system. Along the way, we overcome sev- the model architecture for open-world entity ex-
eral unique challenges: Multiple Languages - since traction.
our business operates world wide and supports lan-
guages from various countries, it is imperative to 3.1 Data Labeling
build a multi-lingual system; Multiple Entity Types - Collecting data for the open-world entity extrac-
we want to extract multiple types of entities includ- tion presents unique challenges since it entails col-
ing named entities like people and places, as well lecting free-form inputs from raters. We design a
as commercial entities like products and brands; widget to let raters highlight spans of text, generat-
Multiple Document Types - our system should work ing a set of positive mentions per example. Each
across multiple structured document types such as example is rated by multiple raters and there are
69
Findings of the Association for Computational Linguistics: EMNLP 2021, pages 69–73
November 7–11, 2021. ©2021 Association for Computational Linguistics
different ways to combine mentions from all raters: find that the simple multiple layer perception with
And – select tokens highlighted by all raters; take-continuous-positive-blocks decoding in the se-
Or – select tokens highlighted by any rater; quence works good enough to provide high quality
Majority – select tokens highlighted by the major- mentions.
ity of raters.
3.2.2 Semi-supervised Clustering Stage
We evaluate these methods by comparing to in-
house experts. Based on the evaluation in Table 1, In the clustering stage, we try to collapse all men-
we choose the Majority method, which provides tions referring to the same concept to a canonical
the best label quality, for our label generating. entity. Intuitively, one can run k-means algorithm
on embeddings coming from the extraction stage.
Method Exact Match F1 However we found that the performance of this
And 0.775 approach not acceptable for two reasons: The k-
Or 0.706 means is based on a uniform distribution assump-
Majority 0.794 tion which the embeddings do not follow; Embed-
dings taken from extraction model fail to align with
Table 1: Exact Match F1 is the F1 score of the aggre- the human interpretation for two mentions being
gated rater labels of extract match compared with the the same concept.
expert labels.
We solve the problem with a semi-supervised
graph based approach, where we build a dedicated
In order to audit and enhance the quality of the la-
model as illustrated in Figure 1b to predict links
beled data, we prepare detailed instructions on nav-
between mentions if they represent the same un-
igating the user interface, task-specific reasoning
derlying entity. This model is trained on a dataset
process, sample tasks elucidating the rules, and ex-
specialized in mention concept similarity that we
planations for handling corner cases. Additionally,
collect separately. We adopt the Siamese neural
we routinely inject known examples to calibrate
network architecture in order to scale for process-
external raters against our experts. We periodically
ing all pairs between hundreds of millions of doc-
remove and retrain raters whose outputs digress
uments during graph construction. Then we run
significantly from the experts. Furthermore, we
Louvain community detection algorithm (Blondel
also track their consistency with consensus labels
et al., 2008) on the resulting graph to collapse close
to detect outliers. Finally, we also perform some
mentions into an entity. We find that this could
rule-based sanitization to rectify common errors.
significantly improve the quality of the clusters.
For example, we find that raters often fail to select
all the occurrences of a same piece of text. Thus, 4 Closed-World Entity Extraction
we broadcast selected mentions back to the entire
input to capture all occurrences. In this section, we discuss the data labeling and the
model architecture for closed-world entity extrac-
3.2 Modeling tion.
We divide the open world-entity extraction task into
4.1 Data Labeling
the extraction stage and the clustering stage. There
are a few existing researches on similar problems, In an ideal world, we would want our raters to se-
e.g. Lin et al. (2012); Cao et al. (2020). However, lect the mentions freely from input text and attach
our method is a novel one in that it completely gets the corresponding Wikipedia entity to it. But that
rid of a predefined entity list. makes it hard for raters to reach any consensus, and
impossible for us to perform any quality control.
3.2.1 Extraction Stage Instead, we make the task a multiple-choice, where
In the extraction stage, We try to find all men- we extract beforehand a list of possible mentions,
tions in a text using a sequence to sequence model. alongside with their potential Wikipedia link can-
As depicted in Figure 1a, our extraction model didates, with the help of a pre-defined dictionary.
is based on a pre-trained cross-lingual language Now the rater only need to choose all the positive
model (Lample and Conneau, 2019). For com- mentions, and their corresponding Wikipedia entity,
putation efficiency, we choose a multiple layer both from a given list.
perception on top of XLM instead of conditional Similar to open-world, we perform quality anal-
random field layer (Lafferty et al., 2001). We ysis on different consensus methods. Here we treat
70
Binary

Mention Embedding
Extracted Mention

Aux feature MLP & Sigmoid


O B I E

Broadcasted Dot Product


1
MLP + Sigmoid

Pool
MLP Decoder Element-wise Absolute Difference Other Feature
Encoded Embedding +

XLM

MLP
other Features + FC Layer Mention Mention
Embedding Embedding 2

Entities Embeddings
Pool Pool

XLM Encoder
SentencePiece
+
XLM Encoder XLM Encoder 3

Segment
+
Position
Extracted Mention Extracted Mention

(a) (b) (c)

Figure 1: (a) The open-world extraction model, where each sentence piece is classified as B/I/O/E; (b) The open-
world link prediction model, which predicts if two mentions refer to the same entity. (c) The closed-world linking
model, which predicts the probability that a mention corresponds to each entity candidate (entity embeddings are
generated offline and fetched from the storage at inference time).

wiki entities selected by 2 out of 5 raters to be our embedding is broadcasted to dot with its candidate
community ground truth. This method, compared entity embeddings after a linear projection, to out-
against the oracle labels provided by in-house ex- put a relevance score, as shown in Figure 1c.
perts, can achieve 80% chance of having all ex- We also experimented with first predicting a
tracted entities being correct, and 70% chance of mention score as in the open world case, but found
having all correct entities being extracted. Both little difference in the final entity metric. Addi-
number would further increase by 14% if we toler- tional supervision on salience is also added for
ate one single error. As reference, the F1 score of entities based on the number of votes received
an individual average rater on this task is 0.68. from the raters. We concatenate these scores with
some counter-based features such as the prior of the
4.2 Modeling mention-entity link, to get the final linking score
Similar to the open-world model, we break up the after feed forward layer.
task into the extraction stage and the linking stage.
5 Scaling Challenges
4.2.1 Extraction Stage
Instead of finding possible entity links dynamically To have a good coverage over various documents,
after the mentions are extracted, we rely on a static our system needs to scale across languages, entity
dictionary, containing mapping from various men- types and document types. Naively, we can de-
tion aliases to entities, to extract all possible links velop a model for each triple (language, entity type,
in advance using fuzzy string matching. This sim- document type) and run a combination of models
plifies the labeling effort, while also reduces the for each piece of document. However, this would
computation time for both training and inference. bring significant overhead in model development
The performance would then heavily depend on and model serving. Therefore, our system tackles
the quality of the dictionary. We recursively trace these scaling challenges with the following tech-
Wikipedia’s Redirection, which defines a mapping niques and train a single model instead.
from a mention to an entity, and Disambiguation
5.1 Cross Language Model and Fine-Tuning
pages, which maps a mention onto multiple possi-
ble entities, to build the dictionary. Various rule- Transformer (Vaswani et al., 2017) based pre-
based clean-ups are also performed for the men- trained language model has led to strong improve-
tions, entities and the mapping. ments on various natural language processing tasks
(Wang et al., 2018). With cross-lingual pretraining,
4.2.2 Linking Stage XLM (Lample and Conneau, 2019) can achieve
The linking model then computes the similarity be- state-of-art results cross languages. In our work, we
tween the mention and its candidate entities. The employ XLM and further improve the prediction
mention tower is similar to the open world model, by fine-tuning on multilingual data. We compare
where we run the input document through a lan- the performance of zero-shot and fine-tuned prod-
guage model and pool the outputs to get embed- uct extraction models on ads in Table 2. While the
dings for the mentions. On the entity side, its zero-shot model predicts reasonably for Romance
Wikipedia texts are summarized offline into embed- languages, e.g. French (fr), Portuguese (pt), it has
dings. For each mention-entity pair, the mention a poor performance for Arabic (ar) and Vietnamese
71
Task Metric Separate Models Shared-Encoder Models
(vi). This is expected since the latter have very dif- Precision 0.4301 0.4171
ferent characteristics from English. By fine-tuning Extraction Recall 0.6750 0.6671
F1 0.5254 0.5133
on all-language data, we see a substantial boost in Closed-World Accuracy 0.6729 0.6815
model performance for all languages.
Table 3: Co-train product name extraction and closed-
Zero-Shot Fine-Tuned world linking models with a shared XLM encoder
Language Precision Recall F1 Precision Recall F1
ar 0.2556 0.0676 0.1069 0.3170 0.5331 0.3976
da 0.2437 0.4037 0.3040 0.4093 0.5444 0.4673 Ads Model Ads+Web Pages Model
Task Metric
Ads Web Pages Ads Web Pages
de 0.2966 0.3670 0.3281 0.3349 0.5921 0.4279
Precision 0.4301 0.4519 0.4315 0.5148
en 0.4301 0.6750 0.5254 0.4251 0.7036 0.5300
Extraction Recall 0.6750 0.5167 0.6951 0.6906
es 0.2739 0.3500 0.3073 0.3439 0.5955 0.4360
F1 0.5254 0.4821 0.5325 0.5899
fr 0.3499 0.3584 0.3541 0.4067 0.5988 0.4844
Closed-World Accuracy 0.6729 0.6106 0.6811 0.6852
it 0.3157 0.3626 0.3375 0.4152 0.6146 0.4956
nl 0.2466 0.4673 0.3228 0.3316 0.5299 0.4079
pt 0.3075 0.4395 0.3618 0.4122 0.6555 0.5061 Table 4: Transfer learning between product name ex-
ru 0.3144 0.4467 0.3691 0.4300 0.7021 0.5334 traction and closed-world linking models between ads
vi 0.1886 0.0283 0.0492 0.3653 0.6888 0.4774
and web pages data. The first model is trained on ads
Overall 0.3315 0.3834 0.3556 0.3861 0.6331 0.4797
only; the second is trained on both ads and web pages.
Table 2: Multilingual fine-tuning of product name ex- The sample sizes of ads and web pages are the same.
traction model. Zero-shot model is trained on English
only; fine-tuned model is trained on all languages (one-
tenth of English sample size for each new language). they have a performance comparable with that of
separately trained models. While the closed-world
linking model has a slightly better accuracy with
5.2 Multi-Task Learning For Extraction, co-training, the product name extraction model per-
Clustering, and Linking forms slightly worse. This is probably because a
single XLM of a moderate size may not encode all
Multi-task learning (Caruana, 1997) is a subfield
info required by different entity extraction heads.
of machine learning, in which multiple tasks are
We expect increasing the capacity of encoder will
simultaneously learned by a shared model. Such
reduce the conflicts. To sum up, the unified model
approaches offer advantages like improved data ef-
permits new entity types with little inference cost
ficiency, reduced overfitting through shared repre-
and only slight performance drop.
sentations, and fast learning by leveraging auxiliary
information. It has been proved effective in various 5.3 Cross Document Transfer Learning
applications like Computer Vision (Zhang et al.,
Transfer learning aims at improving the perfor-
2014) and Natural Language Processing (Vaswani
mance of target models on target domains by trans-
et al., 2017). In previous subsections, we train
ferring the knowledge contained in different but
models separately and predict in parallel for differ-
related source domains (Zhuang et al., 2021). Dif-
ent entity types. This is advantageous in that we
ferent transfer learning approaches are developed
can train a model for a new entity type or update
from zero-shot transfer learning (Xian et al., 2017)
the model for an existing entity type without af-
to few-shot transfer learning (Vinyals et al., 2016).
fecting other entity models. However, this causes
We incorporate the transfer learning framework in
ever-increasing inference costs as new entity types
our system to solve cross document types challenge.
are considered. Currently we have 5 entity types
We run experiments on zero-shot transfer learning
and 7 Transformer-based models, which means to
and few-shot transfer learning as in Table 4. As
run 7 XLM encoders for every ad, web page, etc.
we can see, the transfer learning could boost the
The heavy inference cost is a major blocker for
performance of the model on both document types.
our service. To resolve this issue, we developed
the unified model structure and training framework. 6 Conclusion And Future Work
We are able to co-train all entity extraction and
linking models with a shared XLM encoder. Since In this paper, we present the platform of the en-
the encoding part accounts for the majority of all tity extraction at giant internet company’s scale.
computation, the inference time is reduced to 1/7 We discuss the practical learnings from our work.
of before and unblocks the service. Table 3 dis- In the future, we would like to improve the effi-
plays the performance of the shared-encoder mod- ciency of Transformer related language model as
els trained with the framework. It can be seen that discussed in (Tay et al., 2020).
72
References Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan Gomez, Ł ukasz
Xavier Amatriain and Justin Basilic. 2012. Netflix rec- Kaiser, and Illia Polosukhin. 2017. Attention is all
ommendations: Beyond the 5 stars. The Netflix Tech you need. In Advances in Neural Information Pro-
Blog. cessing Systems, volume 30.
Vincent D Blondel, Jean-Loup Guillaume, Renaud
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Ko-
Lambiotte, and Etienne Lefebvre. 2008. Fast un-
ray kavukcuoglu, and Daan Wierstra. 2016. Match-
folding of communities in large networks. Jour-
ing networks for one shot learning. In Advances in
nal of statistical mechanics: theory and experiment,
Neural Information Processing Systems, volume 29.
2008(10):P10008.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Ermei Cao, Difeng Wang, Jiacheng Huang, and Wei
Hill, Omer Levy, and Samuel Bowman. 2018. Glue:
Hu. 2020. Open knowledge enrichment for long-
A multi-task benchmark and analysis platform for
tail entities. In Proceedings of The Web Conference
natural language understanding. In Proceedings
2020, pages 384–394.
of the 2018 EMNLP Workshop BlackboxNLP: An-
Rich Caruana. 1997. Multitask learning. Machine alyzing and Interpreting Neural Networks for NLP,
Learning, 28:41–75. pages 353–355.

Hanxiong Chen, Xu Chen, Shaoyun Shi, and Yongfeng Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017.
Zhang. 2019. Generate natural language explana- Zero-shot learning — the good, the bad and the ugly.
tions for recommendation. In Proceedings of the In 2017 IEEE Conference on Computer Vision and
29th ACM International Conference on Information Pattern Recognition, pages 3077–3086.
and Knowledge Management, page 755–764.
Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xi-
James Davidson, Benjamin Liebald, Junning Liu, aoou Tang. 2014. Facial landmark detection by deep
Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy multi-task learning. In Computer Vision – ECCV
Gupta, Yu He, Mike Lambert, Blake Livingston, and 2014, pages 94–108.
Dasarathi Sampath. 2010. The youtube video rec-
ommendation system. In Proceedings of the fourth Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi,
ACM conference on Recommender systems, page Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing
293–296. He. 2021. A comprehensive survey on transfer learn-
ing. In Proceedings of the IEEE, volume 109, pages
John Lafferty, Andrew McCallum, and Fernando 43 – 76.
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling se-
quence data. In Proceedings of the Eighteenth In-
ternational Conference on Machine Learning, page
282–289.
Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc
Duong. 2008. Addressing cold-start problem in rec-
ommendation systems. In Proceedings of the 2nd
international conference on Ubiquitous information
management and communication, page 208–211.
Guillaume Lample and Alexis Conneau. 2019. Cross-
lingual language model pretraining. In Advances in
Neural Information Processing Systems, volume 32.
Thomas Lin, Oren Etzioni, et al. 2012. No noun phrase
left behind: detecting and typing unlinkable enti-
ties. In Proceedings of the 2012 joint conference on
empirical methods in natural language processing
and computational natural language learning, pages
893–903.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald
Metzler. 2020. Efficient transformers: A survey.
Arxiv.
Aditya Srinivas Timmaraju, Angli Liu, and Pushkar Tri-
pathi. 2020. Addressing challenges in building web-
scale content classification systems. In 2020 IEEE
International Conference on Acoustics, Speech and
Signal Processing, pages 8134–8138.
73

You might also like