Mining Structures of Factual Knowledge From Text - 9781681733937 - WEB PDF
Mining Structures of Factual Knowledge From Text - 9781681733937 - WEB PDF
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00860ED1V01Y201806DMK015
Lecture #15
Series Editors: Jiawei Han, University of Illinois at Urbana-Champaign
Lise Getoor, University of California, Santa Cruz
Wei Wang, University of California, Los Angeles
Johannes Gehrke, Cornell University
Robert Grossman, University of Chicago
Series ISSN
Print 2151-0067 Electronic 2151-0075
Mining Structures of
Factual Knowledge from Text
An Effort-Light Approach
Xiang Ren
University of Southern California
Jiawei Han
University of Illinois at Urbana-Champaign
M
&C Morgan & cLaypool publishers
ABSTRACT
The real-world data, though massive, is largely unstructured, in the form of natural-language
text. It is challenging but highly desirable to mine structures from massive text data, with-
out extensive human annotation and labeling. In this book, we investigate the principles and
methodologies of mining structures of factual knowledge (e.g., entities and their relationships)
from massive, unstructured text corpora.
Departing from many existing structure extraction methods that have heavy reliance on
human annotated data for model training, our effort-light approach leverages human-curated
facts stored in external knowledge bases as distant supervision and exploits rich data redun-
dancy in large text corpora for context understanding. This effort-light mining approach leads
to a series of new principles and powerful methodologies for structuring text corpora, includ-
ing: (1) entity recognition, typing, and synonym discovery; (2) entity relation extraction; and
(3) open-domain attribute-value mining and information extraction. This book introduces this
new research frontier and points out some promising research directions.
KEYWORDS
mining factual structures, information extraction, knowledge bases, entity recog-
nition and typing, relation extraction, entity synonym mining, distant supervision,
effort-light approach, classification, clustering, real-world applications, scalable al-
gorithms
vii
To my wife Dora, son Lawrence, and grandson Emmett for their love.
– Jiawei Han
ix
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Part I: Identifying Typed Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Part II: Extracting Typed Entity Relationships . . . . . . . . . . . . . . . . . . . 10
1.1.3 Part III: Toward Automated Factual Structure Mining . . . . . . . . . . . . . 14
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Entity Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Relation Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Distant Supervision from Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Mining Entity and Relation Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Common Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Hand-Crafted Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Traditional Supervised Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Sequence Labeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Supervised Relation Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Weakly Supervised Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Pattern-Based Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Distantly Supervised Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Learning with Noisy Labeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Open-Domain Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
x
13 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
13.1 Structuring Life Science Papers: The Life-iNet System . . . . . . . . . . . . . . . . . 153
13.2 Extracting Document Facets from Technical Corpora . . . . . . . . . . . . . . . . . . 156
13.3 Comparative Document Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
14.1 Effort-Light StructMine: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
14.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Acknowledgments
The authors would like to acknowledge Wenqi He, Liyuan Liu, Meng Qu, Ellen Wu, Qi Zhu,
Jingbo Shang, and Meng Jiang for their tremendous research collaborations.
Han’s work was supported in part by U.S. Army Research Lab. under Cooperative Agree-
ment No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C-
0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA
HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided
by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov).
Ren’s work was sponsored by Google PhD Fellowship, ACM SIGKDD Scholarship, and
Richard T. Cheng Endowed Fellowship.
Any opinions, findings, and conclusions or recommendations expressed in this document
are those of the author(s) and should not be interpreted as the views of any funding agencies.
CHAPTER 1
Introduction
The success of data mining technology is largely attributed to the efficient and effective analysis
of structured data. However, the majority of existing data generated in our computerized society
is unstructured or loosely structured, and is typically “text-heavy.” People are soaked with vast
amounts of natural-language text data, ranging from news articles, social media posts, online
advertisements, and scientific publications, to a wide range of textual information from vari-
ous domains (e.g., medical notes and corporate reports). Big data leads to big opportunities to
uncover structures of real-world entities (e.g., person, company, product) and relations (e.g.,
employee_of, manufacture) from massive text corpora. Can machines automatically identify
person, organization, and location entities in a news corpus and use them to summarize recent
news events (Fig. 1.1)? Can we mine different relations between proteins, drugs, and diseases
from massive and rapidly emerging life science literature? How would one represent factual
structures hidden in a collection of medical reports to support answering precise queries or run-
ning data mining tasks?
Figure 1.1: An illustration of entity and relation structures extracted from some text data. The
nodes correspond to entities and the links represent their relationships.
While accessing documents in a gigantic collection is no longer a hard thing with the
help of data management and information retrieval systems, people, especially those who are
not domain experts, struggle to gain insights from such a large volume of text data: document
understanding calls for in-depth content analysis, content analysis itself may require domain-
2 1. INTRODUCTION
specific knowledge, and over a large corpus, a complete read and analysis by domain experts will
invariably be subjective, time-consuming, and relatively costly. Moreover, text data is highly di-
verse: Corpora from different domains, genres or languages typically require effective processing
of a wide range of language resources (e.g., grammars, vocabularies, gazetteers). The “massive”
and “messy” nature of text data poses significant challenges to creating tools for automated pro-
cessing and algorithmic analysis of contents that scale with text volume.
This book introduces principled and scalable methods for the mining of typed entity and
relation structures from unstructured text corpora, with a focus on overcoming the barriers in
dealing with text corpora of various domains, genres, and languages. State-of-the-art informa-
tion extraction (IE) approaches have relied on large amounts of task-specific labeled data (e.g.,
annotating terrorist attack-related entities in web forum posts written in Arabic), to construct
machine-learning models (e.g., deep neural networks). However, even though domain experts
can manually create high-quality training data for specific tasks as needed, both the scale and
efficiency of such a manual process are limited. The research discussed in this book harnesses the
power of “big text data” and focuses on creating generic solutions for efficient construction of cus-
tomized machine-learning models for factual structure extraction, relying on only limited amounts
of (or even no) task-specific training data.
The main coverage of this book includes: (1) entity recognition and typing, which automat-
ically identifies token spans of real-world entities of interests from text and classifies them into a
set of coarse-grained entity types; (2) relation extraction, which determine what kind of relations
is expressed between two entities based on the sentences where they co-occur; and (3) auto-
mated factual structure mining, which aims to extract open-domain factual structures such as
entity attributes and open-domain relation tuples. We provide scalable algorithmic approaches
that leverage external knowledge bases (KBs) as sources of supervision and exploit data redun-
dancy in massive text corpora, and we show how to use them in large-scale, real-world appli-
cations, including structured exploration and analysis of life sciences literature, extracting doc-
ument facets from technical documents, document summarization, entity attribute discovery,
and open-domain information extraction.
Figure 1.2: Overview of the related work. Our method, effort-light StructMine, has relied on
lightest efforts on human laboring and feature engineering when compared with prior arts.
Figure 1.3: Illustration of the proposed framework. Effort-light StructMine leverages existing
structures stored in external KBs to automatically generate large amounts of corpus-specific, po-
tentially noisy training data, and builds corpus-specific models for extracting entity and relation
structures.
automatically identify synonyms of entities from massive text corpora or other kinds of data
sources? Extracting a complete list of synonym strings for entities of interests can benefit many
downstream applications. For example, when doing web search or information retrieval, one
can leverage entity synonyms to enhance the process of query expansion. In topic modeling, by
forcing synonyms of an entity to be assigned in the same latent topic, one can constrain the
topic modeling process to yield high-quality topic representations. In this part of the book, we
introduce methods that automate the process of identifying entity synonyms from text corpus
with the distant supervision from external KBs.
1.1. OVERVIEW OF THE BOOK 7
Entity Recognition and Typing
How can we identify token spans of real-world entities and their categories from text?
One of the most important factual structures in text is entity. Recognizing entities from
text and labeling their types (e.g., person, location) enables effectives structured analysis of
unstructured text corpora (Chapter 4). Traditional named entity recognition (NER) systems are
usually designed for several major types and general domains, and so require additional steps
for adaptation to a new domain and new types. Our method, ClusType [Ren et al., 2015],
aims at identifying typed entities of interests from text without task-specific human supervision.
While most existing NER methods treat the problem as sequence tagging task and require
significant amounts of manually labeled sentences (with typed entities), ClusType makes use
of entity information stored in freely-available KBs to create large amounts of (yet potentially
noisy) labeled data and infers types of other entities mentioned in text in a robust and efficient
way (see Fig. 1.4).
We formalize the entity recognition and typing task as a distantly supervised learning
problem. The solution workflow is: (1) detect entity mentions from a corpus; (2) map candidate
entity mentions to KB entities of target types; and (3) use those confidently mapped {mention,
type} pairs as labeled data to infer the types of remaining candidate mentions. ClusType runs
data-driven phrase mining to generate entity mention candidates and relation phrases (thus
having no reliance on pre-trained name recognizer), and enforces the principle that relation
phrases should be softly clustered when propagating type information between their argument
8 1. INTRODUCTION
entities. We formulate a joint optimization to integrate type propagation via relation phrases
and clustering of relation phrases.
Highlights:
• Problem: We study the problem of distantly supervised entity recognition and typing in a
domain-specific corpus, where only a corpus and a reference KB are given as input.
• Methodology: We introduce an efficient, domain-independent phrase segmentation algo-
rithm for extracting entity mentions and relation phrases. Entity types can be estimated
for entity mentions by solving the clustering-integrated type propagation.
• Effectiveness on real-world corpora: Our experiments on three datasets of different genres—
news, reviews, and tweets—demonstrate that ClusType achieves significant improvements
over the state-of-the-art.
Highlights:
• Problem: The first systematic study of noisy labels in distant supervision for entity typing
problem.
• Methodology 1: An embedding-based framework, PLE [Ren et al., 2016a], is proposed
that models and measures semantic similarity between text features and type labels and is
robust to noisy labels.
• Methodology 2: A novel rank-based optimization problem is formulated to model noisy
type label and type correlation [Ren et al., 2016b].
• Effectiveness on real data: The proposed methods achieve significant improvement over the
state of the art on multiple fine-grained typing datasets, and demonstrate the effectiveness
on recovering true labels from the noisy label set.
Figure 1.6: An illustration of the joint entity and relationship extraction problem.
To overcome these challenges, we study the problem of joint extraction of typed entities
and relationships with KBs. Given a domain-specific corpus and an external KB, we aim to detect
relation mentions together with their entity arguments from text, and categorize each in context
by relation types of interests. Our method, CoType [Ren et al., 2017a], approaches the joint
extraction task as follows: (1) it designs a domain-agnostic text segmentation algorithm to detect
candidate entity mentions with distant supervision (i.e., minimal linguistic assumption); (2) it
models the mutual constraints between the types of relation mentions and the types their entity
arguments to enable feedbacks between the two subtasks; and (3) it models the true type labels in
a candidate type set as latent variables and requires only the most confident type to be relevant
to the mention. CoType achieves the state-of-the-art relation extraction performance under
distant supervision, and demonstrates robust domain-independence across various datasets.
12 1. INTRODUCTION
Highlights:
• Methodology: A novel distant supervision framework, CoType, is proposed, which extracts
typed entities and relationships in domain-specific corpora with minimal linguistic as-
sumption.
• Effectiveness on real data: Experiments with three public datasets demonstrate that Co-
Type improves the performance of state-of-the-art systems of entity typing and relation
extraction significantly, demonstrating robust domain-independence.
Summary. The core of the book focuses on developing effective, human effort-light, and scal-
able methods for extracting factual structures from massive, domain-specific text corpora. Our
contributions are in the area of text mining and information extraction, within which we focus
on domain-independent and noise-robust approaches using distant supervision (in conjunction
with publicly available KBs). The work has broad impact on a variety of applications: KB con-
struction, question-answering systems, structured search and exploration of text data, recom-
mender systems, network analysis, and many other text mining tasks. Next, we present research
background on mining structured factual information and introduce related, useful notions and
definitions.
19
CHAPTER 2
Background
In this chapter we introduce the key definitions and notions on information extraction and
knowledge graph construction that are useful for understanding the methods and algorithms
described in the book. At the end of this chapter we give a table with the common notations
and their descriptions.
Example 2.1 Noun Phrase In the sentence “The quick, brown fox jumped over the lazy dog,”
there are two noun phrases: “the quick, brown fox” and “the lazy dog.” “the quick, brown fox”
is the subject of the sentence and “the lazy dog” is the object.
Proper Name: A proper name is a noun phrase that in its primary application refers to a unique
entity (e.g., University of Southern California, Computer Science, United States), as distinguished
from a common noun which usually refers to a class of entities (e.g., city, person, company), or
non-unique instances of a specific class (e.g., this city, other people, our company). When a noun
refers to a unique entity, it is also called proper noun.
Entity: In information extraction and text mining, an entity (or named entity) is a real-world
object, such as person, location, organization, product, and scientific concept, that can be denoted
20 2. BACKGROUND
with a proper name. It can be abstract or have a physical existence. Examples of named entities
include Barack Obama, Chicago, University of Illinois. An entity is denoted as e in this book.
Remark: Ambiguous Proper Names for Named Entity. In the expression of “named entity,”
the word named restricts the scope to those entities for which one or many strings stands consis-
tently for some referent. In practice, one named entity may be referred by multiple proper names
and one proper name may refer to multiple named entities. For example, the entity, automotive
company created by Henry Ford in 1903, can be referred to proper names “Ford” or “Ford Motor
Company,” although “Ford” can refer to many other entities as well.
Entity Mention: An entity mention, denoted by m, is a token span (i.e., a sequence of words)
in text that refers to a named entity. It consists of the proper name and the token index in the
sentence.
Example 2.2 Entity Mention. In the sentences “I had the pulled pork sandwich with
coleslaw and baked beans for lunch. The pulled pork sandwich is the best I’ve tasted in
Phoenix!,” the entity mentions are bold-faced. The proper name “pulled pork sandwich” ap-
pears twice in the sentence, corresponding to the same named entity but different entity men-
tions (thus will have different entity mention IDs).
Entity Type: An entity type (or entity class, entity category) is a conceptual label for a collection
of entities that share the same characteristics and attributes (e.g., person, artist, singer,
location). Entities with the same entity types are similar to one another. Entity type instances
refer to entities that are assigned with a specific entity type. In many applications, a set of entity
types of interest are usually pre-specified by domain experts via providing example entity type
instances. There also exist cases that entity types are related to each other (vs. mutually exclusive),
forming a complex, DAG-structured type hierarchy.
Example 2.3 Entity Types in ACE Shared Task The Automatic Content Extraction (ACE)
Program [Doddington et al., 2004] was to develop information extraction technology to sup-
port automatic processing of natural language data. In the Entity Detection and Tracking
(EDT) task of ACE, it focuses on seven types of entities: Person, Organization, Location,
Facility, Weapon, Vehicle, and Geo-Political Entity. Each type was further divided into
subtypes (for instance, Organization subtypes include Government, Commercial, Education,
Non-profit, Other).
2.2. RELATION STRUCTURES 21
2.2 RELATION STRUCTURES
This section introduces the basic concepts on relations. We start with the definition of relation,
followed by definitions of relation instance and mention.
Relation: a relation (or relation type, relation class), denoted as r , is a (pre-defined) predication
about two or more entities. For example, from the sentence fragment “Facebook co-found Mark
Zuckerberg” one can extract the FounderOf relation between entities Mark Zuckerberg and Face-
book. In this book, we focus on binary relations, that is, relations between two entities.
Example 2.4 Relations in ACE Shared Task Much of the prior work on extracting relations
from text is based on the task definition from ACE program [Doddington et al., 2004]. A
set of major relation types and their subtypes are defined by ACE. Examples of ACE major
relation types include physical (an entity is physically near another entity), personal/social
(a person is a family member of another person), and employment/affiliation (a person is
employed by an organization).
Relation Instance: A relation instance denotes a relationship over two or more entities in a
specific relation. When only considering binary relations, a relation instance can be represented
as a triple with a pair of entities ei and ej , and their relation type r , i.e., (ei , r , ej ).
Entity Argument: The two entities involved in a relation instance are referred to as entity argu-
ments. The former is also referred to as head entity whereas the latter tail entity.
Relation Mention: A relation mention, z , denotes a specific occurrence of some relation in-
stance in text. It records the two entity mentions for the pair of entity arguments, the relation
type between these two entities, and the sentence s where the relation mention is found, i.e.,
z D .mi ; r; mj I s/.
Example 2.5 Relation Mention Suppose we are given two sentences: “Obama was born in
Hawaii, USA” (s1 ) and “Barack Obama, the president of United States” (s2 ). There are two rela-
tion mentions between entities Barack Obama and United States: z1 D (Obama, BirthPlace,
USA; s1 ) and z2 D (Barack Obama, PresidentOf, United States; s2 ). Although the entity
arguments are the same, the two relation mentions have different relation types based on the
sentence context.
Fact: A fact in KB can refer to either a binary relation triple in the form of .ei ; r; ej / or an is-A
relation between an entity and a concept, such as Facebook is a company.
Formally, KB, ‰ , consists of a set of entities E‰ and curated facts on both relation instances
I‰ and entity types T‰ (i.e., is-A relation between entities and their entity types). The set of
relation types in the KB is denoted as R‰ .
Example 2.6 Freebase Curating a universal knowledge graph is an endeavor which is infeasi-
ble for most individuals and organizations. Thus, distributing that effort on as many shoulders as
possible through crowdsourcing is a way taken by Freebase [Bollacker et al., 2008], a public, ed-
itable knowledge graph with schema templates for most kinds of possible entities (e.g., persons,
cities, and movies). After MetaWeb, the company running Freebase, was acquired by Google,
Freebase was shut down on March 31, 2015. The last version of Freebase contains roughly 50
million entities and 3 billion facts. Freebase’s schema comprises roughly 27,000 entity types and
38,000 relation types.1
Example 2.7 DBpedia DBpedia is a knowledge graph extracted from structured data in
Wikipedia [Auer et al., 2007]. The main source for this extraction is the key-value pairs in
the Wikipedia infoboxes. In a crowd-sourced process, types of infoboxes are mapped to the
DBpedia ontology, and keys used in those infoboxes are mapped to properties in that ontology.
Based on those mappings, a knowledge graph can be extracted. The most recent version of the
main DBpedia (i.e., DBpedia 2016-10, extracted from the English Wikipedia based on dumps
from October 2016) contains 6.6 million entities and 13 billion facts about those entities. The
ontology comprises 735 classes and 2,800 relations.
1 https://developers.google.com/freebase/
2.4. MINING ENTITY AND RELATION STRUCTURES 23
2.4 MINING ENTITY AND RELATION STRUCTURES
Here we describe the basic tasks in mining factual structures from text corpora, followed by a
short introduction on related information extraction tasks.
Entity Recognition and Typing: Entity recognition and typing (or named entity recognition)
addresses the problem of identification (detection) and classification of pre-defined types of en-
tities, such as organization (e.g., “United Nation”), person (e.g., “Barack Obama”), location
(e.g., “Los Angeles”), etc. The detection part aims to find the token span of entities mentioned
in text (i.e., entity mention) and the classification part aims to assign the suitable type to entity
mention based on its sentence context.
Fine-grained Entity Typing: The goal of fine-grained entity typing is to classify each entity
mention m (based on its sentence context s ) into a pre-defined set of types where the types are
correlated and organized into a tree-structured type hierarchy Y . Each entity mention will be
assigned with an entity type path—a path in the given type hierarchy that may not end at a leaf
node. For example, in Fig. 2.1, the entity mention “Donald Trump” is assigned with the type
path “person-artist-actor” based on the given sentence.
Relation Extraction: Relation extraction is to detect and classify pre-defined relationships be-
tween entities recognized in text. In the corpus-level relation extraction setting, all sentences fsg
where a pair of entities .ei ; ej / (proper names) occurs are collected as evidences for determining
the appropriate relation type r . In the mention-level relation extraction, the correct label for a
relation mention m is determined based on the sentence it occurs (i.e., s ). In particular, a label
(class) called “None” is included into the label set so as to classify a false positive candidate as
“no relation.”
24 2. BACKGROUND
2.5 COMMON NOTATIONS
We provide the most common notations and their brief definitions in Table 2.1. More specific
notations used to explain proposed methods are introduced in the corresponding chapters.
Table 2.1: Common notations and definitions used throughout the book
Notation Definition
Sentence
D Document, corpus
Entity
T Entity type, entity type set
Entity mention
R Relation type, relation type set
, ) Relation instance of type between entities and
Relation mention
E Finite set of entities in a corpus
Z Finite set of relationships between entities in a corpus
G = (E, Z) Directed, labeled graph that represents StructNet
Ψ Knowledge base (e.g., Freebase, DBpedia)
EΨ Set of entities in KB
TΨ Entity types in KB
IΨ Set of relation instances in KB
RΨ Relation types in KB
Y Tree-structured entity type hierarchy
25
CHAPTER 3
Literature Review
This chapter provides an overview of prior arts and related studies on mining typed entities and
relationships from text. Methods are categorized and organized based on the amounts of human
labeled data required in the model training process, which also demonstrates the trajectory of
research on reducing human supervision in entity and relation structure mining. We also review
techniques developed for learning with noisy labeled data as well as open-domain information
extraction, followed by a summary of our contributions.
The existing methods on mining entity and relation structures can be roughly categorized
along two dimensions: (1) the amount of human supervision required and (2) the extraction
task (problem formulation) that it is solving. Table 3.1 gives a few examples for each category.
A method can be fully hand-crafted, supervised, weakly supervised, or distantly supervised. The
second dimension is that the problem formulation of the task can be either sequence labeling
(e.g., CRFs), transdutive classification (e.g., pattern bootstrapping, and label propagation), and
inductive classification (e.g., SVM). More in-depth discussion about the literature related to
concrete tasks and the proposed approaches can be found in each chapter.
mance on the corpus of a specific domain, genre, or language does require some degree of skill.
It also generally requires an annotated corpus which can be used to evaluate the rule set after
each revision; without such a corpus there is a tendency—after a certain point—for added rules
to actually worsen overall performance. It requires human experts to define rules or regular ex-
pressions or program snippets for performing the extraction. That person needs to be a domain
expert and a programmer, and possess descent linguistic understanding to be able to develop
robust extraction rules.
Figure 3.1: An illustration of hand-crafted extraction methods for extracting entity structures.
Figure 3.2: An illustration of supervised learning methods for extracting entity structures.
corpus. Such supervised methods for training a fully supervised extraction model will be con-
sidered in this section (see Fig. 3.2).
CHAPTER 4
Entity recognition is an important task in text analysis. Identifying token spans as entity men-
tions in documents and labeling their types (e.g., people, product or food) enables effective
structured analysis of unstructured text corpus. The extracted entity information can be used in
a variety of ways (e.g., to serve as primitives for information extraction [Schmitz et al., 2012]
and KB population [Dong et al., 2014]). Traditional named entity recognition systems [Nadeau
and Sekine, 2007, Ratinov and Roth, 2009] are usually designed for several major types (e.g.,
person, organization, location) and general domains (e.g., news), and so require additional steps
for adaptation to a new domain and new types.
Entity-linking techniques [Shen et al., 2014] map from given entity mentions detected
in text to entities in KBs like Freebase [Bollacker et al., 2008], where type information can
be collected. But most of such information is manually curated, and thus the set of entities
so obtained is of limited coverage and freshness (e.g., over 50% entities mentioned in Web
documents are unlinkable [Lin et al., 2012]). The rapid emergence of large, domain-specific
text corpora (e.g., product reviews) poses significant challenges to traditional entity recognition
and entity-linking techniques and calls for methods of recognizing entity mentions of target
types with minimal or no human supervision, and with no requirement that entities can be
found in a KB.
There are broadly two kinds of efforts toward that goal: weak supervision and distant su-
pervision. Weak supervision relies on manually specified seed entity names in applying pattern-
based bootstrapping methods [Gupta and Manning, 2014, Huang and Riloff, 2010] or label
propagation methods [Talukdar and Pereira, 2010] to identify more entities of each type. Both
methods assume the seed entities are unambiguous and sufficiently frequent in the corpus, which
requires careful seed entity selection by human [Kozareva and Hovy, 2010]. Distant supervision
is a more recent trend, aiming to reduce expensive human labor by utilizing entity information
in KBs [Lin et al., 2012, Nakashole et al., 2013] (see Fig. 4.1). The typical workflow is: (i) detect
entity mentions from a corpus; (ii) map candidate mentions to KB entities of target types; and
(iii) use those confidently mapped {mention, type} pairs as labeled data to infer the types of
remaining candidate mentions.
36 4. ENTITY RECOGNITION AND TYPING WITH KNOWLEDGE BASES
Let M D fm1 ; : : : ; mM g denote the set of M candidate entity mentions extracted from D.
Suppose a subset of entity mentions ML M can be confidently mapped to entities in ‰ . The
type of a linked candidate m 2 ML can be obtained based on its mapping entity e .m/ (see
Section 4.4.1). This work focuses on predicting the types of unlinkable candidate mentions MU D
M n ML , where MU may consist of (1) mentions of the emerging entities which are not in ‰ ;
(2) new names of the existing entities in ‰ ; and (3) invalid entity mentions. Formally, we define
the problem of distantly supervised entity recognition as follows.
Problem 4.1 Entity Recognition and Typing. Given a document collection D, a target type
set T , and a KB ‰ , our task aims to: (1) extract candidate entity mentions M from D; (2) generate
seed mentions ML with ‰ ; and (3) for each unlinkable candidate mention m 2 MU , estimate its
type indicator vector tym to predict its type. In our study, we assume each mention within a
sentence is only associated with a single type t 2 T . We also assume the target type set T is given
(It is outside the scope of this study to generate T ). Finally, while our work is independent of
entity linking techniques [Shen et al., 2014], our ER framework output may be useful to entity
linking.
3. Estimate type indicator ty for unlinkable candidate mention m 2 MU with the proposed
type propagation integrated with relation phrase clustering on G (Section 4.4).
40 4. ENTITY RECOGNITION AND TYPING WITH KNOWLEDGE BASES
4.3 RELATION PHRASE-BASED GRAPH CONSTRUCTION
We first introduce candidate generation in Section 4.3.1, which leads to three kinds of objects,
namely candidate entity mentions M, their surface names C , and surrounding relation phrases
P . We then build a heterogeneous graph G , which consists of multiple types of objects and
multiple types of links, to model their relationship. The basic idea for constructing the graph is
that: the more two objects are likely to share the same label (i.e., t 2 T or NOI), the larger the
weight will be associated with their connecting edge.
Specifically, the constructed graph G unifies three types of links: mention-name link which
represents the mapping between entity mentions and their surface names, entity name-relation
phrase link which captures corpus-level co-occurrences between entity surface names and re-
lation phrase, and mention-mention link which models distributional similarity between entity
mentions. This leads to three subgraphs GM;C , GC ;P , and GM , respectively. We introduce the
construction of them in Sections 4.3.2–4.3.4.
.S1 ˚ S2 / N .S
N
1 / .S2 /
N
X .S1 ; S2 / D p IX .S1 ˚ S2 /: (4.1)
.S1 ˚ S2 /
4.3. RELATION PHRASE-BASED GRAPH CONSTRUCTION 41
At each iteration, the greedy agglomerative algorithm performs the merging which has
highest scores (e or p ), and terminates when the next highest-score merging does not meet
a pre-defined significance threshold. Relation phrases without matched POS patterns are dis-
carded and their valid sub-phrases are recovered. Because the significance score can be consid-
ered analogous to hypothesis testing, one can use standard rule-of thumb values for the threshold
(e.g., Z-score2) [El-Kishky et al., 2015]. Overall, the threshold setting is not sensitive in our
empirical studies. As all merged phrases are frequent, we have fast access to their aggregate
counts and thus it is efficient to compute the score of a potential merging.
Figure 4.3 provides an example output of the candidate generation on The New York Times
(NYT) corpus. We further compare our method with a popular noun phrase chunker1 in terms
of entity detection performance, using the extracted entity mentions. Table 4.1 summarizes the
comparison results on three datasets from different domains (see Section 4.5 for details). Recall is
most critical for this step, since we can recognize false positives in later stages of our framework,
but no chance to later detect the misses, i.e., false negatives.
Pattern Example
V disperse; hit; struck; knock
P in; at; of; from; to
VP locate in; come from; talk to
VW*(P) caused major damage on; come lately
V-verb; P-prep; W-{adv | adj | noun | det | pron}
W denotes multiple W; (P) denotes optional.
If surface name c often appears as the left (right) argument of relation phrase p , then c ’s
type indicator tends to be similar to the corresponding type indicator in p ’s type signature.
In Fig. 4.4, for example, if we know “pizza” refers to food and find it frequently co-occurs
with the relation phrase “serves up” in its right argument position, then another surface name
that appears in the right argument position of “serves up” is likely food. This reinforces the type
propagation that “cheese steak sandwich” is also food.
Formally, suppose there are l different relation phrases P D fp1 ; : : : ; pl g extracted from the
corpus. We use two biadjacency matrices …L ; …R 2 f0; 1gM l to represent the co-occurrences be-
tween relation phrases and their left and right entity arguments, respectively. We define …L;ij D 1
(…R;ij D 1) if mi occurs as the closest entity mention on the left (right) of pj in a sentence; and
4.3. RELATION PHRASE-BASED GRAPH CONSTRUCTION 43
Figure 4.4: Example entity name-relation phrase links from Yelp reviews.
0 otherwise. Each column of …L and …R is normalized by its `2 -norm to reduce the impact of
popular relation phrases. Two bipartite subgraphs GC ;P can be further constructed to capture the
aggregated co-occurrences between relation phrases P and entity names C across the corpus. We
use two biadjacency matrices tWL ; tWR 2 Rnl to represent the edge weights for the two types of
links, and normalize them:
where tSL and tSR are normalized biadjacency matrices. For left-argument relationships, we
Pl
define the diagonal surface name degree matrix tDL.C/ 2 Rnn as DL;i
.C/
i D j D1 WL;ij and the rela-
Pn
tion phrase degree matrix tDL 2 R as DL;jj D iD1 WL;ij . Likewise, we define tDR.C/ 2 Rnn
.P/ ll .P/
Figure 4.5: Example mention-mention links for entity surface name “White House” from Tweets.
If there exists a strong correlation (i.e., within sentence, common neighbor mentions)
between two candidate mentions that share the same name, then their type indicators
tend to be similar.
Specifically, for each candidate entity mention mi 2 M, we extract the set of entity
surface names which co-occur with mi in the same sentence. An n-dimensional TF-IDF vec-
tor f.i/ 2 Rn is used to represent the importance of these co-occurring names for mi where
.i/
fj D s .cj / log jDj=D .cj / with term frequency in the sentence s .cj / and document fre-
quency D .cj / in D. We use an affinity subgraph to represent the mention-mention link based
on k-nearest neighbor (KNN) graph construction [He and Niyogi, 2004], denoted by an ad-
jacency matrix tWM 2 RM M . Each mention candidate is linked to its k most similar mention
candidates which share the same name in terms of the vectors f:.
8
< sim.f.i / ; f.j / /; if f.i/ 2 Nk .f.j / / or f.j / 2 Nk .f.i/ /
WM;ij D and c.mi / D c.mj /;
:
0; otherwise,
where we use the heat kernel function to measure similarity, i.e., sim.f.i/ ; f.j / / D exp kf.i /
f.j / k2 =t with t D 5 [He and Niyogi, 2004]. We use Nk .f/ to denote k nearest neighbors of f
and c.m/ to denote the surface name of mention m. Similarly, we normalize tWM into tSM D
1 1 PM
tDM2 tWM tDM2 where the degree matrix tDM 2 RM M is defined by DM;i i D j D1 WM;ij .
If two relation phrases have similar cluster memberships, the type indicators of their left
and right arguments (type signature) tend to be similar, respectively.
There has been some studies [Galárraga et al., 2014, Min et al., 2012] on clustering syn-
onymous relation phrases based on different kinds of signals and clustering methods. We propose
a general relation phrase clustering method to incorporate different features for clustering, which
can be integrated with the graph-based type propagation in a mutually enhancing framework,
based on the following hypothesis.
Two relation phrases tend to have similar cluster memberships, if they have similar
(1) strings, (2) context words, and (3) left and right argument type indicators.
In particular, type signatures of relation phrases have proven very useful in clustering of
relation phrases which have infrequent or ambiguous strings and contexts [Galárraga et al.,
2014]. In contrast to previous approaches, our method leverages the type information derived
by the type propagation and thus does not rely strictly on external sources to determine the type
information for all the entity arguments.
Formally, suppose there are ns (nc ) unique words fw1 ; : : : ; wns g (fw10 ; : : : ; wn0 c g) in all the
relation phrase strings (contexts). We represent the strings and contexts of the extracted relation
phrases P by two feature matrices fs 2 Rlns and fc 2 Rlnc , respectively. We set Fs;ij D 1 if pi
contains the word wj and 0 otherwise. We use a text window of 10 words to extract the context
for a relation phrase from each sentence it appears in, and construct context features fc based
on TF-IDF weighting. Let tPL ; tPR 2 RlT denote the type signatures of P . Our solution uses
the derived features (i.e., ffs ; fc ; tPL ; tPR g) for multi-view clustering of relation phrases based on
joint non-negative matrix factorization, which will be elaborated in the next section.
The second term L˛ in Eq. (4.2) follows Hypotheses 4.3 and 4.4 to model the multi-view rela-
tion phrase clustering by joint non-negative matrix factorization. In this study, we consider each
derived feature as one view in the clustering, i.e., ff.0/ ; f.1/ ; f.2/ ; f.3/ g D ftPL ; tPR ; fs ; fc g and derive
a four-view clustering objective as follows:
L˛ tPL ; tPR ; ftU .v/ ; tV .v/ g; tU (4.4)
d
X
D ˇ .v/ kf.v/ tU .v/ tV .v/T k2F C ˛ktU .v/ tQ.v/ tU k2F :
vD0
The first part of Eq. (4.4) performs matrix factorization on each feature matrix. Suppose there
exists K relation phrase clusters. For each view v , we factorize the feature matrix f.v/ into a
cluster membership matrix tU .v/ 2 RlK 0 for all relation phrases P and a type indicator matrix
tV .v/ 2 RT
0
K for the K derived clusters. The second part of Eq. (4.4) enforces the consistency
between the four derived cluster membership matrices through a consensus matrix tU 2 RlK 0 ,
which applies Hypothesis 4.4 to incorporate multiple similarity measures to cluster relation
phrases. As in Liu et al. [2013], we normalize ftU .v/ g to the same scale (i.e., ktU .v/ tQ.v/ kF 1)
.v/ P .v/
with the diagonal matrices ftQ.v/ g, where Qkk D T iD1 Vik =kf
.v/ k , so that they are comparable
F
under the same consensus matrix. A tuning parameter ˛ 2 Œ0; 1 is used to control the degree of
consistency between the cluster membership of each view and the consensus matrix. fˇ .v/ g are
used to weight the information among different views, which will be automatically estimated. As
the first part of Eq. (4.4) enforces ftU .0/ ; tU .1/ g tU and the second part of Eq. (4.4) imposes
tPL tU .0/ tV .0/T and tPR tU .1/ tV .1/T , it can be checked that Ui Uj implies both PL;i
PL;j and PR;i PR;j for any two relation phrases, which captures Hypothesis 4.3.
The last term
; in Eq. (4.2) models the type indicator for each entity mention candi-
date, the mention-mention link and the supervision from seed mentions:
2
; .tY; tC; tPL ; tPR / D
tY f .…C tC; …L tPL ; …R tPR /
F
2
X
M
tY tYj
i
C WM;ij
q q
C ktY tY0 k2F : (4.5)
2
.M/ .M/
i;j D1 Di i Djj 2
48 4. ENTITY RECOGNITION AND TYPING WITH KNOWLEDGE BASES
In the first part of Eq. (4.5), the type of each entity mention candidate is modeled by a
function f ./ based on the the type indicator of its surface name as well as the type signatures
of its associated relation phrases. Different functions can be used to combine the information
from surface names and relation phrases. In this study, we use an equal-weight linear combina-
tion, i.e., f .tX1 ; tX2 ; tX3 / D tX1 C tX2 C tX3 . The second part follows Hypothesis 4.2 to model
the mention-mention correlation by graph regularization, which ensures the consistency be-
tween the type indicators of two candidate mentions if they are highly correlated. The third part
enforces the estimated tY to be similar to the initial labels from seed mentions, denoted by a
matrix tY0 2 RMT (see Section 4.4.1). Two tuning parameters
; 2 Œ0; 1 are used to control
the degree of guidance from mention correlation in GM and the degree of supervision from tY0 ,
respectively.
To derive the exact type of each candidate entity mention, we impose the 0-1 integer
constraint tY 2 f0; 1gM T and tY 1 D 1. To model clustering, we further require the cluster mem-
bership matrices ftU .v/ g, the type indicator matrices of the derived clusters ftV .v/ g, and the con-
sensus matrix tU to be non-negative. With the definition of O, we define the joint optimization
problem as follows:
where we define tX0 D Œ.1 C ˇ .0/ /tIl C …LT …L and tX1 D Œ.1 C ˇ .1/ /tIl C …TR …R , respectively.
Note that the matrix inversions in Eq. (4.8) can be efficiently calculated with linear complexity
since both …LT …L and …TR …R are diagonal matrices.
Finally, to perform multi-view clustering, we first optimize Eq. (4.2) with respect to
ftU ; tV .v/ g while fixing other variables, and then update tU and fˇ .v/ g by fixing ftU .v/ ; tV .v/ g
.v/
and other variables, which follows the procedure in Liu et al. [2013].
We first take the derivative of O with respect to tV .v/ and apply Karush-Kuhn-Tucker
complementary condition to impose the non-negativity constraint on it, leading to the multi-
plicative update rules as follows:
P .v/
.v/ .v/ Œf.v/T tU .v/ j k C ˛ liD1 Uik Ui k
Vj k D Vj k .v/ P l .v/2 PT .v/
; (4.9)
j k C ˛ i D1 Uik i D1 Vi k
where we define the matrix .v/ D tV .v/ tU .v/T tU .v/ C f.v/ tU .v/ . It is easy to check that ftV .v/ g
remains non-negative after each update based on Eq. (4.9).
We then normalize the column vectors of tV .v/ and tU .v/ by tV .v/ D tV .v/ tQ.v/ 1 and
tU .v/ D tU .v/ tQ.v/ . Following similar procedure for updating tV .v/ , the update rule for tU .v/ can
be derived as follows:
C
.v/ .v/ Œf.v/ tV .v/ C ˛tU i k
Ui k D Ui k : (4.10)
ŒtU .v/ tV .v/T tV .v/ C f.v/ tV .v/ C ˛tU .v/ ik
C
In particular, we make the decomposition f.v/ D f.v/ f.v/ , where AC ij D .jAij j C Aij /=2 and
Aij D .jAij j Aij /=2, in order to preserve the non-negativity of ftU .v/ g.
The proposed algorithm optimizes ftU .v/ ; tV .v/ g for each view v , by iterating between
Eqs. (4.9) and (4.10) until the following reconstruction error converges:
50 4. ENTITY RECOGNITION AND TYPING WITH KNOWLEDGE BASES
ı .v/ D kf.v/ tU .v/ tV .v/T k2F C ˛ktU .v/ tQ.v/ tU k2F : (4.11)
With optimized ftU .v/ ; tV .v/ g, we update tU and fˇ .v/ g by taking the derivative on O with
respect to each of them while fixing all other variables. This leads to the closed-form update
rules as follows:
Pd .v/ tU .v/ tQ.v/ ı .v/
vD0 ˇ
tU D Pd I ˇ .v/ D log P : (4.12)
.v/ d .i /
vD0 ˇ i D0 ı
Algorithm 4.1 summarizes our algorithm. For convergence analysis, ClusType applies
block coordinate descent on the real-valued relaxation of Eq. (4.6). The proof procedure in Tseng
[2001] (not included for lack of space) can be adopted to prove convergence for ClusType (to
the local minimum).
tive values.
2: repeat
3: Update candidate mention type indicator tY by Eq. (4.7)
4: Update entity name type indicator tC and relation phrase type signature ftPL ; tPR g by Eq. (4.8)
5: for v D 0 to 3 do
6: repeat
7: Update tV .v/ with Eq. (4.9)
8: Normalize tU .v/ D tU .v/ tQ.v/ , tV .v/ D tV .v/ tQ.v/ 1
9: Update tU .v/ by Eq. (4.10)
10: until Eq. (4.11) converges
11: end for
12: Update consensus matrix tU and relative feature weights fˇ .v/ g using Eq. (4.12)
13: until the objective O in Eq. (4.6) converges
14: Predict the type of mi 2 MU by type.mi / D argmax Yi .
4.5 EXPERIMENTS
4.5.1 DATA PREPARATION
Our experiments use three real-world datasets3 : (1) NYT: constructed by crawling 2013 news
articles from The New York Times. The dataset contains 118,664 articles (57M tokens and 480K
unique words) covering various topics such as Politics, Business, and Sports; (2) Yelp: We col-
lected 230,610 reviews (25M tokens and 418K unique words) from the 2014 Yelp dataset chal-
lenge; and (3) Tweet: We randomly selected 10,000 users in Twitter and crawled at most 100
tweets for each user in May 2011. This yields a collection of 302,875 tweets (4.2M tokens and
157K unique words).
1. Heterogeneous Graphs. We first performed lemmatization on the tokens using NLTK
WordNet Lemmatizer4 to reduce variant forms of words (e.g., eat, ate, eating) into their lemma
form (e.g., eat), and then applied Stanford POS tagger [Toutanova et al., 2003] on the cor-
pus. In candidate generation (see Section 4.3.1), we set maximal pattern length as 5, minimum
support as 30, and significance threshold as 2, to extract candidate entity mentions and relation
phrases from the corpus. We then followed the introduction in Section 4.3 to construct the
heterogeneous graph for each dataset. We used the 5-nearest neighbor graphs when construct-
ing the mention correlation subgraph. Table 4.4 summarizes the statistics of the constructed
heterogeneous graphs for all three datasets.
2. Comparing ClusType with its variants. Comparing with ClusType-NoClus and ClusType-
TwoStep, ClusType gains performance from integrating relation phrase clustering with type
propagation in a mutually enhancing way. It always outperforms ClusType-NoWm on Preci-
sion and F1 on all three datasets. The enhancement mainly comes from modeling the mention
correlation links, which helps disambiguate entity mentions sharing the same surface names.
3. Comparing on different entity types. Figure 4.6 shows the performance on different types
on Yelp and Tweet. ClusType outperforms all the others on each type. It obtains larger gain
on organization and person, which have more entities with ambiguous surface names. This indi-
cates that modeling types on entity mention level is critical for name disambiguation. Superior
performance on product and food mainly comes from the domain independence of our method
because both NNPLB and SemTagger require sophisticated linguistic feature generation which
is hard to adapt to new types.
4. Comparing with trained NER. Table 4.7 compares ours with a traditional NER method,
Stanford NER, trained using classic corpora like ACE corpus, on three major types—person,
location, and organization. ClusType and its variants outperform Stanford NER on the corpora
which are dynamic (e.g., NYT) or domain-specific (e.g., Yelp). On the Tweet dataset, ClusType
has lower Precision but achieves a 63.59% improvement in Recall and 7.62% improvement in F1
4.5. EXPERIMENTS 55
1
ClusType 0.9 ClusType
ClusType-NoClus 0.8 ClusType-NoClus
1 SemTagger SemTagger
0.9 0.7
F1 score
F1 score
0.8 NNPLB NNPLB
0.6
0.7 0.5
0.6
0.5 0.4
0.4 0.3
0.3 0.2
0.2
0.1 0.1
0 0
Food Location Organization Person Product Organization
score. The superior Recall of ClusType mainly comes from the domain-independent candidate
generation.
5. Testing on sensitivity over the number of relation phrase clusters, K. Figure 4.7a, ClusType
was less sensitive to K compared with its variants. We found on the Tweet dataset, ClusType
achieved the best performance when K D 300 while its variants peaked at K D 500, which indi-
cates that better performance can be achieved with fewer clusters if type propagation is integrated
with clustering in a mutually enhancing way. On the NYT and the Yelp datasets (not shown
here), ClusType peaked at K D 4000 and K D 1500, respectively.
6. Testing on the size of seed mention set. Seed mentions are used as labels (distant supervi-
sion) for typing other mentions. By randomly selecting a subset of seed mentions as labeled data
(sampling ratio from 0.1–1.0), Fig. 4.7b shows ClusType and its variants are not very sensitive
to the size of seed mention set. Interestingly, using all the seed mentions does not lead to the
best performance, likely caused by the type ambiguity among the mentions.
7. Testing on the effect of corpus size. Experimenting on the same parameters for candidate
generation and graph construction, Fig. 4.7c shows the performance trend when varying the
sampling ratio (subset of documents randomly sampled to form the input corpus). ClusType
and its variants are not very sensitive to the changes of corpus size, but NNPLB had over 17%
56 4. ENTITY RECOGNITION AND TYPING WITH KNOWLEDGE BASES
drop in F1 score when sampling ratio changed from 1.0–0.1 (while ClusType had only 5.5%).
In particular, they always outperform FIGER, which uses a trained classifier and thus does not
depend on corpus size.
Figure 4.7: Performance changes in F1 score with #clusters, #seeds, and corpus size on Tweets.
4.6 DISCUSSION
1. Example output on two Yelp reviews. Table 4.8 shows the output of ClusType, SemTagger,
and NNPLB on two Yelp reviews: ClusType extracts more entity mention candidates (e.g.,
“BBQ,” “ihop”) and predicts their types with better accuracy (e.g., “baked beans,” “pulled pork
sandwich”).
Table 4.8: Example output of ClusType and the compared methods on the Yelp dataset
ClusType ClusType
ClusType-TwoStep ClusType-TwoStep
1 ClusType-NoClus 1 ClusType-NoClus
0.9 FIGER 0.9 FIGER
0.8 0.8
0.7 0.7
F1
F1
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
A B A B
(a) Context sparsity (b) Surface name popularity
Figure 4.8: Case studies on context sparsity and surface name popularity on the Tweet dataset.
3. Testing on surface name popularity. We generated the mentions in Group A with high
frequency surface names (2.7K occurrences) and those in Group B with infrequent surface
names (1.5). Figure 4.8b shows the degraded performance of all methods in both cases—
likely due to ambiguity in popular mentions and sparsity in infrequent mentions. ClusType
outperforms its variants in Group B, showing it handles well mentions with insufficient corpus
statistics.
4. Example relation phrase clusters. Table 4.9 shows relation phrases along with their corpus
frequency from three example relation phrase clusters for the NYT dataset (K D 4000). We
found that not only synonymous relation phrases, but also both sparse and frequent relation
phrases can be clustered together effectively (e.g., “want hire by” and “recruited by”). This shows
that ClusType can boost sparse relation phrases with type information from the frequent relation
phrases with similar group memberships.
58 4. ENTITY RECOGNITION AND TYPING WITH KNOWLEDGE BASES
Table 4.9: Example relation phrase clusters and their corpus frequency from the NYT dataset
ID Relation Phrase
1 Recruited by (5.1K); employed by (3.4K); want hire by (264)
2 Go against (2.4K); struggling so much against (54); run for re-election against (112);
campaigned against (1.3K)
3 Looking at ways around (105); pitched around (1.9K); echo around (844); present at (5.5K)
4.7 SUMMARY
Entity recognition is an important but challenging research problem. In reality, many text collec-
tions are from specific, dynamic, or emerging domains, which poses significant new challenges
for entity recognition with increase in name ambiguity and context sparsity, requiring entity
detection without domain restriction. In this work, we investigate entity recognition (ER) with
distant-supervision and propose a novel relation phrase-based ER framework, called ClusType,
that runs data-driven phrase mining to generate entity mention candidates and relation phrases,
and enforces the principle that relation phrases should be softly clustered when propagating type
information between their argument entities. Then we predict the type of each entity mention
based on the type signatures of its co-occurring relation phrases and the type indicators of its
surface name, as computed over the corpus. Specifically, we formulate a joint optimization prob-
lem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering.
Our experiments on multiple genres—news, Yelp reviews, and tweets—demonstrate the effec-
tiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the
best compared method.
59
CHAPTER 5
Figure 5.1: Current systems may detect Arnold Schwarzenegger in sentences S1–S3 and assign
the same types to all (listed within braces), when only some types are correct for context (blue
labels within braces).
labels (e.g., see Fig. 5.1). Many previous studies ignore the label noises which appear in a ma-
jority of training mentions (see Table 5.1, row (1)), and assume all types obtained by distant
supervision are “correct” [Ling and Weld, 2012, Yogatama et al., 2015]. The noisy labels may
mislead the trained models and cause negative effect. A few systems try to denoise the training
corpora using simple pruning heuristics such as deleting mentions with conflicting types [Gillick
et al., 2014]. However, such strategies significantly reduce the size of training set (Table 5.1,
rows (2a–c)) and lead to performance degradation (later shown in our experiments). The larger
the target type set, the more severe the loss.
Type Correlation. Most existing methods [Ling and Weld, 2012, Yogatama et al., 2015] treat
every type label in a training mention’s candidate type set equally and independently when learn-
ing the classifiers but ignore the fact that types in the given hierarchy are semantically correlated
(e.g., actor is more relevant to singer than to politician). As a consequence, the learned classi-
fiers may bias toward popular types but perform poorly on infrequent types since training data on
infrequent types is scarce. Intuitively, one should pose smaller penalty on types which are seman-
tically more relevant to the true types. For example, in Fig. 5.1 singer should receive a smaller
5.1. OVERVIEW AND MOTIVATION 61
penalty than politician does, by knowing that actor is a true type for “Arnold Schwarzenegger”
in S2. This provides classifiers with additional information to distinguish between two types,
especially those infrequent ones.
Table 5.1: A study of label noise. (1): %mentions with multiple sibling types (e.g., actor,
singer); (2a)–(2c): %mentions deleted by the three pruning heuristics [Gillick et al., 2014] (see
Section 5.4), for three experiment datasets and NYT annotation corpus [Dunietz and Gillick,
2014].
In this chapter, we approach the problem of automatic fine-grained entity typing as follows.
(1) Use different objectives to model training mentions with correct type labels and mentions
with noisy labels, respectively. (2) Design a novel partial-label loss to model true types within
the noisy candidate type set which requires only the “best” candidate type to be relevant to the
training mention, and progressively estimate the best type by leveraging various text features
extracted for the mention. (3) Derive type correlation based on two signals: (i) the given type
hierarchy; and (ii) the shared entities between two types in KB, and incorporate the correlation
so induced by enforcing adaptive margins between different types for mentions in the training set.
To integrate these ideas, we develop a novel embedding-based framework called AFET. First, it
uses distant supervision to obtain candidate types for each mention, and extract a variety of text
features from the mentions themselves and their local contexts. Mentions are partitioned into
a “clean” set and a “noisy” set based on the given type hierarchy. Second, we embed mentions
and types jointly into a low-dimensional space, where, in that space, objects (i.e., features and
types) that are semantically close to each other also have similar representations. In the proposed
objective, an adaptive margin-based rank loss is proposed to model the set of clean mentions to
capture type correlation, and a partial-label rank loss is formulated to model the “best” candidate
type for each noisy mention. Finally, with the learned embeddings (i.e., mapping matrices), one
can predict the type-path for each mention in the test set in a top-down manner, using its text
features. The major contributions of this chapter are as follows.
1. We propose an automatic fine-grained entity typing framework, which reduces label noise
introduced by distant supervision and incorporates type correlation in a principle way.
62 5. FINE-GRAINED ENTITY TYPING WITH KNOWLEDGE BASES
2. A novel optimization problem is formulated to jointly embed entity mentions and types to
the same space. It models noisy type set with a partial-label rank loss and type correlation
with adaptive-margin rank loss.
3. We develop an iterative algorithm for solving the joint optimization problem efficiently.
4. Experiments with three public datasets demonstrate that AFET achieves significant im-
provement over the state of the art.
5.2 PRELIMINARIES
Our task is to automatically uncover the type information for entity mentions (i.e., token spans
representing entities) in natural language sentences. The task takes a document collection D
(automatically labeled using a KB ‰ in conjunction with a target type hierarchy Y ) as input and
predicts a type-path in Y for each mention from the test set D t .
Type Hierarchy and Knowledge Base. Two key factors in distant supervision are the target
type hierarchy and the KB. A type hierarchy, Y , is a tree where nodes represent types of inter-
ests from ‰ . Previous studies manually create several clean type hierarchies using types from
Freebase [Ling and Weld, 2012] or WordNet [Yosef et al., 2012]. In this study, we adopt the
existing hierarchies constructed using Freebase types.1 To obtain types for entities E‰ in ‰ , we
˚
use the human-curated entity-type facts in Freebase, denoted as F‰ D .e; y/ E‰ Y .
Automatically Labeled Training Corpora. There exist publicly available labeled corpora such as
Wikilinks [Singh et al., 2012] and ClueWeb [Gabrilovich et al., 2013]. In these corpora, entity
mentions are identified and mapped to KB entities using anchor links. In specific domains (e.g.,
product reviews) where such public corpora are unavailable, one can utilize distant supervision
to automatically label the corpus [Ling and Weld, 2012]. Specifically, an entity linker will detect
mentions mi and map them to one or more entity ei in E‰ . Types of ei in KB are then associated
˚
with mi to form its type set Yi , i.e., Yi D y j .ei ; y/ 2 F‰ ; y 2 Y . Formally, a training corpus D
consists of a set of extracted entity mentions M D fmi gN iD1 , the context (e.g., sentence, paragraph)
of each mention fci gN ,
i D1 ˚and the candidate type sets fY N
i gi D1 for each mention. We represent D
N
using a set of triples D D .mi ; ci ; Yi / iD1 .
Problem Description. For each test mention, we aim to predict the correct type-path in Y
based on the mention’s context. More specifically, the test set T is defined as a set of mention-
context pairs .m; c/, where mentions in T (denoted as M t ) are extracted from their sentences
using existing extractors such as named entity recognizer [Finkel et al., 2005]. We denote the
gold type-path for a test mention m as Y . This work focuses on learning a typing model from
the noisy training corpus D , and estimating Y from Y for each test mention m (in set M t ),
based on mention m, its context c , and the learned model.
1 We use the Freebase dump as of June 30, 2015.
5.2. PRELIMINARIES 63
Framework Overview. At a high level, the AFET framework (see also Fig. 5.2) learns low-
dimensional representations for entity types and text features, and infers type-paths for test
mentions using the learned embeddings. It consists of the following steps.
1. Extract text features for entity mentions in training set M and test set M t using their
surface names as well as the contexts. (Section 5.3.1).
2. Partition training mentions M into a clean set (denoted as Mc ) and a noisy set (denoted
as Mn ) based on their candidate type sets (Section 5.3.2).
3. Perform joint embedding of entity mentions M and type hierarchy Y into the same
low-dimensional space where, in that space, close objects also share similar types (Sec-
tions 5.3.3–5.3.6).
4. For each test mention m, estimate its type-path Y (on the hierarchy Y ) in a top-down
manner using the learned embeddings (Section 5.3.6).
Table 5.2: Text features used in this chapter. “Turing Machine” is used as an example mention
from “The band’s former drummer Jerry Fuchs—who was also a member of Maserati, Turing Machine
and The Juan MacLean—died after falling down an elevator shaft.”
where tU 2 Rd M and tV 2 Rd K are the projection matrices for mentions and type labels, re-
spectively.
Although a shortest path is efficient to compute, its accuracy is limited—it is not always true
that a type (e.g., athlete) is more related to its parent type (i.e., person) than to its sibling
types (e.g., coach), or that all sibling types are equally related to each other (e.g., actor is more
related to director than to author). We later compare these two methods in our experiments.
With the type correlation computed, we propose to apply adaptive penalties on different
negative type labels (for a training mention), instead of treating all of the labels equally as in most
existing work [Weston et al., 2011]. The hypothesis is intuitive: given the positive type labels
66 5. FINE-GRAINED ENTITY TYPING WITH KNOWLEDGE BASES
for a mention, we force the negative type labels which are related to the positive type labels to
receive smaller penalty. For example, in the right column of Fig. 5.3, negative label businessman
receives a smaller penalty (i.e., margin) than athele does, since businessman is more related
to politician.
Figure 5.3: An illustration of KB-based type correlation computation, and the proposed adaptive
margin.
For a mention, if a negative type is correlated to a positive type, the margin between them
should be smaller.
We propose an adaptive-margin rank loss to model the set of “clean” mentions (i.e., Mc ),
based on the above hypothesis. The intuition is simple: for each mention, rank all the positive
types ahead of negative types, where the ranking score is measured by similarity between mention
and type. We denote fk .mi / as the similarity between .mi ; yk / and is defined as the inner product
of ˆM .mi / and ˆY .yk /:
5.3. THE AFET FRAMEWORK 67
X X j k
`c .mi ; Yi ; Y i / D L rankyk f .mi / ‚i;k;kN I
yk 2Yi y 2Y
n kN i
o
‚i;k;kN D max 0;
k;kN fk .mi / C fkN .mi / I
X
rankyk f .mi / D 1
k;kN C fkN .mi / > fk .mi / :
ykN 2Y i
Here,
k;kN is the adaptive margin between positive type k and negative type kN , which is defined as
P
k;kN D 1 C 1=.wk;kN C ˛/ with a smooth parameter ˛ . L.x/ D xiD1 1i transforms rank to a weight,
which is then multiplied to the max-margin loss ‚i;k;kN to optimize precision at x [Weston et al.,
2011].
For a noisy mention, the maximum score associated with its candidate types should be
greater than the scores associated with any other non-candidate types.
We extend the partial-label loss in Nguyen and Caruana [2008] (used to learn linear clas-
sifiers) to enforce Hypothesis 5.2, and integrate with the adaptive margin to define the loss for
mi (in set Mn ):
X j k
`n .mi ; Yi ; Y i / D L rankyk f .mi / i;kN I
N Yi
k2
n o
i;k D max 0;
k ;kN fk .mi / C fkN .mi / I
X
rankyk f .mi / D 1
k ;kN C fkN .mi / > fk .mi / ;
ykN 2Y i
(i.e., Hypothesis 5.2). This constrasts sharply with multi-label learning [Yosef et al., 2012], where
a large margin is enforced between all candidate types and non-candidate types without con-
sidering noisy types.
5.4 EXPERIMENTS
5.4.1 DATA PREPARATION
Datasets. Our experiments use three public datasets. (1) Wiki [Ling and Weld, 2012]: consists
of 1.5M sentences sampled from Wikipedia articles; (2) OntoNotes [Weischedel et al., 2011]:
consists of 13,109 news documents where 77 test documents are manually annotated [Gillick et
al., 2014]; and (3) BBN [Weischedel and Brunstein, 2005]: consists of 2,311 Wall Street Journal
articles which are manually annotated using 93 types. Statistics of the datasets are shown in
Table 5.3.
Training Data. We followed the process in Ling and Weld [2012] to generate training data for
the Wiki dataset. For the BBN and OntoNotes datasets, we used DBpedia Spotlight2 for entity
linking. We discarded types which cannot be mapped to Freebase types in the BBN dataset (47
of 93).
Table 5.2 lists the set of features used in our experiments, which are similar to those used
in Ling and Weld [2012] and Yogatama et al. [2015] except for topics and ReVerb patterns. We
used a six-word window to extract context unigrams and bigrams for each mention (three words
on the left and the right). We applied the Stanford CoreNLP tool [Manning et al., 2014] to get
POS tags and dependency structures. The word clusters were derived for each corpus using the
Brown clustering algorithm.3 Features for a mention is represented as a binary indicator vector
2 http://spotlight.dbpedia.org/
3 https://github.com/percyliang/brown-cluster
70 5. FINE-GRAINED ENTITY TYPING WITH KNOWLEDGE BASES
where the dimensionality is the number of features derived from the corpus. We discarded the
features which occur only once in the corpus. The number of features generated for each dataset
is shown in Table 5.3.
Baselines. We compared the proposed method (AFET) and its variant with state-of-the-art
typing methods, embedding methods, and partial-label learning methods: (1) FIGER [Ling
and Weld, 2012]; (2) HYENA [Yosef et al., 2012]; (3) FIGER/HYENA-Min [Gillick et al.,
2014]: removes types appearing only once in the document; (4) ClusType [Ren et al., 2015]:
predicts types based on co-occurring relation phrases; (5) HNM [Dong et al., 2015]: proposes
a hybrid neural model without hand-crafted features; (6) DeepWalk [Perozzi et al., 2014]:
applies Deep Walk to a feature-mention-type graph by treating all nodes as the same type;
(7) LINE [Tang et al., 2015]: uses a second-order LINE model on feature-type bipartite graph;
(8) PTE [Tang et al., 2015]: applies the PTE joint training algorithm on feature-mention and
type-mention bipartite graphs; (9) WSABIE [Yogatama et al., 2015]: adopts WARP loss to
learn embeddings of features and types; (10) PL-SVM [Nguyen and Caruana, 2008]: uses a
margin-based loss to handle label noise; and (11) CLPL [Cour et al., 2011]: uses a linear model
to encourage large average scores for candidate types.
We compare AFET and its variant: (1) AFET: complete model with KB-induced type
correlation; (2) AFET-CoH: with hierarchy-induced correlation (i.e., shortest path distance);
(3) AFET-NoCo: without type correlation (i.e., all margin are “1”) in the objective O; and
(4) AFET-NoPa: without label partial loss in the objective O.
Comparison with partial-label learning methods. Compared with PL-SVM and CLPL,
AFET obtains superior performance. PL-SVM assumes that only one candidate type is correct
and does not consider type correlation. CLPL simply averages the model output for all candi-
date types, and thus may generate results biased to frequent false types. Superior performance
of AFET mainly comes from modeling type correlation derived from KB.
Comparison with its variants. AFET always outperforms its variant on all three datasets. It
gains performance from capturing type correlation, as well as handling type noise in the em-
bedding process.
Testing the effect of training set size and dimension. Experimenting with the same settings
for model learning, Fig. 5.5a shows the performance trend on the Wiki dataset when varying
the sampling ratio (subset of mentions randomly sampled from the training set D). Figure 5.5b
analyzes the performance sensitivity of AFET with respect to d —the embedding dimension on
the BBN dataset. Accuracy of AFET improves as d becomes large but the gain decreases when
d is large enough.
Testing sensitivity of the tuning parameter. Figure 5.6b analyzes the sensitivity of AFET with
respect to ˛ on the BBN dataset. Performance increases as ˛ becomes large. When ˛ is large
than 0.5, the performance becomes stable.
Testing at different type levels. Figure 5.6a reports the Ma-F1 of AFET, FIGER, PTE, and
WSABIE at different levels of the target type hierarchy (e.g., person and location on level-1,
politician and artist on level-2, author and actor on level-3). The results show that it is more
difficult to distinguish among more fine-grained types. AFET always outperforms the other
two method, and achieves a 22.36% improvement in Ma-F1, compared to FIGER on level-3
types. The gain mainly comes from explicitly modeling the noisy candidate types.
72 5. FINE-GRAINED ENTITY TYPING WITH KNOWLEDGE BASES
Table 5.4: Study of typing performance on the three datasets
5.6 SUMMARY
In this chapter, we study automatic fine-grained entity typing and propose a hierarchical partial-
label embedding method, AFET, that models “clean” and “noisy” mentions separately and incor-
porates a given type hierarchy to induce loss functions. APEFT builds on a joint optimization
framework, learns embeddings for mentions and type-paths, and iteratively refines the model.
Experiments on three public datasets show that AFET is effective, robust, and outperforms
other comparing methods.
5.6. SUMMARY 73
Table 5.5: Example output of AFET and the compared methods on two news sentences from
OntoNotes dataset
Text “... going to be an imminent easing of ...It’s terrific for advertisers to know
monetary policy, ” said Robert Ded- the reader will be paying more , ” said
erick , chief economist at Northern Michael Drexler, national media di-
Trust Co. in Chicago. rector at Bozell Inc. ad agency.
Ground Truth organization, company person, person_title
FIGER organization organization
WSABIE organization, company, broad- organization, company, news_
cast company
PTE organization person
AFET organization, company person, person_title
Figure 5.5: Performance change with respect to (a) sampling ratio of training mentions on the
Wiki dataset; and (b) embedding dimension d on the BBN dataset.
74 5. FINE-GRAINED ENTITY TYPING WITH KNOWLEDGE BASES
Figure 5.6: Performance change (a) at different levels of the type hierarchy on the OntoNotes
dataset; and (b) with respect to smooth parameter ˛ on the BBN dataset.
75
CHAPTER 6
People often have a variety of ways to refer to the same real-world entity, forming different syn-
onyms for the entity (e.g., entity United States can be referred using “America” and “U.S.”). Auto-
matic synonym discovery is an important task in text analysis and understanding, as the extracted
synonyms (i.e., the alternative ways to refer to the same entity) can benefit many downstream
applications [Angheluta, 2002, Wu and Weld, 2010, Xie et al., 2015, Zeng et al., 2012]. For
example, by forcing synonyms of an entity to be assigned in the same topic category, one can
constrain the topic modeling process and yield topic representations with higher quality [Xie et
al., 2015]. Another example is in document retrieval [Voorhees, 1994], where we can leverage
entity synonyms to enhance the process of query expansion, and thus improve retrieval perfor-
mance.
6.1.1 CHALLENGES
Although distant supervision helps collect training seeds automatically, it also poses a challenge
due to the string ambiguity problem, that is, the same entity surface strings can be mapped to
different entities in KBs. For example, consider the string “Washington” in Fig. 6.1. The “Wash-
ington” in the first sentence represents a state of the United States, while in the second sentence
it refers to a person. As some strings like “Washington” have ambiguous meanings, directly infer-
ring synonyms for such strings may lead to a set of synonyms for multiple entities. For example,
the synonyms of entity Washington returned by current systems may contain both the state names
and person names, which is not desirable. To address the challenge, instead of using ambiguous
strings as queries, a better way is using some specific concepts as queries to disambiguate, such
as entities in KBs.
Figure 6.1: Distant supervision for synonym discovery. We link entity mentions in text corpus
to knowledge base entities, and collect training seeds from KBs.
This motivates us to define a new task: automatic synonym discovery for entities with KBs.
Given a domain-specific corpus, we aim to collect existing name strings of entities from KBs
as seeds. For each query entity, the existing name strings of that entity can disambiguate the
6.1. OVERVIEW AND MOTIVATION 77
meaning for each other, and we will let them vote to decide whether a given candidate string is a
synonym of the query entity. Based on that, the key task for this problem is to predict whether a
pair of strings are synonymous. For this task, the collected seeds can serve as supervision to help
determine the important features. However, as the synonym seeds from KBs are usually quite
limited, how to use them effectively becomes a major challenge. There are broadly two kinds of
efforts toward exploiting a limited number of seed examples.
The distributional-based approaches [Mikolov et al., 2013, Pennington et al., 2014, Roller
et al., 2014, Wang et al., 2015, Weeds et al., 2014] consider the corpus-level statistics, and they
assume strings which often appear in similar contexts are likely to be synonyms. For example,
the strings “U.S.” and “United States” are usually mentioned in similar contexts, and they are the
synonyms of the country U.S.. Based on the assumption, the distributional-based approaches
usually represent strings with their distributional features, and treat the synonym seeds as labels
to train a classifier, which predicts whether a given pair of strings are synonymous or not. Since
most synonymous strings will appear in similar contexts, such approaches usually have high
recall. However, such strategy also brings some noise, since some non-synonymous strings may
also share similar contexts, such as “U.S.” and “Canada,” which could be labeled as synonyms
incorrectly (Fig. 6.2).
Alternatively, the pattern-based approaches [Hearst, 1992, Qian et al., 2009, Snow et al.,
2004, Sun and Grishman, 2010] consider the local contexts, and they infer the relation of two
strings by analyzing sentences mentioning both of them. For example, from the sentence “The
United States of America is commonly referred to as America.”, we can infer that “United States of
America” and “America” have the synonym relation, while the sentence “The U.S. is adjacent to
Canada” may imply that “U.S.” and “Canada” are not synonymous. To leverage this observation,
78 6. SYNONYM DISCOVERY FROM LARGE CORPUS
the pattern-based approaches will extract some textual patterns from sentences in which two
synonymous strings co-occur, and discover more synonyms with the learned patterns. Different
from the distributional-based approaches, the pattern-based approaches can treat the patterns
as concrete evidences to support the discovered synonyms, which are more convincing and in-
terpretable. However, as many synonymous strings will not be co-mentioned in any sentences,
such approaches usually suffer from low recall.
O D OD C OP ; (6.1)
where OD is the objective of the distributional module and OP is the objective of the pattern
module. Next, we introduce the details of each module.
Figure 6.4: Examples of patterns and their features. For a pair of target strings (red ones) in
each sentence, we define the pattern as the <token, POS tag, dependency label> triples in the
shortest dependency path. We collect both lexical features and syntactic features for pattern
classification.
6.3 EXPERIMENT
1. Comparing DPE with other baseline approaches. Table 6.1, Table 6.2, and Figure 6.5
present the results on the warm-start and cold-start settings. In both settings, we see that the
pattern-based approach Patty does not perform well, and our proposed approach DPE signifi-
cantly outperforms Patty. This is because most synonymous strings will never co-appear in any
sentences, leading to the low recall of Patty. Also, many patterns discovered by Patty are not so
reliable, which may harm the precision of the discovered synonyms. DPE addresses this problem
by incorporating the distributional information, which can effectively complement and regulate
the pattern information, leading to higher recall and precision.
Comparing DPE with the distributional based approaches (word2vec, GloVe, PTE,
RKPM), DPE still significantly outperforms them. The performance gains mainly come from:
(1) we exploit the co-occurrence observation during training, which enables us to better capture
the semantic meanings of different strings; and (2) we incorporate the pattern information to
improve the performances.
2. Comparing DPE with its variants. To better understand why DPE achieves better results,
we also compare DPE with several variants. From Table 6.1 and Table 6.2, we see that in most
cases, the distributional module of our approach (DPE-NoP) can already outperform the best
82
Figure 6.5: Precision and recall at different positions on the Wiki dataset.
baseline approach RKPM. This is because we utilize the co-occurrence observation in our distri-
butional module, which helps us capture the semantic meanings of strings more effectively. By
separately training the pattern module after the distributional module, and using both modules
for synonym discovery (DPE-TwoStep), we see that the results are further improved, which
demonstrates that the two modules can indeed mutually complement each other for synonym
discovery. If we jointly train both modules (DPE), we obtain even better results, which shows
84 6. SYNONYM DISCOVERY FROM LARGE CORPUS
that our proposed joint optimization framework can benefit the training process and therefore
helps achieve better results.
6.4 SUMMARY
Recognizing entity synonyms from text has become a crucial task in many entity-leveraging ap-
plications. However, discovering entity synonyms from domain-specific text corpora (e.g., news
articles, scientific papers) is rather challenging. Current systems take an entity name string as
input to find out other names that are synonymous, ignoring the fact that often times a name
string can refer to multiple entities (e.g., “apple” could refer to both Apple Inc and the fruit apple).
Moreover, most existing methods require training data manually created by domain experts to
construct supervised-learning systems. In this chapter, we study the problem of automatic syn-
onym discovery with KBs, that is, identifying synonyms for knowledge base entities in a given
domain-specific corpus. The manually curated synonyms for each entity stored in a KB not
only form a set of name strings to disambiguate the meaning for each other, but also can serve
as “distant” supervision to help determine important features for the task. We propose a novel
framework, called DPE, to integrate two kinds of mutually-complementing signals for synonym
discovery, i.e., distributional features based on corpus-level statistics and textual patterns based
on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction
with distant supervision, so that they can mutually enhance each other in the training stage.
At the inference stage, both signals will be utilized to discover synonyms for the given entities.
Experimental results prove the effectiveness of the proposed framework.
PART II
CHAPTER 7
Figure 7.1: Current systems find relations (Barack Obama, United States) mentioned in sentences
S1-S3 and assign the same relation types (entity types) to all relation mentions (entity mentions),
when only some types are correct for context (highlighted in blue font).
In this chapter, we study the problem of joint extraction of typed entities and relations with
distant supervision. Given a domain-specific corpus and a set of target entity and relation types
from a KB, we aim to detect relation mentions (together with their entity arguments) from
text, and categorize each in context by target types or Not-Target-Type (None), with distant
supervision. Current distant supervision methods focus on solving the subtasks separately (e.g.,
extracting typed entities or relations), and encounter the following limitations when handling
the joint extraction task.
Domain Restriction: They rely on pre-trained named entity recognizers (or noun phrase chunker)
to detect entity mentions. These tools are usually designed for a few general types (e.g., person,
7.1. OVERVIEW AND MOTIVATION 89
location, organization) and require additional human labors to work on specific domains (e.g.,
scientific publications).
Error Propagation: In current extraction pipelines, incorrect entity types generated in entity
recognition and typing step serve as features in the relation extraction step (i.e., errors are prop-
agated from upstream components to downstream ones). Cross-task dependencies are ignored
in most existing methods.
Label Noise: In distant supervision, the context-agnostic mapping from relation (entity) men-
tions to KB relations (entities) may bring false positive type labels (i.e., label noise) into the
automatically labeled training corpora and results in inaccurate models.
In Fig. 7.1, for example, all KB relations between entities Barack Obama and United States
(e.g., born_in, president_of) are assigned to the relation mention in sentence S1 (while only
born_in is correct within the context). Similarly, all KB types for Barack Obama (e.g., politician,
artist) are assigned to the mention “Obama” in S1 (while only person is true). Label noise
becomes an impediment to learn effective type classifiers. The larger the target type set, the
more severe the degree of label noise (see Table 7.1).
Table 7.1: A study of type label noise. (1) %entity mentions with multiple sibling entity types (e.g.,
actor, singer) in the given entity type hierarchy; and (2) %relation mentions with multiple
relation types, for the three experiment datasets.
We approach the joint extraction task as follows. (1) Design a domain-agnostic text seg-
mentation algorithm to detect candidate entity mentions with distant supervision and minimal
linguistic assumption (i.e., assuming part-of-speech (POS) tagged corpus is given [Hovy et al.,
2015]). (2) Model the mutual constraints between the types of the relation mentions and the
types of their entity arguments, to enable feedbacks between the two subtasks. (3) Model the
true type labels in a candidate type set as latent variables and require only the “best” type (pro-
gressively estimated as we learn the model) to be relevant to the mention—this is a less limiting
requirement compared with existing multi-label classifiers that assume “every” candidate type is
relevant to the mention.
To integrate these elements of our approach, a novel framework, CoType, is proposed. It
first runs POS-constrained text segmentation using positive examples from KB to mine quality
entity mentions, and forms candidate relation mentions (Section 7.3.1). Then CoType performs
90 7. JOINT EXTRACTION OF TYPED ENTITIES AND RELATIONSHIPS
entity linking to map candidate relation (entity) mentions to KB relations (entities) and obtain
the KB types. We formulate a global objective to jointly model (1) corpus-level co-occurrences
between linkable relation (entity) mentions and text features extracted from their local contexts;
(2) associations between mentions and their KB-mapped type labels; and (3) interactions be-
tween relation mentions and their entity arguments. In particular, we design a novel partial-label
loss to model the noisy mention-label associations in a robust way, and adopt translation-based
objective to capture the entity-relation interactions. Minimizing the objective yields two low-
dimensional spaces (for entity and relation mentions, respectively), where, in each space, objects
whose types are semantically close also have similar representation (see Section 7.3.2). With the
learned embeddings, we can efficiently estimate the types for the remaining unlinkable relation
mentions and their entity arguments (see Section 7.3.3).
The major contributions of this chapter are as follows.
1. A novel distant-supervision framework, CoType, is proposed to extract typed entities and
relations in domain-specific corpora with minimal linguistic assumption. (Fig. 7.2)
2. A domain-agnostic text segmentation algorithm is developed to detect entity mentions
using distant supervision. (Section 7.3.1)
3. A joint embedding objective is formulated that models mention-type association,
mention-feature co-occurrence, entity-relation cross-constraints in a noise-robust way.
(Section 7.3.2)
4. Experiments with three public datasets demonstrate that CoType improves the perfor-
mance of state-of-the-art systems of entity typing and relation extraction significantly,
demonstrating robust domain-independence. (Section 7.4)
7.2 PRELIMINARIES
The input to our proposed CoType framework is a POS-tagged text corpus D, a KB ‰ (e.g.,
Freebase [Bollacker et al., 2008]), a target entity type hierarchy Y , and a target relation type
7.2. PRELIMINARIES 91
set R. The target type set Y (set R) covers a subset of entity (relation) types that the users are
interested in from ‰ , i.e., Y Y‰ and R R‰ .
Entity and Relation Mention. An entity mention (denoted by m) is a token span in text which
represents an entity e . A relation instance r.e1 ; e2 ; : : : ; en / denotes some type of relation r 2 R
between multiple entities. In this work, we focus on binary relations, i.e., r.e1 ; e2 /. We define a
relation mention (denoted by z ) for some relation instance r.e1 ; e2 / as a (ordered) pair of entities
mentions of e1 and e2 in a sentence s , and represent a relation mention with entity mentions
m1 and m2 in sentence s as z D .m1 ; m2 ; s/.
Knowledge Bases and Target Types. A KB with a set of entities E‰ contains human-curated
facts on both relation instances I‰ D fr.e1 ; e2 /g R‰ E‰ E‰ , and entity-type facts T‰ D
f.e; y/g E‰ Y‰ . Target entity type hierarchy is a tree where nodes represent entity types of inter-
ests from the set Y‰ . An entity mention may have multiple types, which together constitute one
type-path (not required to end at a leaf ) in the given type hierarchy. In existing studies, several
entity type hierarchies are manually constructed using Freebase [Gillick et al., 2014, Lee et al.,
2007] or WordNet [Yosef et al., 2012]. Target relation type set is a set of relation types of interests
from the set R‰ .
Automatically Labeled Training Data. Let M D fmi gN i D1 denote the set of entity mentions ex-
tracted from corpus D. Distant supervision maps M to KB entities E‰ with an entity disambigua-
tion system [Hoffart et al., 2011, Mendes et al., 2011] and heuristically assign type labels to the
mapped mentions. In practice, only a small number of entity mentions in set M can be mapped
to entities in E‰ (i.e., linkable entity mentions, denoted by ML ). As reported in Lin et al. [2012]
and Ren et al. [2015], the ratios of ML over M are usually lower than 50% in domain-specific
corpora.
Between any two linkable entity mentions m1 and m2 in a sentence, a relation mention
zi is formed if there exists one or more KB relations between their KB-mapped entities e1
and e2 . Relations between e1 and e2 in KB are then associated with zi to form its candidate
relation type set Ri , i.e., Ri D fr j r.e1 ; e2 / 2 R‰ g. In a similar way, types of e1 and e2 in KB are
associated with m1 and m2 respectively, to form their candidate entity type sets Yi;1 and Yi;2 ,
where Yi;x D fy j .ex ; y/ 2 Y‰ g. Let ZL D fzi gN i D1 denote the set of extracted relation mentions
L
that can be mapped to KB. Formally, we represent the automatically labeled training corpus for
the joint extraction task, denoted as DL , using a set of tuples DL D f.zi ; Ri ; Yi;1 ; Yi;2 /gN
iD1 .
L
Problem Description. By pairing up entity mentions (from set M) within each sentence in
D, we generate a set of candidate relation mentions, denoted as Z . Set Z consists of (1) linkable
relation mentions ZL , (2) unlinkable (true) relation mentions, and (3) false relation mention
(i.e., no target relation expressed between).
Let ZU denote the set of unlabeled relation mentions in (2) and (3) (i.e., ZU D Z n ZL ).
Our main task is to determine the relation type label (from the set R [fNoneg) for each relation
mention in set ZU , and the entity type labels (either a single type-path in Y or None) for each
92 7. JOINT EXTRACTION OF TYPED ENTITIES AND RELATIONSHIPS
entity mention argument in z 2 ZU , using the automatically labeled corpus DL . Formally, we
define the joint extraction of typed entities and relations task as follows.
Problem 7.1 Joint Entity and Relation Extraction. Given a POS-tagged corpus D, a KB ‰ ,
a target entity type hierarchy Y Y‰ and a target relation type set R R‰ , the joint extraction
task aims to (1) detect entity mentions M from D; (2) generate training data DL with KB ‰ ;
and (3) estimate a relation type r 2 R [{None} for each test relation mention z 2 ZU and a single
type-path Y Y (or None) for each entity mention in z , using DL and its context s .
Non-goals. This work relies on an entity linking system [Mendes et al., 2011] to provide dis-
ambiguation function, but we do not address their limits here (e.g., label noise introduced by
wrongly mapped KB entities). We also assume human-curated target type hierarchies are given
(It is out of the scope of this study to generate the type hierarchy).
Relation Mention Generation. We follow the procedure introduced in Section 7.2 to generate
the set of candidate relation mentions Z from the detected candidate entity mentions M: for
each pair of entity mentions .ma ; mb / found in sentence s , we form two candidate relation men-
tions z1 D .ma ; mb ; s/ and z2 D .mb ; ma ; s/. Distant supervision is then applied on Z to generate
the set of KB-mapped relation mentions ZL . Similar to Hoffmann et al. [2011] and Mintz et
al. [2009], we sample 30% unlinkable relation mentions between two KB-mapped entity men-
tions (from set ML ) in a sentence as examples for modeling None relation label, and sample
30% unlinkable entity mentions (from set M n ML ) to model None entity label. These negative
examples, together with type labels for mentions in ZL , form the automatically labeled data DL
for the task.
Text Feature Extraction. To capture the shallow syntax and distributional semantics of a relation
(or entity) mention, we extract various lexical features from both mention itself (e.g., head token)
and its context s (e.g., bigram), in the POS-tagged corpus. Table 7.3 lists the set of text features
for relation mention, which is similar to those used in Chan and Roth [2010] and Mintz et al.
[2009] (excluding the dependency parse-based features and entity type features). We use the
7.3. THE COTYPE FRAMEWORK 95
same set of features for entity mentions as those used in Ling and Weld [2012] and Ren et al.
[2016c]. We denote the set of Mz (Mm ) unique features extracted of relation mentions ZL (entity
mentions in ZL ) as Fz D ffj gjMD1
z
and Fm D ffj gjMD1
m
.
Table 7.3: Text features for relation mentions used in this work [GuoDong et al., 2005, Riedel
et al., 2010] (excluding dependency parse-based features and entity type features). (“Barack
Obama,” “United States”) is used as an example relation mention from the sentence “Honolulu
native Barack Obama was elected President of the United States on March 20 in 2008.”
Two entity mentions tend to share similar types (close to each other in the embedding
space) if they share many text features in the corpus, and the converse way also holds.
For example, in column 2 of Fig. 7.2, (“Barack Obama,” “US,” S1 ) and (“Barack Obama,”
“United States,” S3 ) share multiple features including context word “president” and first entity
mention argument “Barack Obama,” and thus they are likely of the same relation type (i.e.,
president_of).
Formally, let vectors tzi ; tcj 2 Rd represent relation mention zi 2 ZL and text feature
fj 2 Fz in the d -dimensional relation embedding space. Similar to the distributional hypothe-
sis [Mikolov et al., 2013] in text corpora, we apply second-order proximity [Tang et al., 2015]
to model the idea that objects with similar distribution over neighbors are similar to each other as
follows:
X X
LZF D wij log p.fj jzi /; (7.2)
zi 2ZL fj 2Fz
7.3. THE COTYPE FRAMEWORK 97
ıP
where p.fj jzi / Dexp.tziT tcj / T
denotes the probability of fj generated by zi ,
f 0 2Fz exp.tzi tcj 0 /
and wij is the co-occurrence frequency between .zi ; fj / in corpus D. Function LZF in Eq. (7.2)
enforces the conditional probability specified by embeddings, i.e., p.jzi / to be close to the em-
pirical distribution.
To perform efficient optimization by avoiding summation over all features, we adopt neg-
ative sampling strategy [Mikolov et al., 2013] to sample multiple false features for each .zi ; fj /,
according to some noise distribution Pn .f / / Df3=4 [Mikolov et al., 2013] (with Df denotes the
number of relation mentions co-occurring with f ). Term log p.fj jzi / in Eq. (7.2) is replaced
with the term as follows:
V
X
log tziT tcj C Efj 0 Pn .f / log tziT tcj 0 ; (7.3)
vD1
where .x/ D 1= 1 C exp. x/ is the sigmoid function. The first term in Eq. (7.3) models the
observed co-occurrence, and the second term models the Z negative feature samples.
In DL , each relation mention zi is heuristically associated with a set of candidate types Ri .
Existing embedding methods rely on either the local consistent assumption [He and Niyogi, 2004]
(i.e., objects strongly connected tend to be similar) or the distributional assumption [Mikolov et
al., 2013] (i.e., objects sharing similar neighbors tend to be similar) to model object associa-
tions. However, some associations between zi and r 2 Ri are “false” associations and adopting
the above assumptions may incorrectly yield mentions of different types having similar vector
representations. For example, in Fig. 7.1, mentions (“Obama,” “U.S.,” S1 ) and (“Obama,” “U.S.,”
S2 ) have several candidate types in common (thus high distributional similarity), but their true
types are different (i.e., born_in vs. travel_to).
We specify the likelihood of “whether the association between a relation mention and its can-
didate entity type being true” as the relevance between these two kinds of objects (measured by the
similarity between their current estimated embedding vectors). To impose such idea, we model
the associations between each linkable relation mention zi (in set ZL ) and its noisy candidate
relation type set Ri based on the following hypothesis.
A relation mention’s embedding vector should be more similar (closer in the low-
dimensional space) to its “most relevant” candidate type, than to any other non-candidate
type.
The intuition behind Eq. (7.4) is that: for relation mention zi , the maximum similarity
score associated with its candidate type set Ri should be greater than the maximum similarity
score associated with any other non-candidate types Ri D R n Ri . Minimizing `i forces zi to be
embedded closer to the most “relevant” type in Ri , than to any other non-candidate types in Ri .
This contrasts sharply with multi-label learning [Ling and Weld, 2012], where mi is embedded
closer to every candidate type than any other non-candidate type.
To faithfully model the types of relation mentions, we integrate the modeling of mention-
feature co-occurrences and mention-type associations by the following objective:
NL
X NL Kr
X X
OZ D LZF C `i C ktzi k22 C ktrk k22 ; (7.5)
2 2
i D1 i D1 kD1
where tuning parameter > 0 on the regularization terms is used to control the scale of the
embedding vectors.
By doing so, text features, as complements to mention’s candidate types, also participate in
modeling the relation mention embeddings, and help identify a mention’s most relevant type—
mention-type relevance is progressively estimated during model learning. For example, in the
left column of Fig. 7.4, context words “president” helps infer that relation type president_of is
more relevant (i.e., higher similarity between the embedding vectors) to relation mention (“Mr.
Obama,” “U.S.,” S2 ), than type born_in does.
Modeling Types of Entity Mentions. In a way similar to the modeling of types for relation
mentions, we follow Hypotheses 7.1 and 7.2 to model types of entity mentions. In Fig. 7.2 (col.
2), for example, entity mentions “S1 _Barack Obama” and “S3 _Barack Obama” share multiple
text features in the corpus, including head token “Obama” and context word “president,” and thus
tend to share the same entity types like politician and person (i.e., Hypothesis 7.1). Meanwhile,
entity mentions “S1 _Barack Obama” and “S2 _Obama” have the same candidate entity types but
share very few text features in common. This implies that likely their true type labels are different.
Relevance between entity mentions and their true type labels should be progressively estimated
based on the text features extracted from their local contexts (i.e., Hypothesis 7.2).
Formally, let vectors tmi ; tcj0 ; tyk 2 Rd represent entity mention mi 2 ML , text features (for
entity mentions) fj 2 Fm , and entity type yk 2 Y in a d -dimensional entity embedding space, re-
spectively. We model the corpus-level co-occurrences between entity mentions and text features
by second-order proximity as follows:
X X
LMF D wij log p.fj jmi /; (7.6)
mi 2ML fj 2Fm
7.3. THE COTYPE FRAMEWORK 99
Figure 7.4: Illustrations of the partial-label associations, Hypothesis 7.2 (the left col.), and the
entity-relation interactions, Hypothesis 7.3 (the right col.).
where the conditional probability term log p.fj jmi / is defined as log p.fj jmi / D log .tmTi tcj0 / C
PV T 0
0
vD1 Efj 0 Pn .f / log . tmi tcj 0 / . By integrating the term LMF with partial-label loss `i D
˚ 0
0
max 0; 1 maxy2Yi .mi ; y/ maxy 0 2Y .mi ; y / for NL unique linkable entity mentions (in set
i
ML ), we define the objective function for modeling types of entity mentions as follows.
NL0 NL0
Ky
X X X
OM D LMF C `0i C ktmi k22 C ktyk k22 : (7.7)
2 2
i D1 i D1 kD1
Minimizing the objective OM yields an entity embedding space where, in that space, objects
(e.g., entity mentions, text features) close to each other will have similar types.
Modeling Entity-Relation Interactions. In reality, there exists different kinds of interactions
between a relation mention z D .m1 ; m2 ; s/ and its entity mention arguments m1 and m2 . One
major kind of interactions is the correlation between relation and entity types of these objects—
entity types of the two entity mentions provide good hints for determining the relation type of
the relation mention, and vice versa. For example, in Fig. 7.4 (right column), knowing that entity
mention “S4 _US” is of type location (instead of organization) helps determine that relation
mention (“Obama,” “U.S.,” S4 ) is more likely of relation type travel_to, rather than relation
types like president_of or citizen_of.
Intuitively, entity types of the entity mention arguments pose constraints on the search
space for the relation types of the relation mention (e.g., it is unlikely to find a author_of relation
100 7. JOINT EXTRACTION OF TYPED ENTITIES AND RELATIONSHIPS
between a organization entity and a location entity). The proposed Hypotheses 7.1 and 7.2
model types of relation mentions and entity mentions by learning an entity embedding space
and a relation embedding space, respectively. The correlations between entity and relation types
(and their embedding spaces) motivates us to model entity-relation interactions based on the
following hypothesis.
Given the embedding vectors of any two members in fz; m1 ; m2 g, say tz and tm1 , Hy-
pothesis 7.3 forces the “tm1 C tz tm2 .” This helps regularize the learning of vector tm2 (which
represents the type semantics of entity mention m2 ) in addition to the information encoded by
objective OM in Eq. (7.7). Such a “translating operation” between embedding vectors in a low-
dimensional space has been proven effective in embedding entities and relations in a structured
knowledge baes [Bordes et al., 2013]. We extend this idea to model the type correlations (and
mutual constraints) between embedding vectors of entity mentions and embedding vectors of
relation mentions, which are modeled in two different low-dimensional spaces.
Specifically, we define error function for the triple of a relation mention and its two entity
mention arguments .z; m1 ; m2 / using `-2 norm: .z/ D ktm1 C tz tm2 k22 . A small value on .z/
indicates that the embedding vectors of .z; m1 ; m2 / do capture the type constraints. To enforce
small errors between linkable relation mentions (in set ZL ) and their entity mention arguments,
we use margin-based loss [Bordes et al., 2013] to formulate a objective function as follows:
V
X X ˚
OZM D max 0; 1 C .zi / .zv / ; (7.8)
zi 2ZL vD1
where fzv gVvD1 are negative samples for z , i.e., zv is randomly sampled from the negative sample
set f.z 0 ; m1 ; m2 /g [ f.z; m01 ; m2 /g [ f.z; m1 ; m02 /g with z 0 2 ZL and m0 2 ML [Bordes et al., 2013]. The
intuition behind Eq. (7.8) is simple (see also the right col. in Fig. 7.4): embedding vectors for
a relation mention and its entity mentions are modeled in the way that, the translating error
between them should be smaller than the translating error of any negative sample.
A Joint Optimization Problem. Our goal is to embed all the available information for relation
and entity mentions, relation, and entity type labels, and text features into a d -dimensional entity
space and a d -dimensional relation space, following the three proposed hypotheses. An intuitive
7.3. THE COTYPE FRAMEWORK 101
Table 7.4: Notations
solution is to collectively minimize the three objectives OZ , OM , and OZM , as the embedding
vectors of entity and relation mentions are shared across them. To achieve the goal, we formulate
a joint optimization problem as follows:
Optimizing the global objective O in Eq. (7.9) enables the learning of entity and relation
embeddings to be mutually influenced, such that, errors in each component can be constrained
and corrected by the other. The joint embedding learning also helps the algorithm to find the
true types for each mention, besides using text features.
In Eq. (7.9), one can also minimize the weighted combination of the three objectives
fOZ ; OM ; OZM g to model the importance of different signals, where weights could be manually
determined or automatically learned from data. We leave this as future work.
7.4 EXPERIMENTS
7.4.1 DATA PREPARATION AND EXPERIMENT SETTING
Our experiments use three public datasets1 from different domains. (1) NYT [Riedel et al.,
2010]: The training corpus consists of 1.18M sentences sampled from 294K 1987–2007 New
York Times news articles. 395 sentences are manually annotated by Hoffmann et al. [2011] to
form the test data; (2) Wiki-KBP [Ling and Weld, 2012]: It uses 1.5M sentences sampled
from 780K Wikipedia articles [Ling and Weld, 2012] as training corpus and 14K manually
1 Codes and datasets used in this chapter can be downloaded at: https://github.com/shanzhenren/CoType.
7.4. EXPERIMENTS 103
annotated sentences from 2013 KBP slot filling assessment results [Ellis et al., 2014] as test data;
and (3) BioInfer [Pyysalo et al., 2007]: It consists of 1,530 manually annotated biomedical paper
abstracts as test data and 100K sampled PubMed paper abstracts as training corpus. Statistics
of the datasets are shown in Table 7.5.
Table 7.5: Statistics of the datasets in our experiments
Automatically Labeled Training Corpora. The NYT training corpus has been heuristically la-
beled using distant supervision following the procedure in Riedel et al. [2010]. For Wiki-KBP
and BioInfer training corpora, we utilized DBpedia Spotlight,2 a state-of-the-art entity disam-
biguation tool, to map the detected entity mentions M to Freebase entities. We then followed
the procedure introduced in Sections 7.2 and 7.3.1 to obtain candidate entity and relation types,
and constructed the training data DL . For target types, we discard the relation/entity types which
cannot be mapped to Freebase from the test data while keeping the Freebase entity/relation types
(not found in test data) in the training data (see Table 7.5 for the type statistics).
Feature Generation. Table 7.3 lists the set of text features of relation mentions used in our
experiments. We followed Ling and Weld [2012] to generate text features for entity mentions.
Dependency parse-based features were excluded as only POS-tagged corpus is given as input.
We used a six-word window to extract context features for each mention (three words on the left
and the right). We applied the Stanford CoreNLP tool [Manning et al., 2014] to get POS tags.
Brown clusters were derived for each corpus using public implementation.3 The same kinds of
features were used in all the compared methods in our experiments.
Evaluation Sets. For all three datasets, we used the provided training/test set partitions of the
corpora. In each dataset, relation mentions in sentences are manually annotated with their re-
lation types and the entity mention arguments are labeled with entity type-paths (see Table 7.5
2 http://spotlight.dbpedia.org/
3 https://github.com/percyliang/brown-cluster
104 7. JOINT EXTRACTION OF TYPED ENTITIES AND RELATIONSHIPS
for the statistics of test data). We further created a validation set by randomly sampling 10%
mentions from each test set and used the remaining part to form the evaluation set.
Compared Methods. We compared CoType with its variants which model parts of the pro-
posed hypotheses. Several state-of-the-art relation extraction methods (e.g., supervised, em-
bedding, neural network) were also implemented (or tested using their published codes):
(1) DS+Perceptron [Ling and Weld, 2012]: adopts multi-label learning on automatically la-
beled training data DL ; (2) DS+Kernel [Mooney and Bunescu, 2005]: applies bag-of-feature
kernel [Mooney and Bunescu, 2005] to train a SVM classifier using DL ; (3) DS+Logistic [Mintz
et al., 2009]: trains a multi-class logistic classifier4 on DL ; (4) DeepWalk [Perozzi et al., 2014]:
embeds mention-feature co-occurrences and mention-type associations as a homogeneous net-
work (with binary edges); (5) LINE [Tang et al., 2015]: uses second-order proximity model with
edge sampling on a feature-type bipartite graph (where edge weight wjk is the number of relation
mentions having feature fj and type rk ); (6) MultiR [Hoffmann et al., 2011]: is a state-of-the-
art distant supervision method, which models noisy label in DL by multi-instance multi-label
learning; (7) FCM [Gormley et al., 2015]: adopts neural language model to perform composi-
tional embedding; and (8) DS-Joint [Li and Ji, 2014]: jointly extract entity and relation mentions
using structured perceptron on human-annotated sentences. We used DL to train the model.
For CoType, besides the proposed model, CoType, we compare: (1) CoType-RM: This vari-
ant only optimize objective OZ to learning feature and type embeddings for relation mentions;
and (2) CoType-TwoStep: It first optimizes OM , then use the learned entity mention embedding
ftmi g to initialize the minimization of OZ C OZM —it represents a “pipeline” extraction diagram.
To test the performance on entity recognition and typing, we also compare with several
entity recognition systems, including a supervised method HYENA [Yosef et al., 2012], distant
supervision methods (FIGER [Ling and Weld, 2012], Google [Gillick et al., 2014], WSABIE [Yo-
gatama et al., 2015]), and a noise-robust approach PLE [Ren et al., 2016c].
Parameter Settings. In our testing of CoType and its variants, we set ˛ D 0:025, D 0:35 and
D 10 4 based on the analysis on validation sets. For convergence criterion, we stopped the loop
in the algorithm if the relative change of O in Eq. (7.9) is smaller than 10 4 . For fair comparison,
the dimensionality of embeddings d was set to 50 and the number of negative samples V was
set to 5 for all embedding methods, as used in Tang et al. [2015]. For other tuning parameters
in the compared methods, we tuned them on validation sets and picked the values which lead
to the best performance.
Evaluation Metrics. For entity recognition and typing, we to use strict, micro, and macro F1
scores, as used in Ling and Weld [2012], for evaluating both detected entity mention boundaries
and predicted entity types. We consider two settings in evaluation of relation extraction. For
relation classification, ground-truth relation mentions are given and None label is excluded. We
focus on testing type classification accuracy. For relation extraction, we adopt standard Precision
4 We use liblinear package from https://github.com/cjlin1/liblinear
7.4. EXPERIMENTS 105
(P), Recall (R), and F1 score [Bach and Badaskar, 2007, Mooney and Bunescu, 2005]. Note that
all our evaluations are sentence-level (i.e., context-dependent), as discussed in Hoffmann et al.
[2011].
Table 7.6: Performance comparison of entity recognition and typing (using strict, micro and
macro metrics [Ling and Weld, 2012]) on the three datasets
Table 7.7: Performance comparison on relation classification accuracy over ground-truth relation
mentions on the three datasets
Figure 7.5: Precision-recall curves of relation extraction on NYT and BioInfer datasets. Similar
trend is also observed on the Wiki-KBP dataset.
4. Scalability. In addition to the runtime shown in Table 7.8, Fig. 7.6a tests the scalability of
CoType compared with other methods, by running on BioInfer corpora sampled using different
ratios. CoType demonstrates a linear runtime trend (which validates our time complexity in
Section 7.3.3), and is the only method that is capable of processing the full-size dataset without
significant time cost.
7.5 DISCUSSION
1. Example output on news articles. Table 7.9 shows the output of CoType, MultiR, and Logis-
tic on two news sentences from the Wiki-KBP dataset. CoType extracts more relation mentions
(e.g., children), and predict entity/relation types with better accuracy. Also, CoType can jointly
extract typed entity and relation mentions while other methods cannot (or need to do it incre-
mentally).
2. Testing the effect of training corpus size. Figure 7.6b shows the performance trend on Bioin-
fer dataset when varying the sampling ratio (subset of relation mentions randomly sampled from
the training set). F1 scores of all three methods improves as the sampling ratio increases. Co-
Type performs best in all cases, which demonstrates its robust performance across corpora of
various size.
3. Study the effect of entity type error in relation classification. To investigate the “error prop-
agation” issue of incremental pipeline, we test the changes of relation classification performance
108
Table 7.8: Performance comparison on end-to-end relation extraction (at the highest F1 point) on the three datasets
NYT [Riedel et al., 2010, Wiki-KBP [Ling and Weld, BioInfer [Pyysalo et al., 2007]
Method Hoffmann et al., 2011] 2012, Ellis et al., 2014]
Prec Rec F1 Time Prec Rec F1 Time Prec Rec F1 Time
DS+Perceptron [Ling and 0.068 0.641 0.123 15 min 0.233 0.457 0.308 7.7 min 0.357 0.279 0.313 3.3 min
Weld, 2012]
DS+Kernel [Mooney and 0.095 0.490 0.158 56 hr 0.108 0.239 0.149 9.8 hr 0.333 0.011 0.021 4.2 hr
Bunescu, 2005]
DS+Logistic [Mintz et al., 0.258 0.393 0.311 25 min 0.296 0.387 0.335 14 min 0.572 0.255 0.353 7.4 min
2009]
DeepWalk [Perozzi et al., 0.176 0.224 0.197 1.1 hr 0.101 0.296 0.150 27 min 0.370 0.058 0.101 8.4 min
2014]
LINE [Tang et al., 2015] 0.335 0.329 0.332 2.3 min 0.360 0.257 0.299 1.5 min 0.360 0.275 0.312 35 sec
MultiR [Hoffmann et al., 0.338 0.327 0.333 5.8 min 0.325 0.278 0.301 4.1 min 0.459 0.221 0.298 2.4 min
2011]
FCM [Gormley et al., 0.553 0.154 0.240 1.3 hr 0.151 0.500 0.301 25 min 0.535 0.168 0.255 9.7 min
2015]
DS-Joint [Li and Ji, 2014] 0.574 0.256 0.354 22 hr 0.444 0.043 0.078 54 hr 0.102 0.001 0.002 3.4 hr
CoType-RM 0.467 0.380 0.419 2.6 min 0.342 0.339 0.340 1.5 min 0.482 0.406 0.440 42 sec
7. JOINT EXTRACTION OF TYPED ENTITIES AND RELATIONSHIPS
CoType-TwoStep 0.368 0.446 0.404 9.6 min 0.347 0.351 0.349 6.1 min 0.502 0.405 0.448 3.1 min
CoType 0.423 0.511 0.463 4.1 min 0.348 0.406 0.369 2.5 min 0.536 0.424 0.474 78 sec
7.5. DISCUSSION 109
Figure 7.6: (a) Scalability study on CoType and the compared methods; and (b) performance
changes of relation extraction with respect to sampling ratio of relation mentions on the Bioinfer
dataset.
by (1) training models without entity types as features; (2) using entity types predicted by
FIGER [Ling and Weld, 2012] as features; and (3) using ground-truth (“perfect”) entity types
as features. Figure 7.7 summarize the accuracy of CoType, its variants and the compared meth-
ods. We observe only marginal improvement when using FIGER-predicted types but significant
improvement when using ground-truth entity types—this validates the error propagation issue.
Moreover, we find that CoType achieves an accuracy close to that of the next best method (i.e.,
DS + Logistic + Gold entity type). This demonstrates the effectiveness of our proposed joint entity
and relation embedding.
Figure 7.7: Study of entity type error propagation on the BioInfer dataset.
110 7. JOINT EXTRACTION OF TYPED ENTITIES AND RELATIONSHIPS
Table 7.9: Example output of CoType and the compared methods on two news sentences from
the Wiki-KBP dataset
Text Blake Edwards, a prolific filmmaker Anderson is survived by his wife Carol,
who kept alive the tradition of slap- sons Lee and Albert, daughter Shirley
stick comedy, died Wednesday of Englebrecht and nine grandchildren.
pneumonia at a hospital in Santa
Monica.
MultR 𝑟*: person:country_of_birth, 𝑟*: None,
[Hoffmann et Y *1 : {N/A}, Y *2 : {N/A} Y *1 : {N/A}, Y *2 : {N/A}
al., 2011]
Logistic 𝑟*: per:country_of_birth, 𝑟*: None,Y *1 : {person},
[Mintz et al., Y 1*: {person}, Y 2*: {country} Y 2*: {person, politician}
2009]
CoType 𝑟*: per:place_of_death, 𝑟*: person:children,
Y *1 : {person,artist,director}, Y *1 : {person}, Y *2 : {person}
Y *2 : {location,city}
7.6 SUMMARY
This work studies domain-independent, joint extraction of typed entities and relations in text
with distant supervision. The proposed CoType framework runs domain-agnostic segmentation
algorithm to mine entity mentions, and formulates the joint entity and relation mention typing
problem as a global embedding problem. We design a noise-robust objective to faithfully model
noisy type label from distant supervision, and capture the mutual dependencies between entity
and relation based on the translation embedding assumption. Experiment results demonstrate
the effectiveness and robustness of CoType on text corpora of different domains.
Interesting future work includes incorporating pseudo feedback idea [Xu et al., 2013] to
reduce false negative type labels in the training data, modeling type correlation in the given type
hierarchy [Ren et al., 2016c], and performing type inference for test entity mention and relation
mentions jointly. CoType relies on minimal linguistic assumption (i.e., only POS-tagged corpus
is required) and thus can be extended to different languages where pre-trained POS taggers is
available.
111
CHAPTER 8
Pattern-Enhanced Embedding
Learning for Relation
Extraction
Meng Qu, Department of Computer Science, University of Illinois at
Urbana-Champaign
Relation extraction is an important task in data mining and natural language processing. Given
a text corpus, relation extraction aims at extracting a set of relation instances (i.e., a pair of
entities and their relation) based on some given examples. Many efforts [Culotta and Sorensen,
2004, Mooney and Bunescu, 2006, Ren et al., 2017a] have been done on sentence-level relation
extraction, where the goal is to predict the relation for a pair of entities mentioned in a sentence
(e.g., predict the relation between “Beijing” and “China” in sentence 1 of Fig. 8.1).
Figure 8.1: Illustration of weakly supervised relation extraction. Given a text corpus and a few
relation instances as seeds, the goal is to extract more instances from the corpus.
112 8. PATTERN-ENHANCED EMBEDDING LEARNING FOR RELATION EXTRACTION
8.1 OVERVIEW AND MOTIVATION
Despite its wide applications, these studies usually require a large number of human-annotated
sentences as training data, which are expensive to obtain. In many cases (e.g., KB completion [Xu
et al., 2014]), it is also desirable to extract a set of relation instances by consolidating evidences
from multiple sentences in corpora, which cannot be directly achieved by these studies. Instead
of looking at individual sentences, corpus-level relation extraction [Bing et al., 2015, Hoffmann
et al., 2011, Mintz et al., 2009, Riedel et al., 2013, Zeng et al., 2015] identifies relation in-
stances from text corpora using evidences from multiple sentences. This also makes it possible
to apply weakly supervised methods based on corpus-level statistics [Agichtein and Gravano,
2000, Curran et al., 2007]. Such weakly supervised approaches usually take a few relation in-
stances as seeds, and extract more instances by consolidating redundant information collected
from large corpora. The extracted instances can serve as extra knowledge in various downstream
applications, including KB completion [Riedel et al., 2013, Toutanova et al., 2015], corpus-
level relation extraction [Lin et al., 2016, Zeng et al., 2015], hypernym discovery [Shwartz et
al., 2016, Snow et al., 2005], and synonym discovery [Qu et al., 2017, Wang et al., 2016].
8.1.1 CHALLENGES
In this work, we focus on corpus-level relation extraction in the weakly supervised setting. There
are broadly two types of weakly supervised approaches for corpus-level relation extraction.
Among them, pattern-based approaches predict the relation of an entity pair from multiple
sentences mentioning both entities. To do that, traditional approaches [Nakashole et al., 2012,
Schmitz et al., 2012, Yahya et al., 2014] extract textual patterns (e.g., tokens between a pair of
entities) and new relation instances in a bootstrapping manner. However, many relations could
be expressed in a variety of ways. Due to such diversity, these approaches often have difficulty
matching the learned patterns to unseen contexts, leading to the problem of semantic drift [Cur-
ran et al., 2007] and inferior performance. For example, with the given instance “(Beijing, Capital
of, China)” in Fig. 8.1, “[Head], the capital of [Tail]” will be extracted as a textual pattern from
sentence 1. But we have difficulty in matching the pattern to sentence 2 even though both sen-
tences refer to the same relation “Capital of.” Recent approaches [Liu et al., 2015, Xu et al.,
2015] try to overcome the sparsity issue of textual patterns by encoding textual patterns with
neural networks, so that pattern matching can be replaced by similarity measurement between
vector representations. However, these approaches typically rely on large amount of labeled in-
stances to train effective models [Shwartz et al., 2016], making it hard to deal with the weakly
supervised setting.
Alternatively, distributional approaches resort to the corpus-level co-occurrence statistics
of entities. The basic idea is to learn low-dimensional representations of entities to preserve
such statistics, so that entities with similar semantic meanings tend to have similar representa-
tions. With entity representations, a relation classifier can be learned using the labeled relation
instances, which takes entity representations as features and predicts the relation of a pair of en-
8.1. OVERVIEW AND MOTIVATION 113
tities. To learn entity representations, some approaches [Mikolov et al., 2013, Pennington et al.,
2014, Tang et al., 2015] only consider the given text corpus. Despite the unsupervised property,
their performance is usually limited due to the lack of supervision [Xu et al., 2014]. To learn
more effective representations for relation extraction, some other approaches [Wang et al., 2014,
Xu et al., 2014] jointly learn entity representations and relation classification using the labeled
instances. However, similar to pattern-based approaches, distributional approaches also require
considerable amount of relation instances to achieve good performance [Xu et al., 2014], which
are usually hard to obtain in the weakly supervised setting.
Figure 8.2: Illustration of the modules. The pattern module aims to learn reliable textual patterns
for each relation. The distributional module tries to learn entity representations and a score
function to estimate the quality of each instance.
Figure 8.3: Comparison with existing integration frameworks. Existing frameworks totally rely
on the seed instances to provide supervision. Our framework encourages both modules to pro-
vide extra supervision for each other.
In the objective function, P represents the parameters of the pattern module, that is, a given
number of reliable patterns for each target relation. D denotes the parameters of the distri-
butional module, that is, entity representations and a score function that serves as a relation
classifier. The objective function consists of three terms. Among them, Op is the objective of the
pattern module, in which we leverage the given seed instances for pattern selection. Od is the
objective of the distributional module, which learns relevant parameters under the guidance of
seed instances. Finally, Oi models the interactions of both modules.
8.3 EXPERIMENT
In this section, we evaluate our approach on two downstream applications: KB completion with
text corpora (KBC) and corpus-level relation extraction (RE). In KB completion with text cor-
pora, the key task is to predict the missing relationships between each pair of entities in KBs.
Since some pairs of entities may not co-occur in any sentences of the given corpus, the learned
pattern module cannot provide information for predicting their relations. Therefore, for KBC we
only use the entity representations and score function learned by the distributional module for
extraction, and we expect to show that the pattern module can provide extra seeds during train-
ing, yielding a more effective distributional module. For corpus-level RE, it aims at predicting
the relation of a pair of entities from several sentences mentioning both of them. In this case,
the reliable patterns learned by the pattern module can capture the local context information
from the sentences. Therefore, we focus on utilizing the learned pattern module for prediction
in RE, and we expect to show that the distributional module can enhance the pattern module
by providing extra supervision to select reliable patterns.
1. Knowledge Base Completion with Text Corpora (KBC). We present the quantitative re-
sults in Table 8.1. For the approach only considering the given seed instances (TransE), we see
the performance is very limited due to the scarcity of seeds. Along the other line, the approach
considering text corpora (word2vec) achieves relatively better results, but are still far from sat-
isfactory, since it ignores the supervision from the seed instances. If we consider both the text
corpus and seed instances for entity representation learning (RK), we obtain much better re-
sults. Moreover, by further jointly training a pattern model (DPE, CONV), the hits ratio can
be further significantly improved.
For our proposed approach, with only the distributional module (REPEL-D), it al-
ready outperforms all the baseline approaches. Compared with DPE, the performance gain of
REPEL-D mainly comes from the usage of the score function which utilizing seed instances,
which can better model different relations. Compared with CONV, REPEL-D achieves better
results, as the distributional information in text corpora can be better captured. Moreover, by
116 8. PATTERN-ENHANCED EMBEDDING LEARNING FOR RELATION EXTRACTION
Table 8.1: Quantitative results on the KBC task
encouraging the collaboration of both modules (REPEL), the results are further significantly
improved. This observation demonstrates that the pattern module can indeed help improve the
distributional module by providing some highly confident instances.
Overall, our approach achieves quite impressive results on the KB completion task com-
pared with several strong baseline approaches. Also, the pattern module can indeed enhance the
distributional module with our co-training framework.
2. Corpus-level Relation Extraction (RE). Next, we show the results on the corpus-level re-
lation extraction task. We present the quantitative results in Table 8.2. For the approaches us-
ing textual patterns (PATTY, Snowball), we see the results are quite limited especially on the
NYT dataset. This is because it discovers informative patterns in a bootstrapping way, which
can lead to the semantic drift problem [Curran et al., 2007] and thus harm the performance.
For other neural network-based pattern approaches (PathCNN, CNN-ATT, PCNN-ATT),
although they are proved to be very effective when the given instances are abundant, their per-
formance in the weakly supervised setting is not satisfactory. The reason is that they typically
deploy complicated convolutional layers or recurrent layers in their model, which rely on mas-
sive relation instances to tune. However, in our setting, the instances are very limited, leading to
their poor performance. For the integration approach (LexNET), although it incorporates the
distributional information, the performance is still quite limited especially on the NYT dataset.
This is because the joint training framework of LexNET also requires considerable training
instances.
Overall, in the weakly supervised setting, our approach is able to achieve comparable re-
sults compared with the neural methods. Besides, the distributional module can indeed improve
the pattern module with our co-training framework.
Table 8.2: Quantitative results on the RE task
CHAPTER 9
One of the most important tasks toward text understanding is to detect and categorize semantic
relations between two entities in a given context. Typically, existing methods follow the su-
pervised learning paradigm, and require extensive annotations from domain experts, which are
costly and time-consuming. To alleviate such drawback, attempts have been made to build rela-
tion extractors with a small set of seed instances or human-crafted patterns [Carlson et al., 2010,
Nakashole et al., 2011], based on which more patterns and instances will be iteratively generated
by bootstrap learning. However, these methods often suffer from semantic drift [Mintz et al.,
2009]. Besides, KBs like Freebase have been leveraged to automatically generate training data
and provide distant supervision [Mintz et al., 2009]. Nevertheless, for many domain-specific
applications, distant supervision is either non-existent or insufficient (usually less than 25% of
relation mentions are covered [Ling and Weld, 2012, Ren et al., 2015]).
these proficient subsets based on context information, only trust labeling functions on these
subsets and avoid assuming global source consistency.
Meanwhile, embedding methods have demonstrated great potential in capturing seman-
tic meanings, which also reduce the dimension of overwhelming text features. Here, we present
REHession, a novel framework capturing context’s semantic meaning through representation
learning, and conduct both relation extraction and true label discovery in a context-aware man-
ner. Specifically, as depicted in Fig. 9.1, we embed relation mentions in a low-dimension vec-
tor space, where similar relation mentions tend to have similar relation types and annotations.
“True” labels are further inferred based on the reliability of labeling functions, which are calcu-
lated with their proficient subsets’ representations. Then, these inferred true labels would serve
as supervision for all components, including context representation, true label discovery, and re-
lation extraction. Besides, the context representation bridges relation extraction with true label
discovery, and allows them to enhance each other.
9.2 PRELIMINARIES
In this section, we would formally define relation extraction and heterogeneous supervision,
including the format of labeling functions.
1. After being extracted from context, text features are embedded in a low dimension space
by representation learning (see Fig. 9.2).
122 9. HETEROGENEOUS SUPERVISION FOR RELATION EXTRACTION
2. Text feature embeddings are utilized to calculate relation mention embeddings (see
Fig. 9.2).
3. With relation mention embeddings, true labels are inferred by calculating labeling func-
tions’ reliabilities in a context-aware manner (see Fig. 9.1).
4. Inferred true labels would “supervise” all components to learn model parameters (see
Fig. 9.1).
We now proceed by introducing these components of the model in further details.
P P
fi as vi 2 Rnv , and we aim to maximize the following log likelihood: c2Cl fi ;fj 2c log p.fi jfj /,
P
where p.fi jfj / D exp.vTi vj /= fk 2F exp.vTi vk /.
However, the optimization of this likelihood is impractical because the calculation of
rp.fi jfj / requires summation over all text features, whose size exceeds 107 in our case. In
order to perform efficient optimization, we adopt the negative sampling technique [Mikolov et
al., 2013] to avoid this summation. Accordingly, we replace the log likelihood with Eq. (9.1) as
below: 0 1
X V
X h i
JE D @log vT v Ef log vT A; (9.1)
i j k 0 P
O i vk 0
c2Cl kD1
fi ;fj 2c
where PO is noise distribution used in Mikolov et al. [2013], is the sigmoid function, and V is
number of negative samples.
Relation Mention Representation. With text feature embeddings learned by Eq. (9.1), a naive
method to represent relation mentions is to concatenate or average its text feature embeddings.
However, text features embedding may be in a different semantic space with relation types.
Thus, we directly learn a mapping g from text feature representations to relation mention rep-
124 9. HETEROGENEOUS SUPERVISION FOR RELATION EXTRACTION
resentations [Gysel et al., 2016a,b] instead of simple heuristic rules like concatenate or average
(see Fig. 9.2): 0 1
1 X A
tzc D g.c / D tanh @W vi ; (9.2)
jc j
fi 2c
Then the correctness of annotation oc;i , c;i D ı.oc;i D oc /, would be generated. Further-
more, we assume p.c;i D 1jsc;i D 1/ D 1 and p.c;i D 1jsc;i D 0/ D 0 to be constant for all
relation mentions and labeling functions.
Because sc;i would not be used in other components of our framework, we integrate out
sc;i and write the log likelihood as
X ı o Do
. /
JT D log tzcT li 1 c;i c .1 1 /ı .oc;i ¤oc /
oc;i 2O
ı o Do
. /
C 1 tzcT li 0 c;i c .1 0 /ı .oc;i ¤oc / : (9.4)
9.3. THE REHESSION FRAMEWORK 125
Note that oc is a hidden variable but not a model parameter, and JT is the likelihood
of c;i D ı.oc;i D oc /. Thus, we would first infer oc D argmaxoc JT , then train the true label
discovery model by maximizing JT .
where ti 2 Rvz is the representation for relation type ri . Moreover, with the inferred true label
oc , the relation extraction model can be trained as a multi-class classifier. Specifically, we use
Eq. (9.5) to approach the distribution
1 ri D oc
p ri joc D (9.6)
0 ri ¤ oc :
where KL.p.:jtzc /jjp.:joc // is the KL-divergence from p.ri joc / to p.ri jtzc /, p.ri jtzc / and p.ri joc /
has the form of Eqs. (9.5) and (9.6).
Collectively optimizing Eq. (9.8) allows heterogeneous supervision guiding all three compo-
nents, while these components would refine the context representation, and enhance each other.
In order to solve the joint optimization problem in Eq. (9.8) efficiently, we adopt the
stochastic gradient descent algorithm to update fW; v; v ; l; tg iteratively, and oc is estimated
by maximizing JT after calculating tzc . Additionally, we apply the widely used dropout tech-
niques [Srivastava et al., 2014] to prevent overfitting and improve generalization performance.
126 9. HETEROGENEOUS SUPERVISION FOR RELATION EXTRACTION
The learning process of REHession is summarized as below. In each iteration, we would
sample a relation mention c from Cl , then sample c ’s text features and conduct the text features’
representation learning. After calculating the representation of c , we would infer its true label
oc based on our true label discovery model, and finally update model parameters based on oc .
9.4 EXPERIMENTS
In this section, we empirically validate our method by comparing to the state-of-the-art relation
extraction methods on news and Wikipedia articles.
Given the experimental setup described above, the averaged evaluation scores in ten runs
of relation classification and relation extraction on two datasets are summarized in Table 9.2.
From the comparison, it shows that NL strategy yields better performance than TD strat-
egy, since the true labels inferred by investment are actually wrong for many instances. On the
other hand, our method introduces context-awareness to true label discovery, while the inferred
true label guides the relation extractor achieving the best performance. This observation justifies
the motivation of avoiding the source consistency assumption and the effectiveness of proposed
true label discovery model.
One could also observe the difference between REHession and the compared methods is
more significant on the NYT dataset than on the Wiki-KBP dataset. This observation accords
with the fact that the NYT dataset contains more conflicts than KBP dataset, and the intuition
is that our method would have more advantages on more conflicting labels.
Among four tasks, the relation classification of Wiki-KBP dataset has highest label qual-
ity, i.e., conflicting label ratio, but with least number of training instances. And CoType-RM
and DSL reach relatively better performance among all compared methods. CoType-RM per-
forms much better than DSL on Wiki-KBP relation classification task, while DSL gets better or
similar performance with CoType-RM on other tasks. This may be because the representation
learning method is able to generalize better, thus performs better when the training set size is
small. However, it is rather vulnerable to the noisy labels compared to DSL. Our method em-
ploys embedding techniques, and also integrates context-aware true label discovery to de-noise
labels, making the embedding method rather robust, thus achieves the best performance on all
tasks.
9.5. SUMMARY 127
Table 9.2: Performance comparison of relation extraction and relation classification
Relation
Relation Extraction
Classification
Method Wiki-
NYT Wiki-KBP NYT
KBP
Prec Rec F1 Prec Rec F1 Accuracy Accuracy
NL+FIGER 0.2364 0.2914 0.2606 0.2048 0.4489 0.2810 0.6598 0.6226
NL+BFK 0.1520 0.0508 0.0749 0.1504 0.3543 0.2101 0.6905 0.5000
NL+DSL 0.4150 0.5414 0.4690 0.3301 0.5446 0.4067 0.7954 0.6355
NL+MultiR 0.5196 0.2755 0.3594 0.3012 0.5296 0.3804 0.7059 0.6484
NL+FCM 0.4170 0.2890 0.3414 0.2523 0.5258 0.3410 0.7033 0.5419
NL+CoType-RM 0.3967 0.4049 0.3977 0.3701 0.4767 0.4122 0.6485 0.6935
TD+FIGER 0.3664 0.3350 0.3495 0.2650 0.5666 0.3582 0.7059 0.6355
TD+BFK 0.1011 0.0504 0.0670 0.1432 0.1935 0.1646 0.6292 0.5032
TD+DSL 0.3704 0.5025 0.4257 0.2950 0.5757 0.3849 0.7570 0.6452
TD+MultiR 0.5232 0.2736 0.3586 0.3045 0.5277 0.3810 0.6061 0.6613
TD+FCM 0.3394 0.3325 0.3360 0.1964 0.5645 0.2914 0.6803 0.5645
TD+CoType-RM 0.4516 0.3499 0.3923 0.3107 0.5368 0.3879 0.6409 0.6890
REHession 0.4122 0.5726 0.4792 0.3677 0.4933 0.4208 0.8381 0.7277
9.5 SUMMARY
In this chapter, we propose REHession, an embedding framework to extract relation under het-
erogeneous supervision. When dealing with heterogeneous supervisions, one unique challenge
is how to resolve conflicts generated by different labeling functions. Accordingly, we go beyond
the “source consistency assumption” in prior works and leverage context-aware embeddings to
induce proficient subsets. The resulting framework bridges true label discovery and relation ex-
traction with context representation, and allows them to mutually enhance each other. Experi-
mental evaluation justifies the necessity of involving context-awareness, the quality of inferred
true label, and the effectiveness of the proposed framework on two real-world datasets.
129
CHAPTER 10
Indirect Supervision:
Leveraging Knowledge from
Auxiliary Tasks
Zeqiu Wu, Department of Computer Science, University of Illinois at
Urbana-Champaign
Typically, relation extraction (RE) systems rely on training data, primarily acquired via human
annotation, to achieve satisfactory performance. However, such manual labeling process can be
costly and non-scalable when adapting to other domains (e.g., biomedical domain). In addition,
when the number of types of interest becomes large, the generation of handcrafted training data
can be error-prone. To alleviate such an exhaustive process, the recent trend has deviated toward
the adoption of distant supervision (DS).
DS replaces the manual training data generation with a pipeline that automatically links texts to
a KB. The pipeline has the following steps: (1) detect entity mentions in text; (2) map detected
entity mentions to entities in KB; and (3) assign, to the candidate type set of each entity mention
pair, all KB relation types between their KB-mapped entities. However, the noise introduced
to the automatically generated training data is not negligible. There are two major causes of
error: incomplete KB and context-agnostic labeling process. If we treat unlinkable entity pairs
as the pool of negative examples, false negatives can be commonly encountered as a result of
the insufficiency of facts in KBs, where many true entity or relation mentions fail to be linked
to KBs (see example in Fig. 10.1). In this way, models counting on extensive negative instances
may suffer from such misleading training data. On the other hand, context-agnostic labeling
can engender false positive examples, due to the inaccuracy of the DS assumption that if a
sentence contains any two entities holding a relation in the KB, the sentence must be expressing
such relation between them. For example, entities “Donald Trump” and “United States” in the
sentence “Donald Trump flew back to United States” can be labeled as “president_of” as well as
130 10. INDIRECT SUPERVISION: LEVERAGING KNOWLEDGE FROM AUXILIARY TASKS
“born_in,” although only an out-of-interest relation type “travel_to” is expressed explicitly
(as shown in Fig. 10.1).
Figure 10.1: Distant supervision generates training data by linking relation mentions in sen-
tences S1–S4 to KB and assigning the linkable relation types to all relation mentions. Those
unlinkable entity mention pairs are treated as negative examples. This automatic labeling pro-
cess may cause errors of false positives (highlighted in red) and false negatives (highlighted in
purple). QA pairs provide indirect supervision for correcting such errors.
10.1.1 CHALLENGES
Toward the goal of diminishing the negative effects by noisy DS training data, distantly su-
pervised RE models that deal with training noise, as well as methods that directly improve the
automatic training data generation process have been proposed. These methods mostly involve
designing distinct assumptions to remove redundant training information [Hoffmann et al.,
2011, Lin et al., 2016, Mintz et al., 2009, Riedel et al., 2010]. For example, the method ap-
plied in Hoffmann et al. [2011] and Riedel et al. [2010], assumes that for each relation triple
in the KB, at least one sentence might express the relation instead of all sentences. Moreover,
these noise reduction systems usually only address one type of error, either false positives or false
negatives. Hence, current methods handling DS noises still have the following challenges.
2. Incomplete noise handling: Although both false negative and false positive errors are ob-
served to be significant, most existing works only address one of them.
10.3 EXPERIMENTS
Data sets. Our experiments consists of two different type of datasets, one for relation extraction
and another answer sentence selection dataset for indirect supervision. Two public datasets are
used for relation extraction: NYT [Hoffmann et al., 2011, Riedel et al., 2010] and KBP [Ellis
et al., 2014, Ling and Weld, 2012]. The test data are manually annotated with relation types by
their respective authors.
Compared Methods. We compare ReQuest with its variants which model parts of the pro-
posed hypotheses. Several state-of-the-art relation extraction methods (e.g., supervised, em-
bedding, neural network) are also implemented (or tested using their published codes). Besides
the proposed joint optimization model, ReQuest-Joint, we conduct experiments on two other
variations to compare the performance (1) ReQuest-QA_RE: This variation optimizes objec-
tive OQA first and then uses the learned feature embeddings as the initial state to optimize OZ ;
and (2) ReQuest-RE_QA: It first optimizes OZ , then optimizes OQA to finely tune the learned
feature embeddings.
Performance Comparison with Baselines. To test the effectiveness of our proposed framework
ReQuest, we compare with other methods on the relation extraction task. The precision, recall,
F1 scores as well as the model learning time measured on two datasets are reported in Table 10.1.
As shown in the table, ReQuest achieves superior F1 score on both datasets compared with
10.3. EXPERIMENTS 135
Table 10.1: Performance comparison on end-to-end relation extraction (at the highest F1 point)
on the two datasets
other models. Among all these baselines, MultiR and CoType-RM handle noisy training data
while the remaining ones assume the training corpus is perfectly labeled. Due to their nature of
being cautious toward the noisy training data, both MultiR and CoType-RM reach relatively
high results confronting with other models that blindly exploit all heuristically obtained training
examples. However, as external reliable information sources are absent and only the noise from
multi-label relation mentions (while none or only one assigned label is correct) is tackled in these
models, MultiR and CoType-RM underperform ReQuest. Especially from the comparison
with CoType-RM, which is also an embedding learning based relation extraction model with
the idea of partial-label loss incorporated, we can conclude that the extra semantic inklings
provided by the QA corpus do help boost the performance of relation extraction.
Performance Comparison with Ablations. We experiment with two variations of ReQuest,
ReQuest-QA_RE, and ReQuest-RE_QA, in order to validate the idea of joint optimization.
136 10. INDIRECT SUPERVISION: LEVERAGING KNOWLEDGE FROM AUXILIARY TASKS
As presented in Table 10.1, both ReQuest-QA_RE and ReQuest-RE_QA outperform most
of the baselines, with the indirect supervision from QA corpus. However, their results still fall
behind ReQuest’s. Thus, separately training the two components may not capture as much
information as jointly optimizing the combined objective. The idea of constraining each com-
ponent in the joint optimization process proves to be effective in learning embeddings to present
semantic meanings of objects (e.g., features, types and mentions).
10.4 SUMMARY
We present a novel study on indirect supervision (from question-answering datasets) for the task
of relation extraction. We propose a framework, ReQuest, that embeds information from both
training data automatically generated by linking to KBs and QA datasets, and captures richer
semantic knowledge from both sources via shared text features so that better feature embeddings
can be learned to infer relation type for test relation mentions despite the noisy training data. Our
experiment results on two datasets demonstrate the effectiveness and robustness of ReQuest.
Interesting future work includes identifying most relevant QA pairs for target relation types,
generating most effective questions to collect feedback (or answers) via crowd-sourcing, and
exploring approaches other than distant supervision [Artzi and Zettlemoyer, 2013, Riedel et
al., 2013].
PART III
CHAPTER 11
Discovering textual patterns from text data is an active research theme [Banko et al., 2007, Carl-
son et al., 2010, Fader et al., 2011, Gupta et al., 2014, Nakashole et al., 2012], with broad
applications such as attribute extraction [Ghani et al., 2006, Pasca, 2008, Probst et al., 2007,
Ravi and Paşca, 2008], aspect mining [Chen et al., 2014, Hu and Liu, 2014, Kannan et al.,
2011], and slot filling [Yahya et al., 2014, Yu and Ji, 2016]. Moreover, a data-driven explo-
ration of efficient textual pattern mining may also have strong implications on the development
of efficient methods for NLP tasks on massive text corpora.
Definition 11.1 Meta Pattern. A meta pattern refers to a frequent, informative, and precise
subsequence pattern of entity types (e.g., $Person, $Politician, $Country) or data types (e.g.,
$Digit, $Month, $Year), words (e.g., “politician,” “age”), or phrases (e.g., “prime_minister”),
and possibly punctuation marks (e.g., “,”, “(”), which serves as an integral semantic unit in certain
context.
We study the problem of mining meta patterns and grouping synonymous meta patterns.
Why mining meta patterns and grouping them into synonymous meta pattern groups?—because min-
ing and grouping meta patterns into synonymous groups may facilitate information extraction
and turning unstructured data into structures. For example, given us a sentence from a news
11.1. OVERVIEW AND MOTIVATION 141
corpus, “President Blaise CompaoreK ’s government of Burkina Faso was founded ...”, if we have
discovered the meta pattern “president $Politician’s government of $Country,” we can rec-
ognize and type new entities (i.e., type “Blaise CompaoreK ” as a $Politician and “Burkina Faso”
as a $Country), which previously requires human expertise on language rules or heavy anno-
tations for learning [Nadeau and Sekine, 2007]. If we have grouped the pattern with synony-
mous patterns like “$Country president $Politician”, we can merge the fact tuple hBurkina
Faso, president, Blaise CompaoreK i into the large collection of facts of the attribute type coun-
try:president.
To systematically address the challenges of mining meta patterns and grouping synony-
mous patterns, we develop a novel framework called MetaPAD (Meta PAttern Discovery). Instead
of working on every individual sentence, our MetaPAD leverages massive sentences in which re-
dundant patterns are used to express attributes or relations of massive instances. First, MetaPAD
generates meta pattern candidates using efficient sequential pattern mining, learns a quality as-
sessment function of the patterns candidates with a rich set of domain-independent contextual
features for intuitive ideas (e.g., frequency, informativeness), and then mines the quality meta
patterns by assessment-led context-aware segmentation (see Section 11.2.1). Second, MetaPAD
formulates the grouping process of synonymous meta patterns as a learning task, and solves it by
integrating features from multiple facets including entity types, data types, pattern context, and
extracted instances (see Section 11.2.2). Third, MetaPAD examines the type distributions of en-
tities in the extractions from every meta pattern group, and looks for the most appropriate type
level that the patterns fit. This includes both top-down and bottom-up schemes that traverse
the type ontology for the patterns’ preciseness (see Section 11.2.3).
Problem 11.2 Meta Pattern Discovery Given a fine-grained, typed corpus of massive sen-
tences C D Œ: : : ; S; : : :, and each sentence is denoted as S D t1 t2 : : : tn in which tk 2 T [ P [
M is the k -th token (T is the set of entity types and data types, P is the set of phrases and words,
and M is the set of punctuation marks), the task is to find synonymous groups of quality meta
patterns. A meta pattern mp is a subsequential pattern of the tokens from the set T [ P [ M.
A synonymous meta pattern group is denoted by MPG D Œ: : : ; mpi ; : : : ; mpj : : : in which each
pair of meta patterns, mpi and mpj , are synonymous.
What is a quality meta pattern? Here we take the sentences as sequences of tokens. Previous
sequential pattern mining algorithms mine frequent subsequences satisfying a single metric,
the minimum support threshold (min_sup), in a transactional sequence database [Agrawal and
Srikant, 1995]. However, for text sequence data, the quality of our proposed textual pattern, the
meta pattern, should be evaluated similar to phrase mining [Liu et al., 2015] in four criteria, as
illustrated below.
142 11. MINING ENTITY ATTRIBUTE VALUES WITH META PATTERNS
Example 11.3 The quality of a pattern is evaluated with the following criteria (the former
pattern has higher quality than the latter).
1. Frequency: “$DigitRank president of $Country” vs. “young president of $Country.”
2. Completeness: “$Country president $Politician” vs. “$Country president,” “$Person’s
wife, $Person” vs. “$Person’s wife.”
3. Informativeness: “$Person’s wife, $Person” vs. “$Person and $Person.”
4. Preciseness: “$Country president $Politician” vs. “$Location president $Person,”
“$Person’s wife, $Person” vs. “$Politician’s wife, $Person,” “population of
$Location” vs. “population of $Country.”
What are synonymous meta patterns? The full set of frequent sequential patterns from a
transaction dataset is huge [Agrawal and Srikant, 1995]; and the number of meta patterns from
a massive corpus is also big. Since there are multiple ways to express the same or similar mean-
ings in a natural language, many meta patterns may share the same or nearly the same meaning.
Grouping synonymous meta patterns can help aggregate a large number of extractions of differ-
ent patterns from different sentences. And the type distribution of the aggregated extractions
can help us adjust the meta patterns in the group for preciseness.
features of the patterns according to the quality criteria (see Section 11.1.3) and train a classifier
to estimate the quality function Q.mp/ 2 Œ0; 1 where mp is a meta pattern candidate.
We train a classifier based on random forests [Breiman, 2001] for learning the meta-
pattern quality function Q.mp/ with the above rich set of contextual features. Our experiments
(not reported here for the sake of space) show that using only 100 pattern labels can achieve
similar precision and recall as using 300 labels. Note that the learning results can be trans-
ferred to other domains: the features of low-quality patterns “$Politician and $Country” and
“$Bacteria and $Antibiotics” are similar; the features of high-quality patterns “$Politician
is president of $Country” and “$Bacteria is resistant to $Antibiotics” are similar.
Context-aware segmentation using Q.:/ with feedback. With the pattern quality function
Q.:/ learned from the rich set of contextual features, we develop a bottom-up segmentation
algorithm to construct the best partition of segments of high quality scores. As shown in
Fig. 11.2, we use Q.:/ to determine the boundaries of the segments: we take “$Country presi-
dent $Politician” for its high quality score; we do not take the candidate “and prime_minister
$Politician of $Country” because of its low quality score.
Since Q.mp/ was learned with features including the raw frequency c.mp/, the quality
score may be overestimated or underestimated: the principle is that every token’s occurrence
should be assigned to only one pattern but the raw frequency may count the tokens multiple
times. Fortunately, after the segmentation, we can rectify the frequency as cr .mp/, for example in
Fig. 11.2, the segmentation avoids counting “$Politician and prime_minister $Politician” of
overestimated frequency/quality. Once the frequency feature is rectified, we re-learn the quality
144 11. MINING ENTITY ATTRIBUTE VALUES WITH META PATTERNS
Figure 11.2: Generating meta patterns by context-aware segmentation with the pattern quality
function Q.:/.
function Q.:/ using c.mp/ as feedback and re-segment the corpus with it. This can be an iterative
process but we found in only one iteration, the result converges.
Figure 11.3: Adjusting entity type levels for appropriate granularity with entity type distribu-
tions.
11.3 SUMMARY
In this chapter, we proposed a novel typed textual pattern structure, called meta pattern, which is
extened to a frequent, complete, informative, and precise subsequence pattern in certain context,
compared with the SOL pattern. We developed an efficient framework, MetaPAD, to discover the
meta patterns from massive corpora with three techniques, including: (1) a context-aware seg-
mentation method to carefully determine the boundaries of the patterns with a learned pattern
quality assessment function, which avoids costly dependency parsing and generates high-quality
patterns; (2) a clustering method to group synonymous meta patterns with integrated informa-
146 11. MINING ENTITY ATTRIBUTE VALUES WITH META PATTERNS
tion of types, context, and instances; and (3) top-down and bottom-up schemes to adjust the
levels of entity types in the meta patterns by examining the type distributions of entities in the
instances. Experiments demonstrated that MetaPAD efficiently discovered a large collection of
high-quality typed textual patterns to facilitate challenging NLP tasks like tuple information
extraction.
147
CHAPTER 12
Massive text corpora are emerging worldwide in different domains and languages. The sheer
size of such unstructured data and the rapid growth of new data pose grand challenges on mak-
ing sense of these massive corpora. Information extraction (IE) [Sarawagi, 2008]—extraction of
relation tuples in the form of (head entity, relation, tail entity)—is a key step toward automating
knowledge acquisition from text. While traditional IE systems require people to pre-specify
the set of relations of interests, recent studies on open-domain information extraction (Open
IE) [Banko et al., 2007, Carlson et al., 2010, Schmitz et al., 2012] rely on relation phrases ex-
tracted from text to represent the entity relationship, making it possible to adapt to various
domains (i.e., open-domain) and different languages (i.e., language-independent).
Prior work on Open IE can be summarized as sharing two common characteristics: (1) con-
ducting extraction based on local context information; and (2) adopting an incremental system
pipeline. Current Open IE systems focus on analyzing the local context within individual sen-
tences to extract entity and their relationships, while ignoring the redundant information that
can be collectively referenced across different sentences and documents in the corpus. For ex-
ample, in Fig. 12.1, seeing entity phrases “London” and “Paris” frequently co-occur with similar
relation phrase and tail entities in the corpus, one gets to know that they have close seman-
tics (same for “Great Britain” and “France”). On one hand, this helps confirm that (Paris, is in,
France) is a quality tuple if knowing (London, is in , Great Britain) is a good tuple. On the other,
this helps rule out the tuple (Paris, build, new satellites) as “Louvre-Lens” is semantically distant
from “Paris.” Therefore, the rich information redundancy in the massive corpus motivates us
148 12. OPEN INFORMATION EXTRACTION WITH GLOBAL STRUCTURE
to design an effective way of measuring whether a candidate relation tuple is consistently used
across various context in the corpus (i.e., global cohesiveness).
Figure 12.1: Example of incorporating global cohesiveness view for error pruning. “London”
and “Paris” are similar because they are head entities of the same relation “is in.” When it comes
to the relation “build,” since “London” and “build” do not co-occur in any tuple in the corpus,
then it is unlikely for tuples with “Paris” and “build” to be correct.
Furthermore, most existing Open IE systems assume that they have access to entity de-
tection tools (e.g., named entity recognizer (NER), noun phrase (NP) chunker) to extracted
entity phrases from sentences, which are then used to form entity pairs for relation tuple extrac-
tion [Banko et al., 2007, Carlson et al., 2010, Schmitz et al., 2012]. Some systems further rely on
dependency parsers to generate syntax parse tree for guiding the relation tuple extraction [An-
geli et al., 2015, Corro and Gemulla, 2013, Schmitz et al., 2012]. However, these systems suffer
from error propagation as the errors in prior parts of the pipeline could accumulate cascading
down the pipeline, yielding more significant errors. In addition, the NERs and NP chunkers
are often pre-trained for general domain and may not work well on a domain-specific corpus
(e.g., scientific papers, social media posts).
12.1. OVERVIEW AND MOTIVATION 149
12.1.1 PROPOSED SOLUTION
In this chapter, we propose a novel framework, called ReMine, to unify two important yet com-
plementary signals for Open IE problem, i.e., the local context information and the global co-
hesiveness (see also Fig. 12.2). While most existing Open IE systems focus on analyzing local
context and linguistic structures for tuple extraction, ReMine further make use of all the candidate
tuples extracted from the entire corpus, to collectively measure whether these candidate tuples
are reflecting cohesive semantics. This is done by mapping both entity and relation phrases into
the same low-dimensional embeddings space, where two entity phrases are similar if they share
similar relation phrases and entity arguments. The entity and relation embeddings so learned
can be used to measure the cohesiveness score of a candidate relation tuple. To overcome the
error propagation issue, ReMine jointly optimizes both the extraction of entity and relation phrases
and the global cohesiveness across the corpus, each being formalized as an objective function so as
to quantify the quality scores, respectively.
Specifically, ReMine first identifies entity and relation phrases from local context. In
Fig. 12.2, suppose we have a sentence “Your dry cleaner set out from eastern Queens on foot
Tuesday morning and now somewhere near Maspeth.” We will first extract three entity phrases,
eastern Queens, Tuesday morning, Maspeth, as well as two background phrases Your dry cleaner,
foot. Then, ReMine jointly mines relation tuples and measure extraction with global translat-
ing objective. Local consistent text segmentation may generate noisy tuples, such as <your dry
cleaner, set out from, eastern Queens> and <eastern Queens, on, foot>. However, from the
global cohesiveness view, we may infer the second tuple as a false positive. Entity phrases like
“eastern Queens” are seldom linked by relation phrase “on” in extracted tuples. Overall, ReMine
will iteratively refine extracted tuples and learn entity and relation representations from corpus
level. With careful attention to advantages of linguistic patterns [Fader et al., 2011, Hearst,
1992] and representation learning [Bordes et al., 2013], this approach benefits from both side.
Compared to previous open IE systems, ReMine prune extracted tuples via global cohesiveness
and its accuracy is not sensitive to the target domain.
150 12. OPEN INFORMATION EXTRACTION WITH GLOBAL STRUCTURE
12.2 THE REMINE FRAMEWORK
ReMine aims to jointly address two problems, that is, extracting entity and relation phrases from
sentences and generating quality relation tuples. There are three challenges. First, distant super-
vision may contain false seed examples of entity and relation phrases, and thus asks for effective
measuring of the quality score for phrase candidates. Second, there exist multiple entity phrases
in one sentence. Therefore, selecting entities to form relation tuples may suffer from ambiguity
in local context. Third, ranking extracted tuples without referring to the entire corpus may favor
with good local structures.
Framework Overview. We proposed a framework, called ReMine, that integrates both local con-
text and global structure cohesiveness (see also Fig. 12.2) to address above challenges. There are
three major modules in ReMine: (1) phrase extraction module; (2) relation tuple generation; and
(3) global cohesiveness module. To overcome sparse and noisy labels, phrase extraction module
trains a robust phrase classifier and adjusts quality from a generative perspective. The relation tu-
ple extraction module generates tuples from sentence structure, which adopts widely used local
structure patterns [Corro and Gemulla, 2013, Nakashole et al., 2012, Schmitz et al., 2012], in-
cluding syntactic and lexically patterns over pos tags and dependency parsing tree. However dif-
ferent from previous studies, the module tries to benefit from information redundancy and mine
distinctive extractions with accurate relation phrases. Meanwhile, global cohesiveness module
learns entity and relation phrases representation with a score function to rank tuples. Relation
tuple generation module and global cohesiveness module are collaborating with each other. Par-
ticularly, relation tuple generation module produces coarse positive tuple seeds and feeds them
into global cohesiveness module. By distinguishing positive tuples with constructed negative
samples, global cohesiveness module provide a cohesiveness measure for tuple generation. Tu-
ple generator further incorporates global cohesiveness into local generation and outputs more
precise extractions. ReMine integrates tuple generation and global cohesiveness learning into a
joint optimization framework. They iteratively refine input for each other and eventually obtain
clean extractions. Once the training process converges, the tuples are expected to be distinctive
and accurate. Overall, ReMine extracts relation tuples as follows.
1. Phrase extraction module conducts context-dependent phrasal segmentation on target
corpus (using distant supervision) , to generate entity phrases E , relation phrases R, and
word sequence probability A.
2. Relation tuple generation module generates positive entity pairs and identifies predicate
p between entity argument pair via tuple generative process.
3. Global cohesiveness module learns entity and relation embeddings V via a translating
objective to capture global structure cohesiveness W .
4. Update sentence-level extractions T based on both local context information and global
structure cohesiveness.
12.3. SUMMARY 151
12.2.1 THE JOINT OPTIMIZATION PROBLEM
We now show how three modules introduced above can be organically integrated. Phrase Ex-
traction Module provide entity and relation seeds for Tuple Generation Module and Global
Cohesiveness Module. Relation Tuple Generation Module can provide positive tuples for se-
mantic representation learning, in return, global cohesiveness representation serve as good se-
mantic measure during generation process.
Objective for Local Context. The following objective aims at finding semantic consistent tuples
in each sentence s : X X
max kwi kAi .h; t /; (12.1)
.h;t/2EpC i
where kwi k D S.h; l; t /, Ai .h; t / D P .wŒbi ;bi C1 / jh; t /. A and W are calculated via Phrase Ex-
traction Module and Global Cohesiveness Module, respectively. In each sentence, it is a discrete
problem to find most consistent tuples regarding given entity pairs and scores. Therefore dynamic
programming is deployed to find optimal solution of Relation Tuple Generation Module.
Objective for Global Cohesiveness. With global measuring of relation tuples, we have global
objective to associate extracted relation tuples in the corpus D as below:
X X
max kwi k kwQi k
; (12.2)
C wQi 2Eh0 ;l;t 0
wi 2Eh;l;t
C
where Eh;l;t denote for .h; t / 2 EpC and predicate l stands for average extracted predicate
l D .r1 ; r2 ; :::; rn / in between,
is the hyper margin, Eh0 ;l;t 0 is composed of training tuples with
either h or t replaced. Global objective tries to maximize margin between positive extractions
similarities and negative one’s similarity, which start with current positive extractions and iter-
atively propagate to more unknown tuples in local optimization.
12.3 SUMMARY
This chapter studies the task of open information extraction and proposes a principled frame-
work, ReMine, to unify local contextual information and global structural cohesiveness for ef-
fective extraction of relation tuples. ReMine leverages distant supervision in conjunction with
existing KBs to provide automatically labeled sentence and guide the entity and relation seg-
mentation. The local objective is further learned together with a translating-based objective to
enforce structural cohesiveness, such that corpus-level statistics are incorporated for boosting
high-quality tuples extracted from individual sentences. We develop a joint optimization algo-
rithm to efficiently solve the proposed unified objective function and can output quality extrac-
tions by taking into account both local and global information. Experiments on two real-world
corpora of different domains demonstrate that ReMine system achieves superior precision when
outputting same number of extractions, compared with several state-of-the-art open IE systems.
152 12. OPEN INFORMATION EXTRACTION WITH GLOBAL STRUCTURE
As a byproduct, ReMine also demonstrates competitive performance on detecting mentions of
entities from text when compared to several named entity recognition algorithms.
153
CHAPTER 13
Applications
The impact of the effort-light StructMine approach is best shown in multiple downstream appli-
cations. In this chapter we start with a discussion on how to build on top of distant supervision
to incorporate human supervision (e.g., curated rules from domain experts) in the effort-light
StructMine framework, followed by showing an application on life sciences domain that makes
use of the StructNet constructed by our methods, and proposing a few potential new applica-
tions.
Example 13.1 Biomedical Entity Relationships These murine models demonstrate that
amikacin has in vivo activity against Nocardia and may be potentially useful in the treatment of
human disease.
The above sentence presents a fact that “amikacin” is a chemical entity, and claims the
finding that “amikacin” can potentially treat “Nocardia,” which is a disease. Without tools for
mining entity and relation structures such as effort-light StructMine, human experts have to
read through the whole sentence to identify the chemical and disease entities in the sentence,
154 13. APPLICATIONS
and then infer their relationship as a treatment relationship from the sentence. However, text
mining tools, such as CoType [Ren et al., 2017a], will be able to take the large document col-
lection and some existing biomedical databases as input, and automatically recognize “amikacin”
as a chemical and “Nocardia” as a disease and further infer that there is a “treatment” relation
between them. This example shows that automatic techniques for mining entity and relation
structures can greatly save time, human effort and costs for biomedical information extraction
from literatures, which serves as a primary step for many downstream applications such as new
drug discovery, adverse event detection for drug combination, and biomedical KB construction.
As a follow-up effort, we develop a novel system, called Life-iNet [Ren et al., 2017c] on
top of our entity recognition and relation extraction methods, which automatically turns an un-
structured background corpus into a structured network of factual knowledge (see Fig. 13.1), and
supports multiple exploratory and analytic functions over the constructed network for knowledge
discovery. To extract factual structures, Life-iNet automatically detects token spans of entities
mentioned from text (i.e., ClusType [Ren et al., 2015]), labels entity mentions with semantic
categories (i.e., PLE [Ren et al., 2016a]), and identifies relationships of various relation types
between the detected entities (i.e., CoType [Ren et al., 2017a]). These inter-related pieces of
information are integrated to form a unified, structured network, where nodes represent dif-
ferent types of entities and edges denote relationships of different relation types between the
entities. To address the issue of limited diversity and coverage, Life-iNet relies on the external
KBs to provide seed examples (i.e., distant supervision), and identifies additional entities and
relationships from the given corpus (e.g., using multiple textual resources such as scientific lit-
erature and encyclopedia articles) to construct a structured network. By doing so, we integrate
the factual information in the existing KBs with those extracted from the given corpus. To sup-
port analytic functionality, the Life-iNet system implements link prediction functions over the
construct network and integrates a distinctive summarization function to provide insight analy-
sis (e.g., answering questions such as “which genes are distinctively related to the given disease type
under GeneDiseaseAssociation relation?”).
To systematically incorporate these ideas, Life-iNet leverages the novel entity and relation
structure mining techniques [Ren et al., 2015, 2016a, 2017a] developed in effort-light Struct-
Mine to implement an effort-light network construction framework. Specially, it relies on distant
supervision in conjunction with external KBs to (1) detect quality entity mentions [Ren et al.,
2015], (2) label entity mentions with fine-grained entity types in a given type hierarchy [Ren et
al., 2016a], and (3) identify relationships of different types between entities [Ren et al., 2017a].
In particular, we design specialized loss functions to faithfully model “appropriate” labels and
remove “false positive” labels for the training instances (heuristically generated by distant super-
vision), regarding the specific context where an instance is mentioned [Ren et al., 2016a, 2017a].
By doing so, we can construct corpus-specific information extraction models by using distant su-
pervision in a noise-robust way (see Fig. 13.1). The proposed network construction framework
is domain-independent—it can be quickly ported to other disciplines and sciences without ad-
13.1. STRUCTURING LIFE SCIENCE PAPERS: THE LIFE-INET SYSTEM 155
Figure 13.1: An illustrative example of the constructed Life-iNet, and its statistics.
Figure 13.2: A screen-shot of the graph exploration interface of Life-iNet system. By specifying
the types of two entity arguments and the relation type between them, Life-iNet system returns
a graph which visualize the typed entities and relationship, and allows users to explore the graph
to find relevant research papers.
ditional human labeling effort. With the constructed network, Life-iNet further applies link
prediction algorithms [Bordes et al., 2013, Tang et al., 2015] to infer new entity relationships,
and distinctive summarization algorithm [Tao et al., 2016] to find other entities that are dis-
tinctively related to the query entity (or the given entity types).
156 13. APPLICATIONS
Impact of Life-iNet:
• A biomedical knowledge graph constructed by our Life-iNet system is used by researchers
at Stanford Medical school to facilitate drug re-purposing. It yields significant improve-
ment of performance on new drugs and rare diseases.
• Life-iNet system is adopted by veterinarians at Veterinary Information Network Inc.
(VIN) to construct the first veterinary knowledge graph from multiple sources of infor-
mation including research articles, books, guidelines, drug handbooks, and message board
posts.
• Technologies developed in Life-iNet system have been transferred to Mayo Clinic, UCLA
Medical School, and NIH Big Data to Knowledge Center to facilitate construction of
domain KBs from massive scientific literature.
Technique Application
Conditional random field Document summarization
Unsupervised learning Sequence labeling
Support vector machine Statistical classification
Hidden markov model
Evaluation Metric Dataset
F1; Rouge-2 DUC
Language Processing or the Database Systems community?” and “how does the facet of entity
recognition vary across different communities?”, by aggregating the facets statistics across the
database. Such results enable the discovery of ideas and the dynamics of a research topic or
community in an effective and efficient way.
Our ClusType method [Ren et al., 2015] leverages relation phrase as the bridge to propa-
gate type information. The proposed relation-based framework is general, and can be applied to
different kinds of classification task. Therefore, we propose to extract document facets by doing
type propagation on corpus-induced graphs. The major challenge in performing facet extraction
arises from multiple sources: concept extraction, concept to facet matching, and facet disam-
biguation. To tackle these challenges, we extend ClusType approach and develop FacetGist,
a framework for facet extraction. Facet extraction involves constructing a graph-based hetero-
geneous network to capture information available across multiple local sentence-level features,
as well as global context features. We then formulate a joint optimization problem, and pro-
pose an efficient algorithm for graph-based label propagation to estimate the facet of each con-
cept mention. Experimental results on technical corpora from two domains demonstrate that
FacetGist can lead to an improvement of over 25% in both precision and recall over competing
schemes [Siddiqui et al., 2016].
Figure 13.3: Example output of comparative document analysis (CDA) for papers [Jeh and
Widom, 2003] and [Haveliwala, 2003]. CDA combines two proper names which frequently
co-occur in the documents into a name pair using the symbol “˚.”
Our case study on comparing news articles published at different dates shows the power of the
proposed method on comparing sets of documents.
161
CHAPTER 14
Conclusions
Entities and relationships are important structures that can be extracted from a text corpus
to represent the factual knowledge inside the corpus. Effective and efficient mining of entity
and relation structures from text helps gaining insights from large volume of text data (that are
infeasible for human to read through and digest), and enables many downstream applications on
understanding, exploring and analyzing the text content. Data analysts and government agents
may want to identify person, organization and location entities in news everyday news articles
and generate concise and timely summary of news events. Biomedical researchers who cannot
digest large amounts of newly-published research papers in relevant areas would need an effective
way to extract different relationships between proteins, drugs, and diseases so as to follow the
new claims and facts presented in the research community. However, text data is highly variable:
corpora covering topics from different domains, written in different genres or languages have
typically required for effective processing a wide range of language resources such as grammars,
vocabularies, gazetteers. The massive and messy nature of text data post significant challenges to
creating tools for automated structuring of unstructured content that scale with text volume.
2. We study different structure extraction tasks for mining typed entity and relation structures
from corpora, which include entity recognition and typing (Chapter 4), fine-grained entity typ-
ing (Chapter 5), and entity relationship extraction (Chapter 7). In particular, we investigate hu-
man effort-light solutions for these several tasks using distant supervision in conjunction with
external KBs. This yields different problem settings as compared to the fully-supervised learn-
ing problem setup in most existing studies on information extraction. A key challenge in dealing
with distant supervision is on designing effective typing models that are robust to the noisy labels
in the automatically generated training data.
3. We have proposed models and algorithms to solve the above tasks.
• We studied distantly-supervised entity recognition and typing, and proposed a novel re-
lation phrase-based entity recognition framework, ClusType (Chapter 4). A domain-
agnostic phrase mining algorithm is developed for generating candidate entity mentions
and relation phrases. By integrating relation phrase clustering with type propagation, the
14.2. CONCLUSION 163
proposed method is effective in minimizing name ambiguity and context problems, and
thus predicts each mention’s type based on type distribution of its string name and type
signatures of its surrounding relation phrases. We formulate a joint optimization problem
to learn object type indicators/signatures and cluster memberships simultaneously.
• For fine-grained entity typing, we propose hierarchical partial-label embedding methods,
AFET and PLE , that models “clean” and “noisy” mentions separately and incorporates
a given type hierarchy to induce loss functions (Chapter 5). Both models build on a joint
optimization framework, learns embeddings for mentions and type-paths, and iteratively
refines the model.
• Our work on extracting typed relationships studies domain-independent, joint extraction
of typed entities and relationships with distant supervision (Chapter 7). The proposed Co-
Type framework runs domain-agnostic segmentation algorithm to mine entity mentions,
and formulates the joint entity and relation mention typing problem as a global embedding
problem. We design a noise-robust objective to faithfully model noisy type label from dis-
tant supervision, and capture the mutual dependencies between entity and relation based
on the translation embedding assumption.
14.2 CONCLUSION
The contributions of this book work are in the area of text mining and information extraction,
within which we focus on domain-independent and noise-robust approaches using distant su-
pervision (in conjunction with publicly-available KBs). The work has broad impact on a variety
of applications: KB construction, question-answering systems, structured search and exploration
of text data, recommender systems, network analysis, and many other text mining tasks. Finally,
our work has been used in the following settings.
• Introduced in classes and conference tutorials: Our methods on entity recognition and
typing (ClusType), fine-grained entity typing (PLE [Ren et al., 2016a], AFET [Ren et al.,
2016b]), and relation extraction (CoType [Ren et al., 2017a]) are being taught in graduate
courses, e.g., University of Illinois at Urbana-Champaign (CS 512), and are introduced as
major parts of the conference tutorial in top data mining and database conferences such
as SIGKDD, WWW, CIKM, and SIGMOD.
• Real-world, cross-disciplinary use cases:
– Our entity recognition and typing technique (ClusType [Ren et al., 2014]) has been
transferred to U.S. Army Research Lab, Microsoft Bing Ads and NIH Big Data to
Knowledge Center to identify typed entities of different kinds from low-resource,
domain-specific text corpora. ClusType is also used by Stanford sociologists to iden-
tify scientific concepts from 37 millions of scientific publications in Web of Science
database to study innovation and translation of scientific ideas.
164 14. CONCLUSIONS
– A biomedical knowledge graph (i.e., Life-iNet [Ren et al., 2017c]) constructed au-
tomatically from millions of PubMed publications using our effort-light Struct-
Mine pipeline is used by researchers at Stanford Medical school to facilitate drug
re-purposing. It yields significant improvement of performance on new drugs and
rare diseases.
– Our effort-light StructMine techniques (ClusType, PLE, CoType) is adopted by
veterinarians at Veterinary Information Network Inc. (VIN) to construct the first
veterinary knowledge graph from multiple sources of information including research
articles, books, guidelines, drug handbooks, and message board posts.
• Awards: The book work on effort-light StructMine has been awarded a Google Ph.D.
fellowship in 2016 (sole winner in the category of Structured Data and Data Manage-
ment in the world) and a Yahoo!-DAIS Research Excellence Award, and a C. W. Gear
Outstanding Graduate Student Award from University of Illinois.
165
CHAPTER 15
Bibliography
E. Agichtein and L. Gravano (2000. Snowball: Extracting relations from large plain-text collec-
tions, in Proc. of the 5th ACM Conference on Digital Libraries. DOI: 10.1145/336597.336644.
22, 29, 112
R. Agrawal and R. Srikant (1995). Mining sequential patterns, in ICDE, pp. 3–14. DOI:
10.1109/icde.1995.380415. 141, 142
R. Angheluta, R. De Busser, and M.-F. Moens (2002). The use of topic segmentation for auto-
matic summarization, in Proc. of the ACL Workshop on Automatic Summarization, pp. 11–12.
75
D. E. Appelt, J. R. Hobbs, J. Bear, D. Israel, and M. Tyson (1992). Fastus: A finite-state pro-
cessor for information extraction from real-world text, in IJCAI.
Y. Artzi and L. S. Zettlemoyer (2013). Weakly supervised learning of semantic parsers for map-
ping instructions to actions, TACL, vol. 1, pp. 49–62. 136
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007). Dbpedia: A nu-
cleus for a web of open data, in The Semantic Web, pp. 722–735, Springer. DOI: 10.1007/978-
3-540-76298-0_52. 22, 139
N. Bach and S. Badaskar (2007). A review of relation extraction, Literature Review for Language
and Statistics II. 28, 87, 105
J. Bao, N. Duan, M. Zhou, and T. Zhao (2014). Knowledge-based question answering as ma-
chine translation, Cell, vol. 2, no. 6. DOI: 10.3115/v1/p14-1091. 120
J. Bian, Y. Liu, E. Agichtein, and H. Zha (2008). Finding the right facts in the crowd: Factoid
question answering over social media, in WWW. DOI: 10.1145/1367497.1367561. 87
168 BIBLIOGRAPHY
L. Bing, S. Chaudhari, R. Wang, and W. Cohen (2015). Improving distant supervision for
information extraction using label propagation through lists, in Proc. of the Conference on Em-
pirical Methods in Natural Language Processing, pp. 524–529. DOI: 10.18653/v1/d15-1060.
112
A. Blum and T. Mitchell (1998). Combining labeled and unlabeled data with co-training, in
COLT Workshop on Computational Learning Theory. DOI: 10.1145/279943.279962. 29, 113
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008). Freebase: A collab-
oratively created graph database for structuring human knowledge, in SIGMOD. DOI:
10.1145/1376616.1376746. 4, 22, 35, 38, 90, 139
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013). Translating
embeddings for modeling multi-relational data, in NIPS. 30, 92, 96, 100, 149, 155
L. Breiman (2001). Random forests, Machine Learning, vol. 45, no. 1, pp. 5–32. DOI:
10.1023/A:1010933404324. 143
S. Brin (1998). Extracting patterns and relations from the world wide web, in International
Workshop on the World Wide Web and Databases. DOI: 10.1007/10704656_11. 29
R. C. Bunescu and R. J. Mooney (2005). A shortest path dependency kernel for relation extrac-
tion, in HLT-EMNLP. DOI: 10.3115/1220575.1220666.
R. C. Bunescu and R. Mooney (2007). Learning to extract relations from the web using minimal
supervision, in ACL. 87
R. C. Bunescu and R. J. Mooney (2007). Learning to extract relations from the Web using
minimal supervision, in ACL.
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr., and T. M. Mitchell (2010).
Toward an architecture for never-ending language learning, in AAAI. 2, 3, 22, 139
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr., and T. M. Mitchell
(2010). Coupled semi-supervised learning for information extraction, in WSDM. DOI:
10.1145/1718487.1718501. 119, 147, 148
Y. S. Chan and D. Roth (2010). Exploiting background knowledge for relation extraction, in
COLING. 94, 133
Z. Chen, A. Mukherjee, and B. Liu (2014). Aspect extraction with automated prior knowledge
learning, in ACL. DOI: 10.3115/v1/p14-1033. 139
L. Del Corro and R. Gemulla (2013). Clausie: Clause-based open information extraction, in
WWW. DOI: 10.1145/2488388.2488420. 148, 150
BIBLIOGRAPHY 169
T. Cour, B. Sapp, and B. Taskar (2011). Learning from partial labels, JMLR, vol. 12, pp. 1501–
1536. 30, 70
A. Culotta and J. Sorensen (2004). Dependency tree kernels for relation extraction, in ACL.
DOI: 10.3115/1218955.1219009. 28, 87, 111
J. R. Curran, T. Murphy, and B. Scholz (2007). Minimising semantic drift with mutual exclusion
bootstrapping, in PACLING, pp. 172–180. 112, 116
G. R. Doddington, A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. Strassel, and R. M.
Weischedel (2004). The automatic content extraction (ace) program-tasks, data, and evalua-
tion, in LREC, vol. 2, p. 1. 20, 21
X. L. Dong, T. Strohmann, S. Sun, and W. Zhang (2014). Knowledge vault: A web-scale ap-
proach to probabilistic knowledge fusion, in SIGKDD. DOI: 10.1145/2623330.2623623. 2,
35, 59, 87
L. Dong, F. Wei, H. Sun, M. Zhou, and K. Xu (2015). A hybrid neural model for type classi-
fication of entity mentions, in IJCAI. 70
H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik et al. (1997). Support vector re-
gression machines, NIPS, vol. 9, pp. 155–161. 145
J. Dunietz and D. Gillick (2014). A new entity salience task with millions of training examples,
EACL. DOI: 10.3115/v1/e14-4040. 61
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han (2014). Scalable topical phrase mining
from text corpora, VLDB, vol. 8, no. 3, pp. 305–316. DOI: 10.14778/2735508.2735519. 93
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han (2015). Scalable topical phrase mining
from text corpora, VLDB. DOI: 10.14778/2735508.2735519. 40, 41
J. Ellis, J. Getman, J. Mott, X. Li, K. Griffitt, S. M. Strassel, and J. Wright (2014). Linguistic
resources for 2013 knowledge base population evaluations, Text Analysis Conference (TAC).
103, 134
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D.
S. Weld, and A. Yates (2004). Web-scale information extraction in knowitall (preliminary
results), in WWW. DOI: 10.1145/988672.988687. 2, 3, 87
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld,
and A. Yates (2005). Unsupervised named-entity extraction from the Web: An experimental
study, Artificial Intelligence, vol. 165, no. 1, pp. 91–134. DOI: 10.1016/j.artint.2005.03.001.
A. Fader, S. Soderland, and O. Etzioni (2011). Identifying relations for open information ex-
traction, in EMNLP. 31, 38, 40, 53, 139, 149
170 BIBLIOGRAPHY
J. R. Finkel, T. Grenager, and C. Manning (2005). Incorporating non-local information into in-
formation extraction systems by Gibbs sampling, in ACL. DOI: 10.3115/1219840.1219885.
52, 62, 93
R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano (2006). Text mining for product attribute
extraction, SIGKDD Explorations, vol. 8, no. 1, pp. 41–48. DOI: 10.1145/1147234.1147241.
139
M. R. Gormley, M. Yu, and M. Dredze (2015). Improved relation extraction with feature-rich
compositional embedding models, EMNLP. DOI: 10.18653/v1/d15-1205. 30, 92, 104
Z. GuoDong, S. Jian, Z. Jie, and Z. Min (2005). Exploring various knowledge in relation ex-
traction, in ACL. DOI: 10.3115/1219840.1219893. 28, 87, 92, 95
S. Gupta and C. D. Manning (2014). Improved pattern learning for bootstrapped entity extrac-
tion, in CONLL. DOI: 10.3115/v1/w14-1611. 29, 35, 52
C. Van Gysel, M. de Rijke, and E. Kanoulas (2016a). Learning latent vector spaces for product
search, in Proc. of the 25th ACM International on Conference on Information and Knowledge
Management, pp. 165–174, ACM. DOI: 10.1145/2983323.2983702. 124
C. Van Gysel, M. de Rijke, and M. Worring (2016b). Unsupervised, efficient and seman-
tic expertise retrieval, in Proc. of the 25th International Conference on World Wide Web,
pp. 1069–1079, International World Wide Web Conferences Steering Committee. DOI:
10.1145/2872427.2882974. 124
F. Harary and I. C. Ross (1957). A procedure for clique detection using the group matrix, So-
ciometry, vol. 20, no. 3, pp. 205–215. DOI: 10.2307/2785673. 144
Z. S. Harris (1954). Distributional structure, Word, vol. 10, pp. 146–162. DOI:
10.1080/00437956.1954.11659520. 122
BIBLIOGRAPHY 171
T. H. Haveliwala (2003). Topic-sensitive pagerank: A context-sensitive ranking algorithm for
web search, TKDE, vol. 15, no. 4, pp. 784–796. DOI: 10.1109/tkde.2003.1208999. 158, 159
X. He and P. Niyogi (2004). Locality preserving projections, in NIPS. 44, 47, 96, 97
Y. He and D. Xin (2011). Seisa: Set expansion by iterative similarity aggregation, in WWW.
DOI: 10.1145/1963405.1963467.
M. A. Hearst (1992). Automatic acquisition of hyponyms from large text corpora, in Proc. of
the 14th Conference on Computational Linguistics (ACL). DOI: 10.3115/992133.992154. 77,
139, 149
J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater,
and G. Weikum (2011). Robust disambiguation of named entities in text, in EMNLP. 91
R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011). Knowledge-based
weak supervision for information extraction of overlapping relations, in ACL. 87, 94, 102,
104, 105
R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011). Knowledge-based
weak supervision for information extraction of overlapping relations, in ACL, pp. 541–550.
112
R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld (2011). Knowledge-based
weak supervision for information extraction of overlapping relations, in ACL. 130, 134
D. Hovy, B. Plank, H. M. Alonso, and A. Søgaard (2015). Mining for unambiguous instances
to adapt part-of-speech taggers to new domains, in NAACL. DOI: 10.3115/v1/n15-1135. 89
M. Hu and B. Liu (2014). Mining and summarizing customer reviews, in KDD. DOI:
10.1145/1014052.1014073. 139
R. Huang and E. Riloff (2010). Inducing domain-specific semantic class taggers from (almost)
nothing, in ACL. 35, 36, 52
X. Huang, X. Wan, and J. Xiao (2014). Comparative news summarization using concept-
based optimization, Knowledge and Information Systems, vol. 38, no. 3, pp. 691–716. DOI:
10.1007/s10115-012-0604-8. 158
L. Huang, J. May, X. Pan, H. Ji, X. Ren, J. Han, L. Zhao, and J. A. Hendler (2017). Liberal
entity extraction: Rapid construction of fine-grained entity typing systems, Big Data, vol. 5,
no. 1, pp. 19–31. DOI: 10.1089/big.2017.0012.
G. Jeh and J. Widom (2003). Scaling personalized web search, in WWW. DOI:
10.1145/775189.775191. 158, 159
172 BIBLIOGRAPHY
H. Ji and R. Grishman (2008). Refining event extraction through cross-document inference, in
ACL. 59
J.-Y. Jiang, C.-Y. Lin, and P.-J. Cheng (2015). Entity-driven type hierarchy construction for
freebase, in WWW. DOI: 10.1145/2740908.2742737. 65
Z. Kozareva and E. Hovy (2010). Not all seeds are equal: Measuring the quality of text mining
seeds, in NAACL. 35
C. Lee, Y.-G. Hwang, and M.-G. Jang (2007). Fine-grained named entity recognition and
relation extraction for question answering, in SIGIR. DOI: 10.1145/1277741.1277915. 91
Q. Li and H. Ji (2014). Incremental joint extraction of entity mentions and relations, in ACL.
DOI: 10.3115/v1/p14-1038. 28, 87, 104
Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han (2016). A survey on truth
discovery, SIGKDD Explore Newsletter, vol. 17, no. 2, pp. 1–16, http://doi.acm.org/10.
1145/2897350.2897352 DOI: 10.1145/2897350.2897352. 119
T. Lin et al. (2012). No noun phrase left behind: Detecting and typing unlinkable entities, in
EMNLP. 30, 31, 35, 36, 38, 53, 59, 91
Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun (2016). Neural relation extraction with selective
attention over instances. in ACL, pp. 2124–2133. DOI: 10.18653/v1/p16-1200. 112
Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun (2016). Neural relation extraction with selective
attention over instances, in ACL. DOI: 10.18653/v1/p16-1200. 130
X. Ling and D. S. Weld (2012). Fine-grained entity recognition, in AAAI. 29, 30, 53, 59, 60,
62, 64, 69, 70, 94, 95, 98, 102, 103, 104, 105, 109, 119, 134, 139
J. Liu, C. Wang, J. Gao, and J. Han (2013). Multi-view clustering via joint nonnegative matrix
factorization, in Proc. of the SIAM International Conference on Data Mining (SDM). DOI:
10.1137/1.9781611972832.28. 47, 49
BIBLIOGRAPHY 173
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han (2015). Mining quality phrases from massive
text corpora, in Proc. of the ACM SIGMOD International Conference on Management of Data
(SIGMOD). DOI: 10.1145/2723372.2751523. 93, 94, 141
Y. Liu, F. Wei, S. Li, H. Ji, M. Zhou, and H. Wang (2015). A dependency-based neural network
for relation classification, in ACL, pp. 285–290. DOI: 10.3115/v1/p15-2047. 112
L. Liu, X. Ren, Q. Zhu, S. Zhi, H. Gui, H. Ji, and J. Han (2017). Heterogeneous supervision for
relation extraction: A representation learning approach, in Proc. of the Conference on Empirical
Methods in Natural Language Processing (EMNLP). DOI: 10.18653/v1/d17-1005. 13
I. Mani and E. Bloedorn (1997). Multi-document summarization by graph search and match-
ing, AAAI. 158
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014). The
stanford corenlp natural language processing toolkit, ACL. DOI: 10.3115/v1/p14-5010. 69,
103, 132
M.-C. De Marneffe, B. MacCartney, C. D. Manning et al. (2006). Generating typed depen-
dency parses from phrase structure parses, in Proc. of LREC, vol. 6, pp. 449–454, Genoa.
139
R. McDonald, K. Crammer, and F. Pereira (2005). Online large-margin training of dependency
parsers, in ACL, pp. 91–98. DOI: 10.3115/1219840.1219852. 140
P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer (2011). Dbpedia spotlight: Shedding
light on the web of documents, in I-Semantics. DOI: 10.1145/2063518.2063519. 91, 92
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed representa-
tions of words and phrases and their compositionality, in NIPS. 30, 77, 96, 97, 123, 145
T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word represen-
tations in vector space, ArXiv Preprint ArXiv:1301.3781. 113
B. Min, S. Shi, R. Grishman, and C.-Y. Lin (2012). Ensemble semantics for large-scale unsu-
pervised relation extraction, in EMNLP. 46
M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009). Distant supervision for relation extraction
without labeled data, in ACL. DOI: 10.3115/1690219.1690287. 30, 59, 76, 87, 94, 104, 112,
119, 122
M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009). Distant supervision for relation extraction
without labeled data, in ACL/IJCNLP. DOI: 10.3115/1690219.1690287. 130, 133
M. Miwa and Y. Sasaki (2014). Modeling joint entity and relation extraction with table repre-
sentation, in EMNLP. DOI: 10.3115/v1/d14-1200. 28, 87
174 BIBLIOGRAPHY
M. Miwa and M. Bansal (2016). End-to-end relation extraction using LSTMS on sequences
and tree structures, ArXiv Preprint ArXiv:1601.00770. DOI: 10.18653/v1/p16-1105.
R. J. Mooney and R. C. Bunescu (2005). Subsequence kernels for relation extraction, in NIPS.
104, 105
R. J. Mooney and R. C. Bunescu (2006). Subsequence kernels for relation extraction, in NIPS,
pp. 171–178, MIT Press. 111
D. Nadeau and S. Sekine (2007). A survey of named entity recognition and classification,
Lingvisticae Investigationes, vol. 30, no. 1, pp. 3–26. DOI: 10.1075/bct.19.03nad. 28, 35,
59, 93, 141
N. Nakashole, M. Theobald, and G. Weikum (2011). Scalable knowledge harvesting with high
precision and high recall, in Proc. of the 4th ACM International Conference on Web Search and
Data Mining, pp. 227–236, ACM. DOI: 10.1145/1935826.1935869. 119
N. Nakashole, G. Weikum, and F. Suchanek (2012). Patty: A taxonomy of relational patterns
with semantic types, in EMNLP. 75, 112, 139, 144, 150
N. Nakashole, T. Tylenda, and G. Weikum (2013). Fine-grained semantic typing of emerging
entities, in ACL. 30, 35, 36, 38, 87, 139
V. Nastase, M. Strube, B. Börschinger, C. Zirn, and A. Elghafari (2010). Wikinet: A very large
scale multi-lingual concept network, in LREC. 139
N. Nguyen and R. Caruana (2008). Classification with partial labels, in KDD. DOI:
10.1145/1401890.1401958. 30, 67, 70, 92, 96, 98, 133
K. Nigam and R. Ghani (2000). Analyzing the effectiveness and applicability of co-training, in
CIKM. DOI: 10.1145/354756.354805. 29
N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L. Rubin,
M.-A. Storey, C. G. Chute, and M. A. Musen (2009). Bioportal: Ontologies and integrated
data resources at the click of a mouse, Nucleic Acids Research, vol. 37, pp. W170–3. DOI:
10.1093/nar/gkp440. 4
M. Pasca and B. Van Durme (2008). Weakly-supervised acquisition of open-domain classes and
class attributes from web documents and query logs, in ACL, pp. 19–27. 139
J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu (2004).
Mining sequential patterns by pattern-growth: The prefixspan approach, TKDE, vol. 16, no.
11, pp. 1424–1440. DOI: 10.1109/tkde.2004.77. 142
J. Pennington, R. Socher, and C. D. Manning (2014). Glove: Global vectors for word repre-
sentation, in EMNLP, vol. 14, pp. 1532–43. DOI: 10.3115/v1/d14-1162. 77, 113
BIBLIOGRAPHY 175
B. Perozzi, R. Al-Rfou, and S. Skiena (2014). Deepwalk: Online learning of social representa-
tions, in KDD. DOI: 10.1145/2623330.2623732. 30, 70, 104
M. Purver and S. Battersby (2012). Experimenting with distant supervision for emotion classifi-
cation, in Proc. of the 13th Conference of the European Chapter of the Association for Computational
Linguistics, pp. 482–491. 76
L. Qian, G. Zhou, F. Kong, and Q. Zhu (2009). Semi-supervised learning for semantic relation
classification using stratified sampling strategy, in Proc. of the Conference on Empirical Meth-
ods in Natural Language Processing: Volume 3, pp. 1437–1445, Association for Computational
Linguistics. DOI: 10.3115/1699648.1699690. 77
M. Qu, X. Ren, and J. Han (2017). Automatic synonym discovery with knowledge bases, in
Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD). DOI: 10.1145/3097983.3098185. 10
M. Qu, X. Ren, and J. Han (2017). Automatic synonym discovery with knowledge bases, in
KDD, pp. 997–1005. DOI: 10.1145/3097983.3098185. 112, 113
J. Rao, H. He, and J. J. Lin (2016). Noise-contrastive estimation for answer selection with deep
neural networks, in CIKM. DOI: 10.1145/2983323.2983872. 133
L. Ratinov and D. Roth (2009). Design challenges and misconceptions in named entity recog-
nition, in ACL. DOI: 10.3115/1596374.1596399. 28, 35, 59
S. Ravi and M. Paşca (2008). Using structured text for large-scale attribute extraction, in CIKM,
pp. 1183–1192. DOI: 10.1145/1458082.1458238. 139
X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, and J. Han (2014). ClusCite: Effective
citation recommendation by information network-based clustering, in Proc. of the 20th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). DOI:
10.1145/2623330.2623630. 163
176 BIBLIOGRAPHY
X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han (2015). ClusType: Effective
entity recognition and typing by relation phrase-based clustering, in Proc. of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). DOI:
10.1145/2783258.2783362. 7, 70, 76, 91, 96, 119, 154, 157
X. Ren, W. He, M. Qu, H. Ji, C. R. Voss, and J. Han (2016a). Label noise reduc-
tion in entity typing by heterogeneous partial-label embedding, in Proc. of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). DOI:
10.1145/2939672.2939822. 8, 9, 154, 163
X. Ren, W. He, M. Qu, H. Ji, and J. Han (2016b). AFET: Automatic fine-grained entity typ-
ing by hierarchical partial-label embedding, in Proc. of the Conference on Empirical Methods in
Natural Language Processing (EMNLP). DOI: 10.18653/v1/d16-1144. 8, 9, 163
X. Ren, W. He, M. Qu, C. R. Voss, H. Ji, and J. Han (2016c). Label noise reduction in entity
typing by heterogeneous partial-label embedding, in KDD. DOI: 10.1145/2939672.2939822.
30, 92, 95, 104, 105, 110
X. Ren, A. El-Kishky, C. Wang, and J. Han (2016d). Automatic entity recognition and typing
in massive text corpora, in WWW. DOI: 10.1145/2872518.2891065. 59
X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han (2016e).
Cotype: Joint extraction of typed entities and relations with knowledge bases, ArXiv Preprint
ArXiv:1610.08763. DOI: 10.1145/3038912.3052708. 122, 126
X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han (2017a).
CoType: Joint extraction of typed entities and relations with knowledge bases, in Proc. of the
27th International Conference on World Wide Web (WWW). DOI: 10.1145/3038912.3052708.
11, 111, 154, 163
X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han (2017b).
Cotype: Joint extraction of typed entities and relations with knowledge bases, in WWW. DOI:
10.1145/3038912.3052708.
X. Ren, J. Shen, M. Qu, X. Wang, Z. Wu, Q. Zhu, M. Jiang, F. Tao, S. Sinha, D. Liem et al.
(2017c). Life-inet: A structured network-based knowledge exploration and analytics system
for life sciences, Proc. of ACL System Demonstrations. DOI: 10.18653/v1/p17-4010. 154, 164
S. Riedel, L. Yao, and A. McCallum (2010). Modeling relations and their mentions without
labeled text, in ECML. DOI: 10.1007/978-3-642-15939-8_10. 30, 88, 95, 102, 103
S. Riedel, L. Yao, and A. McCallum (2010). Modeling relations and their mentions without
labeled text, in ECML/PKDD. DOI: 10.1007/978-3-642-15939-8_10. 130, 134
BIBLIOGRAPHY 177
S. Riedel, L. Yao, A. McCallum, and B. M. Marlin (2013). Relation extraction with matrix
factorization and universal schemas. in HLT-NAACL, pp. 74–84. 112
S. Riedel, L. Yao, A. McCallum, and B. M. Marlin (2013). Relation extraction with matrix
factorization and universal schemas, in NAACL. 136
S. Roller, K. Erk, and G. Boleda (2014). Inclusive yet selective: Supervised distributional hy-
pernymy detection, in COLING, pp. 1025–1036. 75, 77
D. Roth and W.-t. Yih (2007). Global inference for entity and relation identification via a linear
programming formulation, Introduction to statistical relational learning, pp. 553–580. 28, 87
B. Salehi, P. Cook, and T. Baldwin (2015). A word embedding approach to predicting the com-
positionality of multiword expressions, in NAACL-HLT. DOI: 10.3115/v1/n15-1099. 30
S. Sarawagi and W. W. Cohen (2004). Semi-Markov conditional random fields for information
extraction, in NIPS. 28
S. Sarawagi (2008). Information extraction, Foundations and Trends in Databases, vol. 1, no. 3,
pp. 261–377. DOI: 10.1561/1900000003. 147
M. Schmitz, R. Bart, S. Soderland, O. Etzioni et al. (2012). Open language learning for infor-
mation extraction, in EMNLP. 31, 35, 112, 147, 148, 150
W. Shen, J. Wang, P. Luo, and M. Wang (2012). A graph-based approach for ontology popu-
lation with named entities, in CIKM. DOI: 10.1145/2396761.2396807. 36, 53
W. Shen, J. Wang, and J. Han (2014). Entity linking with a knowledge base: Issues, techniques,
and solutions, TKDE, no. 99, pp. 1–20. DOI: 10.1109/tkde.2014.2327028. 35, 39
S. Shi, H. Zhang, X. Yuan, and J.-R. Wen (2010). Corpus-based semantic class mining: Dis-
tributional vs. pattern-based approaches, in COLING. 29
J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré (2015). Incremental knowledge base
construction using deepdive, Proc. of the VLDB Endowment, vol. 8, no. 11, pp. 1310–1321.
DOI: 10.14778/2809974.2809991. 2
K. Toutanova, D. Chen, P. Pantel, P. Choudhury, and M. Gamon (2015). Representing text for
joint embedding of text and knowledge bases, in EMNLP. DOI: 10.18653/v1/d15-1174. 30,
112, 113, 145
P. Tseng (2001). Convergence of a block coordinate descent method for nondifferentiable min-
imization, Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475–494. DOI:
10.1023/a:1017501703105. 50, 68
P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. Ré (2016). Socratic learn-
ing: Correcting misspecified generative models using discriminative models, ArXiv Preprint
ArXiv:1610.08123. 119
E. M. Voorhees (1994). Query expansion using lexical-semantic relations, in Proc. of the 17th
Annual International ACM SIGIR Conference on Research and Development in Information Re-
trieval, pp. 61–69, Springer-Verlag, Inc., NY. DOI: 10.1007/978-1-4471-2099-5_7. 75
D. Wang, S. Zhu, T. Li, and Y. Gong (2013). Comparative document summarization via dis-
criminative sentence selection, TKDD, vol. 6, no. 3, p. 12. DOI: 10.1145/2362383.2362386.
158
Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014). Knowledge graph and text jointly embedding,
in EMNLP. DOI: 10.3115/v1/d14-1167. 113
H. Wang, F. Tian, B. Gao, J. Bian, and T.-Y. Liu (2015). Solving verbal comprehension ques-
tions in IQ test by knowledge-powered word embedding, ArXiv Preprint ArXiv:1505.07909.
75, 77
H. Wang, F. Tian, B. Gao, C. Zhu, J. Bian, and T.-Y. Liu (2016). Solving verbal ques-
tions in IQ test by knowledge-powered word embedding, in EMNLP, pp. 541–550. DOI:
10.18653/v1/d16-1052. 112
J. Weeds, D. Clarke, J. Reffin, D. J. Weir, and B. Keller (2014). Learning to distinguish hyper-
nyms and co-hyponyms, in COLING, pp. 2249–2259. 75, 77
180 BIBLIOGRAPHY
R. Weischedel and A. Brunstein (2005). BBN pronoun coreference and entity type corpus,
Linguistic Data Consortium, vol. 112. 69
R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin (2014). Knowledge base
completion via search-based question answering, in WWW. DOI: 10.1145/2566486.2568032.
87
J. Weston, S. Bengio, and N. Usunier (2011). Wsabie: Scaling up to large vocabulary image
annotation, in IJCAI. DOI: 10.5591/978-1-57735-516-8/IJCAI11-460. 65, 67
F. Wu and D. S. Weld (2010). Open information extraction using Wikipedia, in Proc. of the 48th
Annual Meeting of the Association for Computational Linguistics, pp. 118–127. 75
P. Xie, D. Yang, and E. P. Xing (2015). Incorporating word correlation knowledge into topic
modeling, in HLT-NAACL. DOI: 10.3115/v1/n15-1074. 75
W. Xu, R. Hoffmann, L. Zhao, and R. Grishman (2013). Filling knowledge base gaps for distant
supervision of relation extraction, in ACL. 110
C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and T.-Y. Liu (2014). Rc-net: A general
framework for incorporating knowledge into word representations, in Proc. of the 23rd ACM
International Conference on Conference on Information and Knowledge Management, pp. 1219–
1228, ACM. DOI: 10.1145/2661829.2662038. 112, 113
Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin (2015). Classifying relations via long short
term memory networks along shortest dependency paths. in EMNLP, pp. 1785–1794. DOI:
10.18653/v1/d15-1206. 112
M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy (2014). Renoun: Fact extraction for nominal
attributes, in EMNLP. DOI: 10.3115/v1/d14-1038. 112, 139
D. Yogatama, D. Gillick, and N. Lazic (2015). Embedding methods for fine grained entity type
classification, in ACL. DOI: 10.3115/v1/p15-2048. 30, 60, 64, 69, 70, 104
M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum (2012). Hyena: Hierarchical type
classification for entity names, in COLING. 59, 62, 68, 70, 91, 104
X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han (2014). Per-
sonalized entity recommendation: A heterogeneous information network approach, in Proc.
of the 7th ACM International Conference on Web Search and Data Mining (WSDM). DOI:
10.1145/2556195.2556259. 59
BIBLIOGRAPHY 181
D. Yu and H. Ji (2016). Unsupervised person slot filling based on graph mining, in ACL. DOI:
10.18653/v1/p16-1005. 139
Q. T. Zeng, D. Redd, T. C. Rindflesch, and J. R. Nebeker (2012). Synonym, topic model and
predicate-based query expansion for retrieving clinical documents, in AMIA. 75
D. Zeng, K. Liu, Y. Chen, and J. Zhao (2015). Distant supervision for relation extrac-
tion via piecewise convolutional neural networks. in EMNLP, pp. 1753–1762. DOI:
10.18653/v1/d15-1203. 112
W. Zeng, Y. Lin, Z. Liu, and M. Sun (2017). Incorporating relation paths in neural relation
extraction, in EMNLP. DOI: 10.18653/v1/d17-1186.
C. Zhai, A. Velivelli, and B. Yu (2004). A cross-collection mixture model for comparative text
mining, in SIGKDD. DOI: 10.1145/1014052.1014150. 158
L. Zhang, L. Li, C. Shen, and T. Li (2015). Patentcom: A comparative view of patent document
retrieval, SDM. DOI: 10.1137/1.9781611974010.19. 158
183
Authors’ Biographies
XIANG REN
Xiang Ren is an Assistant Professor in the Department of Computer Science at USC, affili-
ated faculty at USC ISI, and a part-time data science advisor at Snap Inc. At USC, Xiang is
part of the Machine Learning Center, NLP community, and Center on Knowledge Graphs.
Prior to that, he was a visiting researcher at Stanford University, and received his Ph.D. in
CS@UIUC. His research develops computational methods and systems that extract machine-
actionable knowledge from massive unstructured data (e.g., text data), and particular focuses
on problems in the space of modeling sequence and graph data under weak supervision (learn-
ing with partial/noisy labels, and semi-supervised learning) and indirect supervision (multi-task
learning, transfer learning, and reinforcement learning). Xiang’s research has been recognized
with several prestigious awards including a Yahoo!-DAIS Research Excellence Award, a Yelp
Dataset Challenge award, a C. W. Gear Outstanding Graduate Student Award and a David
J. Kuck Outstanding M.S. Thesis Award. Technologies he developed have been transferred to
U.S. Army Research Lab, National Institute of Health, Microsoft, Yelp, and TripAdvisor.
JIAWEI HAN
Jiawei Han is the Abel Bliss Professor in the Department of Computer Science, University of
Illinois at Urbana-Champaign. He has been researching into data mining, information network
analysis, database systems, and data warehousing, with over 900 journal and conference publi-
cations. He has chaired or served on many program committees of international conferences in
most data mining and database conferences. He also served as the founding Editor-In-Chief
of ACM Transactions on Knowledge Discovery from Data and the Director of Information Net-
work Academic Research Center supported by U.S. Army Research Lab (2009–2016), and is
the co-Director of KnowEnG, an NIH funded Center of Excellence in Big Data Computing
since 2014. He is a Fellow of ACM, a Fellow of IEEE, and received 2004 ACM SIGKDD In-
novations Award, 2005 IEEE Computer Society Technical Achievement Award, and 2009 M.
Wallace McDowell Award from IEEE Computer Society. His co-authored book Data Mining:
Concepts and Techniques has been adopted as a popular textbook worldwide.