Mining Structured From Massive Text Data: A Data-Driven Approach
Mining Structured From Massive Text Data: A Data-Driven Approach
Jiawei Han
Abel Bliss Professor, Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL 61801, USA
[email protected]
16
erarchies where each node is a topic represented porating POS tags when POS tagger is available,
by a ranked list of concepts (e.g., {‘social network and moreover, the new framework is able to ex-
analysis’, ‘mining information networks’, . . .}, is tract single-word quality phrases; and (iv) high ef-
a child node of a general topic node: {‘knowledge ficiency, due to a better indexing method and an al-
discovery’, ‘data mining’, . . .}). Such hierarchi- most lock-free parallelization, which lead to both
cal organization of concepts allows exploration of running time speedup and memory saving.
corpus at varied granularity, and has applications
like visualization, search and summarization. 3 Distantly Supervised Entity/Relation
The NLP community has conducted exten- Recognition and Typing
sive studies on automatic extraction of quality Extracting entities and relations for types of inter-
phrases, but mostly rely on many kinds of lin- est from text is important for understanding mas-
guistic processing (e.g., chunking, dependency sive text corpora. Traditionally, systems of entity
parsing), domain-dependent language rules, and a relation extraction have been relying on human-
large amount of labeled data (e.g., treebanks). annotated corpora for training and adopted an in-
In our recent research, we have developed sev- cremental pipeline. Such systems require addi-
eral interesting automated phrase mining methods. tional human expertise to be ported to a new do-
The general philosophy is that instead of relying main and are vulnerable to errors cascading down
on explicit training, we explore statistical redun- the pipeline.
dancy in document collections by frequent-pattern Recently, we have investigated a distantly su-
mining and semi-supervised learning. Such data- pervised approach for extraction and typing of en-
driven approaches leverage statistical or heuristic tities and relations and developed several interest-
measures derived from corpus and achieve impres- ing methods to reduce human effort and enhance
sive results. Our newly developed phrase mining the performance. These include (1) ClusType (Xi-
approach consists of three methods: (1) unsuper- ang Ren, et al., 2015), which explores an inte-
vised approach (i.e., requiring neither expert ex- grated, entity typing and relation-phrase cluster-
plicitly labeled training data nor knowledge-base), ing approach, (2) PLE (Xiang Ren, et al., 2016)
represented by ToPMine (Ahmed El-Kishky, et al., for refined entity typing, and (3) Co-Type (Xiang
2014), (2) weakly supervised approach (i.e., re- Ren, et al., 2017) for jointly embedding and typ-
quiring a small set of human labeled training data ing entities and relations in a mutually enhanced
on the quality of phrases), represented by Seg- framework.
Phrase (Jialu Liu, et al., 2015), and (3) distantly- ClusType (Xiang Ren, et al., 2015) explores
supervised approach (i.e., requiring only distantly data-driven phrase mining to generate entity men-
labeled knowledge-bases, such as Wikipedia), rep- tion candidates and relation phrases, and enforces
resented by AutoPhrase (Jialu Liu, et al., 2017; the principle that relation phrases should be softly
Jingbo Shang, et al., 2017). clustered when propagating type information be-
Our experiments on large text corpora show tween their argument entities. Then the method
ToPMine and SegPhrase, with minor adaptation, predicts the type of each entity mention based
generate quality phrases in large corpora of multi- on the type signatures of its co-occurring relation
ple languages (e.g., English, Arabic, Chinese and phrases and the type indicators of its surface name,
Spanish) since both methods rely mainly on sta- as computed over the corpus. The two tasks, type
tistical analysis instead of language parsing and propagation with relation phrases and multi-view
linguistic features. For AutoPhrase, it demon- relation phrase clustering, are put in a joint op-
strates additional power over Segphrase on four timization framework and achieves high perfor-
aspects: (i) minimized human effort, using a ro- mance.
bust positive-only distant training method which For extraction and typing of fine-grained en-
estimates the phrase quality by leveraging exist- tity types in conjunction with existing knowledge
ing general knowledge bases; (ii) supporting mul- bases, a major difficulty is that the type labels
tiple languages including English, Spanish, and obtained from knowledge bases are often noisy
Chinese, where the language in the input will be (i.e., incorrect for the entity mentions’ local con-
automatically detected, (iii) high accuracy, using text). We proposed a framework, called PLE
a POS-guided phrasal segmentation model incor- (Xiang Ren, et al., 2016), which conducts Label
17
Noise Reduction in Entity Typing (LNR), to auto- framework, called MetaPAD (Meng Jiang, et al.,
matically identify correct type labels (type-paths) 2017), which discovers meta patterns from mas-
for training examples, given the set of candidate sive corpora with three techniques: (1) it devel-
type labels obtained by distant supervision with a ops a context-aware segmentation method to care-
given type hierarchy. PLE jointly embeds entity fully determine the boundaries of patterns with a
mentions, text features and entity types into the learned pattern quality assessment function, which
same low-dimensional space where objects whose avoids costly dependency parsing and generates
types are semantically close have similar represen- high-quality patterns; (2) it identifies and groups
tations. Then we estimate the type-path for each synonymous meta patterns from multiple facets—
training example in a top-down manner using the their types, contexts, and extractions; and (3) it
learned embeddings. We formulate a global ob- examines type distributions of entities in the in-
jective for learning the embeddings from text cor- stances extracted by each group of patterns, and
pora and knowledge bases, which adopts a novel looks for appropriate type levels to make discov-
margin-based loss that is robust to noisy labels and ered patterns precise.
faithfully models type correlation derived from Our extensive experiments demonstrate that our
knowledge bases. proposed framework discovers high-quality typed
To Further enhance the overall performance textual patterns efficiently from different genres
for entity and relation extraction and typing, We of massive corpora and facilitates information ex-
propose a novel domain-independent framework, traction. For example, from an Associate Press
called Co-Type (Xiang Ren, et al., 2017), that runs and Reuter dataset (APR 2015), one can dis-
a data-driven text segmentation algorithm to ex- cover meta-patterns for country and president and
tract entity mentions, and jointly embeds entity extract country- president pairs even for rarely
mentions, relation mentions, text features and type mentioned pairs, like Burkina Faso-Blaise
labels into two low-dimensional spaces (for en- Compaoré, and find which bacteria are resistant
tity and relation mentions respectively), where, in to which antibiotics from the PubMed abstracts.
each space, objects whose types are close will also
have similar representations. COTYPE, then us- 5 Conclusions and Future work
ing these learned embeddings, estimates the types
of test (unlinkable) mentions. We formulate a Mining structures from massive text copora is an
joint optimization problem to learn embeddings important task for turning big text data into big
from text corpora and knowledge bases, adopting structured knowledge. Traditional approaches re-
a novel partial-label loss function for noisy labeled lying on extensive human labeling or annotation
data and introducing an object “translation” func- of a nontrivial sample set of documents in specific
tion to capture the cross-constraints of entities and application domain are not scalable. A new di-
relations on each other and achieved high perfor- rection is to develop effective weakly or distantly
mance over existing embedding-based methods. supervised methods to explore existing domain-
agnostic labels and massive existing text corpora
4 Meta-Pattern Guided Information to achieve high performance on phrase mining, en-
Extraction tity and relation extraction and typing, and infor-
mation extraction.
Mining textual patterns in news, tweets, papers, Our recent development of phrase mining meth-
and many other kinds of text corpora may facili- ods, such as ToPMine, SegPhrase and AutoPhrase,
tate effective information extraction from massive entity/relation recognition and typing methods
text corpora. Previous studies adopt a dependency such as ClusType, PLE and CoType, as well as
parsing-based pattern discovery approach. How- pattern-based discovery with massive text corpora,
ever, the parsing results lose rich context around such as MetaPAD, contribute to this direction.
entities in the patterns, and the process is costly There are a lot of future research problems
for a corpus of large scale. Recently, we have along this direction. Besides further consolidat-
proposed a typed textual pattern structure, called ing these distantly supervised methods, an impor-
meta pattern, to represent a general form of fre- tant direction is to study automated multi-faceted
quent, informative, and precise subsequence pat- taxonomy direction from massive text to turn ex-
terns in certain context. We propose an efficient tracted concepts (e.g., phrases) into organized
18
structures as well as identifying trusted claims and pirical Methods in Natural Language Processing
comparative and succinct summaries, and build up EMNLP’17, pages 46–56, Copenhagen, Denmark,
Sept. 2017.
structured, multi-dimensional text-cubes and in-
formation networks, from massive data. We have Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao,
been working along these lines and developing Clare R. Voss, Heng Ji, and Jiawei Han. ClusType:
some new methods, such as SetExpan (Jiaming Effective entity recognition and typing by rela-
tion phrase-based clustering. In Proc. 2015 ACM
Shen, et al., 2017), REHession (Liyuan Liu, et al., SIGKDD Int. Conf. on Knowledge Discovery and
2017) and indirect supervision for relation extrac- Data Mining (KDD’15), Sydney, Australia, Aug.
tion using question-answer pairs (JZeqiu Wu, et 2015.
al., 2018). Still, this is a huge and promising area,
Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng
with a vast unexplored territory waiting to be ex- Ji, and Jiawei Han. Label noise reduction in en-
plored. tity typing by heterogeneous partial-label embed-
ding. In Proc. of 2016 ACM SIGKDD Int. Conf. on
Acknowledgments Knowledge Discovery and Data Mining, San Fran-
cisco, CA, USA, August 13-17, 2016, pages 1825–
Research was sponsored in part by the U.S. Army 1834, 2016.
Research Lab. under Cooperative Agreement No.
Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare
W911NF-09-2-0053 (NSCTA), National Science Voss, Heng Ji, Tarek Abdelzaher, and Jiawei Han.
Foundation IIS 16-18481, IIS 17-04532, and IIS- CoType: Joint extraction of typed entities and rela-
17-41317, and grant 1U54GM114838 awarded tions with knowledge bases. In Proc. 2017 World-
by NIGMS through funds provided by the trans- Wide Web Conf. (WWW’17), Perth, Australia, Apr.
2017.
NIH Big Data to Knowledge (BD2K) initiative
(www.bd2k.nih.gov). The views and conclusions Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren,
contained in this document are those of the au- Clare R. Voss, and Jiawei Han. Automated
phrase mining from massive text corpora. CoRR,
thor(s) and should not be interpreted as represent-
abs/1702.04457, 2017.
ing the official policies of the U.S. Army Research
Laboratory or the U.S. Government. The U.S. Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo
Government is authorized to reproduce and dis- Shang, Xiang Ren, and Jiawei Han. SetExpan:
Corpus-based set expansion via context feature se-
tribute reprints for Government purposes notwith- lection and rank ensemble. In Proc. 2017 Euro-
standing any copyright notation hereon. pean Conf. on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases
(ECMLPKDD’17), Skopje, Macedonia, Sept. 2017.
References
Zeqiu Wu, Xiang Ren, Frank F. Xu, Ji Li, and Ji-
Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. awei Han. Indirect supervision for relation extrac-
Voss, and Jiawei Han. Scalable topical phrase min- tion using question-answer pairs. In Proc. of 2018
ing from text corpora. PVLDB, 8(3):305–316, 2014. ACM Int. Conf. on Web Search and Data Mining
(WSDM’18), Los Angeles, CA, Feb. 2018.
Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren,
Lance Kaplan, Timothy Hanratty, and Jiawei Han.
MetaPAD: Meta patten discovery from massive text
corpora. In Proc. 2017 ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining (KDD’17),
Halifax, Nova Scotia, Canada, Aug. 2017.
Jialu Liu, Jingbo Shang, and Jiawei Han. Phrase Min-
ing from Massive Text and Its Applications. Morgan
& Claypool Publishers, 2017.
Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Ji-
awei Han. Mining quality phrases from massive text
corpora. In Proc. 2015 ACM SIGMOD Int. Conf.
on Management of Data (SIGMOD’15), Melbourne,
Australia, May 2015.
Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui,
Heng Ji, and Jiawei Han. Heterogeneous supervi-
sion for relation extraction: A representation learn-
ing approach. In Proc. of 2017 Conf. on Em-
19