Text Data Management
Text Data Management
and Analysis
ACM Books
Editor in Chief
M. Tamer Özsu, University of Waterloo
ACM Books is a new series of high-quality books for the computer science community,
published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books
publications are widely distributed in both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM Digital
Library platform.
ChengXiang Zhai
University of Illinois at Urbana–Champaign
Sean Massung
University of Illinois at Urbana–Champaign
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means—electronic, mechanical, photocopy,
recording, or any other except for brief quotations in printed reviews—without the prior
permission of the publisher.
Designations used by companies to distinguish their products are often claimed as
trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware
of a claim, the product names appear in initial capital or all capital letters. Readers, however,
should contact the appropriate companies for more complete information regarding
trademarks and registration.
First Edition
10 9 8 7 6 5 4 3 2 1
To Mei and Alex
To Kai
Contents
Preface xv
Acknowledgments xviii
Chapter 1 Introduction 3
1.1 Functions of Text Information Systems 7
1.2 Conceptual Framework for Text Information Systems 10
1.3 Organization of the Book 13
1.4 How to Use this Book 15
Bibliographic Notes and Further Reading 18
Chapter 2 Background 21
2.1 Basics of Probability and Statistics 21
2.2 Information Theory 31
2.3 Machine Learning 34
Bibliographic Notes and Further Reading 36
Exercises 37
Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis 57
4.1 Design Philosophy 58
4.2 Setting up META 59
4.3 Architecture 60
4.4 Tokenization with META 61
4.5 Related Toolkits 64
Exercises 65
Chapter 20 Toward A Unified System for Text Management and Analysis 445
20.1 Text Analysis Operators 448
20.2 System Architecture 452
20.3 META as a Unified System 453
References 477
Index 489
Authors’ Biographies 509
Preface
directly expressed in text data. For example, it is now the norm for people to tap into
opinionated text data such as product reviews, forum discussions, and social media
text to obtain opinions. Once again, due to the overwhelming amount of informa-
tion, people need intelligent software tools to help discover relevant knowledge for
optimizing decisions or helping them complete their tasks more efficiently. While
the technology for supporting text mining is not yet as mature as search engines
for supporting text access, significant progress has been made in this area in re-
cent years, and specialized text mining tools have now been widely used in many
application domains. The subtitle of this book suggests that we cover two major
topics, information retrieval and text mining. These two topics roughly correspond
to the techniques needed to build the two types of application systems discussed
above (i.e., search engines and text analytics systems), although the separation of
the two is mostly artificial and only meant to help provide a high-level structure for
the book, and a sophisticated application system likely would use many techniques
from both topic areas.
In contrast to structured data, which conform to well-defined schemas and are
thus relatively easy for computers to handle, text has less explicit structure so
the development of intelligent software tools discussed above requires computer
processing to understand the content encoded in text. The current technology of
natural language processing has not yet reached a point to enable a computer to pre-
cisely understand natural language text (a main reason why humans often should
be involved in the loop), but a wide range of statistical and heuristic approaches
to management and analysis of text data have been developed over the past few
decades. They are usually very robust and can be applied to analyze and manage
text data in any natural language, and about any topic. This book intends to provide
a systematic introduction to many of these approaches, with an emphasis on cov-
ering the most useful knowledge and skills required to build a variety of practically
useful text information systems.
This book is primarily based on the materials that the authors have used for
teaching a course on the topic of text data management and analysis (i.e., CS410
Text Information Systems) at the University of Illinois at Urbana–Champaign, as
well as the two Massive Open Online Courses (MOOCs) on “Text Retrieval and
Search Engines” and “Text Mining and Analytics” taught by the first author on
Coursera in 2015. Most of the materials in the book directly match those of these
two MOOCs with also similar structures of topics. As such, the book can be used as
a main reference book for any of these two MOOCs.
Information Retrieval (IR) is a relatively mature field and there are no short-
age of good textbooks on IR; for example, the most recent ones include Modern
Information Retrieval: The Concepts and Technology behind Search by Baeza-Yates
Preface xvii
are expected to have basic knowledge about computer science, particularly data
structures and programming languages and be comfortable with some basic con-
cepts in probability and statistics such as conditional probability and parameter
estimation. Readers who do not have this background may still be able to follow
the basic ideas of most of the algorithms discussed in the book; they can also ac-
quire the needed background by carefully studying Chapter 2 of the book and, if
necessary, reading some of the references mentioned in the Bibliographical Notes
section of that chapter to have a solid understanding of all the major concepts men-
tioned therein. META can be used by anyone to easily experiment with algorithms
and build applications, but modifying it or extending it would require at least some
basic knowledge of C++ programming.
The book can be used as a textbook for an upper-level undergraduate course on
information retrieval and text mining or a reference book for a graduate course to
cover practical aspects of information retrieval and text mining. It should also be
useful to practitioners in industry to help them acquire a wide range of practical
techniques for managing and analyzing text data that they can use immediately to
build various interesting real-world applications.
Acknowledgments
This book is the result of many people’s help. First and foremost, we want to express
our sincere thanks to Edward A. Fox for his invitation to write this book for the
ACM Book Series in the area of Information Retrieval and Digital Libraries, of
which he is the Area Editor. We are also grateful to Tamer Ozsu, Editor-in-Chief of
ACM Books, for his support and useful comments on the book proposal. Without
their encouragement and support this book would have not been possible. Next,
we are deeply indebted to Edward A. Fox, Donna Harman, Bing Liu, and Jimmy
Lin for thoroughly reviewing the initial draft of the book and providing very useful
feedback and constructive suggestions. While we were not able to fully implement
all their suggestions, all their reviews were extremely helpful and led to significant
improvement of the quality of the book in many ways; naturally, any remaining
errors in the book are solely the responsibility of the authors.
Throughout the process of writing the book, we received strong support and
great help from Diane Cerra, Executive Editor at Morgan & Claypool Publishers,
whose regular reminders and always timely support are key factors that prevented
us from having the risk of taking “forever” to finish the book; for this, we are truly
grateful to her. In addition, we would like to thank Sara Kreisman for copyediting
and Paul C. Anagnostopoulos and his production team at Windfall Software (Ted
Acknowledgments xix
Laux, Laurel Muller, MaryEllen Oliver, and Jacqui Scarlott) for their great help with
indexing, illustrations, art proofreading, and composition, which ensured a fast
and smooth production of the book.
The content of the book and our understanding of the topics covered in the book
have benefited from many discussions and interactions with a large number of
people in both the research community and industry. Due to space limitations,
we can only mention some of them here (and have to apologize to many whose
names are not mentioned): James Allan, Charu Aggarwal, Ricardo Baeza-Yates,
Nicholas J. Belkin, Andrei Broder, Jamie Callan, Jaime Carbonell, Kevin C. Chang,
Yi Chang, Charlie Clarke, Fabio Crestani, W. Bruce Croft, Maarten de Rijke, Arjen
de Vries, Daniel Diermeier, AnHai Doan, Susan Dumais, David A. Evans, Edward A.
Fox, Ophir Frieder, Norbert Fuhr, Evgeniy Gabrilovich, C. Lee Giles, David Gross-
man, Jiawei Han, Donna Harman, Marti Hearst, Jimmy Huang, Rong Jin, Thorsten
Joachims, Paul Kantor, David Karger, Diane Kelly, Ravi Kumar, Oren Kurland, John
Lafferty, Victor Lavrenko, Lillian Lee, David Lewis, Jimmy Lin, Bing Liu, Wei-Ying
Ma, Christopher Manning, Gary Marchionini, Andrew McCallum, Alistair Moffat,
Jian-Yun Nie, Douglas Oard, Dragomir R. Radev, Prabhakar Raghavan, Stephen
Robertson, Roni Rosenfeld, Dan Roth, Mark Sanderson, Bruce Schatz, Fabrizio Se-
bastiani, Amit Singhal, Keith van Rijsbergen, Luo Si, Noah Smith, Padhraic Smyth,
Andrew Tomkins, Ellen Voorhees, and Yiming Yang, Yi Zhang, Justin Zobel. We
want to thank all of them for their indirect contributions to this book. Some ma-
terials in the book, especially those in Chapter 19, are based on the research work
done by many Ph.D. graduates of the Text Information Management and Analysis
(TIMAN) group at the University of Illinois at Urbana–Champaign, under the super-
vision by the first author. We are grateful to all of them, including Tao Tao, Hui Fang,
Xuehua Shen, Azadeh Shakery, Jing Jiang, Qiaozhu Mei, Xuanhui Wang, Bin Tan,
Xu Ling, Younhee Ko, Alexander Kotov, Yue Lu, Maryam Karimzadehgan, Yuanhua
Lv, Duo Zhang, V.G.Vinod Vydiswaran, Hyun Duk Kim, Kavita Ganesan, Parikshit
Sondhi, Huizhong Duan, Yanen Li, Hongning Wang, Mingjie Qian, and Dae Hoon
Park. The authors’ own work included in the book has been supported by multiple
funding sources, including NSF, NIH, NASA, IARPA, Air Force, ONR, DHS, Alfred P.
Sloan Foundation, and many companies including Microsoft, Google, IBM, Yahoo!,
LinkedIn, Intel, HP, and TCL. We are thankful to all of them.
The two Massive Open Online Courses (MOOCs) offered by the first author for
the University of Illinois at Urbana–Champaign (UIUC) in 2015 on Coursera (i.e.,
Text Retrieval and Search Engines and Text Mining and Analytics) provided a direct
basis for this book in the sense that many parts of the book are based primarily
on the transcribed notes of the lectures in these two MOOCs. We thus would like
xx Preface
to thank all the people who have helped with these two MOOCs, especially TAs
Hussein Hazimeh and Alex Morales, and UIUC instruction support staff Jason
Mock, Shannon Bicknell, Katie Woodruff, and Edward Noel Dignan, and the Head
of Computer Science Department, Rob Rutenbar, whose encouragement, support,
and help are all essential for these two MOOCs to happen. The first author also
wants to thank UIUC for allowing him to use the sabbatical leave in Fall 2015 to
work on this book. Special thanks are due to Chase Geigle, co-founder of META. In
addition to all the above, the second author would like to thank Chase Geigle, Jason
Cho, and Urvashi Khandelwal (among many others) for insightful discussion and
encouragement.
Finally, we would like to thank all our family members, particularly our wives,
Mei and Kai, for their love and support. The first author wants to further thank
his brother Chengxing for the constant intellectual stimulation in their regular
research discussions and his parents for cultivating his passion for learning and
sharing knowledge with others.
ChengXiang Zhai
Sean Massung
June 2016
I
PART
OVERVIEW AND
BACKGROUND
1
Introduction
In the last two decades, we have experienced an explosive growth of online infor-
mation. According to a study done at University of California Berkeley back in 2003:
“. . . the world produces between 1 and 2 exabytes (1018 petabytes) of unique infor-
mation per year, which is roughly 250 megabytes for every man, woman, and child
on earth. Printed documents of all kinds comprise only .03% of the total.” [Lyman
et al. 2003]
A large amount of online information is textual information (i.e., in natural lan-
guage text). For example, according to the Berkeley study cited above: “Newspapers
represent 25 terabytes annually, magazines represent 10 terabytes . . . office docu-
ments represent 195 terabytes. It is estimated that 610 billion emails are sent each
year representing 11,000 terabytes.” Of course, there are also blog articles, forum
posts, tweets, scientific literature, government documents, etc. Roe [2012] updates
the email count from 610 billion emails in 2003 to 107 trillion emails sent in 2010.
According to a recent IDC report report [Gantz & Reinsel 2012], from 2005 to 2020,
the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 ex-
abytes, or 40 trillion gigabytes.
While, in general, all kinds of online information are useful, textual information
plays an especially important role and is arguably the most useful kind of informa-
tion for the following reasons.
Text (natural language) is the most natural way of encoding human knowledge.
As a result, most human knowledge is encoded in the form of text data. For
example, scientific knowledge almost exclusively exists in scientific literature,
while technical manuals contain detailed explanations of how to operate
devices.
Text is by far the most common type of information encountered by people.
Indeed, most of the information a person produces and consumes daily is in
text form.
4 Chapter 1 Introduction
Text is the most expressive form of information in the sense that it can be
used to describe other media such as video or images. Indeed, image search
engines such as those supported by Google and Bing often rely on matching
companion text of images to retrieve “matching” images to a user’s keyword
query.
The explosive growth of online text information has created a strong demand
for intelligent software tools to provide the following two related services to help
people manage and exploit big text data.
Text Retrieval. The growth of text data makes it impossible for people to con-
sume the data in a timely manner. Since text data encode much of our accu-
mulated knowledge, they generally cannot be discarded, leading to, e.g., the
accumulation of a large amount of literature data which is now beyond any
individual’s capacity to even skim over. The rapid growth of online text infor-
mation also means that no one can possibly digest all the new information
created on a daily basis. Thus, there is an urgent need for developing intel-
ligent text retrieval systems to help people get access to the needed relevant
information quickly and accurately, leading to the recent growth of the web
search industry. Indeed, web search engines like Google and Bing are now an
essential part of our daily life, serving millions of queries daily. In general,
search engines are useful anywhere there is a relatively large amount of text
data (e.g., desktop search, enterprise search or literature search in a specific
domain such as PubMed).
Text Mining. Due to the fact that text data are produced by humans for commu-
nication purposes, they are generally rich in semantic content and often con-
tain valuable knowledge, information, opinions, and preferences of people.
As such, they offer great opportunity for discovering various kinds of knowl-
edge useful for many applications, especially knowledge about human opin-
ions and preferences, which is often directly expressed in text data. For exam-
ple, it is now the norm for people to tap into opinionated text data such as
product reviews, forum discussions, and social media text to obtain opinions
about topics interesting to them and optimize various decision-making tasks
such as purchasing a product or choosing a service. Once again, due to the
overwhelming amount of information, people need intelligent software tools
to help discover relevant knowledge to optimize decisions or help them com-
plete their tasks more efficiently. While the technology for supporting text
mining is not yet as mature as search engines for supporting text access, sig-
Chapter 1 Introduction 5
nificant progress has been made in this area in recent years, and specialized
text mining tools have now been widely used in many application domains.
Figure 1.1 Text retrieval and text mining are two main techniques for analyzing big text data.
6 Chapter 1 Introduction
Once we obtain a small set of most relevant text data, we would need to further
analyze the text data to help users digest the content and knowledge in the text
data. This is the text mining step where the goal is to further discover knowledge
and patterns from text data so as to support a user’s task. Furthermore, due to the
need for assessing trustworthiness of any discovered knowledge, users generally
have a need to go back to the original raw text data to obtain appropriate context
for interpreting the discovered knowledge and verify the trustworthiness of the
knowledge, hence a search engine system, which is primarily useful for text access,
also has to be available in any text-based decision-support system for supporting
knowledge provenance. The two steps are thus conceptually interleaved, and a
full-fledged intelligent text information system must integrate both in a unified
framework.
It is worth pointing out that put in the context of “big data,” text data is very dif-
ferent from other kinds of data because it is generally produced directly by humans
and often also meant to be consumed by humans as well. In contrast, other data
tend to be machine-generated data (e.g., data collected by using all kinds of physi-
cal sensors). Since humans can understand text data far better than computers can,
involvement of humans in the process of mining and analyzing text data is abso-
lutely crucial (much more necessary than in other big data applications), and how
to optimally divide the work between humans and machines so as to optimize the
collaboration between humans and machines and maximize their “combined in-
telligence” with minimum human effort is a general challenge in all applications of
text data management and analysis. The two steps discussed above can be regarded
as two different ways for a text information system to assist humans: information
retrieval systems assist users in finding from a large collection of text data the most
relevant text data that are actually needed for solving a specific application prob-
lem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans, while text mining application sys-
tems can assist users in analyzing patterns in text data to extract and discover useful
actionable knowledge directly useful for task completion or decision making, thus
providing more direct task support for users.
With this view, we partition the techniques covered in the book into two parts to
match the two steps shown in Figure 1.1, which are then followed by one chapter to
discuss how all the techniques may be integrated in a unified text information sys-
tem. The book attempts to provide a complete coverage of all the major concepts,
techniques, and ideas in information retrieval and text data mining from a prac-
tical viewpoint. It includes many hands-on exercises designed with a companion
software toolkit META to help readers learn how to apply techniques of information
1.1 Functions of Text Information Systems 7
retrieval and text mining to real-world text data and learn how to experiment with
and improve some of the algorithms for interesting application tasks. This book
can be used as a textbook for computer science undergraduates and graduates, li-
brary and information scientists, or as a reference book for practitioners working
on relevant application problems in analyzing and managing text data.
Information Access. This capability gives a user access to the useful informa-
tion when the user needs it. With this capability, a TIS can connect the right
information with the right user at the right time. For example, a search en-
gine enables a user to access text information through querying, whereas a
recommender system can push relevant information to a user as new informa-
tion items become available. Since the main purpose of Information Access
is to connect a user with relevant information, a TIS offering this capability
Access Mining
Select Create
information knowledge
Organization
Add
structure/annotations
Figure 1.2 Information access, knowledge acquisition, and text organization are three major
capabilities of a text information system with text organization playing a supporting
role for information access and knowledge acquisition. Knowledge acquisition is also
often referred to as text mining.
8 Chapter 1 Introduction
generally only does minimum analysis of text data sufficient for matching
relevant information with a user’s information need, and the original infor-
mation items (e.g., web pages) are often delivered to the user in their original
form, though summaries of the delivered items are often provided. From the
perspective of text analysis, a user would generally need to read the informa-
tion items to further digest and exploit the delivered information.
Information access can be further classified into two modes: pull and push. In
the pull mode, the user takes initiative to “pull” the useful information out from
the system; in this case, the system plays a passive role and waits for a user to
make a request, to which the system would then respond with relevant information.
This mode of information access is often very useful when a user has an ad hoc
1.1 Functions of Text Information Systems 9
information need, i.e., a temporary information need (e.g., an immediate need for
opinions about a product). For example, a search engine like Google generally
serves a user in pull mode. In the push mode, the system takes initiative to “push”
(recommend) to the user an information item that the system believes is useful to
the user. The push mode often works well when the user has a relatively stable
information need (e.g., hobby of a person); in such a case, a system can know
“in advance” a user’s preferences and interests, making it feasible to recommend
information to a user without having the user to take the initiative. We cover both
modes of information access in this book.
The pull mode further consists of two complementary ways for a user to obtain
relevant information: querying and browsing. In the case of querying, the user
specifies the information need with a (keyword) query, and the system would take
the query as input and return documents that are estimated to be relevant to the
query. In the case of browsing, the user simply navigates along structures that
link information items together and progressively reaches relevant information.
Since querying can also be regarded as a way to navigate, in one step, into a set
of relevant documents, it’s clear that browsing and querying can be interleaved
naturally. Indeed, a user of a web search engine often interleaves querying and
browsing.
Knowledge acquisition from text data is often achieved through the process of
text mining, which can be defined as mining text data to discover useful knowl-
edge. Both the data mining community and the natural language processing
(NLP) community have developed methods for text mining, although the two
communities tend to adopt slightly different perspective on the problem. From
a data mining perspective, we may view text mining as mining a special kind
of data, i.e., text. Following the general goals of data mining, the goal of text
mining would naturally be regarded as to discover and extract interesting pat-
terns in text data, which can include latent topics, topical trends, or outliers.
From an NLP perspective, text mining can be regarded as to partially under-
stand natural language text, convert text into some form of knowledge represen-
tation and make limited inferences based on the extracted knowledge. Thus a
key task is to perform information extraction, which often aims to identify and ex-
tract mentions of various entities (e.g., people, organization, and location) and
their relations (e.g., who met with whom). In practice, of course, any text min-
ing applications would likely involve both pattern discovery (i.e., data mining
view) and information extraction (i.e., NLP view), with information extraction
serving as enriching the semantic representation of text, which enables pattern
10 Chapter 1 Introduction
Retrieval Mining
applications Summarization Visualization applications
Filtering Clustering
Information Information Knowledge
access organization acquisition
Search Extraction
Text
also possible, which may be based on recognized entities and relations or other
techniques for more in-depth understanding of text.
With content analysis as the basis, there are multiple components in a TIS that
are useful for users in different ways. The following are some commonly seen
functions for managing and analyzing text information.
Search. Take a user’s query and return relevant documents. The search com-
ponent in a TIS is generally called a search engine. Web search engines are
among the most useful search engines that enable users to effectively and
efficiently deal with a huge amount of text data.
Categorization. Classify a text object into one or several of the predefined cat-
egories where the categories can vary depending on applications. The cat-
egorization component in a TIS can annotate text objects with all kinds of
meaningful categories, thus enriching the representation text data, which
further enables more effective and deeper text analysis. The categories can
also be used for organizing text data and facilitating text access. Subject cate-
gorizers that classify a text article into one or multiple subject categories and
sentiment taggers that classify a sentence into positive, negative, or neutral in
sentiment polarity are both specific examples of a text categorization system.
Topic Analysis. Take a set of documents and extract and analyze topics in them.
Topics directly facilitate digestion of text data by users and support browsing
of text data. When combined with the companion non-textual data such as
time, location, authors, and other meta data, topic analysis can generate
many interesting patterns such as temporal trends of topics, spatiotemporal
distributions of topics, and topic profiles of authors.
Clustering. Discover groups of similar text objects (e.g., terms, sentences, doc-
uments, . . . ). The clustering component of a TIS plays an important role in
helping users explore an information space. It uses empirical data to create
meaningful structures that can be useful for browsing text objects and ob-
taining a quick understanding of a large text data set. It is also useful for
discovering outliers by identifying the items that do not form natural clusters
with other items.
This list also serves as an outline of the major topics to be covered later in
this book. Specifically, search and filtering are covered first in Part II about text
data access, whereas categorization, clustering, topic analysis, and summarization
are covered later in Part III about text data analysis. Information extraction is not
covered in this book since we want to focus on general approaches that can be
readily applied to text data in any natural language, but information extraction
often requires language-specific techniques. Visualization is also not covered due
to the intended focus on algorithms in this book. However, it must be stressed that
both information extraction and visualization are very important topics relevant
to text data analysis and management. Readers interested in these techniques can
find some useful references in the Bibliographic Notes at the end of this chapter.
Part I. Overview and Background. This part consists of the first four chapters
and provides an overview of the book and background knowledge, including
basic concepts needed for understanding the content of the book that some
readers may not be familiar with, and an introduction to the META toolkit
used for exercises in the book. This part also gives a brief overview of natu-
ral language processing techniques needed for understanding text data and
obtaining informative representation of text needed in all text data analysis
applications.
Part II. Text Data Access. This part consists of Chapters 5–11, covering the ma-
jor techniques for supporting text data access. This part provides a systematic
discussion of the basic information retrieval techniques, including the for-
mulation of retrieval tasks as a problem of ranking documents for a query
(Chapter 5), retrieval models that form the foundation of the design of rank-
ing functions in a search engine (Chapter 6), feedback techniques (Chapter 7),
implementation of retrieval systems (Chapter 8), and evaluation of retrieval
systems (Chapter 9). It then covers web search engines, the most important
application of information retrieval so far (Chapter 10), where techniques for
analyzing links in text data for improving ranking of text objects are intro-
duced and application of supervised machine learning to combine multiple
14 Chapter 1 Introduction
Chapter 1
Chapter 2
Chapter 3
Chapter 6 Chapter 13
Chapter 16 Chapter 18
Chapter 20 Chapter 19
features for ranking is briefly discussed. The last chapter in this part (Chap-
ter 11) covers recommender systems which provide a “push” mode of informa-
tion access, as opposed to the “pull” mode of information access supported
by a typical search engine (i.e., querying by users).
Part III. Text Data Analysis. This part consists of Chapters 12–19, covering a
variety of techniques for analyzing text data to facilitate user digestion of text
data and discover useful topical or other semantic patterns in text data. Chap-
ter 12 gives an overview of text analysis from the perspective of data mining,
where we may view text data as data generated by humans as “subjective sen-
sors” of the world; this view allows us to look at the text analysis problem in the
more general context of data analysis and mining in general, and facilitates
the discussion of joint analysis of text and non-text data. This is followed by
multiple chapters covering a number of the most useful general techniques
for analyzing text data without or with only minimum human effort. Specif-
ically, Chapter 13 discusses techniques for discovering two fundamental se-
1.4 How to Use this Book 15
mantic relations between lexical units in text data, i.e., paradigmatic relations
and syntagmatic relations, which can be regarded as an example of discov-
ering knowledge about the natural language used to generate the text data
(i.e., linguistic knowledge). Chapter 14 and Chapter 15 cover, respectively, two
closely related techniques to generate and associate meaningful structures
or annotations with otherwise unorganized text data, i.e., text clustering and
text categorization. Chapter 16 discusses text summarization useful for facil-
itating human digestion of text information. Chapter 17 provides a detailed
discussion of an important family of probabilistic approaches to discovery
and analysis of topical patterns in text data (i.e., topic models). Chapter 18
discusses techniques for analyzing sentiment and opinions expressed in text
data, which are key to discovery of knowledge about preferences, opinions,
and behavior of people based on analyzing the text data produced by them.
Finally, Chapter 19 discusses joint analysis of text and non-text data, which is
often needed in many applications since it is in general beneficial to use as
much data as possible for gaining knowledge and intelligence through (big)
data analysis.
Part IV. Unified Text Management and Analysis System. This last part consists
of Chapter 20 where we attempt to discuss how all the techniques discussed
in this book can be conceptually integrated in an operator-based unified
framework, and thus potentially implemented in a general unified system
for text management and analysis that can be useful for supporting a wide
range of different applications. This part also serves as a roadmap for further
extension of META to provide effective and general high-level support for
various applications and provides guidance on how META may be integrated
with many other related existing toolkits, including particularly search engine
systems, database systems, natural language processing toolkits, machine
learning toolkits, and data mining toolkits.
Due to our attempt to treat all the topics from a practical perspective, most
of the discussions of the concepts and techniques in the book are informal and
intuitive. To satisfy the needs of some readers that might be interested in deeper
understanding of some topics, the book also includes an appendix with notes to
provide a more detailed and rigorous explanation of a few important topics.
such a tradeoff, we have chosen to emphasize the coverage of the basic concepts
and practical techniques of text data mining at the cost of not being able to cover
many advanced techniques in detail, and provide some references at the end of
many chapters to help readers learn more about those advanced techniques if
they wish to. Our hope is that with the foundation received from reading this
book, you will be able to learn about more advanced techniques by yourself or via
another resource. We have also chosen to cover more general techniques for text
management and analysis and favor techniques that can be applicable to any text in
any natural language. Most techniques we discuss can be implemented without any
human effort or only requiring minimal human effort; this is in contrast to some
more detailed analysis of text data, particularly using natural language processing
techniques. Such “deep analysis” techniques are obviously very important and are
indeed necessary for some applications where we would like to go in-depth to
understand text in detail. However, at this point, these techniques are often not
scalable and they tend to require a large amount of human effort. In practice, it
would be beneficial to combine both kinds of techniques.
We envision three main (and potentially overlapping) categories of readers.
the entire book as an Introduction to Text Data Mining, while skipping some
chapters in Part 2 that are more specific to search engine implementation and
applications specific to the Web. Another choice would be using all parts as a
supplemental graduate textbook, where there is still some emphasis on prac-
tical programming knowledge that can be combined with reading referenced
papers in each chapter. Exercises for graduate students could be implement-
ing some methods they read in the references into META.
The exercises at the end of each chapter give students experience working
with a powerful—yet easily understandable—text retrieval and mining toolkit
in addition to written questions. In a programming-focused class, using the
META exercises is strongly encouraged. Programming assignments can be cre-
ated from selecting a subset of exercises in each chapter. Due to the modular
nature of the toolkit, additional programming experiments may be created by
extending the existing system or implementing other well-known algorithms
that do not come with META by default. Finally, students may use compo-
nents of META they learned through the exercises to complete a larger final
programming project. Using different corpora with the toolkit can yield dif-
ferent project challenges, e.g., review summary vs. sentiment analysis.
Practitioners. Most readers in industry would most likely use this book as a
reference, although we also hope that it may serve as some inspiration in
your own work. As with the student user suggestion, we think you would get
the most of this book by first reading the initial three chapters. Then, you may
choose a chapter relevant to your current interests and delve deeper or refresh
your knowledge.
Since many applications in META can be used simply via config files, we
anticipate it as a quick way to get a handle on your dataset and provide some
baseline results without any programming required.
The exercises at the end of each chapter can be thought of as default
implementations for a particular task at hand. You may choose to include
META in your work since it uses a permissive free software license. In fact, it is
dual-licensed under MIT and University of Illinois/NCSA licenses. Of course,
we still encourage and invite you to share any modifications, extensions, and
improvements with META that are not proprietary for the benefit of all the
readers.
No matter what your goal, we hope that you find this book useful and educa-
tional. We also appreciate your comments and suggestions for improvement of the
book. Thanks for reading!
18 Chapter 1 Introduction
of some key techniques important for text mining, notably the information extrac-
tion (IE) techniques which are essential for text mining. We decided not to cover IE
because the IE techniques tend to be language-specific and require non-trivial man-
ual work by humans. Another reason is that many IE techniques rely on supervised
machine learning approaches, which are well covered in many existing machine
learning books (see, e.g., Bishop 2006, Mitchell 1997). Readers who are interested
in knowing more about IE can start with the survey book [Sarawagi 2008] and review
articles [Jiang 2012].
From an application perspective, another important topic missing in this book
is information visualization, which is due to our focus on the coverage of models
and algorithms. However, since every application system must have a user-friendly
interface to allow users to optimally interact with a system, those readers who are
interested in developing text data application systems will surely find it useful to
learn more about user interface design. An excellent reference to start with is Hearst
[2009], which also has a detailed coverage of information visualization.
Finally, due to our emphasis on breadth, the book does not cover any compo-
nent algorithm in depth. To know more about some of the topics, readers can
further read books in natural language processing (e.g., Jurafsky and Martin 2009,
Manning and Schütze 1999), advanced books on IR (e.g., Baeza-Yates and Ribeiro-
Neto [2011]), and books on machine learning (e.g., Bishop [2006]). You may find
more specific recommendations of readings relevant to a particular topic in the
Bibliographic Notes at the end of each chapter that covers the corresponding topic.