Data-Intensive Text Processing With MapReduce
Data-Intensive Text Processing With MapReduce
with MapReduce
Synthesis Lectures on Human
Language Technologies
Editor
Graeme Hirst, University of Toronto
Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University
of Toronto. The series consists of 50- to 150-page monographs on topics relating to natural
language processing, computational linguistics, information retrieval, and spoken language
understanding. Emphasis is on important new techniques, on new applications, and on topics that
combine two or more HLT subfields.
Dependency Parsing
Sandra Kbler, Ryan McDonald, and Joakim Nivre
2009
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations in
printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00274ED1V01Y201006HLT007
Lecture #7
Series Editor: Graeme Hirst, University of Toronto
Series ISSN
Synthesis Lectures on Human Language Technologies
Print 1947-4040 Electronic 1947-4059
Data-Intensive Text Processing
with MapReduce
M
&C Morgan & cLaypool publishers
ABSTRACT
Our world is being revolutionized by data-driven methods: access to large amounts of data has gen-
erated new insights and opened exciting new opportunities in commerce, science, and computing
applications. Processing the enormous quantities of data necessary for these advances requires large
clusters, making distributed computing paradigms more crucial than ever. MapReduce is a program-
ming model for expressing distributed computations on massive datasets and an execution framework
for large-scale data processing on clusters of commodity servers. The programming model provides
an easy-to-understand abstraction for designing scalable algorithms, while the execution framework
transparently handles many system-level details, ranging from scheduling to synchronization to fault
tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing
algorithms common in natural language processing, information retrieval, and machine learning.
We introduce the notion of MapReduce design patterns, which represent general reusable solutions
to commonly occurring problems across a variety of problem domains. This book not only intends
to help the reader think in MapReduce, but also discusses limitations of the programming model
as well.
KEYWORDS
Hadoop, parallel and distributed programming, algorithm design, text processing, nat-
ural language processing, information retrieval, machine learning
vii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Computing in the Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Big Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Why Is This Different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
2.5 The Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
CHAPTER 1
Introduction
MapReduce [45] is a programming model for expressing distributed computations on massive
amounts of data and an execution framework for large-scale data processing on clusters of com-
modity servers. It was originally developed by Google and built on well-known principles in parallel
and distributed processing dating back several decades. MapReduce has since enjoyed widespread
adoption via an open-source implementation called Hadoop, whose development was led by Yahoo
(now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with
significant activity in both industry and academia.
This book is about scalable approaches to processing large amounts of text with MapReduce.
Given this focus, it makes sense to start with the most basic question: Why? There are many answers
to this question, but we focus on two. First, big data is a fact of the world, and therefore an issue that
real-world systems must grapple with. Second, across a wide range of text processing applications,
more data translates into more effective algorithms, and thus it makes sense to take advantage of the
plentiful amounts of data that surround us.
Modern information societies are defined by vast repositories of data, both public and private.
Therefore, any practical application must be able to scale up to datasets of interest. For many, this
means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around
gathering, analyzing, monitoring, filtering, searching, or organizing web content must tackle large-
data problems: web-scale processing is practically synonymous with data-intensive processing.This
observation applies not only to well-established internet companies, but also countless startups and
niche players as well. Just think, how many companies do you know that start their pitch with were
going to harvest information on the web and?
Another strong area of growth is the analysis of user behavior data. Any operator of a moder-
ately successful website can record user activity and in a matter of weeks (or sooner) be drowning in
a torrent of log data. In fact, logging user behavior generates so much data that many organizations
simply cant cope with the volume, and either turn the functionality off or throw away data after
some time. This represents lost opportunities, as there is a broadly held belief that great value lies in
insights derived from mining such data. Knowing what users look at, what they click on, how much
time they spend on a web page, etc., leads to better business decisions and competitive advantages.
Broadly, this is known as business intelligence, which encompasses a wide range of technologies
including data warehousing, data mining, and analytics.
How much data are we talking about? A few examples: Google grew from processing 100 tera-
bytes of data a day with MapReduce in 2004 [45] to processing 20 petabytes a day with MapReduce
2 1. INTRODUCTION
in 2008 [46]. In April 2009, a blog post1 was written about eBays two enormous data warehouses:
one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion
records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2
similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes
per day. Petabyte datasets are rapidly becoming the norm, and the trends are clear: our ability to
store data is fast overwhelming our ability to process what we store. More distressing, increases in
capacity are outpacing improvements in bandwidth such that our ability to even read back what we
store is deteriorating [91]. Disk capacities have grown from tens of megabytes in the mid-1980s
to about a couple of terabytes today (several orders of magnitude). On the other hand, latency and
bandwidth have improved relatively little: in the case of latency, perhaps 2 improvement during the
last quarter century, and in the case of bandwidth, perhaps 50. Given the tendency for individuals
and organizations to continuously fill up whatever capacity is available, large-data problems are
growing increasingly severe.
Moving beyond the commercial sphere, many have recognized the importance of data man-
agement in many scientific disciplines, where petabyte-scale datasets are also becoming increasingly
common [21]. For example:
The high-energy physics community was already describing experiences with petabyte-scale
databases back in 2005 [20]. Today, the Large Hadron Collider (LHC) near Geneva is the
worlds largest particle accelerator, designed to probe the mysteries of the universe, including
the fundamental nature of matter, by recreating conditions shortly following the Big Bang.
When it becomes fully operational, the LHC will produce roughly 15 petabytes of data a year.3
Astronomers have long recognized the importance of a digital observatory that would support
the data needs of researchers across the globethe Sloan Digital Sky Survey [145] is perhaps
the most well known of these projects. Looking into the future, the Large Synoptic Survey
Telescope (LSST) is a wide-field instrument that is capable of observing the entire sky every
few days. When the telescope comes online around 2015 in Chile, its 3.2 gigapixel primary
camera will produce approximately half a petabyte of archive images every month [19].
The advent of next-generation DNA sequencing technology has created a deluge of sequence
data that needs to be stored, organized, and delivered to scientists for further study. Given
the fundamental tenant in modern genetics that genotypes explain phenotypes, the impact
of this technology is nothing less than transformative [103]. The European Bioinformatics
Institute (EBI), which hosts a central repository of sequence data called EMBL-bank, has
increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [142]. Scientists
are predicting that, in the not-so-distant future, sequencing an individuals genome will be no
more complex than getting a blood test todayushering a new era of personalized medicine,
where interventions can be specifically targeted for an individual.
1 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
2 http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/
3 http://public.web.cern.ch/public/en/LHC/Computing-en.html
3
Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help
researchers manipulate, explore, and mine massive datasets [72]this has been hailed as the emerg-
ing fourth paradigm of science [73] (complementing theory, experiments, and simulations). In
other areas of academia, particularly computer science, systems and algorithms incapable of scaling
to massive real-world datasets run the danger of being dismissed as toy systems with limited utility.
Large data is a fact of todays world and data-intensive processing is fast becoming a necessity, not
merely a luxury or curiosity.
Although large data comes in a variety of forms, this book is primarily concerned with pro-
cessing large amounts of text, but touches on other types of data as well (e.g., relational and graph
data). The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural
language processing (NLP) and information retrieval (IR). Recent work in these fields is dominated
by a data-driven, empirical approach, typically involving algorithms that attempt to capture statistical
regularities in data for the purposes of some task or application. There are three components to this
approach: data, representations of the data, and some method for capturing regularities in the data.
Data are called corpora (singular, corpus) by NLP researchers and collections by those from the IR
community. Aspects of the representations of the data are called features, which may be superficial
and easy to extract, such as the words and sequences of words themselves, or deep and more difficult
to extract, such as the grammatical relationship between words. Finally, algorithms or models are
applied to capture regularities in the data in terms of the extracted features for some application. One
common application, classification, is to sort text into categories. Examples include: Is this email
spam or not spam? Is this word part of an address or a location? The first task is easy to understand,
while the second task is an instance of what NLP researchers call named-entity detection [138],
which is useful for local search and pinpointing locations on maps. Another common application is to
rank texts according to some criteriasearch is a good example, which involves ranking documents
by relevance to the users query. Another example is to automatically situate texts along a scale of
happiness, a task known as sentiment analysis or opinion mining [118], which has been applied to
everything from understanding political discourse in the blogosphere to predicting the movement
of stock prices.
There is a growing body of evidence, at least in text processing, that of the three components
discussed above (data, features, algorithms), data probably matters the most. Superficial word-level
features coupled with simple models in most cases trump sophisticated models with deeper features
and less data. But why cant we have our cake and eat it too? Why not both sophisticated models and
deep features applied to lots of data? Because inference over sophisticated models and extraction of
deep features are often computationally intensive, they dont scale well.
Consider a simple task such as determining the correct usage of easily confusable words such
as than and then in English. One can view this as a supervised machine learning problem: we
can train a classifier to disambiguate between the options, and then apply the classifier to new
instances of the problem (say, as part of a grammar checker). Training data is fairly easy to come
bywe can just gather a large corpus of texts and assume that most writers make correct choices
4 1. INTRODUCTION
(the training data may be noisy, since people make mistakes, but no matter). In 2001, Banko and
Brill [14] published what has become a classic paper in natural language processing exploring the
effects of training data size on classification accuracy, using this task as the specific example. They
explored several classification algorithms (the exact ones arent important, as we shall see), and not
surprisingly, found that more data led to better accuracy. Across many different algorithms, the
increase in accuracy was approximately linear in the log of the size of the training data. Furthermore,
with increasing amounts of training data, the accuracy of different algorithms converged, such that
pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale.
This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms
really dont matter, all that matters is the amount of data you have. This resulted in an even more
controversial recommendation, delivered somewhat tongue-in-cheek: we should just give up working
on algorithms and simply spend our time gathering data (while waiting for computers to become
faster so we can process the data).
As another example, consider the problem of answering short, fact-based questions such as
Who shot Abraham Lincoln? Instead of returning a list of documents that the user would then have
to sort through, a question answering (QA) system would directly return the answer: John Wilkes
Booth. This problem gained interest in the late 1990s, when natural language processing researchers
approached the challenge with sophisticated linguistic processing techniques such as syntactic and
semantic analysis. Around 2001, researchers discovered a far simpler approach to answering such
questions based on pattern matching [27; 53; 92]. Suppose you wanted the answer to the above
question. As it turns out, you can simply search for the phrase shot Abraham Lincoln on the web
and look for what appears to its left. Or better yet, look through multiple instances of this phrase
and tally up the words that appear to the left. This simple strategy works surprisingly well, and has
become known as the redundancy-based approach to question answering. It capitalizes on the insight
that in a very large text collection (i.e., the web), answers to commonly asked questions will be stated
in obvious ways, such that pattern-matching techniques suffice to extract answers accurately.
Yet another example concerns smoothing in web-scale language models [25]. A language
model is a probability distribution that characterizes the likelihood of observing a particular sequence
of words, estimated from a large corpus of texts. They are useful in a variety of applications, such as
speech recognition (to determine what the speaker is more likely to have said) and machine translation
(to determine which of possible translations is the most fluent, as we will discuss in Section 6.4). Since
there are infinitely many possible strings, and probabilities must be assigned to all of them, language
modeling is a more challenging task than simply keeping track of which strings were seen how many
times: some number of likely strings will never be encountered, even with lots and lots of training
data! Most modern language models make the Markov assumption: in a n-gram language model,
the conditional probability of a word is given by the n 1 previous words. Thus, by the chain rule,
the probability of a sequence of words can be decomposed into the product of n-gram probabilities.
Nevertheless, an enormous number of parameters must still be estimated from a training corpus:
potentially V n parameters, where V is the number of words in the vocabulary. Even if we treat
5
every word on the web as the training corpus from which to estimate the n-gram probabilities, most
n-gramsin any language, even Englishwill never have been seen. To cope with this sparseness,
researchers have developed a number of smoothing techniques [35; 79; 102], which all share the basic
idea of moving probability mass from observed to unseen events in a principled manner. Smoothing
approaches vary in effectiveness, both in terms of intrinsic and application-specific metrics. In 2007,
Brants et al. [25] described language models trained on up to two trillion words.4 Their experiments
compared a state-of-the-art approach known as Kneser-Ney smoothing [35] with another technique
the authors affectionately referred to as stupid backoff .5 Not surprisingly, stupid backoff didnt work
as well as Kneser-Ney smoothing on smaller corpora. However, it was simpler and could be trained
on more data, which ultimately yielded better language models. That is, a simpler technique on more
data beat a more sophisticated technique on less data.
Recently, three Google researchers summarized this data-driven philosophy in an essay titled
The Unreasonable Effectiveness of Data [65].6 Why is this so? It boils down to the fact that language
in the wild, just like human behavior in general, is messy. Unlike, say, the interaction of subatomic
particles, human use of language is not constrained by succinct, universal laws of grammar. There
are of course rules that govern the formation of words and sentencesfor example, that verbs appear
before objects in English, and that subjects and verbs must agree in number in many languagesbut
real-world language is affected by a multitude of other factors as well: people invent new words and
phrases all the time, authors occasionally make mistakes, groups of individuals write within a shared
context, etc. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story
about a fictional society in which the art of cartography had gotten so advanced that their maps were
as big as the lands they were describing.7 The world, he would say, is the best description of itself. In
the same way, the more observations we gather about language use, the more accurate a description
we have of language itself. This, in turn, translates into more effective algorithms and systems.
So, in summary, why large data? In some ways, the first answer is similar to the reason people
climb mountains: because theyre there. But the second answer is even more compelling. Data
represent the rising tide that lifts all boatsmore data lead to better algorithms and systems for
solving real-world problems. Now that weve addressed the why, lets tackle the how. Lets start with
the obvious observation: data-intensive processing is beyond the capability of any individual machine
and requires clusterswhich means that large-data problems are fundamentally about organizing
computations on dozens, hundreds, or even thousands of machines. This is exactly what MapReduce
does, and the rest of this book is about the how.
4 As an aside, it is interesting to observe the evolving definition of large over the years. Banko and Brills paper in 2001 was titled
Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with a corpus containing a billion words.
5 As in, so stupid it couldnt possibly work.
6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [155]. This
is somewhat ironic in that the original article lauded the beauty and elegance of mathematical models in capturing natural
phenomena, which is the exact opposite of the data-driven approach.
7 On Exactitude in Science [23]. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by Lewis Carroll (1893).
6 1. INTRODUCTION
8 What is the difference between cloud computing and grid computing? Although both tackle the fundamental problem of how best
to bring computational resources to bear on large and difficult problems, they start with different assumptions. Whereas clouds
are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number
of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous
resources under the control of distinct but cooperative organizations. As a result, grid computing tends to deal with tasks that
are coarser-grained, and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple
administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these challenges.
9The first example is Facebook, a well-known user of Hadoop, in exactly the manner as described [68]. The second is, of course,
Google, which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and
placement.
1.1. COMPUTING IN THE CLOUDS 7
resources and enforce isolation between multiple users who may be sharing the same hardware.
Once one or more virtual machine instances have been provisioned, the user has full control over the
resources and can use them for arbitrary computation. Virtual machines that are no longer needed
are destroyed, thereby freeing up physical resources that can be redirected to other users. Resource
consumption is measured in some equivalent of machine-hours and users are charged in increments
thereof.
Both users and providers benefit in the utility computing model. Users are freed from upfront
capital investments necessary to build datacenters and substantial reoccurring costs in maintaining
them. They also gain the important property of elasticityas demand for computing resources grow,
for example, from an unpredicted spike in customers, more resources can be seamlessly allocated
from the cloud without an interruption in service. As demand falls, provisioned resources can be
released. Prior to the advent of utility computing, coping with unexpected spikes in demand was
fraught with challenges: under-provision and run the risk of service interruptions, or over-provision
and tie up precious capital in idle machines that are depreciating.
From the utility provider point of view, this business also makes sense because large datacenters
benefit from economies of scale and can be run more efficiently than smaller datacenters. In the same
way that insurance works by aggregating risk and redistributing it, utility providers aggregate the
computing demands for a large number of users. Although demand may fluctuate significantly for
each user, overall trends in aggregate demand should be smooth and predictable, which allows the
cloud provider to adjust capacity over time with less risk of either offering too much (resulting in
inefficient use of capital) or too little (resulting in unsatisfied customers). In the world of utility
computing, Amazon Web Services currently leads the way and remains the dominant player, but a
number of other cloud providers populate a market that is becoming increasingly crowded. Most
systems are based on proprietary infrastructure, but there is at least one, Eucalyptus [111], that
is available open source. Increased competition will benefit cloud users, but what direct relevance
does this have for MapReduce? The connection is quite simple: processing large amounts of data
with MapReduce requires access to clusters with sufficient capacity. However, not everyone with
large-data problems can afford to purchase and maintain clusters. This is where utility computing
comes in: clusters of sufficient size can be provisioned only when the need arises, and users pay only
as much as is required to solve their problems. This lowers the barrier to entry for data-intensive
processing and makes MapReduce much more accessible.
A generalization of the utility computing concept is everything as a service, which is itself a
new take on the age-old idea of outsourcing. A cloud provider offering customers access to virtual
machine instances is said to be offering infrastructure as a service, or IaaS for short. However, this
may be too low level for many users. Enter platform as a service (PaaS), which is a rebranding of what
used to be called hosted services in the pre-cloud era. Platform is used generically to refer to any
set of well-defined services on top of which users can build applications, deploy content, etc. This
class of services is best exemplified by Google App Engine, which provides the backend datastore
and API for anyone to build highly scalable web applications. Google maintains the infrastructure,
8 1. INTRODUCTION
freeing the user from having to backup, upgrade, patch, or otherwise maintain basic services such
as the storage layer or the programming environment. At an even higher level, cloud providers can
offer software as a service (SaaS), as exemplified by Salesforce, a leader in customer relationship
management (CRM) software. Other examples include outsourcing an entire organizations email
to a third party, which is commonplace today.
What does this proliferation of services have to do with MapReduce? No doubt that ev-
erything as a service is driven by desires for greater business efficiencies, but scale and elasticity
play important roles as well. The cloud allows seamless expansion of operations without the need
for careful planning and supports scales that may otherwise be difficult or cost-prohibitive for an
organization to achieve. Cloud services, just like MapReduce, represent the search for an appro-
priate level of abstraction and beneficial divisions of labor. IaaS is an abstraction over raw physical
hardwarean organization might lack the capital, expertise, or interest in running datacenters, and
therefore pays a cloud provider to do so on its behalf. The argument applies similarly to PaaS and
SaaS. In the same vein, the MapReduce programming model is a powerful abstraction that separates
the what from the how of data-intensive processing.
Scale out, not up. For data-intensive workloads, a large number of commodity low-end servers
(i.e., the scaling out approach) is preferred over a small number of high-end servers (i.e., the scaling
up approach).The latter approach of purchasing symmetric multi-processing (SMP) machines with
a large number of processor sockets (dozens, even hundreds) and a large amount of shared memory
(hundreds or even thousands of gigabytes) is not cost effective, since the costs of such machines do
not scale linearly (i.e., a machine with twice as many processors is often significantly more than twice
as expensive). On the other hand, the low-end server market overlaps with the high-volume desktop
computing market, which has the effect of keeping prices low due to competition, interchangeable
components, and economies of scale.
Barroso and Hlzles recent treatise of what they dubbed warehouse-scale computers [18]
contains a thoughtful analysis of the two approaches. The Transaction Processing Council (TPC)
is a neutral, non-profit organization whose mission is to establish objective database benchmarks.
Benchmark data submitted to that organization are probably the closest one can get to a fair apples-
to-apples comparison of cost and performance for specific, well-defined relational processing appli-
cations. Based on TPC-C benchmark results from late 2007, a low-end server platform is about four
1.2. BIG IDEAS 9
times more cost efficient than a high-end shared memory platform from the same vendor. Excluding
storage costs, the price/performance advantage of the low-end server increases to about a factor of
twelve.
What if we take into account the fact that communication between nodes in a high-end
SMP machine is orders of magnitude faster than communication between nodes in a commodity
network-based cluster? Since workloads today are beyond the capability of any single machine (no
matter how powerful), the comparison is more accurately between a smaller cluster of high-end
machines and a larger cluster of low-end machines (network communication is unavoidable in both
cases). Barroso and Hlzle model these two approaches under workloads that demand more or
less communication, and conclude that a cluster of low-end servers approaches the performance of
the equivalent cluster of high-end serversthe small performance gap is insufficient to justify the
price premium of the high-end servers. For data-intensive applications, the conclusion appears to
be clear: scaling out is superior to scaling up, and therefore most existing implementations of the
MapReduce programming model are designed around clusters of low-end commodity servers.
Capital costs in acquiring servers is, of course, only one component of the total cost of de-
livering computing capacity. Operational costs are dominated by the cost of electricity to power
the servers as well as other aspects of datacenter operations that are functionally related to power:
power distribution, cooling, etc. [18; 67]. As a result, energy efficiency has become a key issue in
building warehouse-scale computers for large-data processing. Therefore, it is important to factor in
operational costs when deploying a scale-out solution based on large numbers of commodity servers.
Datacenter efficiency is typically factored into three separate components that can be inde-
pendently measured and optimized [18]. The first component measures how much of a buildings
incoming power is actually delivered to computing equipment, and correspondingly, how much is
lost to the buildings mechanical systems (e.g., cooling, air handling) and electrical infrastructure
(e.g., power distribution inefficiencies). The second component measures how much of a servers
incoming power is lost to the power supply, cooling fans, etc. The third component captures how
much of the power delivered to computing components (processor, RAM, disk, etc.) is actually used
to perform useful computations.
Of the three components of datacenter efficiency, the first two are relatively straightforward
to objectively quantify. Adoption of industry best-practices can help datacenter operators achieve
state-of-the-art efficiency. The third component, however, is much more difficult to measure. One
important issue that has been identified is the non-linearity between load and power draw. That is,
a server at 10% utilization may draw slightly more than half as much power as a server at 100%
utilization (which means that a lightly loaded server is much less efficient than a heavily loaded
server). A survey of five thousand Google servers over a six-month period shows that servers operate
most of the time at between 10% and 50% utilization [17], which is an energy-inefficient operating
region. As a result, Barroso and Hlzle have advocated for research and development in energy-
proportional machines, where energy consumption would be proportional to load, such that an
10 1. INTRODUCTION
idle processor would (ideally) consume no power, but yet retain the ability to power up (nearly)
instantaneously in response to demand.
Although we have provided a brief overview here, datacenter efficiency is a topic that is beyond
the scope of this book. For more details, consult Barroso and Hlzle [18] and Hamilton [67], who
provide detailed cost models for typical modern datacenters. However, even factoring in operational
costs, evidence suggests that scaling out remains more attractive than scaling up.
Assume failures are common. At warehouse scale, failures are not only inevitable, but commonplace.
A simple calculation suffices to demonstrate: let us suppose that a cluster is built from reliable
machines with a mean-time between failures (MTBF) of 1000 days (about three years). Even with
these reliable servers, a 10,000-server cluster would still experience roughly 10 failures a day. For
the sake of argument, let us suppose that a MTBF of 10,000 days (about 30 years) were achievable
at realistic costs (which is unlikely). Even then, a 10,000-server cluster would still experience one
failure daily. This means that any large-scale service that is distributed across a large cluster (either a
user-facing application or a computing platform like MapReduce) must cope with hardware failures
as an intrinsic aspect of its operation [66]. That is, a server may fail at any time, without notice. For
example, in large clusters disk failures are common [123] and RAM experiences more errors than
one might expect [135]. Datacenters suffer from both planned outages (e.g., system maintenance
and hardware upgrades) and unexpected outages (e.g., power failure, connectivity loss, etc.).
A well-designed, fault-tolerant service must cope with failures up to a point without impacting
the quality of servicefailures should not result in inconsistencies or indeterminism from the user
perspective. As servers go down, other cluster nodes should seamlessly step in to handle the load,
and overall performance should gracefully degrade as server failures pile up. Just as important, a
broken server that has been repaired should be able to seamlessly rejoin the service without manual
reconfiguration by the administrator. Mature implementations of the MapReduce programming
model are able to robustly cope with failures through a number of mechanisms such as automatic
task restarts on different cluster nodes.
Move processing to the data. In traditional high-performance computing (HPC) applications (e.g.,
for climate or nuclear simulations), it is commonplace for a supercomputer to have processing nodes
and storage nodes linked together by a high-capacity interconnect. Many data-intensive workloads
are not very processor-demanding, which means that the separation of compute and storage creates
a bottleneck in the network. As an alternative to moving data around, it is more efficient to move
the processing around. That is, MapReduce assumes an architecture where processors and storage
(disk) are co-located. In such a setup, we can take advantage of data locality by running code on the
processor directly attached to the block of data we need. The distributed file system is responsible
for managing the data over which MapReduce operates.
Process data sequentially and avoid random access. Data-intensive processing by definition means
that the relevant datasets are too large to fit in memory and must be held on disk. Seek times for
random disk access are fundamentally limited by the mechanical nature of the devices: read heads
1.2. BIG IDEAS 11
can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoid
random data access, and instead organize computations so that data are processed sequentially. A
simple scenario10 poignantly illustrates the large performance gap between sequential operations
and random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonable
assumptions about disk latency and throughput, a back-of-the-envelop calculation will show that
updating 1% of the records (by accessing and then mutating each record) will take about a month
on a single machine. On the other hand, if one simply reads the entire database and rewrites all
the records (mutating those that need updating), the process would finish in under a work day on
a single machine. Sequential data access is, literally, orders of magnitude faster than random data
access.11
The development of solid-state drives is unlikely to change this balance for at least two
reasons. First, the cost differential between traditional magnetic disks and solid-state disks remains
substantial: large-data will for the most part remain on mechanical drives, at least in the near
future. Second, although solid-state disks have substantially faster seek times, order-of-magnitude
differences in performance between sequential and random access still remain.
MapReduce is primarily designed for batch processing over large datasets. To the extent
possible, all computations are organized into long streaming operations that take advantage of the
aggregate bandwidth of many disks in a cluster. Many aspects of MapReduces design explicitly trade
latency for throughput.
Hide system-level details from the application developer. According to many guides on the practice
of software engineering written by experienced industry professionals, one of the key reasons why
writing code is difficult is because the programmer must simultaneously keep track of many details
in short-term memoryranging from the mundane (e.g., variable names) to the sophisticated (e.g.,
a corner case of an algorithm that requires special treatment). This imposes a high cognitive load
and requires intense concentration, which leads to a number of recommendations about a program-
mers environment (e.g., quiet office, comfortable furniture, large monitors, etc.). The challenges in
writing distributed software are greatly compoundedthe programmer must manage details across
several threads, processes, or machines. Of course, the biggest headache in distributed programming
is that code runs concurrently in unpredictable orders, accessing data in unpredictable patterns.
This gives rise to race conditions, deadlocks, and other well-known problems. Programmers are
taught to use low-level devices such as mutexes and to apply high-level design patterns such as
producerconsumer queues to tackle these challenges, but the truth remains: concurrent programs
are notoriously difficult to reason about and even harder to debug.
MapReduce addresses the challenges of distributed programming by providing an abstraction
that isolates the developer from system-level details (e.g., locking of data structures, data starvation
issues in the processing pipeline, etc.). The programming model specifies simple and well-defined
interfaces between a small number of components, and therefore is easy for the programmer to reason
10 Adapted from a post by Ted Dunning on the Hadoop mailing list.
11 For more detail, Jacobs [76] provides real-world benchmarks in his discussion of large-data problems.
12 1. INTRODUCTION
about. MapReduce maintains a separation of what computations are to be performed and how those
computations are actually carried out on a cluster of machines. The first is under the control of
the programmer, while the second is exclusively the responsibility of the execution framework or
runtime.The advantage is that the execution framework only needs to be designed once and verified
for correctnessthereafter, as long as the developer expresses computations in the programming
model, code is guaranteed to behave as expected. The upshot is that the developer is freed from
having to worry about system-level details (e.g., no more debugging race conditions and addressing
lock contention) and can instead focus on algorithm or application design.
Seamless scalability. For data-intensive processing, it goes without saying that scalable algorithms
are highly desirable. As an aspiration, let us sketch the behavior of an ideal algorithm. We can define
scalability along at least two dimensions.12 First, in terms of data: given twice the amount of data,
the same algorithm should take at most twice as long to run, all else being equal. Second, in terms of
resources: given a cluster twice the size, the same algorithm should take no more than half as long to
run. Furthermore, an ideal algorithm would maintain these desirable scaling characteristics across
a wide range of settings: on data ranging from gigabytes to petabytes, on clusters consisting of a
few to a few thousand machines. Finally, the ideal algorithm would exhibit these desired behaviors
without requiring any modifications whatsoever, not even tuning of parameters.
Other than for embarrassingly parallel problems, algorithms with the characteristics sketched
above are, of course, unobtainable. One of the fundamental assertions in Fred Brooks classic The
Mythical Man-Month [28] is that adding programmers to a project behind schedule will only make it
fall further behind.This is because complex tasks cannot be chopped into smaller pieces and allocated
in a linear fashion, and is often illustrated with a cute quote: nine women cannot have a baby in
one month. Although Brooks observations are primarily about software engineers and the software
development process, the same is also true of algorithms: increasing the degree of parallelization
also increases communication costs. The algorithm designer is faced with diminishing returns, and
beyond a certain point, greater efficiencies gained by parallelization are entirely offset by increased
communication requirements.
Nevertheless, these fundamental limitations shouldnt prevent us from at least striving for the
unobtainable. The truth is that most current algorithms are far from the ideal. In the domain of text
processing, for example, most algorithms today assume that data fit in memory on a single machine.
For the most part, this is a fair assumption. But what happens when the amount of data doubles in the
near future, and then doubles again shortly thereafter? Simply buying more memory is not a viable
solution, as the amount of data is growing faster than the price of memory is falling. Furthermore,
the price of a machine does not scale linearly with the amount of available memory beyond a certain
point (once again, the scaling up vs. scaling out argument). Quite simply, algorithms that require
holding intermediate data in memory on a single machine will break on sufficiently large datasets
moving from a single machine to a cluster architecture requires fundamentally different algorithms
(and reimplementations).
12 See also DeWitt and Gray [50] for slightly different definitions in terms of speedup and scaleup.
1.3. WHY IS THIS DIFFERENT? 13
Perhaps the most exciting aspect of MapReduce is that it represents a small step toward
algorithms that behave in the ideal manner discussed above. Recall that the programming model
maintains a clear separation between what computations need to occur with how those computations
are actually orchestrated on a cluster. As a result, a MapReduce algorithm remains fixed, and it is
the responsibility of the execution framework to execute the algorithm. Amazingly, the MapReduce
programming model is simple enough that it is actually possible, in many circumstances, to approach
the ideal scaling characteristics discussed above. We introduce the idea of the tradeable machine
hour, as a play on Brooks classic title. If running an algorithm on a particular dataset takes 100
machine hours, then we should be able to finish in an hour on a cluster of 100 machines, or use a
cluster of 10 machines to complete the same task in ten hours.13 With MapReduce, this isnt so far
from the truth, at least for some applications.
For several decades, computer scientists have predicted that the dawn of the age of parallel
computing was right around the corner and that sequential processing would soon fade into obso-
lescence (consider, for example, the above quote). Yet, until very recently, they have been wrong. The
relentless progress of Moores Law for several decades has ensured that most of the worlds prob-
lems could be solved by single-processor machines, save the needs of a few (scientists simulating
molecular interactions or nuclear reactions, for example). Couple that with the inherent challenges
of concurrency, and the result has been that parallel processing and distributed systems have largely
been confined to a small segment of the market and esoteric upper-level electives in the computer
science curriculum.
However, all of that changed around the middle of the first decade of this century. The
manner in which the semiconductor industry had been exploiting Moores Law simply ran out
of opportunities for improvement: faster clocks, deeper pipelines, superscalar architectures, and
other tricks of the trade reached a point of diminishing returns that did not justify continued
investment. This marked the beginning of an entirely new strategy and the dawn of the multi-core
era [115]. Unfortunately, this radical shift in hardware architecture was not matched at that time by
corresponding advances in how software could be easily designed for these new processors (but not
for lack of trying [104]). Nevertheless, parallel processing became an important issue at the forefront
of everyones mindit represented the only way forward.
13 Note that this idea meshes well with utility computing, where a 100-machine cluster running for one hour would cost the same
as a 10-machine cluster running for ten hours.
14 Guess when this was written? You may be surprised.
14 1. INTRODUCTION
At around the same time, we witnessed the growth of large-data problems. In the late 1990s
and even during the beginning of the first decade of this century, relatively few organizations had data-
intensive processing needs that required large clusters: a handful of internet companies and perhaps
a few dozen large corporations. But then, everything changed. Through a combination of many
different factors (falling prices of disks, rise of user-generated web content, etc.), large-data problems
began popping up everywhere. Data-intensive processing needs became widespread, which drove
innovations in distributed computing such as MapReducefirst by Google, and then by Yahoo and
the open source community. This in turn created more demand: when organizations learned about
the availability of effective data analysis tools for large datasets, they began instrumenting various
business processes to gather even more datadriven by the belief that more data lead to deeper
insights and greater competitive advantages. Today, not only are large-data problems ubiquitous,
but technological solutions for addressing them are widely accessible. Anyone can download the
open-source Hadoop implementation of MapReduce, pay a modest fee to rent a cluster from a
utility cloud provider, and be happily processing terabytes upon terabytes of data within the week.
Finally, the computer scientists are rightthe age of parallel computing has begun, both in terms of
multiple cores in a chip and multiple machines in a cluster (each of which often has multiple cores).
Why is MapReduce important? In practical terms, it provides a very effective tool for tackling
large-data problems. But beyond that, MapReduce is important in how it has changed the way
we organize computations at a massive scale. MapReduce represents the first widely adopted step
away from the von Neumann model that has served as the foundation of computer science over
the last half plus century. Valiant called this a bridging model [148], a conceptual bridge between
the physical implementation of a machine and the software that is to be executed on that machine.
Until recently, the von Neumann model has served us well: Hardware designers focused on efficient
implementations of the von Neumann model and didnt have to think much about the actual soft-
ware that would run on the machines. Similarly, the software industry developed software targeted
at the model without worrying about the hardware details. The result was extraordinary growth:
chip designers churned out successive generations of increasingly powerful processors, and software
engineers were able to develop applications in high-level languages that exploited those processors.
Today, however, the von Neumann model isnt sufficient anymore: we cant treat a multi-core
processor or a large cluster as an agglomeration of many von Neumann machine instances commu-
nicating over some interconnect. Such a view places too much burden on the software developer
to effectively take advantage of available computational resourcesit simply is the wrong level of
abstraction. MapReduce can be viewed as the first breakthrough in the quest for new abstractions
that allow us to organize computations, not over individual machines, but over entire clusters. As
Barroso puts it, the datacenter is the computer [18; 119].
To be fair, MapReduce is certainly not the first model of parallel computation that has been
proposed. The most prevalent model in theoretical computer science, which dates back several
decades, is the PRAM [60; 77].15 In the model, an arbitrary number of processors, sharing an
15 More than a theoretical model, the PRAM has been recently prototyped in hardware [153].
1.4. WHAT THIS BOOK IS NOT 15
unboundedly large memory, operate synchronously on a shared input to produce some output.
Other models include LogP [43] and BSP [148]. For reasons that are beyond the scope of this book,
none of these previous models have enjoyed the success that MapReduce has in terms of adoption
and in terms of impact on the daily lives of millions of users.16
MapReduce is the most successful abstraction over large-scale computational resources we
have seen to date. However, as anyone who has taken an introductory computer science course
knows, abstractions manage complexity by hiding details and presenting well-defined behaviors to
users of those abstractions. They, inevitably, are imperfectmaking certain tasks easier but others
more difficult, and sometimes, impossible (in the case where the detail suppressed by the abstraction
is exactly what the user cares about). This critique applies to MapReduce: it makes certain large-data
problems easier, but suffers from limitations as well. This means that MapReduce is not the final
word, but rather the first in a new class of programming models that will allow us to more effectively
organize computations at a massive scale.
So if MapReduce is only the beginning, whats next beyond MapReduce? Were getting ahead
of ourselves, as we cant meaningfully answer this question before thoroughly understanding what
MapReduce can and cannot do well. This is exactly the purpose of this book: let us now begin our
exploration.
16 Nevertheless, it is important to understand the relationship between MapReduce and existing models so that we can bring to
bear accumulated knowledge about parallel algorithms; for example, Karloff et al. [82] demonstrated that a large class of PRAM
algorithms can be efficiently simulated via MapReduce.
17
CHAPTER 2
MapReduce Basics
The only feasible approach to tackling large-data problems today is to divide and conquer, a funda-
mental concept in computer science that is introduced very early in typical undergraduate curricula.
The basic idea is to partition a large problem into smaller sub-problems. To the extent that the
sub-problems are independent [5], they can be tackled in parallel by different workersthreads in a
processor core, cores in a multi-core processor, multiple processors in a machine, or many machines
in a cluster. Intermediate results from each individual worker are then combined to yield the final
output.1
The general principles behind divide-and-conquer algorithms are broadly applicable to a wide
range of problems in many different application domains. However, the details of their implemen-
tations are varied and complex. For example, the following are just some of the issues that need to
be addressed:
How do we break up a large problem into smaller tasks? More specifically, how do we decom-
pose the problem so that the smaller tasks can be executed in parallel?
How do we assign tasks to workers distributed across a potentially large number of machines
(while keeping in mind that some workers are better suited to running some tasks than others,
e.g., due to available resources, locality constraints, etc.)?
How do we ensure that the workers get the data they need?
How do we share partial results from one worker that is needed by another?
How do we accomplish all of the above in the face of software errors and hardware faults?
Message Passing Interface (MPI) for cluster-level parallelism,3 provide logical abstractions that hide
details of operating system synchronization and communications primitives. However, even with
these extensions, developers are still burdened to keep track of how resources are made available to
workers. Additionally, these frameworks are mostly designed to tackle processor-intensive problems
and have only rudimentary support for dealing with very large amounts of input data. When using
existing parallel computing approaches for large-data computation, the programmer must devote a
significant amount of attention to low-level system details, which detracts from higher-level problem
solving.
One of the most significant advantages of MapReduce is that it provides an abstraction that
hides many system-level details from the programmer. Therefore, a developer can focus on what
computations need to be performed, as opposed to how those computations are actually carried out
or how to get the data to the processes that depend on them. Like OpenMP and MPI, MapReduce
provides a means to distribute computation without burdening the programmer with the details of
distributed computing (but at a different level of granularity). However, organizing and coordinating
large amounts of computation is only part of the challenge. Large-data processing by definition
requires bringing data and code together for computation to occurno small feat for datasets that
are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a
simple abstraction for the developer, transparently handling most of the details behind the scenes
in a scalable, robust, and efficient manner. As we mentioned in Chapter 1, instead of moving large
amounts of data around, it is far more efficient, if possible, to move the code to the data. This is
operationally realized by spreading data across the local disks of nodes in a cluster and running
processes on nodes that hold the data. The complex task of managing storage in such a processing
environment is typically handled by a distributed file system that sits underneath MapReduce.
This chapter introduces the MapReduce programming model and the underlying distributed
file system. We start in Section 2.1 with an overview of functional programming, from which
MapReduce draws its inspiration. Section 2.2 introduces the basic programming model, focusing
on mappers and reducers. Section 2.3 discusses the role of the execution framework in actually
running MapReduce programs (called jobs). Section 2.4 fills in additional details by introducing
partitioners and combiners, which provide greater control over data flow. MapReduce would not be
practical without a tightly-integrated distributed file system that manages the data being processed;
Section 2.5 covers this in detail.Tying everything together, a complete cluster architecture is described
in Section 2.6 before the chapter ends with a summary.
f f f f f
g g g g g
Figure 2.1: Illustration of map and fold, two higher-order functions commonly used together in functional
programming: map takes a function f and applies it to every element in a list, while fold iteratively applies
a function g to aggregate results.
that can accept other functions as arguments. Two common built-in higher order functions are map
and fold, illustrated in Figure 2.1. Given a list, map takes as an argument a function f (that takes a
single argument) and applies it to all elements in a list (the top part of the diagram). Given a list,
fold takes as arguments a function g (that takes two arguments) and an initial value: g is first applied
to the initial value and the first item in the list, the result of which is stored in an intermediate
variable. This intermediate variable and the next item in the list serve as the arguments to a second
application of g, the results of which are stored in the intermediate variable. This process repeats
until all items in the list have been consumed; fold then returns the final value of the intermediate
variable.Typically, map and fold are used in combination. For example, to compute the sum of squares
of a list of integers, one could map a function that squares its argument (i.e., x.x 2 ) over the input
list, and then fold the resulting list with the addition function (more precisely, xy.x + y) using
an initial value of zero.
We can view map as a concise way to represent the transformation of a dataset (as defined
by the function f ). In the same vein, we can view fold as an aggregation operation, as defined by
the function g. One immediate observation is that the application of f to each item in a list (or
more generally, to elements in a large dataset) can be parallelized in a straightforward manner, since
each functional application happens in isolation. In a cluster, these operations can be distributed
across many different machines. The fold operation, on the other hand, has more restrictions on
data localityelements in the list must be brought together before the function g can be applied.
However, many real-world applications do not require g to be applied to all elements of the list. To
the extent that elements in the list can be divided into groups, the fold aggregations can also proceed
20 2. MAPREDUCE BASICS
in parallel. Furthermore, for operations that are commutative and associative, significant efficiencies
can be gained in the fold operation through local aggregation and appropriate reordering.
In a nutshell, we have described MapReduce. The map phase in MapReduce roughly corre-
sponds to the map operation in functional programming, whereas the reduce phase in MapReduce
roughly corresponds to the fold operation in functional programming. As we will discuss in detail
shortly, the MapReduce execution framework coordinates the map and reduce phases of processing
over large amounts of data on large clusters of commodity machines.
Viewed from a slightly different angle, MapReduce codifies a generic recipe for processing
large datasets that consists of two stages. In the first stage, a user-specified computation is applied
over all input records in a dataset. These operations occur in parallel and yield intermediate output
that is then aggregated by another user-specified computation. The programmer defines these two
types of computations, and the execution framework coordinates the actual processing (very loosely,
MapReduce provides a functional abstraction). Although such a two-stage processing structure may
appear to be very restrictive, many interesting algorithms can be expressed quite conciselyespecially
if one decomposes complex algorithms into a sequence of MapReduce jobs. Subsequent chapters in
this book focus on how a number of algorithms can be implemented in MapReduce.
To be precise, MapReduce can refer to three distinct but related concepts. First, MapReduce
is a programming model, which is the sense discussed above. Second, MapReduce can refer to the
execution framework (i.e., the runtime) that coordinates the execution of programs written in this
particular style. Finally, MapReduce can refer to the software implementation of the programming
model and the execution framework: for example, Googles proprietary implementation vs. the
open-source Hadoop implementation in Java. And, in fact, there are many implementations of
MapReduce, e.g., targeted specifically for multi-core processors [127], for GPGPUs [71], for the
CELL architecture [126], etc. There are some differences between the MapReduce programming
model implemented in Hadoop and Googles proprietary implementation, which we will explicitly
discuss throughout the book. However, we take a rather Hadoop-centric view of MapReduce, since
Hadoop remains the most mature and accessible implementation to date, and therefore the one most
developers are likely to use.
A B C D E F
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 5 b 2 7 c 2 9 8
X 5 Y 7 Z 9
Figure 2.2: Simplified view of MapReduce. Mappers are applied to all input key-value pairs, which
generate an arbitrary number of intermediate key-value pairs. Reducers are applied to all values associated
with the same key. Between the map and reduce phases lies a barrier that involves a large distributed sort
and group by.
1: class Mapper
2: method Map(docid a, doc d)
3: for all term t doc d do
4: Emit(term t, count 1)
1: class Reducer
2: method Reduce(term t, counts [c1 , c2 , . . .])
3: sum 0
4: for all count c counts [c1 , c2 , . . .] do
5: sum sum + c
6: Emit(term t, count sum)
Figure 2.3: Pseudo-code for the word count algorithm in MapReduce.The mapper emits an intermediate
key-value pair for each word in a document. The reducer sums up all counts for each word.
2.2. MAPPERS AND REDUCERS 23
associated with the particular key. The values are arbitrarily ordered. Googles implementation allows
the programmer to specify a secondary sort key for ordering the values (if desired)in which case
values associated with each key would be presented to the developers reduce code in sorted order.
Later in Section 3.4 we discuss how to overcome this limitation in Hadoop to perform secondary
sorting. Another difference: in Googles implementation the programmer is not allowed to change
the key in the reducer. That is, the reducer output key must be exactly the same as the reducer input
key. In Hadoop, there is no such restriction, and the reducer can emit an arbitrary number of output
key-value pairs (with different keys).
To provide a bit more implementation detail: pseudo-code provided in this book roughly
mirrors how MapReduce programs are written in Hadoop. Mappers and reducers are objects that
implement the Map and Reduce methods, respectively. In Hadoop, a mapper object is initialized
for each map task (associated with a particular sequence of key-value pairs called an input split)
and the Map method is called on each key-value pair by the execution framework. In configuring
a MapReduce job, the programmer provides a hint on the number of map tasks to run, but the
execution framework (see next section) makes the final determination based on the physical layout
of the data (more details in Section 2.5 and Section 2.6). The situation is similar for the reduce
phase: a reducer object is initialized for each reduce task, and the Reduce method is called once per
intermediate key. In contrast with the number of map tasks, the programmer can precisely specify the
number of reduce tasks. We will return to discuss the details of Hadoop job execution in Section 2.6,
which is dependent on an understanding of the distributed file system (covered in Section 2.5). To
reiterate: although the presentation of algorithms in this book closely mirrors the way they would
be implemented in Hadoop, our focus is on algorithm design and conceptual understandingnot
actual Hadoop programming. For that, we recommend Tom Whites book [154].
What are the restrictions on mappers and reducers? Mappers and reducers can express arbi-
trary computations over their inputs. However, one must generally be careful about use of external
resources since multiple mappers or reducers may be contending for those resources. For example,
it may be unwise for a mapper to query an external SQL database, since that would introduce a
scalability bottleneck on the number of map tasks that could be run in parallel (since they might
all be simultaneously querying the database).10 In general, mappers can emit an arbitrary number
of intermediate key-value pairs, and they need not be of the same type as the input key-value pairs.
Similarly, reducers can emit an arbitrary number of final key-value pairs, and they can differ in type
from the intermediate key-value pairs. Although not permitted in functional programming, mappers
and reducers can have side effects. This is a powerful and useful feature: for example, preserving state
across multiple inputs is central to the design of many MapReduce algorithms (see Chapter 3). Such
algorithms can be understood as having side effects that only change state that is internal to the
mapper or reducer. While the correctness of such algorithms may be more difficult to guarantee
(since the functions behavior depends not only on the current input but on previous inputs), most
potential synchronization problems are avoided since internal state is private only to individual map-
Scheduling. Each MapReduce job is divided into smaller units called tasks (see Section 2.6 for
more details). For example, a map task may be responsible for processing a certain block of input
key-value pairs (called an input split in Hadoop); similarly, a reduce task may handle a portion of the
2.3. THE EXECUTION FRAMEWORK 25
intermediate key space. It is not uncommon for MapReduce jobs to have thousands of individual
tasks that need to be assigned to nodes in the cluster. In large jobs, the total number of tasks may
exceed the number of tasks that can be run on the cluster concurrently, making it necessary for
the scheduler to maintain some sort of a task queue and to track the progress of running tasks so
that waiting tasks can be assigned to nodes as they become available. Another aspect of scheduling
involves coordination among tasks belonging to different jobs (e.g., from different users). How can a
large, shared resource support several users simultaneously in a predictable, transparent, policy-driven
fashion? There has been some recent work along these lines in the context of Hadoop [131; 160].
Speculative execution is an optimization that is implemented by both Hadoop and Googles
MapReduce implementation (called backup tasks [45]). Due to the barrier between the map and
reduce tasks, the map phase of a job is only as fast as the slowest map task. Similarly, the completion
time of a job is bounded by the running time of the slowest reduce task. As a result, the speed of
a MapReduce job is sensitive to what are known as stragglers, or tasks that take an usually long
time to complete. One cause of stragglers is flaky hardware: for example, a machine that is suffering
from recoverable errors may become significantly slower. With speculative execution, an identical
copy of the same task is executed on a different machine, and the framework simply uses the result
of the first task attempt to finish. Zaharia et al. [161] presented different execution strategies in
a recent paper, and Google reported that speculative execution can improve job running times by
44% [45]. Although in Hadoop both map and reduce tasks can be speculatively executed, the common
wisdom is that the technique is more helpful for map tasks than reduce tasks, since each copy of the
reduce task needs to pull data over the network. Note, however, that speculative execution cannot
adequately address another common cause of stragglers: skew in the distribution of values associated
with intermediate keys (leading to reduce stragglers). In text processing we often observe Zipfian
distributions, which means that the task or tasks responsible for processing the most frequent few
elements will run much longer than the typical task. Better local aggregation, discussed in the next
chapter, is one possible solution to this problem.
Data/code co-location. The phrase data distribution is misleading, since one of the key ideas behind
MapReduce is to move the code, not the data. However, the more general point remainsin order
for computation to occur, we need to somehow feed data to the code. In MapReduce, this issue is
inextricably intertwined with scheduling and relies heavily on the design of the underlying distributed
file system.11 To achieve data locality, the scheduler starts tasks on the node that holds a particular
block of data (i.e., on its local drive) needed by the task. This has the effect of moving code to the
data. If this is not possible (e.g., a node is already running too many tasks), new tasks will be started
elsewhere, and the necessary data will be streamed over the network. An important optimization
here is to prefer nodes that are on the same rack in the datacenter as the node holding the relevant
data block, since inter-rack bandwidth is significantly less than intra-rack bandwidth.
11 In the canonical case, that is. Recall that MapReduce may receive its input from other sources.
26 2. MAPREDUCE BASICS
Synchronization. In general, synchronization refers to the mechanisms by which multiple concur-
rently running processes join up, for example, to share intermediate results or otherwise exchange
state information. In MapReduce, synchronization is accomplished by a barrier between the map
and reduce phases of processing. Intermediate key-value pairs must be grouped by key, which is
accomplished by a large distributed sort involving all the nodes that executed map tasks and all the
nodes that will execute reduce tasks. This necessarily involves copying intermediate data over the
network, and therefore the process is commonly known as shuffle and sort. A MapReduce job
with m mappers and r reducers involves up to m r distinct copy operations, since each mapper
may have intermediate output going to every reducer.
Note that the reduce computation cannot start until all the mappers have finished emitting
key-value pairs and all intermediate key-value pairs have been shuffled and sorted, since the ex-
ecution framework cannot otherwise guarantee that all values associated with the same key have
been gathered. This is an important departure from functional programming: in a fold operation,
the aggregation function g is a function of the intermediate value and the next item in the list
which means that values can be lazily generated and aggregation can begin as soon as values are
available. In contrast, the reducer in MapReduce receives all values associated with the same key at
once. However, it is possible to start copying intermediate key-value pairs over the network to the
nodes running the reducers as soon as each mapper finishesthis is a common optimization and
implemented in Hadoop.
Error and fault handling. The MapReduce execution framework must accomplish all the tasks
above in an environment where errors and faults are the norm, not the exception. Since MapReduce
was explicitly designed around low-end commodity servers, the runtime must be especially resilient.
In large clusters, disk failures are common [123] and RAM experiences more errors than one might
expect [135]. Datacenters suffer from both planned outages (e.g., system maintenance and hardware
upgrades) and unexpected outages (e.g., power failure, connectivity loss, etc.).
And thats just hardware. No software is bug freeexceptions must be appropriately trapped,
logged, and recovered from. Large-data problems have a penchant for uncovering obscure corner
cases in code that is otherwise thought to be bug-free. Furthermore, any sufficiently large dataset will
contain corrupted data or records that are mangled beyond a programmers imaginationresulting
in errors that one would never think to check for or trap. The MapReduce execution framework
must thrive in this hostile environment.
12 A note on the implementation of combiners in Hadoop: by default, the execution framework reserves the right to use combiners at
its discretion. In reality, this means that a combiner may be invoked zero, one, or multiple times. In addition, combiners in Hadoop
may actually be invoked in the reduce phase, i.e., after key-value pairs have been copied over to the reducer, but before the user
reducer code runs. As a result, combiners must be carefully written so that they can be executed in these different environments.
Section 3.1.2 discusses this in more detail.
28 2. MAPREDUCE BASICS
A B C D E F
mapper
pp mapper
pp mapper
pp mapper
pp
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
a 1 b 2 c 9 a 5 c 2 b 7 c 8
partitioner
p partitioner
p partitioner
p partitioner
p
a 1 5 b 2 7 c 2 9 8
X 5 Y 7 Z 9
Figure 2.4: Complete view of MapReduce, illustrating combiners and partitioners in addition to mappers
and reducers. Combiners can be viewed as mini-reducers in the map phase. Partitioners determine which
reducer is responsible for a particular key.
key, and the execution framework uses this information to copy the data to the right location
during the shuffle and sort phase.13 Therefore, a complete MapReduce job consists of code for the
mapper, reducer, combiner, and partitioner, along with job configuration parameters. The execution
framework handles everything else.
14 However, there is evidence that existing POSIX-based distributed cluster file systems (e.g., GPFS or PVFS) can serve as a replace-
ment for HDFS, when properly tuned or modified for MapReduce workloads [6; 146]. This, however, remains an experimental
use case.
15To be precise, namenode and datanode may refer to physical machines in a cluster, or they may refer to daemons running on those
machines providing the relevant services.
30 2. MAPREDUCE BASICS
HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
Figure 2.5: The architecture of HDFS. The namenode (master) is responsible for maintaining the file
namespace and directing clients to datanodes (slaves) that actually hold data blocks containing user data.
occurs directly between clients and datanodes; communication with the namenode only involves
transfer of metadata.
By default, HDFS stores three separate copies of each data block to ensure reliability, availabil-
ity, and performance. In large clusters, the three replicas are spread across different physical racks, so
HDFS is resilient towards two common failure scenarios: individual datanode crashes and failures in
networking equipment that bring an entire rack offline. Replicating blocks across physical machines
also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs,
since multiple copies yield more opportunities to exploit locality. The namenode is in periodic com-
munication with the datanodes to ensure proper replication of all the blocks: if there arent enough
replicas (e.g., due to disk or machine failures or to connectivity losses due to networking equipment
failures), the namenode directs the creation of additional copies;16 if there are too many replicas
(e.g., a repaired node rejoins the cluster), extra copies are discarded.
To create a new file and write data to HDFS, the application client first contacts the namenode,
which updates the file namespace after checking permissions and making sure the file doesnt already
exist. The namenode allocates a new block on a suitable datanode, and the application is directed to
stream data directly to it. From the initial datanode, data is further propagated to additional replicas.
In the most recent release of Hadoop as of this writing (release 0.20.2), files are immutablethey
cannot be modified after creation. There are current plans to officially support file appends in the
near future, which is a feature already present in GFS.
In summary, the HDFS namenode has the following responsibilities:
16 Note that the namenode coordinates the replication process, but data transfer occurs directly from datanode to datanode.
2.5. THE DISTRIBUTED FILE SYSTEM 31
Namespace management. The namenode is responsible for maintaining the file namespace,
which includes metadata, directory structure, file to block mapping, location of blocks, and
access permissions. These data are held in memory for fast access and all mutations are per-
sistently logged.
Coordinating file operations. The namenode directs application clients to datanodes for read
operations, and allocates blocks on suitable datanodes for write operations. All data trans-
fers occur directly between clients and datanodes. When a file is deleted, HDFS does not
immediately reclaim the available physical storage; rather, blocks are lazily garbage collected.
Maintaining overall health of the file system. The namenode is in periodic contact with the
datanodes via heartbeat messages to ensure the integrity of the system. If the namenode
observes that a data block is under-replicated (fewer copies are stored on datanodes than the
desired replication factor), it will direct the creation of new replicas. Finally, the namenode
is also responsible for rebalancing the file system.17 During the course of normal operations,
certain datanodes may end up holding more blocks than others; rebalancing involves moving
blocks from datanodes with more blocks to datanodes with fewer blocks. This leads to better
load balancing and more even disk utilization.
Since GFS and HDFS were specifically designed to support Googles proprietary and the open-
source implementation of MapReduce, respectively, they were designed with a number of assump-
tions about the operational environment, which in turn influenced the design of the systems. Un-
derstanding these choices is critical to designing effective MapReduce algorithms:
The file system stores a relatively modest number of large files. The definition of modest
varies by the size of the deployment, but in HDFS multi-gigabyte files are common (and
even encouraged). There are several reasons why lots of small files are to be avoided. Since the
namenode must hold all file metadata in memory, this presents an upper bound on both the
number of files and blocks that can be supported.18 Large multi-block files represent a more
efficient use of namenode memory than many single-block files (each of which consumes less
space than a single block size). In addition, mappers in a MapReduce job use individual files as
a basic unit for splitting input data. At present, there is no default mechanism in Hadoop that
allows a mapper to process multiple files. As a result, mapping over many small files will yield
as many map tasks as there are files. This results in two potential problems: first, the startup
costs of mappers may become significant compared to the time spent actually processing input
key-value pairs; second, this may result in an excessive number of across-the-network copy
operations during the shuffle and sort phase (recall that a MapReduce job with m mappers
and r reducers involves up to m r distinct copy operations).
17 In Hadoop, this is a manually invoked process.
18 According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008, each block in HDFS occupies about 150 bytes
of memory on the namenode.
32 2. MAPREDUCE BASICS
Workloads are batch oriented, dominated by long streaming reads and large sequential writes.
As a result, high sustained bandwidth is more important than low latency.This exactly describes
the nature of MapReduce jobs, which are batch operations on large amounts of data. Due to the
common-case workload, both HDFS and GFS do not implement any form of data caching.19
Applications are aware of the characteristics of the distributed file system. Neither HDFS nor
GFS present a general POSIX-compliant API, but rather support only a subset of possible
file operations. This simplifies the design of the distributed file system, and in essence pushes
part of the data management onto the end application. One rationale for this decision is that
each application knows best how to handle data specific to that application, for example, in
terms of resolving inconsistent states and optimizing the layout of data structures.
The file system is deployed in an environment of cooperative users. There is no discussion of
security in the original GFS paper, but HDFS explicitly assumes a datacenter environment
where only authorized users have access. File permissions in HDFS are only meant to prevent
unintended operations and can be easily circumvented.20
The system is built from unreliable but inexpensive commodity components. As a result, failures
are the norm rather than the exception. HDFS is designed around a number of self-monitoring
and self-healing mechanisms to robustly cope with common failure modes.
Finally, some discussion is necessary to understand the single-master design of HDFS and GFS. It
has been demonstrated that in large-scale distributed systems, simultaneously providing consistency,
availability, and partition tolerance is impossiblethis is Brewers so-called CAP Theorem [58].
Since partitioning is unavoidable in large-data systems, the real tradeoff is between consistency and
availability. A single-master design trades availability for consistency and significantly simplifies
implementation. If the master (HDFS namenode or GFS master) goes down, the entire file system
becomes unavailable, which trivially guarantees that the file system will never be in an inconsistent
state. An alternative design might involve multiple masters that jointly manage the file namespace
such an architecture would increase availability (if one goes down, another can step in) at the cost
of consistency, not to mention requiring a more complex implementation (cf. [4; 105]).
The single-master design of GFS and HDFS is a well-known weakness, since if the master
goes offline, the entire file system and all MapReduce jobs running on top of it will grind to a halt.
This weakness is mitigated in part by the lightweight nature of file system operations. Recall that no
data is ever moved through the namenode and that all communication between clients and datanodes
involve only metadata. Because of this, the namenode rarely is the bottleneck, and for the most part
avoids load-induced crashes. In practice, this single point of failure is not as severe a limitation as
it may appearwith diligent monitoring of the namenode, mean time between failure measured in
months are not uncommon for production deployments. Furthermore, the Hadoop community is
19 However, since the distributed file system is built on top of a standard operating system such as Linux, there is still OS-level
caching.
20 However, there are existing plans to integrate Kerberos into Hadoop/HDFS.
2.6. HADOOP CLUSTER ARCHITECTURE 33
t kt k
tasktracker t kt k
tasktracker t kt k
tasktracker
slave node slave node slave node
Figure 2.6: Architecture of a complete Hadoop cluster, which consists of three separate components:
the HDFS master (called the namenode), the job submission node (called the jobtracker), and many slave
nodes (three shown here). Each of the slave nodes runs a tasktracker for executing map and reduce tasks
and a datanode daemon for serving HDFS data.
well aware of this problem and has developed several reasonable workaroundsfor example, a warm
standby namenode that can be quickly switched over when the primary namenode fails. The open
source environment and the fact that many organizations already depend on Hadoop for production
systems virtually guarantees that more effective solutions will be developed over time.
2.7 SUMMARY
This chapter provides a basic overview of the MapReduce programming model, starting with its roots
in functional programming and continuing with a description of mappers, reducers, partitioners, and
combiners. Significant attention is also given to the underlying distributed file system, which is a
2.7. SUMMARY 35
tightly-integrated component of the MapReduce environment. Given this basic understanding, we
now turn our attention to the design of MapReduce algorithms.
37
CHAPTER 3
Nevertheless, the programmer does have a number of techniques for controlling execution and
managing the flow of data in MapReduce. In summary, they are:
38 3. MAPREDUCE ALGORITHM DESIGN
1. The ability to construct complex data structures as keys and values to store and communicate
partial results.
2. The ability to execute user-specified initialization code at the beginning of a map or reduce
task, and the ability to execute user-specified termination code at the end of a map or reduce
task.
3. The ability to preserve state in both mappers and reducers across multiple input or intermediate
keys.
4. The ability to control the sort order of intermediate keys, and therefore the order in which a
reducer will encounter particular keys.
5. The ability to control the partitioning of the key space, and therefore the set of keys that will
be encountered by a particular reducer.
It is important to realize that many algorithms cannot be easily expressed as a single MapReduce
job. One must often decompose complex algorithms into a sequence of jobs, which requires or-
chestrating data so that the output of one job becomes the input to the next. Many algorithms are
iterative in nature, requiring repeated execution until some convergence criteriagraph algorithms
in Chapter 5 and expectation-maximization algorithms in Chapter 6 behave in exactly this way.
Often, the convergence check itself cannot be easily expressed in MapReduce. The standard solu-
tion is an external (non-MapReduce) program that serves as a driver to coordinate MapReduce
iterations.
This chapter explains how various techniques to control code execution and data flow can
be applied to design algorithms in MapReduce. The focus is both on scalabilityensuring that
there are no inherent bottlenecks as algorithms are applied to increasingly larger datasetsand
efficiencyensuring that algorithms do not needlessly consume resources and thereby reducing the
cost of parallelization. The gold standard, of course, is linear scalability: an algorithm running on
twice the amount of data should take only twice as long. Similarly, an algorithm running on twice
the number of nodes should only take half as long.
The chapter is organized as follows:
Section 3.1 introduces the important concept of local aggregation in MapReduce and strategies
for designing efficient algorithms that minimize the amount of partial results that need to be
copied across the network. The proper use of combiners is discussed in detail, as well as the
in-mapper combining design pattern.
Section 3.2 uses the example of building word co-occurrence matrices on large text corpora
to illustrate two common design patterns, which we dub pairs and stripes. These two
approaches are useful in a large class of problems that require keeping track of joint events
across a large number of observations.
3.1. LOCAL AGGREGATION 39
Section 3.3 shows how co-occurrence counts can be converted into relative frequencies using
a pattern known as order inversion. The sequencing of computations in the reducer can be
recast as a sorting problem, where pieces of intermediate data are sorted into exactly the order
that is required to carry out a series of computations. Often, a reducer needs to compute an
aggregate statistic on a set of elements before individual elements can be processed. Normally,
this would require two passes over the data, but with the order inversion design pattern,
the aggregate statistic can be computed in the reducer before the individual elements are
encountered. This may seem counter-intuitive: how can we compute an aggregate statistic on
a set of elements before encountering elements of that set? As it turns out, clever sorting of
special key-value pairs enables exactly this.
Section 3.4 provides a general solution to secondary sorting, which is the problem of sort-
ing values associated with a key in the reduce phase. We call this technique value-to-key
conversion.
Section 3.5 covers the topic of performing joins on relational datasets and presents three
different approaches: reduce-side, map-side, and memory-backed joins.
Figure 3.1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from Figure 2.3).
documents processed by each map task. This results in a reduction in the number of intermediate
key-value pairs that need to be shuffled across the networkfrom the order of total number of terms
in the collection to the order of the number of unique terms in the collection.1
An improvement on the basic algorithm is shown in Figure 3.2 (the mapper is modified but
the reducer remains the same as in Figure 3.1 and therefore is not repeated). An associative array
(i.e., Map in Java) is introduced inside the mapper to tally up term counts within a single document:
instead of emitting a key-value pair for each term in the document, this version emits a key-value pair
for each unique term in the document. Given that some words appear frequently within a document
(for example, a document about dogs is likely to have many occurrences of the word dog), this can
yield substantial savings in the number of intermediate key-value pairs emitted, especially for long
documents.
This basic idea can be taken one step further, as illustrated in the variant of the word count
algorithm in Figure 3.3 (once again, only the mapper is modified). The workings of this algorithm
critically depends on the details of how map and reduce tasks in Hadoop are executed, discussed
in Section 2.6. Recall, a ( Java) mapper object is created for each map task, which is responsible
for processing a block of input key-value pairs. Prior to processing any input key-value pairs, the
mappers Initialize method is called, which is an API hook for user-specified code. In this case,
we initialize an associative array for holding term counts. Since it is possible to preserve state across
multiple calls of the Map method (for each input key-value pair), we can continue to accumulate
partial term counts in the associative array across multiple documents, and emit key-value pairs only
1 More precisely, if the combiners take advantage of all opportunities for local aggregation, the algorithm would generate at most
m V intermediate key-value pairs, where m is the number of mappers and V is the vocabulary size (number of unique terms in
the collection), since every term could have been observed in every mapper. However, there are two additional factors to consider.
Due to the Zipfian nature of term distributions, most terms will not be observed by most mappers (for example, terms that
occur only once will by definition only be observed by one mapper). On the other hand, combiners in Hadoop are treated as
optional optimizations, so there is no guarantee that the execution framework will take advantage of all opportunities for partial
aggregation.
3.1. LOCAL AGGREGATION 41
1: class Mapper
2: method Map(docid a, doc d)
3: H new AssociativeArray
4: for all term t doc d do
5: H {t} H {t} + 1 Tally counts for entire document
6: for all term t H do
7: Emit(term t, count H {t})
Figure 3.2: Pseudo-code for the improved MapReduce word count algorithm that uses an associative
array to aggregate term counts on a per-document basis. Reducer is the same as in Figure 3.1.
1: class Mapper
2: method Initialize
3: H new AssociativeArray
4: method Map(docid a, doc d)
5: for all term t doc d do
6: H {t} H {t} + 1 Tally counts across documents
7: method Close
8: for all term t H do
9: Emit(term t, count H {t})
Figure 3.3: Pseudo-code for the improved MapReduce word count algorithm that demonstrates the
in-mapper combining design pattern. Reducer is the same as in Figure 3.1.
when the mapper has processed all documents. That is, emission of intermediate data is deferred
until the Close method in the pseudo-code. Recall that this API hook provides an opportunity to
execute user-specified code after the Map method has been applied to all input key-value pairs of
the input data split to which the map task was assigned.
With this technique, we are in essence incorporating combiner functionality directly inside
the mapper. There is no need to run a separate combiner, since all opportunities for local aggregation
are already exploited.2 This is a sufficiently common design pattern in MapReduce that its worth
giving it a name, in-mapper combining, so that we can refer to the pattern more conveniently
throughout the book. Well see later on how this pattern can be applied to a variety of problems.
There are two main advantages to using this design pattern:
First, it provides control over when local aggregation occurs and how it exactly takes place.
In contrast, the semantics of the combiner is underspecified in MapReduce. For example, Hadoop
makes no guarantees on how many times the combiner is applied, or that it is even applied at all. The
2 Leaving aside the minor complication that in Hadoop, combiners can be run in the reduce phase also (when merging intermediate
key-value pairs from different map tasks). However, in practice, it makes almost no difference either way.
42 3. MAPREDUCE ALGORITHM DESIGN
combiner is provided as a semantics-preserving optimization to the execution framework, which has
the option of using it, perhaps multiple times, or not at all (or even in the reduce phase). In some
cases (although not in this particular example), such indeterminism is unacceptable, which is exactly
why programmers often choose to perform their own local aggregation in the mappers.
Second, in-mapper combining will typically be more efficient than using actual combiners.
One reason for this is the additional overhead associated with actually materializing the key-value
pairs. Combiners reduce the amount of intermediate data that is shuffled across the network, but
dont actually reduce the number of key-value pairs that are emitted by the mappers in the first place.
With the algorithm in Figure 3.2, intermediate key-value pairs are still generated on a per-document
basis, only to be compacted by the combiners.This process involves unnecessary object creation and
destruction (garbage collection takes time), and furthermore, object serialization and deserialization
(when intermediate key-value pairs fill the in-memory buffer holding map outputs and need to be
temporarily spilled to disk). In contrast, with in-mapper combining, the mappers will generate only
those key-value pairs that need to be shuffled across the network to the reducers.
There are, however, drawbacks to the in-mapper combining pattern. First, it breaks the func-
tional programming underpinnings of MapReduce, since state is being preserved across multiple
input key-value pairs. Ultimately, this isnt a big deal, since pragmatic concerns for efficiency of-
ten trump theoretical purity, but there are practical consequences as well. Preserving state across
multiple input instances means that algorithmic behavior may depend on the order in which input
key-value pairs are encountered. This creates the potential for ordering-dependent bugs, which are
difficult to debug on large datasets in the general case (although the correctness of in-mapper com-
bining for word count is easy to demonstrate). Second, there is a fundamental scalability bottleneck
associated with the in-mapper combining pattern. It critically depends on having sufficient mem-
ory to store intermediate results until the mapper has completely processed all key-value pairs in
an input split. In the word count example, the memory footprint is bound by the vocabulary size,
since it is theoretically possible that a mapper encounters every term in the collection. Heaps Law,
a well-known result in information retrieval, accurately models the growth of vocabulary size as a
function of the collection sizethe somewhat surprising fact is that the vocabulary size never stops
growing.3 Therefore, the algorithm in Figure 3.3 will scale only up to a point, beyond which the
associative array holding the partial term counts will no longer fit in memory.4
One common solution to limiting memory usage when using the in-mapper combining tech-
nique is to block input key-value pairs and flush in-memory data structures periodically. The idea
is simple: instead of emitting intermediate data only after every key-value pair has been processed,
emit partial results after processing every n key-value pairs. This is straightforwardly implemented
with a counter variable that keeps track of the number of input key-value pairs that have been
3 In more detail, Heaps Law relates the vocabulary size V to the collection size as follows: V = kT b , where T is the number of
tokens in the collection. Typical values of the parameters k and b are: 30 k 100 and b 0.5 ([101], p. 81).
4 A few more details: note what matters is that the partial term counts encountered within a particular input split fits into memory.
However, as collection sizes increase, one will often want to increase the input split size to limit the growth of the number of map
tasks (in order to reduce the number of distinct copy operations necessary to shuffle intermediate data over the network).
3.1. LOCAL AGGREGATION 43
processed. As an alternative, the mapper could keep track of its own memory footprint and flush
intermediate key-value pairs once memory usage has crossed a certain threshold. In both approaches,
either the block size or the memory usage threshold needs to be determined empirically: with too
large a value, the mapper may run out of memory, but with too small a value, opportunities for
local aggregation may be lost. Furthermore, in Hadoop physical memory is split between multiple
tasks that may be running on a node concurrently; these tasks are all competing for finite resources,
but since the tasks are not aware of each other, it is difficult to coordinate resource consumption
effectively. In practice, however, one often encounters diminishing returns in performance gains
with increasing buffer sizes, such that it is not worth the effort to search for an optimal buffer size
(personal communication, Jeff Dean).
In MapReduce algorithms, the extent to which efficiency can be increased through local
aggregation depends on the size of the intermediate key space, the distribution of keys themselves,
and the number of key-value pairs that are emitted by each individual map task. Opportunities
for aggregation, after all, come from having multiple values associated with the same key (whether
one uses combiners or employs the in-mapper combining pattern). In the word count example, local
aggregation is effective because many words are encountered multiple times within a map task. Local
aggregation is also an effective technique for dealing with reduce stragglers (see Section 2.3) that
result from a highly skewed (e.g., Zipfian) distribution of values associated with intermediate keys.
In our word count example, we do not filter frequently occurring words: therefore, without local
aggregation, the reducer thats responsible for computing the count of the will have a lot more
work to do than the typical reducer, and therefore will likely be a straggler. With local aggregation
(either combiners or in-mapper combining), we substantially reduce the number of values associated
with frequently occurring terms, which alleviates the reduce straggler problem.
Figure 3.4: Pseudo-code for the basic MapReduce algorithm that computes the mean of values associated
with the same key.
a per-user basis, which would be useful for understanding user demographics. Figure 3.4 shows the
pseudo-code of a simple algorithm for accomplishing this task that does not involve combiners. We
use an identity mapper, which simply passes all input key-value pairs to the reducers (appropriately
grouped and sorted). The reducer keeps track of the running sum and the number of integers
encountered. This information is used to compute the mean once all values are processed. The mean
is then emitted as the output value in the reducer (with the input string as the key).
This algorithm will indeed work, but suffers from the same drawbacks as the basic word count
algorithm in Figure 3.1: it requires shuffling all key-value pairs from mappers to reducers across the
network, which is highly inefficient. Unlike in the word count example, the reducer cannot be used
as a combiner in this case. Consider what would happen if we did: the combiner would compute the
mean of an arbitrary subset of values associated with the same key, and the reducer would compute
the mean of those values. As a concrete example, we know that:
Figure 3.5: Pseudo-code for an incorrect first attempt at introducing combiners to compute the mean
of values associated with each key. The mismatch between combiner input and output key-value types
violates the MapReduce programming model.
the number of integers encountered (i.e., the count). The sum and count are packaged into a pair,
and emitted as the output of the combiner, with the same string as the key. In the reducer, pairs of
partial sums and counts can be aggregated to arrive at the mean. Up until now, all keys and values
in our algorithms have been primitives (string, integers, etc.). However, there are no prohibitions in
MapReduce for more complex types,6 and, in fact, this represents a key technique in MapReduce
algorithm design that we introduced at the beginning of this chapter. We will frequently encounter
complex keys and values throughput the rest of this book.
Unfortunately, this algorithm will not work. Recall that combiners must have the same input
and output key-value type, which also must be the same as the mapper output type and the reducer
input type. This is clearly not the case. To understand why this restriction is necessary in the pro-
gramming model, remember that combiners are optimizations that cannot change the correctness
of the algorithm. So let us remove the combiner and see what happens: the output value type of the
6 In Hadoop, either custom types or types defined using a library such as Protocol Buffers, Thrift, or Avro.
46 3. MAPREDUCE ALGORITHM DESIGN
1: class Mapper
2: method Map(string t, integer r)
3: Emit(string t, pair (r, 1))
1: class Combiner
2: method Combine(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .])
3: sum 0
4: cnt 0
5: for all pair (s, c) pairs [(s1 , c1 ), (s2 , c2 ) . . .] do
6: sum sum + s
7: cnt cnt + c
8: Emit(string t, pair (sum, cnt))
1: class Reducer
2: method Reduce(string t, pairs [(s1 , c1 ), (s2 , c2 ) . . .])
3: sum 0
4: cnt 0
5: for all pair (s, c) pairs [(s1 , c1 ), (s2 , c2 ) . . .] do
6: sum sum + s
7: cnt cnt + c
8: ravg sum/cnt
9: Emit(string t, integer ravg )
Figure 3.6: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with
each key. This algorithm correctly takes advantage of combiners.
mapper is integer, so the reducer expects to receive a list of integers as values. But the reducer actually
expects a list of pairs! The correctness of the algorithm is contingent on the combiner running on
the output of the mappers, and more specifically, that the combiner is run exactly once. Recall from
our previous discussion that Hadoop makes no guarantees on how many times combiners are called;
it could be zero, one, or multiple times. This violates the MapReduce programming model.
Another stab at the algorithm is shown in Figure 3.6, and this time, the algorithm is correct.
In the mapper we emit as the value a pair consisting of the integer and onethis corresponds to a
partial count over one instance. The combiner separately aggregates the partial sums and the partial
counts (as before), and emits pairs with updated sums and counts. The reducer is similar to the
combiner, except that the mean is computed at the end. In essence, this algorithm transforms a
non-associative operation (mean of numbers) into an associative operation (element-wise sum of a
pair of numbers, with an additional division at the very end).
Let us verify the correctness of this algorithm by repeating the previous exercise: What would
happen if no combiners were run? With no combiners, the mappers would send pairs (as values)
3.2. PAIRS AND STRIPES 47
1: class Mapper
2: method Initialize
3: S new AssociativeArray
4: C new AssociativeArray
5: method Map(string t, integer r)
6: S{t} S{t} + r
7: C{t} C{t} + 1
8: method Close
9: for all term t S do
10: Emit(term t, pair (S{t}, C{t}))
Figure 3.7: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with
each key, illustrating the in-mapper combining design pattern. Only the mapper is shown here; the reducer
is the same as in Figure 3.6
directly to the reducers. There would be as many intermediate pairs as there were input key-value
pairs, and each of those would consist of an integer and one. The reducer would still arrive at
the correct sum and count, and hence the mean would be correct. Now add in the combiners: the
algorithm would remain correct, no matter how many times they run, since the combiners merely
aggregate partial sums and counts to pass along to the reducers. Note that although the output
key-value type of the combiner must be the same as the input key-value type of the reducer, the
reducer can emit final key-value pairs of a different type.
Finally, in Figure 3.7, we present an even more efficient algorithm that exploits the in-mapper
combining pattern. Inside the mapper, the partial sums and counts associated with each string are
held in memory across input key-value pairs. Intermediate key-value pairs are emitted only after
the entire input split has been processed; similar to before, the value is a pair consisting of the sum
and count. The reducer is exactly the same as in Figure 3.6. Moving partial aggregation from the
combiner directly into the mapper is subjected to all the trade offs and caveats discussed earlier this
section, but in this case the memory footprint of the data structures for holding intermediate data
is likely to be modest, making this variant algorithm an attractive option.
7The size of the vocabulary depends on the definition of a word and techniques (if any) for corpus pre-processing. One common
strategy is to replace all rare words (below a certain frequency) with a special token such as <UNK> (which stands for unknown)
to model out-of-vocabulary words. Another technique involves replacing numeric digits with #, such that 1.32 and 1.19 both
map to the same token (#.##).
3.2. PAIRS AND STRIPES 49
Pseudo-code for the first algorithm, dubbed the pairs approach, is shown in Figure 3.8. As
usual, document ids and the corresponding contents make up the input key-value pairs. The mapper
processes each input document and emits intermediate key-value pairs with each co-occurring word
pair as the key and the integer one (i.e., the count) as the value.This is straightforwardly accomplished
by two nested loops: the outer loop iterates over all words (the left element in the pair), and the
inner loop iterates over all neighbors of the first word (the right element in the pair). The neighbors
of a word can either be defined in terms of a sliding window or some other contextual unit such as a
sentence. The MapReduce execution framework guarantees that all values associated with the same
key are brought together in the reducer. Thus, in this case the reducer simply sums up all the values
associated with the same co-occurring word pair to arrive at the absolute count of the joint event in
the corpus, which is then emitted as the final key-value pair. Each pair corresponds to a cell in the
word co-occurrence matrix. This algorithm illustrates the use of complex keys in order to coordinate
distributed computations.
An alternative approach, dubbed the stripes approach, is presented in Figure 3.9. Like the
pairs approach, co-occurring word pairs are generated by two nested loops. However, the major
difference is that instead of emitting intermediate key-value pairs for each co-occurring word pair,
co-occurrence information is first stored in an associative array, denoted H . The mapper emits key-
value pairs with words as keys and corresponding associative arrays as values, where each associative
array encodes the co-occurrence counts of the neighbors of a particular word (i.e., its context). The
MapReduce execution framework guarantees that all associative arrays with the same key will be
brought together in the reduce phase of processing. The reducer performs an element-wise sum of
all associative arrays with the same key, accumulating counts that correspond to the same cell in
the co-occurrence matrix. The final associative array is emitted with the same word as the key. In
contrast to the pairs approach, each final key-value pair encodes a row in the co-occurrence matrix.
It is immediately obvious that the pairs algorithm generates an immense number of key-
value pairs compared to the stripes approach. The stripes representation is much more compact,
since with pairs the left element is repeated for every co-occurring word pair. The stripes approach
also generates fewer and shorter intermediate keys, and therefore the execution framework has less
sorting to perform. However, values in the stripes approach are more complex, and come with more
serialization and deserialization overhead than with the pairs approach.
Both algorithms can benefit from the use of combiners, since the respective operations in
their reducers (addition and element-wise sum of associative arrays) are both commutative and
associative. However, combiners with the stripes approach have more opportunities to perform local
aggregation because the key space is the vocabularyassociative arrays can be merged whenever a
word is encountered multiple times by a mapper. In contrast, the key space in the pairs approach
is the cross of the vocabulary with itself, which is far largercounts can be aggregated only when
the same co-occurring word pair is observed multiple times by an individual mapper (which is less
likely than observing multiple occurrences of a word, as in the stripes case).
50 3. MAPREDUCE ALGORITHM DESIGN
1: class Mapper
2: method Map(docid a, doc d)
3: for all term w doc d do
4: for all term u Neighbors(w) do
5: Emit(pair (w, u), count 1) Emit count for each co-occurrence
1: class Reducer
2: method Reduce(pair p, counts [c1 , c2 , . . .])
3: s0
4: for all count c counts [c1 , c2 , . . .] do
5: s s+c Sum co-occurrence counts
6: Emit(pair p, count s)
Figure 3.8: Pseudo-code for the pairs approach for computing word co-occurrence matrices from large
corpora.
1: class Mapper
2: method Map(docid a, doc d)
3: for all term w doc d do
4: H new AssociativeArray
5: for all term u Neighbors(w) do
6: H {u} H {u} + 1 Tally words co-occurring with w
7: Emit(Term w, Stripe H )
1: class Reducer
2: method Reduce(term w, stripes [H1 , H2 , H3 , . . .])
3: Hf new AssociativeArray
4: for all stripe H stripes [H1 , H2 , H3 , . . .] do
5: Sum(Hf , H ) Element-wise sum
6: Emit(term w, stripe Hf )
Figure 3.9: Pseudo-code for the stripes approach for computing word co-occurrence matrices from
large corpora.
3.2. PAIRS AND STRIPES 51
For both algorithms, the in-mapper combining optimization discussed in the previous section
can also be applied; the modification is sufficiently straightforward that we leave the implementation
as an exercise for the reader. However, the above caveats remain: there will be far fewer opportunities
for partial aggregation in the pairs approach due to the sparsity of the intermediate key space. The
sparsity of the key space also limits the effectiveness of in-memory combining, since the mapper
may run out of memory to store partial counts before all documents are processed, necessitating
some mechanism to periodically emit key-value pairs (which further limits opportunities to perform
partial aggregation). Similarly, for the stripes approach, memory management will also be more
complex than in the simple word count example. For common terms, the associative array may grow
to be quite large, necessitating some mechanism to periodically flush in-memory structures.
It is important to consider potential scalability bottlenecks of either algorithm. The stripes
approach makes the assumption that, at any point in time, each associative array is small enough to
fit into memoryotherwise, memory paging will significantly impact performance. The size of the
associative array is bounded by the vocabulary size, which is itself unbounded with respect to corpus
size (recall the previous discussion of Heaps Law). Therefore, as the sizes of corpora increase, this
will become an increasingly pressing issueperhaps not for gigabyte-sized corpora, but certainly for
terabyte-sized and petabyte-sized corpora that will be commonplace tomorrow. The pairs approach,
on the other hand, does not suffer from this limitation, since it does not need to hold intermediate
data in memory.
Given this discussion, which approach is faster? Here, we present previously published re-
sults [94] that empirically answered this question. We implemented both algorithms in Hadoop and
applied them to a corpus of 2.27 million documents from the Associated Press Worldstream (APW)
totaling 5.7 GB.8 Prior to working with Hadoop, the corpus was first preprocessed as follows: All
XML markup was removed, followed by tokenization and stopword removal using standard tools
from the Lucene search engine. All tokens were then replaced with unique integers for a more effi-
cient encoding. Figure 3.10 compares the running time of the pairs and stripes approach on different
fractions of the corpus, with a co-occurrence window size of two. These experiments were performed
on a Hadoop cluster with 19 slave nodes, each with 2 single-core processors and 2 disks.
Results demonstrate that the stripes approach is much faster than the pairs approach: 666
seconds (11 minutes) compared to 3758 seconds (62 minutes) for the entire corpus (improvement
by a factor of 5.7). The mappers in the pairs approach generated 2.6 billion intermediate key-value
pairs totaling 31.2 GB. After the combiners, this was reduced to 1.1 billion key-value pairs, which
quantifies the amount of intermediate data transferred across the network. In the end, the reducers
emitted a total of 142 million final key-value pairs (the number of non-zero cells in the co-occurrence
matrix). On the other hand, the mappers in the stripes approach generated 653 million intermediate
key-value pairs totaling 48.1 GB. After the combiners, only 28.8 million key-value pairs remained.
The reducers emitted a total of 1.69 million final key-value pairs (the number of rows in the co-
8This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium (LDC catalog
number LDC2007T07).
52 3. MAPREDUCE ALGORITHM DESIGN
occurrence matrix). As expected, the stripes approach provided more opportunities for combiners to
aggregate intermediate results, thus greatly reducing network traffic in the shuffle and sort phase.
Figure 3.10 also shows that both algorithms exhibit highly desirable scaling characteristicslinear
in the amount of input data. This is confirmed by a linear regression applied to the running time
data, which yields an R 2 value close to one.
An additional series of experiments explored the scalability of the stripes approach along
another dimension: the size of the cluster. These experiments were made possible by Amazons EC2
service, which allows users to rapidly provision clusters of varying sizes for limited durations (for
more information, refer back to our discussion of utility computing in Section 1.1). Virtualized
computational units in EC2 are called instances, and the user is charged only for the instance-hours
consumed. Figure 3.11 (left) shows the running time of the stripes algorithm (on the same corpus,
with same setup as before), on varying cluster sizes, from 20 slave small instances all the way up to
80 slave small instances (along the x-axis). Running times are shown with solid squares. Figure 3.11
(right) recasts the same results to illustrate scaling characteristics. The circles plot the relative size
and speedup of the EC2 experiments, with respect to the 20-instance cluster. These results show
highly desirable linear scaling characteristics (i.e., doubling the cluster size makes the job twice as
fast). This is confirmed by a linear regression with an R 2 value close to one.
Viewed abstractly, the pairs and stripes algorithms represent two different approaches to
counting co-occurring events from a large number of observations. This general description captures
the gist of many algorithms in fields as diverse as text processing, data mining, and bioinformatics.
For this reason, these two design patterns are broadly useful and frequently observed in a variety of
applications.
To conclude, it is worth noting that the pairs and stripes approaches represent endpoints
along a continuum of possibilities. The pairs approach individually records each co-occurring event,
while the stripes approach records all co-occurring events with respect a conditioning event. A
middle ground might be to record a subset of the co-occurring events with respect to a conditioning
event. We might divide up the entire vocabulary into b buckets (e.g., via hashing), so that words
co-occurring with wi would be divided into b smaller sub-stripes, associated with ten separate keys,
(wi , 1), (wi , 2) . . . (wi , b). This would be a reasonable solution to the memory limitations of the
stripes approach, since each of the sub-stripes would be smaller. In the case of b = |V |, where |V | is
the vocabulary size, this is equivalent to the pairs approach. In the case of b = 1, this is equivalent
to the standard stripes approach.
4000
"stripes" approach
"pairs" approach
3500
R2 = 0.999
3000
Running time (seconds)
2500
2000
1500
1000
R2 = 0.992
500
0
0 20 40 60 80 100
Percentage of the APW corpus
Figure 3.10: Running time of the pairs and stripes algorithms for computing word co-occurrence
matrices on different fractions of the APW corpus. These experiments were performed on a Hadoop
cluster with 19 slaves, each with 2 single-core processors and 2 disks.
5000
4000 4x
Running time (seconds)
2
R = 0.997
Relative speedup
3000 3x
2000 2x
1000 1x
0
10 20 30 40 50 60 70 80 90 1x 2x 3x 4x
Size of EC2 cluster (number of slave instances) Relative size of EC2 cluster
Figure 3.11: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of different
sizes from EC2 (left). Scaling characteristics (relative speedup) in terms of increasing Hadoop cluster
size (right).
54 3. MAPREDUCE ALGORITHM DESIGN
wi may co-occur frequently with wj simply because one of the words is very common. A simple
remedy is to convert absolute counts into relative frequencies, f (wj |wi ). That is, what proportion of
the time does wj appear in the context of wi ? This can be computed using the following equation:
N(wi , wj )
f (wj |wi ) =
(3.1)
w N(wi , w )
Here, N (, ) indicates the number of times a particular co-occurring word pair is observed in the
corpus. We need the count of the joint event (word co-occurrence), divided by what is known as the
marginal (the sum of the counts of the conditioning variable co-occurring with anything else).
Computing relative frequencies with the stripes approach is straightforward. In the reducer,
counts of all words that co-occur with the conditioning variable (wi in the above example) are
available in the associative array. Therefore, it suffices to sum all those counts to arrive at the marginal
(i.e., w N (wi , w )), and then divide all the joint counts by the marginal to arrive at the relative
frequency for all words. This implementation requires minimal modification to the original stripes
algorithm in Figure 3.9, and illustrates the use of complex data structures to coordinate distributed
computations in MapReduce. Through appropriate structuring of keys and values, one can use the
MapReduce execution framework to bring together all the pieces of data required to perform a
computation. Note that, as before, this algorithm also assumes that each associative array fits into
memory.
How might one compute relative frequencies with the pairs approach? In the pairs approach,
the reducer receives (wi , wj ) as the key and the count as the value. From this alone it is not possible
to compute f (wj |wi ) since we do not have the marginal. Fortunately, as in the mapper, the reducer
can preserve state across multiple keys. Inside the reducer, we can buffer in memory all the words that
co-occur with wi and their counts, in essence building the associative array in the stripes approach.
To make this work, we must define the sort order of the pair so that keys are first sorted by the left
word, and then by the right word. Given this ordering, we can easily detect if all pairs associated
with the word we are conditioning on (wi ) have been encountered. At that point we can go back
through the in-memory buffer, compute the relative frequencies, and then emit those results in the
final key-value pairs.
There is one more modification necessary to make this algorithm work. We must ensure that
all pairs with the same left word are sent to the same reducer. This, unfortunately, does not happen
automatically: recall that the default partitioner is based on the hash value of the intermediate key,
modulo the number of reducers. For a complex key, the raw byte representation is used to compute
the hash value. As a result, there is no guarantee that, for example, (dog, aardvark) and (dog, zebra) are
assigned to the same reducer. To produce the desired behavior, we must define a custom partitioner
that only pays attention to the left word. That is, the partitioner should partition based on the hash
of the left word only.
This algorithm will indeed work, but it suffers from the same drawback as the stripes approach:
as the size of the corpus grows, so does that vocabulary size, and at some point there will not be
sufficient memory to store all co-occurring words and their counts for the word we are conditioning
3.3. COMPUTING RELATIVE FREQUENCIES 55
key values
(dog, ) [6327, 8514, . . .] compute marginal: w N(dog, w ) = 42908
(dog, aardvark) [2,1] f (aardvark|dog) = 3/42908
(dog, aardwolf) [1] f (aardwolf|dog) = 1/42908
(dog, zebra) [2,1,1,1] f (zebra|dog) = 5/42908
(doge, ) [682, . . .] compute marginal: w N(doge, w ) = 1267
Figure 3.12: Example of the sequence of key-value pairs presented to the reducer in the pairs algorithm
for computing relative frequencies. This illustrates the application of the order inversion design pattern.
on. For computing the co-occurrence matrix, the advantage of the pairs approach is that it doesnt
suffer from any memory bottlenecks. Is there a way to modify the basic pairs approach so that this
advantage is retained?
As it turns out, such an algorithm is indeed possible, although it requires the coordination of
several mechanisms in MapReduce. The insight lies in properly sequencing data presented to the
reducer. If it were possible to somehow compute (or otherwise obtain access to) the marginal in the
reducer before processing the joint counts, the reducer could simply divide the joint counts by the
marginal to compute the relative frequencies. The notion of before and after can be captured in
the ordering of key-value pairs, which can be explicitly controlled by the programmer. That is, the
programmer can define the sort order of keys so that data needed earlier is presented to the reducer
before data that is needed later. However, we still need to compute the marginal counts. Recall that
in the basic pairs algorithm, each mapper emits a key-value pair with the co-occurring word pair
as the key. To compute relative frequencies, we modify the mapper so that it additionally emits a
special key of the form (wi , ), with a value of one, that represents the contribution of the word
pair to the marginal. Through use of combiners, these partial marginal counts will be aggregated
before being sent to the reducers. Alternatively, the in-mapper combining pattern can be used to
even more efficiently aggregate marginal counts.
In the reducer, we must make sure that the special key-value pairs representing the partial
marginal contributions are processed before the normal key-value pairs representing the joint counts.
This is accomplished by defining the sort order of the keys so that pairs with the special symbol of
the form (wi , ) are ordered before any other key-value pairs where the left word is wi . In addition,
as before we must also properly define the partitioner to pay attention to only the left word in each
pair. With the data properly sequenced, the reducer can directly compute the relative frequencies.
A concrete example is shown in Figure 3.12, which lists the sequence of key-value pairs that a
reducer might encounter. First, the reducer is presented with the special key (dog, ) and a number
of values, each of which represents a partial marginal contribution from the map phase (assume here
either combiners or in-mapper combining, so the values represent partially aggregated counts). The
56 3. MAPREDUCE ALGORITHM DESIGN
reducer accumulates these counts to arrive at the marginal, w N(dog, w ). The reducer holds on
to this value as it processes subsequent keys. After (dog, ), the reducer will encounter a series of
keys representing joint counts; lets say the first of these is the key (dog, aardvark). Associated with
this key will be a list of values representing partial joint counts from the map phase (two separate
values in this case). Summing these counts will yield the final joint count, i.e., the number of times
dog and aardvark co-occur in the entire collection. At this point, since the reducer already knows the
marginal, simple arithmetic suffices to compute the relative frequency. All subsequent joint counts
are processed in exactly the same manner. When the reducer encounters the next special key-value
pair (doge, ), the reducer resets its internal state and starts to accumulate the marginal all over
again. Observe that the memory requirement for this algorithm is minimal, since only the marginal
(an integer) needs to be stored. No buffering of individual co-occurring word counts is necessary,
and therefore we have eliminated the scalability bottleneck of the previous algorithm.
This design pattern, which we call order inversion, occurs surprisingly often and across
applications in many domains. It is so named because through proper coordination, we can access
the result of a computation in the reducer (for example, an aggregate statistic) before processing the
data needed for that computation. The key insight is to convert the sequencing of computations
into a sorting problem. In most cases, an algorithm requires data in some fixed order: by controlling
how keys are sorted and how the key space is partitioned, we can present data to the reducer in
the order necessary to perform the proper computations. This greatly cuts down on the amount of
partial results that the reducer needs to hold in memory.
To summarize, the specific application of the order inversion design pattern for computing
relative frequencies requires the following:
Emitting a special key-value pair for each co-occurring word pair in the mapper to capture its
contribution to the marginal.
Controlling the sort order of the intermediate key so that the key-value pairs representing the
marginal contributions are processed by the reducer before any of the pairs representing the
joint word co-occurrence counts.
Defining a custom partitioner to ensure that all pairs with the same left word are shuffled to
the same reducer.
Preserving state across multiple keys in the reducer to first compute the marginal based on
the special key-value pairs and then dividing the joint counts by the marginals to arrive at the
relative frequencies.
As we will see in Chapter 4, this design pattern is also used in inverted index construction to properly
set compression parameters for postings lists.
3.4. SECONDARY SORTING 57
The sensor reading itself now occupies the value. We must define the intermediate key sort order to
first sort by the sensor id (the left element in the pair) and then by the timestamp (the right element
in the pair). We must also implement a custom partitioner so that all pairs associated with the same
sensor are shuffled to the same reducer.
Properly orchestrated, the key-value pairs will be presented to the reducer in the correct sorted
order:
(m1 , t1 ) [(r80521 )]
(m1 , t2 ) [(r21823 )]
(m1 , t3 ) [(r146925 )]
...
However, note that sensor readings are now split across multiple keys. The reducer will need to
preserve state and keep track of when readings associated with the current sensor end and the next
sensor begin.9
The basic tradeoff between the two approaches discussed above (buffer and in-memory sort
vs. value-to-key conversion) is where sorting is performed. One can explicitly implement secondary
sorting in the reducer, which is likely to be faster but suffers from a scalability bottleneck.10 With
value-to-key conversion, sorting is offloaded to the MapReduce execution framework. Note that
this approach can be arbitrarily extended to tertiary, quaternary, etc. sorting. This pattern results in
many more keys for the framework to sort, but distributed sorting is a task that the MapReduce
runtime excels at since it lies at the heart of the programming model.
(k1 , s1 , S1 )
(k2 , s2 , S2 )
(k3 , s3 , S3 )
...
where k is the key we would like to join on, sn is a unique id for the tuple, and the Sn after sn denotes
other attributes in the tuple (unimportant for the purposes of the join). Similarly, suppose relation
T looks something like this:
(k1 , t1 , T1 )
(k3 , t2 , T2 )
11 http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/
60 3. MAPREDUCE ALGORITHM DESIGN
(k8 , t3 , T3 )
...
where k is the join key, tn is a unique id for the tuple, and the Tn after tn denotes other attributes in
the tuple.
To make this task more concrete, we present one realistic scenario: S might represent a col-
lection of user profiles, in which case k could be interpreted as the primary key (i.e., user id). The
tuples might contain demographic information such as age, gender, income, etc. The other dataset,
T , might represent logs of online activity. Each tuple might correspond to a page view of a particular
URL and may contain additional information such as time spent on the page, ad revenue generated,
etc. The k in these tuples could be interpreted as the foreign key that associates each individual page
view with a user. Joining these two datasets would allow an analyst, for example, to break down
online activity in terms of demographics.
Since weve emitted the join key as the intermediate key, we can remove it from the value to save a
bit of space.12 If there are two values associated with a key, then we know that one must be from S
and the other must be from T . However, recall that in the basic MapReduce programming model,
no guarantees are made about value ordering, so the first value might be from S or from T . We
can proceed to join the two tuples and perform additional computations (e.g., filter by some other
attribute, compute aggregates, etc.). If there is only one value associated with a key, this means that
no tuple in the other dataset shares the join key, so the reducer does nothing.
12 Not very important if the intermediate data is compressed.
3.5. RELATIONAL JOINS 61
Let us now consider the one-to-many join. Assume that tuples in S have unique join keys (i.e.,
k is the primary key in S), so that S is the one and T is the many. The above algorithm will still
work, but when processing each key in the reducer, we have no idea when the value corresponding
to the tuple from S will be encountered, since values are arbitrarily ordered. The easiest solution is to
buffer all values in memory, pick out the tuple from S, and then cross it with every tuple from T to
perform the join. However, as we have seen several times already, this creates a scalability bottleneck
since we may not have sufficient memory to hold all the tuples with the same join key.
This is a problem that requires a secondary sort, and the solution lies in the value-to-key
conversion design pattern we just presented. In the mapper, instead of simply emitting the join key
as the intermediate key, we instead create a composite key consisting of the join key and the tuple
id (from either S or T ). Two additional changes are required: First, we must define the sort order of
the keys to first sort by the join key, and then sort all tuple ids from S before all tuple ids from T .
Second, we must define the partitioner to pay attention to only the join key, so that all composite
keys with the same join key arrive at the same reducer.
After applying the value-to-key conversion design pattern, the reducer will be presented with
keys and values along the lines of the following:
Since both the join key and the tuple id are present in the intermediate key, we can remove them
from the value to save a bit of space.13 Whenever the reducer encounters a new join key, it is
guaranteed that the associated value will be the relevant tuple from S. The reducer can hold this
tuple in memory and then proceed to cross it with tuples from T in subsequent steps (until a new
join key is encountered). Since the MapReduce execution framework performs the sorting, there is
no need to buffer tuples (other than the single one from S). Thus, we have eliminated the scalability
bottleneck.
Finally, let us consider the many-to-many join case. Assuming that S is the smaller dataset,
the above algorithm works as well. Consider what happens at the reducer:
processing (that is, input and output keys need not be the same, nor even the same type).15 However,
if the output key of a reducer is different from the input key, then the output dataset from the reducer
will not necessarily be partitioned in a manner consistent with the specified partitioner (because the
partitioner applies to the input keys rather than the output keys). Since map-side joins depend on
consistent partitioning and sorting of keys, the reducers used to generate data that will participate
in a later map-side join must not emit any key but the one they are currently processing.
3.6 SUMMARY
This chapter provides a guide on the design of MapReduce algorithms. In particular, we present a
number of design patterns that capture effective solutions to common problems. In summary, they
are:
In-mapper combining, where the functionality of the combiner is moved into the mapper.
Instead of emitting intermediate output for every input key-value pair, the mapper aggregates
15 In contrast, recall from Section 2.2 that in Googles implementation, reducers output keys must be exactly same as their input
keys.
16 In order to achieve good performance in accessing distributed key-value stores, it is often necessary to batch queries before making
synchronous requests (to amortize latency over many requests) or to rely on asynchronous requests.
64 3. MAPREDUCE ALGORITHM DESIGN
partial results across multiple input records and only emits intermediate key-value pairs after
some amount of local aggregation is performed.
The related patterns pairs and stripes for keeping track of joint events from a large number
of observations. In the pairs approach, we keep track of each joint event separately, whereas in
the stripes approach we keep track of all events that co-occur with the same event. Although
the stripes approach is significantly more efficient, it requires memory on the order of the size
of the event space, which presents a scalability bottleneck.
Order inversion, where the main idea is to convert the sequencing of computations into
a sorting problem. Through careful orchestration, we can send the reducer the result of a
computation (e.g., an aggregate statistic) before it encounters the data necessary to produce
that computation.
Value-to-key conversion, which provides a scalable solution for secondary sorting. By moving
part of the value into the key, we can exploit the MapReduce execution framework itself for
sorting.
Ultimately, controlling synchronization in the MapReduce programming model boils down to ef-
fective use of the following techniques:
1. Constructing complex keys and values that bring together data necessary for a computation.
This is used in all of the above design patterns.
2. Executing user-specified initialization and termination code in either the mapper or reducer.
For example, in-mapper combining depends on emission of intermediate key-value pairs in
the map task termination code.
3. Preserving state across multiple inputs in the mapper and reducer. This is used in in-mapper
combining, order inversion, and value-to-key conversion.
4. Controlling the sort order of intermediate keys.This is used in order inversion and value-to-key
conversion.
5. Controlling the partitioning of the intermediate key space. This is used in order inversion and
value-to-key conversion.
This concludes our overview of MapReduce algorithm design. It should be clear by now that although
the programming model forces one to express algorithms in terms of a small set of rigidly defined
components, there are many tools at ones disposal to shape the flow of computation. In the next
few chapters, we will focus on specific classes of MapReduce algorithms: for inverted indexing in
Chapter 4, for graph processing in Chapter 5, and for expectation-maximization in Chapter 6.
65
CHAPTER 4
2 http://boston.lti.cs.cmu.edu/Data/clueweb09/
3 As an interesting side note, in the 1990s, research collections were distributed via postal mail on CD-ROMs, and later, on DVDs.
Electronic distribution became common earlier this decade for collections below a certain size. However, many collections today
are so large that the only practical method of distribution is shipping hard drives via postal mail.
4.1. WEB CRAWLING 67
A web crawler must practice good etiquette and not overload web servers. For example, it is
common practice to wait a fixed amount of time before repeated requests to the same server. In
order to respect these constraints while maintaining good throughput, a crawler typically keeps
many execution threads running in parallel and maintains many TCP connections (perhaps
hundreds) open at the same time.
Since a crawler has finite bandwidth and resources, it must prioritize the order in which
unvisited pages are downloaded. Such decisions must be made online and in an adversarial
environment, in the sense that spammers actively create link farms and spider traps full of
spam pages to trick a crawler into overrepresenting content from a particular site.
Most real-world web crawlers are distributed systems that run on clusters of machines, often
geographically distributed. To avoid downloading a page multiple times and to ensure data
consistency, the crawler as a whole needs mechanisms for coordination and load-balancing. It
also needs to be robust with respect to machine failures, network outages, and errors of various
types.
Web content changes, but with different frequency depending on both the site and the nature
of the content. A web crawler needs to learn these update patterns to ensure that content is
reasonably current. Getting the right recrawl frequency is tricky: too frequent means wasted
resources, but not frequent enough leads to stale content.
The web is full of duplicate content. Examples include multiple copies of a popular conference
paper, mirrors of frequently accessed sites such as Wikipedia, and newswire content that is
often duplicated. The problem is compounded by the fact that most repetitious pages are not
exact duplicates but near duplicates (that is, basically the same page but with different ads,
navigation bars, etc.) It is desirable during the crawling process to identify near duplicates and
select the best exemplar to index.
The web is multilingual. There is no guarantee that pages in one language only link to pages
in the same language. For example, a professor in Asia may maintain her website in the local
language, but link to her publications in English. Furthermore, many pages contain a mix of text
in different languages. Since document processing techniques (e.g., tokenization, stemming)
differ by language, it is important to identify the (dominant) language on a page.
The above discussion is not meant to be an exhaustive enumeration of issues, but rather to give the
reader an appreciation of the complexities involved in this intuitively simple task. For more infor-
mation, see a recent survey on web crawling [113]. Section 4.7 provides suggestions for additional
readings.
68 4. INVERTED INDEXING FOR TEXT RETRIEVAL
terms postings
term1 d1 p d5 p d6 p d11 p
Figure 4.1: Simple illustration of an inverted index. Each term is associated with a list of postings. Each
posting is comprised of a document id and a payload, denoted by p in this case. An inverted index provides
quick access to documents ids that contain a term.
Figure 4.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce. Mappers emit
postings keyed by terms, the execution framework groups postings by term, and the reducers write
postings lists to disk.
Once the document has been analyzed, term frequencies are computed by iterating over all the terms
and keeping track of counts. Lines 4 and 5 in the pseudo-code reflect the process of computing term
frequencies, but hides the details of document processing. After this histogram has been built, the
mapper then iterates over all terms. For each term, a pair consisting of the document id and the term
frequency is created. Each pair, denoted by n, H {t}
in the pseudo-code, represents an individual
posting. The mapper then emits an intermediate key-value pair with the term as the key and the
posting as the value, in line 7 of the mapper pseudo-code. Although as presented here only the
term frequency is stored in the posting, this algorithm can be easily augmented to store additional
information (e.g., term positions) in the payload.
In the shuffle and sort phase, the MapReduce runtime essentially performs a large, distributed
group by of the postings by term. Without any additional effort by the programmer, the execution
framework brings together all the postings that belong in the same postings list. This tremendously
simplifies the task of the reducer, which simply needs to gather together all the postings and write
them to disk.The reducer begins by initializing an empty list and then appends all postings associated
with the same key (term) to the list. The postings are then sorted by document id, and the entire
postings list is emitted as a value, with the term as the key. Typically, the postings list is first
compressed, but we leave this aside for now (see Section 4.4 for more details). The final key-value
pairs are written to disk and comprise the inverted index. Since each reducer writes its output in
a separate file in the distributed file system, our final index will be split across r files, where r is
4.3. INVERTED INDEXING: BASELINE IMPLEMENTATION 71
reducer reducer
fish d1 2 d2 2 bird d3 1
one d1 1 d3 1 blue d2 1
two d1 1 red d2 1 d3 1
Figure 4.3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with three
mappers and two reducers. Postings are shown as pairs of boxes (docid, tf).
the number of reducers. There is no need to further consolidate these files. Separately, we must also
build an index to the postings lists themselves for the retrieval engine: this is typically in the form
of mappings from term to (file, byte offset) pairs, so that given a term, the retrieval engine can fetch
its postings list by opening the appropriate file and seeking to the correct byte offset position in that
file.
Execution of the complete algorithm is illustrated in Figure 4.3 with a toy example consisting of
three documents, three mappers, and two reducers. Intermediate key-value pairs (from the mappers)
and the final key-value pairs comprising the inverted index (from the reducers) are shown in the
boxes with dotted lines. Postings are shown as pairs of boxes, with the document id on the left and
the term frequency on the right.
The MapReduce programming model provides a very concise expression of the inverted
indexing algorithm. Its implementation is similarly concise: the basic algorithm can be implemented
in as few as a couple dozen lines of code in Hadoop (with minimal document processing). Such an
implementation can be completed as a week-long programming assignment in a course for advanced
undergraduates or first-year graduate students [83; 93]. In a non-MapReduce indexer, a significant
72 4. INVERTED INDEXING FOR TEXT RETRIEVAL
fraction of the code is devoted to grouping postings by term, given constraints imposed by memory
and disk (e.g., memory capacity is limited, disk seeks are slow, etc.). In MapReduce, the programmer
does not need to worry about any of these issuesmost of the heavy lifting is performed by the
execution framework.
Figure 4.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce. By applying the
value-to-key conversion design pattern, the execution framework is exploited to sort postings so that they
arrive sorted by document id in the reducer.
algorithm, payloads can be easily changed: by simply replacing the intermediate value f (term
frequency) with whatever else is desired (e.g., term positional information).
There is one more detail we must address when building inverted indexes. Since almost all
retrieval models take into account document length when computing querydocument scores, this
information must also be extracted. Although it is straightforward to express this computation as
another MapReduce job, this task can actually be folded into the inverted indexing process. When
processing the terms in each document, the document length is known, and can be written out as
side data directly to HDFS. We can take advantage of the ability for a mapper to hold state across
the processing of multiple documents in the following manner: an in-memory associative array
is created to store document lengths, which is populated as each document is processed.8 When
the mapper finishes processing input records, document lengths are written out to HDFS (i.e., in
the Close method). This approach is essentially a variant of the in-mapper combining pattern.
Document length data end up in m different files, where m is the number of mappers; these files are
8 In general, there is no worry about insufficient memory to hold these data.
74 4. INVERTED INDEXING FOR TEXT RETRIEVAL
then consolidated into a more compact representation. Alternatively, document length information
can be emitted in special key-value pairs by the mapper. One must then write a custom partitioner
so that these special key-value pairs are shuffled to a single reducer, which will be responsible for
writing out the length data separate from the postings lists.
0001101001010100
00 01 1 01 00 1 01 01 00
Golomb
x unary b=5 b = 10
1 0 0 0:00 0:000
2 10 10:0 0:01 0:001
3 110 10:1 0:10 0:010
4 1110 110:00 0:110 0:011
5 11110 110:01 0:111 0:100
6 111110 110:10 10:00 0:101
7 1111110 110:11 10:01 0:1100
8 11111110 1110:000 10:10 0:1101
9 111111110 1110:001 10:110 0:1110
10 1111111110 1110:010 10:111 0:1111
Figure 4.5: The first ten positive integers in unary, , and Golomb (b = 5, 10) codes.
is fine since d-gaps and term frequencies should never be zero.11 As an example, 4 in unary code
is 1110. With unary code we can code x in x bits, which although economical for small values,
becomes inefficient for even moderately large values. Unary codes are rarely used by themselves, but
form a component of other coding schemes. Unary codes of the first ten positive integers are shown
in Figure 4.5.
Elias code is an efficient coding scheme that is widely used in practice. An integer x > 0
is broken into two components, 1 + log2 x (= n, the length), which is coded in unary code,
and x 2log2 x (= r, the remainder), which is in binary.12 The unary component n specifies the
number of bits required to code x, and the binary component codes the remainder r in n 1 bits.
As an example, consider x = 10: 1 + log2 10 = 4, which is 1110. The binary component codes
x 23 = 2 in 4 1 = 3 bits, which is 010. Putting both together, we arrive at 1110:010. The extra
colon is inserted only for readability; its not part of the final code, of course.
Working in reverse, it is easy to unambiguously decode a bit stream of codes: First, we read
a unary code cu , which is a prefix code. This tells us that the binary portion is written in cu 1 bits,
which we then read as cb . We can then reconstruct x as 2cu 1 + cb . For x < 16, codes occupy
less than a full byte, which makes them more compact than variable-length integer codes. Since
term frequencies for the most part are relatively small, codes make sense for them and can yield
substantial space savings. For reference, the codes of the first ten positive integers are shown in
Figure 4.5. A variation on code is code, where the n portion of the code is coded in code
itself (as opposed to unary code). For smaller values codes are more compact, but for larger values,
codes take less space.
11 As a note, some sources describe slightly different formulations of the same coding scheme. Here, we adopt the conventions used
in the classic IR text Managing Gigabytes [156].
12 Note that x is the floor function, which maps x to the largest integer not greater than x, so, e.g., 3.8 = 3. This is the default
behavior in many programming languages when casting from a floating-point type to an integer type.
78 4. INVERTED INDEXING FOR TEXT RETRIEVAL
Unary and codes are parameterless, but even better compression can be achieved with
parameterized codes. A good example of this is Golomb code. For some parameter b, an integer
x > 0 is coded in two parts: first, we compute q = (x 1)/b and code q + 1 in unary; then,
we code the remainder r = x qb 1 in truncated binary. This is accomplished as follows: if b
is a power of two, then truncated binary is exactly the same as normal binary, requiring log2 b
bits. Otherwise, we code the first 2log2 b+1 b values of r in log2 b bits and code the rest of
the values of r by coding r + 2log2 b+1 b in ordinary binary representation using log2 b + 1
bits. In this case, the r is coded in either log2 b or log2 b + 1 bits, and unlike ordinary binary
coding, truncated binary codes are prefix codes. As an example, if b = 5, then r can take the values
{0, 1, 2, 3, 4}, which would be coded with the following code words: {00, 01, 10, 110, 111}. For
reference, Golomb codes of the first ten positive integers are shown in Figure 4.5 for b = 5 and
b = 10. A special case of Golomb code is worth noting: if b is a power of two, then coding and
decoding can be handled more efficiently (needing only bit shifts and bit masks, as opposed to
multiplication and division). These are known as Rice codes.
Researchers have shown that Golomb compression works well for d-gaps, and is optimal with
the following parameter setting:
N
b 0.69 (4.1)
df
where df is the document frequency of the term, and N is the number of documents in the collec-
tion.13
Putting everything together, one popular approach for postings compression is to represent
d-gaps with Golomb codes and term frequencies with codes [156; 162]. If positional information
is desired, we can use the same trick to code differences between term positions using codes.
13 For details as to why this is the case, we refer the reader elsewhere [156], but heres the intuition: under reasonable assumptions,
the appearance of postings can be modeled as a sequence of independent Bernoulli trials, which implies a certain distribution of
d-gaps. From this we can derive an optimal setting of b.
4.5. INDEX COMPRESSION 79
(tuple t, docid , tf f )
(tuple t, , df e)
to keep track of document frequencies associated with each term. In practice, we can accomplish
this by applying the in-mapper combining design pattern (see Section 3.1). The mapper holds an
in-memory associative array that keeps track of how many documents a term has been observed in
(i.e., the local document frequency of the term for the subset of documents processed by the mapper).
Once the mapper has processed all input records, special keys of the form t,
are emitted with the
partial df as the value.
To ensure that these special keys arrive first, we define the sort order of the tuple so that the
special symbol precedes all documents (part of the order inversion design pattern). Thus, for each
term, the reducer will first encounter the t,
key, associated with a list of values representing partial
df values originating from each mapper. Summing all these partial contributions will yield the terms
14 However, this study found less speed difference between group varInt and basic varInt than Deans analysis, presumably due to
the different distribution of d-gaps in the collections they were examining.
80 4. INVERTED INDEXING FOR TEXT RETRIEVAL
df, which can then be used to set the Golomb compression parameter b. This allows the postings
to be incrementally compressed as they are encountered in the reducermemory bottlenecks are
eliminated since we do not need to buffer postings in memory.
Once again, the order inversion design pattern comes to the rescue. Recall that the pattern is
useful when a reducer needs to access the result of a computation (e.g., an aggregate statistic) before
it encounters the data necessary to produce that computation. For computing relative frequencies,
that bit of information was the marginal. In this case, its the document frequency.
Thus far, we have briefly discussed web crawling and focused mostly on MapReduce algorithms for
inverted indexing. What about retrieval? It should be fairly obvious that MapReduce, which was
designed for large batch operations, is a poor solution for retrieval. Since users demand sub-second
response times, every aspect of retrieval must be optimized for low latency, which is exactly the
opposite tradeoff made in MapReduce. Recall the basic retrieval problem: we must look up postings
lists corresponding to query terms, systematically traverse those postings lists to compute query
document scores, and then return the top k results to the user. Looking up postings implies random
disk seeks, since for the most part postings are too large to fit into memory (leaving aside caching
and other special cases for now). Unfortunately, random access is not a forte of the distributed file
system underlying MapReducesuch operations require multiple round-trip network exchanges
(and associated latencies). In HDFS, a client must first obtain the location of the desired data block
from the namenode before the appropriate datanode can be contacted for the actual data. Of course,
access will typically require a random disk seek on the datanode itself.
It should be fairly obvious that serving the search needs of a large number of users, each of
whom demand sub-second response times, is beyond the capabilities of any single machine. The
only solution is to distribute retrieval across a large number of machines, which necessitates breaking
up the index in some manner. There are two main partitioning strategies for distributed retrieval:
document partitioning and term partitioning. Under document partitioning, the entire collection is
broken up into multiple smaller sub-collections, each of which is assigned to a server. In other
words, each server holds the complete index for a subset of the entire collection. This corresponds
to partitioning vertically in Figure 4.6. With term partitioning, on the other hand, each server is
responsible for a subset of the terms for the entire collection. That is, a server holds the postings for
all documents in the collection for a subset of terms. This corresponds to partitioning horizontally
in Figure 4.6.
Document and term partitioning require different retrieval strategies and represent different
trade offs. Retrieval under document partitioning involves a query broker, which forwards the users
query to all partition servers, merges partial results from each, and then returns the final results to the
user. With this architecture, searching the entire collection requires that the query be processed by
every partition server. However, since each partition operates independently and traverses postings
4.6. WHAT ABOUT RETRIEVAL? 81
d1 d2 d3 d4 d5 d6 d7 d8 d9
t1 2 3
t2 1 1 4 partitiona
t3 1 1 2
t4 5 2 2
t5 1 1 3 partitionb
t6 2 1
t7 2 1 4
t8 1 2 3 partitionc
t9 1 2 1
Figure 4.6: Termdocument matrix for a toy collection (nine documents, nine terms) illustrating different
partitioning strategies: partitioning vertically (1, 2, 3) corresponds to document partitioning, whereas
partitioning horizontally (a, b, c) corresponds to term partitioning.
in parallel, document partitioning typically yields shorter query latencies (compared to a single
monolithic index with much longer postings lists).
Retrieval under term partitioning, on the other hand, requires a very different strategy. Suppose
the users query Q contains three terms, q1 , q2 , and q3 . Under the pipelined query evaluation strategy,
the broker begins by forwarding the query to the server that holds the postings for q1 (usually the
least frequent term). The server traverses the appropriate postings list and computes partial query
document scores, stored in the accumulators. The accumulators are then passed to the server that
holds the postings associated with q2 for additional processing, and then to the server for q3 , before
final results are passed back to the broker and returned to the user. Although this query evaluation
strategy may not substantially reduce the latency of any particular query, it can theoretically increase
a systems throughput due to the far smaller number of total disk seeks required for each user
query (compared to document partitioning). However, load-balancing is tricky in a pipelined term-
partitioned architecture due to skew in the distribution of query terms, which can create hot spots
on servers that hold the postings for frequently occurring query terms.
82 4. INVERTED INDEXING FOR TEXT RETRIEVAL
In general, studies have shown that document partitioning is a better strategy overall [109],
and this is the strategy adopted by Google [16]. Furthermore, it is known that Google maintains its
indexes in memory (although this is certainly not the common case for search engines in general).
One key advantage of document partitioning is that result quality degrades gracefully with machine
failures. Partition servers that are offline will simply fail to deliver results for their subsets of the
collection. With sufficient partitions, users might not even be aware that documents are missing.
For most queries, the web contains more relevant documents than any user has time to digest:
users of course care about getting relevant documents (sometimes, they are happy with a single
relevant document), but they are generally less discriminating when it comes to which relevant
documents appear in their results (out of the set of all relevant documents). Note that partitions may
be unavailable due to reasons other than machine failure: cycling through different partitions is a
very simple and non-disruptive strategy for index updates.
Working in a document-partitioned architecture, there are a variety of approaches to dividing
up the web into smaller pieces. Proper partitioning of the collection can address one major weakness
of this architecture, which is that every partition server is involved in every user query. Along one
dimension, it is desirable to partition by document quality using one or more classifiers; see [124] for
a recent survey on web page classification. Partitioning by document quality supports a multi-phase
search strategy: the system examines partitions containing high quality documents first, and only
backs off to partitions containing lower quality documents if necessary. This reduces the number of
servers that need to be contacted for a user query. Along an orthogonal dimension, it is desirable
to partition documents by content (perhaps also guided by the distribution of user queries from
logs), so that each partition is well separated from the others in terms of topical coverage. This
also reduces the number of machines that need to be involved in serving a users query: the broker
can direct queries only to the partitions that are likely to contain relevant documents, as opposed to
forwarding the user query to all the partitions.
On a large scale, reliability of service is provided by replication, both in terms of multiple
machines serving the same partition within a single datacenter, but also replication across geograph-
ically distributed datacenters. This creates at least two query routing problems: since it makes sense
to serve clients from the closest datacenter, a service must route queries to the appropriate location.
Within a single datacenter, the system needs to properly balance load across replicas.
There are two final components of real-world search engines that are worth discussing. First,
recall that postings only store document ids. Therefore, raw retrieval results consist of a ranked list
of semantically meaningless document ids. It is typically the responsibility of document servers,
functionally distinct from the partition servers holding the indexes, to generate meaningful output
for user presentation. Abstractly, a document server takes as input a query and a document id, and
computes an appropriate result entry, typically comprising the title and URL of the page, a snippet
of the source document showing the users query terms in context, and additional metadata about
the document. Second, query evaluation can benefit immensely from caching, of individual postings
(assuming that the index is not already in memory) and even results of entire queries [13]. This is
4.7. SUMMARY AND ADDITIONAL READINGS 83
made possible by the Zipfian distribution of queries, with very frequent queries at the head of the
distribution dominating the total number of queries. Search engines take advantage of this with
cache servers, which are functionally distinct from all of the components discussed above.
Additional Readings. Our brief discussion of web search glosses over many complexities and does
a huge injustice to the tremendous amount of research in information retrieval. Here, however, we
provide a few entry points into the literature. A survey article by Zobel and Moffat [162] is an excel-
lent starting point on indexing and retrieval algorithms. Another by Baeza-Yates et al. [11] overviews
many important issues in distributed retrieval. A keynote talk at the WSDM 2009 conference by Jeff
Dean revealed a lot of information about the evolution of the Google search architecture.15 Finally,
a number of general information retrieval textbooks have been recently published [30; 42; 101]. Of
these three, the one by Bttcher et al. [30] is noteworthy in having detailed experimental evaluations
that compare the performance (both effectiveness and efficiency) of a wide range of algorithms and
techniques. While outdated in many other respects, the textbook Managing Gigabytes [156] remains
an excellent source for index compression techniques. Finally, ACM SIGIR is an annual conference
and the most prestigious venue for academic information retrieval research; proceedings from those
events are perhaps the best starting point for those wishing to keep abreast of publicly documented
developments in the field.
15 http://research.google.com/people/jeff/WSDM09-keynote.pdf
85
CHAPTER 5
Graph Algorithms
Graphs are ubiquitous in modern society: examples encountered by almost everyone on a daily
basis include the hyperlink structure of the web (simply known as the web graph), social networks
(manifest in the flow of email, phone call patterns, connections on social networking sites, etc.), and
transportation networks (roads, bus routes, flights, etc.). Our very own existence is dependent on an
intricate metabolic and regulatory network, which can be characterized as a large, complex graph
involving interactions between genes, proteins, and other cellular products. This chapter focuses
on graph algorithms in MapReduce. Although most of the content has nothing to do with text
processing per se, documents frequently exist in the context of some underlying network, making
graph analysis an important component of many text processing applications. Perhaps the best known
example is PageRank, a measure of web page quality based on the structure of hyperlinks, which is
used in ranking results for web search. As one of the first applications of MapReduce, PageRank
exemplifies a large class of graph algorithms that can be concisely captured in the programming
model. We will discuss PageRank in detail later this chapter.
In general, graphs can be characterized by nodes (or vertices) and links (or edges) that connect
pairs of nodes.1 These connections can be directed or undirected. In some graphs, there may be an
edge from a node to itself, resulting in a self loop; in others, such edges are disallowed. We assume that
both nodes and links may be annotated with additional metadata: as a simple example, in a social
network where nodes represent individuals, there might be demographic information (e.g., age,
gender, location) attached to the nodes and type information attached to the links (e.g., indicating
type of relationship such as friend or spouse).
Mathematicians have always been fascinated with graphs, dating back to Eulers paper on the
Seven Bridges of Knigsberg in 1736. Over the past few centuries, graphs have been extensively studied,
and today much is known about their properties. Far more than theoretical curiosities, theorems and
algorithms on graphs can be applied to solve many real-world problems:
Graph search and path planning. Search algorithms on graphs are invoked millions of times a
day, whenever anyone searches for directions on the web. Similar algorithms are also involved
in friend recommendations and expert-finding in social networks. Path planning problems
involving everything from network packets to delivery trucks represent another large class of
graph search problems.
Graph clustering. Can a large graph be divided into components that are relatively disjoint (for
example, as measured by inter-component links [59])? Among other applications, this task is
1Throughout this chapter, we use node interchangeably with vertex and similarly with link and edge.
86 5. GRAPH ALGORITHMS
useful for identifying communities in social networks (of interest to sociologists who wish to
understand how human relationships form and evolve) and for partitioning large graphs (of
interest to computer scientists who seek to better parallelize graph processing). See [158] for
a survey.
Minimum spanning trees. A minimum spanning tree for a graph G with weighted edges is
a tree that contains all vertices of the graph and a subset of edges connecting all the vertices
together that minimizes the sum of edge weights. A real-world example of this problem is
a telecommunications company that wishes to lay optical fiber to span a number of destina-
tions at the lowest possible cost (where weights denote costs). This approach has also been
applied to wide variety of problems, including social networks and the migration of Polynesian
islanders [64].
Bipartite graph matching. A bipartite graph is one whose vertices can be divided into two
disjoint sets. Matching problems on such graphs can be used to model job seekers looking for
employment or singles looking for dates.
Maximum flow. In a weighted directed graph with two special nodes called the source and
the sink, the max flow problem involves computing the amount of traffic that can be sent
from source to sink given various flow capacities defined by edge weights. Transportation
companies (airlines, shipping, etc.) and network operators grapple with complex versions of
these problems on a daily basis.
Identifying special nodes. There are many ways to define what special means, including
metrics based on node in-degree, average distance to other nodes, and relationship to cluster
structure. These special nodes are important to investigators attempting to break up terrorist
cells, epidemiologists modeling the spread of diseases, advertisers trying to promote products,
and many others.
A common feature of these problems is the scale of the datasets on which the algorithms must
operate: for example, the hyperlink structure of the web, which contains billions of pages, or social
networks that contain hundreds of millions of individuals. Clearly, algorithms that run on a single
machine and depend on the entire graph residing in memory are not scalable. Wed like to put
MapReduce to work on these challenges.2
This chapter is organized as follows: we begin in Section 5.1 with an introduction to graph
representations, and then explore two classic graph algorithms in MapReduce: parallel breadth-first
search (Section 5.2) and PageRank (Section 5.3). Before concluding with a summary and pointing
out additional readings, Section 5.4 discusses a number of general issue that affect graph processing
with MapReduce.
2 As a side note, Google recently published a short description of a system called Pregel [98], based on Valiants Bulk Synchronous
Parallel model [148], for large-scale graph algorithms; a longer description is anticipated in a forthcoming paper [99].
5.1. GRAPH REPRESENTATIONS 87
n2
n1 n2 n3 n4 n5
n1
n1 0 1 0 1 0 n1 [n2, n4]
n2 0 0 1 0 1 n2 [n3, n5]
n3 0 0 0 1 0 n3 [n4]
n5 n4 0 0 0 0 1 n4 [n5]
n3
n5 1 1 1 0 0 n5 [n1, n2, n3]
Figure 5.1: A simple directed graph (left) represented as an adjacency matrix (middle) and with adjacency
lists (right).
4This technique is used in anchor text inversion, where one gathers the anchor text of hyperlinks pointing to a particular page. It is
common practice to enrich a web pages standard textual representation with all of the anchor texts associated with its incoming
hyperlinks (e.g., [107]).
5.2. PARALLEL BREADTH-FIRST SEARCH 89
1: Dijkstra(G, w, s)
2: d[s] 0
3: for all vertex v V do
4: d[v]
5: Q {V }
6: while Q = do
7: u ExtractMin(Q)
8: for all vertex v u.AdjacencyList do
9: if d[v] > d[u] + w(u, v) then
10: d[v] d[u] + w(u, v)
Figure 5.2: Pseudo-code for Dijkstras algorithm, which is based on maintaining a global priority queue
of nodes with priorities equal to their distances from the source node. At each iteration, the algorithm
expands the node with the shortest distance and updates distances to all reachable nodes.
in Figure 5.2 only computes the shortest distances. The actual paths can be recovered by storing
backpointers for every node indicating a fragment of the shortest path.
A sample trace of the algorithm running on a simple graph is shown in Figure 5.3 (example
also adapted from CLR). We start out in (a) with n1 having a distance of zero (since its the source)
and all other nodes having a distance of . In the first iteration (a), n1 is selected as the node to
expand (indicated by the thicker border). After the expansion, we see in (b) that n2 and n3 can be
reached at a distance of 10 and 5, respectively. Also, we see in (b) that n3 is the next node selected
for expansion. Nodes we have already considered for expansion are shown in black. Expanding n3 ,
we see in (c) that the distance to n2 has decreased because weve found a shorter path. The nodes
that will be expanded next, in order, are n5 , n2 , and n4 . The algorithm terminates with the end state
shown in (f ), where weve discovered the shortest distance to all nodes.
The key to Dijkstras algorithm is the priority queue that maintains a globally sorted list of
nodes by current distance. This is not possible in MapReduce, as the programming model does not
provide a mechanism for exchanging global data. Instead, we adopt a brute force approach known as
parallel breadth-first search. First, as a simplification let us assume that all edges have unit distance
(modeling, for example, hyperlinks on the web). This makes the algorithm easier to understand, but
well relax this restriction later.
The intuition behind the algorithm is this: the distance of all nodes connected directly to the
source node is one; the distance of all nodes directly connected to those is two; and so on. Imagine
water rippling away from a rock dropped into a pondthats a good image of how parallel breadth-
first search works. However, what if there are multiple paths to the same node? Suppose we wish to
compute the shortest distance to node n. The shortest path must go through one of the nodes in M
that contains an outgoing edge to n: we need to examine all m M to find ms , the node with the
shortest distance. The shortest distance to n is the distance to ms plus one.
90 5. GRAPH ALGORITHMS
n2 n4 n2 n4 n2 n4
f 1
f 10 1
f 8 1
14
10 10 10
9 9 9
2 3 2 3 2 3
0 4 6
0 4 6
0 4 6
n1 n1 n1
5 7 5 7 5 7
f f 5 f 5 7
2 2 2
n3 n5 n3 n5 n3 n5
(a) (b) (c)
n2 n4 n2 n4 n2 n4
1 1 1
8 13 8 9 8 9
10 10 10
9 9 9
2 3 2 3 2 3
0 4 6
0 4 6
0 4 6
n1 n1 n1
5 7 5 7 5 7
5 7 5 7 5 7
2 2 2
n3 n5 n3 n5 n3 n5
(d) (e) (f)
Figure 5.3: Example of Dijkstras algorithm applied to a simple graph with five nodes, with n1 as the
source and edge distances as indicated. Parts (a)(e) show the running of the algorithm at each iteration,
with the current distance inside the node. Nodes with thicker borders are those being expanded; nodes
that have already been expanded are shown in black.
Pseudo-code for the implementation of the parallel breadth-first search algorithm is provided
in Figure 5.4. As with Dijkstras algorithm, we assume a connected, directed graph represented as
adjacency lists. Distance to each node is directly stored alongside the adjacency list of that node, and
initialized to for all nodes except for the source node. In the pseudo-code, we use n to denote
the node id (an integer) and N to denote the nodes corresponding data structure (adjacency list and
current distance). The algorithm works by mapping over all nodes and emitting a key-value pair for
each neighbor on the nodes adjacency list. The key contains the node id of the neighbor, and the
value is the current distance to the node plus one. This says: if we can reach node n with a distance
d, then we must be able to reach all the nodes that are connected to n with distance d + 1. After
shuffle and sort, reducers will receive keys corresponding to the destination node ids and distances
corresponding to all paths leading to that node. The reducer will select the shortest of these distances
and then update the distance in the node data structure.
It is apparent that parallel breadth-first search is an iterative algorithm, where each iteration
corresponds to a MapReduce job. The first time we run the algorithm, we discover all nodes that
are connected to the source. The second iteration, we discover all nodes connected to those, and
5.2. PARALLEL BREADTH-FIRST SEARCH 91
1: class Mapper
2: method Map(nid n, node N)
3: d N.Distance
4: Emit(nid n, N) Pass along graph structure
5: for all nodeid m N.AdjacencyList do
6: Emit(nid m, d + 1) Emit distances to reachable nodes
1: class Reducer
2: method Reduce(nid m, [d1 , d2 , . . .])
3: dmin
4: M
5: for all d counts [d1 , d2 , . . .] do
6: if IsNode(d) then
7: Md Recover graph structure
8: else if d < dmin then Look for shorter distance
9: dmin d
10: M.Distance dmin Update shortest distance
11: Emit(nid m, node M)
Figure 5.4: Pseudo-code for parallel breath-first search in MapReduce: the mappers emit distances to
reachable nodes, while the reducers select the minimum of those distances for each destination node.
Each iteration (one MapReduce job) of the algorithm expands the search frontier by one hop.
so on. Each iteration of the algorithm expands the search frontier by one hop, and, eventually, all
nodes will be discovered with their shortest distances (assuming a fully-connected graph). Before
we discuss termination of the algorithm, there is one more detail required to make the parallel
breadth-first search algorithm work. We need to pass along the graph structure from one iteration
to the next. This is accomplished by emitting the node data structure itself, with the node id as a key
(Figure 5.4, line 4 in the mapper). In the reducer, we must distinguish the node data structure from
distance values (Figure 5.4, lines 56 in the reducer), and update the minimum distance in the node
data structure before emitting it as the final value. The final output is now ready to serve as input to
the next iteration.5
So how many iterations are necessary to compute the shortest distance to all nodes? The
answer is the diameter of the graph, or the greatest distance between any pair of nodes. This number
is surprisingly small for many real-world problems: the saying six degrees of separation suggests
that everyone on the planet is connected to everyone else by at most six steps (the people a person
knows are one step away, people that they know are two steps away, etc.). If this is indeed true,
5 Note that in this algorithm we are overloading the value type, which can either be a distance (integer) or a complex data structure
representing a node. The best way to achieve this in Hadoop is to create a wrapper object with an indicator variable specifying
what the content is.
92 5. GRAPH ALGORITHMS
then parallel breadth-first search on the global social network would take at most six MapReduce
iterations. For more serious academic studies of small world phenomena in networks, we refer the
reader to a number of publications [2; 61; 62; 152]. In practical terms, we iterate the algorithm until
there are no more node distances that are . Since the graph is connected, all nodes are reachable,
and since all edge distances are one, all discovered nodes are guaranteed to have the shortest distances
(i.e., there is not a shorter path that goes through a node that hasnt been discovered).
The actual checking of the termination condition must occur outside of MapReduce.Typically,
execution of an iterative MapReduce algorithm requires a non-MapReduce driver program, which
submits a MapReduce job to iterate the algorithm, checks to see if a termination condition has been
met, and if not, repeats. Hadoop provides a lightweight API for constructs called counters, which,
as the name suggests, can be used for counting events that occur during execution, e.g., number
of corrupt records, number of times a certain condition is met, or anything that the programmer
desires. Counters can be defined to count the number of nodes that have distances of : at the end
of the job, the driver program can access the final counter value and check to see if another iteration
is necessary.
Finally, as with Dijkstras algorithm in the form presented earlier, the parallel breadth-first
search algorithm only finds the shortest distances, not the actual shortest paths. However, the path
can be straightforwardly recovered. Storing backpointers at each node, as with Dijkstras algorithm,
will work, but may not be efficient since the graph needs to be traversed again to reconstruct the
path segments. A simpler approach is to emit paths along with distances in the mapper, so that each
node will have its shortest path easily accessible at all times. The additional space requirements for
shuffling these data from mappers to reducers are relatively modest, since for the most part paths
(i.e., sequence of node ids) are relatively short.
Up until now, we have been assuming that all edges are unit distance. Let us relax that
restriction and see what changes are required in the parallel breadth-first search algorithm. The
adjacency lists, which were previously lists of node ids, must now encode the edge distances as well.
In line 6 of the mapper code in Figure 5.4, instead of emitting d + 1 as the value, we must now
emit d + w where w is the edge distance. No other changes to the algorithm are required, but
the termination behavior is very different. To illustrate, consider the graph fragment in Figure 5.5,
where s is the source node, and in this iteration, we just discovered node r for the very first time.
Assume for the sake of argument that weve already discovered the shortest distance to node p,
and that the shortest distance to r so far goes through p. This, however, does not guarantee that
weve discovered the shortest distance to r, since there may exist a path going through q that we
havent encountered yet (because it lies outside the search frontier).6 However, as the search frontier
expands, well eventually cover q and all other nodes along the path from p to q to rwhich means
that with sufficient iterations, we will discover the shortest distance to r. But how do we know that
weve found the shortest distance to p? Well, if the shortest path to p lies within the search frontier,
6 Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside the search frontier
since any such path would necessarily be longer.
5.2. PARALLEL BREADTH-FIRST SEARCH 93
search frontier
s
q
p
Figure 5.5: In the single-source shortest path problem with arbitrary edge distances, the shortest path
from source s to node r may go outside the current search frontier, in which case we will not find the
shortest distance to r until the search frontier expands to cover q.
1
1 1
n6 n7
n8
10
1 n9
n5
n1
1 1
1 n4
n2 1
n3
Figure 5.6: A sample graph that elicits worst-case behavior for parallel breadth-first search. Eight iter-
ations are required to discover shortest distances to all nodes from n1 .
we would have already discovered it. And if it doesnt, the above argument applies. Similarly, we
can repeat the same argument for all nodes on the path from s to p. The conclusion is that, with
sufficient iterations, well eventually discover all the shortest distances.
So exactly how many iterations does eventually mean? In the worst case, we might need as
many iterations as there are nodes in the graph minus one. In fact, it is not difficult to construct
graphs that will elicit this worse-case behavior: Figure 5.6 provides an example, with n1 as the source.
The parallel breadth-first search algorithm would not discover that the shortest path from n1 to n6
goes through n3 , n4 , and n5 until the fifth iteration. Three more iterations are necessary to cover
the rest of the graph. Fortunately, for most real-world graphs, such extreme cases are rare, and the
number of iterations necessary to discover all shortest distances is quite close to the diameter of the
graph, as in the unit edge distance case.
In practical terms, how do we know when to stop iterating in the case of arbitrary edge
distances? The algorithm can terminate when shortest distances at every node no longer change.
94 5. GRAPH ALGORITHMS
Once again, we can use counters to keep track of such events. Every time we encounter a shorter
distance in the reducer, we increment a counter. At the end of each MapReduce iteration, the driver
program reads the counter value and determines if another iteration is necessary.
Compared to Dijkstras algorithm on a single processor, parallel breadth-first search in Map-
Reduce can be characterized as a brute force approach that wastes a lot of time performing compu-
tations whose results are discarded. At each iteration, the algorithm attempts to recompute distances
to all nodes, but in reality only useful work is done along the search frontier: inside the search frontier,
the algorithm is simply repeating previous computations.7 Outside the search frontier, the algorithm
hasnt discovered any paths to nodes there yet, so no meaningful work is done. Dijkstras algorithm, on
the other hand, is far more efficient. Every time a node is explored, were guaranteed to have already
found the shortest path to it. However, this is made possible by maintaining a global data structure
(a priority queue) that holds nodes sorted by distancethis is not possible in MapReduce because
the programming model does not provide support for global data that is mutable and accessible by
the mappers and reducers. These inefficiencies represent the cost of parallelization.
The parallel breadth-first search algorithm is instructive in that it represents the prototypical
structure of a large class of graph algorithms in MapReduce. They share in the following character-
istics:
The graph structure is represented with adjacency lists, which is part of some larger node
data structure that may contain additional information (variables to store intermediate output,
features of the nodes). In many cases, features are attached to edges as well (e.g., edge weights).
The MapReduce algorithm maps over the node data structures and performs a computation
that is a function of features of the node, intermediate state attached to each node, and features
of the adjacency list (outgoing edges and their features). In other words, computations can only
involve a nodes internal state and its local graph structure.The results of these computations are
emitted as values, keyed with the node ids of the neighbors (i.e., those nodes on the adjacency
lists). Conceptually, we can think of this as passing the results of the computation along
outgoing edges. In the reducer, the algorithm receives all partial results that have the same
destination node, and performs another computation (usually, some form of aggregation).
In addition to computations, the graph itself is also passed from the mapper to the reducer.
In the reducer, the data structure corresponding to each node is updated and written back to
disk.
Graph algorithms in MapReduce are generally iterative, where the output of the previous
iteration serves as input to the next iteration. The process is controlled by a non-MapReduce
driver program that checks for termination.
7 Unless the algorithm discovers an instance of the situation described in Figure 5.5, in which case, updated distances will propagate
inside the search frontier.
5.3. PAGERANK 95
For parallel breadth-first search, the mapper computation is the current distance plus edge distance
(emitting distances to neighbors), while the reducer computation is the Min function (selecting the
shortest path). As we will see in the next section, the MapReduce algorithm for PageRank works in
much the same way.
5.3 PAGERANK
PageRank [117] is a measure of web page quality based on the structure of the hyperlink graph.
Although it is only one of thousands of features that is taken into account in Googles search
algorithm, it is perhaps one of the best known and most studied.
A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer visits a
page, randomly clicks a link on that page, and repeats ad infinitum. PageRank is a measure of how
frequently a page would be encountered by our tireless web surfer. More precisely, PageRank is a
probability distribution over nodes in the graph representing the likelihood that a random walk over
the link structure will arrive at a particular node. Nodes that have high in-degrees tend to have high
PageRank values, as well as nodes that are linked to by other nodes with high PageRank values.
This behavior makes intuitive sense: if PageRank is a measure of page quality, we would expect
high-quality pages to contain endorsements from many other pages in the form of hyperlinks.
Similarly, if a high-quality page links to another page, then the second page is likely to be high
quality also. PageRank represents one particular approach to inferring the quality of a web page
based on hyperlink structure; two other popular algorithms, not covered here, are SALSA [88] and
HITS [84] (also known as hubs and authorities).
The complete formulation of PageRank includes an additional component. As it turns out,
our web surfer doesnt just randomly click links. Before the surfer decides where to go next, a biased
coin is flippedheads, the surfer clicks on a random link on the page as usual. Tails, however, the
surfer ignores the links on the page and randomly jumps or teleports to a completely different
page.
But enough about random web surfing. Formally, the PageRank P of a page n is defined as
follows:
P (m)
1
P (n) = + (1 ) (5.1)
|G| C(m)
mL(n)
where |G| is the total number of nodes (pages) in the graph, is the random jump factor, L(n) is
the set of pages that link to n, and C(m) is the out-degree of node m (the number of links on page
m). The random jump factor is sometimes called the teleportation factor; alternatively, (1 )
is referred to as the damping factor.
Let us break down each component of the formula in detail. First, note that PageRank is
defined recursivelythis gives rise to an iterative algorithm we will detail in a bit. A web page n
receives PageRank contributions from all pages that link to it, L(n). Let us consider a page m from
the set of pages L(n): a random surfer at m will arrive at n with probability 1/C(m) since a link is
96 5. GRAPH ALGORITHMS
selected at random from all outgoing links. Since the PageRank value of m is the probability that the
random surfer will be at m, the probability of arriving at n from m is P (m)/C(m). To compute the
PageRank of n, we need to sum contributions from all pages that link to n. This is the summation
in the second half of the equation. However, we also need to take into account the random jump:
there is a 1/|G| chance of landing at any particular page, where |G| is the number of nodes in the
graph. Of course, the two contributions need to be combined: with probability the random surfer
executes a random jump, and with probability 1 the random surfer follows a hyperlink.
Note that PageRank assumes a community of honest users who are not trying to game the
measure.This is, of course, not true in the real world, where an adversarial relationship exists between
search engine companies and a host of other organizations and individuals (marketers, spammers,
activists, etc.) who are trying to manipulate search resultsto promote a cause, product, or service, or
in some cases, to trap and intentionally deceive users (see, for example, [12; 63]). A simple example
is a so-called spider trap, a infinite chain of pages (e.g., generated by CGI) that all link to a single
page (thereby artificially inflating its PageRank). For this reason, PageRank is only one of thousands
of features used in ranking web pages.
The fact that PageRank is recursively defined translates into an iterative algorithm which is
quite similar in basic structure to parallel breadth-first search. We start by presenting an informal
sketch. At the beginning of each iteration, a node passes its PageRank contributions to other nodes
that it is connected to. Since PageRank is a probability distribution, we can think of this as spreading
probability mass to neighbors via outgoing links. To conclude the iteration, each node sums up all
PageRank contributions that have been passed to it and computes an updated PageRank score. We
can think of this as gathering probability mass passed to a node via its incoming links.This algorithm
iterates until PageRank values dont change anymore.
Figure 5.7 shows a toy example that illustrates two iterations of the algorithm. As a simplifi-
cation, we ignore the random jump factor for now (i.e., = 0) and further assume that there are no
dangling nodes (i.e., nodes with no outgoing edges). The algorithm begins by initializing a uniform
distribution of PageRank values across nodes. In the beginning of the first iteration (top, left), partial
PageRank contributions are sent from each node to its neighbors connected via outgoing links. For
example, n1 sends 0.1 PageRank mass to n2 and 0.1 PageRank mass to n4 . This makes sense in
terms of the random surfer model: if the surfer is at n1 with a probability of 0.2, then the surfer could
end up either in n2 or n4 with a probability of 0.1 each. The same occurs for all the other nodes in
the graph: note that n5 must split its PageRank mass three ways, since it has three neighbors, and
n4 receives all the mass belonging to n3 because n3 isnt connected to any other node. The end of
the first iteration is shown in the top right: each node sums up PageRank contributions from its
neighbors. Note that since n1 has only one incoming link, from n3 , its updated PageRank value is
smaller than before, i.e., it passed along more PageRank mass than it received. The exact same
process repeats, and the second iteration in our toy example is illustrated by the bottom two graphs.
At the beginning of each iteration, the PageRank values of all nodes sum to one. PageRank mass is
5.3. PAGERANK 97
It
Iteration
ti 1 n2 (0.2) n2 (0.166)
0.1
n1 (0.2) 0.1 0.1 n1 (0.066)
0.1
0.066
0.066 0.066
n5 (0.2) n5 (0.3)
n3 (0.2) n3 (0.166)
0.2 0.2
n4 (0.2) n4 (0.3)
0.033
0.1
01
0.1 01
0.1
n5 (0.3) n5 (0.383)
n3 (0.166) n3 (0.183)
0.3 0.166
n4 (0.3)
(0 3) n4 (0.2)
(0 2)
Figure 5.7: PageRank toy example showing two iterations, top and bottom. Left graphs show PageRank
values at the beginning of each iteration and how much PageRank mass is passed to each neighbor. Right
graphs show updated PageRank values at the end of each iteration.
preserved by the algorithm, guaranteeing that we continue to have a valid probability distribution
at the end of each iteration.
Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.8; it is simplified
in that we continue to ignore the random jump factor and assume no dangling nodes (complications
that we will return to later). An illustration of the running algorithm is shown in Figure 5.9 for the
first iteration of the toy graph in Figure 5.7. The algorithm maps over the nodes, and for each node
computes how much PageRank mass needs to be distributed to its neighbors (i.e., nodes on the
adjacency list). Each piece of the PageRank mass is emitted as the value, keyed by the node ids of
the neighbors. Conceptually, we can think of this as passing PageRank mass along outgoing edges.
In the shuffle and sort phase, the MapReduce execution framework groups values (piece of
PageRank mass) passed along the graph edges by destination node (i.e., all edges that point to the
same node). In the reducer, PageRank mass contributions from all incoming edges are summed
to arrive at the updated PageRank value for each node. As with the parallel breadth-first search
algorithm, the graph structure itself must be passed from iteration to iteration. Each node data
structure is emitted in the mapper and written back out to disk in the reducer. All PageRank mass
98 5. GRAPH ALGORITHMS
1: class Mapper
2: method Map(nid n, node N)
3: p N.PageRank/|N.AdjacencyList|
4: Emit(nid n, N ) Pass along graph structure
5: for all nodeid m N.AdjacencyList do
6: Emit(nid m, p) Pass PageRank mass to neighbors
1: class Reducer
2: method Reduce(nid m, [p1 , p2 , . . .])
3: M
4: for all p counts [p1 , p2 , . . .] do
5: if IsNode(p) then
6: Mp Recover graph structure
7: else
8: s s+p Sum incoming PageRank contributions
9: M.PageRank s
10: Emit(nid m, node M)
Figure 5.8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the random
jump factor). In the map phase we evenly divide up each nodes PageRank mass and pass each piece
along outgoing edges to neighbors. In the reduce phase PageRank contributions are summed up at each
destination node. Each MapReduce job corresponds to one iteration of the algorithm.
emitted by the mappers is accounted for in the reducer: since we begin with the sum of PageRank
values across all nodes equal to one, the sum of all the updated PageRank values should remain a
valid probability distribution.
Having discussed the simplified PageRank algorithm in MapReduce, let us now take into
account the random jump factor and dangling nodes: as it turns out both are treated similarly.
Dangling nodes are nodes in the graph that have no outgoing edges, i.e., their adjacency lists are
empty. In the hyperlink graph of the web, these might correspond to pages in a crawl that have not
been downloaded yet. If we simply run the algorithm in Figure 5.8 on graphs with dangling nodes,
the total PageRank mass will not be conserved, since no key-value pairs will be emitted when a
dangling node is encountered in the mappers.
The proper treatment of PageRank mass lost at the dangling nodes is to redistribute it across
all nodes in the graph evenly (cf. [22]). There are many ways to determine the missing PageRank
mass. One simple approach is by instrumenting the algorithm in Figure 5.8 with counters: whenever
the mapper processes a node with an empty adjacency list, it keeps track of the nodes PageRank
value in the counter. At the end of the iteration, we can access the counter to find out how much
5.3. PAGERANK 99
n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]
Map
n2 n4 n3 n5 n4 n5 n1 n2 n3
n1 n2 n2 n3 n3 n4 n4 n5 n5
Reduce
Figure 5.9: Illustration of the MapReduce PageRank algorithm corresponding to the first iteration in
Figure 5.7. The size of each box is proportion to its PageRank value. During the map phase, PageRank
mass is distributed evenly to nodes on each nodes adjacency list (shown at the very top). Intermediate
values are keyed by node (shown inside the boxes). In the reduce phase, all partial PageRank contributions
are summed together to arrive at updated values.
PageRank mass was lost at the dangling nodes.8 Another approach is to reserve a special key for
storing PageRank mass from dangling nodes. When the mapper encounters a dangling node, its
PageRank mass is emitted with the special key; the reducer must be modified to contain special logic
for handling the missing PageRank mass. Yet another approach is to write out the missing PageRank
mass as side data for each map task (using the in-mapper combining technique for aggregation); a
final pass in the driver program is needed to sum the mass across all map tasks. Either way, we arrive
at the amount of PageRank mass lost at the dangling nodesthis then must be redistribute evenly
across all nodes.
This redistribution process can be accomplished by mapping over all nodes again. At the same
time, we can take into account the random jump factor. For each node, its current PageRank value
p is updated to the final PageRank value p according to the following formula:
1 m
p = + (1 ) +p (5.2)
|G| |G|
where m is the missing PageRank mass, and |G| is the number of nodes in the entire graph. We add
the PageRank mass from link traversal (p, computed from before) to the share of the lost PageRank
mass that is distributed to each node (m/|G|). Finally, we take into account the random jump factor:
8 In Hadoop, counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large constant, and then cast
as an integer.
100 5. GRAPH ALGORITHMS
with probability the random surfer arrives via jumping, and with probability 1 the random
surfer arrives via incoming links. Note that this MapReduce job requires no reducers.
Putting everything together, one iteration of PageRank requires two MapReduce jobs: the
first to distribute PageRank mass along graph edges, and the second to take care of dangling nodes
and the random jump factor. At the end of each iteration, we end up with exactly the same data
structure as the beginning, which is a requirement for the iterative algorithm to work. Also, the
PageRank values of all nodes sum up to one, which ensures a valid probability distribution.
Typically, PageRank is iterated until convergence, i.e., when the PageRank values of nodes
no longer change (within some tolerance, to take into account, for example, floating point precision
errors). Therefore, at the end of each iteration, the PageRank driver program must check to see
if convergence has been reached. Alternative stopping criteria include running a fixed number of
iterations (useful if one wishes to bound algorithm running time) or stopping when the ranks of
PageRank values no longer change. The latter is useful for some applications that only care about
comparing the PageRank of two arbitrary pages and do not need the actual PageRank values. Rank
stability is obtained faster than the actual convergence of values.
In absolute terms, how many iterations are necessary for PageRank to converge? This is a
difficult question to precisely answer since it depends on many factors, but generally, fewer than one
might expect. In the original PageRank paper [117], convergence on a graph with 322 million edges
was reached in 52 iterations (see also Bianchini et al. [22] for additional discussion). On todays
web, the answer is not very meaningful due to the adversarial nature of web search as previously
discussedthe web is full of spam and populated with sites that are actively trying to game
PageRank and related hyperlink-based metrics. As a result, running PageRank in its unmodified
form presented here would yield unexpected and undesirable results. Of course, strategies developed
by web search companies to combat link spam are proprietary (and closely guarded secrets, for
obvious reasons)but undoubtedly these algorithmic modifications impact convergence behavior.
A full discussion of the escalating arms race between search engine companies and those that seek
to promote their sites is beyond the scope of this book.9
9 For the interested reader, the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb) provide great
starting points into the literature.
5.4. ISSUES WITH GRAPH PROCESSING 101
nodes it links to, or to a node from nodes linked to itin other words, passing information is only
possible within the local graph structure.10
This restriction gives rise to the structure of many graph algorithms in MapReduce: local
computation is performed on each node, the results of which are passed to its neighbors. With
multiple iterations, convergence on the global graph is possible. The passing of partial results along
a graph edge is accomplished by the shuffling and sorting provided by the MapReduce execution
framework. The amount of intermediate data generated is on the order of the number of edges,
which explains why all the algorithms we have discussed assume sparse graphs. For dense graphs,
MapReduce running time would be dominated by copying intermediate data across the network,
which in the worst case is O(n2 ) in the number of nodes in the graph. Since MapReduce clusters
are designed around commodity networks (e.g., gigabit Ethernet), MapReduce algorithms are often
impractical on large, dense graphs.
Combiners and the in-mapper combining pattern described in Section 3.1 can be used to
decrease the running time of graph iterations. It is straightforward to use combiners for both parallel
breadth-first search and PageRank since Min and sum, used in the two algorithms, respectively, are
both associative and commutative. However, combiners are only effective to the extent that there
are opportunities for partial aggregationunless there are nodes pointed to by multiple nodes being
processed by an individual map task, combiners are not very useful. This implies that it would be
desirable to partition large graphs into smaller components where there are many intra-component
links and fewer inter-component links. This way, we can arrange the data such that nodes in the
same component are handled by the same map taskthus maximizing opportunities for combiners
to perform local aggregation.
Unfortunately, this sometimes creates a chicken-and-egg problem. It would be desirable to
partition a large graph to facilitate efficient processing by MapReduce. But the graph may be so
large that we cant partition it except with MapReduce algorithms! Fortunately, in many cases there
are simple solutions around this problem in the form of cheap partitioning heuristics based on
reordering the data [106]. For example, in a social network, we might sort nodes representing users
by zip code, as opposed to by last namebased on the observation that friends tend to live close
to each other. Sorting by an even more cohesive property such as school would be even better (if
available): the probability of any two random students from the same school knowing each other is
much higher than two random students from different schools. Another good example is to partition
the web graph by the language of the page (since pages in one language tend to link mostly to other
pages in that language) or by domain name (since inter-domain links are typically much denser than
intra-domain links). Resorting records using MapReduce is both easy to do and a relatively cheap
operationhowever, whether the efficiencies gained by this crude form of partitioning are worth
the extra time taken in performing the resort is an empirical question that will depend on the actual
graph structure and algorithm.
10 Of course, it is perfectly reasonable to compute derived graph structures in a pre-processing step. For example, if one wishes to
propagate information from a node to all nodes that are within two links, one could process graph G to derive graph G , where
there would exist a link from node ni to nj if nj was reachable within two link traversals of ni in the original graph G.
102 5. GRAPH ALGORITHMS
Finally, there is a practical consideration to keep in mind when implementing graph algorithms
that estimate probability distributions over nodes (such as PageRank). For large graphs, the probabil-
ity of any particular node is often so small that it underflows standard floating point representations.
A very common solution to this problem is to represent probabilities using their logarithms. When
probabilities are stored as logs, the product of two values is simply their sum. However, addition of
probabilities is also necessary, for example, when summing PageRank contribution for a node. This
can be implemented with reasonable precision as follows:
b + log(1 + eab ) a < b
ab =
a + log(1 + eba ) a b
Furthermore, many math libraries include a log1p function which computes log(1 + x) with higher
precision than the nave implementation would have when x is very small (as is often the case when
working with probabilities). Its use may further improve the accuracy of implementations that use
log probabilities.
Algorithms map over nodes and pass partial results to nodes on their adjacency lists. Partial
results are aggregated for each node in the reducer.
The graph structure itself is passed from the mapper to the reducer, such that the output is in
the same form as the input.
Algorithms are iterative and under the control of a non-MapReduce driver program, which
checks for termination at the end of each iteration.
The MapReduce programming model does not provide a mechanism to maintain global data struc-
tures accessible and mutable by all the mappers and reducers.11 One implication of this is that
communication between pairs of arbitrary nodes is difficult to accomplish. Instead, information
typically propagates along graph edgeswhich gives rise to the structure of algorithms discussed
above.
Additional Readings. The ubiquity of large graphs translates into substantial interest in scalable
graph algorithms using MapReduce in industry, academia, and beyond. There is, of course, much
beyond what has been covered in this chapter. For additional material, we refer readers to the
11 However, maintaining globally-synchronized state may be possible with the assistance of other tools (e.g., a distributed database).
5.5. SUMMARY AND ADDITIONAL READINGS 103
following: Kang et al. [80] presented an approach to estimating the diameter of large graphs using
MapReduce and a library for graph mining [81]; Cohen [39] discussed a number of algorithms
for processing undirected graphs, with social network analysis in mind; Rao and Yarowsky [128]
described an implementation of label propagation, a standard algorithm for semi-supervised machine
learning, on graphs derived from textual data; Schatz [132] tackled the problem of DNA sequence
alignment and assembly with graph algorithms in MapReduce. Finally, it is easy to forget that
parallel graph algorithms have been studied by computer scientists for several decades, particular in
the PRAM model [60; 77]. It is not clear, however, to what extent well-known PRAM algorithms
translate naturally into the MapReduce framework.
105
CHAPTER 6
In a conditional (or direct) model, this is a straightforward search for the best y under the model. In a
joint model, the search is also straightforward, on account of the definition of conditional probability:
Pr(x, y)
y = arg max Pr(y|x) = arg max
= arg max Pr(x, y)
yY yY y Pr(x, y ) yY
1 In this chapter, we will consider discrete models only. They tend to be sufficient for text processing, and their presentation is
simpler than models with continuous densities. It should be kept in mind that the sets X and Y may still be countably infinite.
2 We restrict our discussion in this chapter to models with finite numbers of parameters and where the learning process refers to
setting those parameters. Inference in and learning of so-called nonparameteric models, which have an infinite number of parameters
and have become important statistical models for text processing in recent years, is beyond the scope of this chapter.
107
The specific form that the search takes will depend on how the model is represented. Our focus in
this chapter will primarily be on the second problem: learning parameters for models, but we will
touch on the third problem as well.
Machine learning is often categorized as either supervised or unsupervised. Supervised learning
of statistical models simply means that the model parameters are estimated from training data
consisting of pairs of inputs and annotations, that is Z = x1 , y1
, x2 , y2
, . . .
where xi , yi
X Y and yi is the gold standard (i.e., correct) annotation of xi . While supervised models often
attain quite good performance, they are often uneconomical to use, since the training data requires
each object that is to be classified (to pick a specific task), xi to be paired with its correct label, yi . In
many cases, these gold standard training labels must be generated by a process of expert annotation,
meaning that each xi must be manually labeled by a trained individual. Even when the annotation
task is quite simple for people to carry out (e.g., in the case of spam detection), the number of potential
examples that could be classified (representing a subset of X , which may of course be infinite in
size) will far exceed the amount of data that can be annotated. As the annotation task becomes more
complicated (e.g., when predicting more complex structures such as sequences of labels or when the
annotation task requires specialized expertise), annotation becomes far more challenging.
Unsupervised learning, on the other hand, requires only that the training data consist of a
representative collection of objects that should be annotated, that is Z = x1 , x2 , . . .
where xi X ,
but without any example annotations. While it may at first seem counterintuitive that meaningful
annotations can be learned without any examples of the desired annotations being given, the learning
criteria and model structure (which crucially define the space of possible annotations Y and the
process by which annotations relate to observable inputs) make it possible to induce annotations
by relying on regularities in the unclassified training instances. While a thorough discussion of
unsupervised learning is beyond the scope of this book, we focus on a particular class of algorithms
expectation maximization (EM) algorithmsthat can be used to learn the parameters of a joint
model Pr(x, y) from incomplete data (i.e., data where some of the variables in the model cannot be
observed; in the case of unsupervised learning, the yi s are unobserved). Expectation maximization
algorithms fit naturally into the MapReduce paradigm, and are used to solve a number of problems
of interest in text processing. Furthermore, these algorithms can be quite computationally expensive,
since they generally require repeated evaluations of the training data. MapReduce therefore provides
an opportunity not only to scale to larger amounts of data, but also to improve efficiency bottlenecks
at scales where non-parallel solutions could be utilized.
This chapter is organized as follows. In Section 6.1, we describe maximum likelihood esti-
mation for statistical models, show how this is generalized to models where not all variables are
observable, and then introduce expectation maximization (EM). We describe hidden Markov mod-
els (HMMs) in Section 6.2, a very versatile class of models that uses EM for parameter estimation.
Section 6.3 discusses how EM algorithms can be expressed in MapReduce, and then in Section 6.4
we look at a case study of word alignment for statistical machine translation. Section 6.5 examines
108 6. EM ALGORITHMS FOR TEXT PROCESSING
similar algorithms that are appropriate for supervised learning tasks. This chapter concludes with a
summary and pointers to additional readings.
To illustrate, consider the simple marble game shown in Figure 6.1. In this game, a marble is
released at the position indicated by the black dot, and it bounces down into one of the cups at the
bottom of the board, being diverted to the left or right by the peg (indicated by a triangle) in the
center. Our task is to construct a model that predicts which cup the ball will drop into. A rule-based
approach might be to take exact measurements of the board and construct a physical model that we
can use to predict the behavior of the ball. Given sophisticated enough measurements, this could
certainly lead to a very accurate model. However, the construction of this model would be quite time
consuming and difficult.
A statistical approach, on the other hand, might be to assume that the behavior of the marble
in this game can be modeled using a Bernoulli random variable Y with parameter p. That is, the
value of the random variable indicates whether path 0 or 1 is taken. We also define a random variable
X whose value is the label of the cup that the marble ends up in; note that X is deterministically
related to Y , so an observation of X is equivalent to an observation of Y .
To estimate the parameter p of the statistical model of our game, we need some training data,
so we drop 10 marbles into the game which end up in cups x = b, b, b, a, b, b, b, b, b, a
.
6.1. EXPECTATION MAXIMIZATION 109
0 1
a b
Figure 6.1: A simple marble game where a released marble takes one of two possible paths. This game
can be modeled using a Bernoulli random variable with parameter p, which indicates the probability that
the marble will go to the right when it hits the peg.
What is the maximum likelihood estimate of p given this data? By assuming that our samples
are independent and identically distributed (i.i.d.), we can write the likelihood of our data as follows:3
10
Pr(x; p) = p (xj ,a) (1 p)(xj ,b)
j =1
= p (1 p)8
2
Since log is a monotonically increasing function, maximizing log Pr(x; p) will give us the desired
result. We can do this differentiating with respect to p and finding where the resulting expression
equals 0:
d log Pr(x; p)
= 0
dp
d[2 log p + 8 log(1 p)]
= 0
dp
2 8
= 0
p 1p
Solving for p yields 0.2, which is the intuitive result. Furthermore, it is straightforward to show that
in N trials where N0 marbles followed path 0 to cup a, and N1 marbles followed path 1 to cup b,
the maximum likelihood estimate of p is N1 /(N0 + N1 ).
While this model only makes use of an approximation of the true physical process at work
when the marble interacts with the game board, it is an empirical question whether the model works
well enough in practice to be useful. Additionally, while a Bernoulli trial is an extreme approxi-
mation of the physical process, if insufficient resources were invested in building a physical model,
3 In this equation, is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0 otherwise.
110 6. EM ALGORITHMS FOR TEXT PROCESSING
0 1 2 3
a b c
Figure 6.2: A more complicated marble game where the released marble takes one of four possible paths.
We assume that we can only observe which cup the marble ends up in, not the specific path taken.
the approximation may perform better than the more complicated rule-based model. This sort
of dynamic is found often in text processing problems: given enough data, astonishingly simple
models can outperform complex knowledge-intensive models that attempt to simulate complicated
processes.
N2 + N 3 N1 N3
p0 = p1 = p2 =
N N0 + N 1 N2 + N 3
6.1. EXPECTATION MAXIMIZATION 111
However, we wish to consider the case where the paths taken are unobservable (imagine an opaque
sheet covering the center of the game board), but where we can see what cup a marble ends in. In
other words, we want to consider the case where we have partial data. This is exactly the problem
encountered in unsupervised learning: there is a statistical model describing the relationship between
two sets of variables (Xs and Y s), and there is data available from just one of them. Furthermore, such
algorithms are quite useful in text processing, where latent variables may describe latent linguistic
structures of the observed variables, such as parse trees or part-of-speech tags, or alignment structures
relating sets of observed variables (see Section 6.4).
Pr(X = x) = Pr(X = x, Y = y; )
yY
|x|
Pr(x; ) = Pr(X = xj , Y = y; )
j =1 yY
Thus, the maximum (marginal) likelihood estimate of the model parameters given a vector of
i.i.d. observations x becomes:
|x|
= arg max Pr(X = xj , Y = y; )
j =1 yY
Unfortunately, in many cases, this maximum cannot be computed analytically, but the iterative
hill-climbing approach of expectation maximization can be used instead.
4 For this description, we assume that the variables in our model take on discrete values. Not only does this simplify exposition, but
discrete models are widely used in text processing.
112 6. EM ALGORITHMS FOR TEXT PROCESSING
6.1.4 EXPECTATION MAXIMIZATION
Expectation maximization (EM) is an iterative algorithm that finds a successive series of parameter
estimates (0) , (1) , . . . that improve the marginal likelihood of the training data. That is, EM
guarantees:
|x|
|x|
Pr(X = xj , Y = y; (i+1) ) Pr(X = xj , Y = y; (i) )
j =1 yY j =1 yY
The algorithm starts with some initial set of parameters (0) and then updates them using two
steps: expectation (E-step), which computes the posterior distribution over the latent variables given
the observable data x and a set of parameters (i) ,5 and maximization (M-step), which computes
new parameters (i+1) maximizing the expected log likelihood of the joint distribution with respect
to the distribution computed in the E-step. The process then repeats with these new parameters.
The algorithm terminates when the likelihood remains unchanged.6 In more detail, the steps are as
follows:
E-step. Compute the posterior probability of each possible hidden variable assignments y Y
for each x X and the current parameter settings, weighted by the relative frequency with which x
occurs in x. Call this q(X = x, Y = y; (i) ) and note that it defines a joint probability distribution
over X Y in that (x,y)X Y q(x, y) = 1.
Pr(x, y; (i) )
q(x, y; (i) ) = f (x|x) Pr(Y = y|X = x; (i) ) = f (x|x) (i)
y Pr(x, y ; )
M-step. Compute new parameter settings that maximize the expected log of the probability of
the joint distribution under the q-distribution that was computed in the E-step:
We omit the proof that the model with parameters (i+1) will have equal or greater marginal
likelihood on the training data than the model with parameters (i) , but this is provably true [78].
Before continuing, we note that the effective application of expectation maximization re-
quires that both the E-step and the M-step consist of tractable computations. Specifically, summing
5The term expectation is used since the values computed in terms of the posterior distribution Pr(y|x; (i) ) that are required to
solve the M-step have the form of an expectation (with respect to this distribution).
6The final solution is only guaranteed to be a local maximum, but if the model is fully convex, it will also be the global maximum.
6.1. EXPECTATION MAXIMIZATION 113
over the space of hidden variable assignments must not be intractable. Depending on the indepen-
dence assumptions made in the model, this may be achieved through techniques such as dynamic
programming. However, some models may require intractable computations.
6.1.5 AN EM EXAMPLE
Lets look at how to estimate the parameters from our latent variable marble game from Section 6.1.2
using EM. We assume training data x consisting of N = |x| observations of X with Na , Nb , and
Nc indicating the number of marbles ending in cups a, b, and c. We start with some parameters
(0) (0) (0)
(0) = p0 , p1 , p2
that have been randomly initialized to values between 0 and 1.
E-step. We need to compute the distribution q(X = x, Y = y; (i) ), as defined above. We first
note that the relative frequency f (x|x) is:
Nx
f (x|x) =
N
Next, we observe that Pr(Y = 0|X = a) = 1 and Pr(Y = 3|X = c) = 1 since cups a and c fully
determine the value of the path variable Y . The posterior probability of paths 1 and 2 are only
non-zero when X is b:
M-step. We now need to maximize the expectation of log Pr(X, Y ; ) (which will be a function
in terms of the three parameter variables) under the q-distribution we computed in the E step. The
non-zero terms in the expectation are as follows:
x y q(X = x, Y = y; (i) ) log Pr(X = x, Y = y; )
a 0 Na /N log(1 p0 ) + log(1 p1 )
b 1 Nb /N Pr(1|b; (i) ) log(1 p0 ) + log p1
b 2 Nb /N Pr(2|b; (i) ) log p0 + log(1 p2 )
c 3 Nc /N log p0 + log p2
Multiplying across each row and adding from top to bottom yields the expectation we wish to
maximize. Each parameter can be optimized independently using differentiation. The resulting
optimal values are expressed in terms of the counts in x and (i) :
7This is only one possible definition of an HMM, but it is one that is useful for many text processing problems. In alternative
definitions, initial and final states may be handled differently, observations may be emitted during the transition between states,
or continuous-valued observations may be emitted (for example, from a Gaussian distribution).
6.2. HIDDEN MARKOV MODELS 115
are useful. We further stipulate that Aq (r) 0, Bq (o) 0, and q 0 for all q, r, and o, as well as
that:
Aq (r) = 1 q Bq (o) = 1 q q = 1
rS oO qS
Since all events generated by this process are conditionally independent, the joint probability of this
sequence of observations and the state sequence used to generate it is the product of the individual
event probabilities.
Figure 6.3 shows a simple example of a hidden Markov model for part-of-speech tagging,
which is the task of assigning to each word in an input sentence its grammatical category (one of
the first steps in analyzing textual content). States S = {det, adj, nn, v } correspond to the parts
of speech (determiner, adjective, noun, and verb), and observations O = {the, a, green, . . .} are a
subset of English words.This example illustrates a key intuition behind many applications of HMMs:
states correspond to equivalence classes or clustering of observations, and a single observation type
may associated with several clusters (in this example, the word wash can be generated by an nn or
v, since wash can either be a noun or a verb).
2. Given a model M = S , O,
and an observation sequence x, what is the most likely sequence
of states that generated the data?
8The organization of this section is based in part on ideas from Lawrence Rabiners HMM tutorial [125].
116 6. EM ALGORITHMS FOR TEXT PROCESSING
Initial probabilities:
DET ADJ NN V
0.5 0.1 0.3 0.1
Transition probabilities:
DET ADJ NN V
ADJ
DET 0 0 0 0.5
ADJ 0.3 0.2 0.1 0.2 NN
DET
NN 0.7 0.7 0.4 0.2
V 0 0.1 0.5 0.1
V
Emission probabilities:
DET ADJ NN V
Example outputs:
John might wash
NN V V
Figure 6.3: An example HMM that relates part-of-speech tags to vocabulary items in an English-like
language. Possible (probability > 0) transitions for the Markov process are shown graphically. In the
example outputs, the state sequences corresponding to the emissions are written beneath the emitted
symbols.
6.2. HIDDEN MARKOV MODELS 117
3. Given a set of states S , an observation vocabulary O, and a series of i.i.d. observation se-
quences x1 , x2 , . . . , x
, what are the parameters = A, B,
that maximize the likelihood
of the training data?
= arg max Pr(xi , y; )
i=1 yY
Using our definition of an HMM, the answers to the first two questions are in principle quite
trivial to compute: by iterating over all state sequences Y , the probability that each generated x can
be computed by looking up and multiplying the relevant probabilities in A, B, and , and then
summing the result or taking the maximum. And, as we hinted at in the previous section, the third
question can be answered using EM. Unfortunately, even with all the distributed computing power
MapReduce makes available, we will quickly run into trouble if we try to use this nave strategy since
there are |S | distinct state sequences of length , making exhaustive enumeration computationally
intractable. Fortunately, because the underlying model behaves exactly the same whenever it is in
some state, regardless of how it got to that state, we can use dynamic programming algorithms to
answer all of the above questions without summing over exponentially many sequences.
1 (q) = q Bq (x1 )
From this, its not hard to see that the values of 2 (r) for every r can be computed in terms of the
|S | values in 1 () and the observation x2 :
118 6. EM ALGORITHMS FOR TEXT PROCESSING
2 (r) = Br (x2 ) 1 (q) Aq (r)
qS
This works because there are |S | different ways to get to state r at time t = 2: starting from state
1, 2, . . . , |S | and transitioning to state r. Furthermore, because the behavior of a Markov process is
determined only by the state it is in at some time (not by how it got to that state), t (r) can always
be computed in terms of the |S | values in t1 () and the observation xt :
t (r) = Br (xt ) t1 (q) Aq (r)
qS
We have now shown how to compute the probability of being in any state q at any time t, having
generated x1 , x2 , . . . , xt
, with the forward algorithm. The probability of the full sequence is the
probability of being in time |x| and in any state, so the answer to Question 1 can be computed simply
by summing over values at time |x| for all states:
Pr(x; ) = |x| (q)
qS
In summary, there are two ways of computing the probability that a sequence of observations x
was generated by M: exhaustive enumeration with summing and the forward algorithm. Figure 6.4
illustrates the two possibilities. The upper panel shows the nave exhaustive approach, enumerating
all 43 possible labels y of x and computing their joint probability Pr(x, y ). Summing over all y ,
the marginal probability of x is found to be 0.00018. The lower panel shows the forward trellis,
consisting of 4 3 cells. Summing over the final column also yields 0.00018, the same result.
John might wash p(x,y) John might wash p(x,y) John might wash p(x,y) John might wash p(x,y)
DET DET DET 0.0 ADJ DET DET 0.0 NN DET DET 0.0 V DET DET 0.0
DET DET ADJ 0.0 ADJ DET ADJ 0.0 NN DET ADJ 0.0 V DET ADJ 0.0
DET DET NN 0.0 ADJ DET NN 0.0 NN DET NN 0.0 V DET NN 0.0
DET DET V 0.0 ADJ DET V 0.0 NN DET V 0.0 V DET V 0.0
DET ADJ DET 0.0 ADJ ADJ DET 0.0 NN ADJ DET 0.0 V ADJ DET 0.0
DET ADJ ADJ 0.0 ADJ ADJ ADJ 0.0 NN ADJ ADJ 0.0 V ADJ ADJ 0.0
DET ADJ NN 0.0 ADJ ADJ NN 0.0 NN ADJ NN 0.000021 V ADJ NN 0.0
DET ADJ V 0.0 ADJ ADJ V 0.0 NN ADJ V 0.000009 V ADJ V 0.0
DET NN DET 0.0 ADJ NN DET 0.0 NN NN DET 0.0 V NN DET 0.0
DET NN ADJ 0.0 ADJ NN ADJ 0.0 NN NN ADJ 0.0 V NN ADJ 0.0
DET NN NN 0.0 ADJ NN NN 0.0 NN NN NN 0.0 V NN NN 0.0
DET NN V 0.0 ADJ NN V 0.0 NN NN V 0.0 V NN V 0.0
DET V DET 0.0 ADJ V DET 0.0 NN V DET 0.0 V V DET 0.0
DET V ADJ 0.0 ADJ V ADJ 0.0 NN V ADJ 0.0 V V ADJ 0.0
DET V NN 0.0 ADJ V NN 0.0 NN V NN 0.00006 V V NN 0.0
DET V V 0.0 ADJ V V 0.0 NN V V 0.00009 V V V 0.0
Pr(x) = Pr(x, y; ) = 0.00018
yY
1 2 3
Pr(x) = 3 (q) = 0.00018
qS
Figure 6.4: Computing the probability of the sequence John, might, wash
under the HMM given in
Figure 6.3 by explicitly summing over all possible sequence labels (upper panel) and using the forward
algorithm (lower panel).
120 6. EM ALGORITHMS FOR TEXT PROCESSING
of states, we define bpt (q), the backpointer, to be the state used in this sequence at time t 1. The
base case for the recursion is as follows (the state index of 1 is used as a placeholder since there is
no previous best state at time t = 1):
1 (q) = q Bq (x1 )
bp1 (q) = 1
The recursion is similar to that of the forward algorithm, except rather than summing over previous
states, the maximum value of all possible trajectories into state r at time t is computed. Note that
the backpointer simply records the index of the originating statea separate computation is not
necessary.
To compute the best sequence of states, y , the state with the highest probability path at time |x| is
selected, and then the backpointers are followed, recursively, to construct the rest of the sequence:
y|x| = arg max |x| (q)
qS
yt1 = bpt (yt )
Figure 6.5 illustrates a Viterbi trellis, including backpointers that have been used to compute the
most likely state sequence.
1 2 3
Figure 6.5: Computing the most likely state sequence that generated John, might, wash
under the
HMM given in Figure 6.3 using the Viterbi algorithm. The most likely state sequence is highlighted in
bold and could be recovered programmatically by following backpointers from the maximal probability
cell in the last column to the first column (thicker arrows).
ADJ
NN
Figure 6.6: A fully observable HMM training instance. The output sequence is at the top of the figure,
and the corresponding states and transitions are shown in the trellis below.
case, such as when we have a corpus of sentences in which all words have already been tagged with
their parts of speech, the maximum likelihood estimate for the parameters can be computed in terms
of the counts of the number of times the process transitions from state q to state r in all training
instances, T (q r); the number of times that state q emits symbol o, O(q o); and the number
of times the process starts in state q, I (q). In this example, the process starts in state nn; there is
one nn v transition and one v v transition. The nn state emits John in the first time step,
and v state emits might and wash in the second and third time steps, respectively. We also define
N (q) to be the number of times the process enters state q. The maximum likelihood estimates of
the parameters in the fully observable case are:
122 6. EM ALGORITHMS FOR TEXT PROCESSING
I (q) T (q r) O(q o)
q = Aq (r) = Bq (o) = (6.2)
= r I (r) N(q) = r T (q r ) N(q) = o O(q o )
For example, to compute the emission parameters from state nn, we simply need to keep track of
the number of times the process is in state nn and what symbol it generated at each of these times.
Transition probabilities are computed similarly: to compute, for example, the distribution Adet (),
that is, the probabilities of transitioning away from state det, we count the number of times the
process is in state det, and keep track of what state the process transitioned into at the next time
step. This counting and normalizing be accomplished using the exact same counting and relative
frequency algorithms that we described in Section 3.3. Thus, in the fully observable case, parameter
estimation is not a new algorithm at all, but one we have seen before.
How should the model parameters be estimated when the state sequence is not provided? It
turns out that the update equations have the satisfying form where the optimal parameter values
for iteration i + 1 are expressed in terms of the expectations of the counts referenced in the fully
observed case, according to the posterior distribution over the latent variables given the observations
x and the parameters (i) :
|x|
E[O(q o)] = Pr(yi = q|x; ) (xi , o) (6.4)
i=1
|x|1
E[T (q r)] = Pr(yi = q, yi+1 = r|x; ) (6.5)
i=1
Posterior probabilities. The expectations necessary for computing the M-step in HMM training
are sums of probabilities that a particular transition is taken, given an observation sequence, and
that some state emits some observation symbol, given an observation sequence. These are referred to
as posterior probabilities, indicating that they are the probability of some event whose distribution
we have a prior belief about, after addition evidence has been taken into consideration (here, the
model parameters characterize our prior beliefs, and the observation sequence is the evidence).
Both posterior probabilities can be computed by combining the forward probabilities, t (), which
give the probability of reaching some state at time t, by any path, and generating the observations
x1 , x2 , . . . , xt
, with backward probabilities, t (), which give the probability of starting in some
state at time t and generating the rest of the sequence xt+1 , xt+2 , . . . , x|x|
, using any sequence of
states to do so. The algorithm for computing the backward probabilities is given a bit later. Once the
forward and backward probabilities have been computed, the state transition posterior probabilities
and the emission posterior probabilities can be written as follows:
Equation 6.6 is the probability of being in state q at time i, given x, and the correctness of the
expression should be clear from the definitions of forward and backward probabilities. The intuition
for Equation 6.7, the probability of taking a particular transition at a particular time, is also not
complicated. It is the product of four conditionally independent probabilities: the probability of
124 6. EM ALGORITHMS FOR TEXT PROCESSING
a b b c b
S1
S2
S3
2(2) 3(2)
Figure 6.7: Using forward and backward probabilities to compute the posterior probability of the dashed
transition, given the observation sequence a b b c b. The shaded area on the left corresponds to the
forward probability 2 (s2 ), and the shaded area on the right corresponds to the backward probability
3 (s2 ).
getting to state q at time i (having generated the first part of the sequence), the probability of taking
transition q r (which is specified in the parameters, ), the probability of generating observation
xi+1 from state r (also specified in ), and the probability of generating the rest of the sequence,
along any path. A visualization of the quantities used in computing this probability is shown in
Figure 6.7. In this illustration, we assume an HMM with S = {s1 , s2 , s3 } and O = {a, b, c}.
The backward algorithm. Like the forward and Viterbi algorithms introduced above to answer
Questions 1 and 2, the backward algorithm uses dynamic programming to incrementally compute
t (). Its base case starts at time |x|, and is defined as follows:
|x| (q) = 1
To understand the intuition of the base case, keep in mind that since the backward probabilities
t () are the probability of generating the remainder of the sequence after time t (as well as being
in some state), and since there is nothing left to generate after time |x|, the probability must be 1.
The recursion is defined as follows:
t (q) = t+1 (r) Aq (r) Br (xt+1 )
rS
6.3. EM IN MAPREDUCE 125
Unlike the forward and Viterbi algorithms, the backward algorithm is computed from right to left
and makes no reference to the start probabilities, .
6.3 EM IN MAPREDUCE
Expectation maximization algorithms fit quite naturally into the MapReduce programming model.
Although the model being optimized determines the details of the required computations, Map-
Reduce implementations of EM algorithms share a number of characteristics:
126 6. EM ALGORITHMS FOR TEXT PROCESSING
Each iteration of EM is one MapReduce job.
A controlling process (i.e., driver program) spawns the MapReduce jobs and keeps track of
the number of iterations and convergence criteria.
Model parameters (i) , which are static for the duration of the MapReduce job, are loaded by
each mapper from HDFS or other data provider (e.g., a distributed key-value store).
Mappers map over independent training instances, computing partial latent variable posteriors
(or summary statistics, such as expected counts).
Reducers sum together the required training statistics and solve one or more of the M-step
optimization problems.
Combiners, which sum together the training statistics, are often quite effective at reducing the
amount of data that must be shuffled across the network.
The degree of parallelization that can be attained depends on the statistical independence assumed
in the model and in the derived quantities required to solve the optimization problems in the M-step.
Since parameters are estimated from a collection of samples that are assumed to be i.i.d., the E-step
can generally be parallelized effectively since every training instance can be processed independently
of the others. In the limit, in fact, each independent training instance could be processed by a separate
mapper!10
Reducers, however, must aggregate the statistics necessary to solve the optimization problems
as required by the model. The degree to which these may be solved independently depends on the
structure of the model, and this constrains the number of reducers that may be used. Fortunately,
many common models (such as HMMs) require solving several independent optimization problems
in the M-step. In this situation, a number of reducers may be run in parallel. Still, it is possible that in
the worst case, the M-step optimization problem will not decompose into independent subproblems,
making it necessary to use a single reducer.
HMM training mapper. The pseudo-code for the HMM training mapper is given in Figure 6.8.
The input consists of key-value pairs with a unique id as the key and a training instance (e.g., a
sentence) as the value. For each training instance, 2n + 1 stripes are emitted with unique keys,
and every training instance emits the same set of keys. Each unique key corresponds to one of the
independent optimization problems that will be solved in the M-step. The outputs are:
1. the probabilities that the unobserved Markov process begins in each state q, with a unique key
designating that the values are initial state counts;
2. the expected number of times that state q generated each emission symbol o (the set of emission
symbols included will be just those found in each training instance x), with a key indicating
that the associated value is a set of emission counts from state q; and
3. the expected number of times state q transitions to each state r, with a key indicating that the
associated value is a set of transition counts from state q.
HMM training reducer. The reducer for one iteration of HMM training, shown together with
an optional combiner in Figure 6.9, aggregates the count collections associated with each key by
summing them. When the values for each key have been completely aggregated, the associative array
contains all of the statistics necessary to compute a subset of the parameters for the next EM iteration.
The optimal parameter settings for the following iteration are computed simply by computing the
relative frequency of each event with respect to its expected count at the current iteration. The new
computed parameters are emitted from the reducer and written to HDFS. Note that they will be
spread across 2 |S | + 1 keys, representing initial state probabilities , transition probabilities Aq
for each state q, and emission probabilities Bq for each state q.
128 6. EM ALGORITHMS FOR TEXT PROCESSING
1: class Mapper
2: method Initialize(integer iteration)
3: S , O
ReadModel
4: A, B,
ReadModelParams(iteration)
5: method Map(sample id, sequence x)
6: Forward(x, ) cf. Section 6.2.2
7: Backward(x, ) cf. Section 6.2.4
8: I new AssociativeArray Initial state expectations
9: for all q S do Loop over states
10: I {q} 1 (q) 1 (q)
11: O new AssociativeArray of AssociativeArray Emissions
12: for t = 1 to |x| do Loop over observations
13: for all q S do Loop over states
14: O{q}{xt } O{q}{xt } + t (q) t (q)
15: t t +1
16: T new AssociativeArray of AssociativeArray Transitions
17: for t = 1 to |x| 1 do Loop over observations
18: for all q S do Loop over states
19: for all r S do Loop over states
20: T {q}{r} T {q}{r} + t (q) Aq (r) Br (xt+1 ) t+1 (r)
21: t t +1
22: Emit(string initial , stripe I )
23: for all q S do Loop over states
24: Emit(string emit from + q, stripe O{q})
25: Emit(string transit from + q, stripe T {q})
Figure 6.8: Mapper pseudo-code for training hidden Markov models using EM. The mappers map over
training instances (i.e., sequences of observations xi ) and generate the expected counts of initial states,
emissions, and transitions taken to generate the sequence.
6.3. EM IN MAPREDUCE 129
1: class Combiner
2: method Combine(string t, stripes [C1 , C2 , . . .])
3: Cf new AssociativeArray
4: for all stripe C stripes [C1 , C2 , . . .] do
5: Sum(Cf , C)
6: Emit(string t, stripe Cf )
1: class Reducer
2: method Reduce(string t, stripes [C1 , C2 , . . .])
3: Cf new AssociativeArray
4: for all stripe C stripes [C1 , C2 , . . .] do
5: Sum(Cf , C)
6: z0
7: for all k, v
Cf do
8: zz+v
9: Pf new AssociativeArray Final parameters vector
10: for all k, v
Cf do
11: Pf {k} v/z
12: Emit(string t, stripe Pf )
Figure 6.9: Combiner and reducer pseudo-code for training hidden Markov models using EM. The
HMMs considered in this book are fully parameterized by multinomial distributions, so reducers do not
require special logic to handle different types of model parameters (since they are all of the same type).
130 6. EM ALGORITHMS FOR TEXT PROCESSING
With the statistical approach, translation systems can be developed cheaply and quickly for
any language pair, as long as there is sufficient training data available. Furthermore, improvements
in learning algorithms and statistical modeling can yield benefits in many translation pairs at once,
rather than being specific to individual language pairs. Thus, SMT, like many other topics we are
considering in this book, is an attempt to leverage the vast quantities of textual data that is available
to solve problems that would otherwise require considerable manual effort to encode specialized
knowledge. Since the advent of statistical approaches to translation, the field has grown tremendously
and numerous statistical models of translation have been developed, with many incorporating quite
specialized knowledge about the behavior of natural language as biases in their learning algorithms.
6.4. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 131
6.4.1 STATISTICAL PHRASE-BASED TRANSLATION
One approach to statistical translation that is simple yet powerful is called phrase-based translation
[86]. We provide a rough outline of the process since it is representative of most state-of-the-
art statistical translation systems, such as the one used inside Google Translate.11 Phrase-based
translation works by learning how strings of words, called phrases, translate between languages.12
Example phrase pairs for Spanish-English translation might include los estudiantes, the students
,
los estudiantes, some students
, and soy, i am
. From a few hundred thousand sentences of example
translations, many millions of such phrase pairs may be automatically learned.
The starting point is typically a parallel corpus (also called bitext), which contains pairs of
sentences in two languages that are translations of each other. Parallel corpora are frequently gener-
ated as the by-product of an organizations effort to disseminate information in multiple languages,
for example, proceedings of the Canadian Parliament in French and English, and text generated by
the United Nations in many different languages. The parallel corpus is then annotated with word
alignments, which indicate which words in one language correspond to words in the other. By using
these word alignments as a skeleton, phrases can be extracted from the sentence that is likely to
preserve the meaning relationships represented by the word alignment. While an explanation of the
process is not necessary here, we mention it as a motivation for learning word alignments, which we
show below how to compute with EM. After phrase extraction, each phrase pair is associated with
a number of scores which, taken together, are used to compute the phrase translation probability, a
conditional probability that reflects how likely the source phrase translates into the target phrase.
We briefly note that although EM could be utilized to learn the phrase translation probabilities, this
is not typically done in practice since the maximum likelihood solution turns out to be quite bad for
this problem. The collection of phrase pairs and their scores are referred to as the translation model.
In addition to the translation model, phrase-based translation depends on a language model, which
gives the probability of a string in the target language. The translation model attempts to preserve
the meaning of the source language during the translation process, while the language model en-
sures that the output is fluent and grammatical in the target language. The phrase-based translation
process is summarized in Figure 6.10.
A language model gives the probability that a string of words w = w1 , w2 , . . . , wn
, written
n
as w1 for short, is a string in the target language. By the chain rule of probability, we get:
n
Pr(w1n ) = Pr(w1 ) Pr(w2 |w1 ) Pr(w3 |w12 ) . . . Pr(wn |w1n1 ) = Pr(wk |w1k1 ) (6.8)
k=1
Due to the extremely large number of parameters involved in estimating such a model directly, it is
customary to make the Markov assumption, that the sequence histories only depend on prior local
11 http://translate.google.com
12 Phrases are simply sequences of words; they are not required to correspond to the definition of a phrase in any linguistic theory.
132 6. EM ALGORITHMS FOR TEXT PROCESSING
(vi, i saw)
i saw the small table
(la mesa pequea, the small table)
vi la mesa pequea
Parallel Sentences
Decoder
maria no daba una bofetada a la bruja verde mary did not slap the green witch
Foreign Input Sentence English Output Sentence
Figure 6.10: The standard phrase-based machine translation architecture. The translation model is
constructed with phrases extracted from a word-aligned parallel corpus. The language model is estimated
from a monolingual corpus. Both serve as input to the decoder, which performs the actual translation.
no slap to the
the
Figure 6.11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde by a phrase-
based model. The best possible translation path is indicated with a dashed line.
6.4. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 133
context. That is, an n-gram language model is equivalent to a (n 1)th-order Markov model. Thus,
we can approximate P (wk |w1k1 ) as follows:
The probabilities used in computing Pr(w1n ) based on an n-gram language model are generally
estimated from a monolingual corpus of target language text. Since only target language text is
necessary (without any additional annotation), language modeling has been well served by large-
data approaches that take advantage of the vast quantities of text available on the web.
To translate an input sentence f, the phrase-based decoder creates a matrix of all translation
possibilities of all substrings in the input string, as an example illustrates in Figure 6.11. A sequence
of phrase pairs is selected such that each word in f is translated exactly once.13 The decoder seeks
to find the translation that maximizes the product of the translation probabilities of the phrases
used and the language model probability of the resulting string in the target language. Because the
phrase translation probabilities are independent of each other and the Markov assumption made
in the language model, this may be done efficiently using dynamic programming. For a detailed
introduction to phrase-based decoding, we refer the reader to a recent textbook by Koehn [85].
|e|
Pr(e, a|f) = Pr(a|f, e) Pr(ei |fai )
i=1
Alignment probability
Lexical probability
Since we have parallel corpora consisting of only f, e
pairs, we can learn the parameters for this
model using EM and treating a as a latent variable. However, to combat data sparsity in the alignment
14 In the original presentation of statistical lexical translation models, a special null word is added to the source sentences, which
permits words to be inserted out of nowhere. Since this does not change any of the important details of training, we omit it from
our presentation for simplicity.
6.4. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 135
probability, we must make some further simplifying assumptions. By letting the probability of an
alignment depend only on the position of the previous aligned word we capture a valuable insight
(namely, words that are nearby in the source language will tend to be nearby in the target language),
and our model acquires the structure of an HMM [150]:
|e|
|e|
Pr(e, a|f) = Pr(ai |ai1 ) Pr(ei |fai )
i=1 i=1
Transition probability Emission probability
This model can be trained using the forward-backward algorithm described in the previous section,
summing over all settings of a, and the best alignment for a sentence pair can be found using the
Viterbi algorithm.
To properly initialize this HMM, it is conventional to further simplify the alignment proba-
bility model, and use this simpler model to learn initial lexical translation (emission) parameters for
the HMM. The favored simplification is to assert that all alignments are uniformly probable:
|e|
1
Pr(e, a|f) = Pr(ei |fai )
|f||e|
i=1
This model is known as IBM Model 1. It is attractive for initialization because it is convex everywhere,
and therefore EM will learn the same solution regardless of initialization. Finally, while the forward-
backward algorithm could be used to compute the expected counts necessary for training this model
by setting Aq (r) to be a constant value for all q and r, the uniformity assumption means that the
expected emission counts can be estimated in time O(|e| |f|), rather than time O(|e| |f|2 ) required
by the forward-backward algorithm.
6.4.4 EXPERIMENTS
How well does a MapReduce word aligner for statistical machine translation perform? We describe
previously published results [54] that compared a Java-based Hadoop implementation against a
highly optimized word aligner called Giza++ [112], which was written in C++ and designed to run
efficiently on a single core. We compared the training time of Giza++ and our aligner on a Hadoop
cluster with 19 slave nodes, each with two single-core processors and two disks (38 cores total).
Figure 6.12 shows the performance of Giza++ in terms of the running time of a single EM
iteration for both Model 1 and the HMM alignment model as a function of the number of training
pairs. Both axes in the figure are on a log scale, but the ticks on the y-axis are aligned with meaningful
time intervals rather than exact orders of magnitude.There are three things to note. First, the running
time scales linearly with the size of the training data. Second, the HMM is a constant factor slower
than Model 1. Third, the alignment process is quite slow as the size of the training data grows
136 6. EM ALGORITHMS FOR TEXT PROCESSING
at one million sentences, a single iteration takes over three hours to complete! Five iterations are
generally necessary to train the models, which means that full training takes the better part of a day.
In Figure 6.13 we plot the running time of our MapReduce implementation running on the
38-core cluster described above. For reference, we plot points indicating what 1/38 of the running
time of the Giza++ iterations would be at each data size, which gives a rough indication of what an
ideal parallelization could achieve, assuming that there was no overhead associated with distributing
computation across these machines.Three things may be observed in the results. First, as the amount
of data increases, the relative cost of the overhead associated with distributing data, marshaling and
aggregating counts, decreases. At one million sentence pairs of training data, the HMM alignment
iterations begin to approach optimal runtime efficiency. Second, Model 1, which we observe is
light on computation, does not approach the theoretical performance of an ideal parallelization,
and in fact, has almost the same running time as the HMM alignment algorithm. We conclude
that the overhead associated with distributing and aggregating data is significant compared to the
Model 1 computations, although a comparison with Figure 6.12 indicates that the MapReduce
implementation is still substantially faster than the single core implementation, at least once a
certain training data size is reached. Finally, we note that, in comparison to the running times of
the single-core implementation, at large data sizes, there is a significant advantage to using the
distributed implementation, even of Model 1.
Although these results do confound several variables ( Java vs. C++ performance, memory usage
patterns), it is reasonable to expect that the confounds would tend to make the single-core systems
performance appear relatively better than the MapReduce system (which is, of course, the opposite
pattern from what we actually observe). Furthermore, these results show that when computation is
distributed over a cluster of many machines, even an unsophisticated implementation of the HMM
aligner could compete favorably with a highly optimized single-core system whose performance is
well-known to many people in the MT research community.
Why are these results important? Perhaps the most significant reason is that the quantity of
parallel data that is available to train statistical machine translation models is ever increasing, and as is
the case with so many problems we have encountered, more data lead to improvements in translation
quality [54]. Recently, a corpus of one billion words of French-English data was mined automatically
from the web and released publicly [33].15 Single-core solutions to model construction simply
cannot keep pace with the amount of translated data that is constantly being produced. Fortunately,
several independent researchers have shown that existing modeling algorithms can be expressed
naturally and effectively using MapReduce, which means that we can take advantage of this data.
Furthermore, the results presented here show that even at data sizes that may be tractable on single
machines, significant performance improvements are attainable using MapReduce implementations.
This improvement reduces experimental turnaround times, which allows researchers to more quickly
explore the solution spacewhich will, we hope, lead to rapid new developments in statistical
machine translation.
15 http://www.statmt.org/wmt10/translation-task.html
6.4. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 137
Model 1
HMM
60 min
20 min
5 min
90 s
30 s
10 s
3s
Figure 6.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM training
iterations at various corpus sizes.
5 min
90 s
30 s
10 s
3s
Figure 6.13: Running times of our MapReduce implementation of Model 1 and HMM training itera-
tions at various corpus sizes. For reference, 1/38 running times of the Giza++ models are shown.
138 6. EM ALGORITHMS FOR TEXT PROCESSING
For the reader interested in statistical machine translation, there is an open source Hadoop-
based MapReduce implementation of a training pipeline for phrase-based translation that includes
word alignment, phrase extraction, and phrase scoring [56].
The gradient has two crucial properties that are exploited in gradient-based optimization. First, the
gradient F is a vector field that points in the direction of the greatest increase of F and whose
magnitude indicates the rate of increase. Second, if is a (local) minimum of F, then the following
is true:
F ( ) = 0
6.5. EM-LIKE ALGORITHMS 139
An extremely simple gradient-based minimization algorithm produces a series of parameter
estimates (1) , (2) , . . . by starting with some initial parameter settings (1) and updating parameters
through successive iterations according to the following rule:
The parameter (i) > 0 is a learning rate which indicates how quickly the algorithm moves along
the gradient during iteration i. Provided this value is small enough that F decreases, this strategy
will find a local minimum of F . However, while simple, this update strategy may converge slowly,
and proper selection of is non-trivial. More sophisticated algorithms perform updates that are
informed by approximations of the second derivative, which are estimated by successive evaluations
of F (), and can converge much more rapidly [96].
The objective should decompose linearly across training instances. This implies that the gra-
dient also decomposes linearly, and therefore mappers can process input data in parallel. The
values they emit are pairs F ( ), F ( )
, which are linear components of the objective and
gradient.
Evaluations of the function and its gradient are often computationally expensive because they
require processing lots of data. This make parallelization with MapReduce worthwhile.
Whether more than one reducer can run in parallel depends on the specific optimization
algorithm being used. Some, like the trivial algorithm of Equation 6.12 treat the dimensions
of independently, whereas many are sensitive to global properties of F ( ). In the latter
case, parallelization across multiple reducers is non-trivial.
Reducer(s) sum the component objective/gradient pairs, compute the total objective and gra-
dient, run the optimization algorithm, and emit (i+1) .
Many optimization algorithms are stateful and must persist their state between optimization
iterations. This may either be emitted together with (i+1) or written to the distributed file
system as a side effect of the reducer. Such external side effects must be handled carefully; refer
to Section 2.2 for a discussion.
140 6. EM ALGORITHMS FOR TEXT PROCESSING
Parameter learning for log-linear models. Gradient-based optimization techniques can be quite
effectively used to learn the parameters of probabilistic models with a log-linear parameterization
[100]. While a comprehensive introduction to these models is beyond the scope of this book, such
models are used extensively in text processing applications, and their training using gradient-based
optimization, which may otherwise be computationally expensive, can be implemented effectively
using MapReduce. We therefore include a brief summary.
Log-linear models are particularly useful for supervised learning (unlike the unsupervised
models learned with EM), where an annotation y Y is available for every x X in the training
data. In this case, it is possible to directly model the conditional distribution of label given input:
exp i i Hi (x, y)
Pr(y|x; ) =
y exp i i Hi (x, y )
In this expression, Hi are real-valued functions sensitive to features of the input and labeling. The
parameters of the model is selected so as to minimize the negative conditional log likelihood of a
set of training instances x, y
1 , x, y
2 , . . .
, which we assume to be i.i.d.:
F ( ) = log Pr(y|x; ) (6.13)
x,y
= arg min F ( ) (6.14)
As Equation 6.13 makes clear, the objective decomposes linearly across training instances, meaning
it can be optimized quite well in MapReduce. The gradient derivative of F with respect to i can
be shown to have the following form [141]:16
F
( ) = Hi (x, y) EPr(y |x;) [Hi (x, y )]
i
x,y
The expectation in the second part of the gradients expression can be computed using a variety
of techniques. However, as we saw with EM, when very large event spaces are being modeled, as
is the case with sequence labeling, enumerating all possible values y can become computationally
intractable. And, as was the case with HMMs, independence assumptions can be used to enable
efficient computation using dynamic programming. In fact, the forward-backward algorithm intro-
duced in Section 6.2.4 can, with only minimal modification, be used to compute the expectation
EPr(y |x; ) [Hi (x, y )] needed in CRF sequence models, as long as the feature functions respect the
same Markov assumption that is made in HMMs. For more information about inference in CRFs
using the forward-backward algorithm, we refer the reader to Sha et al. [140].
As we saw in the previous section, MapReduce offers significant speedups when training
iterations require running the forward-backward algorithm. The same pattern of results holds when
training linear CRFs.
16This assumes that when x, y
is present the model is fully observed (i.e., there are no additional latent variables).
6.6. SUMMARY AND ADDITIONAL READINGS 141
Additional Readings. Because of its ability to leverage large amounts of training data, machine
learning is an attractive problem for MapReduce and an area of active research. Chu et al. [37]
presented general formulations of a variety of machine learning problems, focusing on a normal
form for expressing a variety of machine learning algorithms in MapReduce. The Apache Mahout
project is an open-source implementation of these and other learning algorithms,17 and it is also
the subject of a forthcoming book [116]. Issues associated with a MapReduce implementation of
latent Dirichlet allocation (LDA), which is another important unsupervised learning technique, with
certain similarities to EM, have been explored by Wang et al. [151].
17 http://lucene.apache.org/mahout/
143
CHAPTER 7
Closing Remarks
The need to process enormous quantities of data has never been greater. Not only are terabyte-
and petabyte-scale datasets rapidly becoming commonplace, but there is consensus that great value
lies buried in them, waiting to be unlocked by the right computational tools. In the commercial
sphere, business intelligencedriven by the ability to gather data from a dizzying array of sources
promises to help organizations better understand their customers and the marketplace, hopefully
leading to better business decisions and competitive advantages. For engineers building information
processing tools and applications, larger datasets lead to more effective algorithms for a wide range
of tasks, from machine translation to spam detection. In the natural and physical sciences, the ability
to analyze massive amounts of data may provide the key to unlocking the secrets of the cosmos or
the mysteries of life.
In the preceding chapters, we have shown how MapReduce can be exploited to solve a variety
of problems related to text processing at scales that would have been unthinkable a few years ago.
However, no toolno matter how powerful or flexiblecan be perfectly adapted to every task, so it is
only fair to discuss the limitations of the MapReduce programming model and survey alternatives.
Section 7.1 covers online learning algorithms and Monte Carlo simulations, which are examples of
algorithms that require maintaining global state. As we have seen, this is difficult to accomplish
in MapReduce. Section 7.2 discusses alternative programming models, and the book concludes in
Section 7.3.
The capabilities necessary to tackle large-data problems are already within reach by many and will
continue to become more accessible over time. By scaling out with commodity servers, we have
been able to economically bring large clusters of machines to bear on problems of interest. But
this has only been possible with corresponding innovations in software and how computations are
organized on a massive scale. Important ideas include: moving processing to the data, as opposed to
the other way around; also, emphasizing throughput over latency for batch tasks by sequential scans
through data, avoiding random seeks. Most important of all, however, is the development of new
abstractions that hide system-level details from the application developer. These abstractions are
at the level of entire datacenters, and provide a model using which programmers can reason about
computations at a massive scale without being distracted by fine-grained concurrency management,
7.3. MAPREDUCE AND BEYOND 147
fault tolerance, error recovery, and a host of other issues in distributed computing. This, in turn,
paves the way for innovations in scalable algorithms that can run on petabyte-scale datasets.
None of these points are new or particularly earth shatteringcomputer scientists have known
about these principles for decades. However, MapReduce is unique in that, for the first time, all these
ideas came together and were demonstrated on practical problems at scales unseen before, both in
terms of computational resources and the impact on the daily lives of millions. The engineers at
Google deserve a tremendous amount of credit for that, and also for sharing their insights with
the rest of the world. Furthermore, the engineers and executives at Yahoo deserve a lot of credit
for starting the open-source Hadoop project, which has made MapReduce accessible to everyone
and created the vibrant software ecosystem that flourishes today. Add to that the advent of utility
computing, which eliminates capital investments associated with cluster infrastructure, large-data
processing capabilities are now available to the masses with a relatively low barrier to entry.
The golden age of massively distributed computing is finally upon us.
149
Bibliography
[1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander
Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for
analytical workloads. In Proceedings of the 35th International Conference on Very Large Data
Base (VLDB 2009), pages 922933, Lyon, France, 2009. 59
[2] Rka Albert and Albert-Lszl Barabsi. Statistical mechanics of complex networks. Reviews
of Modern Physics, 74:4797, 2002. DOI: 10.1103/RevModPhys.74.47 92
[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the
frequency moments. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing
(STOC 96), pages 2029, Philadelphia, Pennsylvania, 1996. DOI: 10.1145/237814.237823
145
[4] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and
Russell C. Sears. BOOM: Data-centric programming in the datacenter. Technical Report
UCB/EECS-2009-98, Electrical Engineering and Computer Sciences, University of Cali-
fornia at Berkeley, 2009. 32
[5] Gene Amdahl. Validity of the single processor approach to achieving large-scale computing
capabilities. In Proceedings of the AFIPS Spring Joint Computer Conference, pages 483485,
1967. DOI: 10.1145/1465482.1465560 17
[6] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit
Sarkar, Mansi Shah, and Renu Tewari. Cloud analytics: Do we really need to reinvent the
storage stack? In Proceedings of the 2009 Workshop on Hot Topics in Cloud Computing (HotCloud
09), San Diego, California, 2009. 29
[7] Thomas Anderson, Michael Dahlin, Jeanna Neefe, David Patterson, Drew Roselli, and Ran-
dolph Wang. Serverless network file systems. In Proceedings of the 15th ACM Symposium on
Operating Systems Principles (SOSP 1995), pages 109126, Copper Mountain Resort, Col-
orado, 1995. DOI: 10.1145/224056.224066 29
[8] Vo Ngoc Anh and Alistair Moffat. Inverted index compression using word-aligned binary
codes. Information Retrieval, 8(1):151166, 2005.
DOI: 10.1023/B:INRT.0000048490.99518.5c 76
150 7. CLOSING REMARKS
[9] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew
Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia.
Above the clouds: A Berkeley view of cloud computing. Technical Report UCB/EECS-2009-
28, Electrical Engineering and Computer Sciences, University of California at Berkeley, 2009.
6
[10] Arthur Asuncion, Padhraic Smyth, and Max Welling. Asynchronous distributed learning of
topic models. In Advances in Neural Information Processing Systems 21 (NIPS 2008), pages
8188, Vancouver, British Columbia, Canada, 2008. 144
[11] Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and Fabrizio
Silvestri. Challenges on distributed web retrieval. In Proceedings of the IEEE 23rd Inter-
national Conference on Data Engineering (ICDE 2007), pages 620, Istanbul, Turkey, 2007.
DOI: 10.1109/ICDE.2007.367846 83
[12] Ricardo Baeza-Yates, Carlos Castillo, and Vicente Lpez. PageRank increase under dif-
ferent collusion topologies. In Proceedings of the First International Workshop on Adversarial
Information Retrieval on the Web (AIRWeb 2005), pages 1724, Chiba, Japan, 2005. 96
[13] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Pla-
chouras, and Fabrizio Silvestri. The impact of caching on search engines. In Proceedings
of the 30th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 2007), pages 183190, Amsterdam, The Netherlands, 2007.
DOI: 10.1145/1277741.1277775 82
[14] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam-
biguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Lin-
guistics (ACL 2001), pages 2633, Toulouse, France, 2001. DOI: 10.3115/1073012.1073017
4
[15] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neuge-
bauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In Proceedings of the
19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages 164177, Bolton
Landing, New York, 2003. DOI: 10.1145/945445.945462 6
[16] Luiz Andr Barroso, Jeffrey Dean, and Urs Hlzle. Web search for a planet: The Google
cluster architecture. IEEE Micro, 23(2):2228, 2003. DOI: 10.1109/MM.2003.1196112 82
[17] Luiz Andr Barroso and Urs Hlzle. The case for energy-proportional computing. Computer,
40(12):3337, 2007. DOI: 10.1109/MC.2007.443 9
[18] Luiz Andr Barroso and Urs Hlzle. The Datacenter as a Computer: An Introduc-
tion to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers, 2009.
DOI: 10.2200/S00193ED1V01Y200905CAC006 8, 9, 10, 14
7.3. MAPREDUCE AND BEYOND 151
[19] Jacek Becla, Andrew Hanushevsky, Sergei Nikolaev, Ghaleb Abdulla, Alex Szalay, Maria
Nieto-Santisteban, Ani Thakar, and Jim Gray. Designing a multi-petabyte database for
LSST. SLAC Publications SLAC-PUB-12292, Stanford Linear Accelerator Center, May
2006. 2
[20] Jacek Becla and Daniel L. Wang. Lessons learned from managing a petabyte. In Proceedings
of the Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), Asilomar,
California, 2005. 2
[21] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science, 323(5919):1297
1298, 2009. 2
[22] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM Transactions
on Internet Technology, 5(1):92128, 2005. DOI: 10.1145/1052934.1052938 98, 100
[23] Jorge Luis Borges. Collected Fictions (translated by Andrew Hurley). Penguin, 1999. 5
[24] Lon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg, editors,
Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelligence, LNAI 3176,
pages 146168. Springer Verlag, Berlin, 2004. DOI: 10.1007/b100712 144
[25] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language
models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning, pages 858867,
Prague, Czech Republic, 2007. 4, 5, 133
[26] Thorsten Brants and Peng Xu. Distributed Language Models. Morgan & Claypool Publishers,
2010. 134
[27] Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais, and Andrew Ng. Data-intensive
question answering. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001),
pages 393400, Gaithersburg, Maryland, 2001. 4
[28] Frederick P. Brooks. The Mythical Man-Month: Essays on Software Engineering, Anniversary
Edition. Addison-Wesley, Reading, Massachusetts, 1995. 12
[29] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.
The mathematics of statistical machine translation: Parameter estimation. Computational
Linguistics, 19(2):263311, 1993. 130
[30] Stefan Bttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval:
Implementing and Evaluating Search Engines. MIT Press, Cambridge, Massachusetts, 2010.
78, 83
152 7. CLOSING REMARKS
[31] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic.
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering
computing as the 5th utility. Future Generation Computer Systems, 25(6):599616, 2009.
DOI: 10.1016/j.future.2008.12.001 6
[32] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping to provide
high I/O data rates. Computer Systems, 4(4):405436, 1991. 29
[33] Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. Findings of the
2009 workshop on statistical machine translation. In Proceedings of the Fourth Workshop on
Statistical Machine Translation (StatMT 09), pages 128, Athens, Greece, 2009. 136
[34] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael
Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A distributed storage
system for structured data. In Proceedings of the 7th Symposium on Operating System Design
and Implementation (OSDI 2006), pages 205218, Seattle, Washington, 2006. 24, 145
[35] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques
for language modeling. In Proceedings of the 34th Annual Meeting of the Association
for Computational Linguistics (ACL 1996), pages 310318, Santa Cruz, California, 1996.
DOI: 10.3115/981863.981904 5, 133
[36] Hung chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-Reduce-Merge:
Simplified relational data processing on large clusters. In Proceedings of the 2007 ACM
SIGMOD International Conference on Management of Data, pages 10291040, Beijing, China,
2007. DOI: 10.1145/1247480.1247602 146
[37] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, and
Kunle Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural
Information Processing Systems 19 (NIPS 2006), pages 281288, Vancouver, British Columbia,
Canada, 2006. 141
[38] Kenneth W. Church and Patrick Hanks. Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1):2229, 1990. 48
[39] Jonathan Cohen. Graph twiddling in a MapReduce world. Computing in Science and Engi-
neering, 11(4):2941, 2009. DOI: 10.1109/MCSE.2009.120 103
[40] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears.
Benchmarking cloud serving systems with YCSB. In Proceedings of the First ACM Symposium
on Cloud Computing (ACM SOCC 2010), Indianapolis, Indiana, 2010. 145
[41] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms.
MIT Press, Cambridge, Massachusetts, 1990. 88
7.3. MAPREDUCE AND BEYOND 153
[42] W. Bruce Croft, Donald Meztler, and Trevor Strohman. Search Engines: Information Retrieval
in Practice. Addison-Wesley, Reading, Massachusetts, 2009. 83
[43] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice San-
tos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of par-
allel computation. ACM SIGPLAN Notices, 28(7):112, 1993. DOI: 10.1145/173284.155333
15
[44] Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical part-of-speech
tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pages
133140, Trento, Italy, 1992. DOI: 10.3115/974499.974523 114
[45] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters.
In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI
2004), pages 137150, San Francisco, California, 2004. 1, 24, 25
[46] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1):107113, 2008. DOI: 10.1145/1327452.1327492 2
[47] Jeffrey Dean and Sanjay Ghemawat. MapReduce: A flexible data processing tool. Commu-
nications of the ACM, 53(1):7277, 2010. DOI: 10.1145/1629175.1629198 59
[48] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash
Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and Werner Vogels. Dy-
namo: Amazons highly available key-value store. In Proceedings of the 21st ACM Symposium
on Operating Systems Principles (SOSP 2007), pages 205220, Stevenson, Washington, 2007.
145
[49] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological),
39(1):138, 1977. 108
[50] David J. DeWitt and Jim Gray. Parallel database systems: The future of high
performance database systems. Communications of the ACM, 35(6):8598, 1992.
DOI: 10.1145/129888.129894 12
[51] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker,
and David Wood. Implementation techniques for main memory database systems. ACM
SIGMOD Record, 14(2):18, 1984. DOI: 10.1145/971697.602261 63
[52] Mark Dredze, Alex Kulesza, and Koby Crammer. Multi-domain learning by
confidence-weighted parameter combination. Machine Learning, 79:123149, 2010.
DOI: 10.1007/s10994-009-5148-0 144
154 7. CLOSING REMARKS
[53] Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. Web question
answering: Is more always better? In Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2002), pages 291
298, Tampere, Finland, 2002. DOI: 10.1145/564376.564428 4
[54] Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, easy, and cheap: Construction
of statistical machine translation models with MapReduce. In Proceedings of the Third Workshop
on Statistical Machine Translation at ACL 2008, pages 199207, Columbus, Ohio, 2008. 47,
135, 136
[55] John R. Firth. A synopsis of linguistic theory 193055. In Studies in Linguistic Analysis,
Special Volume of the Philological Society, pages 132. Blackwell, Oxford, 1957. 48
[56] Qin Gao and Stephan Vogel. Training phrase-based machine translation models on the
cloud: Open source machine translation toolkit Chaski. The Prague Bulletin of Mathematical
Linguistics, 93:3746, 2010. DOI: 10.2478/v10108-010-0004-8 138
[57] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In
Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages
2943, Bolton Landing, New York, 2003. DOI: 10.1145/945445.945450 29
[58] Seth Gilbert and Nancy Lynch. Brewers Conjecture and the feasibility of consis-
tent, available, partition-tolerant web services. ACM SIGACT News, 33(2):5159, 2002.
DOI: 10.1145/564585.564601 32
[59] Michelle Girvan and Mark E. J. Newman. Community structure in social and biolog-
ical networks. Proceedings of the National Academy of Science, 99(12):78217826, 2002.
DOI: 10.1073/pnas.122653799 85
[60] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel
Computing. Addison-Wesley, Reading, Massachusetts, 2003. 14, 103
[61] Mark S. Granovetter. The strength of weak ties. The American Journal of Sociology, 78(6):1360
1380, 1973. DOI: 10.1086/225469 92
[62] Mark S. Granovetter. The strength of weak ties: A network theory revisited. Sociological
Theory, 1:201233, 1983. DOI: 10.2307/202051 92
[63] Zoltn Gyngyi and Hector Garcia-Molina. Web spam taxonomy. In Proceedings of the First
International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005), pages
3947, Chiba, Japan, 2005. 96
[64] Per Hage and Frank Harary. Island Networks: Communication, Kinship, and Classification
Structures in Oceania. Cambridge University Press, Cambridge, England, 1996. 86
7.3. MAPREDUCE AND BEYOND 155
[65] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data.
Communications of the ACM, 24(2):812, 2009. DOI: 10.1109/MIS.2009.36 5
[66] James Hamilton. On designing and deploying Internet-scale services. In Proceedings of the
21st Large Installation System Administration Conference (LISA 07), pages 233244, Dallas,
Texas, 2007. 10
[67] James Hamilton. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost, low
power servers for Internet-scale services. In Proceedings of the Fourth Biennial Conference on
Innovative Data Systems Research (CIDR 2009), Asilomar, California, 2009. 9, 10
[68] Jeff Hammerbacher. Information platforms and the rise of the data scientist. In Toby Segaran
and Jeff Hammerbacher, editors, Beautiful Data, pages 7384. OReilly, Sebastopol, California,
2009. 6, 59, 146
[69] Zelig S. Harris. Mathematical Structures of Language. Wiley, New York, 1968. 48
[70] Md. Rafiul Hassan and Baikunth Nath. Stock market forecasting using hidden Markov
models: A new approach. In Proceedings of the 5th International Conference on Intelli-
gent Systems Design and Applications (ISDA 05), pages 192196, Wroclaw, Poland, 2005.
DOI: 10.1109/ISDA.2005.85 114
[71] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars:
A MapReduce framework on graphics processors. In Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques (PACT 2008), pages 260269,
Toronto, Ontario, Canada, 2008. DOI: 10.1145/1454115.1454152 20
[72] Tony Hey, Stewart Tansley, and Kristin Tolle. The Fourth Paradigm: Data-Intensive Scientific
Discovery. Microsoft Research, Redmond, Washington, 2009. 3
[73] Tony Hey, Stewart Tansley, and Kristin Tolle. Jim Gray on eScience: A transformed scientific
method. In Tony Hey, Stewart Tansley, and Kristin Tolle, editors, The Fourth Paradigm:
Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington, 2009. 3
[74] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev Satyanarayanan,
Robert Sidebotham, and Michael West. Scale and performance in a distributed file system.
ACM Transactions on Computer Systems, 6(1):5181, 1988. DOI: 10.1145/35037.35059 29
[75] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Dis-
tributed data-parallel programs from sequential building blocks. In Proceedings of the ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys 2007), pages 5972,
Lisbon, Portugal, 2007. DOI: 10.1145/1272998.1273005 145
[76] Adam Jacobs. The pathologies of big data. ACM Queue, 7(6), 2009.
DOI: 10.1145/1563821.1563874 11
156 7. CLOSING REMARKS
[77] Joseph JaJa. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, Massachusetts,
1992. 14, 103
[78] Frederick Jelinek. Statistical methods for speech recognition. MIT Press, Cambridge, Mas-
sachusetts, 1997. 112, 114, 120
[79] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Pearson, Upper Saddle
River, New Jersey, 2009. 5
[80] U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec.
HADI: Fast diameter estimation and mining in massive graphs with Hadoop. Technical
Report CMU-ML-08-117, School of Computer Science, Carnegie Mellon University, 2008.
103
[82] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for Map-
Reduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA 2010), Austin, Texas, 2010. 15
[83] Aaron Kimball, Sierra Michels-Slettvet, and Christophe Bisciglia. Cluster computing
for Web-scale data processing. In Proceedings of the 39th ACM Technical Symposium
on Computer Science Education (SIGCSE 2008), pages 116120, Portland, Oregon, 2008.
DOI: 10.1145/1352135.1352177 71
[84] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
46(5):604632, 1999. DOI: 10.1145/324133.324140 65, 95
[85] Philipp Koehn. Statistical Machine Translation. Cambridge University Press, Cambridge,
England, 2010. 130, 133
[86] Philipp Koehn, Franz J. Och, and Daniel Marcu. Statistical phrase-based translation. In
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter
of the Association for Computational Linguistics (HLT/NAACL 2003), pages 4854, Edmonton,
Alberta, Canada, 2003. DOI: 10.3115/1073445.1073462 131
[87] John D. Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Prob-
abilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth
International Conference on Machine Learning (ICML 01), pages 282289, San Francisco,
California, 2001. 138
7.3. MAPREDUCE AND BEYOND 157
[88] Ronny Lempel and Shlomo Moran. SALSA: The Stochastic Approach for Link-
Structure Analysis. ACM Transactions on Information Systems, 19(2):131160, 2001.
DOI: 10.1145/382979.383041 65, 95
[89] Abby Levenberg, Chris Callison-Burch, and Miles Osborne. Stream-based translation mod-
els for statistical machine translation. In Proceedings of the 11th Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los
Angeles, California, 2010. 145
[90] Abby Levenberg and Miles Osborne. Stream-based randomised language models for SMT.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing,
pages 756764, Singapore, 2009. 145
[91] Adam Leventhal. Triple-parity RAID and beyond. ACM Queue, 7(11), 2009.
DOI: 10.1145/1661785.1670144 2
[93] Jimmy Lin. Exploring large-data issues in the curriculum: A case study with MapReduce. In
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08)
at ACL 2008, pages 5461, Columbus, Ohio, 2008. 71
[94] Jimmy Lin. Scalable language processing algorithms for the masses: A case study in computing
word co-occurrence matrices with MapReduce. In Proceedings of the 2008 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2008), pages 419428, Honolulu,
Hawaii, 2008. 47, 51
[95] Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. Low-latency, high-
throughput access to static global resources within the Hadoop framework. Technical Report
HCIL-2009-01, University of Maryland, College Park, Maryland, January 2009. 63
[96] Dong C. Liu, Jorge Nocedal, Dong C. Liu, and Jorge Nocedal. On the limited memory
BFGS method for large scale optimization. Mathematical Programming B, 45(3):503528,
1989. DOI: 10.1007/BF01589116 139
[97] Adam Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):149, 2008.
DOI: 10.1145/1380584.1380586 130
[98] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,
Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. In
Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC 2009),
page 6, Calgary, Alberta, Canada, 2009. DOI: 10.1145/1583991.1584010 86, 145
158 7. CLOSING REMARKS
[99] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,
Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing.
In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data,
Indianapolis, Indiana, 2010. DOI: 10.1145/1582716.1582723 86, 145
[100] Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation.
In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), pages
4955, Taipei, Taiwan, 2002. DOI: 10.3115/1118853.1118871 140
[102] Christopher D. Manning and Hinrich Schtze. Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Massachusetts, 1999. 5, 133
[103] Elaine R. Mardis. The impact of next-generation sequencing technology on genetics. Trends
in Genetics, 24(3):133141, 2008. DOI: 10.1016/j.tig.2007.12.007 2
[104] Michael D. McCool. Scalable programming models for massively multicore processors.
Proceedings of the IEEE, 96(5):816831, 2008. DOI: 10.1109/JPROC.2008.917731 13, 145
[105] Marshall K. McKusick and Sean Quinlan. GFS: Evolution on fast-forward. ACM Queue,
7(7), 2009. DOI: 10.1145/1594204.1594206 32
[106] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving memory hierarchy
performance for irregular applications using data and computation reorderings. International
Journal of Parallel Programming, 29(3):217247, 2001. DOI: 10.1023/A:1011119519789 101
[107] Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy. Building enriched document
representations using aggregated anchor text. In Proceedings of the 32nd Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009),
pages 219226, 2009. DOI: 10.1145/1571941.1571981 68, 88
[108] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden Markov model informa-
tion retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR 1999), pages 214221, Berkeley,
California, 1999. DOI: 10.1145/312624.312680 114
[109] Alistair Moffat, William Webber, and Justin Zobel. Load balancing for term-distributed
parallel retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR 2006), pages 348355, Seattle,
Washington, 2006. DOI: 10.1145/1148170.1148232 82
7.3. MAPREDUCE AND BEYOND 159
[110] Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text
classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information
Filtering, pages 6167, Stockholm, Sweden, 1999. 138
[111] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia
Youseff, and Dmitrii Zagorodnov. The Eucalyptus open-source cloud-computing system. In
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid,
pages 124131, Washington, D.C., 2009. DOI: 10.1109/CCGRID.2009.93 7
[112] Franz J. Och and Hermann Ney. A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1):1951, 2003. DOI: 10.1162/089120103321337421
135
[113] Christopher Olston and Marc Najork. Web crawling. Foundations and Trends in Information
Retrieval, 4(3):175246, 2010. DOI: 10.1561/1500000017 67
[114] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins.
Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM
SIGMOD International Conference on Management of Data, pages 10991110, Vancouver,
British Columbia, Canada, 2008. DOI: 10.1145/1376616.1376726 59, 146
[115] Kunle Olukotun and Lance Hammond. The future of microprocessors. ACM Queue, 3(7):27
34, 2005. DOI: 10.1145/1095408.1095418 13
[116] Sean Owen and Robin Anil. Mahout in Action. Manning Publications Co., Greenwich,
Connecticut, 2010. 141
[117] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation
ranking: Bringing order to the Web. Stanford Digital Library Working Paper SIDL-WP-
1999-0120, Stanford University, 1999. 65, 95, 100
[118] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends
in Information Retrieval, 2(12):1135, 2008. DOI: 10.1561/1500000011 3
[119] David A. Patterson. The data center is the computer. Communications of the ACM, 52(1):105,
2008. 14
[120] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel
Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis.
In Proceedings of the 35th ACM SIGMOD International Conference on Management of Data,
pages 165178, Providence, Rhode Island, 2009. DOI: 10.1145/1559845.1559865 59
[121] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming first story detection with
application to Twitter. In Proceedings of the 11th Annual Conference of the North American
160 7. CLOSING REMARKS
Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los Angeles,
California, 2010. 145
[122] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel
analysis with Sawzall. Scientific Programming Journal, 13(4):277298, 2005. 146
[123] Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andr Barroso. Failure trends in a large
disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage
Technologies (FAST 2007), San Jose, California, 2008. 10, 26
[124] Xiaoguang Qi and Brian D. Davison. Web page classification: Features and algorithms. ACM
Computing Surveys, 41(2), 2009. DOI: 10.1145/1459352.1459357 82
[125] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. In Readings in Speech Recognition, pages 267296. Morgan Kaufmann
Publishers, San Francisco, California, 1990. DOI: 10.1109/5.18626 115, 120
[126] M. Mustafa Rafique, Benjamin Rose, Ali R. Butt, and Dimitrios S. Nikolopoulos. Supporting
MapReduce on large-scale asymmetric multi-core clusters. ACM Operating Systems Review,
43(2):2534, 2009. DOI: 10.1145/1531793.1531800 20
[127] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos
Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings
of the 13th International Symposium on High-Performance Computer Architecture (HPCA 2007),
pages 205218, Phoenix, Arizona, 2007. DOI: 10.1109/HPCA.2007.346181 20
[128] Delip Rao and David Yarowsky. Ranking and semi-supervised classification on large scale
graphs using Map-Reduce. In Proceedings of the ACL/IJCNLP 2009 Workshop on Graph-Based
Methods for Natural Language Processing (TextGraphs-4), Singapore, 2009. 103
[129] Michael A. Rappa. The utility business model and the future of computing services. IBM
Systems Journal, 34(1):3242, 2004. DOI: 10.1147/sj.431.0032 6
[130] Sheldon M. Ross. Stochastic processes. Wiley, New York, 1996. 114
[131] Thomas Sandholm and Kevin Lai. MapReduce optimization using regulated dynamic pri-
oritization. In Proceedings of the Eleventh International Joint Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS 09), pages 299310, Seattle,Washington, 2009.
DOI: 10.1145/1555349.1555384 25
[132] Michael Schatz. High Performance Computing for DNA Sequence Alignment and Assembly. PhD
thesis, University of Maryland, College Park, 2010. 103
[133] Frank Schmuck and Roger Haskin. GPFS: A shared-disk file system for large computing
clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies, pages
231244, Monterey, California, 2002. 29
7.3. MAPREDUCE AND BEYOND 161
[134] Donovan A. Schneider and David J. DeWitt. A performance evaluation of four parallel join
algorithms in a shared-nothing multiprocessor environment. In Proceedings of the 1989 ACM
SIGMOD International Conference on Management of Data, pages 110121, Portland, Oregon,
1989. DOI: 10.1145/67544.66937 60
[135] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM errors in the wild:
A large-scale field study. In Proceedings of the Eleventh International Joint Conference on
Measurement and Modeling of Computer Systems (SIGMETRICS 09), pages 193204, Seattle,
Washington, 2009. DOI: 10.1145/1555349.1555372 10, 26
[136] Hinrich Schtze. Automatic word sense discrimination. Computational Linguistics, 24(1):97
123, 1998. 48
[137] Hinrich Schtze and Jan O. Pedersen. A cooccurrence-based thesaurus and two applica-
tions to information retrieval. Information Processing and Management, 33(3):307318, 1998.
DOI: 10.1016/S0306-4573(96)00068-4 48
[138] Satoshi Sekine and Elisabete Ranchhod. Named Entities: Recognition, Classification and Use.
John Benjamins, Amsterdam, The Netherlands, 2009. 3
[139] Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. Learning hidden Markov model
structure for information extraction. In Proceedings of the AAAI-99 Workshop on Machine
Learning for Information Extraction, pages 3742, Orlando, Florida, 1999. 114
[140] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceed-
ings of the 2003 Human Language Technology Conference of the North American Chapter of the
Association for Computational Linguistics (HLT/NAACL 2003), pages 134141, Edmonton,
Alberta, Canada, 2003. DOI: 10.3115/1073445.1073473 140
[142] Christopher Southan and Graham Cameron. Beyond the tsunami: Developing the infras-
tructure to deal with life sciences data. In Tony Hey, Stewart Tansley, and Kristin Tolle,
editors, The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Red-
mond, Washington, 2009. 2
[143] Mario Stanke and Stephan Waack. Gene prediction with a hidden Markov model
and a new intron submodel. Bioinformatics, 19 Suppl 2:ii215225, October 2003.
DOI: 10.1093/bioinformatics/btg1080 114
[144] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew
Pavlo, and Alexander Rasin. MapReduce and parallel DBMSs: Friends or foes? Communi-
cations of the ACM, 53(1):6471, 2010. DOI: 10.1145/1629175.1629197 59
162 7. CLOSING REMARKS
[145] Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar, Jim Gray, Don Slutz, and Robert J. Brunner.
Designing and mining multi-terabyte astronomy archives: The Sloan Digital Sky Survey.
SIGMOD Record, 29(2):451462, 2000. DOI: 10.1145/335191.335439 2
[146] Wittawat Tantisiriroj, Swapnil Patil, and Garth Gibson. Data-intensive file systems for
Internet services: A rose by any other name. Technical Report CMU-PDL-08-114, Parallel
Data Laboratory, Carnegie Mellon University, 2008. 29
[147] Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable
distributed file system. In Proceedings of the 16th ACM Symposium on Operating Systems Princi-
ples (SOSP 1997), pages 224237, Saint-Malo, France, 1997. DOI: 10.1145/268998.266694
29
[148] Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM,
33(8):103111, 1990. DOI: 10.1145/79173.79181 13, 14, 15, 86, 145
[149] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner. A break in the
clouds: Towards a cloud definition. ACM SIGCOMM Computer Communication Review,
39(1):5055, 2009. DOI: 10.1145/1496091.1496100 6
[150] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word alignment
in statistical translation. In Proceedings of the 16th International Conference on Com-
putational Linguistics (COLING 1996), pages 836841, Copenhagen, Denmark, 1996.
DOI: 10.3115/993268.993313 114, 135
[151] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. PLDA: Parallel
latent Dirichlet allocation for large-scale applications. In Proceedings of the Fifth International
Conference on Algorithmic Aspects in Information and Management (AAIM 2009), pages 301
314, San Francisco, California, 2009. DOI: 10.1007/978-3-642-02158-9_26 141
[152] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world networks.
Nature, 393:440442, 1998. DOI: 10.1038/30918 92
[153] Xingzhi Wen and Uzi Vishkin. FPGA-based prototype of a PRAM-On-Chip processor.
In Proceedings of the 5th Conference on Computing Frontiers, pages 5566, Ischia, Italy, 2008.
DOI: 10.1145/1366230.1366240 14
[154] Tom White. Hadoop: The Definitive Guide. OReilly, Sebastopol, California, 2009. 15, 23
[157] Jinxi Xu and W. Bruce Croft. Corpus-based stemming using cooccurrence of word variants.
ACM Transactions on Information Systems, 16(1):6181, 1998. DOI: 10.1145/267954.267957
48
[158] Rui Xu and Donald Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural
Networks, 16(3):645678, 2005. DOI: 10.1109/TNN.2005.845141 86
[159] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, lfar Erlingsson, Pradeep Kumar
Gunda, and Jon Currey. DryadLINQ: A system for general-purpose distributed data-parallel
computing using a high-level language. In Proceedings of the 8th Symposium on Operating
System Design and Implementation (OSDI 2008), pages 114, San Diego, California, 2008.
145
[160] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker,
and Ion Stoica. Job scheduling for multi-user MapReduce clusters. Technical Report
UCB/EECS-2009-55, Electrical Engineering and Computer Sciences, University of Cal-
ifornia at Berkeley, 2009. 25
[161] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving
MapReduce performance in heterogeneous environments. In Proceedings of the 8th Sympo-
sium on Operating System Design and Implementation (OSDI 2008), pages 2942, San Diego,
California, 2008. 25
[162] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing
Surveys, 38(6):156, 2006. 78, 83
165
Authors Biographies
JIMMY LIN
Jimmy Lin is an Associate Professor in the iSchool (College of Information Studies) at the Uni-
versity of Maryland, College Park. He directs the recently-formed Cloud Computing Center, an
interdisciplinary group that explores the many aspects of cloud computing as it impacts technol-
ogy, people, and society. Lins research lies at the intersection of natural language processing and
information retrieval, with a recent emphasis on scalable algorithms and large-data processing. He
received his Ph.D. from MIT in Electrical Engineering and Computer Science in 2004.
CHRIS DYER
Chris Dyer is graduating with a Ph.D. in Linguistics from the University of Maryland, College
Park in June, 2010 and will be joining the Language Technologies Institute at Carnegie Mellon
University as a postdoctoral researcher. His research interests include statistical machine translation
and machine learning, and he has served as a reviewer for numerous conferences and journals in the
areas of natural language processing and computational linguistics. He first became acquainted with
MapReduce in 2007 using Hadoop, version 0.13.0, and gained further experience with MapReduce
during an internship with Google Research in 2008.