InfoQ - Hadoop PDF
InfoQ - Hadoop PDF
Hadoop
eMag Issue 13 - May 2014
INTRODUCTION P. 3
BUILDING APPLICATIONS WITH HADOOP P. 5
WHAT IS APACHE TEZ? P. 9
MODERN HEALTHCARE ARCHITECTURES BUILT WITH HADOOP P. 14
Contents
Introduction Page 3
Apache Hadoop is an open-source framework that runs applications on large clustered
hardware (servers). It is designed to scale from a single server to thousands of machines, with
a very high degree of fault tolerance.
Introduction
by Roopesh Shanoy and Boris Lublinsky
We are living in the era of “big data”. With today’s technology powering increases
in computing power, electronic devices, and accessibility to the Internet, more data
than ever is being transmitted and collected. Organizations are producing data
at an astounding rate. Facebook alone collects 250 terabytes a day. According to
Thompson Reuters News Analytics, digital data production has more than doubled
from almost one zettabyte (a zettabyte is equal to 1 million petabytes) in 2009 and
is expected to reach 7.9 zettabytes in 2015, and 35 zettabytes in 2020.
As organizations have begun collecting and (servers). It is designed to scale from a single server
producing massive amounts of data, they have to thousands of machines, with a very high degree
started to recognize the advantages of data analysis, of fault tolerance. Rather than relying on high-end
but they are also struggling to manage the massive hardware, the reliability of these clusters comes from
amounts of information they have. According to the software’s ability to detect and handle failures of
Alistair Croll, “Companies that have massive amounts its own.
of data without massive amounts of clue are going to
be displaced by startups that have less data but more According to a Hortonworks survey, many
clue.…” large, mainstream organizations (50% of survey
respondents were from organizations with over
Unless your business understands the data it has, $500M in revenues) currently deploy Hadoop
it will not be able to compete with businesses that across many industries including high-tech,
do. Businesses realize that there are tremendous healthcare, retail, financial services, government, and
benefits to be gained in analyzing big data related manufacturing.
to business competition, situational awareness,
productivity, science, and innovation – and most see In the majority of cases, Hadoop does not replace
Hadoop as a main tool for analyzing their massive existing data-processing systems but rather
amounts of information and mastering the big-data complements them. It is typically used to supplement
challenges. existing systems to tap into additional business data
and a more powerful analytics system in order to
Apache Hadoop is an open-source framework that get a competitive advantage through better insights
runs applications on large clustered hardware in business information. Some 54% of respondents
Page 3
CONTENTS
Hadoop / eMag Issue 13 - May 2014
Page 4
CONTENTS
Hadoop / eMag Issue 13 - May 2014
When building applications using Hadoop, it is common to have input data from
various sources coming in various formats. In his presentation, “New Tools for
Building Applications on Apache Hadoop”, Eli Collins, tech lead for Cloudera’s
Platform Team overviews how to build better products with Hadoop and various
tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.
Avro Avro is dynamic and one of its neat features is that
Avro is a project for data serialization in formats. It is you can read and write data without generating any
similar to Thrift or Protocol Buffers. It’s expressive. code. It will use reflection and look at the schema
You can deal in terms of records, arrays, unions, that you’ve given it to create classes on the fly. That’s
enums. It’s efficient so it has a compact binary called Avro-generic formats. You can also specify
representation. One of the benefits of logging an formats for which Avro will generate optimal code.
Avro is that you get much smaller data files. All the
traditional aspects of Hadoop data formats, like Avro was designed with expectation that you would
compressible or splittable data, are true of Avro. change your schema over time. That’s an important
attribute in a big-data system because you generate
One of the reasons Doug Cutting (founder of the lots of data, and you don’t want to constantly
Hadoop project) created the Avro project was that reprocess it. You’re going to generate data at one
a lot of the formats in Hadoop were Java only. It’s time and have tools process that data maybe two,
important for Avro to be interoperable – to a lot of three, or four years down the line. Avro has the
different languages like Java, C, C++, C#, Python, ability to negotiate differences between schemata so
Ruby, etc. – and to be usable by a lot of tools. that new tools can read old data and vice versa.
One of the goals for Avro is a set of formats and Avro forms an important basis for the following
serialization that’s usable throughout the data projects.
platform that you’re using, not just in a subset of
the components. So MapReduce, Pig, Hive, Crunch, Crunch
Flume, Sqoop, etc. all support Avro. You’re probably familiar with Pig and Hive and how
to process data with them and integrate valuable
Page 5
CONTENTS
Hadoop / eMag Issue 13 - May 2014
tools. However, not all data formats that you use will Hadoop Writables in Avro. There’s no impedance
fit Pig and Hive. mismatch between the Java codes you’re writing
and the data that you’re analyzing.
Pig and Hive are great for a lot of logged data or
relational data, but other data types don’t fit as well. • It’s built as a modular library for reuse. You can
You can still process poorly fitting data with Pig and capture your pipelines in Crunch code in Java and
Hive, which don’t force you to a relational model or a then combine it with arbitrary machine learning
log structure, but you have to do a lot of work around program later, so that someone else can reuse
it. You might find yourself writing unwieldy user- that algorithm.
defined functions or doing things that are not natural
in the language. People, sometimes, just give up and The fundamental structure is a parallel collection so
start writing raw Java MapReduce programs because it’s a distributed, unordered collection of elements.
that’s easier. This collection has a parallel do operator which you
can imagine turns into a MapReduce job. So if you
Crunch was created to fill this gap. It’s a higher-level had a bunch of data that you want to operate in
API than MapReduce. It’s in Java. It’s lower level parallel, you can use a parallel collection.
than, say, Pig, Hive, Cascade, or other frameworks
you might be used to. It’s based on a paper that And there’s something called the parallel table,
Google published called FlumeJava. It’s a very which is a subinterface of the collection, and it’s
similar API. Crunch has you combine a small number a distributed sorted map. It also has a group by
of primitives with a small number of types and operators you can use to aggregate all the values for
effectively allow the user to create really lightweight a given key. We’ll go through an example that shows
UDS, which are just Java methods and classes to how that works.
create complex data pipelines.
Finally, there’s a pipeline class and pipelines are really
Crunch has a number of advantages. for coordinating the execution of the MapReduce
jobs that will actually do the back-end processing for
• It’s just Java. You have access to a full this Crunch program.
programming language.
Let’s take an example for which you’ve probably seen all
• You don’t have to learn Pig. the Java code before, word count, and see what it looks
like in Crunch.
• The type system is well-integrated. You can use
Java POJOs, but there’s also a native support for
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
7
Page 6
CONTENTS
Hadoop / eMag Issue 13 - May 2014
It’s a lot smaller and simpler. The first line creates The listing at the bottom of this page is the same
a pipeline. We create a parallel collection of all the program written in Scala.We have the pipeline and
lines from a given file by using the pipeline class. And we can use Scala’s built-in functions that map really
then we get a collection of words by running the nicely to Crunch – so word count becomes a one-line
parallel do operator on these lines. program. It’s pretty cool and very powerful if you’re
writing Java code already and want to do complex
We’ve got a defined anonymous function here that pipelines.
basically processes the input and word count splits
on the word and emits that word for each map task. Cloudera ML
Cloudera ML (machine learning) is an open-source
Finally, we want to aggregate the counts for each library and tools to help data scientists perform the
word and write them out. There’s a line at the day-to-day tasks, primarily of data preparation to
bottom, pipeline run. Crunch’s planner does lazy model evaluation.
evaluation. We’re going to create and run the
MapReduce jobs until we’ve gotten a full pipeline With built-in commands for summarizing, sampling,
together. normalizing, and pivoting data, Cloudera ML has
recently added a built-in clustering algorithm
If you’re used to programming Java and you’ve seen for k-means, based on an algorithm that was just
the Hadoop examples for writing word count in Java, developed a year or two back. There are a couple of
you can tell that this is a more natural way to express other implementations as well. It’s a home for tools
that. This is among the simplest pipelines you can you can use so you can focus on data analysis and
create, and you can imagine you can do many more modeling instead of on building or wrangling the
complicated things. tools.
If you want to go even one step easier than this, It’s built using Crunch. It leverages a lot of existing
there’s a wrapper for Scala. This is very similar idea to projects. For example, the vector formats: a lot of
Cascade, which was built on Google FlumeJava. Since ML involves transforming raw data that’s in a record
Scala runs on the JVM, it’s an obvious natural fit. format to vector formats for machine-learning
Scala’s type inference actually ends up being really algorithms. It leverages Mahout’s vector interface
powerful in the context of Crunch. and classes for that purpose. The record format is
just a thin wrapper in Avro, and HCatalog is record
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
Page 7
CONTENTS
8
Hadoop / eMag Issue 13 - May 2014
and schema formats so you can easily integrate with It lets you focus on working on a dataset on HDFS
existing data sources. instead of all the implementation details. It also has
plugin providers for existing systems.
For more information on Cloudera ML, visit the
projects’ GitHub page; there’s a bunch of examples Imagine you’re already using Hive and HCatalog
with datasets that can get you started. as a metadata repository, and you’ve already got a
schema for what these files look like. CDK integrates
Cloudera Development Kit with that. It doesn’t require you to define all of
Like Cloudera ML, Cloudera Development Kit a set your metadata for your entire data repository from
of open-source libraries and tools that make writing scratch. It integrates with existing systems.
applications on Hadoop easier. Unlike ML though,
it’s not focused on using machine learning like a data You can learn more about the various CDK modules
scientist. It’s directed at developers trying to build and how to use them in the documentation.
applications on Hadoop. It’s really the plumbing of
a lot of different frameworks and pipelines and the In summary, working with data from various sources,
integration of a lot of different components. preparing and cleansing data and processing them
via Hadoop involves a lot of work. Tools such as
The purpose of the CDK is to provide higher level Crunch, Cloudera ML and CDK make it easier to do
APIs on top of the existing Hadoop components this and leverage Hadoop more effectively.
in the CDH stack that codify a lot of patterns in
common use cases.
Page 8
CONTENTS
Hadoop / eMag Issue 13 - May 2014
You might have heard of Apache Tez, a new distributed execution framework that
is targeted towards data-processing applications on Hadoop. But what exactly is
it? How does it work? Who should use it and why? In their presentation, “Apache
Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun Murthy discuss
Tez’s design, highlight some of its features and share some of the initial results
obtained by making Hive use Tez instead of MapReduce.
Tez generalizes the MapReduce paradigm to a The Tez project aims to be highly customizable
more powerful framework based on expressing so that it can meet a broad spectrum of use cases
computations as a dataflow graph. Tez is not meant without forcing people to go out of their way to
directly for end-users – in fact it enables developers make things work; projects such as Hiveand Pig are
to build end-user applications with much better seeing significant improvements in response times
performance and flexibility. Hadoop has traditionally when they use Tez instead of MapReduce as the
been a batch-processing platform for large amounts backbone for data processing. Tez is built on top
of data. However, there are a lot of use cases for of YARN, which is the new resource-management
near-real-time performance of query processing. framework for Hadoop.
There are
also several
workloads,
such as
Machine
Learning,
which do
not fit will
into the
MapReduce
paradigm.
Tez helps
Hadoop
address
these use
cases.
Page 9
CONTENTS
Hadoop / eMag Issue 13 - May 2014
Page 10
CONTENTS
Hadoop / eMag Issue 13 - May 2014
The first diagram demonstrates a process that has You can view this Hortonworks article to see an
multiple MR jobs, each storing intermediate results example of the API in action, more detail about these
to the HDFS – the reducers of the previous step properties and how the logical graph expands at run-
feeding the mappers of the next step. The second time.
diagram shows how with Tez, the same processing
can be done in just one job, with no need to access The runtime API is based on an input-processor-
HDFS in between. output model which allows all inputs and outputs
to be pluggable. To facilitate this, Tez uses an event-
Tez’s flexibility means that it requires a bit more based model in order to communicate between tasks
effort than MapReduce to start consuming; there’s a and the system, and between various components.
bit more API and a bit more processing logic that you Events are used to pass information such as task
need to implement. This is fine since it is not an end- failures to the required components, flow of data
user application like MapReduce; it is designed to let from Output to the Input such as location of data
developers build end-user applications on top of it. that it generates, enabling run-time changes to the
DAG execution plan, etc.
Given that overview of Tez and its broad goals, let’s
try to understand the actual APIs. Tez also comes with various Input and Output
processors out-of-the-box.
Tez API
The Tez API has the following components: The expressive API allows higher language (such as
Hive) writers to elegantly transform their queries
• DAG (Directed Acyclic Graph) – defines the into Tez jobs.
overall job. One DAG object corresponds to one
job Tez Scheduler
The Tez scheduler considers a lot of things when
• Vertex – defines the user logic along with the deciding on task assignments – task-locality
resources and the environment needed to requirements, compatibility of containers, total
execute the user logic. One Vertex corresponds available resources on the cluster, priority of
to one step in the job pending task requests, automatic parallelization,
freeing up resources that the application cannot
• Edge – defines the connection between producer use anymore (because the data is not local to it) etc.
and consumer vertices. It also maintains a connection pool of pre-warmed
JVMs with shared registry objects. The application
Edges need to be assigned properties; these can choose to store different kinds of pre-computed
properties are essential for Tez to be able to information in those shared registry objects so that
expand that logical graph at runtime into the they can be reused without having to recompute
physical set of tasks that can be done in parallel them later on, and this shared set of connections and
on the cluster. There are several such properties: container-pool resources can run those tasks very
fast.
• The data-movement property defines how
data moves from a producer to a consumer. You can read more about reusing of containers in
Apache Tez.
• Scheduling properties (sequential or
concurrent) helps us define when the Flexibility
producer and consumer tasks can be Overall, Tez provides a great deal of flexibility for
scheduled relative to each other. developers to deal with complex processing logic.
This can be illustrated with one example of how Hive
• Data-source property (persisted, reliable or is able to leverage Tez.
ephemeral), defines the lifetime or durability
of the output produced by our task so that Let’s take this typical TPC-DS query pattern in which
we can determine when we can terminate it. you are joining multiple tables with a fact table. Most
optimizers and query systems can do what is there
in the top-right corner: if the dimension tables are
Page 11
CONTENTS
Hadoop / eMag Issue 13 - May 2014
small, then they can broadcast-join all of them with • it provides sessions and reusable containers so
the large fact table, and you can do that same thing that you have low latency and can avoid
on Tez. recombination as much as possible.
But what if these broadcasts have user-defined This particular Hive query is seeing performance
functions that are expensive to compute? You may improvement of more than 100% with the new Tez
not be able to do all of that this way. You may have to engine.
break up your tasks into different stages, and that’s
what the left-side topology shows you. The first Roadmap
dimension table is broadcast-joined with the fact • Richer DAG support. For example, can Samza
table. The result is then broadcast-joined with the use Tez as a substrate on which to build the
second dimension table. application? It needs some support in order
for Tez to handle Samza’s core scheduling and
Here, the third dimension table is not broadcastable streaming requirements. The Tez team wants
because it is too large. You can choose to do a shuffle to explore how we would enable those kinds of
join, and Tez can efficiently navigate the topology connection patterns in our DAGs. They also want
without falling over just because you can’t do the more fault-tolerance support, more efficient data
top-right one. transfer for further performance optimization,
and improved session performance.
The two benefits for this kind of Hive query with Tez
are: • Given that these DAGs can get arbitrarily
complex, we need a lot of automatic tooling to
• it gives you full DAG support and does a lot help the users understand their performance
automatically on the cluster so that it can fully bottlenecks
utilize the parallelism that is available in the
cluster; as already discussed above, this means
there is no need for reading/writing from HDFS
between multiple MR jobs, all the computation
can be done in a single Tez job.
Page 12
CONTENTS
Hadoop / eMag Issue 13 - May 2014
Summary
Tez is a distributed execution framework that works ABOUT THE SPEAKERS
on computations represented as dataflow graphs. It
maps naturally to higher-level declarative languages
like Hive, Pig, Cascading, etc. It’s designed to have Arun Murthy is the lead of the
highly customizable execution architecture so that MapReduce project in Apache
we can make dynamic performance optimizations Hadoop where he has been a full-time
at runtime based on real information about the data contributor to Apache Hadoop since
and the resources. The framework itself automatically its inception in 2006. He is a long-time
determines a lot of the hard stuff, allowing it to work committer and member of the Apache
right out-of-the-box. Hadoop PMC and jointly holds the
current world sorting record using
You get good performance and efficiency out-of-the- Apache Hadoop. Prior to co-founding
box. Tez aims to address the broad spectrum of use Hortonworks, Arun was responsible for
cases in the data-processing domain in Hadoop, all MapReduce code and configuration
ranging from latency to complexity of the execution. It deployed across the 42,000+ servers at
is an open-source project. Tez works, Saha and Murthy Yahoo!
suggest, and is already being used by Hive and Pig.
Bikas Saha has been working on
Apache Hadoop for over a year and is a
committer on the project. He has been
a key contributor in making Hadoop run
natively on Windows and has focused
on YARN and the Hadoop compute
stack. Prior to Hadoop, he has worked
WATCH THE FULL extensively on the Dryad distributed
PRESENTATION ON InfoQ data processing framework that runs on
some of the worlds largest clusters as
part of Microsoft Bing infrastructure
Page 13
CONTENTS
Hadoop / eMag Issue 13 - May 2014
We have heard plenty in the news lately about healthcare challenges and the
difficult choices faced by hospital administrators, technology and pharmaceutical
providers, researchers, and clinicians. At the same time, consumers are
experiencing increased costs without a corresponding increase in health security or
in the reliability of clinical outcomes.
One key obstacle in the healthcare market is data Healthcare”. The report points out how big data
liquidity (for patients, practitioners and payers) and is creating value in five “new value pathways”
some are using Apache Hadoop to overcome this allowing data to flow more freely. Below we present
challenge, as part of a modern data architecture. a summary of these five new value pathways and
This post describes some healthcare use cases, a an an example how Hadoop can be used to address
healthcare reference architecture and how Hadoop each. Thanks to the Clinical Informatics Group at UC
can ease the pain caused by poor data liquidity. Irvine Health for many of the use cases, described in
their UCIH case study.
New Value Pathways for Healthcare
In January 2013, McKinsey & Company published
a report named “The ‘Big Data’ Revolution in
Source: The ‘Big Data’ Revolution in Healthcare. McKinsey & Company, January 2013.
At Hortonworks, we see our healthcare customers ingest and analyze data from many sources. The following
reference architecture is an amalgam of Hadoop data patterns that we’ve seen with our customers’ use of
Hortonworks Data Platform (HDP). Components shaded green are part of HDP.
Sources of Healthcare Data • RTLS (for locating medical equipment & patient
Source data comes from: throughput)
• Bio Repository
• Legacy Electronic Medical Records (EMRs) • Device Integration (e.g. iSirona)
• Transcriptions • Home Devices (e.g. scales and heart monitors)
• PACS • Clinical Trials
• Medication Administration • Genomics (e.g. 23andMe, Cancer Genomics Hub)
• Financial • Radiology (e.g. RadNet)
• Laboratory (e.g. SunQuest, Cerner) • Quantified Self Sensors (e.g. Fitbit, SmartSleep)
• Social Media Streams (e.g. FourSquare, Twitter)
Page 15
CONTENTS
Hadoop / eMag Issue 13 - May 2014
Apache UIMA
Page 16
CONTENTS
Hadoop / eMag Issue 13 - May 2014
Page 17
CONTENTS
Hadoop / eMag Issue 13 - May 2014
That’s where Samza fits in. This is where we can Because LinkedIn has Kafka and because they’ve
process stuff asynchronously, but we’re also not integrated with it for the past few years, a lot of data
waiting for hours. It typically operates in the order of at LinkedIn, almost all of it, is available in a stream
milliseconds to minutes. The idea is to process stuff format as opposed to a data format or on Hadoop.
relatively quickly and get the data back to wherever
it needs to be, whether that’s a downstream system Motivation for Building Samza
or some real-time service. Chris mentions that when they began doing stream
processing, with Kafka and all this data in their
Chris mentions that right now, this stream processing system, they started with something like a web
is the worst-supported in terms of tooling and service that would start up, read messages from
environment. Kafka and do some processing, and then write the
messages back out.
LinkedIn sees a lot of use cases for this type of
processing – As they did this, they realized that there were a lot of
problems that needed to be solved in order to make
• Newsfeed displays when people move to another it really useful and scalable. Things like partitioning:
company, when they like an article, when they how do you partition your stream? How do you
join a group, et cetera. partition your processor? How do you manage state,
where state is defined essentially as something that
News is latency-sensitive and if you use Hadoop to you maintain in your processor between messages,
batch-compute it, you might be getting responses or things like count if you’re incrementing a counter
hours or maybe even a day later. It is important to get every time a message comes in. How do you re-
trending articles in News pretty quickly. process?
• Advertising – getting relevant advertisements, as With failure semantics, you get at least once, at most
well as tracking and monitoring ad display, clicks once, exactly once messaging. There is also non-
and other metrics determinism. If your stream processor is interacting
with another system, whether it’s a database or it’s
• Sophisticated monitoring that allows performing depending on time or the ordering of messages,
of complex querys like “the top five slowest pages how you deal with stuff that actually determines the
for the last minute.” output that you will end up sending?
Page 18
CONTENTS
Hadoop / eMag Issue 13 - May 2014
end – say, Message 6 of Partition 0 – you would just • State management – Data that needs to be
try to reread the next message when it’s available. passed between processing of different messages
It’s not the case that you’re finished. can be called state – this can be something as
simple as a keeping count or something a lot
This definition maps very well to Kafka, which more complex. Samza allows tasks to maintain
LinkedIn uses as the streaming infrastructure for persistent, mutable, queryable state that is
Samza. physically co-located with each task. The state
is highly available: in the event of a task failure
There are many concepts to understand within it will be restored when the task fails over to
Samza. In a gist, they are – another machine.
• Streams – Samza processes streams. A stream This datastore is pluggable, but Samza comes with a
is composed of immutable messages of a similar key-value store out-of-the-box.
type or category. The actual implementation
can be provided via a messaging system such • YARN (Yet Another Resource Manager) is
as Kafka (where each topic becomes a Samza Hadoop v2’s biggest improvement over v1 – it
Stream) or a database (table) or even Hadoop (a separates the Map-Reduce Job tracker from the
directory of files in HDFS) resource management and enables Map-reduce
alternatives to use the same resource manager.
• Things like message ordering, batching are Samza utilizes YARN to do cluster management,
handled via streams. tracking failures, etc.
• Jobs – a Samza job is code that performs logical Samza provides a YARN ApplicationMaster and a
transformation on a set of input streams to YARN job runner out of the box.
append messages to a set of output streams
Page 19
CONTENTS
Hadoop / eMag Issue 13 - May 2014
Conclusion
Samza is a relatively young project incubating at ABOUT THE SPEAKER
Apache so there’s a lot of room to get involved. A good
way to get started is with the hello-samza project,
which is a little thing that will get you up and running Chris Riccomini is a Staff Software
in about five minutes. It will let you play with a real- Engineer at LinkedIn, where he’s is
time change log from the Wikipedia servers to let you currently working as a committer
figure out what’s going on in and give you a stream of and PMC member for Apache Samza.
stuff to play with. He’s been involved in a wide range of
projects at LinkedIn, including, “People
The other stream processing project built on top of You May Know”, REST.li, Hadoop,
Hadoop is STORM. You can see acomparison between engineering tooling, and OLAP systems.
Samza and STORM Prior to LinkedIn, he worked on data
visualization and fraud modeling at
PayPal.
Page 20
CONTENTS