0% found this document useful (0 votes)
130 views20 pages

InfoQ - Hadoop PDF

This document summarizes presentations from an eMag issue on using Hadoop. It discusses tools for building applications on Hadoop like Apache Avro, Apache Crunch, and Cloudera ML. Avro allows for data serialization in different formats and is efficient with a compact binary representation. When building Hadoop applications, it is common to have input data from various sources in different formats. These tools can help with integrating and processing diverse data sources on Hadoop.

Uploaded by

pbecic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views20 pages

InfoQ - Hadoop PDF

This document summarizes presentations from an eMag issue on using Hadoop. It discusses tools for building applications on Hadoop like Apache Avro, Apache Crunch, and Cloudera ML. Avro allows for data serialization in different formats and is efficient with a compact binary representation. When building Hadoop applications, it is common to have input data from various sources in different formats. These tools can help with integrating and processing diverse data sources on Hadoop.

Uploaded by

pbecic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN PROFESSIONAL SOFTWARE DEVELOPMENT

Hadoop
eMag Issue 13 - May 2014

How LinkedIn Uses Apache Samza


Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation, Samza: Real-
time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature set, how Samza integrates
with YARN and Kafka, how it’s used at LinkedIn, and what’s next on the roadmap. PAGE 17

INTRODUCTION P. 3
BUILDING APPLICATIONS WITH HADOOP P. 5
WHAT IS APACHE TEZ? P. 9
MODERN HEALTHCARE ARCHITECTURES BUILT WITH HADOOP P. 14
Contents

Introduction Page 3
Apache Hadoop is an open-source framework that runs applications on large clustered
hardware (servers). It is designed to scale from a single server to thousands of machines, with
a very high degree of fault tolerance.

Building Applications With Hadoop Page 5


When building applications using Hadoop, it is common to have input data from various
sources coming in various formats. In his presentation, “New Tools for Building Applications
on Apache Hadoop”, Eli Collins overviews how to build better products with Hadoop and
various tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.

What is Apache Tez? Page 9


Apache Tez is a new distributed execution framework that is targeted towards data-
processing applications on Hadoop. But what exactly is it? How does it work? In the
presentation, “Apache Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun
Murthy discuss Tez’s design, highlight some of its features and share initial results obtained
by making Hive use Tez instead of MapReduce.

Modern Healthcare Architectures Built with Hadoop Page 14


This article explores some specific use cases where Hadoop can play a major role in the health
care industry, as well as a possible reference architecture.

How LinkedIn Uses Apache Samza Page17


Apache Samza is a stream processor LinkedIn recently open-sourced. In his presentation,
Samza: Real-time Stream Processing at LinkedIn, Chris Riccomini discusses Samza’s feature
set, how Samza integrates with YARN and Kafka, how it’s used at LinkedIn, and what’s next
on the roadmap.
Hadoop / eMag Issue 13 - May 2014

Introduction
by Roopesh Shanoy and Boris Lublinsky

content extracted from this InfoQ news post

We are living in the era of “big data”. With today’s technology powering increases
in computing power, electronic devices, and accessibility to the Internet, more data
than ever is being transmitted and collected. Organizations are producing data
at an astounding rate. Facebook alone collects 250 terabytes a day. According to
Thompson Reuters News Analytics, digital data production has more than doubled
from almost one zettabyte (a zettabyte is equal to 1 million petabytes) in 2009 and
is expected to reach 7.9 zettabytes in 2015, and 35 zettabytes in 2020.
As organizations have begun collecting and (servers). It is designed to scale from a single server
producing massive amounts of data, they have to thousands of machines, with a very high degree
started to recognize the advantages of data analysis, of fault tolerance. Rather than relying on high-end
but they are also struggling to manage the massive hardware, the reliability of these clusters comes from
amounts of information they have. According to the software’s ability to detect and handle failures of
Alistair Croll, “Companies that have massive amounts its own.
of data without massive amounts of clue are going to
be displaced by startups that have less data but more According to a Hortonworks survey, many
clue.…” large, mainstream organizations (50% of survey
respondents were from organizations with over
Unless your business understands the data it has, $500M in revenues) currently deploy Hadoop
it will not be able to compete with businesses that across many industries including high-tech,
do. Businesses realize that there are tremendous healthcare, retail, financial services, government, and
benefits to be gained in analyzing big data related manufacturing.
to business competition, situational awareness,
productivity, science, and innovation – and most see In the majority of cases, Hadoop does not replace
Hadoop as a main tool for analyzing their massive existing data-processing systems but rather
amounts of information and mastering the big-data complements them. It is typically used to supplement
challenges. existing systems to tap into additional business data
and a more powerful analytics system in order to
Apache Hadoop is an open-source framework that get a competitive advantage through better insights
runs applications on large clustered hardware in business information. Some 54% of respondents

Page 3
CONTENTS
Hadoop / eMag Issue 13 - May 2014

are utilizing Hadoop to capture new types


of data, while 48% are planning to do the
same. The main new data types include the
following:

• Server logs data enabling IT


departments to better manage their
infrastructure (64% of respondents are
already doing it, while 28% plan to).

• Clickstream data enabling better


understanding of how customers
are using applications (52.3% of
respondents are already doing it, while
37.4% plan to).

• Social-media data enabling


understanding of the public’s
perception of the company (36.5% of
respondents are already doing it, while
32.5% plan to).

• Geolocation data enabling analysis of


travel patterns (30.8% of respondents
are already doing it, while 26.8% plan
to).

• Machine data enabling analysis of


machine usage (29.3% of respondents
are already doing it, while 33.3% plan
to).

According to the survey, traditional data


grows with an average rate of about 8%
a year but new data types are growing at
a rate exceeding 85% and, as a result, it is
virtually impossible to collect and process it
without Hadoop.

While Version 1 of Hadoop came with


the MapReduce processing model, the
recently released Hadoop Version 2
comes with YARN, which separates the
cluster management from MapReduce job
manager, making the Hadoop architecture
more modular. One of the side-effects is
that MapReduce is now just one of the ways
to leverage a Hadoop cluster; a number of
new projects such as Tez and Stinger can
also process data in the Hadoop Distributed
File System (HDFS) without the constraint
of the batch-job execution model of
MapReduce.

Page 4
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Building Applications With Hadoop


Presentation transcript edited by Roopesh Shenoy

When building applications using Hadoop, it is common to have input data from
various sources coming in various formats. In his presentation, “New Tools for
Building Applications on Apache Hadoop”, Eli Collins, tech lead for Cloudera’s
Platform Team overviews how to build better products with Hadoop and various
tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the
Cloudera Development Kit.
Avro Avro is dynamic and one of its neat features is that
Avro is a project for data serialization in formats. It is you can read and write data without generating any
similar to Thrift or Protocol Buffers. It’s expressive. code. It will use reflection and look at the schema
You can deal in terms of records, arrays, unions, that you’ve given it to create classes on the fly. That’s
enums. It’s efficient so it has a compact binary called Avro-generic formats. You can also specify
representation. One of the benefits of logging an formats for which Avro will generate optimal code.
Avro is that you get much smaller data files. All the
traditional aspects of Hadoop data formats, like Avro was designed with expectation that you would
compressible or splittable data, are true of Avro. change your schema over time. That’s an important
attribute in a big-data system because you generate
One of the reasons Doug Cutting (founder of the lots of data, and you don’t want to constantly
Hadoop project) created the Avro project was that reprocess it. You’re going to generate data at one
a lot of the formats in Hadoop were Java only. It’s time and have tools process that data maybe two,
important for Avro to be interoperable – to a lot of three, or four years down the line. Avro has the
different languages like Java, C, C++, C#, Python, ability to negotiate differences between schemata so
Ruby, etc. – and to be usable by a lot of tools. that new tools can read old data and vice versa.

One of the goals for Avro is a set of formats and Avro forms an important basis for the following
serialization that’s usable throughout the data projects.
platform that you’re using, not just in a subset of
the components. So MapReduce, Pig, Hive, Crunch, Crunch
Flume, Sqoop, etc. all support Avro. You’re probably familiar with Pig and Hive and how
to process data with them and integrate valuable

Page 5
CONTENTS
Hadoop / eMag Issue 13 - May 2014

tools. However, not all data formats that you use will Hadoop Writables in Avro. There’s no impedance
fit Pig and Hive. mismatch between the Java codes you’re writing
and the data that you’re analyzing.
Pig and Hive are great for a lot of logged data or
relational data, but other data types don’t fit as well. • It’s built as a modular library for reuse. You can
You can still process poorly fitting data with Pig and capture your pipelines in Crunch code in Java and
Hive, which don’t force you to a relational model or a then combine it with arbitrary machine learning
log structure, but you have to do a lot of work around program later, so that someone else can reuse
it. You might find yourself writing unwieldy user- that algorithm.
defined functions or doing things that are not natural
in the language. People, sometimes, just give up and The fundamental structure is a parallel collection so
start writing raw Java MapReduce programs because it’s a distributed, unordered collection of elements.
that’s easier. This collection has a parallel do operator which you
can imagine turns into a MapReduce job. So if you
Crunch was created to fill this gap. It’s a higher-level had a bunch of data that you want to operate in
API than MapReduce. It’s in Java. It’s lower level parallel, you can use a parallel collection.
than, say, Pig, Hive, Cascade, or other frameworks
you might be used to. It’s based on a paper that And there’s something called the parallel table,
Google published called FlumeJava. It’s a very which is a subinterface of the collection, and it’s
similar API. Crunch has you combine a small number a distributed sorted map. It also has a group by
of primitives with a small number of types and operators you can use to aggregate all the values for
effectively allow the user to create really lightweight a given key. We’ll go through an example that shows
UDS, which are just Java methods and classes to how that works.
create complex data pipelines.
Finally, there’s a pipeline class and pipelines are really
Crunch has a number of advantages. for coordinating the execution of the MapReduce
jobs that will actually do the back-end processing for
• It’s just Java. You have access to a full this Crunch program.
programming language.
Let’s take an example for which you’ve probably seen all
• You don’t have to learn Pig. the Java code before, word count, and see what it looks
like in Crunch.
• The type system is well-integrated. You can use
Java POJOs, but there’s also a native support for

Crunch – word count

public class WordCount {


public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);

PCollection words = lines.parallelDo("my splitter", new DoFn() {


public void process(String line, Emitter emitter) {
for (String word : line.split("\\s+")) {
emitter.emit(word);
}
}
}, Writables.strings());

PTable counts = Aggregate.count(words);

pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}

7
Page 6
CONTENTS
Hadoop / eMag Issue 13 - May 2014

It’s a lot smaller and simpler. The first line creates The listing at the bottom of this page is the same
a pipeline. We create a parallel collection of all the program written in Scala.We have the pipeline and
lines from a given file by using the pipeline class. And we can use Scala’s built-in functions that map really
then we get a collection of words by running the nicely to Crunch – so word count becomes a one-line
parallel do operator on these lines. program. It’s pretty cool and very powerful if you’re
writing Java code already and want to do complex
We’ve got a defined anonymous function here that pipelines.
basically processes the input and word count splits
on the word and emits that word for each map task. Cloudera ML
Cloudera ML (machine learning) is an open-source
Finally, we want to aggregate the counts for each library and tools to help data scientists perform the
word and write them out. There’s a line at the day-to-day tasks, primarily of data preparation to
bottom, pipeline run. Crunch’s planner does lazy model evaluation.
evaluation. We’re going to create and run the
MapReduce jobs until we’ve gotten a full pipeline With built-in commands for summarizing, sampling,
together. normalizing, and pivoting data, Cloudera ML has
recently added a built-in clustering algorithm
If you’re used to programming Java and you’ve seen for k-means, based on an algorithm that was just
the Hadoop examples for writing word count in Java, developed a year or two back. There are a couple of
you can tell that this is a more natural way to express other implementations as well. It’s a home for tools
that. This is among the simplest pipelines you can you can use so you can focus on data analysis and
create, and you can imagine you can do many more modeling instead of on building or wrangling the
complicated things. tools.

If you want to go even one step easier than this, It’s built using Crunch. It leverages a lot of existing
there’s a wrapper for Scala. This is very similar idea to projects. For example, the vector formats: a lot of
Cascade, which was built on Google FlumeJava. Since ML involves transforming raw data that’s in a record
Scala runs on the JVM, it’s an obvious natural fit. format to vector formats for machine-learning
Scala’s type inference actually ends up being really algorithms. It leverages Mahout’s vector interface
powerful in the context of Crunch. and classes for that purpose. The record format is
just a thin wrapper in Avro, and HCatalog is record

Scrunch – Scala wrapper

Based on Google’s Cascade project

class WordCountExample {
val pipeline = new Pipeline[WordCountExample]

def wordCount(fileName: String) = {


pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("\\W+"))
.filter(!_.isEmpty())
.count
}
}

Page 7
CONTENTS
8
Hadoop / eMag Issue 13 - May 2014

and schema formats so you can easily integrate with It lets you focus on working on a dataset on HDFS
existing data sources. instead of all the implementation details. It also has
plugin providers for existing systems.
For more information on Cloudera ML, visit the
projects’ GitHub page; there’s a bunch of examples Imagine you’re already using Hive and HCatalog
with datasets that can get you started. as a metadata repository, and you’ve already got a
schema for what these files look like. CDK integrates
Cloudera Development Kit with that. It doesn’t require you to define all of
Like Cloudera ML, Cloudera Development Kit a set your metadata for your entire data repository from
of open-source libraries and tools that make writing scratch. It integrates with existing systems.
applications on Hadoop easier. Unlike ML though,
it’s not focused on using machine learning like a data You can learn more about the various CDK modules
scientist. It’s directed at developers trying to build and how to use them in the documentation.
applications on Hadoop. It’s really the plumbing of
a lot of different frameworks and pipelines and the In summary, working with data from various sources,
integration of a lot of different components. preparing and cleansing data and processing them
via Hadoop involves a lot of work. Tools such as
The purpose of the CDK is to provide higher level Crunch, Cloudera ML and CDK make it easier to do
APIs on top of the existing Hadoop components this and leverage Hadoop more effectively.
in the CDH stack that codify a lot of patterns in
common use cases.

CDK is prescriptive, has an opinion on the way to ABOUT THE SPEAKER


do things, and tries to make it easy for you to do
the right thing by default, but its architecture is a
system of loosely coupled modules. You can use Eli Collins is the tech lead for
modules independently of each other. It’s not an Cloudera’s Platform team, an active
uber- framework that you have to adopt whole. You contributor to Apache Hadoop and
can adopt it piecemeal. It doesn’t force you into any member of its project management
particular programming paradigms. It doesn’t force committee (PMC) at the Apache
you to adopt a ton of dependencies. You can adopt Software Foundation. Eli holds
only the dependencies of the particular modules you Bachelor’s and Master’s degrees in
want. Computer Science from New York
University and the University of
Let’s look at an example. The first module in Wisconsin-Madison, respectively.
CDK is the data module, and the goal of the data
module is to make it easier for you to work with
datasets on Hadoop file systems. There are a lot
of gory details to clean up to make this work in
practice; you have to worry about serialization, WATCH THE FULL
deserialization, compression, partitioning, directory PRESENTATION ON InfoQ
layout, commuting, getting that directory layout,
partitioning to other people who want to consume
the data, etc.

The CDK data module handles all this for you. It


automatically serializes and deserializes data from
Java POJOs, if that’s what you have, or Avro records
if you use them. It has built-in compression, and
built-in policies around file and directory layouts so
that you don’t have to repeat a lot of these decisions
and you get smart policies out of the box. It will
automatically partition data within those layouts.

Page 8
CONTENTS
Hadoop / eMag Issue 13 - May 2014

What is Apache Tez?


Presentation transcript edited by Roopesh Shenoy

You might have heard of Apache Tez, a new distributed execution framework that
is targeted towards data-processing applications on Hadoop. But what exactly is
it? How does it work? Who should use it and why? In their presentation, “Apache
Tez: Accelerating Hadoop Query Processing”, Bikas Saha and Arun Murthy discuss
Tez’s design, highlight some of its features and share some of the initial results
obtained by making Hive use Tez instead of MapReduce.
Tez generalizes the MapReduce paradigm to a The Tez project aims to be highly customizable
more powerful framework based on expressing so that it can meet a broad spectrum of use cases
computations as a dataflow graph. Tez is not meant without forcing people to go out of their way to
directly for end-users – in fact it enables developers make things work; projects such as Hiveand Pig are
to build end-user applications with much better seeing significant improvements in response times
performance and flexibility. Hadoop has traditionally when they use Tez instead of MapReduce as the
been a batch-processing platform for large amounts backbone for data processing. Tez is built on top
of data. However, there are a lot of use cases for of YARN, which is the new resource-management
near-real-time performance of query processing. framework for Hadoop.
There are
also several
workloads,
such as
Machine
Learning,
which do
not fit will
into the
MapReduce
paradigm.
Tez helps
Hadoop
address
these use
cases.

Page 9
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Design Philosophy production jobs use. Your users can experiment


The main reason for Tez to exist is to get around with a second copy, the latest version of Tez. And
limitations imposed by MapReduce. Other than they will not interfere with each other.
being limited to writing mappers and reducers, there
are other inefficiencies in force-fitting all kinds of Tez can run any MR job without any modification.
computations into this paradigm – for e.g. HDFS This allows for stage-wise migration of tools that
is used to store temporary data between multiple currently depend on MR.
MR jobs, which is an overhead. (In Hive, this is
common when queries require multiple shuffles on Exploring the Expressive Dataflow APIs in detail -
keys without correlation, such as with join - grp by - what can you do with this? For e.g. instead of using
window function - order by.) multiple MapReduce jobs, you can use the MRR
pattern, such that a single map has multiple reduce
The key elements forming the design philosophy stages; this can allow streaming of data from one
behind Tez - processor to another to another, without writing
anything to HDFS (it will be written to disk only for
• Empowering developers (and hence end users) to check-pointing), leading to much better performance.
do what they want in the most efficient manner The below diagrams demonstrate this -

• Better execution performance

Some of the things that helps Tez achieve these goals


are:

• Expressive Dataflow APIs - The Tez team wants


to have an expressive-dataflow-definition API so
that you can describe the Direct Acyclic Graph
(DAG) of computation that you want to run. For
this, Tez has a structural kind of API in which you
add all processors and edges and visualize what
you are actually constructing.

• Flexible Input-Processor-Output runtime model


– can construct runtime executors dynamically
by connecting different inputs, processors and
outputs.

• Data type agnostic – only concerned with


movement of data, not with the data format (key-
value pairs, tuple oriented formats, etc)

• Dynamic Graph Reconfiguration

• Simple Deployment – Tez is completely a client-


side application, leverages YARN local resources
and distributed cache. There’s no need to deploy
anything on your cluster as far as using Tez is
concerned. You just upload the relevant Tez
libraries to HDFS then use your Tez client to
submit with those libraries.

You can even have two copies of the libraries on


your cluster. One would be a production copy,
which is the stable version and which all your

Page 10
CONTENTS
Hadoop / eMag Issue 13 - May 2014

The first diagram demonstrates a process that has You can view this Hortonworks article to see an
multiple MR jobs, each storing intermediate results example of the API in action, more detail about these
to the HDFS – the reducers of the previous step properties and how the logical graph expands at run-
feeding the mappers of the next step. The second time.
diagram shows how with Tez, the same processing
can be done in just one job, with no need to access The runtime API is based on an input-processor-
HDFS in between. output model which allows all inputs and outputs
to be pluggable. To facilitate this, Tez uses an event-
Tez’s flexibility means that it requires a bit more based model in order to communicate between tasks
effort than MapReduce to start consuming; there’s a and the system, and between various components.
bit more API and a bit more processing logic that you Events are used to pass information such as task
need to implement. This is fine since it is not an end- failures to the required components, flow of data
user application like MapReduce; it is designed to let from Output to the Input such as location of data
developers build end-user applications on top of it. that it generates, enabling run-time changes to the
DAG execution plan, etc.
Given that overview of Tez and its broad goals, let’s
try to understand the actual APIs. Tez also comes with various Input and Output
processors out-of-the-box.
Tez API
The Tez API has the following components: The expressive API allows higher language (such as
Hive) writers to elegantly transform their queries
• DAG (Directed Acyclic Graph) – defines the into Tez jobs.
overall job. One DAG object corresponds to one
job Tez Scheduler
The Tez scheduler considers a lot of things when
• Vertex – defines the user logic along with the deciding on task assignments – task-locality
resources and the environment needed to requirements, compatibility of containers, total
execute the user logic. One Vertex corresponds available resources on the cluster, priority of
to one step in the job pending task requests, automatic parallelization,
freeing up resources that the application cannot
• Edge – defines the connection between producer use anymore (because the data is not local to it) etc.
and consumer vertices. It also maintains a connection pool of pre-warmed
JVMs with shared registry objects. The application
Edges need to be assigned properties; these can choose to store different kinds of pre-computed
properties are essential for Tez to be able to information in those shared registry objects so that
expand that logical graph at runtime into the they can be reused without having to recompute
physical set of tasks that can be done in parallel them later on, and this shared set of connections and
on the cluster. There are several such properties: container-pool resources can run those tasks very
fast.
• The data-movement property defines how
data moves from a producer to a consumer. You can read more about reusing of containers in
Apache Tez.
• Scheduling properties (sequential or
concurrent) helps us define when the Flexibility
producer and consumer tasks can be Overall, Tez provides a great deal of flexibility for
scheduled relative to each other. developers to deal with complex processing logic.
This can be illustrated with one example of how Hive
• Data-source property (persisted, reliable or is able to leverage Tez.
ephemeral), defines the lifetime or durability
of the output produced by our task so that Let’s take this typical TPC-DS query pattern in which
we can determine when we can terminate it. you are joining multiple tables with a fact table. Most
optimizers and query systems can do what is there
in the top-right corner: if the dimension tables are

Page 11
CONTENTS
Hadoop / eMag Issue 13 - May 2014

small, then they can broadcast-join all of them with • it provides sessions and reusable containers so
the large fact table, and you can do that same thing that you have low latency and can avoid
on Tez. recombination as much as possible.

But what if these broadcasts have user-defined This particular Hive query is seeing performance
functions that are expensive to compute? You may improvement of more than 100% with the new Tez
not be able to do all of that this way. You may have to engine.
break up your tasks into different stages, and that’s
what the left-side topology shows you. The first Roadmap
dimension table is broadcast-joined with the fact • Richer DAG support. For example, can Samza
table. The result is then broadcast-joined with the use Tez as a substrate on which to build the
second dimension table. application? It needs some support in order
for Tez to handle Samza’s core scheduling and
Here, the third dimension table is not broadcastable streaming requirements. The Tez team wants
because it is too large. You can choose to do a shuffle to explore how we would enable those kinds of
join, and Tez can efficiently navigate the topology connection patterns in our DAGs. They also want
without falling over just because you can’t do the more fault-tolerance support, more efficient data
top-right one. transfer for further performance optimization,
and improved session performance.
The two benefits for this kind of Hive query with Tez
are: • Given that these DAGs can get arbitrarily
complex, we need a lot of automatic tooling to
• it gives you full DAG support and does a lot help the users understand their performance
automatically on the cluster so that it can fully bottlenecks
utilize the parallelism that is available in the
cluster; as already discussed above, this means
there is no need for reading/writing from HDFS
between multiple MR jobs, all the computation
can be done in a single Tez job.

Page 12
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Summary
Tez is a distributed execution framework that works ABOUT THE SPEAKERS
on computations represented as dataflow graphs. It
maps naturally to higher-level declarative languages
like Hive, Pig, Cascading, etc. It’s designed to have Arun Murthy is the lead of the
highly customizable execution architecture so that MapReduce project in Apache
we can make dynamic performance optimizations Hadoop where he has been a full-time
at runtime based on real information about the data contributor to Apache Hadoop since
and the resources. The framework itself automatically its inception in 2006. He is a long-time
determines a lot of the hard stuff, allowing it to work committer and member of the Apache
right out-of-the-box. Hadoop PMC and jointly holds the
current world sorting record using
You get good performance and efficiency out-of-the- Apache Hadoop. Prior to co-founding
box. Tez aims to address the broad spectrum of use Hortonworks, Arun was responsible for
cases in the data-processing domain in Hadoop, all MapReduce code and configuration
ranging from latency to complexity of the execution. It deployed across the 42,000+ servers at
is an open-source project. Tez works, Saha and Murthy Yahoo!
suggest, and is already being used by Hive and Pig.
Bikas Saha has been working on
Apache Hadoop for over a year and is a
committer on the project. He has been
a key contributor in making Hadoop run
natively on Windows and has focused
on YARN and the Hadoop compute
stack. Prior to Hadoop, he has worked
WATCH THE FULL extensively on the Dryad distributed
PRESENTATION ON InfoQ data processing framework that runs on
some of the worlds largest clusters as
part of Microsoft Bing infrastructure

Page 13
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Modern Healthcare Architectures


Built with Hadoop
by Justin Sears

We have heard plenty in the news lately about healthcare challenges and the
difficult choices faced by hospital administrators, technology and pharmaceutical
providers, researchers, and clinicians. At the same time, consumers are
experiencing increased costs without a corresponding increase in health security or
in the reliability of clinical outcomes.
One key obstacle in the healthcare market is data Healthcare”. The report points out how big data
liquidity (for patients, practitioners and payers) and is creating value in five “new value pathways”
some are using Apache Hadoop to overcome this allowing data to flow more freely. Below we present
challenge, as part of a modern data architecture. a summary of these five new value pathways and
This post describes some healthcare use cases, a an an example how Hadoop can be used to address
healthcare reference architecture and how Hadoop each. Thanks to the Clinical Informatics Group at UC
can ease the pain caused by poor data liquidity. Irvine Health for many of the use cases, described in
their UCIH case study.
New Value Pathways for Healthcare
In January 2013, McKinsey & Company published
a report named “The ‘Big Data’ Revolution in

Pathway Benefit Hadoop Use Case


Right Living Patients can build value by taking an Predictive Analytics: Heart patients weigh
active role in their own treatment, themselves at home with scales that transmit data
including disease prevention. wirelessly to their health center. Algorithms analyze
the data and flag patterns that indicate a high risk of
readmission, alerting a physician.
Right Care Patients get the most timely, Real-time Monitoring: Patient vital statistics are
appropriate treatment available. transmitted from wireless sensors every minute.
If vital signs cross certain risk thresholds, staff can
attend to the patient immediately.
Right Provider skill sets matched to the Historical EMR Analysis: Hadoop reduces the cost
Provider complexity of the assignment— for to store data on clinical operations, allowing longer
instance, nurses or physicians’ assistants retention of data on staffing decisions and clinical
performing tasks that do not require a outcomes. Analysis of this data allows administrators
doctor. Also the specific selection of the to promote individuals and practices that achieve the
provider with the best outcomes. best results.
Page 14
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Pathway Benefit Hadoop Use Case


Right Value Ensure cost-effectiveness of care, such Medical Device Management: For biomedical device
as tying provider reimbursement to maintenance, use geolocation and sensor data to
patient outcomes, or eliminating fraud, manage its medical equipment. The biomedical
waste, or abuse in the system. team can know where all the equipment is, so they
don’t waste time searching for an item.Over time,
determine the usage of different devices, and use this
information to make rational decisions about when to
repair or replace equipment.
Right The identification of new therapies and Research Cohort Selection: Researchers at teaching
Innovation approaches to delivering care, across all hospitals can access patient data in Hadoop for cohort
aspects of the system. Also improving discovery, then present the anonymous sample cohort
the innovation engines themselves. to their Internal Review Board for approval, without
ever having seen uniquely identifiable information.

Source: The ‘Big Data’ Revolution in Healthcare. McKinsey & Company, January 2013.

At Hortonworks, we see our healthcare customers ingest and analyze data from many sources. The following
reference architecture is an amalgam of Hadoop data patterns that we’ve seen with our customers’ use of
Hortonworks Data Platform (HDP). Components shaded green are part of HDP.

Sources of Healthcare Data • RTLS (for locating medical equipment & patient
Source data comes from: throughput)
• Bio Repository
• Legacy Electronic Medical Records (EMRs) • Device Integration (e.g. iSirona)
• Transcriptions • Home Devices (e.g. scales and heart monitors)
• PACS • Clinical Trials
• Medication Administration • Genomics (e.g. 23andMe, Cancer Genomics Hub)
• Financial • Radiology (e.g. RadNet)
• Laboratory (e.g. SunQuest, Cerner) • Quantified Self Sensors (e.g. Fitbit, SmartSleep)
• Social Media Streams (e.g. FourSquare, Twitter)

Page 15
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Loading Healthcare Data existing data architecture to create a modern data


Apache Sqoop is included in Hortonworks Data architecture that is interoperable and familiar, so that
Platform, as a tool to transfer data between external the same team of analysts and practitioners can use
structured data stores (such as Teradata, Netezza, their existing skills in new ways:
MySQL, or Oracle) into HDFS or related systems like
Hive and HBase. We also see our customers using
other tools or standards for loading healthcare data
into Hadoop. Some of these are:

Health Level 7 (HL7) International Standards

Apache UIMA

JAVA ETL rules

Processing Healthcare Data


Depending on the use case, healthcare organizations
process data in batch (using Apache Hadoop
MapReduce and Apache Pig); interactively (with As more and more healthcare organizations adopt
Apache Hive); online (with Apache HBase) or Hadoop to disseminate data to their teams and
streaming (with Apache Storm). partners, they empower caregivers to combine their
training, intuition, and professional experience with
Analyzing Healthcare Data big data to make data-driven decisions that cure
Once data is stored and processed in Hadoop it patients and reduce costs.
can either be analyzed in the cluster or exported to
relational data stores for analysis there. These data Watch our blog in the coming weeks as we share
stores might include: reference architectures for other industry verticals.

• Enterprise data warehouse


• Quality data mart
• Surgical data mart
• Clinical info data mart ABOUT THE AUTHOR
• Diagnosis data mart
• Neo4j graph database Justin Sears is an experienced
marketing manager with sixteen years
Many data analysis and visualization applications leading teams to create and position
can also work with the data directly in Hadoop. enterprise software, risk-controlled
Hortonworks healthcare customers typically use the consumer banking products, desktop
following business intelligence and visualization tools and mobile web properties, and
to inform their decisions: services for Latino customers in the US
and Latin America. Expert in enterprise
• Microsoft Excel big data use cases for Apache Hadoop.
• Tableau
• RESTful Web Services
• EMR Real-time analytics
• Metric Insights READ THIS ARTICLE ON
• Patient Scorecards Hortonworks.com
• Research Portals
• Operational Dashboard
• Quality Dashboards

The following diagram shows how healthcare


organizations can integrate Hadoop into their

Page 16
CONTENTS
Hadoop / eMag Issue 13 - May 2014

How LinkedIn Uses Apache Samza


Presentation transcript edited by Roopesh Shenoy

Apache Samza is a stream processor LinkedIn recently open-sourced. In


his presentation, Samza: Real-time Stream Processing at LinkedIn, Chris
Riccomini discusses Samza’s feature set, how Samza integrates with YARN and
Kafka, how it’s used at LinkedIn, and what’s next on the roadmap.
Apache Samza is a stream processor LinkedIn spectrum, they have batch processing, for which they
recently open-sourced. In his presentation,Samza: use Hadoop for quite a bit. Hadoop processing and
Real-time Stream Processing at LinkedIn, Chris batch processing typically happens after the fact,
Riccomini discusses Samza’s feature set, how Samza often hours later.
integrates with YARN and Kafka, how it’s used at
LinkedIn, and what’s next on the roadmap. There’s this gap between synchronous RPC
processing, where the user is actively waiting for a
Bulk of processing that happens at LinkedIn is RPC- response, and this Hadoop-style processing which
style data processing, where one expects a very fast despite efforts to shrink it still it takes a long time to
response. On the other end of their response latency run through.

Page 17
CONTENTS
Hadoop / eMag Issue 13 - May 2014

That’s where Samza fits in. This is where we can Because LinkedIn has Kafka and because they’ve
process stuff asynchronously, but we’re also not integrated with it for the past few years, a lot of data
waiting for hours. It typically operates in the order of at LinkedIn, almost all of it, is available in a stream
milliseconds to minutes. The idea is to process stuff format as opposed to a data format or on Hadoop.
relatively quickly and get the data back to wherever
it needs to be, whether that’s a downstream system Motivation for Building Samza
or some real-time service. Chris mentions that when they began doing stream
processing, with Kafka and all this data in their
Chris mentions that right now, this stream processing system, they started with something like a web
is the worst-supported in terms of tooling and service that would start up, read messages from
environment. Kafka and do some processing, and then write the
messages back out.
LinkedIn sees a lot of use cases for this type of
processing – As they did this, they realized that there were a lot of
problems that needed to be solved in order to make
• Newsfeed displays when people move to another it really useful and scalable. Things like partitioning:
company, when they like an article, when they how do you partition your stream? How do you
join a group, et cetera. partition your processor? How do you manage state,
where state is defined essentially as something that
News is latency-sensitive and if you use Hadoop to you maintain in your processor between messages,
batch-compute it, you might be getting responses or things like count if you’re incrementing a counter
hours or maybe even a day later. It is important to get every time a message comes in. How do you re-
trending articles in News pretty quickly. process?

• Advertising – getting relevant advertisements, as With failure semantics, you get at least once, at most
well as tracking and monitoring ad display, clicks once, exactly once messaging. There is also non-
and other metrics determinism. If your stream processor is interacting
with another system, whether it’s a database or it’s
• Sophisticated monitoring that allows performing depending on time or the ordering of messages,
of complex querys like “the top five slowest pages how you deal with stuff that actually determines the
for the last minute.” output that you will end up sending?

Existing Ecosystem at LinkedIn Samza tries to address some of these problems.


The existing ecosystem at LinkedIn has had a huge
influence in the motivation behind Samza as well Samza Architecture
as it’s architecture. Hence it is important to have at The most basic element of Samza is a stream. The
least a glimpse of what this looks like before diving stream definition for Samza is much more rigid and
into Samza. heavyweight than you would expect from other
stream processing systems. Other processing
Kafka is an open-source project that LinkedIn systems, such as Storm, tend to have very lightweight
released a few years ago. It is a messaging system stream definitions to reduce latency, everything
that fulfills two needs – message-queuing and log from, say, UDP to a straight-up TCP connection.
aggregation. All of LinkedIn’s user activity, all the
metrics and monitoring data, and even database Samza goes the other direction. It wants its streams
changes go into this. to be, for starters, partitions. It wants them to be
ordered. If you read Message 3 and then Message
LinkedIn also has a specialized system 4, you are never going to get those inverted within
called Databus, which models all of their databases a single partition. It also wants them to replayable,
as a stream. It is like a database with the latest data which means you should be able to go back to reread
for each key-value pair. But as this database mutates, a message at a later date. It wants them to be fault-
you can actually model that set of mutations as a tolerant. If a host from Partition 1 disappears, it
stream. Each individual change is a message in that should still be readable on some other hosts. Also,
stream. the streams are usually infinite. Once you get to the

Page 18
CONTENTS
Hadoop / eMag Issue 13 - May 2014

end – say, Message 6 of Partition 0 – you would just • State management – Data that needs to be
try to reread the next message when it’s available. passed between processing of different messages
It’s not the case that you’re finished. can be called state – this can be something as
simple as a keeping count or something a lot
This definition maps very well to Kafka, which more complex. Samza allows tasks to maintain
LinkedIn uses as the streaming infrastructure for persistent, mutable, queryable state that is
Samza. physically co-located with each task. The state
is highly available: in the event of a task failure
There are many concepts to understand within it will be restored when the task fails over to
Samza. In a gist, they are – another machine.

• Streams – Samza processes streams. A stream This datastore is pluggable, but Samza comes with a
is composed of immutable messages of a similar key-value store out-of-the-box.
type or category. The actual implementation
can be provided via a messaging system such • YARN (Yet Another Resource Manager) is
as Kafka (where each topic becomes a Samza Hadoop v2’s biggest improvement over v1 – it
Stream) or a database (table) or even Hadoop (a separates the Map-Reduce Job tracker from the
directory of files in HDFS) resource management and enables Map-reduce
alternatives to use the same resource manager.
• Things like message ordering, batching are Samza utilizes YARN to do cluster management,
handled via streams. tracking failures, etc.

• Jobs – a Samza job is code that performs logical Samza provides a YARN ApplicationMaster and a
transformation on a set of input streams to YARN job runner out of the box.
append messages to a set of output streams

• Partitions – For scalability, each stream is broken


into one or more partitions. Each partition is a
totally ordered sequence of messages

• Tasks – again for scalability, a job is distributed


by breaking it into multiple tasks. The task
consumes data from one partition for each of the You can understand how the various components
job’s input streams (YARN, Kafka and Samza API) interact by
looking at the detailed architecture. Also read
• Containers – whereas partitions and tasks are the overall documentation to understand each
logical units of parallelism, containers are unit component in detail.
physical parallelism. Each container is a unix
process (or linux cgroup) and runs one or more Possible Improvements
tasks. One of the advantages of using something like YARN
with Samza is that it enables you to potentially run
• TaskRunner – Taskrunner is Samza’s stream Samza on the same grid that you already run your
processing container. It manages startup, draft tasks, test tasks, and MapReduce tasks. You
execution and shutdown of one or more could use the same infrastructure for all of that.
StreamTask instances. However, LinkedIn currently does not run Samza in a
multi-framework environment because the existing
• Checkpointing – Checkpointing is generally done setup itself is quite experimental.
to enable failure recovery. If a taskrunner goes
down for some reason (hardware failure, for e.g.), In order to get into a more multi-framework
when it comes back up, it should start consuming environment, Chris says that the process isolation
messages where it left off last – this is achieved would have to get a little better.
via Checkpointing.

Page 19
CONTENTS
Hadoop / eMag Issue 13 - May 2014

Conclusion
Samza is a relatively young project incubating at ABOUT THE SPEAKER
Apache so there’s a lot of room to get involved. A good
way to get started is with the hello-samza project,
which is a little thing that will get you up and running Chris Riccomini is a Staff Software
in about five minutes. It will let you play with a real- Engineer at LinkedIn, where he’s is
time change log from the Wikipedia servers to let you currently working as a committer
figure out what’s going on in and give you a stream of and PMC member for Apache Samza.
stuff to play with. He’s been involved in a wide range of
projects at LinkedIn, including, “People
The other stream processing project built on top of You May Know”, REST.li, Hadoop,
Hadoop is STORM. You can see acomparison between engineering tooling, and OLAP systems.
Samza and STORM Prior to LinkedIn, he worked on data
visualization and fraud modeling at
PayPal.

WATCH THE FULL


PRESENTATION ON InfoQ

Page 20
CONTENTS

You might also like