0% found this document useful (0 votes)
105 views127 pages

Intro Haddop Ecosystem 24sep2020

The document provides an introduction to Hadoop and the Hadoop ecosystem. It begins with definitions of big data and discusses the challenges of storing and processing large volumes of data. It then describes Hadoop as a framework for distributed storage and computation using HDFS for redundant data storage and MapReduce as a programming model for distributed processing of large datasets across clusters of computers. Finally, it outlines some of the common components in the Hadoop ecosystem that provide additional functionality beyond core Hadoop.

Uploaded by

pankaj boricha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views127 pages

Intro Haddop Ecosystem 24sep2020

The document provides an introduction to Hadoop and the Hadoop ecosystem. It begins with definitions of big data and discusses the challenges of storing and processing large volumes of data. It then describes Hadoop as a framework for distributed storage and computation using HDFS for redundant data storage and MapReduce as a programming model for distributed processing of large datasets across clusters of computers. Finally, it outlines some of the common components in the Hadoop ecosystem that provide additional functionality beyond core Hadoop.

Uploaded by

pankaj boricha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

07.11.

13

uweseiler

Introduction to the
Hadoop Ecosystem
07.11.13
About me

Big Data Nerd Hadoop Trainer MongoDB Author

Photography Enthusiast Travelpirate


07.11.13
About us

is a bunch of…

Big Data Nerds Agile Ninjas Continuous Delivery Gurus

Join us!

Enterprise Java Specialists Performance Geeks


07.11.13
Agenda

• What is Big Data & Hadoop?


• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Agenda

• What is Big Data & Hadoop?


• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Big Data
Big Data is like teenage sex:
everybody talks about it,
nobody really knows how to
do it, everyone thinks
everyone else is doing it, so
everyone claims they are
doing it…
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
07.11.13
My favorite definition
07.11.13
The classic definition

Volume
The 3 V’s of Big Data

Velocity
Variety
07.11.13

«Big Data» != Hadoop


g

NoSQL
07.11.13
Classification of NoSQL

Key-Value Stores Column Stores


1
K V
1 1 1
K V 1 1
K V 1 1
1
K V 1
K V 1

Graph Databases Document Stores

_id
_id
_id
Horizontal
Scaling
07.11.13
Vertical Scaling

RAM
CPU
Storage
07.11.13
Vertical Scaling

RAM
CPU
Storage
07.11.13
Vertical Scaling

RAM
CPU
Storage
07.11.13
Horizontal Scaling

RAM
CPU
Storage
07.11.13
Horizontal Scaling

RAM RAM RAM RAM RAM


CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage
07.11.13
Horizontal Scaling

RAM RAM RAM RAM RAM


CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage

RAM RAM RAM RAM RAM


CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage

RAM RAM RAM RAM RAM


CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage
07.11.13
Why Hadoop?

Traditional dataStores are expensive to scale


and by Design difficult to Distribute

Scale out is the way to go!


07.11.13
How to scale data?

“Data“

w w w

worker worker worker

r r r

“Result“
07.11.13
But…

Parallel processing is
complicated!
07.11.13
But…

Data storage is not


trivial!
07.11.13
What is Hadoop?

Distributed Storage and


Computation Framework
07.11.13
What is Hadoop?

Hadoop != Database
07.11.13
What is Hadoop?

“Swiss army knife


of the 21st century”

http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
07.11.13
The Hadoop App Store

HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra

Chukwa Flume Hana HyperT Impala Mahout Nutch Oozie Scoop

Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC

Intel IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper

Sync Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat


07.11.13
The Hadoop App Store

Hadoop
Distributions

Apache
Hadoop
+
• Integrated Environment

+ •


Visualization
(Near-)Realtime analysis
Modeling
• Test & Packaging • ETL & Connectors
• HDFS • Installation
• MapReduce • Monitoring
• Hadoop Ecosystem Big Data
• Business Support
• Hadoop YARN Suites

less Functionality more


07.11.13
Agenda

• What is Big Data & Hadoop?


• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Data Storage

OK, first things


first!
I want to store all of
my <<Big Data>>
07.11.13
Data Storage
07.11.13 Hadoop Distributed File System

• Distributed file system for


redundant storage
• Designed to reliably store data
on commodity hardware
• Built to expect hardware
failures
07.11.13 Hadoop Distributed File System

Intended for
• large files
• batch inserts
07.11.13
HDFS Architecture

Client Master Helper

File NameNode Secondary


Block Map
NameNode
#1 #2 Journal Log periodical merges

Rack 1 Rack 2
Slave Slave Slave

DataNode DataNode DataNode


#1 #1 #1
07.11.13
HDFS

Let’s have a look…


07.11.13
Data Processing

Data stored, check!


Now I want to
create insights
from my data!
07.11.13
Data Processing
07.11.13
MapReduce

• Programming model for


distributed computations at a
massive scale

• Execution framework for


organizing and performing such
computations

• Data locality is king


07.11.13
Typical large-data problem

• Iterate over a large number of records

Map
• Extract something of interest from each

• Shuffle and sort intermediate results

Reduce
• Aggregate intermediate results

• Generate final output


07.11.13 MapReduce Flow

Map Map Map Map


a b 2 c 3 c 6 a 3 c 2 b 7 c 8

Combine Combine Combine Combine

a b 2 c 9 a 3 c 2 b 7 c 8

Partition Partition Partition Partition

Shuffle and Sort


a 1 3 b 7 c 2 8 9

Reduce Reduce Reduce


a 4 b 9 c 19
07.11.13 Combined Hadoop Architecture

Client Master Helper

Job JobTracker

Secondary
File NameNode
NameNode

Slave Slave Slave

TaskTracker TaskTracker TaskTracker


Task Task Task

DataNode DataNode DataNode


Block Block Block
07.11.13
Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,


IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
07.11.13
Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
07.11.13
Map/Reduce

Let’s have a look…


07.11.13
Agenda

• What is Big Data & Hadoop?


• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Scripting for Hadoop

Java for MapReduce?


I dunno, dude…
I’m more of a
scripting guy…
07.11.13
Scripting for Hadoop
07.11.13
Apache Pig

• High-level data flow language

• Made of two components:


• Data processing language Pig Latin
• Compiler to translate Pig Latin to
MapReduce
07.11.13
Pig in the Hadoop ecosystem

Pig
Scripting

HCatalog
Metadata Management

MapReduce
Distributed Programming Framework

HDFS
Hadoop Distributed File System
07.11.13
Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);

filteredUsers = FILTER users BY age >= 18 and age <=50;


joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';


07.11.13
Pig Execution Plan
07.11.13
Try that with Java…
07.11.13
Pig

Let’s have a look…


07.11.13
SQL for Hadoop

OK, Pig seems quite


useful…
But I’m more of a
SQL person…
07.11.13
SQL for Hadoop
07.11.13
Apache Hive

• Data Warehousing Layer on top of


Hadoop

• Allows analysis and queries


using a SQL-like language
07.11.13
Hive in the Hadoop ecosystem

Pig Hive
Scripting Query

HCatalog
Metadata Management

MapReduce
Distributed Programming Framework

HDFS
Hadoop Distributed File System
07.11.13
Hive Architecture

Hive
Hive Thrift Thrift
Driver Applications
Shell
Hive

Hive
Server Hive JDBC JDBC
Driver Applications

Meta- Hive
store Engine Hive ODBC ODBC
Driver Applications

MapReduce

HDFS
07.11.13
Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO


TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN


pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
07.11.13
Hive

Let’s have a look…


07.11.13
But wait, there’s still more!

More components of the


Hadoop Ecosystem
Mahout
07.11.13 Machine Learning

Pig Hive

Cluster installation & management


Scripting SQL-like queries

Workflow automatization
NoSQL Database
HBase

Cluster Coordination
HCatalog
Metadata Management

ZooKeeper

Ambari

Oozie
MapReduce
Data processing

HDFS
Data storage

Scoop Flume
Import & Export of Import & Export of
relational data data flows
07.11.13
Agenda

• What is Big Data & Hadoop?


• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Applications Classical enterprise platform

Business Business Custom


Intelligence Applications Applications
Dev Tools

Build
&
Test
Data Systems

Traditional Systems
Operation

Manage
RDBMS EDW MPP … &
Monitor

Traditional Sources
Data Sources

RDBMS OLTP OLAP …


07.11.13
Applications Big Data Platform
Business Business Custom
Intelligence Applications Applications
Dev Tools

Build
&
Test
Data Systems

Traditional Systems
Enterprise Operation
Hadoop Manage
RDBMS EDW MPP … Plattform &
Monitor

Traditional Sources New Sources


Data Sources

… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #1: Refine data
Business Business Custom Capture
1
Intelligence Applications Applications all data

Process
4 2
the data
Data Systems

Traditional Systems
3 Enterprise Exchange
Hadoop 2 using
3
RDBMS EDW MPP … Plattform traditional
systems
1
Process &
Traditional Sources New Sources
Data Sources

Visualize
4 with
traditional
Social applications
RDBMS OLTP OLAP … Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #2: Explore data
Business Business Custom
Intelligence Applications Applications
Capture
1
3 all data
Data Systems

Traditional Systems Process


Enterprise 2
the data
2 Hadoop
RDBMS EDW MPP … Plattform
Explore the
data using
1 3 applications
with support
Traditional Sources New Sources
Data Sources

for Hadoop

… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #3: Enrich data
Business Custom
Applications Applications

Capture
3 1
all data
Data Systems

Traditional Systems
Enterprise Process
2
2 Hadoop the data
RDBMS EDW MPP … Plattform
Directly
1 3 ingest the
data
Traditional Sources New Sources
Data Sources

… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
07.11.13
Bringing it all together…

One example…
07.11.13
Digital Advertising

• 6 billion ad deliveries per day

• Reports (and bills) for the


advertising companies needed

• Own C++ solution did not scale

• Adding functions was a nightmare


07.11.13
AdServing Architecture
AdServer Hadoop Cluster Synchronisation

AdServer
Campaign
Database

Campaign
FFM AMS Data

Binary
Log Format
TCP TCP
Interface Interface

Custom Custom Report


Flume Flume
Source Source Pig Hive Engine

Temporary Aggregated
data data

Local files
Job
NAS
Start
Scheduler
Flume HDFS Sink
Direct
Config UI Job Config Download
XML
07.11.13
What’s next?

Hadoop 2.0
aka YARN
07.11.13
Hadoop 1.0

Built for web-scale batch apps


Single App Single App

Batch Batch

Single App Single App Single App

Batch Batch Batch

HDFS HDFS HDFS


07.11.13
MapReduce is good for…

• Embarrassingly parallel algorithms


• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset
07.11.13
MapReduce is OK for…

• Iterative jobs (i.e., graph algorithms)


– Each iteration must read/write data to
disk
– I/O and latency cost of an iteration is
high
07.11.13
MapReduce is not good for…

• Jobs that need shared state/coordination


– Tasks are shared-nothing
– Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records


07.11.13
MapReduce limitations

• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker

• Availability
– Failure kills all queued and running jobs

• Hard partition of resources into map & reduce slots


– Low resource utilization

• Lacks support for alternate paradigms and services


– Iterative applications implemented using MapReduce are 10x
slower
07.11.13
Hadoop 2.0: Next-gen platform

Single use system Multi-purpose platform


Batch Apps Batch, Interactive, Streaming, …
Hadoop 1.0 Hadoop 2.0

MapReduce Others
MapReduce Data processing Data processing
Cluster resource mgmt.
+ data processing
YARN
Cluster resource management
HDFS
Redundant, reliable HDFS 2.0
storage Redundant, reliable storage
07.11.13
Taking Hadoop beyond batch
Store all data in one place
Interact with data in multiple ways

Applications run natively in Hadoop

Batch Interactive Online Streaming Graph In-Memory Other


MapReduce Tez HOYA Storm, … Giraph Spark Search, …

YARN
Cluster resource management

HDFS 2.0
Redundant, reliable storage
07.11.13
A brief history of Hadoop 2.0

• Originally conceived & architected by the


team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and now is
the YARN release manager

• The team at Hortonworks has been working


on YARN for 4 years:
– 90% of code from Hortonworks & Yahoo!

• Hadoop 2.0 based architecture running at scale at


Yahoo!
– Deployed on 35,000 nodes for 6+ months
07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0


07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0


07.11.13
YARN: Architecture
Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager

Scheduler

NodeManager NodeManager NodeManager NodeManager

AM 1 Container 1.1 Container 2.1


Container 2.3

NodeManager NodeManager NodeManager NodeManager

Container 1.2 AM 2 Container 2.2


07.11.13
YARN: Architecture

• Resource Manager
– Global resource scheduler
– Hierarchical queues

• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring

• Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
07.11.13
YARN: Architecture
ResourceManager

Scheduler

NodeManager NodeManager NodeManager NodeManager


MapReduce 1 map 1.1 reduce 2.2 map 2.1
Region server 2 reduce 2.1 nimbus 1 vertex 3

NodeManager NodeManager NodeManager NodeManager

HBase Master map 1.2 MapReduce 2 map 2.2


nimbus 2 Region server 1 vertex 4 vertex 2

NodeManager NodeManager NodeManager NodeManager

HOYA reduce 1.1 Tez map 2.3


vertex 1 Region server 3 Storm
07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0


07.11.13
HDFS Federation

• Removes tight coupling of Block


Storage and Namespace
• Scalability & Isolation
• High Availability
• Increased performance

Details: https://issues.apache.org/jira/browse/HDFS-1052
07.11.13
HDFS Federation: Architecture
NameNodes do not talk to each other

NameNode 1 NameNode 2
Namespace 1 Namespace 2
NameNodes manages
only slice of namespace
logs finance insights reports

Block Management 1 Block Management 2


DataNodes can store
1 2 3 4 5 6 7 8 blocks managed by
any NameNode

DataNode DataNode DataNode DataNode


1 2 3 4
07.11.13
HDFS: Quorum based storage
The state is shared
on a quorum of
Journal Journal Journal journal nodes
Node
Only the active Node Node
writes edits The Standby
simultaneously
Active NameNode Standby NameNode
reads and applies
Block Edits Block Edits the edits
Map File Map File

DataNode DataNode DataNode DataNode DataNode

DataNodes report to both NameNodes but listen


only to the orders from the active one
07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0


07.11.13
Hive: Current Focus Area

Non-
Real-Time Interactive
Interactive
Batch
• Online systems • Parameterized • Data preparation • Operational
• R-T analytics Reports • Incremental batch
• CEP • Drilldown batch processing
• Visualization processing • Enterprise
• Exploration • Dashboards / Reports
Scorecards • Data Mining
Current Hive Sweet Spot

0-5s 5s – 1m 1m – 1h 1h+

Data Size
07.11.13 Stinger: Extending the sweet spot

Non-
Real-Time Interactive
Interactive
Batch
• Online systems • Parameterized • Data preparation • Operational
• R-T analytics Reports • Incremental batch
• CEP • Drilldown batch processing
• Visualization processing • Enterprise
• Exploration • Dashboards / Reports
Scorecards • Data Mining
Future Hive Expansion

0-5s 5s – 1m 1m – 1h 1h+

Data Size

Improve Latency & Throughput Extend Deep Analytical Ability


• Query engine improvements • Analytics functions
• New “Optimized RCFile” column store • Improved SQL coverage
• Next-gen runtime (elim’s M/R latency) • Continued focus on core Hive use cases
07.11.13
Stinger Initiative at a glance
07.11.13
Tez: The Execution Engine

• Low level data-processing execution engine


• Use it for the base of MapReduce, Hive, Pig, etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the
end of the queue between steps in the pipeline
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• Built on YARN
07.11.13
Pig/Hive MR vs. Pig/Hive Tez
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1

Job 2

I/O Synchronization
Barrier

I/O Synchronization
Barrier

Single Job

Job 3

Pig/Hive - MR Pig/Hive - Tez


07.11.13
Tez Service

• MapReduce Query Startup is expensive:


– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)

• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive/Pig
• Submit query-plan to Tez Service
– Native Hadoop service, not ad-hoc
07.11.13
Tez: Low latency
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Existing Hive Hive/Tez Tez & Tez Service


Parse Query 0.5s Parse Query 0.5s Parse Query 0.5s

Create Plan 0.5s Create Plan 0.5s Create Plan 0.5s


Launch Map- 20s Launch Map- 20s Submit to Tez 0.5s
Reduce Reduce Service
Process Map- 10s Process Map- 2s
Process Map-Reduce 2s
Reduce Reduce
Total 31s Total 23s Total 3.5s

* No exact numbers, for illustration only


07.11.13
Stinger: Summary

* Real numbers, but handle with care!


07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
MapReduce 2.0

• Basically a porting to the YARN


architecture
• MapReduce becomes a user-land
library
• No need to rewrite MapReduce jobs
• Increased scalability & availability
• Better cluster utilization
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
HOYA: HBase on YARN

• Create on-demand HBase clusters


• Configure different HBase instances
differently
• Better isolation
• Create (transient) HBase clusters from
MapReduce jobs
• Elasticity of clusters for analytic / batch
workload processing
• Better cluster resources utilization
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Twitter Storm

• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm

• Ported on YARN
• https://github.com/yahoo/storm-yarn
07.11.13
Storm: Conceptual view
Bolt:
Spout: Consumer of streams,
Source of streams Processing of tuples,
Bolt Possibly emits new tuples

Stream:
Spout Unbound sequence of tuples Bolt
Tuple
Tuple:
List of name-value pairs
Bolt
Tuple

Spout Bolt
Tuple
Bolt

Topology: Network of Spouts & Bolts as the nodes and stream as the edge
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Spark

• High-speed in-memory analytics over


Hadoop and Hive
• Separate MapReduce-like engine
– Speedup of up to 100x
– On-disk queries 5-10x faster

• Compatible with Hadoop‘s Storage API


• Available as standalone application
– https://github.com/mesos/spark

• Experimental support for YARN since 0.6


– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
07.11.13
Data Sharing in Spark
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Apache Giraph

• Giraph is a framework for processing semi-


structured graph data on a massive scale.
• Giraph is loosely based upon Google's
Pregel
• Giraph performs iterative calculations on top
of an existing Hadoop cluster.
• Available on GitHub
– https://github.com/apache/giraph
07.11.13
Hadoop 2.0 Summary

1. Scale
2. New programming models &
Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
07.11.13
Getting started…

One more thing…


07.11.13
Hortonworks Sandbox

http://hortonworks.com/products/hortonworsk-sandbox
07.11.13
Books about Hadoop

1. Hadoop - The Definite Guide, Tom White,


3rd ed., O’Reilly, 2012.

2. Hadoop in Action, Chuck Lam,


Manning, 2011

Programming Pig, Alan Gates


O’Reilly, 2011

1. Hadoop Operations, Eric Sammer,


O’Reilly, 2012
07.11.13
The end…or the beginning?

You might also like