0% found this document useful (0 votes)

105 views127 pages

Intro Haddop Ecosystem 24sep2020

The document provides an introduction to Hadoop and the Hadoop ecosystem. It begins with definitions of big data and discusses the challenges of storing and processing large volumes of data. It then describes Hadoop as a framework for distributed storage and computation using HDFS for redundant data storage and MapReduce as a programming model for distributed processing of large datasets across clusters of computers. Finally, it outlines some of the common components in the Hadoop ecosystem that provide additional functionality beyond core Hadoop.

Uploaded by

pankaj boricha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views127 pages

Intro Haddop Ecosystem 24sep2020

Uploaded by

pankaj boricha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 127

07.11.

uweseiler

Introduction to the
Hadoop Ecosystem
07.11.13
About me

Big Data Nerd Hadoop Trainer MongoDB Author

Photography Enthusiast Travelpirate

07.11.13
About us

is a bunch of…

Big Data Nerds Agile Ninjas Continuous Delivery Gurus

Join us!

Enterprise Java Specialists Performance Geeks

07.11.13
Agenda

• What is Big Data & Hadoop?

• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Agenda

• What is Big Data & Hadoop?

• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
Big Data
Big Data is like teenage sex:
everybody talks about it,
nobody really knows how to
do it, everyone thinks
everyone else is doing it, so
everyone claims they are
doing it…
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
Slides from APCON: Big Data in Action (http://de.slideshare.net/cnkelly/big-data-in-action)
07.11.13
My favorite definition
07.11.13
The classic definition

Volume
The 3 V’s of Big Data

Velocity
Variety
07.11.13

«Big Data» != Hadoop

NoSQL
07.11.13
Classification of NoSQL

Key-Value Stores Column Stores

1
K V
1 1 1
K V 1 1
K V 1 1
1
K V 1
K V 1

Graph Databases Document Stores

_id
_id
_id
Horizontal
Scaling
07.11.13
Vertical Scaling

RAM
CPU
Storage
07.11.13
Vertical Scaling

RAM
CPU
Storage
07.11.13
Horizontal Scaling

RAM RAM RAM RAM RAM

CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage
07.11.13
Horizontal Scaling

RAM RAM RAM RAM RAM

CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage

RAM RAM RAM RAM RAM

CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage

RAM RAM RAM RAM RAM

CPU CPU CPU CPU CPU
Storage Storage Storage Storage Storage
07.11.13
Why Hadoop?

Traditional dataStores are expensive to scale

and by Design difficult to Distribute

Scale out is the way to go!

07.11.13
How to scale data?

“Data“

w w w

worker worker worker

r r r

“Result“
07.11.13
But…

Parallel processing is
complicated!
07.11.13
But…

Data storage is not

trivial!
07.11.13
What is Hadoop?

Distributed Storage and

Computation Framework
07.11.13
What is Hadoop?

Hadoop != Database
07.11.13
What is Hadoop?

“Swiss army knife

of the 21st century”

http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
07.11.13
The Hadoop App Store

HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra

Chukwa Flume Hana HyperT Impala Mahout Nutch Oozie Scoop

Scribe Tez Vertica Whirr ZooKee Horton Cloudera MapR EMC

Intel IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper

Sync Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat

07.11.13
The Hadoop App Store

Hadoop
Distributions

Apache
Hadoop
+
• Integrated Environment

+ •
•
•
Visualization
(Near-)Realtime analysis
Modeling
• Test & Packaging • ETL & Connectors
• HDFS • Installation
• MapReduce • Monitoring
• Hadoop Ecosystem Big Data
• Business Support
• Hadoop YARN Suites

less Functionality more

07.11.13
Agenda

• What is Big Data & Hadoop?

• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Data Storage

OK, first things

first!
I want to store all of
my <<Big Data>>
07.11.13
Data Storage
07.11.13 Hadoop Distributed File System

• Distributed file system for

redundant storage
• Designed to reliably store data
on commodity hardware
• Built to expect hardware
failures
07.11.13 Hadoop Distributed File System

Intended for
• large files
• batch inserts
07.11.13
HDFS Architecture

Client Master Helper

File NameNode Secondary

Block Map
NameNode
#1 #2 Journal Log periodical merges

Rack 1 Rack 2
Slave Slave Slave

DataNode DataNode DataNode

#1 #1 #1
07.11.13
HDFS

Let’s have a look…

07.11.13
Data Processing

Data stored, check!

Now I want to
create insights
from my data!
07.11.13
Data Processing
07.11.13
MapReduce

• Programming model for

distributed computations at a
massive scale

• Execution framework for

organizing and performing such
computations

• Data locality is king

07.11.13
Typical large-data problem

• Iterate over a large number of records

Map
• Extract something of interest from each

• Shuffle and sort intermediate results

Reduce
• Aggregate intermediate results

• Generate final output

07.11.13 MapReduce Flow

Map Map Map Map

a b 2 c 3 c 6 a 3 c 2 b 7 c 8

Combine Combine Combine Combine

a b 2 c 9 a 3 c 2 b 7 c 8

Partition Partition Partition Partition

Shuffle and Sort

a 1 3 b 7 c 2 8 9

Reduce Reduce Reduce

a 4 b 9 c 19
07.11.13 Combined Hadoop Architecture

Client Master Helper

Job JobTracker

Secondary
File NameNode
NameNode

Slave Slave Slave

TaskTracker TaskTracker TaskTracker

Task Task Task

DataNode DataNode DataNode

Block Block Block
07.11.13
Word Count Mapper in Java
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
07.11.13
Word Count Reducer in Java
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator values, OutputCollector
output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
07.11.13
Map/Reduce

Let’s have a look…

07.11.13
Agenda

• What is Big Data & Hadoop?

• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Scripting for Hadoop

Java for MapReduce?

I dunno, dude…
I’m more of a
scripting guy…
07.11.13
Scripting for Hadoop
07.11.13
Apache Pig

• High-level data flow language

• Made of two components:

• Data processing language Pig Latin
• Compiler to translate Pig Latin to
MapReduce
07.11.13
Pig in the Hadoop ecosystem

Pig
Scripting

HCatalog
Metadata Management

MapReduce
Distributed Programming Framework

HDFS
Hadoop Distributed File System
07.11.13
Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);

filteredUsers = FILTER users BY age >= 18 and age <=50;

joinResult = JOIN filteredUsers BY name, pages by user;
grouped = GROUP joinResult BY url;
summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

07.11.13
Pig Execution Plan
07.11.13
Try that with Java…
07.11.13
Pig

Let’s have a look…

07.11.13
SQL for Hadoop

OK, Pig seems quite

useful…
But I’m more of a
SQL person…
07.11.13
SQL for Hadoop
07.11.13
Apache Hive

• Data Warehousing Layer on top of

Hadoop

• Allows analysis and queries

using a SQL-like language
07.11.13
Hive in the Hadoop ecosystem

Pig Hive
Scripting Query

HCatalog
Metadata Management

MapReduce
Distributed Programming Framework

HDFS
Hadoop Distributed File System
07.11.13
Hive Architecture

Hive
Hive Thrift Thrift
Driver Applications
Shell
Hive

Hive
Server Hive JDBC JDBC
Driver Applications

Meta- Hive
store Engine Hive ODBC ODBC
Driver Applications

MapReduce

HDFS
07.11.13
Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO

TABLE 'users';
LOAD DATA INPATH '/user/sandbox/pages.txt' INTO
TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN

pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT BY clicks DESC
LIMIT 10;
07.11.13
Hive

Let’s have a look…

07.11.13
But wait, there’s still more!

More components of the

Hadoop Ecosystem
Mahout
07.11.13 Machine Learning

Pig Hive

Cluster installation & management

Scripting SQL-like queries

Workflow automatization
NoSQL Database
HBase

Cluster Coordination
HCatalog
Metadata Management

ZooKeeper

Ambari

Oozie
MapReduce
Data processing

HDFS
Data storage

Scoop Flume
Import & Export of Import & Export of
relational data data flows
07.11.13
Agenda

• What is Big Data & Hadoop?

• Core Hadoop
• The Hadoop Ecosystem
• Use Cases
• What‘s next? Hadoop 2.0!
07.11.13
Applications Classical enterprise platform

Business Business Custom

Intelligence Applications Applications
Dev Tools

Build
&
Test
Data Systems

Traditional Systems
Operation

Manage
RDBMS EDW MPP … &
Monitor

Traditional Sources
Data Sources

RDBMS OLTP OLAP …

07.11.13
Applications Big Data Platform
Business Business Custom
Intelligence Applications Applications
Dev Tools

Build
&
Test
Data Systems

Traditional Systems
Enterprise Operation
Hadoop Manage
RDBMS EDW MPP … Plattform &
Monitor

Traditional Sources New Sources

Data Sources

… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #1: Refine data
Business Business Custom Capture
1
Intelligence Applications Applications all data

Process
4 2
the data
Data Systems

Traditional Systems
3 Enterprise Exchange
Hadoop 2 using
3
RDBMS EDW MPP … Plattform traditional
systems
1
Process &
Traditional Sources New Sources
Data Sources

Visualize
4 with
traditional
Social applications
RDBMS OLTP OLAP … Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #2: Explore data
Business Business Custom
Intelligence Applications Applications
Capture
1
3 all data
Data Systems

Traditional Systems Process

Enterprise 2
the data
2 Hadoop
RDBMS EDW MPP … Plattform
Explore the
data using
1 3 applications
with support
Traditional Sources New Sources
Data Sources

for Hadoop

… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #3: Enrich data
Business Custom
Applications Applications

Capture
3 1
all data
Data Systems

Traditional Systems
Enterprise Process
2
2 Hadoop the data
RDBMS EDW MPP … Plattform
Directly
1 3 ingest the
data
Traditional Sources New Sources
Data Sources

… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
07.11.13
Bringing it all together…

One example…
07.11.13
Digital Advertising

• 6 billion ad deliveries per day

• Reports (and bills) for the

advertising companies needed

• Own C++ solution did not scale

• Adding functions was a nightmare

07.11.13
AdServing Architecture
AdServer Hadoop Cluster Synchronisation

AdServer
Campaign
Database

Campaign
FFM AMS Data

Binary
Log Format
TCP TCP
Interface Interface

Custom Custom Report

Flume Flume
Source Source Pig Hive Engine

Temporary Aggregated
data data

Local files
Job
NAS
Start
Scheduler
Flume HDFS Sink
Direct
Config UI Job Config Download
XML
07.11.13
What’s next?

Hadoop 2.0
aka YARN
07.11.13
Hadoop 1.0

Built for web-scale batch apps

Single App Single App

Batch Batch

Single App Single App Single App

Batch Batch Batch

HDFS HDFS HDFS

07.11.13
MapReduce is good for…

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset
07.11.13
MapReduce is OK for…

• Iterative jobs (i.e., graph algorithms)

– Each iteration must read/write data to
disk
– I/O and latency cost of an iteration is
high
07.11.13
MapReduce is not good for…

• Jobs that need shared state/coordination

– Tasks are shared-nothing
– Shared-state requires scalable state store

• Low-latency jobs

• Jobs on small datasets

• Finding individual records

07.11.13
MapReduce limitations

• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker

• Availability
– Failure kills all queued and running jobs

• Hard partition of resources into map & reduce slots

– Low resource utilization

• Lacks support for alternate paradigms and services

– Iterative applications implemented using MapReduce are 10x
slower
07.11.13
Hadoop 2.0: Next-gen platform

Single use system Multi-purpose platform

Batch Apps Batch, Interactive, Streaming, …
Hadoop 1.0 Hadoop 2.0

MapReduce Others
MapReduce Data processing Data processing
Cluster resource mgmt.
+ data processing
YARN
Cluster resource management
HDFS
Redundant, reliable HDFS 2.0
storage Redundant, reliable storage
07.11.13
Taking Hadoop beyond batch
Store all data in one place
Interact with data in multiple ways

Applications run natively in Hadoop

Batch Interactive Online Streaming Graph In-Memory Other

MapReduce Tez HOYA Storm, … Giraph Spark Search, …

YARN
Cluster resource management

HDFS 2.0
Redundant, reliable storage
07.11.13
A brief history of Hadoop 2.0

• Originally conceived & architected by the

team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and now is
the YARN release manager

• The team at Hortonworks has been working

on YARN for 4 years:
– 90% of code from Hortonworks & Yahoo!

• Hadoop 2.0 based architecture running at scale at

Yahoo!
– Deployed on 35,000 nodes for 6+ months
07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

07.11.13
YARN: Architecture
Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager

Scheduler

NodeManager NodeManager NodeManager NodeManager

AM 1 Container 1.1 Container 2.1

Container 2.3

NodeManager NodeManager NodeManager NodeManager

Container 1.2 AM 2 Container 2.2

07.11.13
YARN: Architecture

• Resource Manager
– Global resource scheduler
– Hierarchical queues

• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring

• Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
07.11.13
YARN: Architecture
ResourceManager

Scheduler

NodeManager NodeManager NodeManager NodeManager

MapReduce 1 map 1.1 reduce 2.2 map 2.1
Region server 2 reduce 2.1 nimbus 1 vertex 3

NodeManager NodeManager NodeManager NodeManager

HBase Master map 1.2 MapReduce 2 map 2.2

nimbus 2 Region server 1 vertex 4 vertex 2

NodeManager NodeManager NodeManager NodeManager

HOYA reduce 1.1 Tez map 2.3

vertex 1 Region server 3 Storm
07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

07.11.13
HDFS Federation

• Removes tight coupling of Block

Storage and Namespace
• Scalability & Isolation
• High Availability
• Increased performance

Details: https://issues.apache.org/jira/browse/HDFS-1052
07.11.13
HDFS Federation: Architecture
NameNodes do not talk to each other

NameNode 1 NameNode 2
Namespace 1 Namespace 2
NameNodes manages
only slice of namespace
logs finance insights reports

Block Management 1 Block Management 2

DataNodes can store
1 2 3 4 5 6 7 8 blocks managed by
any NameNode

DataNode DataNode DataNode DataNode

1 2 3 4
07.11.13
HDFS: Quorum based storage
The state is shared
on a quorum of
Journal Journal Journal journal nodes
Node
Only the active Node Node
writes edits The Standby
simultaneously
Active NameNode Standby NameNode
reads and applies
Block Edits Block Edits the edits
Map File Map File

DataNode DataNode DataNode DataNode DataNode

DataNodes report to both NameNodes but listen

only to the orders from the active one
07.11.13
Hadoop 2.0 Projects

• YARN

• HDFS Federation aka HDFS 2.0

• Stinger & Tez aka Hive 2.0

07.11.13
Hive: Current Focus Area

Non-
Real-Time Interactive
Interactive
Batch
• Online systems • Parameterized • Data preparation • Operational
• R-T analytics Reports • Incremental batch
• CEP • Drilldown batch processing
• Visualization processing • Enterprise
• Exploration • Dashboards / Reports
Scorecards • Data Mining
Current Hive Sweet Spot

0-5s 5s – 1m 1m – 1h 1h+

Data Size
07.11.13 Stinger: Extending the sweet spot

0-5s 5s – 1m 1m – 1h 1h+

Data Size

Improve Latency & Throughput Extend Deep Analytical Ability

• Query engine improvements • Analytics functions
• New “Optimized RCFile” column store • Improved SQL coverage
• Next-gen runtime (elim’s M/R latency) • Continued focus on core Hive use cases
07.11.13
Stinger Initiative at a glance
07.11.13
Tez: The Execution Engine

• Low level data-processing execution engine

• Use it for the base of MapReduce, Hive, Pig, etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the
end of the queue between steps in the pipeline
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• Built on YARN
07.11.13
Pig/Hive MR vs. Pig/Hive Tez
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1

Job 2

I/O Synchronization
Barrier

Single Job

Job 3

Pig/Hive - MR Pig/Hive - Tez

07.11.13
Tez Service

• MapReduce Query Startup is expensive:

– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)

• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive/Pig
• Submit query-plan to Tez Service
– Native Hadoop service, not ad-hoc
07.11.13
Tez: Low latency
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Existing Hive Hive/Tez Tez & Tez Service

Parse Query 0.5s Parse Query 0.5s Parse Query 0.5s

Create Plan 0.5s Create Plan 0.5s Create Plan 0.5s

Launch Map- 20s Launch Map- 20s Submit to Tez 0.5s
Reduce Reduce Service
Process Map- 10s Process Map- 2s
Process Map-Reduce 2s
Reduce Reduce
Total 31s Total 23s Total 3.5s

* No exact numbers, for illustration only

07.11.13
Stinger: Summary

* Real numbers, but handle with care!

07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
MapReduce 2.0

• Basically a porting to the YARN

architecture
• MapReduce becomes a user-land
library
• No need to rewrite MapReduce jobs
• Increased scalability & availability
• Better cluster utilization
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
HOYA: HBase on YARN

• Create on-demand HBase clusters

• Configure different HBase instances
differently
• Better isolation
• Create (transient) HBase clusters from
MapReduce jobs
• Elasticity of clusters for analytic / batch
workload processing
• Better cluster resources utilization
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Twitter Storm

• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm

• Ported on YARN
• https://github.com/yahoo/storm-yarn
07.11.13
Storm: Conceptual view
Bolt:
Spout: Consumer of streams,
Source of streams Processing of tuples,
Bolt Possibly emits new tuples

Stream:
Spout Unbound sequence of tuples Bolt
Tuple
Tuple:
List of name-value pairs
Bolt
Tuple

Spout Bolt
Tuple
Bolt

Topology: Network of Spouts & Bolts as the nodes and stream as the edge
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Spark

• High-speed in-memory analytics over

Hadoop and Hive
• Separate MapReduce-like engine
– Speedup of up to 100x
– On-disk queries 5-10x faster

• Compatible with Hadoop‘s Storage API

• Available as standalone application
– https://github.com/mesos/spark

• Experimental support for YARN since 0.6

– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
07.11.13
Data Sharing in Spark
07.11.13
Hadoop 2.0 Applications

• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Apache Giraph

• Giraph is a framework for processing semi-

structured graph data on a massive scale.
• Giraph is loosely based upon Google's
Pregel
• Giraph performs iterative calculations on top
of an existing Hadoop cluster.
• Available on GitHub
– https://github.com/apache/giraph
07.11.13
Hadoop 2.0 Summary

1. Scale
2. New programming models &
Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
07.11.13
Getting started…

One more thing…

07.11.13
Hortonworks Sandbox

http://hortonworks.com/products/hortonworsk-sandbox
07.11.13
Books about Hadoop

1. Hadoop - The Definite Guide, Tom White,

3rd ed., O’Reilly, 2012.

2. Hadoop in Action, Chuck Lam,

Manning, 2011

Programming Pig, Alan Gates

O’Reilly, 2011

1. Hadoop Operations, Eric Sammer,

O’Reilly, 2012
07.11.13
The end…or the beginning?

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
EUROCOD 5 - Design of Timber Structures - General Rules
100% (1)
EUROCOD 5 - Design of Timber Structures - General Rules
72 pages
Big Data Camp Intro Hadoop
No ratings yet
Big Data Camp Intro Hadoop
22 pages
SImple and Compound Interest Notes Lyst6475
No ratings yet
SImple and Compound Interest Notes Lyst6475
11 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Big Data
No ratings yet
Big Data
43 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Introduction To The Hadoop Ecosystem Java
No ratings yet
Introduction To The Hadoop Ecosystem Java
106 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Big Data
No ratings yet
Big Data
67 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Otto Cycle - Wikipedia
No ratings yet
Otto Cycle - Wikipedia
13 pages
How To Use LTMC For Master Data Migration
100% (1)
How To Use LTMC For Master Data Migration
13 pages
KT Ykts
No ratings yet
KT Ykts
41 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
1.2 Newtonian Relativity and Galilean Transformations
No ratings yet
1.2 Newtonian Relativity and Galilean Transformations
7 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
HADOOP
No ratings yet
HADOOP
55 pages
Pos - 0101 Qe Et200sp Elev.28kw Inv - Sew.+cat. 1,5kw Sew 2i004764 (Es2-2019) q5 Vinamilk - 04!30!2020 English Version
No ratings yet
Pos - 0101 Qe Et200sp Elev.28kw Inv - Sew.+cat. 1,5kw Sew 2i004764 (Es2-2019) q5 Vinamilk - 04!30!2020 English Version
51 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
Weld Consumable Calculator, Butt and Fillet Welds
No ratings yet
Weld Consumable Calculator, Butt and Fillet Welds
7 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Aristotle On Matter
No ratings yet
Aristotle On Matter
24 pages
Blas Lapack
No ratings yet
Blas Lapack
21 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
AQA GCSE Chem C2 Summary Question Answers
No ratings yet
AQA GCSE Chem C2 Summary Question Answers
4 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Nonincendive Circuit Parameters: Planning and Installation Guide For Tricon v9-v10 Systems
No ratings yet
Nonincendive Circuit Parameters: Planning and Installation Guide For Tricon v9-v10 Systems
26 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Brochure Force Sensor
No ratings yet
Brochure Force Sensor
7 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Lec1 PDF
No ratings yet
Lec1 PDF
28 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Continuous Integration (Jenkins) : Ahmed Gomaa
No ratings yet
Continuous Integration (Jenkins) : Ahmed Gomaa
34 pages
Multi Class Logistic Regression Training and Testing
No ratings yet
Multi Class Logistic Regression Training and Testing
9 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Buchanan
No ratings yet
Buchanan
8 pages
Internet of Things, Big Data and Artificial Intelligence For Driving Innovation Management
No ratings yet
Internet of Things, Big Data and Artificial Intelligence For Driving Innovation Management
23 pages
Marantz SR 4500 Brochure
No ratings yet
Marantz SR 4500 Brochure
4 pages
KCPSM6 User Guide 30sept14 PDF
No ratings yet
KCPSM6 User Guide 30sept14 PDF
124 pages
CH-10 Boiler Performance
No ratings yet
CH-10 Boiler Performance
19 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Aws For StartUp 24sep2020
No ratings yet
Aws For StartUp 24sep2020
89 pages
CICD 24sep2020
No ratings yet
CICD 24sep2020
78 pages
Study of Suspension System in All Terrain Vehicle: Presented by
No ratings yet
Study of Suspension System in All Terrain Vehicle: Presented by
14 pages
Data Visualization: Branching Out Beyond Excel
No ratings yet
Data Visualization: Branching Out Beyond Excel
44 pages
DIN A Rail Sections
100% (1)
DIN A Rail Sections
1 page
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Akka HTTP
No ratings yet
Akka HTTP
23 pages
STF5 Equilibrium Beam Datasheet
No ratings yet
STF5 Equilibrium Beam Datasheet
2 pages
Grammar Jeopardy: Modal Auxiliaries, Relative Adverbs, & Relative Pronouns
No ratings yet
Grammar Jeopardy: Modal Auxiliaries, Relative Adverbs, & Relative Pronouns
18 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Comparison of Shielding Methods
No ratings yet
Comparison of Shielding Methods
2 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Big Data
No ratings yet
Big Data
27 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Math 2
No ratings yet
Math 2
17 pages
Assignment 01
No ratings yet
Assignment 01
2 pages
Extech Phase Rotation Testers
No ratings yet
Extech Phase Rotation Testers
1 page
wk8 Final
No ratings yet
wk8 Final
39 pages
Introduction To
No ratings yet
Introduction To
7 pages
2-Introduction To Hadoop Eco System
No ratings yet
2-Introduction To Hadoop Eco System
35 pages
Measurement Instrumentation and Sensors Handbook Two Volume Set 2nd Edition John G. Webster (Editor) Instant Download
No ratings yet
Measurement Instrumentation and Sensors Handbook Two Volume Set 2nd Edition John G. Webster (Editor) Instant Download
42 pages
Chemistry-Neet Chemical Kinetics (Easy) Solution
No ratings yet
Chemistry-Neet Chemical Kinetics (Easy) Solution
8 pages