Intro Haddop Ecosystem 24sep2020
Intro Haddop Ecosystem 24sep2020
13
uweseiler
Introduction to the
Hadoop Ecosystem
07.11.13
About me
is a bunch of…
Join us!
Volume
The 3 V’s of Big Data
Velocity
Variety
07.11.13
NoSQL
07.11.13
Classification of NoSQL
_id
_id
_id
Horizontal
Scaling
07.11.13
Vertical Scaling
RAM
CPU
Storage
07.11.13
Vertical Scaling
RAM
CPU
Storage
07.11.13
Vertical Scaling
RAM
CPU
Storage
07.11.13
Horizontal Scaling
RAM
CPU
Storage
07.11.13
Horizontal Scaling
“Data“
w w w
r r r
“Result“
07.11.13
But…
Parallel processing is
complicated!
07.11.13
But…
Hadoop != Database
07.11.13
What is Hadoop?
http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
07.11.13
The Hadoop App Store
Hadoop
Distributions
Apache
Hadoop
+
• Integrated Environment
+ •
•
•
Visualization
(Near-)Realtime analysis
Modeling
• Test & Packaging • ETL & Connectors
• HDFS • Installation
• MapReduce • Monitoring
• Hadoop Ecosystem Big Data
• Business Support
• Hadoop YARN Suites
Intended for
• large files
• batch inserts
07.11.13
HDFS Architecture
Rack 1 Rack 2
Slave Slave Slave
Map
• Extract something of interest from each
Reduce
• Aggregate intermediate results
a b 2 c 9 a 3 c 2 b 7 c 8
Job JobTracker
Secondary
File NameNode
NameNode
Pig
Scripting
HCatalog
Metadata Management
MapReduce
Distributed Programming Framework
HDFS
Hadoop Distributed File System
07.11.13
Pig Latin
users = LOAD 'users.txt' USING PigStorage(',') AS (name,
age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,
url);
Pig Hive
Scripting Query
HCatalog
Metadata Management
MapReduce
Distributed Programming Framework
HDFS
Hadoop Distributed File System
07.11.13
Hive Architecture
Hive
Hive Thrift Thrift
Driver Applications
Shell
Hive
Hive
Server Hive JDBC JDBC
Driver Applications
Meta- Hive
store Engine Hive ODBC ODBC
Driver Applications
MapReduce
HDFS
07.11.13
Hive Example
CREATE TABLE users(name STRING, age INT);
CREATE TABLE pages(user STRING, url STRING);
Pig Hive
Workflow automatization
NoSQL Database
HBase
Cluster Coordination
HCatalog
Metadata Management
ZooKeeper
Ambari
Oozie
MapReduce
Data processing
HDFS
Data storage
Scoop Flume
Import & Export of Import & Export of
relational data data flows
07.11.13
Agenda
Build
&
Test
Data Systems
Traditional Systems
Operation
Manage
RDBMS EDW MPP … &
Monitor
Traditional Sources
Data Sources
Build
&
Test
Data Systems
Traditional Systems
Enterprise Operation
Hadoop Manage
RDBMS EDW MPP … Plattform &
Monitor
… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #1: Refine data
Business Business Custom Capture
1
Intelligence Applications Applications all data
Process
4 2
the data
Data Systems
Traditional Systems
3 Enterprise Exchange
Hadoop 2 using
3
RDBMS EDW MPP … Plattform traditional
systems
1
Process &
Traditional Sources New Sources
Data Sources
Visualize
4 with
traditional
Social applications
RDBMS OLTP OLAP … Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #2: Explore data
Business Business Custom
Intelligence Applications Applications
Capture
1
3 all data
Data Systems
for Hadoop
… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
Applications
07.11.13
Pattern #3: Enrich data
Business Custom
Applications Applications
Capture
3 1
all data
Data Systems
Traditional Systems
Enterprise Process
2
2 Hadoop the data
RDBMS EDW MPP … Plattform
Directly
1 3 ingest the
data
Traditional Sources New Sources
Data Sources
… Social
RDBMS OLTP OLAP Logs Mails Sensor …
Media
07.11.13
Bringing it all together…
One example…
07.11.13
Digital Advertising
AdServer
Campaign
Database
Campaign
FFM AMS Data
Binary
Log Format
TCP TCP
Interface Interface
Temporary Aggregated
data data
Local files
Job
NAS
Start
Scheduler
Flume HDFS Sink
Direct
Config UI Job Config Download
XML
07.11.13
What’s next?
Hadoop 2.0
aka YARN
07.11.13
Hadoop 1.0
Batch Batch
• Low-latency jobs
• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Availability
– Failure kills all queued and running jobs
MapReduce Others
MapReduce Data processing Data processing
Cluster resource mgmt.
+ data processing
YARN
Cluster resource management
HDFS
Redundant, reliable HDFS 2.0
storage Redundant, reliable storage
07.11.13
Taking Hadoop beyond batch
Store all data in one place
Interact with data in multiple ways
YARN
Cluster resource management
HDFS 2.0
Redundant, reliable storage
07.11.13
A brief history of Hadoop 2.0
• YARN
• YARN
Scheduler
• Resource Manager
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
07.11.13
YARN: Architecture
ResourceManager
Scheduler
• YARN
Details: https://issues.apache.org/jira/browse/HDFS-1052
07.11.13
HDFS Federation: Architecture
NameNodes do not talk to each other
NameNode 1 NameNode 2
Namespace 1 Namespace 2
NameNodes manages
only slice of namespace
logs finance insights reports
• YARN
Non-
Real-Time Interactive
Interactive
Batch
• Online systems • Parameterized • Data preparation • Operational
• R-T analytics Reports • Incremental batch
• CEP • Drilldown batch processing
• Visualization processing • Enterprise
• Exploration • Dashboards / Reports
Scorecards • Data Mining
Current Hive Sweet Spot
0-5s 5s – 1m 1m – 1h 1h+
Data Size
07.11.13 Stinger: Extending the sweet spot
Non-
Real-Time Interactive
Interactive
Batch
• Online systems • Parameterized • Data preparation • Operational
• R-T analytics Reports • Incremental batch
• CEP • Drilldown batch processing
• Visualization processing • Enterprise
• Exploration • Dashboards / Reports
Scorecards • Data Mining
Future Hive Expansion
0-5s 5s – 1m 1m – 1h 1h+
Data Size
Job 2
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Single Job
Job 3
• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive/Pig
• Submit query-plan to Tez Service
– Native Hadoop service, not ad-hoc
07.11.13
Tez: Low latency
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
MapReduce 2.0
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
HOYA: HBase on YARN
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Twitter Storm
• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm
• Ported on YARN
• https://github.com/yahoo/storm-yarn
07.11.13
Storm: Conceptual view
Bolt:
Spout: Consumer of streams,
Source of streams Processing of tuples,
Bolt Possibly emits new tuples
Stream:
Spout Unbound sequence of tuples Bolt
Tuple
Tuple:
List of name-value pairs
Bolt
Tuple
Spout Bolt
Tuple
Bolt
Topology: Network of Spouts & Bolts as the nodes and stream as the edge
07.11.13
Hadoop 2.0 Applications
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Spark
• MapReduce 2.0
• HOYA - HBase on YARN
• Storm, Spark, Apache S4
• Hamster (MPI on Hadoop)
• Apache Giraph
• Apache Hama
• Distributed Shell
• Tez
07.11.13
Apache Giraph
1. Scale
2. New programming models &
Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
07.11.13
Getting started…
http://hortonworks.com/products/hortonworsk-sandbox
07.11.13
Books about Hadoop