2 Hadoop Ecosystem
2 Hadoop Ecosystem
Hadoop ecosystem
We need a system that scales
• Traditional tools are overwhelmed
• Slow disks, unreliable machines, parallelism is not
easy
• 3 challenges
• Reliable storage
• Powerful data processing
• Efficient visualization
3
What is Apache Hadoop?
• Scalable and economical data storage and
processing
• The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models. It is designed to scale out from single servers to
thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of
which may be prone to failures (commodity hardware).
• Heavily inspired by Google data architecture
4
Hadoop main components
• Storage: Hadoop distributed file system
(HDFS)
• Processing: MapReduce framework
• System utilities:
• Hadoop Common: The common utilities that
support the other Hadoop modules.
• Hadoop YARN: A framework for job scheduling and
cluster resource management.
5
Scalability
• Distributed by design
• Hadoop can run on cluster
• Individual servers within a cluster are called
nodes
• each node may both store and process data
• Scale out by adding more nodes to increase
scalability
• Up to several thousand nodes
6
Fault tolerance
• Cluster of commodity servers
• Hardware failure is the norm rather than the exception
• Built with redundancy
• File loaded into HDFS are replicated across nodes in
the cluster
• If a node failed, its data is re-replicated using one of the
copies
• Data processing jobs are broken into individual tasks
• Each task takes a small amount of data as input
• Parallel tasks execution
• Failed tasks also get rescheduled elsewhere
• Routine failures are handled automatically without any
loss of data
7
Hadoop distributed file system
• Provides inexpensive and reliable storage for massive
amounts of data
• Optimized for big files (100 MB to several TBs file
sizes)
• Hierarchical UNIX style file system
• (e.g., /hust/soict/hello.txt)
• UNIX style file ownership and permissions
• There are also some major deviations from UNIX
• Append only
• Write once read many times
8
HDFS Architecture
• Master/slave architecture
• HDFS master: namenode
• Manage namespace and
metadata
• Monitor datanode
• HDFS slave: datanode
• Handle read/write the actual
data
9
HDFS main design principles
• I/O pattern
• Append only à reduce synchronization
• Data distribution
• File is splitted in big chunks (64 MB)
à reduce metadata size
à reduce network communication
• Data replication
• Each chunk is usually replicated in 3 different nodes
• Fault tolerance
• Data node: re-replication
• Name node
• Secondary namenode
• Enqury data nodes instead of complex checkpointing scheme
Data processing: MapReduce
• MapReduce framework is the Hadoop default data
processing engine
• MapReduce is a programming model for data
processing
• it is not a language, a style of processing data created by
Google
• The beauty of MapReduce
• Simplicity
• Flexibility
• Scalability
11
a MR job = {Isolated Tasks}n
• MapReduce divides the workload into multiple
independent tasks and schedule them across cluster
nodes
• A work performed by each task is done in isolation
from one another for scalability reasons
• The communication overhead required to keep the data on the
nodes synchronized at all times would prevent the model from
performing reliably and efficiently at large scale
12
Data Distribution
• In a MapReduce cluster, data is usually managed by a
distributed file systems (e.g., HDFS)
• Move code to data and not data to code
13
Keys and Values
• The programmer in MapReduce has to specify two
functions, the map function and the reduce function
that implement the Mapper and the Reducer in a
MapReduce program
• In MapReduce data elements are always structured
as
key-value (i.e., (K, V)) pairs
• The map and reduce functions receive and emit (K, V)
pairs
Input Splits Intermediate Outputs Final Outputs
16
Map phase
• Hadoop splits job into many individual map tasks
• Number of map tasks is determined by the amount of input data
• Each map task receives a portion of the overall job input to process
• Mappers process one input record at a time
• For each input record, they emit zero or more records as output
• In this case, the map task simply parses the input record
• And then emits the name and price fields for each as output
Map phase
17
Shuffle & sort
• Hadoop automatically sorts and merges output from all
map tasks
• This intermediate process is known as the shuffle and sort
• The result is supplied to reduce tasks
18
Reduce phase
• Reducer input comes from the shuffle and sort process
• As with map, the reduce function receives one record at a time
• A given reducer receives all records for a given key
• For each input record, reduce can emit zero or more output records
• Our reduce function simply sums total per person
• And emits employee name (key) and total (value) as output
Reduce phase
19
Data flow for the entire MapReduce
job
20
Word Count Dataflow
MapReduce - Dataflow
Map reduce life cycle
[email protected] 23
Example: Word Count (1)
Example: Word Count (2)
Hadoop ecosystem
• Many related tools integrate with Hadoop
• Data analysis
• Database integration
• Workflow management
• These are not considered ‘core Hadoop’
• Rather, they are part of the ‘Hadoop ecosystem’
• Many are also open source Apache projects
26
Apache Pig
• Apache Pig builds on Hadoop to offer high level data processing
• Pig is especially good at joining and transforming data
• The Pig interpreter runs on the client machine
• Turns PigLatin scripts into MapReduce jobs
• Submits those jobs to the cluster
27
Apache Hive
• Another abstraction on top of MapReduce
• Reduce development time
• HiveQL: SQL-like language
• The Hive interpreter runs on the client machine
• Turns HiveQL scripts into MapReduce jobs
• Submits those jobs to the cluster
28
Apache Hbase
• HBase is a distributed column-oriented data store built on top of
HDFS
• Is considered as the Hadoop database
• Data is logically organized into tables, rows and columns
• terabytes, and even petabytes of data in a table
• Tables can have many thousands of columns
• Scales to provide very high write throughput
• Hundreds of thousands of inserts per second
• Fairly primitive when compared to RDBMS
• NoSQL : There is no high/level query language
• Use API to scan / get / put values based on keys
29
Apache sqoop
• Sqoop is a tool designed for efficiently
transferring bulk data between Apache
Hadoop and structured datastores such
as relational databases.
• It can import all tables, a single table, or
a portion of a table into HDFS
• Via a Map/only MapReduce job
• Result is a directory in HDFS containing
comma/delimited text files
• Sqoop can also export data from HDFS
back to the database
30
Apache Kafka
Kafka decouple data streams
Producers don’t know about
Producer Producer consumers
Flexible message consumption
Producers Kafka broker delegates log
partition offset (location) to
Consumers (clients)
Cluster
Zookeeper
Consumers Consumer
Offsets
Consumer
[email protected]
Apache Oozie
• Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
• Oozie supports many workflow actions, including
• Executing MapReduce jobs
• Running Pig or Hive scripts
• Executing standard Java or shell programs
• Manipulating data via HDFS commands
• Running remote commands with SSH
• Sending e/mail messages
32
Apache Zookeeper
• Apache ZooKeeper is a highly reliable
distributed coordination service
• Group membership
• Leader election
• Dynamic Configuration
• Status monitoring
• All of these kinds of services are used in some
form or another by distributed applications
33
PAXOS algorithm
https://www.youtube.com/watch?v=d7nAGI_NZPk
[email protected] 34
YARN – Yet Another Resource Negotiator
• Nodes have "resources" – memory and CPU
cores – which are allocated to application when
requested
• Moving beyond Map Reduce
• MR and non-MR running on the same cluster
• Most jobtracker functions moved to application
masters HADOOP 2.0
MapReduce
(cluster resource management YARN
& data processing) (cluster resource management)
HDFS
(redundant, reliable HDFS
storage) (redundant, reliable storage)
YARN execution
[email protected] 36
Big data platform: Hadoop ecosystem
Hortonworks Data Platform Sandbox
Demo
Big data management
Reference
• White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.",
2012.
• Borthakur, Dhruba. "HDFS architecture guide." Hadoop Apache
Project 53.1-13 (2008): 2.
• Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The
Google file system." Proceedings of the nineteenth ACM
symposium on Operating systems principles. 2003.
• Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for
Internet-scale Systems." USENIX annual technical conference.
Vol. 8. No. 9. 2010.
• Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified
data processing on large clusters." Communications of the
ACM 51.1 (2008): 107-113.
Thank for
your
attention!