0% found this document useful (0 votes)

44 views41 pages

2 Hadoop Ecosystem

The document discusses the Hadoop ecosystem and its main components. It describes how Hadoop provides scalable and economical data storage and processing using Apache Hadoop. The key components are the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing. It also discusses how Hadoop provides fault tolerance and scalability.

Uploaded by

tranngocbaooooo12062003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views41 pages

2 Hadoop Ecosystem

Uploaded by

tranngocbaooooo12062003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Chapter 2

Hadoop ecosystem
We need a system that scales
• Traditional tools are overwhelmed
• Slow disks, unreliable machines, parallelism is not
easy
• 3 challenges
• Reliable storage
• Powerful data processing
• Efficient visualization

3
What is Apache Hadoop?
• Scalable and economical data storage and
processing
• The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models. It is designed to scale out from single servers to
thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of
which may be prone to failures (commodity hardware).
• Heavily inspired by Google data architecture

4
Hadoop main components
• Storage: Hadoop distributed file system
(HDFS)
• Processing: MapReduce framework
• System utilities:
• Hadoop Common: The common utilities that
support the other Hadoop modules.
• Hadoop YARN: A framework for job scheduling and
cluster resource management.

5
Scalability
• Distributed by design
• Hadoop can run on cluster
• Individual servers within a cluster are called
nodes
• each node may both store and process data
• Scale out by adding more nodes to increase
scalability
• Up to several thousand nodes

6
Fault tolerance
• Cluster of commodity servers
• Hardware failure is the norm rather than the exception
• Built with redundancy
• File loaded into HDFS are replicated across nodes in
the cluster
• If a node failed, its data is re-replicated using one of the
copies
• Data processing jobs are broken into individual tasks
• Each task takes a small amount of data as input
• Parallel tasks execution
• Failed tasks also get rescheduled elsewhere
• Routine failures are handled automatically without any
loss of data

7
Hadoop distributed file system
• Provides inexpensive and reliable storage for massive
amounts of data
• Optimized for big files (100 MB to several TBs file
sizes)
• Hierarchical UNIX style file system
• (e.g., /hust/soict/hello.txt)
• UNIX style file ownership and permissions
• There are also some major deviations from UNIX
• Append only
• Write once read many times

8
HDFS Architecture
• Master/slave architecture
• HDFS master: namenode
• Manage namespace and
metadata
• Monitor datanode
• HDFS slave: datanode
• Handle read/write the actual
data

9
HDFS main design principles
• I/O pattern
• Append only à reduce synchronization
• Data distribution
• File is splitted in big chunks (64 MB)
à reduce metadata size
à reduce network communication
• Data replication
• Each chunk is usually replicated in 3 different nodes
• Fault tolerance
• Data node: re-replication
• Name node
• Secondary namenode
• Enqury data nodes instead of complex checkpointing scheme
Data processing: MapReduce
• MapReduce framework is the Hadoop default data
processing engine
• MapReduce is a programming model for data
processing
• it is not a language, a style of processing data created by
Google
• The beauty of MapReduce
• Simplicity
• Flexibility
• Scalability

11
a MR job = {Isolated Tasks}n
• MapReduce divides the workload into multiple
independent tasks and schedule them across cluster
nodes
• A work performed by each task is done in isolation
from one another for scalability reasons
• The communication overhead required to keep the data on the
nodes synchronized at all times would prevent the model from
performing reliably and efficiently at large scale

12
Data Distribution
• In a MapReduce cluster, data is usually managed by a
distributed file systems (e.g., HDFS)
• Move code to data and not data to code

Input data: A large file

Node 1 Node 2 Node 3

Chunk of input data Chunk of input data Chunk of input data

13
Keys and Values
• The programmer in MapReduce has to specify two
functions, the map function and the reduce function
that implement the Mapper and the Reducer in a
MapReduce program
• In MapReduce data elements are always structured
as
key-value (i.e., (K, V)) pairs
• The map and reduce functions receive and emit (K, V)
pairs
Input Splits Intermediate Outputs Final Outputs

Map (K’, Reduce (K’’,

(K, V)
V’) V’’)
Pairs Function Pairs Function Pairs
Partitions
§ A different subset of intermediate key space is
assigned to each Reducer
§ These subsets are known as partitions

Different colors represent

different keys (potentially)
from different Mappers

Partitions are the input to Reducers

MapReduce example
• Input: text file containing order ID, employee name,
and sale amount
• Output: sum of all sales per employee

16
Map phase
• Hadoop splits job into many individual map tasks
• Number of map tasks is determined by the amount of input data
• Each map task receives a portion of the overall job input to process
• Mappers process one input record at a time
• For each input record, they emit zero or more records as output
• In this case, the map task simply parses the input record
• And then emits the name and price fields for each as output

Map phase

17
Shuffle & sort
• Hadoop automatically sorts and merges output from all
map tasks
• This intermediate process is known as the shuffle and sort
• The result is supplied to reduce tasks

Shuffle & sort phase

18
Reduce phase
• Reducer input comes from the shuffle and sort process
• As with map, the reduce function receives one record at a time
• A given reducer receives all records for a given key
• For each input record, reduce can emit zero or more output records
• Our reduce function simply sums total per person
• And emits employee name (key) and total (value) as output

Reduce phase
19
Data flow for the entire MapReduce
job

20
Word Count Dataflow
MapReduce - Dataflow
Map reduce life cycle

[email protected] 23
Example: Word Count (1)
Example: Word Count (2)
Hadoop ecosystem
• Many related tools integrate with Hadoop
• Data analysis
• Database integration
• Workflow management
• These are not considered ‘core Hadoop’
• Rather, they are part of the ‘Hadoop ecosystem’
• Many are also open source Apache projects

26
Apache Pig
• Apache Pig builds on Hadoop to offer high level data processing
• Pig is especially good at joining and transforming data
• The Pig interpreter runs on the client machine
• Turns PigLatin scripts into MapReduce jobs
• Submits those jobs to the cluster

27
Apache Hive
• Another abstraction on top of MapReduce
• Reduce development time
• HiveQL: SQL-like language
• The Hive interpreter runs on the client machine
• Turns HiveQL scripts into MapReduce jobs
• Submits those jobs to the cluster

28
Apache Hbase
• HBase is a distributed column-oriented data store built on top of
HDFS
• Is considered as the Hadoop database
• Data is logically organized into tables, rows and columns
• terabytes, and even petabytes of data in a table
• Tables can have many thousands of columns
• Scales to provide very high write throughput
• Hundreds of thousands of inserts per second
• Fairly primitive when compared to RDBMS
• NoSQL : There is no high/level query language
• Use API to scan / get / put values based on keys

29
Apache sqoop
• Sqoop is a tool designed for efficiently
transferring bulk data between Apache
Hadoop and structured datastores such
as relational databases.
• It can import all tables, a single table, or
a portion of a table into HDFS
• Via a Map/only MapReduce job
• Result is a directory in HDFS containing
comma/delimited text files
• Sqoop can also export data from HDFS
back to the database

30
Apache Kafka
Kafka decouple data streams
Producers don’t know about
Producer Producer consumers
Flexible message consumption
Producers Kafka broker delegates log
partition offset (location) to
Consumers (clients)

Kafka Broker Broker Broker Broker

Cluster
Zookeeper

Consumers Consumer
Offsets
Consumer

Kafka decouples Data Pipelines

[email protected]
Apache Oozie
• Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
• Oozie supports many workflow actions, including
• Executing MapReduce jobs
• Running Pig or Hive scripts
• Executing standard Java or shell programs
• Manipulating data via HDFS commands
• Running remote commands with SSH
• Sending e/mail messages

32
Apache Zookeeper
• Apache ZooKeeper is a highly reliable
distributed coordination service
• Group membership
• Leader election
• Dynamic Configuration
• Status monitoring
• All of these kinds of services are used in some
form or another by distributed applications

33
PAXOS algorithm

https://www.youtube.com/watch?v=d7nAGI_NZPk

[email protected] 34
YARN – Yet Another Resource Negotiator
• Nodes have "resources" – memory and CPU
cores – which are allocated to application when
requested
• Moving beyond Map Reduce
• MR and non-MR running on the same cluster
• Most jobtracker functions moved to application
masters HADOOP 2.0

HADOOP 1.0 MapReduce

(data processing)
Others
(data processing)

MapReduce
(cluster resource management YARN
& data processing) (cluster resource management)

HDFS
(redundant, reliable HDFS
storage) (redundant, reliable storage)
YARN execution

[email protected] 36
Big data platform: Hadoop ecosystem
Hortonworks Data Platform Sandbox
Demo
Big data management
Reference
• White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.",
2012.
• Borthakur, Dhruba. "HDFS architecture guide." Hadoop Apache
Project 53.1-13 (2008): 2.
• Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The
Google file system." Proceedings of the nineteenth ACM
symposium on Operating systems principles. 2003.
• Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for
Internet-scale Systems." USENIX annual technical conference.
Vol. 8. No. 9. 2010.
• Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified
data processing on large clusters." Communications of the
ACM 51.1 (2008): 107-113.
Thank for
your
attention!

wk8 Final
No ratings yet
wk8 Final
39 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Unit 5
No ratings yet
Unit 5
101 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Unit 5
No ratings yet
Unit 5
32 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
29 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Fro CH3
No ratings yet
Fro CH3
21 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
HADOOP
No ratings yet
HADOOP
19 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Soal Bahasa Inggris Kelas 9 SMP/MTs - Report Text
100% (10)
Soal Bahasa Inggris Kelas 9 SMP/MTs - Report Text
2 pages
Unit 2
No ratings yet
Unit 2
9 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Mercedes Benz E 320 & E500 ENGLISH MANUAL
100% (3)
Mercedes Benz E 320 & E500 ENGLISH MANUAL
399 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
HADOOP
No ratings yet
HADOOP
4 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
HADOOP
No ratings yet
HADOOP
10 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
IPAQ - AUTOMATIC REPORT - Kuisioner
No ratings yet
IPAQ - AUTOMATIC REPORT - Kuisioner
20 pages
Soil Nailing Design PDF
100% (3)
Soil Nailing Design PDF
32 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Hadoop
No ratings yet
Hadoop
34 pages
Parts Guide Sharp MX-M 202-232D PDF
100% (2)
Parts Guide Sharp MX-M 202-232D PDF
86 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Dogmatica Zizioulas
No ratings yet
Dogmatica Zizioulas
241 pages
Civil Engineering Project Ideas - Best and Exclusive FE Collection
No ratings yet
Civil Engineering Project Ideas - Best and Exclusive FE Collection
4 pages
Color Atlas of High Resolution Manometry - 1st Edition Academic PDF Download
100% (14)
Color Atlas of High Resolution Manometry - 1st Edition Academic PDF Download
16 pages
The DoTerra Essential Oil Chemestry Handbook
100% (1)
The DoTerra Essential Oil Chemestry Handbook
42 pages
Philosophers Way Thinking Critically About Profound Ideas 5th Edition Chaffee Test Bank 1
100% (76)
Philosophers Way Thinking Critically About Profound Ideas 5th Edition Chaffee Test Bank 1
14 pages
Ib On Granites
No ratings yet
Ib On Granites
23 pages
Economic Geog - Agriculture - Main Product Produced - Beef Cattle
No ratings yet
Economic Geog - Agriculture - Main Product Produced - Beef Cattle
13 pages
Ambleside Online Year 4 36-Week Schedule
No ratings yet
Ambleside Online Year 4 36-Week Schedule
19 pages
External Thermal Insulation Composite Systems Etics
No ratings yet
External Thermal Insulation Composite Systems Etics
34 pages
Section 3 Well Performance Retesting
No ratings yet
Section 3 Well Performance Retesting
59 pages
Untitled
No ratings yet
Untitled
146 pages
Catalog Item 1 - EJB Flameproof Enclosures
No ratings yet
Catalog Item 1 - EJB Flameproof Enclosures
3 pages
AP Biology 1986 With Answers
No ratings yet
AP Biology 1986 With Answers
21 pages
Ee492b2 39630 General Introduction To PIARC November 2022 World Road Association
No ratings yet
Ee492b2 39630 General Introduction To PIARC November 2022 World Road Association
46 pages
Full-Application Note Drinking Water Monitoring An Algae Bloom
No ratings yet
Full-Application Note Drinking Water Monitoring An Algae Bloom
6 pages
Manual Módem Huawei
No ratings yet
Manual Módem Huawei
3 pages
Chap007 1 PDF
No ratings yet
Chap007 1 PDF
69 pages
Best PLace of Original Pearls in Hydrabad
No ratings yet
Best PLace of Original Pearls in Hydrabad
2 pages
Douk Audio P6 Mini Tube Preamplifier (Review) - Elektor Magazine
No ratings yet
Douk Audio P6 Mini Tube Preamplifier (Review) - Elektor Magazine
8 pages
SSC Maths Quiz 2
No ratings yet
SSC Maths Quiz 2
12 pages
Mitigation Strategy of Subsynchronous Oscillation Based On Fractional-Order Sliding Mode Control For VSC-MTDC Systems With DFIG-Based Wind Farm Access
No ratings yet
Mitigation Strategy of Subsynchronous Oscillation Based On Fractional-Order Sliding Mode Control For VSC-MTDC Systems With DFIG-Based Wind Farm Access
9 pages
Lapp Pro206402en
No ratings yet
Lapp Pro206402en
4 pages
Ba BSC 1st Semester Paper Code and Title
No ratings yet
Ba BSC 1st Semester Paper Code and Title
2 pages
Rincian Harga CCTV: Paket All Dahua 4 Channel
No ratings yet
Rincian Harga CCTV: Paket All Dahua 4 Channel
2 pages
AWRN-22B: 802.11n WLAN ADSL2+ Router
No ratings yet
AWRN-22B: 802.11n WLAN ADSL2+ Router
2 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

2 Hadoop Ecosystem

Uploaded by

2 Hadoop Ecosystem

Uploaded by

Chapter 2

Input data: A large file

Node 1 Node 2 Node 3

Map (K’, Reduce (K’’,

Different colors represent

Partitions are the input to Reducers

Shuffle & sort phase

Kafka Broker Broker Broker Broker

Kafka decouples Data Pipelines

HADOOP 1.0 MapReduce

You might also like