0% found this document useful (0 votes)
7 views57 pages

HADOOP

Uploaded by

Roy abhisek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

HADOOP

Uploaded by

Roy abhisek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

CS5412 / Lecture 22 Ken Birman

Apache Architecture Kishore Pusukuri

CS5412, Fall 2022 1


Apache: A Big Data Archicture
Batch Analytical Stream Machine Other
Processing SQL Processing Learning Applications

Resource Manager (Workload Manager, Task Scheduler, etc.)

Data
Data Storage (File Systems, Database, etc.) Ingestion
Systems

Popular BigData Systems: Apache Hadoop, Apache Spark

CS5412, Fall 2022 2


Actual Apache Tool Names

Hadoop Other
MapReduce Hive Pig Applications

Yet Another Resource


Negotiator (YARN)
Data Ingest
Systems
Hadoop NoSQL Database (HBase) e.g.,
Hadoop Distributed File System (HDFS) Apache
Kafka,
Flume, etc
Cluster
CS5412, Fall 2022 3
Apache Zookeeper
Zookeeper manages small files
holding configuration
information for your application
It automatically tracks IP
addresses of application
components, and health status.
The step count for an iterative
calculation.

CS5412, Fall 2022 4


The ZooKeeper Service
Each -service has a
leader that talks to
Zookeeper. Its other Zookeeper is itself an
nodes have interesting distributed
connections too, but system
just passive (for fault
detection)

ZooKeeper Service is replicated over a set of machines, usually 5 or 7.


All machines store a copy of the data in memory (!). Checkpointed to disk if you wish but at a limited pace (1/5 secs).
A leader is elected on service startup
Clients only connect to a single ZooKeeper server & maintains a TCP connection.
Client can read from any Zookeeper server.
Writes go through the leader & employ virtual synchrony atomic multicast with majority consensus on membership

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription CS5412, Fall 2022 5


Hadoop Distributed File System (HDFS)

HDFS is similar to Ceph, so we won’t do a deep dive on it.


Supports append-only updates, whole-file delete/create/replace.

Offers a form of “checkpoint”


 It records file versions and lengths.
 Rollback is done by truncating files to earlier lengths
 HDFS doesn’t retain old versions, so it can’t undo a file
delete/replace.
CS5412, Fall 2022 6
Hadoop Database (HBase) is a thin
layer directly over HDFS
HBASE is used like a NoSQL database.

It maps directly to HDFS

It holds relations (tables).

Supports large amounts of data and high throughput


CS5412, Fall 2022 7
HBase: Data Model (1)

CS5412, Fall 2022 8


HBase: Data Model (2)

•Sorted rows: support billions of rows


•Columns: Supports millions of columns
•Cell: intersection of row and column
 Can have multiple values (which are time-stamped)
 Can be empty. No storage/processing overheads

CS5412, Fall 2022 9


HBase: Table

CS5412, Fall 2022 10


HBase: Horizontal Splits (Regions)

CS5412, Fall 2022 11


HBase Architecture

CS5412, Fall 2022 12


HBase Architecture: Column Family (1)

CS5412, Fall 2022 13


HBase Architecture: Column Family (2)

CS5412, Fall 2022 14


HBase Architecture (1)
HBase is composed of three types of servers in a leader/worker
type of architecture: Region Server, Hbase Master, ZooKeeper.
Region Server:
• Clients communicate with Leader
RegionServers (workers) directly for Servers

accessing data
• Serves data for reads and writes.
• These region servers are assigned Worker
Servers
to the HDFS data nodes to preserve
data locality.
CS5412, Fall 2022 15
HBase Architecture (2)

HBase Leader (HMaster): coordinates region servers,


handles DDL (create, delete tables) operations.
Zookeeper: HBase uses ZooKeeper as a distributed
coordination service to maintain server state in the cluster.

CS5412, Fall 2022 16


HDFS uses ZooKeeper as its
coordinator
Maintains region server state in the cluster
Provides server failure notification
Uses consensus to guarantee common shared state

CS5412, Fall 2022 17


HBase vs HDFS
Hbase is a way of “talking to” HDFS. We use it for massive
tables that wouldn’t fit into a single HDFS file.

HBase
HDFS
• Stores data as key-value objs in column-
• Stores data as flat files
families. Records in HBase are stored
• Optimized for streaming access of
according to the rowkey and sequential
large files -- doesn’t support random
search is common
read/write
• Provides low latency access to small
• Follows write-once read-many model
amounts of data from within a large data set
• Supports log-style files (append-only).
• Provides flexible data model

CS5412, Fall 2022 18


Hadoop Resource Management

Yet Another Resource Negotiator (YARN)


➢ YARN is a core component of Hadoop, manages all the resources of a
Hadoop cluster (CPUs, memory, GPUs, networking connections, etc).
➢ Using selectable criteria such as fairness, it effectively allocates resources of
Hadoop cluster to multiple data processing jobs
○ Batch jobs (e.g., MapReduce, Spark)
○ Streaming Jobs (e.g., Spark streaming)
○ Analytics jobs (e.g., Impala, Spark)

CS5412, Fall 2022 19


Hadoop Ecosystem (Resource
Manager)
YARN
Other Spark decides
Hadoop Hive Pig
Applications Stream where the
steps in
your job
Resource Yet Another Resource should run
manager Negotiator (YARN)

Hadoop Distributed Hadoop NoSQL


File System (HDFS) Database (HBase)

CS5412, Fall 2022 20


YARN Concepts (1)
➢ YARN focuses on a generalized concept based on a virtual machine
container. In YARN, a container is an abstraction for managing resources --
an unit of computation of a resource node, i.e., a certain amount of CPU,
Memory, Disk, etc., ASIC resources. Tied to Mesos container model.
➢ A single job may run in one or more containers – a set of containers would
be used to encapsulate highly parallel Hadoop jobs.
➢ The main goal of YARN is effectively allocating containers to multiple data
processing jobs.
➢ YARN competes with Kubnetes (the Docker container manager) but covers
cases that don’t involve executable VMs. Kubnetes is focused on Docker

CS5412, Fall 2022 21


YARN Concepts (2)
Three Main components of YARN:
Application Leader, Node Manager, and Resource Manager (a.k.a. YARN
Daemon Processes)
➢ Application Leader:
○ Single instance per job.
○ Spawned within a container when a new job is submitted by a client
○ Requests additional containers for handling of any sub-tasks.
➢ Node Manager: Single instance per worker node. Responsible for monitoring
and reporting on local container status (all containers on worker node).

CS5412, Fall 2022 22


YARN Concepts (3)
Three Main components of YARN: Application Master, Node Manager, and
Resource Manager (aka The YARN Daemon Processes)
➢ Resource Manager: arbitrates system resources between competing jobs. It has
two main components:
○ Scheduler (Global scheduler): Responsible for allocating resources to the
jobs subject to familiar constraints of capacities, queues etc.
○ Application Manager: Responsible for accepting job-submissions and
provides the service for restarting the ApplicationMaster container on failure.

CS5412, Fall 2022 23


YARN Concepts (4)

How do the
components of YARN
work together?

Image source: http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/YARN.html


CS5412, Fall 2022 24
Hadoop Ecosystem (Processing
Layer)
Processing Map Other Spark
Reduce Hive Pig Stream
Applications
Totally unrelated
Yet Another Resource tools run on the
Negotiator (YARN) same machines.
YARN needs to
decide where to
schedule each
Hadoop Distributed Hadoop NoSQL task.
File System (HDFS) Database (HBase)

CS5412, Fall 2022 25


Recall: MapReduce generates lots
of tasks as it runs!
Reduce Reducer Output Result
aardvark 1 aardvark 1
Intermediate Data Reduce
cat 1 cat 1
aardvark 1
Reduce mat 1
cat 1 mat 1
mat 1 on 2
Reduce on 2
on 1,1
sat 1,1 sat 2 sat 2
Reduce
sofa 1 sofa 1
sofa 1
the 1,1,1,1
Reduce the 4
the 4

Reduce
CS5412, Fall 2022 26
Example YARN challenge

YARN has to decide which machines to schedule each map and


reduce step on. Some have RDDs cached for reuse

Those same machines are in demand by other Apache tools as


well, such as Pig and Hive and even the HDFS meta-data service

Some tasks might have special needs, like “a node with 8 GPUs”
or “at least 20GB of RAM memory.”
CS5412, Fall 2022 27
How does YARN do this?

YARN loops, collecting a batch of tasks that need to be scheduled.

For each batch it runs a form of constrained optimization. First, it


identifies tasks with special needs (for example, a task that needs
two GPUs can only run on a node with two GPUs available)

Next it runs a form of minflow/maxcut algorithm to assign tasks to


available nodes in a best fit manner. Like bin packing.
CS5412, Fall 2022 28
Slight topic shift…
Away from YARN and
focusing now on what
the other tools do

CS5412, Fall 2022 29


Apache Hive: SQL on MapReduce
Hive is an abstraction layer on top of Hadoop (MapReduce/Spark)

Use Cases:

 Data Preparation
 Extraction-Transformation-Loading Jobs (Data Warehousing)
 Data Mining

CS5412, Fall 2022 30


Apache Hive: SQL on MapReduce
Hive is an abstraction layer on top of Hadoop (MapReduce/Spark)
➢ Hive uses a SQL-like language called HiveQL

➢ Facilitates reading, writing, and managing large datasets residing in


distributed storage using SQL-like queries

➢ Hive executes queries using MapReduce (and also using Spark)


○ HiveQL queries → Hive → MapReduce Jobs

CS5412, Fall 2022 31


Apache Hive
➢ Structure is applied to data at time of read → No need to worry about
formatting the data at the time when it is stored in the Hadoop cluster
➢ Data can be read using any of a variety of formats:
○ Unstructured flat files with comma or space-separated text
○ Semi-structured JSON files (a web standard for event-oriented data such
as news feeds, stock quotes, weather warnings, etc)
○ Structured HBase tables
➢ Hive is not designed for online transaction processing. Hive should be
used for “data warehousing” tasks, not arbitrary transactions.

CS5412, Fall 2022 32


Apache Pig: Scripting on MapReduce
Pig is an abstraction layer on top of Hadoop (MapReduce/Spark)

➢ Use Cases:
○ Data Preparation
○ ETL Jobs (Data Warehousing)
○ Data Mining

CS5412, Fall 2022 33


Apache Pig: Scripting on MapReduce
Pig is an abstraction layer on top of Hadoop (MapReduce/Spark)
➢ Code is written in Pig Latin “script” language (a data flow language)
➢ Facilitates reading, writing, and managing large datasets residing in
distributed storage
➢ Pig executes queries using MapReduce (and also using Spark)
○ Pig Latin scripts → Pig → MapReduce Jobs

CS5412, Fall 2022 34


Apache Hive & ApachePig
➢ Instead of writing Java code to implement MapReduce, one can opt
between Pig Latin and Hive SQL to construct MapReduce programs
➢ Much fewer lines of code compared to MapReduce, which reduces
the overall development and testing time

CS5412, Fall 2022 35


Apache Hive vs Apache Pig
➢ Declarative SQL-like language ➢ Procedural data flow language (Pig Latin)
(HiveQL) ➢ Runs on Client side of any cluster
➢ Operates on the server side of any ➢ Best for semi structured data
cluster ➢ Better for creating data pipelines
➢ Better for structured Data ○ allows developers to decide where to
➢ Easy to use, specifically for checkpoint data in the pipeline
generating reports ➢ Incremental changes to large data sets
➢ Data Warehousing tasks and also better for streaming
➢ Facebook ➢ Yahoo

CS5412, Fall 2022 36


Apache Hive vs ApachePig: example
Job: Get data from sources users and clicks is to be joined and filtered, and then joined
to data from a third source geoinfo and aggregated and finally stored into a table
ValuableClicksPerDMA

insert into ValuableClicksPerDMA Users = load 'users' as (name, age, ipaddr);


select dma, count(*) Clicks = load 'clicks' as (user, url, value);
from geoinfo join ( ValuableClicks = filter Clicks by value > 0;
select name, ipaddr UserClicks = join Users by name, ValuableClicks by user;
from users join clicks on Geoinfo = load 'geoinfo' as (ipaddr, dma);
(users.name = clicks.user) UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
where value > 0; ByDMA = group UserGeo by dma;
) using ipaddr ValuableClicksPerDMA = foreach ByDMA generate group,
group by dma; COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

CS5412, Fall 2022 37


Data Ingestion Systems/Tools
➢ Apache Sqoop
○ High speed import to HDFS from Relational Database (and vice versa)
○ Supports many database systems,
e.g. Mongo, MySQL, Teradata, Oracle

➢ Apache Flume
○ Distributed service for ingesting streaming data
○ Ideally suited for event data from multiple systems, for example, log files

CS5412, Fall 2022 38


Apache Kafka
➢ Functions like a distributed publish-subscribe messaging system (or a
distributed streaming platform)
○ A high throughput, scalable messaging system
○ Distributed, reliable publish-subscribe system
○ Design as a message queue & Implementation as a distributed log service

➢ Originally developed by LinkedIn, now widely popular


➢ Features: Durability, Scalability, High Availability, High Throughput
➢ Check out the awesome Kafka “intro” video here.

CS5412, Fall 2022 39


What is Apache Kafka used for? (1)
➢ The original use case (@LinkedIn):
○ To track user behavior on websites.
○ Site activity (page views, searches, or other actions users might take) is
published to central topics, with one topic per activity type.

➢ Effective for two broad classes of applications:


○ Building real-time streaming data pipelines that reliably get data between
systems or applications
○ Building real-time streaming applications that transform or react to the
streams of data
CS5412, Fall 2022 40
What is Apache Kafka used for? (2)
➢ Lets you publish and subscribe to streams of records, similar to a
message queue or enterprise messaging system
➢ Lets you store streams of records in a fault-tolerant way
➢ Lets you process streams of records as they occur
➢ Lets you have both offline and online message consumption

CS5412, Fall 2022 41


Apache Kafka: Fundamentals
➢ Kafka is run as a cluster on one or more servers
➢ The Kafka cluster stores streams of records in categories called topics
➢ Each record (or message) consists of a key, a value, and a timestamp

➢ Point-to-Point: Messages persisted in a queue, a particular message is


consumed by a maximum of one consumer only
➢ Publish-Subscribe: Messages are persisted in a topic, consumers can
subscribe to one or more topics and consume all the messages in that topic

CS5412, Fall 2022 42


Apache Kafka: Components
Logical Components:
➢ Topic: The named destination of partition
➢ Partition: One Topic can have multiple partitions and it is an unit of parallelism
➢ Record or Message: Key/Value pair (+ Timestamp)

Physical Components:
➢ Producer: The role to send message to broker
➢ Consumer: The role to receive message from broker
➢ Broker: One node of Kafka cluster
➢ ZooKeeper: Coordinator of Kafka cluster and consumer groups
CS5412, Fall 2022 43
Apache Kafka: Topics & Partitions (1)
➢ A stream of messages belonging to a particular category is called a
topic (or a feed name to which records are published)
➢ Data is stored in topics.
➢ Topics in Kafka are always multi-subscriber -- a topic can have
zero, one, or many consumers that subscribe to the data written to it
➢ Topics are split into partitions. Topics may have many partitions, so
it can handle an arbitrary amount of data

CS5412, Fall 2022 44


Apache Kafka: Topics & Partitions (2)
➢ For each topic, the Kafka cluster ➢ Each partition is an ordered,
maintains a partitioned log that immutable sequence of records
looks like this: that is continually appended to -- a
structured commit log.
➢ Partition offset: The records in the
partitions are each assigned a
sequential id number called the
offset that uniquely identifies each
record within the partition.

CS5412, Fall 2022 45


Apache Kafka: Topics & Partitions (3)
➢ The only metadata retained on a per-
consumer basis is the offset or
position of that consumer in the log.
➢ This offset is controlled by the
consumer -- normally a consumer will
advance its offset linearly as it reads
records (but it can also consume
records in any order it likes)

CS5412, Fall 2022 46


Apache Kafka: Topics & Partitions (4)
The partitions in the log serve several purposes:
➢ Allow the log to scale beyond a size that will fit on a single server.
➢ Handles an arbitrary amount of data -- a topic may have many partitions
➢ Acts as the unit of parallelism

CS5412, Fall 2022 47


Apache Kafka: Distribution of Partitions (2)

Here, a topic is configured into


three partitions.
Partition 1 has two offset factors 0
and 1.
Partition 2 has four offset factors 0,
1, 2, and 3.
Partition 3 has one offset factor 0.
The id of the replica is same as the
id of the server that hosts it.
CS5412, Fall 2022 48
Apache Kafka: Components
Logical Components:
➢ Topic: The named destination of partition
➢ Partition: One Topic can have multiple partitions and it is an unit of parallelism
➢ Record or Message: Key/Value pair (+ Timestamp)

Physical Components:
➢ Producer: The role to send message to broker
➢ Consumer: The role to receive message from broker
➢ Broker: One node of Kafka cluster
➢ ZooKeeper: Coordinator of Kafka cluster and consumer groups
CS5412, Fall 2022 49
Apache Kafka: Producers
➢ Producers publish data to the topics of their choice.
➢ The producer is responsible for choosing which record to assign to
which partition within the topic.
➢ Record to Topic: In a round-robin fashion simply to balance load or
can be done according to some semantic partition function

CS5412, Fall 2022 50


Apache Kafka: Consumers
➢ Consumer group: Balance consumers to partitions
➢ Consumers label themselves with a consumer group name
➢ Each record published to a topic is delivered to one consumer
instance within each subscribing consumer group
➢ If all the consumer instances have the same consumer group, then the
records will effectively be load balanced over the consumer instances.
➢ If all the consumer instances have different consumer groups, then
each record will be broadcast to all the consumer processes.
CS5412, Fall 2022 51
Apache Kafka: Producers & Consumers

Example:
A two server Kafka cluster hosting four
partitions (P0 to P3) with two consumer
groups (A & B). Consumer group A has
two consumer instances (C1 & C2) and
group B has four (C3 to C6).

CS5412, Fall 2022 52


Apache Kafka: Design Guarantees (1)

➢ Records (or Messages) sent by a producer to a particular topic partition


will be appended in the order they are sent.
➢ A consumer instance sees records in the order they are stored in the log.
➢ For a topic with replication factor N, we will tolerate up to N-1 server
failures without losing any records committed to the log.

CS5412, Fall 2022 53


Apache Kafka: Design Guarantees (2)
Message Delivery Semantics:
➢ At most once: Messages may be lost but are never redelivered.
➢ At least once: Messages are never lost but may be redelivered.
➢ Exactly once: Each message is delivered once and only once

CS5412, Fall 2022 54


Apache Kafka: Four Core APIs (1)
Producer API: Allows an application to publish a
stream of records to one or more Kafka topics
Consumer API: Allows an application to
subscribe to one or more topics and process the
stream of records produced to them
Streams API: Allows an application to act as a
stream processor -- consuming an input stream
from one or more topics and producing an output
stream to one or more output topics

CS5412, Fall 2022 55


Apache Kafka: Four Core APIs (2)
Connector API:
Allows building and running producers or
consumers that connect Kafka topics to existing
applications or data systems.
For example, a connector to a relational
database might capture every change to a table.

CS5412, Fall 2022 56


Summary

Apache ecosystem: A comprehensive big-data framework

All open source, very standard architecture at all levels

Widely popular, but not always blindingly fast. Understanding


the intended styles of use is important for good performance.

CS5412, Fall 2022 57

You might also like