HADOOP
HADOOP
Data
Data Storage (File Systems, Database, etc.) Ingestion
Systems
Hadoop Other
MapReduce Hive Pig Applications
accessing data
• Serves data for reads and writes.
• These region servers are assigned Worker
Servers
to the HDFS data nodes to preserve
data locality.
CS5412, Fall 2022 15
HBase Architecture (2)
HBase
HDFS
• Stores data as key-value objs in column-
• Stores data as flat files
families. Records in HBase are stored
• Optimized for streaming access of
according to the rowkey and sequential
large files -- doesn’t support random
search is common
read/write
• Provides low latency access to small
• Follows write-once read-many model
amounts of data from within a large data set
• Supports log-style files (append-only).
• Provides flexible data model
How do the
components of YARN
work together?
Reduce
CS5412, Fall 2022 26
Example YARN challenge
Some tasks might have special needs, like “a node with 8 GPUs”
or “at least 20GB of RAM memory.”
CS5412, Fall 2022 27
How does YARN do this?
Use Cases:
Data Preparation
Extraction-Transformation-Loading Jobs (Data Warehousing)
Data Mining
➢ Use Cases:
○ Data Preparation
○ ETL Jobs (Data Warehousing)
○ Data Mining
➢ Apache Flume
○ Distributed service for ingesting streaming data
○ Ideally suited for event data from multiple systems, for example, log files
Physical Components:
➢ Producer: The role to send message to broker
➢ Consumer: The role to receive message from broker
➢ Broker: One node of Kafka cluster
➢ ZooKeeper: Coordinator of Kafka cluster and consumer groups
CS5412, Fall 2022 43
Apache Kafka: Topics & Partitions (1)
➢ A stream of messages belonging to a particular category is called a
topic (or a feed name to which records are published)
➢ Data is stored in topics.
➢ Topics in Kafka are always multi-subscriber -- a topic can have
zero, one, or many consumers that subscribe to the data written to it
➢ Topics are split into partitions. Topics may have many partitions, so
it can handle an arbitrary amount of data
Physical Components:
➢ Producer: The role to send message to broker
➢ Consumer: The role to receive message from broker
➢ Broker: One node of Kafka cluster
➢ ZooKeeper: Coordinator of Kafka cluster and consumer groups
CS5412, Fall 2022 49
Apache Kafka: Producers
➢ Producers publish data to the topics of their choice.
➢ The producer is responsible for choosing which record to assign to
which partition within the topic.
➢ Record to Topic: In a round-robin fashion simply to balance load or
can be done according to some semantic partition function
Example:
A two server Kafka cluster hosting four
partitions (P0 to P3) with two consumer
groups (A & B). Consumer group A has
two consumer instances (C1 & C2) and
group B has four (C3 to C6).