0% found this document useful (0 votes)
11 views41 pages

Hadoop Platform & Services

The document outlines the objectives and system overview of a data engineering platform designed to solve data-related problems for businesses using an optimal technology stack. It details the architecture and components of the Hadoop ecosystem, including HDFS, YARN, and various data storage and processing technologies. Additionally, it discusses the design principles of distributed file systems, resource management, and the application lifecycle in Hadoop.

Uploaded by

jwyxhwzbqz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views41 pages

Hadoop Platform & Services

The document outlines the objectives and system overview of a data engineering platform designed to solve data-related problems for businesses using an optimal technology stack. It details the architecture and components of the Hadoop ecosystem, including HDFS, YARN, and various data storage and processing technologies. Additionally, it discusses the design principles of distributed file systems, resource management, and the application lifecycle in Hadoop.

Uploaded by

jwyxhwzbqz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Engineering

Data & Compute Platform and Services ...


Objective
➔ Build platform to address data related problems for Businesses

➔ Use of optimal technology stack

➔ Develop right platforms based on requirements

➔ Optimize overall Cost - Usage based cost optimization


System Overview

➔ Compute: ~50,000 applications running on an average in a day

➔ Storage: (~1.4 PB) Tiered Storage


● SSD for latency critical processes (HBase)
● Magnetic disk for historical storage & scan oriented processes
● Live data kept in hot storage with the aim of providing high throughputs
● Older data is periodically moved to cold storage (decreased replication, lower throughput)
Technology Stack
➔ HADOOP (Distributed Data Warehouse)
◆ HDFS (Storage Layer)

◆ HBase (Distributed key-value Store)

◆ YARN (Compute Engine)

◆ Hive/Spark (Querying Engines)

◆ Oozie (Scheduling Engine)

➔ Kafka (Messaging Queue: Primary Source for Data Ingestion)

➔ ES, Druid (Other data stores used for analytics and reporting)

➔ Hue/Zeppelin/Jupyter (Data Access Platforms)


Hadoop Distributed
File System (HDFS)
Distributed File System

▪ How to store Data ?

‣ Transactional based systems v/s Event based systems

‣ Storage Mechanism and Scale:


■ Single node
■ Multi node (data partitioning, consistency, availability)
Distributed File System
▪ Unit of data that can be read or written ?
‣ Folders, files, blocks ... ? Single Node Distributed
Storage Storage (eg:hdfs)

▪ What should be the optimal size of a block ?


Storage Unit FileSystemBlock Hdfs Block

‣ Unit of access Block Size 4KB 128MB

Data Availability Using RAID Replicated Blocks


across Nodes

▪ How to ensure data availability ?


Storage Policy Will reserve at least Need not reserve
‣ Replicate Blocks 1 block on the disk complete block for
(i.e. 4KB) smaller data sizes
‣ Separate data from metadata
Hadoop Distributed File System (HDFS)
▪ Based on Google File System (GFS)
▪ Optimized for huge files
▪ Write once, read many
‣ Create new data. Never update-in-place, only append.
‣ No write locks (only 1 writer!).

▪ Optimized for sequential reads


‣ Typically, start at a point and read to completion.

▪ Throughput favoured over low latency


‣ Low total time for reading all data, than time per small files.

▪ Survive high disk/node failures


HDFS Design
▪ Master-slave architecture
‣ Master manages namespace, directory/file names/tree structure, metadata, block ids,
permissions

‣ Slave manages blocks containing data


Master: Name Node
▪ Persists names, trees, metadata, permissions
‣ Namespace image (fsimage), cached in-memory
‣ Edit log of deltas (rename, permission, create)
• Transaction persisted on disk, then applied to in-memory fsimage
‣ fsimage and edit log merged on disk when HDFS restarted
‣ Mapping from files to list of blocks

▪ Block location not persistent, kept in-memory


‣ Mapping from blocks to locations is dynamic
• Why?
‣ Reconstructs location of blocks from data nodes
‣ ~150 bytes of in-memory metadata per block/file/dir
Master: Name Node
▪ Detects health of FS
‣ Is data node alive?
‣ Is data block under-replicated?
‣ Rebalancing block allocation across data nodes, improved disk utilization

▪ Coordinates file operations


‣ Directs application clients to datanodes for reads
‣ Allocates blocks on datanodes for writes

▪ Security is not a priority


‣ Basic file and dir permissions (rwx)
‣ Default enforcement relies on client machine ‘username’
Master: Name Node
▪ File system does no work if NameNode not accessible!

▪ Single Point of failure! (Hadoop 1.x)


‣ Cold start → 10mins load FS image, 1hr for block list for every file
‣ Host recovery → Copy FS image, config data node

▪ Sync atomic writes to multiple disk file systems


‣ Local disk + NFS

▪ Secondary NameNode
‣ Merge FS image with edit log periodically … avoids downtime
when merging
‣ Serves as stale copy of FS image … data loss possible

http://blog.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/
Secondary
Name Node

Hadoop: The Definitive Guide, Tom White, 4th Edition, 2015


Master: Name Node
▪ NameNode High Availability (2.x)
‣ Reliable shared NFS for edit log
‣ Hot standby loads FS image in-memory
‣ Constantly reads edit logs from disk
‣ DataNodes send heartbeat, block list to both
• But ops received only from active

‣ On NameNode failover, standby can takeover immediately

http://blog.cloudera.com/blog/2012/03/high-availability-for-the-hadoop-distributed-file-system-hdfs/
Slave/Worker: Data Node
▪ Store & retrieve blocks
▪ Respond to client and master requests for block operations
▪ Sends heartbeat every 3 secs for liveliness
▪ Periodically sends list of block IDs and location on that node
‣ Piggyback on heartbeat message
‣ e.g., send block list every hour
▪ Caches blocks in-memory using cache-directives per file, on
single data node
‣ E.g. index, lookup table, etc.
‣ Can be used by schedulers
Network Topology
▪ Same Node, Same Rack, Same Data Center, Different Data Centers
▪ Distance function between two logical nodes provided in config
‣ /dc/rack/node … default is “flat”, i.e. same distance

Hadoop: The Definitive Guide, Tom White, 4th Edition, 2015


File Reads
▪ Client-Data Node direct
transfer .. Not through the Name Node

▪ Client gets data node


list for each block from NameNode
‣ First few blocks returned
initially, Sorted by distance

▪ Blocks read in order


‣ Connection opened and closed to nearest DataNode for each block
‣ Tries alternate data nodes on network failure, checksum failure
‣ Remembers & reports failures/corrupt blocks to Name Node

▪ Allows scaling to many concurrent clients

Hadoop: The Definitive Guide, Tom White, 4th Edition, 2015


File Writes
▪ Write one only…Append, Truncate…Strict one writer at a time, per file
▪ Clients get list of data nodes to store a block’s replica
‣ First copy on same data node as client, or random.
‣ Second is off-rack. Third on same rack as second.
▪ Blocks written in order. Forwarded in a pipeline. Acks from all replicas expected before next block
written.

Hadoop: The Definitive Guide, 4th Edition, 2015


Hadoop YARN
Yet Another Resource Negotiator
Slave Slave

Master Slave Slave


Master

Slave Slave

Fig-1 Fig-2
MRv1 vs MRv2 Application Lifecycle

Apache Hadoop YARN, Arun C. Murthy, et al, HortonWorks, Addison Wesley, 2014
MapReduce v1 → MapReduce v2 (YARN)

Apache Hadoop YARN, Arun C. Murthy, et al, HortonWorks, Addison Wesley, 2014
YARN
▪ Designed for scalability
‣ 10k nodes, 400k tasks
▪ Designed for availability
‣ Separate application management from resource management

▪ Improve utilization
‣ Flexible slot allocation. Slots not bound to Map or Reduce types.

▪ Go beyond MapReduce
YARN
▪ ResourceManager for cluster
‣ Keeps track of nodes, capacities, allocations
‣ Failure and recovery (heartbeats)
▪ Coordinates scheduling of jobs on the cluster
‣ Decides which node to allocate to a job
‣ Ensures load balancing

▪ Used by programming frameworks to schedule distributed applications


‣ MapReduce, Spark, etc.
▪ NodeManager
‣ Offers slots with given capacity on a host to schedule tasks
‣ Container maps to one or more slots…Container can be a Unix process or cgroup
Application Manager
▪ Coordinates
‣ resource acquisition,
‣ scheduling,
‣ monitoring progress ,
‣ and termination
‣ for a specific application type
▪ E.g. MapReduce, MPI, Spark, etc.
▪ AppManager runs in its own container
‣ May launch additional containers for its compute tasks
‣ Or may run job locally in JVM for “small” applications
YARN Application Lifecycle

Apache Hadoop YARN, Arun C. Murthy, et al, HortonWorks, Addison Wesley, 2014
Container
heartbeat
status to AM

Apache Hadoop YARN, Arun C. Murthy, et al, HortonWorks, Addison Wesley, 2014
Hadoop: The Definitive Guide, 4th Edition, 2015
MapReduce AppManager
▪ First requests Map containers
‣ As many as number of splits

▪ Reduce containers requested after 5% Map tasks complete


‣ User specified. 1 by default!
▪ Map containers try for data locality as “split”
‣ Same node, Same rack

▪ Containers have CPU and Memory resource requirements


‣ Config per job, or default for cluster

▪ AppManager asks Node Manager to start container


‣ Container task fetches jar, config locally, executes, commits
Scheduling in YARN
▪ FIFO

▪ Capacity
‣ using different queues, min capacity per
queue
‣ Allocate excess resource to more loaded

▪ Fair
‣ Give all available
‣ Redistribute as jobs arrive
Hadoop MapReduce
Mapping tasks to blocks
▪ FileInputFormat converts blocks to “splits”
‣ Typically, 1 split per block … reduce task creation
overhead vs. overwhelm single task
‣ Can specify splits smaller/larger than a block size
‣ Affects locality if spanning blocks
‣ Affects performance with many small files (combine!)

▪ Each split handled by a single Mapper task


‣ Records read from each split, forms Key-Value pair input
to Map function
Mapping tasks to blocks

Hadoop: The Definitive Guide, 4th Edition, 2015


Resource Mapping
▪ Resource acquisition either at beginning (Map tasks) or during (Reduce tasks)
application lifetime
‣ Higher priority for Map container requests

▪ AppManager can specify locality constraints to YARN


‣ Compute tasks are moved to data block location
‣ Location of one of three replicas of block
‣ Prefer same node, followed by rack, then cluster
Local Disk Local Disk

▪ Background thread “spills” to disk when circular memory buffer (100MB) threshold
reached (80%)
‣ Asynchronous, avoid blocking unless thread write slower than Map task
▪ Divides the data into in-memory partitions, one for each reducer
‣ Performs sort by key
‣ Runs combiner sorted outputs
‣ Writes to local directory, accessible by reducers over HTTPS (Not HDFS!)
Local Disk Local Disk

▪ Output files are merged, partitioned and sorted into single file on disk
‣ If multiple spill files (3) once Map task done, runs combiner again.
‣ Optionally compressed
▪ Map task output always written to disk…recovery!
Local Disk Local Disk

▪ Reducer copies files as soon as available from any Map task


‣ Copied to reducer memory if small,
‣ On threshold: Merged , Combiner then spilled to disk
▪ Incremental merge sort takes place in background thread
Local Disk Local Disk

▪ When output from all Map tasks available, final Merge-sort over all spilled files,
before reduce method called
‣ Multiple rounds, 10 files merged per round
‣ Input to reducer from sorted file and trailing in-memory sorted KVP
Liveliness
▪ A Hadoop job or task is alive as long as it is
making progress
‣ Reading/writing input record
‣ Setting status or incrementing counter
▪ Progress reported to App- Manager by
Tasks ~3secs
▪ Client polls AppManager
‣ ~1 sec
Reading
▪ Hadoop: The Definitive Guide, 4th Edition, 2015
‣ Chapters 3, 4, 7

Additional Resources
▪ Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with
Apache Hadoop, 2015
‣ Chapters 1, 3, 4, 7

You might also like