Hadoop Intro

Overloading Terminology
• Hadoop has become synonymous with Big Data

management and processing
• The name Hadoop is also now a proxy for both
Hadoop and the large, growing ecosystem around it
• Basically, a very large “system” using Hadoop
Distributed File System (HDFS) for storage, and direct
or indirect use of the MapReduce programming
model and software framework for processing
1
Hadoop: The High Level
• Apache top-level project
– “…develops open-source software for reliable, scalable, distributed computing.”
• Software library
– “…framework that allows for distributed processing of large data sets across
clusters of computers using a simple programming model…designed to scale up
from single servers to thousands of machines, each offering local computation and
storage…designed to detect and handle failures at the application layer…delivering
a highly-available service on top of a cluster of computers, each of which may be
prone to failures.”
• Hadoop Distributed File System (HDFS)
– “…primary storage system used by Hadoop applications. HDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations.”
• MapReduce
– “…a programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute 2
nodes.”
What’s Hadoop Used For?
Major Use Cases per Cloudera (2012)
• Data processing
– Building search indexes
– Log processing
– “Click Sessionization”
– Data processing pipelines
– Video & Image analysis
• Analytics
– Recommendation systems (Machine Learning)
– Batch reporting
• Real time applications (“home of the brave”)
• Data Warehousing
3
HDFS
• Google File System (GFS) cited as the original concept

– Certain common attributes between GFS & HDFS
• Pattern emphasis: Write Once, Read Many
– Sound familiar?
• Remember WORM drives? Optical jukebox library? Before CDs (CD-
R)
– Back when dirt was new… almost…
• WORM optical storage was still about greater price:performance at
larger storage volumes vs. the prevailing [disk] technology – though
still better p:p than The Other White Meat of the day, Tape!
• Very Large Files: One file could span the entire HDFS
• Commodity Hardware
– 2 socket; 64-bit; local disks; no RAID
• Co-locate data and compute resources
4
HDFS: What It’s Not Good For
• Many, many small files
– Scalability issue for the namenode (more in a moment)
• Low Latency Access
– It’s all about Throughput (N=XR)
– Not about minimizing service time (or latency to first data read)
• Multiple writers; updates at offsets within a file
– One writer
– Create/append/rename/move/delete - that’s it! No updates!
• Not a substitute for a relational database
– Data stored in files, not indexed
– To find something, must read all the data (ultimately, by a MapReduce job/tasks)
• Selling SAN or NAS – NetApp & EMC need not apply
• Programming in COBOL
• Selling mainframes & FICON 5
HDFS Concepts & Architecture
• Core architectural goal: Fault-tolerance in the face of

massive parallelism (many devices, high probability of HW
failure)
• Monitoring, detection, fast automated recovery
• Focus is batch throughput
– Hence support for very large files; large block size; single writer,
create or append (no update) and a high read ratio
• Master/Slave: Separate filesystem metadata & app data
– One NameNode manages the filesystem and client access
• All metadata - namespace tree & map of file blocks to DataNodes - stored
in memory and persisted on disk
• Namespace is a traditional hierarchical file organization (directory tree)
– DataNodes manage locally attached storage (JBOD, no RAID)
6
• Not layered on or dependent on any other filesystem
http://hadoop.apache.org/common/docs/current/hdfs_design.html
7
HDFS Concepts & Architecture
• Block-oriented
– Not your father’s block size: 64 MB by default
• Block size selectable on a file-by-file basis
– What if an HDFS block was 512 bytes or 4KB instead of 64 MB?
» Metadata would be 128K (131,072) times or 16K (16,384) times larger, impacting
memory on NameNode and potentially limiting the total HDFS storage size in the
cluster
» Total storage also sensitive to total number of files in HDFS
– All blocks in a file are the same size, except the last block
– Large block size minimizes seek time
• Approaches disk spiral transfer rate ~100 MB/sec
• Disk read-ahead may further minimize track to track seek time
• High aggregate file read bandwidth, # disks/node X 100 GB/sec
– Divides incoming files into blocks, storing them redundantly
across DataNodes in the cluster: Replication for fault-tolerance
• Replication Factor per-file
8
• Location of block replicas can change over time
NameNode
• Is a SPOF, single point of failure

• Metadata persisted in the file FsImage
• Changes since last checkpoint logged separately in a transaction log
called the EditLog
• Both files stored in NameNodes’s local filesystem
– Availability: Redundant copies to NAS or other servers
• Counter-intuitive:
– Checkpoint done only at NameNode startup
• EditLog read to update FsImage; FsImage mapped into memory; EditLog truncated; ready to
rock!
• The NameNode does not directly call DataNodes but piggy-backs
instructions in its replies to DataNodes’ heartbeat
– Replicate blocks to other nodes (pipeline approach)
– Remove block replicas (load balancing)
– Restart (re-register, re-join cluster) or shutdown 9
– Send a block report NOW
DataNodes
• A block replica is made up of 2 files on a DataNode’s local

file system
– The data itself
– Block metadata
• Block checksum and generation timestamp
• At startup the DataNode connects to the NameNode &
performs a handshake to verify ID and version
– Shutdown if either does not match: Prevent corruption
– Nodes cannot “register” with NameNode; i.e., join the cluster
unless they have a matching ID
• Blocks consume only the space needed
– Hadoop optimizes for full blocks until the last block of the file is
10
written
DataNodes (cont’d)
• Sends a block report to the NameNode about all block

replicas it contains, at DataNode “startup” and hourly
• Sends heartbeat to NameNode @ 3 sec intervals (default)
– 10 minute timeout, DataNode removed by NameNode and new
replicas of its blocks scheduled for creation on other DataNodes
– Heartbeat also contains info used by NameNode’s storage
allocation and load balancing algorithms
11
Back to the NameNode
• Filesystem metadata is persisted in the file “FsImage”

• Changes since last checkpoint logged separately in a
transaction log called the “EditLog”
• Both files stored in NameNodes’s local filesystem
– Availability: Redundant copies to NAS or other servers
• Counter-intuitive:
– Checkpoint done only at NameNode startup
• Replay EditLog to update FsImage; FsImage mapped into memory;
EditLog truncated; ready to rock!
12
Client Read & Write
• Client read request first contacts the NameNode for locations of

file data blocks; reads blocks from closest DataNode(s)
– Uses block checksum to detect corruption, a common problem in clusters
of 100s-1,000s of nodes and disks
• Client write operation sends path to NameNode; requests
NameNode to select Replication Factor number of DataNodes to
host block replicas; writes data to DataNodes in serial pipeline
manner (more in a moment)
• File create:
13
Client Write (cont’d)
• Write Pipeline
– Data bytes pushed to the pipeline as a sequence of packets
• Client buffers bytes until a packet buffer fills (64KB default) or until file is
closed, then packet is pushed into the pipeline
– Client calculates & sends checksum with block
– Recall that data and metadata (incl. checksum) are stored separately on DataNode
– Asynchronous I/O
• Client does not wait for ACK to packet by DataNode, continues pushing
packets
• Limit to the # of outstanding packets = “packet window size” of client
– i.e., the queue size (N=XR)
– Data visible to new reader at file close or using hflush call
14
HDFS Key Points
• Locality and High Performance

– HDFS is atypical of conventional non-distributed file systems in
that its API exposes the location of a file’s blocks, with major
implications
– Frameworks like MapReduce can thus schedule a task to execute
where the data is located
• Send computation to the data, not the other way around
– Reduces network infrastructure costs, reduces elapsed time
• Highest read performance when a task executes with data on local disks
– Point-to-point SAS/SATA disks, no shared bus, a key technology enabler
– Massive price : performance benefit over SAN storage
– Supports application per-file replication factor (default 3)
– Improves fault tolerance
– Increases read bandwidth for heavily accessed files 15
Typical Hadoop Node
• Dual socket 1U/2U rack mount server

– Internal disk requirement negates use of blades
• Quad core CPUs >2 GHz
• 16-24 GB memory
• 4 to 6 data disks @ 1TB
– Performance & economics target for heavily queried data, busy processors
– Use cases focusing on long-term storage might deploy a DL370 5U form factor
with 14 internal 3.5” disks
• 1 GbE NIC
• Linux
16
Hadoop MapReduce
• Software framework for “easily” writing applications that process

vast (multi-TB) amounts of data in parallel on very large clusters of
commodity hardware with reliability and fault-tolerance
• A MapReduce job normally divides the input data into separate
chunks which are then processed in parallel by Map tasks. The
framework guarrantees Map output is sorted for input to Reduce
• The sorted outputs are inputs to the Reduce tasks
• Input and output are stored in HDFS
• Framework performs task scheduling; task monitoring & re-
execution of failed tasks
• Massive aggregate cluster bandwidth
– # nodes X # disks X ~100 MB/sec
17
Language Support
• Java (native)
• Streaming
– Any language supporting Unix STREAMS interface
• Read from stdin, write to stdout
• Redirection as usual
– Highly suited to text processing
• Line by line input
• Key-value pair output of Map program written as tab-delimited line to stdout
• MapReduce Framework sorts the Map task output
• Reduce function reads sorted lines from stdin
• Reduce finally writes to stdout
– Python; Perl; Ruby; etc.
• Pipes
– The C++ interface for MapReduce
• Sockets interface (not streams, not java native interface JNI)
18
Hadoop MapReduce
• Master/Slave architecture
– One master Jobtracker in the cluster
– One slave Tasktracker per DataNode
• Jobtracker
– Schedules a job’s constituent tasks on slaves
– Monitors tasks, re-executes upon task failure
• Tasktracker executes tasks by direction of the Jobtracker
19
Hadoop MapReduce Job Execution
• Client MapReduce program submits job through

JobClient.runJob API
– Job Client (in client JVM on client node) sends request to JobTracker (on
JobTracker node), which will coordinate job execution in the cluster
• JobTracker returns Job ID
– JobClient calculates splits of input data
– JobClient “fans the job out” to multiple systems
• Copies program components (Java jar, config file) and calculated input data splits to
JobTracker’s filesystem. Write (default) 10 copies of this to 10 JobTrackers (recall
JobTracker is one per DataNode)
– JobTracker inserts job into its queue for its scheduler to dequeue and init
• Init: Create job object to represent tasks it executes, and track task progress/status
– JobClient informs JobTracker that job is ready to execute
– JobTracker creates list of tasks to run
• Retrieves (from HDFS) the input data splits computed by JobClient
20
• Creates one Map task per data split
TaskTracker
• TaskTrackers communication with JobTracker

– Default loop, sends periodic heartbeat message to JobTracker
• Tasktracker is alive
• Available (or not) to run a task
• JobTracker sends task to run in heartbeat response message if TaskTracker available
• TaskTrackers have limited queue size for Map and for Reduce tasks
– Queue size mapped to # cores & physical memory on TaskTracker node
• TaskTracker scheduler will schedule Map tasks before Reduce
tasks
• Copies jar file from HDFS to TaskTracker filesystem
– Unwinds jar file into a local working directory create for task
• Creates TaskRunner to run the task
– TaskRunner starts a JVM for each task
– Separate JVM per task isolates TaskTracker from user M/R program bugs21
22
TaskTracker
• TaskTracker per-task JVM (children) communicate to parent

– Status, progress of each task, every few seconds via “umbilical” interface
– Streaming interface communicates with process via stdin/stdout
• MapReduce jobs are long-running
– Minutes to hours
– Status and progress important to communicate
• State: Running, successful completion, failed
• Progress of map & reduce tasks
• Job counters
• User-defined status messages
• Task Progress: How is this defined?
– Input record read (Map or Reduce)
– Output record written
– Set status description on Reporter, increment Reporter counter, call to
Reporter progress() method 23
Job Completion
• JobTracker receives last task’s completion notification

– Changes job status to “successful”
– Can send HTTP job notification message to JobClient
• JobClient learns of success at its next polling interval to JobTracker
– Sends message to user
• JobTracker performs cleanup, commands TaskTrackers to cleanup
24
Example Job & Filesystem Interaction
Multiple Map tasks, one Reduce task
Map reads from HDFS, writes to local FS
Reduce reads from local FS, writes to HDFS
25
Example # Job & Filesystem Interaction
Multiple Map tasks, Two Reduce tasks
Each Reduce writes one partition per Reduce task to HDFS
26
Sorting Map Output
• Default 100 MB memory buffer in which to sort output of a Map tasks’s
output
• Sort thread divides output data into partitions by the Reducer the
output will be sent to
• Within a partition, a thread sorts in-memory by key. Combiner function
uses this sorted output
• @ default 80% buffer full, starts to flush (“spill”) to the local file system
• In parallel, Map outputs continue to be written into the sort buffer
while spill writes sorted output to the local FS. Map will block writing to
the sort buffer @ 80% full (default)
• New spill file created every time buffer hits 80% full
• Multiple spill files are merged into one partitioned, sorted output
• Output is typically compressed for write to disk (good tradeoff given
CPU speed vs. disk speed)
• Map output file is now on local disk of tasktracker that ran the Map task 27
MapReduce Paradigm
• MapReduce is not always the best algorithm

– One simple functional programming operation applied in
parallel to Big Data
– Not amenable to maintaining state (remembering output from
a previous MapReduce job)
• Can pipeline MR jobs in serial
• HBase
28
• Apache Hive!
• Data warehouse system providing structure onto HDFS
• SQL-like query language, HiveQL
• Also supports traditional map/reduce programs
• Popular with data researchers
– Bitly; LinkedIn to name two
• Originated & developed at Facebook
29
• Apache Pig!
• High-level language for data analysis
• Program structure “…amenable to substantial
parallelization”
• “Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data
flow sequences, making them easy to write,
understand, and maintain.”
• Popular with data researchers who are also skilled in
Python as their analysis toolset and “glue” language
• Originated & developed at Yahoo! 30
• Apache HBase!
– Realtime, Random R/W access to Big Data
– Column-oriented
• Modeled after Google’s Bigtable for structured data
– Massive distributed database layered onto clusters of
commodity hardware
• Goal: Very large tables, billions of rows X millions of columns
– HDFS underlies HBase
• Originally developed at Powerset; acquired by Microsoft
31
• Apache Zookeeper!
• Goal: Provide highly reliable distributed coordination
services via a set of primitives for distributed apps to build
on
– “Because Coordinating Distributed Systems is a Zoo” (!)
• Common distributed services notoriously difficult to develop & maintain
on an application by application basis
• Centralized distributed synchronization; group services;
naming; configuration information
• High performance, highly available, ordered access
– In memory. Replicated. Synchronization at client by API
• Originated & developed at Yahoo! 32
• Apache Mahout!
• Machine Learning Library
• Based on Hadoop HDFS & Map/Reduce
– Though not restricted to Hadoop implementations
• “Scalable to ‘reasonably large’ data sets”
• Four primary ML use cases, currently
– Clustering; Classification; Itemset Frequency Mining;
Recommendation Mining
33
• Apache Chukwa! Originated & developed @ Yahoo!
• Log analysis framework built on HDFS & Map/Reduce and
HBase
• Data collection system to monitor large distributed systems
• Toolkit for displaying & analyzing collected data
• A competitor to Splunk?
34
• Apache Avro! Originated @ Yahoo!, co-developer
Cloudera
• Data serialization system
– Serialization: “…the process of converting a data structure or object state
into a format that can be stored” in a file or transmitted over a network link.
– Serializing an object is also referred to as “marshaling” an object
• Supports JSON as the data interchange format
– JavaScript Object Notation, a self-describing data format
– Experimental support for Avro IDL, an alternate Interface Description Lang.
• Includes Remote Procedure Call (RPC) in the framewor k
– Communication between Hadoop nodes and from clients to services
• Hadoop (Doug Cutting) emphasizes moving away from text
35
APIs to using Avro APIs for Map/Reduce, etc.
• Apache Whirr!
• Libraries for running cloud services
– Cloud-neutral, avoids provider idiosyncrasies
• Command line tool for deploying large clusters
• Common service API
• Smart defaults for services to enable quickly getting a
properly configured system running but still specify the
desired attributes
36
• “Sqoop is a tool designed for efficiently transferring
bulk data between Apache Hadoop and structured
datastores such as relational databases”
• Project status:
37
• Apache Bigtop!
• Bigtop is to Hadoop as Fedora is to Red Hat Enterprise Linux
– Packing & interoperability testing focuses on the system as a
whole vs. individual projects
38
HCatalog; MRUnit; Oozie
• HCatalog
– Table and storage management service for data created using
Apache Hadoop
• Table abstraction with regard to where or how the data is stored
• To provide interoperability across data processing tools including Pig; Map/Reduce;
Streaming; Hive
• MRUnit
– Library to support unit testing of Hadoop Map/Reduce jobs
• Oozie
– Workflow system to manage Hadoop jobs by time or data arrival
39
Small sample of awareness being raised
• Big Data
– Science, one of the two top peer-review science journals in the world
• 11 Feb 2011 Special Issue: Dealing with Data
– The Economist
• Hadoop-specific
– Yahoo! videos on YouTube
– Technology pundits in many mainline publications & blogs
– Serious blogs by researchers & Ph.D. candidates across data science; computer science;
statistics & other fields
40

Hadoop Intro

Uploaded by

Copyright:

Available Formats

Hadoop Intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Intro

Uploaded by

Copyright:

Available Formats

Overloading Terminology

• Hadoop has become synonymous with Big Data

• Google File System (GFS) cited as the original concept

• Core architectural goal: Fault-tolerance in the face of

• Is a SPOF, single point of failure

• A block replica is made up of 2 files on a DataNode’s local

• Sends a block report to the NameNode about all block

• Filesystem metadata is persisted in the file “FsImage”

• Client read request first contacts the NameNode for locations of

• Locality and High Performance

• Dual socket 1U/2U rack mount server

• Software framework for “easily” writing applications that process

• Client MapReduce program submits job through

• TaskTrackers communication with JobTracker

• TaskTracker per-task JVM (children) communicate to parent

• JobTracker receives last task’s completion notification

• MapReduce is not always the best algorithm

• 11 Feb 2011 Special Issue: Dealing with Data

You might also like