Hadoop Intro

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

Overloading Terminology

• Hadoop has become synonymous with Big Data


management and processing
• The name Hadoop is also now a proxy for both
Hadoop and the large, growing ecosystem around it
• Basically, a very large “system” using Hadoop
Distributed File System (HDFS) for storage, and direct
or indirect use of the MapReduce programming
model and software framework for processing

1
Hadoop: The High Level
• Apache top-level project
– “…develops open-source software for reliable, scalable, distributed computing.”
• Software library
– “…framework that allows for distributed processing of large data sets across
clusters of computers using a simple programming model…designed to scale up
from single servers to thousands of machines, each offering local computation and
storage…designed to detect and handle failures at the application layer…delivering
a highly-available service on top of a cluster of computers, each of which may be
prone to failures.”
• Hadoop Distributed File System (HDFS)
– “…primary storage system used by Hadoop applications. HDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations.”
• MapReduce
– “…a programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute 2
nodes.”
What’s Hadoop Used For?
Major Use Cases per Cloudera (2012)
• Data processing
– Building search indexes
– Log processing
– “Click Sessionization”
– Data processing pipelines
– Video & Image analysis
• Analytics
– Recommendation systems (Machine Learning)
– Batch reporting
• Real time applications (“home of the brave”)
• Data Warehousing
3
HDFS

• Google File System (GFS) cited as the original concept


– Certain common attributes between GFS & HDFS
• Pattern emphasis: Write Once, Read Many
– Sound familiar?
• Remember WORM drives? Optical jukebox library? Before CDs (CD-
R)
– Back when dirt was new… almost…
• WORM optical storage was still about greater price:performance at
larger storage volumes vs. the prevailing [disk] technology – though
still better p:p than The Other White Meat of the day, Tape!
• Very Large Files: One file could span the entire HDFS
• Commodity Hardware
– 2 socket; 64-bit; local disks; no RAID
• Co-locate data and compute resources
4
HDFS: What It’s Not Good For
• Many, many small files
– Scalability issue for the namenode (more in a moment)
• Low Latency Access
– It’s all about Throughput (N=XR)
– Not about minimizing service time (or latency to first data read)
• Multiple writers; updates at offsets within a file
– One writer
– Create/append/rename/move/delete - that’s it! No updates!
• Not a substitute for a relational database
– Data stored in files, not indexed
– To find something, must read all the data (ultimately, by a MapReduce job/tasks)
• Selling SAN or NAS – NetApp & EMC need not apply
• Programming in COBOL
• Selling mainframes & FICON 5
HDFS Concepts & Architecture

• Core architectural goal: Fault-tolerance in the face of


massive parallelism (many devices, high probability of HW
failure)
• Monitoring, detection, fast automated recovery
• Focus is batch throughput
– Hence support for very large files; large block size; single writer,
create or append (no update) and a high read ratio
• Master/Slave: Separate filesystem metadata & app data
– One NameNode manages the filesystem and client access
• All metadata - namespace tree & map of file blocks to DataNodes - stored
in memory and persisted on disk
• Namespace is a traditional hierarchical file organization (directory tree)
– DataNodes manage locally attached storage (JBOD, no RAID)
6
• Not layered on or dependent on any other filesystem
http://hadoop.apache.org/common/docs/current/hdfs_design.html
7
HDFS Concepts & Architecture
• Block-oriented
– Not your father’s block size: 64 MB by default
• Block size selectable on a file-by-file basis
– What if an HDFS block was 512 bytes or 4KB instead of 64 MB?
» Metadata would be 128K (131,072) times or 16K (16,384) times larger, impacting
memory on NameNode and potentially limiting the total HDFS storage size in the
cluster
» Total storage also sensitive to total number of files in HDFS

– All blocks in a file are the same size, except the last block
– Large block size minimizes seek time
• Approaches disk spiral transfer rate ~100 MB/sec
• Disk read-ahead may further minimize track to track seek time
• High aggregate file read bandwidth, # disks/node X 100 GB/sec
– Divides incoming files into blocks, storing them redundantly
across DataNodes in the cluster: Replication for fault-tolerance
• Replication Factor per-file
8
• Location of block replicas can change over time
NameNode

• Is a SPOF, single point of failure


• Metadata persisted in the file FsImage
• Changes since last checkpoint logged separately in a transaction log
called the EditLog
• Both files stored in NameNodes’s local filesystem
– Availability: Redundant copies to NAS or other servers
• Counter-intuitive:
– Checkpoint done only at NameNode startup
• EditLog read to update FsImage; FsImage mapped into memory; EditLog truncated; ready to
rock!
• The NameNode does not directly call DataNodes but piggy-backs
instructions in its replies to DataNodes’ heartbeat
– Replicate blocks to other nodes (pipeline approach)
– Remove block replicas (load balancing)
– Restart (re-register, re-join cluster) or shutdown 9
– Send a block report NOW
DataNodes

• A block replica is made up of 2 files on a DataNode’s local


file system
– The data itself
– Block metadata
• Block checksum and generation timestamp
• At startup the DataNode connects to the NameNode &
performs a handshake to verify ID and version
– Shutdown if either does not match: Prevent corruption
– Nodes cannot “register” with NameNode; i.e., join the cluster
unless they have a matching ID
• Blocks consume only the space needed
– Hadoop optimizes for full blocks until the last block of the file is
10
written
DataNodes (cont’d)

• Sends a block report to the NameNode about all block


replicas it contains, at DataNode “startup” and hourly
• Sends heartbeat to NameNode @ 3 sec intervals (default)
– 10 minute timeout, DataNode removed by NameNode and new
replicas of its blocks scheduled for creation on other DataNodes
– Heartbeat also contains info used by NameNode’s storage
allocation and load balancing algorithms

11
Back to the NameNode

• Filesystem metadata is persisted in the file “FsImage”


• Changes since last checkpoint logged separately in a
transaction log called the “EditLog”
• Both files stored in NameNodes’s local filesystem
– Availability: Redundant copies to NAS or other servers
• Counter-intuitive:
– Checkpoint done only at NameNode startup
• Replay EditLog to update FsImage; FsImage mapped into memory;
EditLog truncated; ready to rock!

12
Client Read & Write

• Client read request first contacts the NameNode for locations of


file data blocks; reads blocks from closest DataNode(s)
– Uses block checksum to detect corruption, a common problem in clusters
of 100s-1,000s of nodes and disks
• Client write operation sends path to NameNode; requests
NameNode to select Replication Factor number of DataNodes to
host block replicas; writes data to DataNodes in serial pipeline
manner (more in a moment)
• File create:

13
Client Write (cont’d)

• Write Pipeline
– Data bytes pushed to the pipeline as a sequence of packets
• Client buffers bytes until a packet buffer fills (64KB default) or until file is
closed, then packet is pushed into the pipeline
– Client calculates & sends checksum with block
– Recall that data and metadata (incl. checksum) are stored separately on DataNode
– Asynchronous I/O
• Client does not wait for ACK to packet by DataNode, continues pushing
packets
• Limit to the # of outstanding packets = “packet window size” of client
– i.e., the queue size (N=XR)
– Data visible to new reader at file close or using hflush call

14
HDFS Key Points

• Locality and High Performance


– HDFS is atypical of conventional non-distributed file systems in
that its API exposes the location of a file’s blocks, with major
implications
– Frameworks like MapReduce can thus schedule a task to execute
where the data is located
• Send computation to the data, not the other way around
– Reduces network infrastructure costs, reduces elapsed time
• Highest read performance when a task executes with data on local disks
– Point-to-point SAS/SATA disks, no shared bus, a key technology enabler
– Massive price : performance benefit over SAN storage
– Supports application per-file replication factor (default 3)
– Improves fault tolerance
– Increases read bandwidth for heavily accessed files 15
Typical Hadoop Node

• Dual socket 1U/2U rack mount server


– Internal disk requirement negates use of blades
• Quad core CPUs >2 GHz
• 16-24 GB memory
• 4 to 6 data disks @ 1TB
– Performance & economics target for heavily queried data, busy processors
– Use cases focusing on long-term storage might deploy a DL370 5U form factor
with 14 internal 3.5” disks
• 1 GbE NIC
• Linux

16
Hadoop MapReduce

• Software framework for “easily” writing applications that process


vast (multi-TB) amounts of data in parallel on very large clusters of
commodity hardware with reliability and fault-tolerance
• A MapReduce job normally divides the input data into separate
chunks which are then processed in parallel by Map tasks. The
framework guarrantees Map output is sorted for input to Reduce
• The sorted outputs are inputs to the Reduce tasks
• Input and output are stored in HDFS
• Framework performs task scheduling; task monitoring & re-
execution of failed tasks
• Massive aggregate cluster bandwidth
– # nodes X # disks X ~100 MB/sec
17
Language Support

• Java (native)
• Streaming
– Any language supporting Unix STREAMS interface
• Read from stdin, write to stdout
• Redirection as usual
– Highly suited to text processing
• Line by line input
• Key-value pair output of Map program written as tab-delimited line to stdout
• MapReduce Framework sorts the Map task output
• Reduce function reads sorted lines from stdin
• Reduce finally writes to stdout
– Python; Perl; Ruby; etc.
• Pipes
– The C++ interface for MapReduce
• Sockets interface (not streams, not java native interface JNI)
18
Hadoop MapReduce

• Master/Slave architecture
– One master Jobtracker in the cluster
– One slave Tasktracker per DataNode
• Jobtracker
– Schedules a job’s constituent tasks on slaves
– Monitors tasks, re-executes upon task failure
• Tasktracker executes tasks by direction of the Jobtracker

19
Hadoop MapReduce Job Execution

• Client MapReduce program submits job through


JobClient.runJob API
– Job Client (in client JVM on client node) sends request to JobTracker (on
JobTracker node), which will coordinate job execution in the cluster
• JobTracker returns Job ID
– JobClient calculates splits of input data
– JobClient “fans the job out” to multiple systems
• Copies program components (Java jar, config file) and calculated input data splits to
JobTracker’s filesystem. Write (default) 10 copies of this to 10 JobTrackers (recall
JobTracker is one per DataNode)
– JobTracker inserts job into its queue for its scheduler to dequeue and init
• Init: Create job object to represent tasks it executes, and track task progress/status
– JobClient informs JobTracker that job is ready to execute
– JobTracker creates list of tasks to run
• Retrieves (from HDFS) the input data splits computed by JobClient
20
• Creates one Map task per data split
TaskTracker

• TaskTrackers communication with JobTracker


– Default loop, sends periodic heartbeat message to JobTracker
• Tasktracker is alive
• Available (or not) to run a task
• JobTracker sends task to run in heartbeat response message if TaskTracker available
• TaskTrackers have limited queue size for Map and for Reduce tasks
– Queue size mapped to # cores & physical memory on TaskTracker node
• TaskTracker scheduler will schedule Map tasks before Reduce
tasks
• Copies jar file from HDFS to TaskTracker filesystem
– Unwinds jar file into a local working directory create for task
• Creates TaskRunner to run the task
– TaskRunner starts a JVM for each task
– Separate JVM per task isolates TaskTracker from user M/R program bugs21
22
TaskTracker

• TaskTracker per-task JVM (children) communicate to parent


– Status, progress of each task, every few seconds via “umbilical” interface
– Streaming interface communicates with process via stdin/stdout
• MapReduce jobs are long-running
– Minutes to hours
– Status and progress important to communicate
• State: Running, successful completion, failed
• Progress of map & reduce tasks
• Job counters
• User-defined status messages
• Task Progress: How is this defined?
– Input record read (Map or Reduce)
– Output record written
– Set status description on Reporter, increment Reporter counter, call to
Reporter progress() method 23
Job Completion

• JobTracker receives last task’s completion notification


– Changes job status to “successful”
– Can send HTTP job notification message to JobClient
• JobClient learns of success at its next polling interval to JobTracker
– Sends message to user
• JobTracker performs cleanup, commands TaskTrackers to cleanup

24
Example Job & Filesystem Interaction
Multiple Map tasks, one Reduce task
Map reads from HDFS, writes to local FS
Reduce reads from local FS, writes to HDFS

25
Example # Job & Filesystem Interaction
Multiple Map tasks, Two Reduce tasks
Each Reduce writes one partition per Reduce task to HDFS

26
Sorting Map Output
• Default 100 MB memory buffer in which to sort output of a Map tasks’s
output
• Sort thread divides output data into partitions by the Reducer the
output will be sent to
• Within a partition, a thread sorts in-memory by key. Combiner function
uses this sorted output
• @ default 80% buffer full, starts to flush (“spill”) to the local file system
• In parallel, Map outputs continue to be written into the sort buffer
while spill writes sorted output to the local FS. Map will block writing to
the sort buffer @ 80% full (default)
• New spill file created every time buffer hits 80% full
• Multiple spill files are merged into one partitioned, sorted output
• Output is typically compressed for write to disk (good tradeoff given
CPU speed vs. disk speed)
• Map output file is now on local disk of tasktracker that ran the Map task 27
MapReduce Paradigm

• MapReduce is not always the best algorithm


– One simple functional programming operation applied in
parallel to Big Data
– Not amenable to maintaining state (remembering output from
a previous MapReduce job)
• Can pipeline MR jobs in serial
• HBase

28
• Apache Hive!
• Data warehouse system providing structure onto HDFS
• SQL-like query language, HiveQL
• Also supports traditional map/reduce programs
• Popular with data researchers
– Bitly; LinkedIn to name two
• Originated & developed at Facebook

29
• Apache Pig!
• High-level language for data analysis
• Program structure “…amenable to substantial
parallelization”
• “Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data
flow sequences, making them easy to write,
understand, and maintain.”
• Popular with data researchers who are also skilled in
Python as their analysis toolset and “glue” language
• Originated & developed at Yahoo! 30
• Apache HBase!
– Realtime, Random R/W access to Big Data
– Column-oriented
• Modeled after Google’s Bigtable for structured data
– Massive distributed database layered onto clusters of
commodity hardware
• Goal: Very large tables, billions of rows X millions of columns
– HDFS underlies HBase
• Originally developed at Powerset; acquired by Microsoft

31
• Apache Zookeeper!
• Goal: Provide highly reliable distributed coordination
services via a set of primitives for distributed apps to build
on
– “Because Coordinating Distributed Systems is a Zoo” (!)
• Common distributed services notoriously difficult to develop & maintain
on an application by application basis
• Centralized distributed synchronization; group services;
naming; configuration information
• High performance, highly available, ordered access
– In memory. Replicated. Synchronization at client by API
• Originated & developed at Yahoo! 32
• Apache Mahout!
• Machine Learning Library
• Based on Hadoop HDFS & Map/Reduce
– Though not restricted to Hadoop implementations
• “Scalable to ‘reasonably large’ data sets”
• Four primary ML use cases, currently
– Clustering; Classification; Itemset Frequency Mining;
Recommendation Mining

33
• Apache Chukwa! Originated & developed @ Yahoo!
• Log analysis framework built on HDFS & Map/Reduce and
HBase
• Data collection system to monitor large distributed systems
• Toolkit for displaying & analyzing collected data
• A competitor to Splunk?

34
• Apache Avro! Originated @ Yahoo!, co-developer
Cloudera
• Data serialization system
– Serialization: “…the process of converting a data structure or object state
into a format that can be stored” in a file or transmitted over a network link.
– Serializing an object is also referred to as “marshaling” an object
• Supports JSON as the data interchange format
– JavaScript Object Notation, a self-describing data format
– Experimental support for Avro IDL, an alternate Interface Description Lang.
• Includes Remote Procedure Call (RPC) in the framewor k
– Communication between Hadoop nodes and from clients to services
• Hadoop (Doug Cutting) emphasizes moving away from text
35
APIs to using Avro APIs for Map/Reduce, etc.
• Apache Whirr!
• Libraries for running cloud services
– Cloud-neutral, avoids provider idiosyncrasies
• Command line tool for deploying large clusters
• Common service API
• Smart defaults for services to enable quickly getting a
properly configured system running but still specify the
desired attributes

36
• “Sqoop is a tool designed for efficiently transferring
bulk data between Apache Hadoop and structured
datastores such as relational databases”
• Project status:

37
• Apache Bigtop!
• Bigtop is to Hadoop as Fedora is to Red Hat Enterprise Linux
– Packing & interoperability testing focuses on the system as a
whole vs. individual projects

38
HCatalog; MRUnit; Oozie

• HCatalog
– Table and storage management service for data created using
Apache Hadoop
• Table abstraction with regard to where or how the data is stored
• To provide interoperability across data processing tools including Pig; Map/Reduce;
Streaming; Hive

• MRUnit
– Library to support unit testing of Hadoop Map/Reduce jobs
• Oozie
– Workflow system to manage Hadoop jobs by time or data arrival

39
Small sample of awareness being raised

• Big Data
– Science, one of the two top peer-review science journals in the world

• 11 Feb 2011 Special Issue: Dealing with Data

– The Economist
• Hadoop-specific
– Yahoo! videos on YouTube
– Technology pundits in many mainline publications & blogs
– Serious blogs by researchers & Ph.D. candidates across data science; computer science;
statistics & other fields

40

You might also like