Hadoop Intro
Hadoop Intro
Hadoop Intro
1
Hadoop: The High Level
• Apache top-level project
– “…develops open-source software for reliable, scalable, distributed computing.”
• Software library
– “…framework that allows for distributed processing of large data sets across
clusters of computers using a simple programming model…designed to scale up
from single servers to thousands of machines, each offering local computation and
storage…designed to detect and handle failures at the application layer…delivering
a highly-available service on top of a cluster of computers, each of which may be
prone to failures.”
• Hadoop Distributed File System (HDFS)
– “…primary storage system used by Hadoop applications. HDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations.”
• MapReduce
– “…a programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute 2
nodes.”
What’s Hadoop Used For?
Major Use Cases per Cloudera (2012)
• Data processing
– Building search indexes
– Log processing
– “Click Sessionization”
– Data processing pipelines
– Video & Image analysis
• Analytics
– Recommendation systems (Machine Learning)
– Batch reporting
• Real time applications (“home of the brave”)
• Data Warehousing
3
HDFS
– All blocks in a file are the same size, except the last block
– Large block size minimizes seek time
• Approaches disk spiral transfer rate ~100 MB/sec
• Disk read-ahead may further minimize track to track seek time
• High aggregate file read bandwidth, # disks/node X 100 GB/sec
– Divides incoming files into blocks, storing them redundantly
across DataNodes in the cluster: Replication for fault-tolerance
• Replication Factor per-file
8
• Location of block replicas can change over time
NameNode
11
Back to the NameNode
12
Client Read & Write
13
Client Write (cont’d)
• Write Pipeline
– Data bytes pushed to the pipeline as a sequence of packets
• Client buffers bytes until a packet buffer fills (64KB default) or until file is
closed, then packet is pushed into the pipeline
– Client calculates & sends checksum with block
– Recall that data and metadata (incl. checksum) are stored separately on DataNode
– Asynchronous I/O
• Client does not wait for ACK to packet by DataNode, continues pushing
packets
• Limit to the # of outstanding packets = “packet window size” of client
– i.e., the queue size (N=XR)
– Data visible to new reader at file close or using hflush call
14
HDFS Key Points
16
Hadoop MapReduce
• Java (native)
• Streaming
– Any language supporting Unix STREAMS interface
• Read from stdin, write to stdout
• Redirection as usual
– Highly suited to text processing
• Line by line input
• Key-value pair output of Map program written as tab-delimited line to stdout
• MapReduce Framework sorts the Map task output
• Reduce function reads sorted lines from stdin
• Reduce finally writes to stdout
– Python; Perl; Ruby; etc.
• Pipes
– The C++ interface for MapReduce
• Sockets interface (not streams, not java native interface JNI)
18
Hadoop MapReduce
• Master/Slave architecture
– One master Jobtracker in the cluster
– One slave Tasktracker per DataNode
• Jobtracker
– Schedules a job’s constituent tasks on slaves
– Monitors tasks, re-executes upon task failure
• Tasktracker executes tasks by direction of the Jobtracker
19
Hadoop MapReduce Job Execution
24
Example Job & Filesystem Interaction
Multiple Map tasks, one Reduce task
Map reads from HDFS, writes to local FS
Reduce reads from local FS, writes to HDFS
25
Example # Job & Filesystem Interaction
Multiple Map tasks, Two Reduce tasks
Each Reduce writes one partition per Reduce task to HDFS
26
Sorting Map Output
• Default 100 MB memory buffer in which to sort output of a Map tasks’s
output
• Sort thread divides output data into partitions by the Reducer the
output will be sent to
• Within a partition, a thread sorts in-memory by key. Combiner function
uses this sorted output
• @ default 80% buffer full, starts to flush (“spill”) to the local file system
• In parallel, Map outputs continue to be written into the sort buffer
while spill writes sorted output to the local FS. Map will block writing to
the sort buffer @ 80% full (default)
• New spill file created every time buffer hits 80% full
• Multiple spill files are merged into one partitioned, sorted output
• Output is typically compressed for write to disk (good tradeoff given
CPU speed vs. disk speed)
• Map output file is now on local disk of tasktracker that ran the Map task 27
MapReduce Paradigm
28
• Apache Hive!
• Data warehouse system providing structure onto HDFS
• SQL-like query language, HiveQL
• Also supports traditional map/reduce programs
• Popular with data researchers
– Bitly; LinkedIn to name two
• Originated & developed at Facebook
29
• Apache Pig!
• High-level language for data analysis
• Program structure “…amenable to substantial
parallelization”
• “Complex tasks comprised of multiple interrelated
data transformations are explicitly encoded as data
flow sequences, making them easy to write,
understand, and maintain.”
• Popular with data researchers who are also skilled in
Python as their analysis toolset and “glue” language
• Originated & developed at Yahoo! 30
• Apache HBase!
– Realtime, Random R/W access to Big Data
– Column-oriented
• Modeled after Google’s Bigtable for structured data
– Massive distributed database layered onto clusters of
commodity hardware
• Goal: Very large tables, billions of rows X millions of columns
– HDFS underlies HBase
• Originally developed at Powerset; acquired by Microsoft
31
• Apache Zookeeper!
• Goal: Provide highly reliable distributed coordination
services via a set of primitives for distributed apps to build
on
– “Because Coordinating Distributed Systems is a Zoo” (!)
• Common distributed services notoriously difficult to develop & maintain
on an application by application basis
• Centralized distributed synchronization; group services;
naming; configuration information
• High performance, highly available, ordered access
– In memory. Replicated. Synchronization at client by API
• Originated & developed at Yahoo! 32
• Apache Mahout!
• Machine Learning Library
• Based on Hadoop HDFS & Map/Reduce
– Though not restricted to Hadoop implementations
• “Scalable to ‘reasonably large’ data sets”
• Four primary ML use cases, currently
– Clustering; Classification; Itemset Frequency Mining;
Recommendation Mining
33
• Apache Chukwa! Originated & developed @ Yahoo!
• Log analysis framework built on HDFS & Map/Reduce and
HBase
• Data collection system to monitor large distributed systems
• Toolkit for displaying & analyzing collected data
• A competitor to Splunk?
34
• Apache Avro! Originated @ Yahoo!, co-developer
Cloudera
• Data serialization system
– Serialization: “…the process of converting a data structure or object state
into a format that can be stored” in a file or transmitted over a network link.
– Serializing an object is also referred to as “marshaling” an object
• Supports JSON as the data interchange format
– JavaScript Object Notation, a self-describing data format
– Experimental support for Avro IDL, an alternate Interface Description Lang.
• Includes Remote Procedure Call (RPC) in the framewor k
– Communication between Hadoop nodes and from clients to services
• Hadoop (Doug Cutting) emphasizes moving away from text
35
APIs to using Avro APIs for Map/Reduce, etc.
• Apache Whirr!
• Libraries for running cloud services
– Cloud-neutral, avoids provider idiosyncrasies
• Command line tool for deploying large clusters
• Common service API
• Smart defaults for services to enable quickly getting a
properly configured system running but still specify the
desired attributes
36
• “Sqoop is a tool designed for efficiently transferring
bulk data between Apache Hadoop and structured
datastores such as relational databases”
• Project status:
37
• Apache Bigtop!
• Bigtop is to Hadoop as Fedora is to Red Hat Enterprise Linux
– Packing & interoperability testing focuses on the system as a
whole vs. individual projects
38
HCatalog; MRUnit; Oozie
• HCatalog
– Table and storage management service for data created using
Apache Hadoop
• Table abstraction with regard to where or how the data is stored
• To provide interoperability across data processing tools including Pig; Map/Reduce;
Streaming; Hive
• MRUnit
– Library to support unit testing of Hadoop Map/Reduce jobs
• Oozie
– Workflow system to manage Hadoop jobs by time or data arrival
39
Small sample of awareness being raised
• Big Data
– Science, one of the two top peer-review science journals in the world
– The Economist
• Hadoop-specific
– Yahoo! videos on YouTube
– Technology pundits in many mainline publications & blogs
– Serious blogs by researchers & Ph.D. candidates across data science; computer science;
statistics & other fields
40