Hadoop and Pig Overview - Hands-On: Outline of Tutorial
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
October 2011
Hadoop Ecosystem
Tools on top of Hadoop
Programming in Hadoop
Building blocks, Streaming, C-HDFS API
3
Timeline
Nutch open source search project (2002-2004) MapReduce & DFS implementation and Hadoop splits out of Nutch (2004-2006)
4
OSDI 2004
Mapping
Reduces
Design
single master with multiple chunkservers per master file represented as fixed-sized chunks 3-way mirrored across chunkservers
Hadoop
Open source reliable, scalable distributed computing platform
implementation of MapReduce Hadoop Distributed File System (HDFS) runs on commodity hardware
Fault Tolerance
restarting tasks data replication
Speculative execution
handles stragglers
9
HDFS Architecture
10
Hadoop Stack
Pig
Chukwa
Hive
MapReduce Core
HDFS
Avro
Constantly evolving!
13
Google Vs Hadoop
Hadoop Hadoop MapReduce HDFS Pig, Hive Hbase Zookeeper Hama, Giraph
14
Pig Platform for analyzing large data sets Data-flow oriented language Pig Latin
data transformation functions datatypes include sets, associative arrays, tuples high-level language for marshalling data
Developed at Yahoo!
15
Supports SELECT, JOIN, GROUP BY, etc Analyzing very large data sets
log processing, text mining, document indexing
Developed at Facebook
16
Holds extremely large datasets (multiTB) High-speed lookup of individual (row, column)
17
20
21
Magellan and Hadoop DOE funded project to determine appropriate role of cloud computing for DOE/SC midrange workloads Co-located at Argonne Leadership Computing Facility (ALCF) and National Energy Research Scientific Center (NERSC) Hadoop/Magellan research questions
Are the new cloud programming models useful for scientific computing?
22
Data Intensive Science Evaluating hardware and software choices for supporting next generation data problems Evaluation of Hadoop
using mix of synthetic benchmarks and scientific applications understanding application characteristics that can leverage the model
data operations: filter, merge, reorganization compute-data ratio
(collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika)
23
Hadoop Streaming
allows users to plug any binary as maps and reduces input comes on standard input
25
BioPig Analytics toolkit for Next-Generation Sequence Data User defined functions (UDF) for common bioinformatics programs
BLAST, Velvet readers and writers for FASTA and FASTQ pack/unpack for space conservation with DNA sequences
26
Bring your application Hadoop workshop When: TBD Send us email if you are interested
[email protected] [email protected]
28
Teragen (1TB)
12
HDFS GPFS Linear (HDFS) Expon. (HDFS) Linear (GPFS) Expon. (GPFS)
10
8 Time (minutes)
29
10
LEMOMR
Load balancing
node1: (Under stress) node2 node3
100
10
20
Hadoop
50
Twister
LEMOMR
50
40
50
60
70
80
90
Speedup
30
40
20
10
20
30
40
50
60
Programming Hadoop
33
Programming with Hadoop Map and reduce as Java programs using Hadoop API Pipes and Streaming can help with existing applications in other languages C- HDFS API Higher-level languages such as Pig might help with some applications
34
36
Data Flow
37
Mechanics[1/2]
Input files
large 10s of GB or more, typically in HDFS line-based, binary, multi-line, etc.
InputFormat
function defines how input files are split up and read TextInputFormat (default), KeyValueInputFormat, SequenceFileInputFormat
InputSplits
unit of work that comprises a single map task FileInputFormat divides it into 64MB chunks
38
Combiner
reduce data on a single machine
42
Pipes Allows C++ code to be used for Mapper and Reducer Both key and value inputs to pipes programs are provided as std::string $ hadoop pipes
43
C-HDFS API
Limited C API to read and write from HDFS
#include "hdfs.h" int main(int argc, char **argv) { hdfsFS fs = hdfsConnect("default", 0); hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0); tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1); hdfsCloseFile(fs, writeFile); }
44
Hadoop Streaming Generic API that allows programs in any language to be used as Hadoop Mapper and Reducer implementations Inputs written to stdin as strings with tab character separating Output to stdout as key \t value \n $ hadoop jar contrib/streaming/ hadoop-[version]-streaming.jar
45
Debugging Test core functionality separate Use Job Tracker Run local in Hadoop Run job on a small data set on a single node Hadoop can save files from failed tasks
46
Pig Basic Operations LOAD loads data into a relational form FOREACH..GENERATE Adds or removes fields (columns) GROUP Group data on a field JOIN Join two relations DUMP/STORE Dump query to terminal or file There are others, but these will be used for the exercises today
Pig Example
Find the number of gene hits for each model in an hmmsearch (>100GB of output, 3 Billion Lines) bash# cat * |cut f 2|sort|uniq -c
> hits = LOAD /data/bio/*' USING PigStorage() AS (id:chararray,model:chararray, value:float);! > amodels = FOREACH hits GENERATE model;! > models = GROUP amodels BY model;! > counts = FOREACH models GENERATE group,COUNT (amodels) as count;! > STORE counts INTO 'tcounts' USING PigStorage();!
Pig - LOAD
Example: hits = LOAD 'load4/*' USING PigStorage() AS (id:chararray, model:chararray,value:float);! Pig has several built-in data types (chararray, float, integer) PigStorage can parse standard line oriented text files. Pig can be extended with custom load types written in Java. Pig doesnt read any data until triggered by a DUMP or STORE
Pig Important Points Nothing really happens until a DUMP or STORE is performed. Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary output Use PARALLEL keyword on GROUP operations to run more reduce tasks
Lavanya Ramakrishnan
[email protected]
52