0% found this document useful (0 votes)

64 views52 pages

Hadoop and Pig Overview - Hands-On: Outline of Tutorial

The document provides an overview of Hadoop and Pig with the following key points: 1. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Pig is a platform for analyzing large datasets using a high-level language called Pig Latin. 2. The tutorial covers Hadoop and Pig concepts and background, tools in the Hadoop ecosystem including Hive and HBase, examples of Hadoop for science, and programming with Hadoop including MapReduce basics. 3. Many large companies are using Hadoop to process vast amounts of data generated daily and Hadoop has become popular for applications involving log processing, text mining, and document indexing.

Uploaded by

Konara Kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views52 pages

Hadoop and Pig Overview - Hands-On: Outline of Tutorial

Uploaded by

Konara Kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Outline of Tutorial Hadoop and Pig Overview Hands-on

Hadoop and Pig Overview

Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab

October 2011

Overview Concepts & Background

MapReduce and Hadoop

Hadoop Ecosystem
Tools on top of Hadoop

Hadoop for Science

Examples, Challenges

Programming in Hadoop
Building blocks, Streaming, C-HDFS API
3

Processing Big Data Internet scale generates BigData

Terabytes of data/day just reading 100 TB can be overwhelming
using clusters of standard commodity computers for linear scalability

Timeline
Nutch open source search project (2002-2004) MapReduce & DFS implementation and Hadoop splits out of Nutch (2004-2006)
4

MapReduce Computation performed on large volumes of data in parallel

divide workload across large number of machines need a good data management scheme to handle scalability and consistency

Functional programming concepts

map reduce
5

OSDI 2004

Mapping

Map input to an output using some function Example

string manipulation
6

Reduces

Aggregate values together to provide summary data Example

addition of the list of numbers
7

Google File System Distributed File System

accounts for component failure multi-GB files and billions of objects

Design
single master with multiple chunkservers per master file represented as fixed-sized chunks 3-way mirrored across chunkservers

Hadoop
Open source reliable, scalable distributed computing platform
implementation of MapReduce Hadoop Distributed File System (HDFS) runs on commodity hardware

Fault Tolerance
restarting tasks data replication

Speculative execution
handles stragglers
9

HDFS Architecture

HDFS and other Parallel Filesystems

HDFS Typical Replication Storage Location Access Model Stripe Size Concurrent Writes Scales with Scale of Largest Systems User/Kernel Space 3 Compute Node Custom (except with Fuse) 64 MB No # of Compute Nodes O(10k) Nodes User GPFS and Lustre 1 Servers POSIX 1 MB Yes # of Servers O(100) Servers Kernel

Who is using Hadoop?

A9.com Amazon Adobe AOL Baidu Cooliris Facebook NSF-Google university initiative
12

IBM LinkedIn Ning PARC Rackspace StumbleUpon Twitter Yahoo!

Hadoop Stack

Pig

Chukwa

Hive

HBase Zoo Keeper

MapReduce Core

HDFS

Avro

Source: Hadoop: The Definitive Guide

Constantly evolving!
13

Google Vs Hadoop

Google MapReduce GFS Sawzall BigTable Chubby Pregel

Hadoop Hadoop MapReduce HDFS Pig, Hive Hbase Zookeeper Hama, Giraph

Pig Platform for analyzing large data sets Data-flow oriented language Pig Latin
data transformation functions datatypes include sets, associative arrays, tuples high-level language for marshalling data

Developed at Yahoo!

Hive SQL-based data warehousing application

features similar to Pig more strictly SQL-type

Supports SELECT, JOIN, GROUP BY, etc Analyzing very large data sets
log processing, text mining, document indexing

Developed at Facebook
16

HBase Persistent, distributed, sorted, multidimensional, sparse map

based on Google BigTable provides interactive access to information

Holds extremely large datasets (multiTB) High-speed lookup of individual (row, column)

ZooKeeper Distributed consensus engine

runs on a set of servers and maintains state consistency

Concurrent access semantics

leader election service discovery distributed locking/mutual exclusion message board/mailboxes producer/consumer queues, priority queues and multi-phase commit operations
18

Other Related Projects [1/2]

Chukwa Hadoop log aggregation Scribe more general log aggregation Mahout machine learning library Cassandra column store database on a P2P backend Dumbo Python library for streaming Spark in memory cluster for interactive and iterative Hadoop on Amazon Elastic MapReduce
19

Other Related Projects [2/2]

Sqoop import SQL-based data to Hadoop Jaql JSON (JavaScript Object Notation) based semi-structured query processing Oozie Hadoop workflows Giraph Large scale graph processing on Hadoop Hcatlog relational view of HDFS Fuse-DS POSIX interface to HDFS

Hadoop for Science

Magellan and Hadoop DOE funded project to determine appropriate role of cloud computing for DOE/SC midrange workloads Co-located at Argonne Leadership Computing Facility (ALCF) and National Energy Research Scientific Center (NERSC) Hadoop/Magellan research questions
Are the new cloud programming models useful for scientific computing?
22

Data Intensive Science Evaluating hardware and software choices for supporting next generation data problems Evaluation of Hadoop
using mix of synthetic benchmarks and scientific applications understanding application characteristics that can leverage the model
data operations: filter, merge, reorganization compute-data ratio
(collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika)
23

MapReduce and HPC Applications that can benefit from MapReduce/Hadoop

Large amounts of data processing Science that is scaling up from the desktop Query-type workloads

Data from Exascale needs new technologies

Hadoop On Demand lets one run Hadoop through a batch queue
24

Hadoop for Science Advantages of Hadoop

transparent data replication, data locality aware scheduling fault tolerance capabilities

Hadoop Streaming
allows users to plug any binary as maps and reduces input comes on standard input

BioPig Analytics toolkit for Next-Generation Sequence Data User defined functions (UDF) for common bioinformatics programs
BLAST, Velvet readers and writers for FASTA and FASTQ pack/unpack for space conservation with DNA sequences

Application Examples Bioinformatics applications (BLAST)

parallel search of input sequences Managing input data format

Tropical storm detection

binary file formats cant be handled in streaming

Atmospheric River Detection

maps are differentiated on file and parameter
27

Bring your application Hadoop workshop When: TBD Send us email if you are interested
[email protected] [email protected]

Include a brief description of your application.

HDFS vs GPFS (Time)

Teragen (1TB)
12

HDFS GPFS Linear (HDFS) Expon. (HDFS) Linear (GPFS) Expon. (GPFS)

8 Time (minutes)

0 0 500 1000 1500 Number of maps 2000 2500 3000

Application Characteristic Affect Choices

Wikipedia data set On ~ 75 nodes, GPFS performs better with large nodes Iden%cal data loads and processing load Amount of wri%ng in applica%on aects performance

Hadoop: Challenges Deployment

all jobs run as user hadoop affecting file permissions less control on how many nodes are used affects allocation policies

Programming: No turn-key solution

using existing code bases, managing input formats and data

Additional benchmarking, tuning needed, Plug-ins for Science

Number of words processed (Billion)

Comparison of MapReduce Implementations

Hadoop Twister
150

LEMOMR

Load balancing
node1: (Under stress) node2 node3

Processing time (s)

100

64 core Twister Cluster 64 core Hadoop Cluster 64 core LEMOMR Cluster

Hadoop
50

Twister

LEMOMR

64 core Twister Cluster 64 core LEMOMR Cluster 64 core Hadoop Cluster

Speedup

Output data size (MB)

Producing random floating point numbers

10
! ! ! !

Collaboration w/ Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, SUNY Binghamton 32

Cluster size (cores)

Processing 5 million 33 x 33 matrices

Programming Hadoop

Programming with Hadoop Map and reduce as Java programs using Hadoop API Pipes and Streaming can help with existing applications in other languages C- HDFS API Higher-level languages such as Pig might help with some applications

Keys and Values Maps and reduces produce key-value pairs

arbitrary number of values can be output may map one input to 0,1, .100 outputs reducer may emit one or more outputs

Example: Temperature recordings

94089 8:00 am, 59 27704 6:30 am, 70 94089 12:45 pm, 80 47401 1 pm, 90
35

Keys divide the reduce space

Data Flow

Mechanics[1/2]
Input files
large 10s of GB or more, typically in HDFS line-based, binary, multi-line, etc.

InputFormat
function defines how input files are split up and read TextInputFormat (default), KeyValueInputFormat, SequenceFileInputFormat

InputSplits
unit of work that comprises a single map task FileInputFormat divides it into 64MB chunks
38

Mechanics [2/2] RecordReader

loads data and converts to key value pair

Sort & Partiton & Shuffle

intermediate data from map to reducer

Combiner
reduce data on a single machine

Mapper & Reducer OutputFormat, RecordWriter

Word Count Mapper

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
40

Word Count Reducer

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
41

Word Count Example

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); . Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Pipes Allows C++ code to be used for Mapper and Reducer Both key and value inputs to pipes programs are provided as std::string $ hadoop pipes

C-HDFS API
Limited C API to read and write from HDFS
#include "hdfs.h" int main(int argc, char **argv) { hdfsFS fs = hdfsConnect("default", 0); hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0); tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1); hdfsCloseFile(fs, writeFile); }
44

Hadoop Streaming Generic API that allows programs in any language to be used as Hadoop Mapper and Reducer implementations Inputs written to stdin as strings with tab character separating Output to stdout as key \t value \n $ hadoop jar contrib/streaming/ hadoop-[version]-streaming.jar

Debugging Test core functionality separate Use Job Tracker Run local in Hadoop Run job on a small data set on a single node Hadoop can save files from failed tasks

Pig Basic Operations LOAD loads data into a relational form FOREACH..GENERATE Adds or removes fields (columns) GROUP Group data on a field JOIN Join two relations DUMP/STORE Dump query to terminal or file There are others, but these will be used for the exercises today

Pig Example
Find the number of gene hits for each model in an hmmsearch (>100GB of output, 3 Billion Lines) bash# cat * |cut f 2|sort|uniq -c
> hits = LOAD /data/bio/*' USING PigStorage() AS (id:chararray,model:chararray, value:float);! > amodels = FOREACH hits GENERATE model;! > models = GROUP amodels BY model;! > counts = FOREACH models GENERATE group,COUNT (amodels) as count;! > STORE counts INTO 'tcounts' USING PigStorage();!

Pig - LOAD
Example: hits = LOAD 'load4/*' USING PigStorage() AS (id:chararray, model:chararray,value:float);! Pig has several built-in data types (chararray, float, integer) PigStorage can parse standard line oriented text files. Pig can be extended with custom load types written in Java. Pig doesnt read any data until triggered by a DUMP or STORE

Pig FOREACH..GENERATE, GROUP

Example: amodel = FOREACH model GENERATE hits;! models = GROUP amodels BY model;! counts = FOREACH models GENERATE group,COUNT (amodels) as count;! Use FOREACH..GENERATE to pick of specific fields or generate new fields. Also referred to as a projection GROUP will create a new record with the group name and a bag of the tuples in each group You can reference a specific field in a bag with <bag>.field (i.e. amodels.model) You can use aggregate functions like COUNT, MAX, etc on a bag

Pig Important Points Nothing really happens until a DUMP or STORE is performed. Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary output Use PARALLEL keyword on GROUP operations to run more reduce tasks

Questions? Shane Canon

[email protected]

Lavanya Ramakrishnan
[email protected]

02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop
No ratings yet
Hadoop
154 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
1- HADOOP crash course
No ratings yet
1- HADOOP crash course
52 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
cc unit 51
No ratings yet
cc unit 51
39 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Week 14
No ratings yet
Week 14
33 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Lab 09.0 - NetBIOS Hacking
No ratings yet
Lab 09.0 - NetBIOS Hacking
18 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
unit 2
No ratings yet
unit 2
9 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Big Data
No ratings yet
Big Data
3 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
LIST-OF-E-BOOKS-
No ratings yet
LIST-OF-E-BOOKS-
1,473 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Quartz Job Scheduling Framework Building Open Source Enterprise Applications
No ratings yet
Quartz Job Scheduling Framework Building Open Source Enterprise Applications
254 pages
AMMP
No ratings yet
AMMP
35 pages
How I Hacked Into One of The Most Popular Dating Websites
100% (3)
How I Hacked Into One of The Most Popular Dating Websites
21 pages
ZERO TRUST CLOUD SECURITY process documents
No ratings yet
ZERO TRUST CLOUD SECURITY process documents
27 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Exploratory Testing Using Heuristics
No ratings yet
Exploratory Testing Using Heuristics
40 pages
Compiler Design Mini Project Amisha
No ratings yet
Compiler Design Mini Project Amisha
7 pages
IT - Group 3 Project
100% (3)
IT - Group 3 Project
90 pages
db6 Nls Guide
No ratings yet
db6 Nls Guide
116 pages
ACM (Advance Cluster Management) : Key Features and Capabilities of ACM Include
No ratings yet
ACM (Advance Cluster Management) : Key Features and Capabilities of ACM Include
5 pages
Kaspersky Secure Mail Gateway Technical Overview
No ratings yet
Kaspersky Secure Mail Gateway Technical Overview
25 pages
CAVALLO Blender Bottle Case Study
No ratings yet
CAVALLO Blender Bottle Case Study
3 pages
Zenon
No ratings yet
Zenon
6 pages
DATASTAGE Performance Tuning Tips V1.1
No ratings yet
DATASTAGE Performance Tuning Tips V1.1
2 pages
8088 and 8086 Microprocessor by Avtar Singh PDF
0% (1)
8088 and 8086 Microprocessor by Avtar Singh PDF
2 pages
RTC Process Guidelines
No ratings yet
RTC Process Guidelines
43 pages
Kernel Panic in A Kworker Executing DRM - FB - Helper - Dirty - Work Function - Red Hat Customer Portal
No ratings yet
Kernel Panic in A Kworker Executing DRM - FB - Helper - Dirty - Work Function - Red Hat Customer Portal
2 pages
Create Windows 7 AIO (All in One) DVD at It - Megocollector
No ratings yet
Create Windows 7 AIO (All in One) DVD at It - Megocollector
4 pages
Dice Resume CV Sameera Chandu
No ratings yet
Dice Resume CV Sameera Chandu
5 pages
Easycheck: Onscreen Evaluation System Simplified
No ratings yet
Easycheck: Onscreen Evaluation System Simplified
6 pages
Ritesh 111
No ratings yet
Ritesh 111
5 pages
OAF Training Material
No ratings yet
OAF Training Material
92 pages
Final Report Capstone 1
No ratings yet
Final Report Capstone 1
17 pages
OPIGIMAC-Integris - Backup Instructions - LS PDF
No ratings yet
OPIGIMAC-Integris - Backup Instructions - LS PDF
4 pages
Onapsis Webcasttopnotes Final
No ratings yet
Onapsis Webcasttopnotes Final
17 pages
Online Voting Systen Synopsis
No ratings yet
Online Voting Systen Synopsis
9 pages
TPT Checkpoint
No ratings yet
TPT Checkpoint
4 pages
Crack Any Wifi Without Bruteforce and Anonymously
No ratings yet
Crack Any Wifi Without Bruteforce and Anonymously
2 pages
Online CSC Registration Form Process Flow
No ratings yet
Online CSC Registration Form Process Flow
4 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)

Hadoop and Pig Overview - Hands-On: Outline of Tutorial

Uploaded by

Hadoop and Pig Overview - Hands-On: Outline of Tutorial

Uploaded by

Outline of Tutorial Hadoop and Pig Overview Hands-on

Hadoop and Pig Overview

Overview Concepts & Background

Hadoop for Science

Processing Big Data Internet scale generates BigData

MapReduce Computation performed on large volumes of data in parallel

Functional programming concepts

Map input to an output using some function Example

Aggregate values together to provide summary data Example

Google File System Distributed File System

HDFS and other Parallel Filesystems

Who is using Hadoop?

IBM LinkedIn Ning PARC Rackspace StumbleUpon Twitter Yahoo!

HBase Zoo Keeper

Source: Hadoop: The Definitive Guide

Google MapReduce GFS Sawzall BigTable Chubby Pregel

Hive SQL-based data warehousing application

HBase Persistent, distributed, sorted, multidimensional, sparse map

ZooKeeper Distributed consensus engine

Concurrent access semantics

Other Related Projects [1/2]

Other Related Projects [2/2]

Hadoop for Science

MapReduce and HPC Applications that can benefit from MapReduce/Hadoop

Data from Exascale needs new technologies

Hadoop for Science Advantages of Hadoop

Application Examples Bioinformatics applications (BLAST)

Tropical storm detection

Atmospheric River Detection

Include a brief description of your application.

HDFS vs GPFS (Time)

0 0 500 1000 1500 Number of maps 2000 2500 3000

Application Characteristic Affect Choices

Hadoop: Challenges Deployment

Programming: No turn-key solution

Additional benchmarking, tuning needed, Plug-ins for Science

Number of words processed (Billion)

Comparison of MapReduce Implementations

Processing time (s)

64 core Twister Cluster 64 core Hadoop Cluster 64 core LEMOMR Cluster

64 core Twister Cluster 64 core LEMOMR Cluster 64 core Hadoop Cluster

Output data size (MB)

Producing random floating point numbers

Collaboration w/ Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, SUNY Binghamton 32

Cluster size (cores)

Processing 5 million 33 x 33 matrices

Keys and Values Maps and reduces produce key-value pairs

Example: Temperature recordings

Keys divide the reduce space

Mechanics [2/2] RecordReader

Sort & Partiton & Shuffle

Mapper & Reducer OutputFormat, RecordWriter

Word Count Mapper

Word Count Reducer

Word Count Example

Pig FOREACH..GENERATE, GROUP

Questions? Shane Canon

You might also like