0% found this document useful (0 votes)
5 views

MapReduce - Notes

Map-Reduce is an execution model in the Hadoop framework that consists of two phases: Mapper and Reducer. The Mapper processes input data into key-value pairs, while the Reducer aggregates the data and writes the results to HDFS. The document also outlines the structure of Map-Reduce programs in Java, including necessary classes and configurations for executing jobs.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

MapReduce - Notes

Map-Reduce is an execution model in the Hadoop framework that consists of two phases: Mapper and Reducer. The Mapper processes input data into key-value pairs, while the Reducer aggregates the data and writes the results to HDFS. The document also outlines the structure of Map-Reduce programs in Java, including necessary classes and configurations for executing jobs.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Map-Reduce

Map Reduce is an execution model in Hadoop frame work. It divides the process into two separate phases
1. Mapper
2. Reducer

Mapper
Mapper takes input from raw-input-file (HDFS) and output of the mapper is called shuffled data
(intermediate result). This output will be sent to Reducer which aggregates our data and writes result into
HDFS.

Shuffling and sorting the data


Making multiple rows of same key into single row (it eliminates duplicate keys) and arranges them in
the sorted order of keys.

Example

In high level terminology,

Mapper input should be in map collection <key, value> pair, its output will also be in the format
of map collection. When input file is submitted, framework will convert each row of file into <key, value>
pairs.
Note: The lines being offset number will become key and entire line will become value.
 Developer logic has to separate the required key and values from the value of input.
 Developer logic in Mapper separate <key, value>.
 Developer logic in Reducer generates aggregate (sum (), avg (), min (), max (), count (), etc….)

Low Level Flow

Example
Reducer

NOTE “Map-Reduce API” is available for following languages


1. Java

2. Python
3. C++
4. Ruby
The following map-reduce programs are implemented in JAVA
Structure of Map-Reduce program in Java:

Note The customer Mapper class and Reducer class should be static.
In Mapper class developer logic has to be implemented in map () function.
In Reducer class developer logic has to be implemented in reduce () function.
Whenever the program is compiled we get 3 separate classes.
1. Main class (or) Driver class
2. Mapper class
3. Reducer class
The following are the common map-reduce classes used in every program.

1. Mapper
2. Reducer 7. Path
3. Job 8. Configuraton
4. Text 9. FileInputFormat
5. IntWritable 10. FileOutputFormat
6. LongWritable 11. GenericOptionParser

Note:
The above first 10 classes are available in Hadoop–core.jar. If you want to access the above classes
you must configure the Hadoop–core.jar. Path for this above jar file is (/usr/lib/Hadoop – 0.20) and 11th
class is available in commons – cli – 1.2.jar, we must configure this jar file also. Path foe this jar file is
(/usr/lib/Hadoop – 0.20/lib).
Mapper : To define mapper class (custom) package is org.apache.hadoop.mapreduce.Mapper . Here
Mapper is a class and the remaining is package. Whenever a java class has extended Mapper class, the java
class will get Mapper functionalities.

Example:

Context
This class is available in Java and also in “map-reduce”. It is inner class, in Mapper and also in Reducer.

Context of Mapper
It shuffles data that means not allowing duplicates in key place.
Example
context (k1, v1) --------- <k1, v1>
context (k1, v2) --------- <k1, <v1, v2>>
context (k2, v1) --------- <k2, v1>
context (k1, v1) --------- <k1, v1>

context of Reducer
It writes key, value pairs into HDFS. It does not bother weather key is duplicated or not.

Reducer Class :
To define custom reducer class we need the package. It is org.apache.hadoop.mapreduce.Reducer
when java class has extended Reducer class the class will get reducer functionality.

Example:

Note: Input key type of Reducer is equal to output key type of Mapper. Input value type of Reducer is equal
to output value type of Mapper.

Map-Reduce Data Types


1. Text

2. IntWritable
3. LongWritable
Java data types are compatible with Operating System. Map-Reduce types are compatible with
HDFS. While contexting results into HDFS, we should use Map-Reduce types.
Input HDFS File

Map-Reduce Types

1. Map-Reduce types are converted into Java type


2. Apply Java logic (to process)
3. Results (Java Types) are to be converted into Map-Reduce types.
4. Context the results.

Text :
It is equivalent to Java “String” type. Package is org.apache.hadoop.io.Text

IntWritable
It is equivalent to Java “int”. Package is org.apache.hadoop.io.IntWritable
LongWritable
It is equivalent to Java “long”. Package is org.apache.hadoop.io.LongWritable

Converting LongWritable to long Converting long to IntWritable


LongWritable x=7500000000; long a=750000000;
long y=x.get(); LongWritable l=new LongWritable ();
l.set(a);
(or)
long x=7800000000;
LongWritable l=new LongWritable (x);

Example
public static class MyMap extends Mapper < LongWritable, Text, Text, LongWritable >
{
Public void map (LongWritable k, Text v, Context con) throws IOException, Interrupted Exception

{
String line= v.toString();
String y= line.subString(5,9);
Int y=Integer.parseInt (line.subString (12,14));
con.write (new Text(y), new IntWritable (t));
}
}

public static class MyReducer extends Reducer < Text, IntWritable, Text, IntWritable >
{
Public void reduce (Text y, Iterable <IntWritable> vals, Context con)
throws IOException, InterruptedException
{
int m=0;

for(IntWritable v:vals){
m=Math.max(m, v.get());
con.write(y, new Intwritable(m));
}
}
}
Configuration
To take default parameters of HDFS and Map-Reduce.
Package org.apache.hadoop.conf.Configuration
Example Configuration con=new Configuration ();
GenericOptionsParser

To pass command line arguments of Hadoop jar command.


Example $hadoop jar Desktop/abc.jar abc.xyz file1 dirx;
Desktop/abc.jar jar file path
abc.xyz package & class name
file1 file name (input file name)
dirx Directory name (output directory name)

Example $hadoop jar Desktop/abc.jar abc.xyz /user/Myself/file1.txt /user/urself;


/user/Myself/file1.txt arguments 0
/user/urself; arguments 1
Public static void main (String [ ] args)
{
Configuration con=new Configuration ();

String [] files=new GenericOptionParser (con,args).getRemaingArgs();


}

Explaination
GenericOptionParser gop=new GenericOptionParser (con, args);
String [] files=gop.getRemainingArgs();
files[0]=/user/Myself/file1.txt
files[1]=/user/urself

In the above explanation file[0], file[1] are in operating system file path, JVM does not
know the operating system file path, we must convert this path into HDFS file path.

Path class
To convert operating system file path compatible with HDFS file path.

Package org.apache.hadoop.fs.Path
Path p1=new Path (files [0]);
Path p2=new Path (files [1]);
FileInputFormat class
To specify input file path.
Package org.apache.hadoop.lib.input.FileInputFormat
FileInputFormat.addInputPath(j, p1);
Job j=new Job (con, Myjob);

FileOutputFormat class
To specify output file path.
Package org.apache.hadoop.lib.input.FileOutputFormat
FileOutputFormat.setOutputPath(j, p2);
Job j= new Job(con, Myjob);
Note FileInputFormat, FileOutputFormat classes contain static methods like

1. addInputPath()
2. setOuputPath()
Note These methods are static methods. The above methods are calling without creating
instance. Directly call with class name.

Job To define job.


Package org.apache.hadoop.mapreduce.Job
i. Job j=new Job(con);
j.setJobName(“MyTest”)
ii. Job j=new Job(con, “MyTest”);

j.setJarByClass(MaxTemperature.class);
j.setJarByClass(MaxTemperature.class); //Main class
j.setMapperClass(MyMap.class); //Mapper
j.setReduceClass(MyReduce.class); //Reducer
j.setCombinerClass(MyReducer.class); //Reducer
j.setOutputKeyClass(Text.class); //output key of Mapper

j.setOutputValuesClass(IntWritable.class); //output value of Mapper


Example:
AIM: Word count program using Map-Reducer Mr/file.txt
Input:
Mr HDFS directory
File.txt HDFS File

Hadoop is a framework

for big data storage

and processing

Hadoop is good for big data Analytics

package my.map.reduce;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.fs.Path;
import org.apache.hadoop.io.conf.Configuration;
import org.apache.hadoop.io.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount


{
public static class MapForWordCount
extends Mapper (LongWritable, Text, Text, IntWritable)
{
Public void map( Long Writable k, Text v, Context con)
throws IOException InterruputException
{
String line=v.toString();
StringToKenizer t =new StringToKenizer(line);
While(t.hasMoreTokens())

{
String word=t.nextTokens();
Con.write(new Text(word), new IntWritable(l));
} //end of loop
} //end of map()
} //end of Mapper

public static void main( String [] args ) throws Exception


{
Configuration c=new Configuration ();
String [] files=new GenericOptionsParser(c, args).getRemainingArgs();
Path p1=new Path(files[0]);
Path p2=new Path(files[1]);

Job j=new Job (c, “wordcount”);


j.setJarByClass(WordCount.class);
j.setMapperClass(MapForwardCount.class);
j.setCombinerClass(ReduceForwardCount.class);
j.setReducerClass(ReduceForwardCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWrtable.class);

FileInputFormat.addInputPath(j,p1);
FileOutputFormat.setOutputPath(j,p2);
System.exit(j.waitForCompletion(true)? 0 : 1);
} //end of main function
} //end of main class.
Output: 17 words
Working with delimited files using Map-Reduce
Aim:: Single Grouping with Single Aggregation

Input::

code:
package my.mr.analytics
import java.io.IOException;

import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.fs.Path;
import org.apache.hadoop.io.conf.Configuration;
import org.apache.hadoop.io.util.GenericOptionsParser;
public class Emp

{
public static class MapForEmp
extends Mapper < LongWritable, Text, Text, Intwritable >
{
int sal;
String sex,
public void map (LongWritable k, Text v, Context con)
throws IOException,InterruptedException
{

String line=v.toString();
StringTokenizer t=new StringTokenizer(line, “,”);
int i=1;
while(t.hasMoreTokens())
{
String word=t.nextToken();

if(i==3)
sal=Integer.parseInt(word);
if(i==4)
sex=word;
i++;
}

if(sex.matches(“f”))
sex= “ female”;
else
sex=”male”;
con.write(new Text(sex), new IntWritable(sal));
}
}

// Reducer
Public static class ReducerForEmp
extends Reducer < Text, IntWritable, Text, IntWritable >
{
Public void reduce ( Text sex, Iterable <IntWritable> salaries, Context con )
throws IOException, InterruptExcaption
{
Int tot=0;
for ( IntWritable sal : salaries)
tot +=sal.get();
con.write ( sex, new IntWritable (txt));

} //end of reduce function


} //end of reduce class.
Public static void main( Sting [] args) throwsException
{
-----------
----------- //job definition.

Output::

Output of Mapper Output of Reducer

M, < 2000, 4000, 6000, 7000 > <M, 19000>

F, < 3000, 5000, 9000, 7000 > <F, 24000>

How To Execute Map-Reduce Program


Step1 create Java Project
FileNew Java Project

First time
FileNew OthersJava Project
Project NameMyTestscr

Step2
Create Package under
SrcNewPackagePackage name (my.mapreduce)

Step3
Create Java class
PackageNewClassClass Name (staff.java)
Step4
Supply Java code and Save.

Step5
Create JAR file
Project name -> Export -> Java JAR file

Note Give Jar file name as project name.

Step6
Submit Map-Reduce Jar

Execution Step
$ Hadoop jar <jar file path> <class path name> <input file path> <output file path>

Note : How to configure External JAR files


Src (project)  Build Path  Configure Build Path  Libraries  Add External Jars 
1. Commons-cli-1.2.jar (for GenericOptionsParser)
2. Hadoop-core.jar (for remaining packages).
Open Eclips  Choose Project  write class.

You might also like