MapReduce - Notes
MapReduce - Notes
Map Reduce is an execution model in Hadoop frame work. It divides the process into two separate phases
1. Mapper
2. Reducer
Mapper
Mapper takes input from raw-input-file (HDFS) and output of the mapper is called shuffled data
(intermediate result). This output will be sent to Reducer which aggregates our data and writes result into
HDFS.
Example
Mapper input should be in map collection <key, value> pair, its output will also be in the format
of map collection. When input file is submitted, framework will convert each row of file into <key, value>
pairs.
Note: The lines being offset number will become key and entire line will become value.
Developer logic has to separate the required key and values from the value of input.
Developer logic in Mapper separate <key, value>.
Developer logic in Reducer generates aggregate (sum (), avg (), min (), max (), count (), etc….)
Example
Reducer
2. Python
3. C++
4. Ruby
The following map-reduce programs are implemented in JAVA
Structure of Map-Reduce program in Java:
Note The customer Mapper class and Reducer class should be static.
In Mapper class developer logic has to be implemented in map () function.
In Reducer class developer logic has to be implemented in reduce () function.
Whenever the program is compiled we get 3 separate classes.
1. Main class (or) Driver class
2. Mapper class
3. Reducer class
The following are the common map-reduce classes used in every program.
1. Mapper
2. Reducer 7. Path
3. Job 8. Configuraton
4. Text 9. FileInputFormat
5. IntWritable 10. FileOutputFormat
6. LongWritable 11. GenericOptionParser
Note:
The above first 10 classes are available in Hadoop–core.jar. If you want to access the above classes
you must configure the Hadoop–core.jar. Path for this above jar file is (/usr/lib/Hadoop – 0.20) and 11th
class is available in commons – cli – 1.2.jar, we must configure this jar file also. Path foe this jar file is
(/usr/lib/Hadoop – 0.20/lib).
Mapper : To define mapper class (custom) package is org.apache.hadoop.mapreduce.Mapper . Here
Mapper is a class and the remaining is package. Whenever a java class has extended Mapper class, the java
class will get Mapper functionalities.
Example:
Context
This class is available in Java and also in “map-reduce”. It is inner class, in Mapper and also in Reducer.
Context of Mapper
It shuffles data that means not allowing duplicates in key place.
Example
context (k1, v1) --------- <k1, v1>
context (k1, v2) --------- <k1, <v1, v2>>
context (k2, v1) --------- <k2, v1>
context (k1, v1) --------- <k1, v1>
context of Reducer
It writes key, value pairs into HDFS. It does not bother weather key is duplicated or not.
Reducer Class :
To define custom reducer class we need the package. It is org.apache.hadoop.mapreduce.Reducer
when java class has extended Reducer class the class will get reducer functionality.
Example:
Note: Input key type of Reducer is equal to output key type of Mapper. Input value type of Reducer is equal
to output value type of Mapper.
2. IntWritable
3. LongWritable
Java data types are compatible with Operating System. Map-Reduce types are compatible with
HDFS. While contexting results into HDFS, we should use Map-Reduce types.
Input HDFS File
Map-Reduce Types
Text :
It is equivalent to Java “String” type. Package is org.apache.hadoop.io.Text
IntWritable
It is equivalent to Java “int”. Package is org.apache.hadoop.io.IntWritable
LongWritable
It is equivalent to Java “long”. Package is org.apache.hadoop.io.LongWritable
Example
public static class MyMap extends Mapper < LongWritable, Text, Text, LongWritable >
{
Public void map (LongWritable k, Text v, Context con) throws IOException, Interrupted Exception
{
String line= v.toString();
String y= line.subString(5,9);
Int y=Integer.parseInt (line.subString (12,14));
con.write (new Text(y), new IntWritable (t));
}
}
public static class MyReducer extends Reducer < Text, IntWritable, Text, IntWritable >
{
Public void reduce (Text y, Iterable <IntWritable> vals, Context con)
throws IOException, InterruptedException
{
int m=0;
for(IntWritable v:vals){
m=Math.max(m, v.get());
con.write(y, new Intwritable(m));
}
}
}
Configuration
To take default parameters of HDFS and Map-Reduce.
Package org.apache.hadoop.conf.Configuration
Example Configuration con=new Configuration ();
GenericOptionsParser
Explaination
GenericOptionParser gop=new GenericOptionParser (con, args);
String [] files=gop.getRemainingArgs();
files[0]=/user/Myself/file1.txt
files[1]=/user/urself
In the above explanation file[0], file[1] are in operating system file path, JVM does not
know the operating system file path, we must convert this path into HDFS file path.
Path class
To convert operating system file path compatible with HDFS file path.
Package org.apache.hadoop.fs.Path
Path p1=new Path (files [0]);
Path p2=new Path (files [1]);
FileInputFormat class
To specify input file path.
Package org.apache.hadoop.lib.input.FileInputFormat
FileInputFormat.addInputPath(j, p1);
Job j=new Job (con, Myjob);
FileOutputFormat class
To specify output file path.
Package org.apache.hadoop.lib.input.FileOutputFormat
FileOutputFormat.setOutputPath(j, p2);
Job j= new Job(con, Myjob);
Note FileInputFormat, FileOutputFormat classes contain static methods like
1. addInputPath()
2. setOuputPath()
Note These methods are static methods. The above methods are calling without creating
instance. Directly call with class name.
j.setJarByClass(MaxTemperature.class);
j.setJarByClass(MaxTemperature.class); //Main class
j.setMapperClass(MyMap.class); //Mapper
j.setReduceClass(MyReduce.class); //Reducer
j.setCombinerClass(MyReducer.class); //Reducer
j.setOutputKeyClass(Text.class); //output key of Mapper
Hadoop is a framework
and processing
package my.map.reduce;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.fs.Path;
import org.apache.hadoop.io.conf.Configuration;
import org.apache.hadoop.io.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
{
String word=t.nextTokens();
Con.write(new Text(word), new IntWritable(l));
} //end of loop
} //end of map()
} //end of Mapper
FileInputFormat.addInputPath(j,p1);
FileOutputFormat.setOutputPath(j,p2);
System.exit(j.waitForCompletion(true)? 0 : 1);
} //end of main function
} //end of main class.
Output: 17 words
Working with delimited files using Map-Reduce
Aim:: Single Grouping with Single Aggregation
Input::
code:
package my.mr.analytics
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.fs.Path;
import org.apache.hadoop.io.conf.Configuration;
import org.apache.hadoop.io.util.GenericOptionsParser;
public class Emp
{
public static class MapForEmp
extends Mapper < LongWritable, Text, Text, Intwritable >
{
int sal;
String sex,
public void map (LongWritable k, Text v, Context con)
throws IOException,InterruptedException
{
String line=v.toString();
StringTokenizer t=new StringTokenizer(line, “,”);
int i=1;
while(t.hasMoreTokens())
{
String word=t.nextToken();
if(i==3)
sal=Integer.parseInt(word);
if(i==4)
sex=word;
i++;
}
if(sex.matches(“f”))
sex= “ female”;
else
sex=”male”;
con.write(new Text(sex), new IntWritable(sal));
}
}
// Reducer
Public static class ReducerForEmp
extends Reducer < Text, IntWritable, Text, IntWritable >
{
Public void reduce ( Text sex, Iterable <IntWritable> salaries, Context con )
throws IOException, InterruptExcaption
{
Int tot=0;
for ( IntWritable sal : salaries)
tot +=sal.get();
con.write ( sex, new IntWritable (txt));
Output::
First time
FileNew OthersJava Project
Project NameMyTestscr
Step2
Create Package under
SrcNewPackagePackage name (my.mapreduce)
Step3
Create Java class
PackageNewClassClass Name (staff.java)
Step4
Supply Java code and Save.
Step5
Create JAR file
Project name -> Export -> Java JAR file
Step6
Submit Map-Reduce Jar
Execution Step
$ Hadoop jar <jar file path> <class path name> <input file path> <output file path>