Palak
Palak
OF TECHNOLOGY
There are three modes in which you can get the experience of hadoop.
Standalone mode:
In this mode you need an ide like eclipse and the hadoop library files (which you can
download from the apache website). You can create your mapreduce program and run it
in your local machine. You will be able to check the logic of the code and you can check
any syntax errors and this needs some sample data to perform these actions but you will
not get the full experience of hadoop.
Psuedo-distributed mode:
In this mode you get all the daemons of hadoop running on a single machine and you
can get a vm from cloudera or hortonworks which is just plug and play type of thing. It
will have all the necessary tools installed and configured. In this mode you can scale up
your data to check how your code performs and optimize accordingly to get the job
done in the required time.
Fully-distributed mode:
In this mode you get all the daemons running on different machines. This is mostly used
in the production stage of your project. When you have already verified your code you
will get a chance to implement it in this mode.
Since you request an online service where you can practice your hadoop code. Install
eclipse on pc and download the libraries and start coding
hadoop fs -mkdir
Example:
hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Usage:
hadoop fs -put <localsrc>……<HDFS_dest_Path>
Example:
hadoop fs -du
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
{
public void map(LongWritable key, Text value,Context context) throws
IOException,InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
public static class Reduce extends Reducer
{
public void reduce(Text key, Iterable values,Context context) throws
IOException,InterruptedException
{
int sum=0;for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
//deleting the output path automatically from hdfs so that we don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath); //exiting the job only if the flag value
becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The entire MapReduce program can be fundamentally divided into three parts:
• Mapper Phase Code
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
• We have created a class Map that extends the class Mapper which is already defined in the
MapReduce Framework
. • We define the data types of input and output key/value pair after the class declaration using
angle brackets.
• Both the input and output of the Mapper is a key/value pair.
• Input:
◦ The key is nothing but the offset of each line in the text file:LongWritable
◦ The value is each individual line (as shown in the figure at the right): Text• Output:
◦ The key is the tokenized words: Text ◦ We have the hardcoded value in our case which
is
1: IntWritable
◦ Example – Dear 1, Bear 1, etc.
• We have written a java code where we have tokenized each word and assigned them a
hardcoded value equal to 1.
Reducer Code:
public static class Reduce extends Reducer
{
public void reduce(Text key, Iterable values,Context context) throws
IOException,InterruptedException
{
}
• We have created a class Reduce which extends class Reducer like that of Mapper.
• We define the data types of input and output key/value pair after the class declaration using
angle brackets as done for Mapper
.• Both the input and the output of the Reducer is a keyvalue pair.
• Input:
◦ The key nothing but those unique words which have been generated after the sorting
and shuffling phase: Text
Output:
◦ The key is all the unique words present in the input text file: Text
◦ The value is the number of occurrences of each of the unique words: IntWritable
◦ Example – Bear, 2; Car, 3, etc.
• We have aggregated the values present in each of the list corresponding to each key and
produced the final answer.
• In general, a single reducer is created for each of the unique words, but, you can specify the
number of reducer in map red-site.xml
Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
• In the driver class, we set the configuration of our MapReduce job to run in Hadoop.
• We specify the name of the job , the data type of input/ output of the mapper and reducer.
• We also specify the names of the mapper and reducer classes.