B1 Instructions
B1 Instructions
Write a code in JAVA for a simple Word Count application that counts
the number of occurrences of each word in a given input set using the
Hadoop Map-Reduce framework on local-standalone set-up.
Execute following commands on Terminal
java -version
To check the version of Java installed on your system, you can use the command
java -version.
su - hadoop
The command su - hadoop is used to switch the current user to the user named
"hadoop" in a Unix-based operating system. This is typically used in environments
where multiple users have access to a system, and "hadoop" might be a user
account associated with the Hadoop distributed computing framework.
When you run this command, you will be prompted to enter the password for the
"hadoop" user account. If the password is entered correctly, you will then be logged
in as the "hadoop" user, inheriting its environment settings and permissions.
cd hadoop
hadoop version
To check the version of Hadoop installed on your system, you can use the hadoop
version command. This command will display the installed version of Hadoop along
with some additional information.
nano data1.txt
start-all.sh
The start-all.sh script is typically used in Hadoop environments to start all the
Hadoop daemons simultaneously.
hdfs dfs -mkdir /test_wc
The command hdfs dfs -mkdir /test_wc is used to create a directory named "test_wc" in
● hdfs: This command is used to interact with the Hadoop Distributed File System
(HDFS).
● dfs: This sub-command specifies that the operation is related to the Hadoop
Distributed File System.
● -mkdir: This option specifies that you want to create a directory.
● /test_wc: This is the path of the directory you want to create. The leading forward
slash (/) indicates that the directory should be created in the root directory of HDFS.
So, when you run hdfs dfs -mkdir /test_wc, it creates a directory named "test_wc" in the
Ifconfig
Copy the ip address
put following URL in firefox
Ip address followed by :9870
go to utilities
browse the file system
type /test_wc
nano WC_Mapper.java
package com.javatpoint;
/* This line specifies the package name where this Java class belongs. Packages are
used for organizing classes into namespaces.*/
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
/* These lines import necessary Java and Hadoop libraries that are required for the
functionalities used in this class. For instance, java.io.IOException is imported for
handling input/output exceptions, java.util.StringTokenizer is used to tokenize strings,
and the org.apache.hadoop.io packages contain classes for various data types used in
Hadoop MapReduce jobs. */
/* This line declares a Java class named WC_Mapper. It extends MapReduceBase, which
is a Hadoop class used as a base class for MapReduce mapper and reducer classes.
It also implements the Mapper interface, specifying the input and output key-value
types for the mapper. Here, LongWritable represents the input key type (offset of a
line in the input file), Text represents the input value type (a line of text), Text
represents the output key type (a word), and IntWritable represents the output value
type (count of occurrences of the word). */
/* This line declares a constant variable named one of type IntWritable, initialized
with the value 1. It is used to represent the count of each word, initialized to 1. */
/* This line declares a variable named word of type Text. It is used to store each word
extracted from the input text during the mapping process. */
/* This line defines the map method required by the Mapper interface. It takes four
parameters: key (representing the offset of a line in the input file), value
(representing a line of text), output (used to collect output key-value pairs), and
reporter (used for reporting progress and status). It throws IOException to handle
input/output exceptions. */
/* This line converts the Text object value (representing a line of text) into a Java
string named line. */
nano WC_Reducer.java
package com.javatpoint;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WC_Reducer extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
/* This line defines the reduce method, which is a part of the reducer class in a
Hadoop MapReduce program.
● Text key: This parameter represents the key for this particular invocation of
the reduce method. In a word count example like this, it represents a unique
word.
● Iterator<IntWritable> values: This parameter represents an iterator over
the list of values associated with the key. In this case, it iterates over the
counts of occurrences of the word represented by the key.
● OutputCollector<Text, IntWritable> output: This parameter is used to
collect the output key-value pairs produced by the reducer. The reducer
aggregates the values for each key (word) and emits the final key-value pairs.
● Reporter reporter: This parameter is used for reporting progress and status
of the reducer job to the Hadoop framework.
● throws IOException: This method may throw an IOException in case of
input/output errors. */
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
/* This loop iterates through all the values associated with the given key. For each
value, it adds the integer value retrieved by get() method of IntWritable to the sum
variable. This effectively calculates the total count of occurrences of the word
represented by the key. */
output.collect(key,new IntWritable(sum));
/* After summing up all the counts for the current word, this line emits the final
key-value pair. The key remains the same (representing the word), while the value is
the total count of occurrences (sum). This key-value pair is collected using the output
object, which is an instance of OutputCollector. This output will be passed on to the
Hadoop framework to be written to the output file. */
}
}
Nano WC_Runner.java
package com.javatpoint;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
/ * This line declares a Java class named WC_Runner, which serves as the entry point
for running the MapReduce job. */
/* This line defines the main method, which is the starting point of execution for the
Java program. It takes an array of strings args as input arguments and may throw an
IOException. */
/* This line creates a new instance of JobConf, which is a configuration class for a
MapReduce job. The constructor takes the class name of the job as an argument
(WC_Runner.class in this case). */
conf.setJobName("WordCount");
/* This line sets the name of the MapReduce job to "WordCount" using the setJobName
method of the JobConf object conf. */
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
/* These lines set the output key and value classes for the MapReduce job. In this
case, the output key is of type Text (representing words) and the output value is of
type IntWritable (representing counts). */
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
/* These lines specify the mapper, combiner, and reducer classes for the MapReduce
job. WC_Mapper.class is set as the mapper class, WC_Reducer.class is set as both the
combiner and the reducer class. This indicates that the same reducer logic will be
applied as the combiner logic for intermediate aggregation. */
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
/* These lines specify the mapper, combiner, and reducer classes for the MapReduce
job. WC_Mapper.class is set as the mapper class, WC_Reducer.class is set as both the
combiner and the reducer class. This indicates that the same reducer logic will be
applied as the combiner logic for intermediate aggregation.. */
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
/* These lines specify the input and output paths for the MapReduce job. The input
path is taken from the first argument (args[0]) provided in the command-line
arguments, and the output path is taken from the second argument (args[1]). Both
paths are converted to Path objects using the Path class constructor. */
JobClient.runJob(conf);
/* Finally, this line runs the MapReduce job with the configuration specified in the
object using the runJob method of JobClient. This initiates the execution of the
conf
MapReduce job. */
}
}
/* This command executes the Hadoop MapReduce job defined in the JAR file
/home/hadoop/hadoop/wordcount.jar, using the com.javatpoint.WC_Runner class as
the main entry point. It takes /tes_wc/data1.txt as input and writes the output to
/r_output. */
hdfs dfs -cat /r_output/part-00000
/* when you run hdfs dfs -cat /r_output/part-00000, you'll see the contents of the
/r_output/part-00000 file displayed in the terminal. This file should contain the
output of your MapReduce job, which, in the case of a word count program, would
likely consist of word-count pairs. */