0% found this document useful (0 votes)
6 views32 pages

Unit III

Hadoop is an open-source framework for storing and processing large data sets, utilizing components like HDFS for distributed file storage, YARN for resource management, and MapReduce for data processing. It features a master/slave architecture, where a single NameNode manages file metadata and multiple DataNodes store data blocks. Hadoop is scalable, cost-effective, and resilient to failures, making it suitable for big data applications, but it is not ideal for low-latency data access or handling numerous small files.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Unit III

Hadoop is an open-source framework for storing and processing large data sets, utilizing components like HDFS for distributed file storage, YARN for resource management, and MapReduce for data processing. It features a master/slave architecture, where a single NameNode manages file metadata and multiple DataNodes store data blocks. Hadoop is scalable, cost-effective, and resilient to failures, making it suitable for big data applications, but it is not ideal for low-latency data access or handling numerous small files.

Uploaded by

Sree RK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit – 3 HADOOP

1.What is Hadoop?

Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

2. Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
3. Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.

NameNode

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode

o The HDFS cluster contains multiple DataNodes.


o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the data
by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.


o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
4. Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as co
compared
mpared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of da
data
ta and use it. Normally, data are replicated thrice but the
replication factor is configurable.

5. History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.

Let's focus
ocus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It
is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.

6. Hadoop Installation

Environment required for Hadoop: The production environment of Hadoop is UNIX, but it
can also be used in Windows using Cygwin. Java 1.6 or above is needed to run Map Reduce
Programs. For Hadoop installation from tar ball on the UNIX environment you need

1. Java Installation
2. SSH installation
3. Hadoop Installation and File Configuration

1) Java Installation

Step 1. Type "java -version" in prompt to find if the java is installed or not. If not then download
java from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-
1880260.html . The tar filejdk-7u71-linux-x64.tar.gz will be downloaded to your system.
Step 2. Extract the file using the below command

1. #tar zxf jdk-7u71-linux-x64.tar.gz

Step 3. To make java available for all the users of UNIX move the file to /usr/local and set the
path. In the prompt switch to root user and then type the command below to move the jdk to
/usr/lib.

1. # mv jdk1.7.0_71 /usr/lib/

Now in ~/.bashrc file add the following commands to set up the path.

1. # export JAVA_HOME=/usr/lib/jdk1.7.0_71
2. # export PATH=PATH:$JAVA_HOME/bin

Now, you can check the installation by typing "java -version" in the prompt.

2) SSH Installation

SSH is used to interact with the master and slaves computer without any prompt for password.
First of all create a Hadoop user on the master and slave systems

1. # useradd hadoop
2. # passwd Hadoop

To map the nodes open the hosts file present in /etc/ folder on all the machines and put the ip
address along with their host name.

1. # vi /etc/hosts

Enter the lines below

1. 190.12.1.114 hadoop-master
2. 190.12.1.121 hadoop-salve-one
3. 190.12.1.143 hadoop-slave-two
Set up SSH key in every node so that they can communicate among themselves without
password. Commands for the same are:

1. # su hadoop
2. $ ssh-keygen -t rsa
3. $ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
4. $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
5. $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
6. $ chmod 0600 ~/.ssh/authorized_keys
7. $ exit

3) Hadoop Installation

Hadoop can be downloaded from http://developer.yahoo.com/hadoop/tutorial/module3.html

Now extract the Hadoop and copy it to a location.

1. $ mkdir /usr/hadoop
2. $ sudo tar vxzf hadoop-2.2.0.tar.gz ?c /usr/hadoop

Change the ownership of Hadoop folder

1. $sudo chown -R hadoop usr/hadoop

Change the Hadoop configuration files:

All the files are present in /usr/local/Hadoop/etc/hadoop

1) In hadoop-env.sh file add

1. export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71

2) In core-site.xml add following between configuration tabs,

1. <configuration>
2. <property>
3. <name>fs.default.name</name>
4. <value>hdfs://hadoop-master:9000</value>
5. </property>
6. <property>
7. <name>dfs.permissions</name>
8. <value>false</value>
9. </property>
10. </configuration>

3) In hdfs-site.xmladd following between configuration tabs,

1. <configuration>
2. <property>
3. <name>dfs.data.dir</name>
4. <value>usr/hadoop/dfs/name/data</value>
5. <final>true</final>
6. </property>
7. <property>
8. <name>dfs.name.dir</name>
9. <value>usr/hadoop/dfs/name</value>
10. <final>true</final>
11. </property>
12. <property>
13. <name>dfs.replication</name>
14. <value>1</value>
15. </property>
16. </configuration>

4) Open the Mapred-site.xml and make the change as shown below

1. <configuration>
2. <property>
3. <name>mapred.job.tracker</name>
4. <value>hadoop-master:9001</value>
5. </property>
6. </configuration>

5) Finally, update your $HOME/.bahsrc

1. cd $HOME
2. vi .bashrc
3. Append following lines in the end and save and exit
4. #Hadoop variables
5. export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
6. export HADOOP_INSTALL=/usr/hadoop
7. export PATH=$PATH:$HADOOP_INSTALL/bin
8. export PATH=$PATH:$HADOOP_INSTALL/sbin
9. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
10. export HADOOP_COMMON_HOME=$HADOOP_INSTALL
11. export HADOOP_HDFS_HOME=$HADOOP_INSTALL
12. export YARN_HOME=$HADOOP_INSTALL

On the slave machine install Hadoop using the command below

1. # su hadoop
2. $ cd /opt/hadoop
3. $ scp -r hadoop hadoop-slave-one:/usr/hadoop
4. $ scp -r hadoop hadoop-slave-two:/usr/Hadoop

Configure master node and slave node

1. $ vi etc/hadoop/masters
2. hadoop-master
3.
4. $ vi etc/hadoop/slaves
5. hadoop-slave-one
6. hadoop-slave-two

After this format the name node and start all the deamons

1. # su hadoop
2. $ cd /usr/hadoop
3. $ bin/hadoop namenode -format
4.
5. $ cd $HADOOP_HOME/sbin
6. $ start-all.sh

The easiest step is the usage of cloudera as it comes with all the stuffs pre-installed which can be
downloaded from http://content.udacity-data.com/courses/ud617/Cloudera-Udacity-Training-
VM-4.1.1.c.zip

7. What is HDFS

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.

Where to use HDFS


o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency
in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.
Where not to use HDFS
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the
files are small in size it takes a lot of memory for name node's memory which is not
feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.

HDFS DataNode and NameNode Image:


HDFS Read Image:

Backward Skip 10sPlay VideoForward Skip 10s

HDFS Write Image:


Since all the metadata is stored in name node, it is very important. If it fails the file system can
not be used as there would be no way of knowing how to reconstruct the files from blocks
present in data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points.It communicates with the name node and take snapshot of meta
data which helps minimize downtime and loss of data.

Starting HDFS

The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations


1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.
$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to
HDFS folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system


o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

HDFS Other commands

The below is used in the commands

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.

"<file>" means any filename.

"<src>" and "<dest>" are path names in a directed operation.


"<localSrc>" and "<localDest>" are paths as above, but on the local file system

o put <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within
the DFS.

o copyFromLocal <localSrc><dest>

Identical to -put

o copyFromLocal <localSrc><dest>

Identical to -put

o moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within
HDFS, and then deletes the local copy on success.

o get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file system path
identified by localDest.

o cat <filen-ame>

Displays the contents of filename on stdout.

o moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

o setrep [-R] [-w] rep <path>


Sets the target replication factor for files identified by path to rep. (The actual replication
factor will move toward the target over time)

o touchz <path>

Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.

o test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

o stat [format] <path>

Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).

8. HDFS Features and Goals

The Hadoop Distributed File System (HDFS) is a distributed file system. It is a core part of
Hadoop which is used for data storage. It is designed to run on commodity hardware.

Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-
cost hardware. It can easily handle the application that contains large data sets.

Let's see some of the important features and goals of HDFS.

Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.

Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.

9. What is YARN

Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications
can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.

Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines in
the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.

Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible
for handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.

Benefits of YARN
o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, but
Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather than a fixed number of
the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.

10. MapReduce Tutorial

MapReduce tutorial provides basic and advanced concepts of MapReduce. Our MapReduce
tutorial is designed for beginners and professionals.

Our MapReduce tutorial includes all topics of MapReduce such as Data Flow in MapReduce,
Map Reduce API, Word Count Example, Character Count Example, etc.

What is MapReduce?

A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase.
In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed
to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes
input in key-value format, and the output of reducer is the final output.

Steps in Map Reduce


o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys
will not be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This
sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a
list of values associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
Sort and Shuffle

The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper
task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then
written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to
reducer phase.

Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and
web link-graph reversal.
o It can be used for distributed pattern
pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi
multi-cluster,
cluster, multi-core,
multi and
mobile environment.

11.Data Flow In MapReduce

MapReduce is used to compute the huge amount of data . To handle the upcoming data in a
parallel and distributed form, the data has to flow from various phases.

Phases of MapReduce data flow

Input reader

The input reader reads the upcoming data and spli


splits
ts it into the data blocks of the appropriate size
(64 MB to 128 MB). Each data block is associated with a Map function.

Once input reads the data, it generates the corresponding key


key-value
value pairs. The input files reside
in HDFS.

Note - The input data can be in any form.

Map function

The map function process the upcoming key


key-value
value pairs and generated the corresponding output
key-value
value pairs. The map input and output type may be different from each other.
Partition function

The partition function assigns the output of each Map function to the appropriate reducer. The
available key and value provide this function. It returns the index of reducers.

Shuffling and Sorting

The data are shuffled between/within nodes so that it moves out from the map and get ready to
process for reduce function. Sometimes, the shuffling of data can take much computation time.

The sorting operation is performed on input data for Reduce function. Here, the data is compared
using comparison function and arranged in a sorted form.

Reduce function

The Reduce function is assigned to each unique key. These keys are already arranged in sorted
order. The values associated with the keys can iterate the Reduce and generates the
corresponding output.

Output writer

Once the data flow from all the above phases, Output writer executes. The role of Output writer
is to write the Reduce output to the stable storage

MapReduce API

In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods
used in MapReduce programming.

MapReduce Mapper Class

In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of
intermediate key-value pairs. It transforms the input records into intermediate records.
These intermediate records associated with a given output key and passed to Reducer for the
final output.

Methods of Mapper Class

void cleanup(Context context) This method called only once at the end of
the task.

void map(KEYIN key, VALUEIN This method can be called only once for
value, Context context) each key-value in the input split.

void run(Context context) This method can be override to control the


execution of the Mapper.

void setup(Context context) This method called only once at the


beginning of the task.

MapReduce Reducer Class

In MapReduce, the role of the Reducer class is to reduce the set of intermediate values. Its
implementations can access the Configuration for the job via the JobContext.getConfiguration()
method.

Methods of Reducer Class

void cleanup(Context context) This method called only once at the


end of the task.

void map(KEYIN key, Iterable<VALUEIN> This method called only once for
values, Context context) each key.
void run(Context context) This method can be used to control
the tasks of the Reducer.

void setup(Context context) This method called only once at the


beginning of the task.

MapReduce Job Class

The Job class is used to configure the job and submits it. It also controls the execution and query
the state. Once the job is submitted, the set method throws IllegalStateException.

Methods of Job Class

Methods Description

Counters getCounters() This method is used to get the counters for


the job.

long getFinishTime() This method is used to get the finish time


for the job.

Job getInstance() This method is used to generate a new Job


without any cluster.

Job getInstance(Configuration conf) This method is used to generate a new Job


without any cluster and provided
configuration.

Job getInstance(Configuration conf, This method is used to generate a new Job


String jobName) without any cluster and provided
configuration and job name.

String getJobFile() This method is used to get the path of the


submitted job configuration.

String getJobName() This method is used to get the user-


specified job name.

JobPriority getPriority() This method is used to get the scheduling


function of the job.

void setJarByClass(Class<?> c) This method is used to set the jar by


providing the class name with .class
extension.

void setJobName(String name) This method is used to set the user-


specified job name.

void setMapOutputKeyClass(Class<?> This method is used to set the key class


class) for the map output data.

void setMapOutputValueClass(Class<?> This method is used to set the value class


class) for the map output data.

void setMapperClass(Class<? extends This method is used to set the Mapper for
Mapper> class) the job.

void setNumReduceTasks(int tasks) This method is used to set the number of


reduce tasks for the job

void setReducerClass(Class<? extends


Reducer> class)

13. MapReduce Word Count Example

In MapReduce word count example, we find out the frequency of each word. Here, the role of
Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys
of common values. So, everything is represented in the form of Key-value pair.

Pre-requisite
o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version

If any of them is not installed in your system, follow the below link to install it.

www.javatpoint.com/hadoop-installation

Steps to execute MapReduce word count example


o Create a text file in your local machine and write some text into it.
$ nano data.txt
o Check the text written in the data.txt file.
$ cat data.txt

In this example, we find out the frequency of each word exists in this text file.

Create a directory in HDFS, where to kept text file.


$ hdfs dfs -mkdir /test
o Upload the data.txt file on HDFS in the specific directory.
$ hdfs dfs -put /home/codegyani/data.txt /test

o Write the MapReduce program using eclipse.

File: WC_Mapper.java

1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import java.util.StringTokenizer;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.LongWritable;
7. import org.apache.hadoop.io.Text;
8. import org.apache.hadoop.mapred.MapReduceBase;
9. import org.apache.hadoop.mapred.Mapper;
10. import org.apache.hadoop.mapred.OutputCollector;
11. import org.apache.hadoop.mapred.Reporter;
12. public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Te
xt,IntWritable>{
13. private final static IntWritable one = new IntWritable(1);
14. private Text word = new Text();
15. public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
16. Reporter reporter) throws IOException{
17. String line = value.toString();
18. StringTokenizer tokenizer = new StringTokenizer(line);
19. while (tokenizer.hasMoreTokens()){
20. word.set(tokenizer.nextToken());
21. output.collect(word, one);
22. }
23. }
24.
25. }

File: WC_Reducer.java

1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10.
11. public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,T
ext,IntWritable> {
12. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable
> output,
13. Reporter reporter) throws IOException {
14. int sum=0;
15. while (values.hasNext()) {
16. sum+=values.next().get();
17. }
18. output.collect(key,new IntWritable(sum));
19. }
20. }

File: WC_Runner.java

1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import org.apache.hadoop.fs.Path;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.FileInputFormat;
8. import org.apache.hadoop.mapred.FileOutputFormat;
9. import org.apache.hadoop.mapred.JobClient;
10. import org.apache.hadoop.mapred.JobConf;
11. import org.apache.hadoop.mapred.TextInputFormat;
12. import org.apache.hadoop.mapred.TextOutputFormat;
13. public class WC_Runner {
14. public static void main(String[] args) throws IOException{
15. JobConf conf = new JobConf(WC_Runner.class);
16. conf.setJobName("WordCount");
17. conf.setOutputKeyClass(Text.class);
18. conf.setOutputValueClass(IntWritable.class);
19. conf.setMapperClass(WC_Mapper.class);
20. conf.setCombinerClass(WC_Reducer.class);
21. conf.setReducerClass(WC_Reducer.class);
22. conf.setInputFormat(TextInputFormat.class);
23. conf.setOutputFormat(TextOutputFormat.class);
24. FileInputFormat.setInputPaths(conf,new Path(args[0]));
25. FileOutputFormat.setOutputPath(conf,new Path(args[1]));
26. JobClient.runJob(conf);
27. }
28. }

Download the source code.

o Create the jar file of this program and name it countworddemo.jar.


o Run the jar file
hadoop jar /home/codegyani/wordcountdemo.jar com.javatpoint.WC_Runner
/test/data.txt /r_output
o The output is stored in /r_output/part-00000

o Now execute the command to see the output.


hdfs dfs -cat /r_output/part-00000

You might also like