Unit III
Unit III
1.What is Hadoop?
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
2. Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
3. Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.
NameNode
DataNode
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data
by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
4. Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as co
compared
mpared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of da
data
ta and use it. Normally, data are replicated thrice but the
replication factor is configurable.
5. History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google
File System paper, published by Google.
Let's focus
ocus on the history of Hadoop in the following steps: -
o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It
is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
6. Hadoop Installation
Environment required for Hadoop: The production environment of Hadoop is UNIX, but it
can also be used in Windows using Cygwin. Java 1.6 or above is needed to run Map Reduce
Programs. For Hadoop installation from tar ball on the UNIX environment you need
1. Java Installation
2. SSH installation
3. Hadoop Installation and File Configuration
1) Java Installation
Step 1. Type "java -version" in prompt to find if the java is installed or not. If not then download
java from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-
1880260.html . The tar filejdk-7u71-linux-x64.tar.gz will be downloaded to your system.
Step 2. Extract the file using the below command
Step 3. To make java available for all the users of UNIX move the file to /usr/local and set the
path. In the prompt switch to root user and then type the command below to move the jdk to
/usr/lib.
1. # mv jdk1.7.0_71 /usr/lib/
Now in ~/.bashrc file add the following commands to set up the path.
1. # export JAVA_HOME=/usr/lib/jdk1.7.0_71
2. # export PATH=PATH:$JAVA_HOME/bin
Now, you can check the installation by typing "java -version" in the prompt.
2) SSH Installation
SSH is used to interact with the master and slaves computer without any prompt for password.
First of all create a Hadoop user on the master and slave systems
1. # useradd hadoop
2. # passwd Hadoop
To map the nodes open the hosts file present in /etc/ folder on all the machines and put the ip
address along with their host name.
1. # vi /etc/hosts
1. 190.12.1.114 hadoop-master
2. 190.12.1.121 hadoop-salve-one
3. 190.12.1.143 hadoop-slave-two
Set up SSH key in every node so that they can communicate among themselves without
password. Commands for the same are:
1. # su hadoop
2. $ ssh-keygen -t rsa
3. $ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
4. $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
5. $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
6. $ chmod 0600 ~/.ssh/authorized_keys
7. $ exit
3) Hadoop Installation
1. $ mkdir /usr/hadoop
2. $ sudo tar vxzf hadoop-2.2.0.tar.gz ?c /usr/hadoop
1. export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
1. <configuration>
2. <property>
3. <name>fs.default.name</name>
4. <value>hdfs://hadoop-master:9000</value>
5. </property>
6. <property>
7. <name>dfs.permissions</name>
8. <value>false</value>
9. </property>
10. </configuration>
1. <configuration>
2. <property>
3. <name>dfs.data.dir</name>
4. <value>usr/hadoop/dfs/name/data</value>
5. <final>true</final>
6. </property>
7. <property>
8. <name>dfs.name.dir</name>
9. <value>usr/hadoop/dfs/name</value>
10. <final>true</final>
11. </property>
12. <property>
13. <name>dfs.replication</name>
14. <value>1</value>
15. </property>
16. </configuration>
1. <configuration>
2. <property>
3. <name>mapred.job.tracker</name>
4. <value>hadoop-master:9001</value>
5. </property>
6. </configuration>
1. cd $HOME
2. vi .bashrc
3. Append following lines in the end and save and exit
4. #Hadoop variables
5. export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
6. export HADOOP_INSTALL=/usr/hadoop
7. export PATH=$PATH:$HADOOP_INSTALL/bin
8. export PATH=$PATH:$HADOOP_INSTALL/sbin
9. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
10. export HADOOP_COMMON_HOME=$HADOOP_INSTALL
11. export HADOOP_HDFS_HOME=$HADOOP_INSTALL
12. export YARN_HOME=$HADOOP_INSTALL
1. # su hadoop
2. $ cd /opt/hadoop
3. $ scp -r hadoop hadoop-slave-one:/usr/hadoop
4. $ scp -r hadoop hadoop-slave-two:/usr/Hadoop
1. $ vi etc/hadoop/masters
2. hadoop-master
3.
4. $ vi etc/hadoop/slaves
5. hadoop-slave-one
6. hadoop-slave-two
After this format the name node and start all the deamons
1. # su hadoop
2. $ cd /usr/hadoop
3. $ bin/hadoop namenode -format
4.
5. $ cd $HADOOP_HOME/sbin
6. $ start-all.sh
The easiest step is the usage of cloudera as it comes with all the stuffs pre-installed which can be
downloaded from http://content.udacity-data.com/courses/ud617/Cloudera-Udacity-Training-
VM-4.1.1.c.zip
7. What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over
several machines and replicated to ensure their durability to failure and high availability to
parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes
and node name.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names
and location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points.It communicates with the name node and take snapshot of meta
data which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Start $ start-dfs.sh
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to
HDFS folder /user/ test
Recursive deleting
Example:
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within
the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within
HDFS, and then deletes the local copy on success.
Copies the file or directory in HDFS identified by src to the local file system path
identified by localDest.
o cat <filen-ame>
o moveToLocal <src><localDest>
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already
exists at path, unless the file is already size 0.
Prints information about path. Format is a string which accepts file size in blocks (%b),
filename (%n), block size (%o), replication (%r), and modification date (%y, %Y).
The Hadoop Distributed File System (HDFS) is a distributed file system. It is a core part of
Hadoop which is used for data storage. It is designed to run on commodity hardware.
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-
cost hardware. It can easily handle the application that contains large data sets.
Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be
loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines.
Anyhow, if any machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.
9. What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications
can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.
Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines in
the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible
for handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.
Benefits of YARN
o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000 nodes and 40000 task, but
Yarn is designed for 10,000 nodes and 1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather than a fixed number of
the designated slots thus increasing the utilization.
o Multitenancy: Different version of MapReduce can run on YARN, which makes the
process of upgrading MapReduce more manageable.
MapReduce tutorial provides basic and advanced concepts of MapReduce. Our MapReduce
tutorial is designed for beginners and professionals.
Our MapReduce tutorial includes all topics of MapReduce such as Data Flow in MapReduce,
Map Reduce API, Word Count Example, Character Count Example, etc.
What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce:
Simplified Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase.
In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed
to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes
input in key-value format, and the output of reducer is the final output.
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper
task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then
written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each
unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to
reducer phase.
Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and
web link-graph reversal.
o It can be used for distributed pattern
pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi
multi-cluster,
cluster, multi-core,
multi and
mobile environment.
MapReduce is used to compute the huge amount of data . To handle the upcoming data in a
parallel and distributed form, the data has to flow from various phases.
Input reader
Map function
The partition function assigns the output of each Map function to the appropriate reducer. The
available key and value provide this function. It returns the index of reducers.
The data are shuffled between/within nodes so that it moves out from the map and get ready to
process for reduce function. Sometimes, the shuffling of data can take much computation time.
The sorting operation is performed on input data for Reduce function. Here, the data is compared
using comparison function and arranged in a sorted form.
Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted
order. The values associated with the keys can iterate the Reduce and generates the
corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer
is to write the Reduce output to the stable storage
MapReduce API
In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods
used in MapReduce programming.
In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of
intermediate key-value pairs. It transforms the input records into intermediate records.
These intermediate records associated with a given output key and passed to Reducer for the
final output.
void cleanup(Context context) This method called only once at the end of
the task.
void map(KEYIN key, VALUEIN This method can be called only once for
value, Context context) each key-value in the input split.
In MapReduce, the role of the Reducer class is to reduce the set of intermediate values. Its
implementations can access the Configuration for the job via the JobContext.getConfiguration()
method.
void map(KEYIN key, Iterable<VALUEIN> This method called only once for
values, Context context) each key.
void run(Context context) This method can be used to control
the tasks of the Reducer.
The Job class is used to configure the job and submits it. It also controls the execution and query
the state. Once the job is submitted, the set method throws IllegalStateException.
Methods Description
void setMapperClass(Class<? extends This method is used to set the Mapper for
Mapper> class) the job.
In MapReduce word count example, we find out the frequency of each word. Here, the role of
Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys
of common values. So, everything is represented in the form of Key-value pair.
Pre-requisite
o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version
If any of them is not installed in your system, follow the below link to install it.
www.javatpoint.com/hadoop-installation
In this example, we find out the frequency of each word exists in this text file.
File: WC_Mapper.java
1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import java.util.StringTokenizer;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.LongWritable;
7. import org.apache.hadoop.io.Text;
8. import org.apache.hadoop.mapred.MapReduceBase;
9. import org.apache.hadoop.mapred.Mapper;
10. import org.apache.hadoop.mapred.OutputCollector;
11. import org.apache.hadoop.mapred.Reporter;
12. public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Te
xt,IntWritable>{
13. private final static IntWritable one = new IntWritable(1);
14. private Text word = new Text();
15. public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
16. Reporter reporter) throws IOException{
17. String line = value.toString();
18. StringTokenizer tokenizer = new StringTokenizer(line);
19. while (tokenizer.hasMoreTokens()){
20. word.set(tokenizer.nextToken());
21. output.collect(word, one);
22. }
23. }
24.
25. }
File: WC_Reducer.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10.
11. public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,T
ext,IntWritable> {
12. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable
> output,
13. Reporter reporter) throws IOException {
14. int sum=0;
15. while (values.hasNext()) {
16. sum+=values.next().get();
17. }
18. output.collect(key,new IntWritable(sum));
19. }
20. }
File: WC_Runner.java
1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import org.apache.hadoop.fs.Path;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.FileInputFormat;
8. import org.apache.hadoop.mapred.FileOutputFormat;
9. import org.apache.hadoop.mapred.JobClient;
10. import org.apache.hadoop.mapred.JobConf;
11. import org.apache.hadoop.mapred.TextInputFormat;
12. import org.apache.hadoop.mapred.TextOutputFormat;
13. public class WC_Runner {
14. public static void main(String[] args) throws IOException{
15. JobConf conf = new JobConf(WC_Runner.class);
16. conf.setJobName("WordCount");
17. conf.setOutputKeyClass(Text.class);
18. conf.setOutputValueClass(IntWritable.class);
19. conf.setMapperClass(WC_Mapper.class);
20. conf.setCombinerClass(WC_Reducer.class);
21. conf.setReducerClass(WC_Reducer.class);
22. conf.setInputFormat(TextInputFormat.class);
23. conf.setOutputFormat(TextOutputFormat.class);
24. FileInputFormat.setInputPaths(conf,new Path(args[0]));
25. FileOutputFormat.setOutputPath(conf,new Path(args[1]));
26. JobClient.runJob(conf);
27. }
28. }