A Report On Distributed Computing
A Report On Distributed Computing
ENGINEERING
CLOUD COMPUTING
UCS641
ASSIGNMENT 2
Hadoop and Map-Reduce
COE-14
PATIALA-147001, PUNJAB
Table of Contents
1. Setting up and Installing Hadoop ....................................................................................... 2
1.1. Stand-alone Mode ....................................................................................................... 3
1.2. Pseudo-Distributed Mode ........................................................................................... 4
1.3. Fully Distributed Mode.............................................................................................. 10
2. Implementation of HDFS Commands ............................................................................... 16
3. Upload and Download File from HDFS ............................................................................. 18
4. Copy file from Source to Destination in Hadoop.............................................................. 18
5. Copy file from/to local file system to HDFS...................................................................... 19
6. Remove a File or Directory from HDFS ............................................................................. 20
7. Display last few lines of a file in HDFS .............................................................................. 20
8. Display aggregate length of a file ..................................................................................... 21
9. Implementation of Word Count Map Reduce .................................................................. 22
10. Matrix multiplication with Hadoop Map Reduce ................ Error! Bookmark not defined.
11. Small binary file to One sequential file................................ Error! Bookmark not defined.
1. Setting up and Installing Hadoop
First, we need to download Hadoop.
We will download Hadoop 3.2.1. Click on the link under the “Binary Download” column. This
will take us to the download page of Hadoop.
Now we have Hadoop zip file. Before setting up, we need java 7 i.e. Java1.7 or above pre-
installed on our computer.
To check for installed java versions, we can run in terminal:
$ java -version
If java is not installed, we can install it using the following commands:
$ java -version
This will extract the files into a new folder in the same directory.
ii. Move the extracted folder to a suitable install location.
$ sudo mv Hadoop-3.2.1 /usr/local/hadoop
$ export HADOOP_PATH=~/Hadoop-3.2.1/bin/
$ export PATH=$PATH;$HADOOP_PATH
To execute a jar file using Hadoop, first move to the Hadoop install directory and run:
If you are not in the correct directory, you will get an error saying file not found, otherwise
you see the output as below:
$ ssh localhost
Master node communicate with slave node very frequently over SSH protocol. in Pseudo-
Distributed mode, only one node exists (your machine) and master slave interaction is
simulated by JVM. Since communication is very frequent, SSH should be password less.
Authentication needs to be done using Public key.
We can achieve this by creating a RSA key value pair. We use the command:
$ ssh-keygen
You'll be prompted to choose the location to store the keys. The default location is good
unless you already have a key. Press Enter to choose the default location.
Enter file in which to save the key (/Users/yourname/.ssh/id
_rsa):
Next, you'll be asked to choose a password. Using a password means a password will be
required to use the private key. It's a good idea to use a password on your private key.
After you choose a password, your public and private keys will be generated. There will be
two different files. The one named id_rsa is your private key. The one
named id_rsa.pub is your public key.
You'll also be shown a fingerprint and "visual fingerprint" of your key. You do not need to
save these.
Now to give access using the public key, create a file .ssh/authorized_keys and
paste the content of the rsa.pub key in it.
Now we have configured our SSH, now we need to configure Hadoop configuration files.
Navigate to Hadoop installation directory; and within it navigate to etc/hadoop/.
Figure 9: Content of etc/hadoop (Configuration Files)
First, we need to tell Hadoop where in our computer JAVA is located. We can find path to
java by using the whereis command and following the symbolic links to its original path.
We will now modify the hadoop-env.sh file to export this path. Within the hadoop
installation directory, change your directory to etc/hadoop.
$ nano hadoop-env.sh
$ export JAVA_HOME=path/from/above/command
Figure 11: JAVA_HOME path inside hadoop-env.sh
Now save and exit by pressing Ctrl+X and then enter Y to save buffer with existing name.
Second, we will configure the core-site.xml in the same directory.
$ nano core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
$ nano hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
It tells any data you store on HDFS is replicated to one another node as a backup.
Fourth, we configure mapred-site.xml. We need to create this file as it is not present
in the directory.
$ nano mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
$ nano yarn-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
$ jps
$ start-dfs.sh
$ start-yarn.sh
$ jps
We can verify if our NameNode is running using our web browser. Visit the url:
localhost:9870. Hadoop, by default, works on port 9870.
192.168.2.14 HadoopMaster
192.168.2.15 HadoopSlave1
192.168.2.16 HadoopSlave2
Then we create a group ‘hdgroup’ and add a user ‘hduser’ in the group.
$ sudo addgroup hdgroup
$ sudo adduser –ingroup hdgroup hduser
$ sudo visudo
In the file, give permissions to ‘hduser’. Then save and exit the file.
All the steps till now should be done on all the machines i.e Master and slave nodes.
Now we need to re-configure our Master node for fully distributed mode.
We will now edit core-site.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://HadoopMaster:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.
</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/app/hadoop/tmp/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>HadoopMaster:50070</value>
<description>Your NameNode hostname for http
access.</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>HadoopMaster:50090</value>
<description>Your Secondary NameNode hostname for http
access.</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>Long running service which executes on Node
Manager(s) and provides MapReduce Sort and Shuffle
functionality.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Enable log aggregation so application logs are
moved onto hdfs and are viewable via web ui after the
application completed. The default location on hdfs is
'/log' and can be changed via yarn.nodemanager.remote-app-
log-dir property</description>
</property>
Continuation from previous page…
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>HadoopMaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>HadoopMaster:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>HadoopMaster:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>HadoopMaster:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>HadoopMaster:8088</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>HadoopMaster:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
192.168.2.15
192.168.2.16
Secure copy the Hadoop folder to other slave hosts.
$ start-dfs.sh
$ start-yarn.sh
Example,
$ hadoop fs -ls /
Example,
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
conf.setJobName("CountWordFreq");
conf.setMapperClass(MyMap.class);
conf.setReducerClass(MyReduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
These jar files should be added as dependencies while writing the program.
Compiling the java code.