0% found this document useful (0 votes)
53 views17 pages

Hands On-Exercies

This document provides steps to install Hadoop in pseudo-distributed mode on a virtual machine and run a sample job. It describes downloading required files, installing Hadoop packages, configuring HDFS directories and permissions, starting HDFS and MapReduce services, and running an example MapReduce job to grep for text in XML files stored on HDFS.

Uploaded by

api-281821827
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views17 pages

Hands On-Exercies

This document provides steps to install Hadoop in pseudo-distributed mode on a virtual machine and run a sample job. It describes downloading required files, installing Hadoop packages, configuring HDFS directories and permissions, starting HDFS and MapReduce services, and running an example MapReduce job to grep for text in XML files stored on HDFS.

Uploaded by

api-281821827
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Hadoop

Training
Hands On Exercise

1. Getting started:
Step 1: Download and Install the Vmware player
- Download the VMware-player-5.0.1-894247.zip and unzip it on your
windows machine
- Click the exe and install Vmware player

Step 2: Download and install the VMWare image
- Download the Hadoop Training - Distribution.zip and unzip it on your
windows machine
- Click on centos-6.3-x86_64-server.vmx to start the Virtual Machine

Step 3: Login and a quick check
- Once the VM starts, use the following credentials:
Username: training
Password: training
- Quickly check if eclipse and mysql workbench are installed



2. Installing Hadoop in a pseudo distributed


mode:
Step 1: Run the following command to install hadoop from yum
repository in a pseudo distributed mode (Already done for you,
please dont run this command)
sudo yum install hadoop-0.20-conf-pseudo

Step 2: Verify if the packages are installed properly


rpm -ql hadoop-0.20-conf-pseudo

Step 3: Format the namenode

sudo -u hdfs hdfs namenode -format


Step 4: Stop existing services (As Hadoop was already installed for
you, there might be some services running)
$ for service in /etc/init.d/hadoop*
> do
> sudo $service stop
> done

Step 5: Start HDFS
$ for service in /etc/init.d/hadoop-hdfs-*
> do
> sudo $service start
> done



Step 6: Verify if HDFS has started properly (In the browser)
http://localhost:50070

Step 7: Create the /tmp directory


$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp


Step 8: Create mapreduce specific directories

sudo -u hdfs hadoop fs -mkdir /var
sudo -u hdfs hadoop fs -mkdir /var/lib
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-
hdfs/cache/mapred/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-
hdfs/cache/mapred

Step 9: Verify the directory structure


$ sudo -u hdfs hadoop fs -ls -R /

Output should be

drwxrwxrwt

- hdfs

supergroup

0 2012-04-19 15:14

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

/tmp
/var
/var/lib
/var/lib/hadoop-hdfs
/var/lib/hadoop-

supergroup

0 2012-04-19 15:19

/var/lib/hadoop-

0 2012-04-19 15:29

/var/lib/hadoop-

0 2012-04-19 15:33

/var/lib/hadoop-

hdfs/cache
drwxr-xr-x

- mapred

hdfs/cache/mapred
drwxr-xr-x

- mapred

supergroup

hdfs/cache/mapred/mapred
drwxrwxrwt

- mapred

supergroup

hdfs/cache/mapred/mapred/staging


Step 10: Start MapReduce
$ for service in /etc/init.d/hadoop-0.20-
mapreduce-*
> do
> sudo $service start
> done

Step 11: Verify if MapReduce has started properly (In Browser)
http://localhost:50030


Step 12: Verify if the installation went on well by running a program

Step 12.1: Create a home directory on HDFS for the user



sudo -u hdfs hadoop fs -mkdir /user/training
sudo -u hdfs hadoop fs -chown training /user/training

Step 12.2: Make a directory in HDFS called input and copy some XML files
into it by running the following commands

$ hadoop fs -mkdir input


$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-
site.xml
-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-
site.xml
-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-
site.xml

Step 12.3: Run an example Hadoop job to grep with a regular expression in
your input data.

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-
mapreduce/hadoop-examples.jar grep input output 'dfs[a-
z.]+'

Step 12.4: After the job completes, you can find the output in the HDFS
directory named output because you specified that output directory to
Hadoop.

$ hadoop fs -ls
Found 2 items
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:36
/user/joe/input
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:38
/user/joe/output





Step 12.5: List the output files



$ hadoop fs -ls output

Found 2 items

drwxr-xr-x - joe supergroup
0 2009-02-25

10:33
/user/joe/output/_logs

-rw-r--r-- 1 joe supergroup 1068 2009-02-25

10:33
/user/joe/output/part-00000

-rw-r--r1 joe supergroup
0 2009-02-25

10:33
/user/joe/output/_SUCCESS




Step 12.6: Read the output


$ hadoop fs -cat output/part-00000 | head

1
dfs.datanode.data.dir

1
dfs.namenode.checkpoint.dir

1
dfs.namenode.name.dir

1
dfs.replication

1
dfs.safemode.extension

1
dfs.safemode.min.datanodes








3. Accessing HDFS from command line:


This exercise is just to you familiar with HDFS. Run the following commands:

Command 1: List the files in the user/training directory
$> hadoop fs -ls

Command 2: List the files in the root directory


$> hadoop fs ls /

Command 3: Push a file to HDFS

$> hadoop fs put test.txt /user/training/test.txt






Command 4: View the contents of the file
$> hadoop fs cat /user/training/test.txt

Command 5: Delete a file


$> hadoop fs rmr /user/training/test.txt


4. Running the Wordcount Mapreduce job


Step 1: Put the data in the HDFS
hadoop fs -mkdir /user/training/wordcountinput
hadoop fs put wordcount.txt /user/training/wordcountinput



2: Create a new project in eclipse called wordcount
Step

1. cp r /home/training/exercises/wordcount
/home/training/workspace/wordcount
2. Open EclipseNew Project->wordcount->location
/home/training/workspace
3. Right Click on the wordcount project->properties->java
build path->Libraries->Add External JarsSelect all jars
from /usr/lib/hadoop and /usr/lib/hadoop-0.20-
mapreduceOk
4. Make sure that there are no more compilation errors






Step
3: Create a jar file

1. Right click the project-ExportJavaJarSelect the location as


/home/trainingMake sure workdcount is checkedFinish


Step 4 Run the jar file
hadoop jar wordcount.jar WordCount wordcountinput
wordcountoutput

5. Mini Project: Importing MySQL Data


Using Sqoop and Querying it using Hive
5.1 Setting up Sqoop
Step 1: Install Sqoop (Already done for you, please dont run
this command)

$> sudo yum install sqoop



Step 2: View list of databases

$> sqoop list-databases \
--connect jdbc:mysql://localhost/training_db \
--username root --password root

Step 3: View list of tables




$> sqoop list-tables \

--connect jdbc:mysql://localhost/training_db \

--username root --password root





Step 4: Import data to HDFS



$> sqoop import \

--connect jdbc:mysql://localhost/training_db \
--table user_log --fields-terminated-by '\t' \
-m 1 --username root --password root

5.2 Setting up Hive


Step 1: Install Hive


$> sudo yum install hive (Already done for you, dont
run this command)
$> sudo u hdfs hadoop fs mkdir /user/hive/warehouse
$> hadoop fs chmod g+w /tmp
$> sudo u hdfs hadoop fs chmod g+w
/user/hive/warehouse
$> sudo u hdfs hadoop fs chown R training
/user/hive/warehouse
$>sudo chmod 777 /var/lib/hive/metastore
$> hive
Hive>show tables;



Step 2: Create table

hive> create table user_log (country
STRING,ip_address STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;

Step 3: Load Data




hive> LOAD DATA INPATH "/user/training/user_log/part-

m-00000" INTO TABLE user_log;



Step 4: Run the query



$> select country,count(1) from user_log group by
country;

6. Setting up Flume
Step 1: Install Flume
$> sudo yum install flume-ng (Already done for you, please
dont run this command)
$> sudo u hdfs hadoop fs chmod 1777 /user/training

Step 2: Copy the configuration file


$> sudo cp /home/training/exercises/flume-
config/flume.conf /usr/lib/flume-ng/conf


Step 3: Start the flume agent
$> flume-ng agent --conf-file /usr/lib/flume-
ng/conf/flume.conf --name agent -
Dflume.root.logger=INFO,console


Step 4: Push the file in a different terminal
$> sudo cp /home/training/exercises/log.txt
/home/training


Step 5: View the output
$> hadoop fs ls logs


7. Setting up a multi node cluster
Step 1: For converting the pseudo distributed mode to distributed
mode, the first step is to stop the existing services (To be done on all
nodes)
$> for service in /etc/init.d/hadoop*
> do
> sudo $service stop
> done

Step 2: Create a new set of blank configuration files. The conf.empty


directory contains blank files, so we will copy those to a new
directory (To be done on all nodes)

$> sudo cp r /etc/hadoop/conf.empty \


> /etc/hadoop/conf.class

Step 3: Point Hadoop configuration to the new configuration (To be


done on all nodes)

$> sudo /usr/sbin/alternatives -install \


> /etc/hadoop/conf hadoop-conf \
> /etc/hadoop/conf.class 99


Step 4: Verify Alternatives (To be done on all nodes)

$> /usr/sbin/update-alternatives \
> --display hadoop-conf

Step 5: Setting up the hosts (To be done on all nodes)


Step 5.1: Find the IP address of your machine



$> /sbin/ifconfig

Step 5.2: List down all the IP Addresses in your cluster setup i.e.
the ones that will belong to your cluster. And decide a name for
each one. In our example, lets say we are trying to setup a 3 node
cluster so we fetch IP address of each node and name it as
namenode and datanode<n>.
Update /etc/hosts file with IP addresses as shown. So /etc/hosts
file on each node should look something like this





192.168.1.12 namenode
192.168.1.21 datanode1
192.168.1.21 datanode2


Step 5.3: Update /etc/sysconfig/network file with Hostname

Open the /etc/sysconfig/network on your local box and make


sure that your hostname is namenode or datanode<n>.
Assuming you have decided to become a datanode1 i.e.
192.168.1.21. So your hostname should be
HOSTNAME=datanode1
HOSTNAME=Your node i.e. namenode or datanode1


5.4: Restart your machine and try pining other machines
Step

Ping namenode



Step
6: Changing configuration files (To be done on all nodes)
The format to add the configuration parameter is
<property>
<name>property_name</name>
<value>property_value</value>
</property>

Add the following configurations in the following files


Name
Value
Filename: /etc/hadoop/conf.class/core-site.xml
fs.default.name
hdfs://<namenode>:8020

Filename: /etc/hadoop/conf.class/hdfs-site.xml
dfs.name.dir
/home/disk1/dfs/nn,/home/disk2/dfs/nn
dfs.data.dir
/home/disk1/dfs/dn,/home/disk2/dfs/dn
dfs.http.address
namenode:50070

Filename: /etc/hadoop/conf.class/mapred-site.xml
mapred.local.dir
/home/disk1/mapred/local,/home/disk2/mapre
d/local
mapred.job.tracker
namenode:8021
mapred.jobtracker.staging.ro /user
ot.dir


Step 7: Create necessary directories (To be done on all nodes)
$> sudo mkdir p /home/disk1/dfs/nn
$> sudo mkdir p /home/disk2/dfs/nn
$> sudo mkdir p /home/disk1/dfs/dn
$> sudo mkdir p /home/disk2/dfs/dn
$> sudo mkdir p /home/disk1/mapred/local
$> sudo mkdir p /home/disk2/mapred/local



Step
8: Manage Permissions (To be done on all nodes)

$> sudo chown R hdfs:hadoop /home/disk1/dfs/nn


$> sudo chown R hdfs:hadoop /home/disk2/dfs/nn
$> sudo chown R hdfs:hadoop /home/disk1/dfs/dn
$> sudo chown R hdfs:hadoop /home/disk2/dfs/dn
$> sudo chown R mapred:hadoop /home/disk1/mapred/local
$> sudo chown R mapred:hadoop /home/disk2/mapred/local



Step 9: Reduce Hadoop Heapsize (To be done on all nodes)


$> export HADOOP_HEAPSIZE=200




Step 10: Format the namenode (Only on Namenode)
$> sudo u hdfs hadoop namenode -format

On Namenode
$> sudo /etc/init.d/hadoop-hdfs-namenode start
$> sudo /etc/init.d/hadoop-hdfs-secondarynamenode start

On Datanode
$> sudo /etc/init.d/hadoop-hdfs-datanode start



Step 11: Start HDFS processes

Step 12: Create directories in HDFS (Only one member should do this)
$> sudo u hdfs hadoop fs mkdir /user/training
$> sudo u hdfs hadoop fs chown training /user/training



$> sudo u hdfs hadoop fs mkdir /mapred/system
$> sudo u hdfs hadoop fs chown mapred:hadoop \
>/mapred/system







Step 13: Create directories for mapreduce (Only one member should do this)

Step 14: Start the Mapreduce process


On Namenode
$> sudo /etc/init.d/hadoop-0.20-jobtracker start

On Slave node
$> sudo /etc/init.d/hadoop-0.20-tasktracker start



Step 15: Verify the cluster
Visit http://namenode:50070 and look at number of nodes

You might also like