Unix Commands Part 2
Unix Commands Part 2
Documentation
Release 1.0.1
Oshin Prem
1 HADOOP INSTALLATION 3
1.1 SINGLE-NODE INSTALLATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 MULTI-NODE INSTALLATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 HIVE INSTALLATION 25
2.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Hive Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 SQOOP INSTALLATION 29
3.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Stable release and Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
i
ii
Installation and Configuration Documentation, Release 1.0.1
Contents:
Contents 1
Installation and Configuration Documentation, Release 1.0.1
2 Contents
CHAPTER 1
HADOOP INSTALLATION
This section refers to the installation settings of Hadoop on a standalone system as well as on a system existing as a
node in a cluster.
SINGLE-NODE INSTALLATION
The report here will describe the required steps for setting up a single-node Hadoop cluster backed by the Hadoop
Distributed File System, running on Ubuntu Linux. Hadoop is a framework written in Java for running applications on
large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and
of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like
Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application
data and is suitable for applications that have large data sets.
Before we start, we will understand the meaning of the following:
DataNode:
A DataNode stores data in the Hadoop File System. A functional file system has more than one DataNode, with the
data replicated across them.
NameNode:
The NameNode is the centrepiece of an HDFS file system. It keeps the directory of all files in the file system, and
tracks where across the cluster the file data is kept. It does not store the data of these file itself.
3
Installation and Configuration Documentation, Release 1.0.1
Jobtracker:
The Jobtracker is the service within hadoop that farms out MapReduce to specific nodes in the cluster, ideally the
nodes that have the data, or atleast are in the same rack.
TaskTracker:
A TaskTracker is a node in the cluster that accepts tasks- Map, Reduce and Shuffle operatons – from a Job Tracker.
Secondary Namenode:
Secondary Namenode whole purpose is to have a checkpoint in HDFS. It is just a helper node for namenode.
Prerequisites
Java 6 JDK
or
Install Sun Java 6 JDK
Note:
If you already have Java JDK installed on your system, then you need not run the above command.
To install it
The full JDK which will be placed in /usr/lib/jvm/java-6-openjdk-amd64 After installation, check whether java JDK
is correctly installed or not, with the following command
This will add the user hduser1 and the group hadoop_group to the local machine. Add hduser1 to the sudo group
Configuring SSH
The hadoop control scripts rely on SSH to peform cluster-wide operations. For example, there is a script for stopping
and starting all the daemons in the clusters. To work seamlessly, SSH needs to be setup to allow password-less login
for the hadoop user from machines in the cluster. The simplest way to achive this is to generate a public/private key
pair, and it will be shared across the cluster.
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node
setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the earlier.
We have to generate an SSH key for the hduser user.
user@ubuntu:~$ su - hduser1
hduser1@ubuntu:~$ ssh-keygen -t rsa -P ""
The second line will create an RSA key pair with an empty password.
Note:
The final step is to test the SSH setup by connecting to the local machine with the hduser1 user. The step is also needed
to save your local machine’s host key fingerprint to the hduser user’s known hosts file.
INSTALLATION
Main Installation
Configuration
hadoop-env.sh
conf/*-site.xml
Now we create the directory and set the required ownerships and permissions
The last line gives reading and writing permissions to the /app/hadoop/tmp directory
• Error: If you forget to set the required ownerships and permissions, you will see a java.io.IO Exception when
you try to format the name node.
Paste the following between <configuration>
• In file conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
• In file conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
• In file conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable). Run the
command
Before starting the cluster, we need to give the required permissions to the directory with the following command
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on the machine.
hduser@ubuntu:/usr/local/hadoop$ jps
Errors:
1. If by chance your datanode is not starting, then you have to erase the contents of the folder /app/hadoop/tmp
The command that can be used
2. You can also check with netstat if Hadoop is listening on the configured ports. The command that can be
used
Run the command to stop all the daemons running on your machine.
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
ERROR POINTS:
If datanode is not starting, then clear the tmp folder before formatting the namenode using the following command
Note:
MULTI-NODE INSTALLATION
We will build a multi-node cluster merge two or more single-node clusters into one multi-node cluster in which one
Ubuntu box will become the designated master but also act as a slave , and the other box will become only a slave.
Prerequisites
Configuring single-node clusters first,here we have used two single node clusters. Shutdown each single-node cluster
with the following command
user@ubuntu:~$ bin/stop-all.sh
Networking
• The easiest is to put both machines in the same network with regard to hardware and software configuration.
• Update /etc/hosts on both machines .Put the alias to the ip addresses of all the machines. Here we are creating a
cluster of 2 machines , one is master and other is slave 1
hduser@master:$ cd /etc/hosts
SSH access
The hduser user on the master (aka hduser@master) must be able to connect:
1. to its own user account on the master - i.e. ssh master in this context.
2. to the hduser user account on the slave (i.e. hduser@slave1) via a password-less SSH login.
• Add the hduser@master public SSH key using the following command
• Connect with user hduser from the master to the user account hduser on the slave.
1. From master to master
Hadoop
Cluster Overview
This will describe how to configure one Ubuntu box as a master node and the other Ubuntu box as a slave node.
Configuration
conf/masters
The machine on which bin/start-dfs.sh is running will become the primary NameNode. This file should be updated on
all the nodes. Open the masters file in the conf directory
Master
conf/slaves
This file should be updated on all the nodes as master is also a slave. Open the slaves file in the conf directory
Master
Slave1
Change the fs.default.name parameter (in conf/core-site.xml), which specifies the NameNode (the HDFS master) host
and port.
conf/core-site.xml (ALL machines .ie. Master as well as slave)
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
conf/mapred-site.xml
hduser@master:~$ /usr/local/hadoop/conf
hduser@master:~$ sudo gedit mapred-site.xml
Change the mapred.job.tracker parameter (in conf/mapred-site.xml), which specifies the JobTracker (MapReduce mas-
ter) host and port.
conf/mapred-site.xml (ALL machines)
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
conf/hdfs-site.xml
hduser@master:~$ /usr/local/hadoop/conf
hduser@master:~$ sudo gedit hdfs-site.xml
Change the dfs.replication parameter (in conf/hdfs-site.xml) which specifies the default block replication. We have
two nodes available, so we set dfs.replication to 2.
Changes to be made
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
hduser@master:~$ /usr/local/hadoop
hduser@master:~$ bin/start-all.sh
By this command:
• The NameNode daemon is started on master, and DataNode daemons are started on all slaves (here: master and
slave).
• The JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slave)
To check the daemons running , run the following commands
hduser@master:~$ jps
hduser@slave:~/usr/local/hadoop$ jps
hduser@master:~$ cd /usr/local/hadoop
hduser@master:~/usr/local/hadoop$ bin/stop-all.sh
ERROR POINTS:
1. Number of slaves = Number of replications in hdfs-site.xml also number of slaves = all slaves + master(if
master is also considered to be a slave)
2. When you start the cluster, clear the tmp directory on all the nodes (master+slaves) using the following command
3. Configuration of /etc/hosts , masters and slaves files on both the masters and the slaves nodes should be the
same.
4. If namenode is not getting started run the following commands:
• To give all permissions of hadoop folder to hduser
• This command deletes the junk files which gets stored in tmp folder of hadoop
HIVE INSTALLATION
This section refers to the installation settings of Hive on a standalone system as well as on a system existing as a node
in a cluster.
INTRODUCTION
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization,
query, and analysis. Apache Hive supports analysis of large datasets stored in Hadoop’s HDFS and com-
patible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL(Hive
Query Language) while maintaining full support for map/reduce.
Hive Installation
Installing HIVE:
user@ubuntu:~$ cd /usr/lib/
user@ubuntu:~$ sudo mkdir hive
user@ubuntu:~$ cd Downloads
user@ubuntu:~$ sudo mv apache-hive-0.13.0-bin /usr/lib/hive
25
Installation and Configuration Documentation, Release 1.0.1
Commands
user@ubuntu:~$ cd
user@ubuntu:~$ sudo gedit ~/.bashrc
# Set HIVE_HOME
export HIVE_HOME="/usr/lib/hive/apache-hive-0.13.0-bin"
PATH=$PATH:$HIVE_HOME/bin
export PATH
Commands
user@ubuntu:~$ cd /usr/lib/hive/apache-hive-0.13.0-bin/bin
user@ubuntu:~$ sudo gedit hive-config.sh
Command
Command
HIVE launch
Command
user@ubuntu:~$ hive
OUTPUT
hive>
Creating a database
Command
OUTPUT
OK
Time taken: 0.369 seconds
hive>
Configuring hive-site.xml:
<property>
<name>hive.metastore.local</name>
<value>TRUE</value>
<description>controls whether to connect to remove metastore server or open a new
˓→metastore server in Hive Client JVM</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://usr/lib/hive/apache-hive-0.13.0-bin/metastore_db?
˓→createDatabaseIfNotExist=true</value>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
Writing a Script
describe product;
SQOOP INSTALLATION
INTRODUCTION
• Sqoop is a tool designed to transfer data between Hadoop and relational databases.
• You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL
or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and
then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to
describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which
provides parallel operation as well as fault tolerance. This document describes how to get started using Sqoop
to move data between databases and Hadoop and provides reference information for the operation of the Sqoop
command-line tool suite.
Sqoop is an open source software product of the Apache Software Foundation. Sqoop source code is held in the
Apache Git repository.
Prerequisites
Before we can use Sqoop, a release of Hadoop must be installed and con?gured. Sqoop is currently supporting 4 major
Hadoop releases - 0.20, 0.23, 1.0 and 2.0. We have installed Hadoop 2.2.0 and it is compatible with sqoop 1.4.4.We
are using a Linux environment Ubuntu 12.04 to install and run sqoop. The basic familiarity with the purpose and
operation of Hadoop is required to use this product.
29
Installation and Configuration Documentation, Release 1.0.1
Installation
export SQOOP_HOME=¡usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
9. To check if the sqoop has been installed successfully type the command
sqoop version
• Run the command :sudo apt-get install mysql-server and give appropriate username and password.
This command imports the employees table from the sqoop directory of mysql to hdfs.
31
Installation and Configuration Documentation, Release 1.0.1
Error points
• genindex
• modindex
• search
33