0% found this document useful (0 votes)
5 views10 pages

Support of Hadoop Cluster Installation and Administration

The document provides a comprehensive guide for installing Hadoop on a single node and multi-node setup, as well as Apache Hive. It includes step-by-step instructions for updating repositories, installing necessary software, configuring Hadoop and Hive, and setting up network connections for multi-node installations. Additionally, it covers troubleshooting tips and commands for managing Hadoop services and data processing tasks.

Uploaded by

Flexbide 09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Support of Hadoop Cluster Installation and Administration

The document provides a comprehensive guide for installing Hadoop on a single node and multi-node setup, as well as Apache Hive. It includes step-by-step instructions for updating repositories, installing necessary software, configuring Hadoop and Hive, and setting up network connections for multi-node installations. Additionally, it covers troubleshooting tips and commands for managing Hadoop services and data processing tasks.

Uploaded by

Flexbide 09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

© CDOSS Association contact@cdoss.

tech

I/Hadoop Single Node Installation


1) Update repository files with sudo apt-get update
- apt-get: is a software tool to be used in command line, it allows the installation and the
uninstallation of packages from an APT repository, the latter is a source (or a server) of
software. The use of these tools centralizes software management and simplifies it. They also
allow distributors (those who set up the repositories) to provide you with updates through a
centralized channel.
- apt-get update: the "update" option updates the list of files available in the APT repositories
found in the /etc/apt/sources.list configuration file. Running it regularly is a good practice to
keep your list of available packages up to date.
- apt-get install: allows you to install a package.
2) Install java with sudo apt-get install default-jdk.
apt-get install default-jdk: install the default java package (most recent).
3) Check the installation of java with: java -version
4) Install ssh with sudo apt-get install ssh
OpenSSH is a free version of the SSH protocol suite, the network connectivity tools that a
growing number of people on the Internet are relying on.
Many users of Telnet, Rlogin, FTP, or other programs do not realize that their data, and in
particular passwords, are transmitted through networks in the clear which constitutes an obvious
flaw in the security of their network.
OpenSSH encrypts all traffic, including passwords, through a clever combination of symmetric
and asymmetric encryption. OpenSSH also provides other authentication methods alternative
to the traditional password.
As the name suggests, OpenSSH is developed as part of the OpenBSD project.
5) Install rsync with sudo apt-get install rsync
- rsync (for remote synchronization or remote synchronization), is a file synchronization
software. It is frequently used to set up remote backup systems,
- rsync works unidirectionally, that is to say it synchronizes, copies or updates data from a
source (local or remote) to a destination (local or remote) by transferring only the bytes of files
that have been changed.
6) Generate an encryption key with ssh-keygen -t rsa -P ""
- RSA encryption (named by the initials of its three inventors) is an asymmetric cryptography
algorithm, widely used in electronic commerce, and more generally to exchange confidential
data on the Internet.
- -P "": ie without password.
7) Copy this key to "authorized_keys" with cat ~/.ssh/id_rsa.pub >> ~ /.ssh/authorized_keys
8) Check the installation with ssh localhost.
9) Download a version 2.X of Hadoop.
10) Go to the Download directory and extract the zipped folder with sudo tar -zxvf hadoop-
…tar.gz or just right click and extract here.
11) Create the hadoop folder in /usr/local with sudo mkdir /usr/local/hadoop
12) Move the source folder of hadoop with sudo mv hadoop-… /usr/local/hadoop.
13) Get the name and the path of java jdk, /usr/lib/jvm/java-8-….
14) Add the aliases below to the .bashrc file

1
© CDOSS Association [email protected]

Note: Aliases are shorthand substitutions for repetitive and/or time consuming commands to
type in the console. It is possible to define your aliases in a hidden file, called .bashrc, which is
in your Personal Folder.
with sudo gedit (or nano) .bashrc (hidden file located in /home/<name>/.bashrc) and write these
lines at the bottom of the file:
#HADOOP VARIABLES START
export JAVA_HOME=/….
export HADOOP_INSTALL=/….
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
15) In order for the aliases to take effect, either restart the terminal or type the source command
.bashrc.
16) Modify the hadoop startup file "hadoop-env.sh" by adding the path of JAVA_HOME:
export JAVA_HOME=/…
with the command sudo nano $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh
or sudo nano /usr/local/hadoop/hadoop<version>/etc/hadoop/hadoop-env.sh
or sudo gedit /usr/local/hadoop/hadoop<version>/etc/hadoop/hadoop-env.sh
Note: The hadoop-env.sh file starts daemons, which, in programming terms, are processes that
run in the background. Hadoop has five namely:
- The Namenode,
- The Secondary Namenode,
- The Datanode,
- The JobTracker,
- The TaskTraker.
17) Modify the core-site.xml file by adding:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
with the command sudo nano $HADOOP_INSTALL/etc/hadoop/core-site.xml
Note: The core-site.xml file informs hadoop daemons that a namenode is running on the cluster
by mentioning its address.
18) Modify the hdfs-site.xml file by adding:
<configuration>
<property>

2
© CDOSS Association [email protected]

<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/…/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/…/data</value>
</property>
</configuration>
Note: the hdfs-site.xml file informs hadoop and its hdfs system of the number of machines in
the cluster (number of replications).
19) Make a copy of the mapred-site.xml.template file under the name mapred-site.xml with the
command: sudo cp /usr/local/hadoop/hadoop-<version>/etc/hadoop/mapred-site.xml .template
/usr/local/hadoop/hadoop-<version>/etc/hadoop/mapred-site.xml
Note: the mapred-site.xml file informs the MapReduce package that it will run as a yarn
application (separation between resource management and job management).
20) Modify the mapred-site.xml file by adding:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
21) Modify the yarn-site.xml file by adding:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Note: the yarn-site.xml file tells the nodemanager that it will have an auxiliary service telling
MapReduce how to shuffle.
22) Format the hdfs system with: hdfs namenode -format.
23) Check the active services (before starting hadoop) with jps.
24) Start the hadoop system (also hdfs) with start-all.sh.
25) Check active services after startup.
26) With your web browser, you can access the NameNode web interface via
http://localhost:50070/
27) Create a directory in hdfs to put the input and result files with hdfs dfs -mkdir /user/
28) Create a datainput directory in your home directory with mkdir ~/datainput. This folder will
serve as a pool where we will put the files to be analyzed.

3
© CDOSS Association [email protected]

29) Create a text.txt file (to be completed with several words) in this datainput folder with cat
> ~/datainput/texte.txt followed by entering a few words and press Ctrl + D at the end.
30) Transfer the files to process with hdfs dfs -put /home/<name>/datainput /user/input
31) Launch an example jar (wordcount) with hadoop jar /usr/local/hadoop/hadoop-
…/share/hadoop/mapreduce/hadoop-mapreduce-examples-….jar wordcount /user/input/
/user/output
32) Get the result with hdfs dfs -cat /user/output/*
33) Stop hadoop with stop-all.sh
34) Clone this machine to have a machine ready in case the original one has problems due to
the installation of new software. To perform this action, switch to the virtual box interface,
right-click on the configured machine and choose "clone".
Notes:
- In case of problem, deactivate IP6 with iptables -A INPUT -p tcp -dport 50070 -j ACCEPT
- If there are permission problems with safemode you can run: hdfs dfsadmin -safemode leave
- Machine Password : hadoop

II/Hadoop Multi-Nodes Installation


1) Clone the virtual machine of single node installation (another time).
2) Access “parameters” for machine1 followed by “network”, check “activate the network
card”, set “network access mode” to internal network, put as network name: “Hadoop
Multinode Network”, click on advanced and choose “allow all” and finally check “cable
connected”
3) Repeat the same process for the machine 2.
4) Start the machine 1.
5) Launch the terminal and add the two lines below in the /etc/hosts file:
192.168.0.1 master
192.168.0.2 slave1
With the sudo nano /etc/hosts command
6) Modify the hostname file by putting master in place of <machine name> with the command
sudo nano /etc/hostname
7) Start machine2 and repeat step 5 for it.
8) Modify the hostname file by putting slave1 in place of <machine name> with the command
sudo nano /etc/hostname
9) Go back to the master machine and click on the "network" symbol and choose "modify
network connections"
10) Click on wired connection 1 then on the "modify" button
11) Enter a connection name (Master connection for example).
12) Go to the IPV4 tab and choose "manual" method
13) Add the address 192.168.0.1 with 24 as a mask.
14) Go back to the "general" tab and check "All users can connect to this network" and save.
15) Restart machine1 (which is now called master).
16) Go to machine2 and do the same steps (from 9 to 14) as the master machine by adding the
ip address 192.168.0.2 with 24 as a mask and save.
17) Restart machine2 (which is now called slave1).
18) Generate new rsa key for both machines, copy each key to
/home/<name>/.ssh/authorized_keys (to skip the local password request) and make copies of

4
© CDOSS Association [email protected]

each key in the other machine (to skip the password request during network connection) with
the commands:
On the master machine:
ssh-keygen –t rsa –P ""
cat /home/<name>/.ssh/id_rsa.pub >> /home/<name>/.ssh/authorized_keys
ssh-copy-id –i /home/<name>/.ssh/id_rsa.pub slave1
On the slave1 machine:
ssh-keygen –t rsa –P ""
cat /home/<name>/.ssh/id_rsa.pub >> /home/<name>/.ssh/authorized_keys
ssh-copy-id –i /home/<name>/.ssh/id_rsa.pub master
19) Check the connectivity of the two machines with mutual ssh; ssh master and ssh slave1
followed by exit each time.
20) Go to the master machine, update the slaves file by adding the following two lines:
Master
slave1
with the command sudo gedit /usr/local/hadoop/hadoop-…/etc/hadoop/slaves
21) Again on the master machine, update the two files:
- core-site.xml by modifying localhost by master
- hdfs-site.xml by modifying the number of replication from 1 to 2
22) Repeat step 20 on slave1.
23) Repeat step 21 on slave1 (remove the namenode property from the hdfs-site.xml file).
24) Format the namenode of the master machine with hdfs namenode -format.
25) Start the hadoop system (hdfs too) with start-all.sh for the master
26) Check the active services after startup for both machines.

III/Apache Hive Installation


1) Download Apache hive 2.3.3 from the link: http://www-eu.apache.org/dist/hive/hive-2.3.3/.
2) Go to Downloads and extract the hive installation folder with: tar zxvf apache-hive-2.3.3-
bin.tar.gz.
3) Create a hive folder in the usr / local location with: mkdir / usr / local / hive (in case of
problem, change the access rights of the destination folder with: sudo chmod -R 777 / usr /
local).
4) Move the hive installation folder to the new folder created with: mv apache-hive-2.3.3-bin /
usr / local / hive.
5) Add the lines below to the bashrc with: sudo gedit .bashrc
export HIVE_HOME=/usr/local/hive/apache-hive-2.3.3-bin
export HIVE_CONF_DIR=/usr/local/hive/apache-hive-2.3.3-bin/conf
export PATH=$HIVE_HOME/bin:$PATH
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/hadoop-…/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/apache-hive-2.3.3-bin/lib/*:.
6) Enter the modifications of the bashrc in force with: source .bashrc.
7) Access the hive configuration folder with: cd $ HIVE_HOME/conf.
8) Copy the starter template hive-env.sh.template file by calling the copy hive-env.sh with: cp
hive-env.sh.template hive-env.sh.
9) Edit the hive-env.sh startup file with: gedit hive-env.sh and add the following line: export
HADOOP_HOME = /usr/local/hadoop /hadoop-…
10) Download the derby-10 database server (Use to manage the data metastore).

5
© CDOSS Association [email protected]

11) Go to Downloads and extract the derby installation folder with: tar zxvf db-derby-
10.14.2.0-bin.tar.gz.
12) Create a derby folder in the /usr/local location with: mkdir /usr/local/derby.
13) Move the derby installation folder (by changing its name) with: mv db-derby-10.14.2.0-
bin /usr/local/derby.
14) Add the lines below to the bashrc with: sudo gedit .bashrc:
export DERBY_HOME=/usr/local/derby/db-derby-10.14.2.0-bin/
export PATH=$PATH:$DERBY_HOME/bin
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/d
erbytools.jar
15) Enter the modifications of the bashrc in force with: source .bashrc.
16) Create a folder called data (for metastore storage) in the derby server folder with: mkdir
$DERBY_HOME/data.
17) Access the hive configuration folder with: cd $HIVE_HOME/conf.
18) Copy the template configuration file hive-default.xml.template naming the copy hive-
site.xml with: cp hive-default.xml.template hive-site.xml.
19) Replace the content of the configuration tag with the following with: gedit hive-site.xml:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/usr/local/hive/apache-hive-2.3.3-
bin/metastore_db;create=true</value>
<description>
JDBC connect string for a JDBC metastore.
</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value/>
<description>Thrift URI for the remote metastore. Used by metastore client to connect
to remote metastore.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>

6
© CDOSS Association [email protected]

<name>javax.jdo.PersistenceManagerFactoryClass</name>
<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
<description>class implementing the jdo persistence</description>
</property>
</configuration>
20) Format the namenode with: hdfs namenode -format.
21) Start hadoop with: start-all.sh.
22) Deactivate protected mode with: hdfs dfsadmin -safemode leave.
23) Create a Hive warehouse in HDFS. With the commands:
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir /user/hive/
hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod g + w /tmp
hdfs dfs -chmod g + w /user/hive/warehouse
24) Access the hive bin folder with: cd $HIVE_HOME/bin and initialize the derby metastore
with: schematool -dbType derby -initSchema.
Start the hive shell with: hive.

IV/Apache Flume Installation


1) Download Apache flume from the following link: https://flume.apache.org/download.html.
2) Create the flume installation folder with: sudo mkdir /usr/local/flume
3) Give all permissions to the flume installation folder with:
sudo chmod -R 777 /usr/local/flume
4) Copy from Downloads the apache flume folder to the installation folder created previously
with: cp ~/Downloads/apache-flume-1.8.0-bin.tar.gz/ /usr/local/flume
5) Go to /usr/local/flume and extract the apache flume directory with: tar -xzf apache-flume-
1.8.0-bin.tar.gz.
6) Add the lines below to the bashrc with: gedit ~/.bashrc
#FLUME VAR
export FLUME_HOME="/usr/local/flume/apache-flume-1.8.0-bin"
export FLUME_CONF_DIR="$FLUME_HOME/conf"
export FLUME_CLASSPATH="$FLUME_CONF_DIR"
export PATH="$FLUME_HOME/bin:$PATH"
#FLUME VAR END
7) Update changes at bashrc level with source .bashrc
8) Copy the template from the flume-env.sh.template config file to the same configuration
folder (naming it flume-env.sh) with:
mv $FLUME_CONF_DIR/flume-env.sh.template $FLUME_CONF_DIR/flume-env.sh
9) Open the flume config file with: gedit $FLUME_CONF_DIR/flume-env.sh and add the path
of the jdk: JAVA_HOME="/usr/lib/jvm/java-…"
10) Check the installation with: flume-ng --help.

V/Apache Spark Installation


1) Download Apache Spark from the following link:
https://spark.apache.org/downloads.html

7
© CDOSS Association [email protected]

2) Install (if it is not installed yet) java with sudo apt-get install default-jdk
3) Create a spark folder in /usr/local with: mkdir /usr/local/spark (in case of permission problem
you can use: sudo chmod -R 777 /usr/local).
4) Go to Downloads and extract the spark installation folder with: tar -zxvf spark-2.3.1-bin-
hadoop2.7.tgz
5) Move the Downloads folder to /usr/local/spark with: mv spark-2.3.1-bin-hadoop2.6
/usr/local/spark
6) Open the bashrc with: sudo gedit .bashrc and add the following lines:
#spark
export SPARK_HOME=/usr/local/spark /spark-2.3.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
#end spark
7) Start spark in python mode with: pyspark
--- Installation with Yarn ---
1) Add the following two lines to the bachrc:
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/Hadoop
2) Update the bashrc with source .bashrc
3) Access the conf folder of spark with cd $SPARK_HOME/conf
4) Copy the spark-defaults.conf.template file to have another file with the name spark-
defaults.conf (with cp spark-defaults.conf.template spark-defaults.conf).
5) Modify the spark-defaults.conf file by activating spark.master with yarn:
Spark.master yarn (command gedit spark-defaults.conf).
6) In the conf folder of spark, copy the spark-env.sh.template file to have another file with the
name spark-env.sh (with cp spark-env.sh.template spark-env.sh).
7) Modify the spark-env.sh file by adding:
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/Hadoop
8) Format the namenode with hdfs namenode –format
9) Start Hadoop services with start-all.sh
10) Exit safemode mode with hdfs dfsadmin -safemode leave
Note: the default folder: hdfs://localhost:9000/user/<user name>
To read from a local file: file: ///home/....
11) In yarn mode, you can only run scripts at terminal level with the “spark-submit” command.

VI/Apache SQOOP Installation


1) Update the download repositories with: sudo apt-get update.
2) Start the installation of mysql with: sudo apt-get install mysql-server.
3) Start the mysql_secure_installation utility (password: hadoop).
4) Start the mysql server with: sudo systemctl start mysql.
5) To start the server after a reboot run: systemctl enable mysql.
6) Start the mysql shell with: mysql -u root –p
7) Download sqoop from https://sqoop.apache.org/ (version: sqoop-1.4.7.bin__hadoop-
2.6.0.tar.gz).
8) Start the hadoop cluster with: hdfs namenode –format (respond with N) followed by start-
all.sh.
9) Disable safe mode with hdfs dfsadmin –safemode leave.
10) Go to Downloads and unzip the sqoop file with: tar -zxvf sqoop-1.4.7.bin__hadoop-
2.6.0.tar.gz.

8
© CDOSS Association [email protected]

11) Create the folder that will contain sqoop with: sudo mkdir /usr/local/sqoop.
12) Move the unzip downloads sqoop folder to /usr/local/sqoop with: sudo mv sqoop-
1.4.7.bin__hadoop-2.6.0 /usr/local/sqoop.
13) Edit the bashrc with "gedit .bashrc" to add the following lines:
#Sqoop
export SQOOP_HOME=/usr/local/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0
export PATH=$PATH:$SQOOP_HOME/bin
14) Source the bashrc with: source .bashrc
15) Access the sqoop conf folder with: cd $SQOOP_HOME/conf
16) Create a copy of sqoop-env-template.sh by naming it sqoop-env.sh with: mv sqoop-env-
template.sh sqoop-env.sh
17) Start the modification of this file with: gedit sqoop-env.sh and add the following two lines:
export HADOOP_COMMON_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_MAPRED_HOME=/usr/local/hadoop/hadoop-2.7.7
18) Download the mysql connector from
https://dev.mysql.com/downloads/connector/j/5.1.html.
19) Unzip this connector with: tar -zxvf mysql-connector-java-5.1.47.tar.gz.
20) Access the unzipped folder with cd mysql-connector-java-5.1.47 and move the jar file with:
mv mysql-connector-java-5.1.47-bin.jar /usr/local/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/lib
21) Check the installation with sqoop-version.

VII/Apache Hbase Installation


1) Download Apache Hbse (for example hbase-1.1.0-bin.tar.gz via archive) from this link:
https: //hbase.apache.org/downloads.html.
2) Go to Downloads with: cd Downloads and extract the contents of the Hbase installation
folder with: tar -zxvf hbase-1.1.0-bin.tar.gz.
3) Move the folder resulting from the extraction in /usr/local and naming it hbase with:
mv hbase-1.1.0 /usr/local/hbase
4) Modify the hbase startup file "hbase-env.sh" by adding the following line:
export JAVA_HOME=/usr/lib/jvm/java-…
with the command: gedit /usr/local/hbase/conf/hbase-env.sh
5) Modify the hbase-site.xml file by adding:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/<user name>/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
with: gedit /usr/local/hbase/conf/hbase-site.xml.
6) Add the aliases below to the .bashrc file
# - HBASE ENVIRONMENT VARIABLES START - #

9
© CDOSS Association [email protected]

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
# - HBASE ENVIRONMENT VARIABLES END - #
with: gedit .bashrc.
7) Update the bashrc with the command: source ~/.bashrc
8) Format the namenode with: hdfs namenode -format.
9) Start the hadoop system with: start-all.sh and check the services with: jps.
10) Start hbase with: start-hbase.sh.
11) Check the services with: jps.
12) Start the hbase shell with: hbase shell.

10

You might also like