Support of Hadoop Cluster Installation and Administration
Support of Hadoop Cluster Installation and Administration
tech
1
© CDOSS Association [email protected]
Note: Aliases are shorthand substitutions for repetitive and/or time consuming commands to
type in the console. It is possible to define your aliases in a hidden file, called .bashrc, which is
in your Personal Folder.
with sudo gedit (or nano) .bashrc (hidden file located in /home/<name>/.bashrc) and write these
lines at the bottom of the file:
#HADOOP VARIABLES START
export JAVA_HOME=/….
export HADOOP_INSTALL=/….
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
15) In order for the aliases to take effect, either restart the terminal or type the source command
.bashrc.
16) Modify the hadoop startup file "hadoop-env.sh" by adding the path of JAVA_HOME:
export JAVA_HOME=/…
with the command sudo nano $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh
or sudo nano /usr/local/hadoop/hadoop<version>/etc/hadoop/hadoop-env.sh
or sudo gedit /usr/local/hadoop/hadoop<version>/etc/hadoop/hadoop-env.sh
Note: The hadoop-env.sh file starts daemons, which, in programming terms, are processes that
run in the background. Hadoop has five namely:
- The Namenode,
- The Secondary Namenode,
- The Datanode,
- The JobTracker,
- The TaskTraker.
17) Modify the core-site.xml file by adding:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
with the command sudo nano $HADOOP_INSTALL/etc/hadoop/core-site.xml
Note: The core-site.xml file informs hadoop daemons that a namenode is running on the cluster
by mentioning its address.
18) Modify the hdfs-site.xml file by adding:
<configuration>
<property>
2
© CDOSS Association [email protected]
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/…/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/…/data</value>
</property>
</configuration>
Note: the hdfs-site.xml file informs hadoop and its hdfs system of the number of machines in
the cluster (number of replications).
19) Make a copy of the mapred-site.xml.template file under the name mapred-site.xml with the
command: sudo cp /usr/local/hadoop/hadoop-<version>/etc/hadoop/mapred-site.xml .template
/usr/local/hadoop/hadoop-<version>/etc/hadoop/mapred-site.xml
Note: the mapred-site.xml file informs the MapReduce package that it will run as a yarn
application (separation between resource management and job management).
20) Modify the mapred-site.xml file by adding:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
21) Modify the yarn-site.xml file by adding:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Note: the yarn-site.xml file tells the nodemanager that it will have an auxiliary service telling
MapReduce how to shuffle.
22) Format the hdfs system with: hdfs namenode -format.
23) Check the active services (before starting hadoop) with jps.
24) Start the hadoop system (also hdfs) with start-all.sh.
25) Check active services after startup.
26) With your web browser, you can access the NameNode web interface via
http://localhost:50070/
27) Create a directory in hdfs to put the input and result files with hdfs dfs -mkdir /user/
28) Create a datainput directory in your home directory with mkdir ~/datainput. This folder will
serve as a pool where we will put the files to be analyzed.
3
© CDOSS Association [email protected]
29) Create a text.txt file (to be completed with several words) in this datainput folder with cat
> ~/datainput/texte.txt followed by entering a few words and press Ctrl + D at the end.
30) Transfer the files to process with hdfs dfs -put /home/<name>/datainput /user/input
31) Launch an example jar (wordcount) with hadoop jar /usr/local/hadoop/hadoop-
…/share/hadoop/mapreduce/hadoop-mapreduce-examples-….jar wordcount /user/input/
/user/output
32) Get the result with hdfs dfs -cat /user/output/*
33) Stop hadoop with stop-all.sh
34) Clone this machine to have a machine ready in case the original one has problems due to
the installation of new software. To perform this action, switch to the virtual box interface,
right-click on the configured machine and choose "clone".
Notes:
- In case of problem, deactivate IP6 with iptables -A INPUT -p tcp -dport 50070 -j ACCEPT
- If there are permission problems with safemode you can run: hdfs dfsadmin -safemode leave
- Machine Password : hadoop
4
© CDOSS Association [email protected]
each key in the other machine (to skip the password request during network connection) with
the commands:
On the master machine:
ssh-keygen –t rsa –P ""
cat /home/<name>/.ssh/id_rsa.pub >> /home/<name>/.ssh/authorized_keys
ssh-copy-id –i /home/<name>/.ssh/id_rsa.pub slave1
On the slave1 machine:
ssh-keygen –t rsa –P ""
cat /home/<name>/.ssh/id_rsa.pub >> /home/<name>/.ssh/authorized_keys
ssh-copy-id –i /home/<name>/.ssh/id_rsa.pub master
19) Check the connectivity of the two machines with mutual ssh; ssh master and ssh slave1
followed by exit each time.
20) Go to the master machine, update the slaves file by adding the following two lines:
Master
slave1
with the command sudo gedit /usr/local/hadoop/hadoop-…/etc/hadoop/slaves
21) Again on the master machine, update the two files:
- core-site.xml by modifying localhost by master
- hdfs-site.xml by modifying the number of replication from 1 to 2
22) Repeat step 20 on slave1.
23) Repeat step 21 on slave1 (remove the namenode property from the hdfs-site.xml file).
24) Format the namenode of the master machine with hdfs namenode -format.
25) Start the hadoop system (hdfs too) with start-all.sh for the master
26) Check the active services after startup for both machines.
5
© CDOSS Association [email protected]
11) Go to Downloads and extract the derby installation folder with: tar zxvf db-derby-
10.14.2.0-bin.tar.gz.
12) Create a derby folder in the /usr/local location with: mkdir /usr/local/derby.
13) Move the derby installation folder (by changing its name) with: mv db-derby-10.14.2.0-
bin /usr/local/derby.
14) Add the lines below to the bashrc with: sudo gedit .bashrc:
export DERBY_HOME=/usr/local/derby/db-derby-10.14.2.0-bin/
export PATH=$PATH:$DERBY_HOME/bin
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/d
erbytools.jar
15) Enter the modifications of the bashrc in force with: source .bashrc.
16) Create a folder called data (for metastore storage) in the derby server folder with: mkdir
$DERBY_HOME/data.
17) Access the hive configuration folder with: cd $HIVE_HOME/conf.
18) Copy the template configuration file hive-default.xml.template naming the copy hive-
site.xml with: cp hive-default.xml.template hive-site.xml.
19) Replace the content of the configuration tag with the following with: gedit hive-site.xml:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/usr/local/hive/apache-hive-2.3.3-
bin/metastore_db;create=true</value>
<description>
JDBC connect string for a JDBC metastore.
</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value/>
<description>Thrift URI for the remote metastore. Used by metastore client to connect
to remote metastore.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
6
© CDOSS Association [email protected]
<name>javax.jdo.PersistenceManagerFactoryClass</name>
<value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
<description>class implementing the jdo persistence</description>
</property>
</configuration>
20) Format the namenode with: hdfs namenode -format.
21) Start hadoop with: start-all.sh.
22) Deactivate protected mode with: hdfs dfsadmin -safemode leave.
23) Create a Hive warehouse in HDFS. With the commands:
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir /user/hive/
hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod g + w /tmp
hdfs dfs -chmod g + w /user/hive/warehouse
24) Access the hive bin folder with: cd $HIVE_HOME/bin and initialize the derby metastore
with: schematool -dbType derby -initSchema.
Start the hive shell with: hive.
7
© CDOSS Association [email protected]
2) Install (if it is not installed yet) java with sudo apt-get install default-jdk
3) Create a spark folder in /usr/local with: mkdir /usr/local/spark (in case of permission problem
you can use: sudo chmod -R 777 /usr/local).
4) Go to Downloads and extract the spark installation folder with: tar -zxvf spark-2.3.1-bin-
hadoop2.7.tgz
5) Move the Downloads folder to /usr/local/spark with: mv spark-2.3.1-bin-hadoop2.6
/usr/local/spark
6) Open the bashrc with: sudo gedit .bashrc and add the following lines:
#spark
export SPARK_HOME=/usr/local/spark /spark-2.3.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
#end spark
7) Start spark in python mode with: pyspark
--- Installation with Yarn ---
1) Add the following two lines to the bachrc:
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/Hadoop
2) Update the bashrc with source .bashrc
3) Access the conf folder of spark with cd $SPARK_HOME/conf
4) Copy the spark-defaults.conf.template file to have another file with the name spark-
defaults.conf (with cp spark-defaults.conf.template spark-defaults.conf).
5) Modify the spark-defaults.conf file by activating spark.master with yarn:
Spark.master yarn (command gedit spark-defaults.conf).
6) In the conf folder of spark, copy the spark-env.sh.template file to have another file with the
name spark-env.sh (with cp spark-env.sh.template spark-env.sh).
7) Modify the spark-env.sh file by adding:
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/Hadoop
8) Format the namenode with hdfs namenode –format
9) Start Hadoop services with start-all.sh
10) Exit safemode mode with hdfs dfsadmin -safemode leave
Note: the default folder: hdfs://localhost:9000/user/<user name>
To read from a local file: file: ///home/....
11) In yarn mode, you can only run scripts at terminal level with the “spark-submit” command.
8
© CDOSS Association [email protected]
11) Create the folder that will contain sqoop with: sudo mkdir /usr/local/sqoop.
12) Move the unzip downloads sqoop folder to /usr/local/sqoop with: sudo mv sqoop-
1.4.7.bin__hadoop-2.6.0 /usr/local/sqoop.
13) Edit the bashrc with "gedit .bashrc" to add the following lines:
#Sqoop
export SQOOP_HOME=/usr/local/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0
export PATH=$PATH:$SQOOP_HOME/bin
14) Source the bashrc with: source .bashrc
15) Access the sqoop conf folder with: cd $SQOOP_HOME/conf
16) Create a copy of sqoop-env-template.sh by naming it sqoop-env.sh with: mv sqoop-env-
template.sh sqoop-env.sh
17) Start the modification of this file with: gedit sqoop-env.sh and add the following two lines:
export HADOOP_COMMON_HOME=/usr/local/hadoop/hadoop-2.7.7
export HADOOP_MAPRED_HOME=/usr/local/hadoop/hadoop-2.7.7
18) Download the mysql connector from
https://dev.mysql.com/downloads/connector/j/5.1.html.
19) Unzip this connector with: tar -zxvf mysql-connector-java-5.1.47.tar.gz.
20) Access the unzipped folder with cd mysql-connector-java-5.1.47 and move the jar file with:
mv mysql-connector-java-5.1.47-bin.jar /usr/local/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/lib
21) Check the installation with sqoop-version.
9
© CDOSS Association [email protected]
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
# - HBASE ENVIRONMENT VARIABLES END - #
with: gedit .bashrc.
7) Update the bashrc with the command: source ~/.bashrc
8) Format the namenode with: hdfs namenode -format.
9) Start the hadoop system with: start-all.sh and check the services with: jps.
10) Start hbase with: start-hbase.sh.
11) Check the services with: jps.
12) Start the hbase shell with: hbase shell.
10