Môn học: Khai phá dữ liệu lớn
Hướng dẫn cài đặt
1. Hadoop
a. Install Java:
i. Get update: sudo apt-get update
ii. Providing from a supported Ubuntu repository:
sudo apt-get install default-jdk
iii. Check version: java -version
b. Adding a dedicated Hadoop user
i. Add group: sudo addgroup hadoop
ii. Add user: sudo adduser --ingroup hadoop hduser => enter Y
c. Installing SSH:
i. Install: sudo apt-get install ssh
ii. Check: which ssh
Which sshd
d. Create and Setup SSH Certificates
i. Change hduser: su hduser
e. Install Hadoop 3.3.0
i. Download hadoop:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.-src.tar.gz
ii. Extract hadoop: tar xvzf hadoop-3.3.0-src.tar.gz
iii. Move to hadoop install to directory /usr/local/hadoop:
sudo mkdir -p /usr/local/hadoop => enter password
iv. Check user hduser if it is not in sudo group: sudo –v
v. Add hduser to sudo group: sudo adduser hduser sudo
vi. Move hadoop installation to /usr/local/hadoop:
sudo mv * /usr/local/hadoop
Sudo chown -R hduser:hadoop /usr/local/hadoop
vii. Setup configuration file
1. ~/.bashrc
Open ~/.bashrc: nano ~/.bachsrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
1
ThS. Hồ Ngọc Trung Kiên
Môn học: Khai phá dữ liệu lớn
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib
/native
export
HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
save file ~./bashrc: source ~./bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Set Javahome in hadoop-env.sh: nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Adding the above statement in the hadoop-env.sh file
source /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
override the default settings that Hadoop starts with.
Sudo mkdir –p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
Open the file: nano /usr/local/hadoop/etc/hadoop/core-site.xml
and enter the following in between the <configuration></configuration> tag:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
Check default file in /usr/local/hadoop/etc/hadoop/ folder,
If it equal /usr/local/hadoop/etc/hadoop/mapred-
site.xml.template, we need change its name mapred-site.xml
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
we need configuration it
2
ThS. Hồ Ngọc Trung Kiên
Môn học: Khai phá dữ liệu lớn
open file: nano /usr/local/hadoop/etc/hadoop/mapred-site.xml
then, add configuration
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each
host in the cluster that is being used.
It specifies the directories which will be used as the namenode and the datanode on
that host.
Before editing this file, we need to create two directories which will contain
the namenode and the datanode for this Hadoop installation.
This can be done using the following commands:
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo chown -R hduser:hadoop /usr/local/hadoop_store
open file: nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
and configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
3
ThS. Hồ Ngọc Trung Kiên
Môn học: Khai phá dữ liệu lớn
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
viii. Format the New Hadoop Filesystem
hadoop namenode -format
starting hadoop
move to directory sbin: cd /usr/local/hadoop/sbin
start yarn.sh: start-dfs.sh
start-yarn.sh
check: jps
stop: stop-dfs.sh
stop-yarn.sh
Web UI: localhost:9870
http://localhost:8088/cluster
2. Spark
a. Install Java JDK
i. Sudo apt update
ii. Sudo apt install default-jdk
iii. Check version: java –version
b. Install Scala
4
ThS. Hồ Ngọc Trung Kiên
Môn học: Khai phá dữ liệu lớn
sudo apt install scala
Check version: scala –version
c. Install Apache Spark
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.1.2/spark-3.1.2-bin-
hadoop3.2.tgz
extracted: tar –zvzf spark-3.1.2-bin-hadoop3.2.tgz
move downloaded file to /opt/spark: sudo mv spark-3.1.2-bin-hadoop3.2.tgz /opt/spark
create environment variables
nano ~/.bashrc
add configuration:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
save file: source ~/.bashrc
Start Apache Spark: start-master.sh
Start spark-shell in terminal: spark-shell
3. IDE
a. Eclipse: Ubuntu software => search Elcipse => install
b. Elcipse for Scala: Eclipse =>Help => Eclipse Marketplace => Scala=> Go =>
Scala IDE 4.7.x => Install
c. Intelli J: Ubuntu software => search IDE/Intelli J => install
4. Tạo project trong Eclipse
5
ThS. Hồ Ngọc Trung Kiên