Aryan
Aryan
1
Aim: - Installation of VMware to set up the Hadoop environment
and its ecosystems.
Prerequisites:
1. Virtual Environment: For virtual environments, VMware is used.
2.operating System: - On Linux-based operating systems, Hadoop can be
installed. Ubuntu and CentOS are two of the most popular operating systems
among them. We'll be using Ubuntu for this tutorial.
Procedure:
Step 1:
Installing VMware Workstation Player Firstly download VMware workstation
from this Link download VMware workstation Follow this Link for VMware
installation. After completion of download open the exe file.
Step 3: Using VMware to install Ubuntu Initially when you open the VMware
file, the below mentioned window appears
Sudo apt update: Updates the package lists from the configured
repositories.
Sudo apt upgrade: Upgrades all installed packages to their newest
versions.
Sudo apt install <package>: Installs the specified package.
Sudo apt remove <package>: Removes the specified package.
Sudo apt search <keyword>: Searches for packages matching the given
keyword.
Sudo apt show <package>: Displays Information about the specified
package.
Using Sudo apt provides a convenient and centralized way to manage software
on a Debian-based Linux system, making it easier to keep your system up-to-
date and install the applications you need.
Step 4: - java jdk installation – Install jdk on ubuntu using terminal >>sudo
apt install openjdk-8-jdk openjdk-8-jre
hadoop@aryan: ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
hadoop@aryan: ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
Step 6: -To generate SSH authentication keys (RSA pair) with OpenSSH on
Ubuntu machine, use below command.
>>ssh-keygen -t rsa -P,,"-f~/.ssh/id_rsa
Step7: -To store the generated key pair in another file, use the command
below.
>>cat~/.ssh/id rsa.pub
hadoop@aryan: ~
hdoop@aryan
hdoop@aryan
Step 10: -Use command below, to download Hadoop using command line
interface.
>>wget https://downloads.apache.org/hadoop/common/current/hadoop-
3.4.1.tar.gz
hadoop@aryan: ~
hdoop@aryan
hdoop@aryan
export HADOOP_INSTALL-$HADOOP_HOME
export HADOOP_MAPRED_HOME-SHADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/lib/n
ative
export PATH=$PATH:SHADOOP_HOME/sbin:$HADOOP_HOME/bin
Step 18:-Add below lines in this file (between "" and "")
>>sudo nano SHADOOP HOME/etc/hadoop/core-site.xml->>#Add below lines
in this file(between and "") hadoop.tmp.dir /home/msimasood/Impdata A hase
for other temporary directories. fs.default.name hdfs://localhost:9000 The name
of the default file system>
Step 22: -Add below lines in this file (between "" and "")
yarn.nodemanager.aux-services
mapreduce shuffle
yarn.nodemanager.aux-
services.mapreduc
>>/start-dfs.sh
hadoop@aryan: ~
hdoop@aryan
hdoop@aryan
About HADOOP
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.
Setting up HADOOP
Pre-requisites: 1. Java 2. SSH Before any other steps, we need to set the java
I wget http://apache.claz.org/hadoop/common/hadoop-3.4.1/
II hadoop-3.4.1.tar.gz
III tar xzf hadoop-3.4.1.tar.gz
IV hadoop-3.4.1/* to hadoop/
export JAVA_HOME=/usr/local/jdk1.8.0_71
Core-site.xml
hdfs-site.xml
Mapred-site.xml
hadoop@ravi: ~
hdoop@aryan
hdoop@aryan
hadoop@aryan: ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
a. For each node, edit eh /etc/hosts/ file and add the IP addresses of the servers
e.g
Standalone
Step 1 — Installing Java
To get started, you’ll update our package list and install OpenJDK, the default
Java Development Kit on Ubuntu 20.04:
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
Navigate to binary for the release you’d like to install. In this guide you’ll install
>>wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-
Hadoop 3.4.1. 3.4.1.tar.gz
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
>>/usr/local/hadoop/bin/Hadoop
Prerequisites
Basic understanding of what a command line interface is
1. mkdir
Purpose: to create a directory
As we covered in the last article, folders in Linux are called “directories”. They
serve the same purpose as folders in Windows.
usage: mkdir [directory name]
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
You can even make a directory within a directory, even if the base one doesn’t
exist with the -p option.
Here I will create a new directory called test2 within a test directory, with the -p
option:
hdoop@aryan
hdoop@aryan
2. rmdir
Purpose: to remove a directory
With the rmdir command, you can remove a directory quickly and easily.
hdoop@aryan
Now this works great if the directory is empty. But what about the directory I
created that has another directory in it?
Here’s what happens when I try to use rmdir on that directory:
hdoop@aryan
rmdir cannot remove directories that have files or directories in them. To do that,
you must use the rm command (which we’ll cover again in command #5 )
Do that we need to type in
hdoop@aryan
3. cp
Purpose: to make a copy of a file Here’s one you’ll use all the time, especially if
file name]
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
You can also copy the file to another directory and keep the same file name:
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
4. mv
Purpose: to move a file to another location or rename it
This one is pretty straightforward. You use it to move a file from one place to the
other.
usage:
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
usage:
hdoop@aryan
We used rm earlier to remove a directory. It’s also you delete individual files.
usage:
rm [file name]
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
6. touch
Purpose: create an empty file
You may have noticed my “nginx.conf” was zero bytes. This is a nifty command
for creating empty files. This is handy for creating a new file or testing things.
usage:
hdoop@aryan
hdoop@aryan
hdoop@aryan
usage:
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
vam
hdoop@aryan
hdoop@aryan
hdoop@aryan
Local to HDFS:
HDFS to Local:
hdoop@aryan
hdoop@aryan
hdoop@aryan
Remove a file:
hadoop@aryan : ~
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
op@Barkha
hdoop@aryan
mkdir WordCountProject
cd WordCountProject
nano WordCount.java
hadoop@aryan : ~ /Wordcount
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hadoop@aryan : ~ /Wordcount
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
Note: /output directory must not exist before you run the job. If it does:
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hadoop 1
hello 2
world 1
hdoop@aryan
hdoop@aryan
nano matrix_b.txt
B007
B018
B109
B 1 1 10
B 2 0 11
B 2 1 12
Save and exit (same steps as above).
hadoop@aryan : ~ /matrix-multiplication
hdoop@aryan
hdoop@aryan
hdoop@aryan
hadoop@aryan : ~ /matrix-multiplication
hdoop@aryan
hdoop@aryan
hdoop@aryan
nano mapper.py
nano reducer.py
Make it executable:
chmod +x reducer.py
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
0,0 58
0,1 64
1,0 139
1,1 154
hdoop@aryan
hdoop@aryan
About Pig: -
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is
used to process the large datasets. It provides a high-level of abstraction for
processing over the MapReduce. It provides a high-level scripting language,
known as Pig Latin, which is used to develop the data analysis codes. First, to
process the data which is stored in the HDFS, the programmers will write the
scripts using the Pig Latin Language.
Need of Pig: One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the code, submitting
the job and retrieving the output is a time-consuming task. Apache Pig reduces
the time of development using the multi-query approach.
Add to .bashrc:
nano ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
Then apply changes:
nano ~/.bashrc
Add:
export PIG_HOME=/opt/pig
export PATH=$PATH:$PIG_HOME/bin
Apply:
source ~/.bashrc
6. Verify Installation
pig -version
hadoop@aryan:~
hdoop@aryan
hdoop@aryan
hdoop@aryan
id,name,age,department
Aryan Yadav 2301331549002 36
1,John,25,IT
2,Alice,30,HR
3,Bob,22,IT
4,Eve,35,Finance
hadoop@aryan:~
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hadoop@aryan:~
hadoop@aryan:~
hadoop@aryan:~
id,salary
1,50000
2,60000
3,55000
4,70000
Load salaries:
3. Exit Pig
quit;
nano text.txt
hello world
hello pig
hello hadoop
pig loves hadoop
Save and exit (Ctrl + O, Enter, Ctrl + X)
hadoop@aryan:~
hdoop@aryan
hdoop@aryan
-- Dump result
DUMP wordcount;
(hadoop,2)
(hello,3)
(pig,2)
(world,1)
(loves,1)
2010,30
2010,35
2010,29
2011,38
2011,36
2011,32
Save and exit.
hadoop@aryan:~
hdoop@aryan
hdoop@aryan
-- Group by year
grouped = GROUP weather BY year;
-- Find max temp per year
-- Dump results
DUMP max_temp;
(2010,35)
(2011,38)
2. Extract Hive
tar -xvzf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin hive
# Hive Environment
export HIVE_HOME=/home/hdoop/hive
export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and set the following basic properties:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>Derby embedded metastore URL</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>Hive warehouse location</description>
</property>
</configuration>
start-dfs.sh
start-yarn.sh
Then create Hive warehouse directory in HDFS:
$ hive
hdoop@aryan:~/hadoop-3.4.1
hdoop@aryan
2. Use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.
1. Create a Database
USE mydb;
INSERT INTO students (id, name, grade) VALUES (1, 'John Doe', 95);
INSERT INTO students (id, name, grade) VALUES (2, 'Jane Doe', 85);
INSERT INTO students (id, name, grade) VALUES (3, 'Mike Ross', 92);
hdoop@aryan:~/hadoop-3.4.1
You can alter the table by adding a new column (e.g., email):
8. Create a View
Now, let's create a view high_achievers to show students with grades greater
than 90:
hdoop@aryan
nano ~/hbase/conf/hbase-env.sh
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hdoop/hbase/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
hdoop@aryan
nano input.txt
hello world
hello hadoop
hello spark
nano wordcount.py
# Create SparkContext
sc = SparkContext("local", "WordCount")
# Read the local input file
text_file = sc.textFile("file:///home/hdoop/input.txt")
# Split lines into words, map to (word, 1), reduce by key
word_counts = text_file.flatMap(lambda line: line.split()) \
In terminal:
spark-submit wordcount.py
hdoop@aryan:~
hdoop@aryan
hdoop@aryan
hdoop@aryan
Run:
cat output/part-00000
Example output:
('hello', 3)
('world', 1)
('hadoop', 1)
('spark', 1)
hdoop@aryan
Open MySQL:
mysql -u root -p
Then:
hdoop@aryan:~
hdoop@aryan
wget https://repo1.maven.org/maven2/mysql/mysql-connector-
java/8.0.33/mysql-connector-java-8.0.33.jar
Move it to your home directory if needed:
mv mysql-connector-java-8.0.33.jar ~/mysql-connector-java.jar
nano step2_read_mysql.py
# JDBC config
spark.stop()
hdoop@aryan
hdoop@aryan
hdoop@aryan