0% found this document useful (0 votes)
16 views60 pages

Aryan

The document outlines the installation and setup of VMware for creating a Hadoop environment, detailing prerequisites such as the operating system, JDK, and Hadoop version. It provides a step-by-step procedure for installing Ubuntu, configuring Java, and setting up Hadoop in various modes (standalone, pseudo-distributed, fully distributed). Additionally, it covers basic Linux commands for file and directory management.

Uploaded by

Anish kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views60 pages

Aryan

The document outlines the installation and setup of VMware for creating a Hadoop environment, detailing prerequisites such as the operating system, JDK, and Hadoop version. It provides a step-by-step procedure for installing Ubuntu, configuring Java, and setting up Hadoop in various modes (standalone, pseudo-distributed, fully distributed). Additionally, it covers basic Linux commands for file and directory management.

Uploaded by

Anish kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Experiment No.

1
Aim: - Installation of VMware to set up the Hadoop environment
and its ecosystems.

Prerequisites:
1. Virtual Environment: For virtual environments, VMware is used.
2.operating System: - On Linux-based operating systems, Hadoop can be
installed. Ubuntu and CentOS are two of the most popular operating systems
among them. We'll be using Ubuntu for this tutorial.

3.JDK: - The Java 8 package must be installed on your system.


4.Hadoop version: - Here we are using Hadoop 3.4.1 Package.

Procedure:
Step 1:
Installing VMware Workstation Player Firstly download VMware workstation
from this Link download VMware workstation Follow this Link for VMware
installation. After completion of download open the exe file.

Aryan Yadav 2301331549002 1


Step 2: Installing Ubuntu download Link After downloading save file in desired
location (Preferably D:).

Step 3: Using VMware to install Ubuntu Initially when you open the VMware
file, the below mentioned window appears

Aryan Yadav 2301331549002 2


After Completion of installation as you can see the ubuntu interface.

Aryan Yadav 2301331549002 3


Command in Linux environment:
Sudo apt is a command-line tool in Linux and Ubuntu-based operating systems
that allows users to install, remove, and manage software packages. The "Sudo"
part of the command stands for "superuser do", which allows the user to
execute commands with administrative or root-level privileges. The "apt" part
refers to the Advanced Packaging Tool, which is the package management
system used by Debian-based Linux distributors.
Some common Sudo apt commands include:

Sudo apt update: Updates the package lists from the configured
repositories.
Sudo apt upgrade: Upgrades all installed packages to their newest
versions.
Sudo apt install <package>: Installs the specified package.
Sudo apt remove <package>: Removes the specified package.
Sudo apt search <keyword>: Searches for packages matching the given
keyword.
Sudo apt show <package>: Displays Information about the specified
package.

Using Sudo apt provides a convenient and centralized way to manage software
on a Debian-based Linux system, making it easier to keep your system up-to-
date and install the applications you need.

Step 4: - java jdk installation – Install jdk on ubuntu using terminal >>sudo
apt install openjdk-8-jdk openjdk-8-jre

hadoop@aryan: ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 4


Step 5: - To install OpenSSH server/client on Ubuntu, use the command
below.
>>Sudo apt install OpenSSH-server OpenSSH-client-y

hadoop@aryan: ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

Step 6: -To generate SSH authentication keys (RSA pair) with OpenSSH on
Ubuntu machine, use below command.
>>ssh-keygen -t rsa -P,,"-f~/.ssh/id_rsa

Step7: -To store the generated key pair in another file, use the command
below.
>>cat~/.ssh/id rsa.pub

>> ~/.ssh/authorized keys

Aryan Yadav 2301331549002 5


Step 8:-Command in the screenshot below can be used to set permissions
for the file generated in the previous step.
>>chmod 0600~/.ssh/authorized_keys

Step 9: -Below command can be used to connect to localhost using SSH


protocol.
>>ssh localhost

hadoop@aryan: ~

hdoop@aryan

hdoop@aryan

Step 10: -Use command below, to download Hadoop using command line
interface.
>>wget https://downloads.apache.org/hadoop/common/current/hadoop-
3.4.1.tar.gz
hadoop@aryan: ~

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 6


Step 11: -Unzip the downloaded file, executing the command below.
>>tar xzf hadoop-3.2.1.tar.gz
Step 12: -As a pre-requisite, .bashrc file needs to be edited using Nano text
editor. Thus, use the command below for the purpose.
>>sudo nano.bashrc

Step 13: -Add below lines in this file.


>>#Hadoop Related Options export
HADOOP_HOME=/home/msjmasood/hadoop-3.3.1

export HADOOP_INSTALL-$HADOOP_HOME

export HADOOP_MAPRED_HOME-SHADOOP_HOME

export HADOOP COMMON HOME-SHADOOP HOME

export HADOOP HDFS HOME-SHADOOP HOME

export YARN_HOME=$HADOOP HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/lib/n
ative

export PATH=$PATH:SHADOOP_HOME/sbin:$HADOOP_HOME/bin

export HADOOP OPTS"-Djava.library.path=$HADOOP HOME/lib/nativ"


Step 14: -Open the 2nd file that requires to be edited.
>>sudo nano SHADOOP_HOME/etc/hadoop/hadoop-env.sh

Step 15: -Add below line in this file in the end.


>>export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd 64
Step 16: -Edit the 3rd file namely $HADOOP_HOME/etc/hadoop/core-
site.xml-
>>sudo nano SHADOOP_HOME/etc/hadoop/core-site.xml-

>>#Add below lines in this file(between and "") hadoop.tmp.dir


/home/msjmasood/tmpdata A base for other temporary directories.
fs.default.name

hdfs://localhost:9000 The name of the default file system>

Aryan Yadav 2301331549002 7


Step 17: -Open4th file which requires editing, namely
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

>>sudo nano $HADOOP_HOME/etc/hadoop/hdfssite.xml

Step 18:-Add below lines in this file (between "" and "")
>>sudo nano SHADOOP HOME/etc/hadoop/core-site.xml->>#Add below lines
in this file(between and "") hadoop.tmp.dir /home/msimasood/Impdata A hase
for other temporary directories. fs.default.name hdfs://localhost:9000 The name
of the default file system>

Step 19: -Open 5th file namely SHADOOP_HOME/etc/Hadoop/mapred-


site.xml to edit.
>>sudo nano SHADOOP_HOME/etc/hadoop/mapredsite.xml

Step 20:-Add below lines in this file(between "" and "")


mapreduce.framework.name yarn

Step 21: -Open 6th file namely SHADOOP_HOME/etc/hadoop/yarn-


site.xml, to enable editing.
>>sudo nano $HADOOP_HOME/etc/hadoop/yarnsite.xml

Step 22: -Add below lines in this file (between "" and "")
yarn.nodemanager.aux-services

mapreduce shuffle

yarn.nodemanager.aux-

services.mapreduc

e.shuffle.class. org.apache.hadoop.mapred. ShuffleHandler

yarm.resourcemanager.hostname 127.0.0.1 yarn.acl.enable 0


yarn.nodemanager.env-whitelist

Aryan Yadav 2301331549002 8


Step 23: -Use the command below to launch Hadoop.
>>hdfs namenode-format

Step 24: -Get started with Hadoop

>>/start-dfs.sh

hadoop@aryan: ~

hdoop@aryan

hdoop@aryan

Result: - Thus, above setting up and installing Hadoop is Successfully


completed.

Aryan Yadav 2301331549002 9


Experiment No.: - 2
Aim: -
I Perform setting up and Installing Hadoop in its three
operating modes. a. Standalone. b. Pseudo distributed. c.
Fully distributed.
II Use web-based tools to monitor your Hadoop setup.

About HADOOP
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.

Setting up HADOOP
Pre-requisites: 1. Java 2. SSH Before any other steps, we need to set the java

environment variable, this


can be done in windows from the system variables window or on Linux by
adding the following to the variables file:

export JAVA_HOME= /usr/java/latest

Download and extract the HADOOP binaries.

I wget http://apache.claz.org/hadoop/common/hadoop-3.4.1/
II hadoop-3.4.1.tar.gz
III tar xzf hadoop-3.4.1.tar.gz
IV hadoop-3.4.1/* to hadoop/

Aryan Yadav 2301331549002 10


Pseudo-distributed mode
1. Add the following variables to the system variable file

2. Configure HADOOP files

a. Change to the Hadoop directory/etc/Hadoop

b. Add the following to the hadoop-env.sh file

export JAVA_HOME=/usr/local/jdk1.8.0_71

c. Edit the following config files

Core-site.xml

hdfs-site.xml

Aryan Yadav 2301331549002 11


Yarn-site.xml

Mapred-site.xml

d. Verifying the installation

i. Formatting the name nodes

ii. Verifying the HDFS File system

hadoop@ravi: ~

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 12


iii. Starting YARN

hadoop@aryan: ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

iv. Accessing the HADOOP bowser and verifying everything.

Fully distributed mode


1 Configure system and create host files on each node

a. For each node, edit eh /etc/hosts/ file and add the IP addresses of the servers

e.g

2 Distribute the authentication key-pairs to the users

a. Login to the node-master and generate ssh-keys

b. Copy the keys to the other nodes.

Aryan Yadav 2301331549002 13


3 Download and extract the HADOOP binaries

4 Set the environment variables (same as pseudo-distributed)

5 Edit the core-site.xml file to set Name Node location

6 Set the HDFS Paths in hdfs-site.xml

7 Set the Job scheduler (same as pseudo-distributed)

8 Configure YARN in yarn-site.xml

Aryan Yadav 2301331549002 14


9 Duplicate the config files to each node.

10 Format the HDFS (same as pseudo-distributed).

11 Start the HDFS (same as pseudo-distributed).

12 Run YARN (same as pseudo-distributed).

Standalone
Step 1 — Installing Java
To get started, you’ll update our package list and install OpenJDK, the default
Java Development Kit on Ubuntu 20.04:

>>sudo apt update

>>sudo apt install default-jdk

hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

Step 2 — Installing Hadoop


With Java in place, you’ll visit the Apache Hadoop Releases page to find the
most recent stable release.

Navigate to binary for the release you’d like to install. In this guide you’ll install
>>wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-
Hadoop 3.4.1. 3.4.1.tar.gz
hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 15


Step 3 — Configuring Hadoop’s Java Home
Hadoop requires that you set the path to Java, either as an environment
variable or in the Hadoop configuration file.

The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in


turn a symlink to default Java binary. You will use readlink with the -f flag to
follow every symlink in every part of the path, recursively. Then, you’ll use sed to
trim bin/java from the output to give us the correct value for JAVA_HOME.
To find the default Java path

>>readlink -f /usr/bin/java | sed "s:bin/java::"

Step 4 — Running Hadoop


Now you should be able to run Hadoop:

>>/usr/local/hadoop/bin/Hadoop

Result: - We have installed HADOOP in pseudo-distributed, fully distributed


modes and Standalone.

Aryan Yadav 2301331549002 16


Experiment No. 3
Aim: - Implementing the basic commands of LINUX Operating
System – File/Directory creation, deletion, update operations.

Prerequisites
Basic understanding of what a command line interface is

Familiarity with the concept of files and directories (folders)

1. mkdir
Purpose: to create a directory

As we covered in the last article, folders in Linux are called “directories”. They
serve the same purpose as folders in Windows.
usage: mkdir [directory name]

Here’s mkdir in action:

hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

You can even make a directory within a directory, even if the base one doesn’t
exist with the -p option.
Here I will create a new directory called test2 within a test directory, with the -p
option:

Aryan Yadav 2301331549002 17


hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan

2. rmdir
Purpose: to remove a directory

With the rmdir command, you can remove a directory quickly and easily.

usage: rmdir [directory name]

Here’s rmdir in action:


hdoop@aryan
hdoop@aryan

hdoop@aryan

Now this works great if the directory is empty. But what about the directory I
created that has another directory in it?
Here’s what happens when I try to use rmdir on that directory:

Aryan Yadav 2301331549002 18


hdoop@aryan

hdoop@aryan

rmdir cannot remove directories that have files or directories in them. To do that,
you must use the rm command (which we’ll cover again in command #5 )
Do that we need to type in

rm -rf [directory name]


hdoop@aryan
hdoop@aryan

hdoop@aryan

3. cp
Purpose: to make a copy of a file Here’s one you’ll use all the time, especially if

you’re making a config file


backup. Let’s use that as an example. I want to make a backup of this file. If I
mess something up, I can go back to the old version. usage: cp [file name] [new

file name]

Aryan Yadav 2301331549002 19


hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

hdoop@aryan

You can also copy the file to another directory and keep the same file name:

cp [file name] [new location]

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

4. mv
Purpose: to move a file to another location or rename it

This one is pretty straightforward. You use it to move a file from one place to the
other.
usage:

mv [file name] [new location]

Aryan Yadav 2301331549002 20


It’s used the same way as cp, though it moves the file instead of making a copy.

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan

This is also how you rename a file.

usage:

mv [file name] [new file name]

So if I want to rename my nginx configuration file, I can do this:


hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 21


5. rm
Purpose: to delete a file

We used rm earlier to remove a directory. It’s also you delete individual files.

usage:

rm [file name]

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

6. touch
Purpose: create an empty file

You may have noticed my “nginx.conf” was zero bytes. This is a nifty command
for creating empty files. This is handy for creating a new file or testing things.
usage:

touch [file name]

This creates a file with nothing in it:

Aryan Yadav 2301331549002 22


hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

7. find a file (find)


This is a powerful command for finding files in the file system.

usage:

find [path to search] -name filename

hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

hdoop@aryan

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 23


Experiment No. 4
Aim: - Perform various File Management tasks in Hadoop.
i. Upload and download a file in HDFS.
ii. See contents of a file.
iii. Copy a file from source to destination.
iv. Copy a file from/To Local file system to HDFS.
v. Move file from source to destination.
vi. Remove a file or directory in HDFS.
vii. Display last few lines of a file viii. Display the aggregate length of
a file.

i. Upload and Download a File in HDFS

Upload from local to HDFS:

hdfs dfs -put /path/to/local/file.txt /user/hadoop/

Download from HDFS to local:

hdfs dfs -get /user/hadoop/file.txt /path/to/local/

ii. See Contents of a File


hdfs dfs -cat /user/hadoop/file.txt

hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

hdoop@aryan

vam

hdoop@aryan

Aryan Yadav 2301331549002 24


iii. Copy a File from Source to Destination (within HDFS)
hdfs dfs -cp /user/hadoop/file.txt /user/hadoop/copy_file.txt

hdoop@aryan

hdoop@aryan

iv. Copy a File from/To Local File System to HDFS

Local to HDFS:

hdfs dfs -copyFromLocal /path/to/local/file.txt /user/hadoop/

HDFS to Local:

hdfs dfs -copyToLocal /user/hadoop/file.txt /path/to/local/

hdoop@aryan

hdoop@aryan

hdoop@aryan

v. Move File from Source to Destination (within HDFS)


hdfs dfs -mv /user/hadoop/file.txt /user/hadoop/new_folder/

vi. Remove a File or Directory in HDFS

Remove a file:

hdfs dfs -rm /user/hadoop/file.txt

Aryan Yadav 2301331549002 25


Remove a directory and its contents:

hdfs dfs -rm -r /user/hadoop/directory_name

hadoop@aryan : ~

hdoop@aryan

hdoop@aryan

hdoop@aryan

vii. Display Last Few Lines of a File


hdfs dfs -tail /user/hadoop/file.txt

By default, it displays the last 1KB of the file.

viii. Display the Aggregate Length of a File


hdfs dfs -du /user/hadoop/file.txt

For total space used by a directory recursively:

hdfs dfs -du -s -h /user/hadoop/

hdoop@aryan

op@Barkha

hdoop@aryan

Aryan Yadav 2301331549002 26


Experiment No. 5
Aim: - Implement Word Count Map Reduce program to understand
Map Reduce Program.

Step 1: Setup Requirements


Make sure you have:

Hadoop installed and configured (HADOOP_HOME set, HDFS


working).
Java installed (java -version should show JDK 8 or above).
$HADOOP_HOME/bin added to your system's PATH.

Step 2: Create Java Program

1. Create a folder for your project:

mkdir WordCountProject
cd WordCountProject

2. Create a file named WordCount.java:

nano WordCount.java

hadoop@aryan : ~ /Wordcount

hdoop@aryan

hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 27


3. Paste this code inside:

Aryan Yadav 2301331549002 28


Step 3: Compile the Program
mkdir wordcount_classes
javac -classpath `hadoop classpath` -d wordcount_classes WordCount.java

Step 4: Create a JAR File


jar -cvf wordcount.jar -C wordcount_classes/ .

hadoop@aryan : ~ /Wordcount

hdoop@aryan

hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 5: Prepare Input File in HDFS


1. Create a sample text file:

echo "Hello Hadoop Hello World" > input.txt

2. Create an input directory in HDFS:

hdfs dfs -mkdir /input

3. Upload the file to HDFS:

hdfs dfs -put input.txt /input/

hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 29


Step 6: Run the MapReduce Job
hadoop jar wordcount.jar WordCount /input /output

Note: /output directory must not exist before you run the job. If it does:

hdfs dfs -rm -r /output

hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan

Step 7: View the Output

hdfs dfs -cat /output/part-r-00000

You should see something like:

hadoop 1
hello 2
world 1

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 30


Experiment No. 6
Aim: - Implement matrix multiplication with Hadoop Map Reduce.

Step 1: Setup Your Input Matrices


We’ll create two files: matrix_a.txt and matrix_b.txt

nano matrix_a.txt Paste this content: A 0 0 1


A012
A023
A104
A115
A126
Press Ctrl+O, then Enter, then Ctrl+X to save and exit.

nano matrix_b.txt

Paste this content:

B007
B018
B109
B 1 1 10
B 2 0 11
B 2 1 12
Save and exit (same steps as above).

hadoop@aryan : ~ /matrix-multiplication

hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 31


Step 2: Combine Inputs
mkdir -p matrix_input
cat matrix_a.txt matrix_b.txt > matrix_input/input.txt

hadoop@aryan : ~ /matrix-multiplication

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 3: Upload to HDFS


hadoop fs -mkdir -p /matrix/input
hadoop fs -put -f matrix_input/input.txt /matrix/input

Step 4: Create mapper.py

nano mapper.py

Save and exit, then make it executable:

Aryan Yadav 2301331549002 32


chmod +x mapper.py

Step 5: Create reducer.py

nano reducer.py

Make it executable:

chmod +x reducer.py

Step 6: Run Hadoop Streaming Job

hadoop jar /home/hdoop/hadoop-3.4.1/share/hadoop/tools/lib/hadoop-


streaming-3.4.1.jar \
-input /matrix/input \ -
output /matrix/output \
-mapper mapper.py \ -
reducer reducer.py \ -
file mapper.py \ -file
reducer.py

Aryan Yadav 2301331549002 33


Wait for the job to complete.

hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 7: View Output

hadoop fs -cat /matrix/output/part-00000

You should see:

0,0 58
0,1 64
1,0 139
1,1 154

hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 34


Experiment No.: - 7
Aim: -
I. Installation of PIG.
ii. Write Pig Latin scripts sort, group, join, project, and filter your data.

About Pig: -
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is
used to process the large datasets. It provides a high-level of abstraction for
processing over the MapReduce. It provides a high-level scripting language,
known as Pig Latin, which is used to develop the data analysis codes. First, to
process the data which is stored in the HDFS, the programmers will write the
scripts using the Pig Latin Language.

Need of Pig: One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the code, submitting
the job and retrieving the output is a time-consuming task. Apache Pig reduces
the time of development using the multi-query approach.

I. Installing Apache Pig -

1. Install Java (Pig requires Java)


sudo apt update
sudo apt install openjdk-11-jdk -y
2. Set JAVA_HOME

Add to .bashrc:

nano ~/.bashrc

Add at the end:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
Then apply changes:

Aryan Yadav 2301331549002 35


source ~/.bashrc

3. Download Apache Pig


wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz
4. Extract and Move Pig
tar -xvzf pig-0.17.0.tar.gz
sudo mv pig-0.17.0 /opt/pig

5. Set PIG_HOME and PATH

Again edit .bashrc:

nano ~/.bashrc

Add:

export PIG_HOME=/opt/pig
export PATH=$PATH:$PIG_HOME/bin
Apply:

source ~/.bashrc

6. Verify Installation
pig -version

hadoop@aryan:~
hdoop@aryan

hdoop@aryan

hdoop@aryan

II. Writing Pig Latin Scripts

Let’s assume you have a CSV file called data.csv:

id,name,age,department
Aryan Yadav 2301331549002 36
1,John,25,IT
2,Alice,30,HR
3,Bob,22,IT
4,Eve,35,Finance

hadoop@aryan:~

hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan

hdoop@aryan

1. Start Pig in Local Mode


pig -x local
2. Load Data
data = LOAD 'data.csv' USING PigStorage(',') AS (id:int, name:chararray,
age:int, dept:chararray);

A. Project (select specific columns)


proj = FOREACH data GENERATE name, age;
DUMP proj;

hadoop@aryan:~

Aryan Yadav 2301331549002 37


B. Filter (e.g., age > 25)
filtered = FILTER data BY age > 25;
DUMP filtered;

hadoop@aryan:~

hadoop@aryan:~

C. Sort (by age ascending)


sorted = ORDER data BY age ASC;
DUMP sorted;

D. Group (by department)


grouped = GROUP data BY dept;
DUMP grouped;

Aryan Yadav 2301331549002 38


hadoop@aryan:~

E. Join (joining two datasets)

Assume another file salaries.csv:

id,salary
1,50000
2,60000
3,55000
4,70000

Load salaries:

salaries = LOAD 'salaries.csv' USING PigStorage(',') AS (id:int, salary:int);


joined = JOIN data BY id, salaries BY id;
DUMP joined;

3. Exit Pig
quit;

Aryan Yadav 2301331549002 39


Experiment No.: - 8
Aim: -
i. Run the Pig Latin Scripts to find Word Count.
ii. Run the Pig Latin Scripts to find a max temp for every year.

i. Run Pig Script to Find Word Count

Step 1: Create a Text File

Let’s create a file text.txt:

nano text.txt

Paste this sample content:

hello world
hello pig
hello hadoop
pig loves hadoop
Save and exit (Ctrl + O, Enter, Ctrl + X)

Step 2: Start Grunt Shell


pig -x local

hadoop@aryan:~
hdoop@aryan
hdoop@aryan

Aryan Yadav 2301331549002 40


Step 3: Run Word Count in Grunt
-- Load file
lines = LOAD 'text.txt' AS (line:chararray);

-- Break into words


words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group words
grouped = GROUP words BY word;
-- Count each word
wordcount = FOREACH grouped GENERATE group AS word, COUNT(words)
AS count;

-- Dump result
DUMP wordcount;

You’ll see output like:

(hadoop,2)
(hello,3)
(pig,2)
(world,1)
(loves,1)

Aryan Yadav 2301331549002 41


Step 1: Create the Weather File
nano weather.csv Paste this:

2010,30
2010,35
2010,29
2011,38
2011,36
2011,32
Save and exit.

Step 2: Start Grunt Shell (if not already)


pig -x local

hadoop@aryan:~
hdoop@aryan
hdoop@aryan

Aryan Yadav 2301331549002 42


Step 3: Run Max Temp Script in Grunt
-- Load weather data
weather = LOAD 'weather.csv' USING PigStorage(',') AS (year:int, temp:int);

-- Group by year
grouped = GROUP weather BY year;
-- Find max temp per year

max_temp = FOREACH grouped GENERATE group AS year,


MAX(weather.temp) AS max_temp;

-- Dump results
DUMP max_temp;

You’ll see output like:

(2010,35)
(2011,38)

Aryan Yadav 2301331549002 43


Experiment No.: - 9
Aim: -
i. Installation of HIVE.
ii. Use Hive to create, alter, and drop databases, tables, views,
functions, and indexes

1. Apache Hive 3.1.3 Installation (Step-by-Step)


Pre-requisites

Java JDK 8 or 11 installed


Hadoop installed and configured (version 3.4.1)
Environment variables set for Hadoop
Linux-based system (Ubuntu or similar)

1. Download Hive 3.1.3


cd /home/hdoop/
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

2. Extract Hive
tar -xvzf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin hive

Now Hive is located at: /home/hdoop/hive

3. Set Hive Environment Variables


Add the following lines to your ~/.bashrc file:

# Hive Environment
export HIVE_HOME=/home/hdoop/hive
export PATH=$PATH:$HIVE_HOME/bin

Apply the changes:

source ~/.bashrc

Aryan Kumar 2301331549002 44


4. Configure Hive (hive-site.xml)
Create the Hive configuration directory and copy default files:

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and set the following basic properties:

<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>Derby embedded metastore URL</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>Hive warehouse location</description>
</property>
</configuration>

5. Initialize the Hive Metastore


schematool -dbType derby -initSchema

This sets up the metastore database using embedded Derby.

6. Create Warehouse Directory in HDFS

Start Hadoop services first:

start-dfs.sh
start-yarn.sh
Then create Hive warehouse directory in HDFS:

hdfs dfs -mkdir -p /user/hive/warehouse


hdfs dfs -chmod g+w /user/hive/warehouse

Aryan Kumar 2301331549002 45


7. Start Hive
Launch the Hive shell:

$ hive

hdoop@aryan:~/hadoop-3.4.1

hdoop@aryan

2. Use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.

1. Create a Database

To create a new database in Hive:

CREATE DATABASE mydb;

2. Switch to the New Database


Once the database is created, switch to it:

USE mydb;

Aryan Kumar 2301331549002 46


3. Create a Table

Now, let’s create a table students with some columns:

CREATE TABLE students (


id INT,
name STRING,
grade FLOAT
);
4. Insert Data into the Table
Insert some sample data into the students table:

INSERT INTO students (id, name, grade) VALUES (1, 'John Doe', 95);
INSERT INTO students (id, name, grade) VALUES (2, 'Jane Doe', 85);
INSERT INTO students (id, name, grade) VALUES (3, 'Mike Ross', 92);

hdoop@aryan:~/hadoop-3.4.1

5. Show Data in the Table


Now, check if the data has been inserted correctly:

SELECT * FROM students;

Aryan Kumar 2301331549002 47


6. Alter the Table (Add a Column)

You can alter the table by adding a new column (e.g., email):

ALTER TABLE students ADD COLUMNS (email STRING);

7. Alter the Table (Change Column Type)


Change the name column to a new data type (e.g., STRING to VARCHAR):

ALTER TABLE students CHANGE name full_name STRING;

8. Create a View
Now, let's create a view high_achievers to show students with grades greater
than 90:

CREATE VIEW high_achievers AS


SELECT full_name, grade FROM students WHERE grade > 90;

9. Query the View

Check the data in the high_achievers view:

SELECT * FROM high_achievers;

Aryan Kumar 2301331549002 48


10. Create an Index
CREATE INDEX idx_grade ON TABLE students (grade)
AS 'COMPACT' WITH DEFERRED REBUILD;

11. Drop the Index


To drop the index created on the students table:

DROP INDEX idx_grade ON students;

12. Drop the View


Drop the high_achievers view if it's no longer needed:

DROP VIEW high_achievers;

13. Drop the Table


To drop the students table:

DROP TABLE students;

14. Drop the Database


If you want to drop the database mydb (with its tables, views, and other
objects):

DROP DATABASE mydb CASCADE;

Aryan Kumar 2301331549002 49


Experiment No.: - 10

Aim: - Install Hbase and Perform CRUD operation using Hbase


Shell.

Step 1: Download and Install HBase


1.1 Download HBase
cd ~
wget https://downloads.apache.org/hbase/2.4.17/hbase-2.4.17-bin.tar.gz

1.2 Extract the Archive


tar -xvzf hbase-2.4.17-bin.tar.gz
mv hbase-2.4.17 hbase
hdoop@aryan

hdoop@aryan

Step 2: Configure HBase


2.1 Edit hbase-env.sh

nano ~/hbase/conf/hbase-env.sh

Uncomment and set Java home:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 # Change


path as needed
Add Hadoop path:

Aryan Kumar 2301331549002 50


export HBASE_MANAGES_ZK=false

2.2 Edit hbase-site.xml


nano ~/hbase/conf/hbase-site.xml

Add this configuration inside <configuration>:

<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hdoop/hbase/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

Step 3: Start Hadoop and HBase


3.1 Start Hadoop (if not already running)
start-dfs.sh

3.2 Format HDFS (first time only)


hdfs namenode -format

3.3 Start HBase


~/hbase/bin/start-hbase.sh
hdoop@aryan

hdoop@aryan

Step 4: Access HBase Shell


~/hbase/bin/hbase shell

You should see the HBase shell prompt like this:

Aryan Kumar 2301331549002 51


hbase(main):001:0>

Step 5: Perform CRUD Operations in HBase Shell


5.1 Create a Table
create 'students', 'info'

5.2 Insert (Put) Data


put 'students', '1', 'info:name', 'Alice'
put 'students', '1', 'info:age', '22'
put 'students', '2', 'info:name', 'Bob'
put 'students', '2', 'info:age', '24'

5.3 Read (Get and Scan)


get 'students', '1'
scan 'students'

5.4 Update Data (same as put)


put 'students', '1', 'info:age', '23'
get 'students', '1'

Aryan Kumar 2301331549002 52


5.5 Delete Data
Delete a column:
delete 'students', '1', 'info:age'
Delete entire row:
deleteall 'students', '1'

Step 6: Stop HBase and Hadoop


6.1 Stop HBase
~/hbase/bin/stop-hbase.sh

6.2 Stop Hadoop


stop-dfs.sh

Aryan Kumar 2301331549002 53


Experiment No.: -11

Aim: - Implement Spark Core Processing RDD to run Word


Count Program.

Steps: WordCount in PySpark (Using Local File)

Step 1: Create Your Input File

Open terminal and run:

nano input.txt

Add this sample text:

hello world
hello hadoop
hello spark

Save and exit (Ctrl + O, Enter, Ctrl + X).

Step 2: Create Your PySpark Script

Now create a new Python file:

nano wordcount.py

Paste this code:

from pyspark import SparkContext

# Create SparkContext
sc = SparkContext("local", "WordCount")
# Read the local input file
text_file = sc.textFile("file:///home/hdoop/input.txt")
# Split lines into words, map to (word, 1), reduce by key
word_counts = text_file.flatMap(lambda line: line.split()) \

.map(lambda word: (word, 1)) \

Aryan Kumar 2301331549002 54


.reduceByKey(lambda a, b: a + b)

# Save the result in output folder


word_counts.saveAsTextFile("file:///home/hdoop/output")
# Stop the SparkContext
sc.stop()

Save and exit (Ctrl + O, Enter, Ctrl + X).

Step 3: Run Your Script

In terminal:

spark-submit wordcount.py

hdoop@aryan:~

hdoop@aryan
hdoop@aryan
hdoop@aryan

Step 4: See the Result

Run:

cat output/part-00000

Example output:

('hello', 3)
('world', 1)
('hadoop', 1)
('spark', 1)

Aryan Kumar 2301331549002 55


hdoop@aryan

hdoop@aryan

Aryan Kumar 2301331549002 56


Experiment No.: - 12

Aim: - Implement Spark Core Processing RDD to read a table


Stored in a database and calculate the number of people for every
age.

Step 1: Create the MySQL Table

Open MySQL:

mysql -u root -p

Then:

CREATE DATABASE sparkdb;


USE sparkdb;
CREATE TABLE people (

id INT AUTO_INCREMENT PRIMARY KEY,


name VARCHAR(50),
age INT
);

INSERT INTO people (name, age) VALUES


('Alice', 25),('Bob', 30),('Charlie', 25),('David', 40),('Eve', 30);
EXIT;

hdoop@aryan:~

hdoop@aryan

Aryan Kumar 2301331549002 57


Step 2: Download JDBC Driver

Download MySQL JDBC driver:

wget https://repo1.maven.org/maven2/mysql/mysql-connector-
java/8.0.33/mysql-connector-java-8.0.33.jar
Move it to your home directory if needed:

mv mysql-connector-java-8.0.33.jar ~/mysql-connector-java.jar

Step 3: Create the PySpark Script

Create a new Python file:

nano step2_read_mysql.py

Paste this code:

from pyspark.sql import SparkSession

# Start Spark session


spark = SparkSession.builder \
.appName("AgeCount") \
.config("spark.driver.extraClassPath", "/home/hdoop/mysql-connector-
java.jar") \
.getOrCreate()

# JDBC config

Aryan Kumar 2301331549002 58


url = "jdbc:mysql://localhost:3306/sparkdb"
properties = {
"user": "root",
"password": "your_mysql_password",
"driver": "com.mysql.cj.jdbc.Driver"
}

# Load data from MySQL


df = spark.read.jdbc(url=url, table="people", properties=properties)
# Convert to RDD and count people by age
rdd = df.rdd
age_counts = rdd.map(lambda row: (row.age, 1)).reduceByKey(lambda a, b: a +
b)
# Print result
for age, count in age_counts.collect():

print(f"Age: {age}, Count: {count}")

spark.stop()

Replace your_mysql_password with your actual MySQL root password.

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 4: Run the Script

Run it using Spark:

spark-submit --jars ~/mysql-connector-java.jar age_count.py


Aryan Kumar 2301331549002 59
Example Output
Age: 25, Count: 4
Age: 30, Count: 3
Age: 40, Count: 2
Age: 35, Count: 1

Aryan Kumar 2301331549002 60

You might also like