0% found this document useful (0 votes)

16 views60 pages

Aryan

The document outlines the installation and setup of VMware for creating a Hadoop environment, detailing prerequisites such as the operating system, JDK, and Hadoop version. It provides a step-by-step procedure for installing Ubuntu, configuring Java, and setting up Hadoop in various modes (standalone, pseudo-distributed, fully distributed). Additionally, it covers basic Linux commands for file and directory management.

Uploaded by

Anish kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views60 pages

Aryan

Uploaded by

Anish kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Experiment No.

1
Aim: - Installation of VMware to set up the Hadoop environment
and its ecosystems.

Prerequisites:
1. Virtual Environment: For virtual environments, VMware is used.
2.operating System: - On Linux-based operating systems, Hadoop can be
installed. Ubuntu and CentOS are two of the most popular operating systems
among them. We'll be using Ubuntu for this tutorial.

3.JDK: - The Java 8 package must be installed on your system.

4.Hadoop version: - Here we are using Hadoop 3.4.1 Package.

Procedure:
Step 1:
Installing VMware Workstation Player Firstly download VMware workstation
from this Link download VMware workstation Follow this Link for VMware
installation. After completion of download open the exe file.

Aryan Yadav 2301331549002 1

Step 2: Installing Ubuntu download Link After downloading save file in desired
location (Preferably D:).

Step 3: Using VMware to install Ubuntu Initially when you open the VMware
file, the below mentioned window appears

Aryan Yadav 2301331549002 2

After Completion of installation as you can see the ubuntu interface.

Aryan Yadav 2301331549002 3

Command in Linux environment:
Sudo apt is a command-line tool in Linux and Ubuntu-based operating systems
that allows users to install, remove, and manage software packages. The "Sudo"
part of the command stands for "superuser do", which allows the user to
execute commands with administrative or root-level privileges. The "apt" part
refers to the Advanced Packaging Tool, which is the package management
system used by Debian-based Linux distributors.
Some common Sudo apt commands include:

Sudo apt update: Updates the package lists from the configured
repositories.
Sudo apt upgrade: Upgrades all installed packages to their newest
versions.
Sudo apt install <package>: Installs the specified package.
Sudo apt remove <package>: Removes the specified package.
Sudo apt search <keyword>: Searches for packages matching the given
keyword.
Sudo apt show <package>: Displays Information about the specified
package.

Using Sudo apt provides a convenient and centralized way to manage software
on a Debian-based Linux system, making it easier to keep your system up-to-
date and install the applications you need.

Step 4: - java jdk installation – Install jdk on ubuntu using terminal >>sudo
apt install openjdk-8-jdk openjdk-8-jre

hadoop@aryan: ~

hdoop@aryan

Aryan Yadav 2301331549002 4

Step 5: - To install OpenSSH server/client on Ubuntu, use the command
below.
>>Sudo apt install OpenSSH-server OpenSSH-client-y

hadoop@aryan: ~

hdoop@aryan

Step 6: -To generate SSH authentication keys (RSA pair) with OpenSSH on
Ubuntu machine, use below command.
>>ssh-keygen -t rsa -P,,"-f~/.ssh/id_rsa

Step7: -To store the generated key pair in another file, use the command
below.
>>cat~/.ssh/id rsa.pub

>> ~/.ssh/authorized keys

Aryan Yadav 2301331549002 5

Step 8:-Command in the screenshot below can be used to set permissions
for the file generated in the previous step.
>>chmod 0600~/.ssh/authorized_keys

Step 9: -Below command can be used to connect to localhost using SSH

protocol.
>>ssh localhost

hadoop@aryan: ~

hdoop@aryan

Step 10: -Use command below, to download Hadoop using command line
interface.
>>wget https://downloads.apache.org/hadoop/common/current/hadoop-
3.4.1.tar.gz
hadoop@aryan: ~

hdoop@aryan

Aryan Yadav 2301331549002 6

Step 11: -Unzip the downloaded file, executing the command below.
>>tar xzf hadoop-3.2.1.tar.gz
Step 12: -As a pre-requisite, .bashrc file needs to be edited using Nano text
editor. Thus, use the command below for the purpose.
>>sudo nano.bashrc

Step 13: -Add below lines in this file.

>>#Hadoop Related Options export
HADOOP_HOME=/home/msjmasood/hadoop-3.3.1

export HADOOP_INSTALL-$HADOOP_HOME

export HADOOP_MAPRED_HOME-SHADOOP_HOME

export HADOOP COMMON HOME-SHADOOP HOME

export HADOOP HDFS HOME-SHADOOP HOME

export YARN_HOME=$HADOOP HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=SHADOOP_HOME/lib/n
ative

export PATH=$PATH:SHADOOP_HOME/sbin:$HADOOP_HOME/bin

export HADOOP OPTS"-Djava.library.path=$HADOOP HOME/lib/nativ"

Step 14: -Open the 2nd file that requires to be edited.
>>sudo nano SHADOOP_HOME/etc/hadoop/hadoop-env.sh

Step 15: -Add below line in this file in the end.

>>export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd 64
Step 16: -Edit the 3rd file namely $HADOOP_HOME/etc/hadoop/core-
site.xml-
>>sudo nano SHADOOP_HOME/etc/hadoop/core-site.xml-

>>#Add below lines in this file(between and "") hadoop.tmp.dir

/home/msjmasood/tmpdata A base for other temporary directories.
fs.default.name

hdfs://localhost:9000 The name of the default file system>

Aryan Yadav 2301331549002 7

Step 17: -Open4th file which requires editing, namely
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

>>sudo nano $HADOOP_HOME/etc/hadoop/hdfssite.xml

Step 18:-Add below lines in this file (between "" and "")
>>sudo nano SHADOOP HOME/etc/hadoop/core-site.xml->>#Add below lines
in this file(between and "") hadoop.tmp.dir /home/msimasood/Impdata A hase
for other temporary directories. fs.default.name hdfs://localhost:9000 The name
of the default file system>

Step 19: -Open 5th file namely SHADOOP_HOME/etc/Hadoop/mapred-

site.xml to edit.
>>sudo nano SHADOOP_HOME/etc/hadoop/mapredsite.xml

Step 20:-Add below lines in this file(between "" and "")

mapreduce.framework.name yarn

Step 21: -Open 6th file namely SHADOOP_HOME/etc/hadoop/yarn-

site.xml, to enable editing.
>>sudo nano $HADOOP_HOME/etc/hadoop/yarnsite.xml

Step 22: -Add below lines in this file (between "" and "")
yarn.nodemanager.aux-services

mapreduce shuffle

yarn.nodemanager.aux-

services.mapreduc

e.shuffle.class. org.apache.hadoop.mapred. ShuffleHandler

yarm.resourcemanager.hostname 127.0.0.1 yarn.acl.enable 0

yarn.nodemanager.env-whitelist

Aryan Yadav 2301331549002 8

Step 23: -Use the command below to launch Hadoop.
>>hdfs namenode-format

Step 24: -Get started with Hadoop

>>/start-dfs.sh

hadoop@aryan: ~

hdoop@aryan

Result: - Thus, above setting up and installing Hadoop is Successfully

completed.

Aryan Yadav 2301331549002 9

Experiment No.: - 2
Aim: -
I Perform setting up and Installing Hadoop in its three
operating modes. a. Standalone. b. Pseudo distributed. c.
Fully distributed.
II Use web-based tools to monitor your Hadoop setup.

About HADOOP
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly
available service on top of a cluster of computers, each of which may be prone
to failures.

Setting up HADOOP
Pre-requisites: 1. Java 2. SSH Before any other steps, we need to set the java

environment variable, this

can be done in windows from the system variables window or on Linux by
adding the following to the variables file:

export JAVA_HOME= /usr/java/latest

Download and extract the HADOOP binaries.

I wget http://apache.claz.org/hadoop/common/hadoop-3.4.1/
II hadoop-3.4.1.tar.gz
III tar xzf hadoop-3.4.1.tar.gz
IV hadoop-3.4.1/* to hadoop/

Aryan Yadav 2301331549002 10

Pseudo-distributed mode
1. Add the following variables to the system variable file

2. Configure HADOOP files

a. Change to the Hadoop directory/etc/Hadoop

b. Add the following to the hadoop-env.sh file

export JAVA_HOME=/usr/local/jdk1.8.0_71

c. Edit the following config files

Core-site.xml

hdfs-site.xml

Aryan Yadav 2301331549002 11

Yarn-site.xml

Mapred-site.xml

d. Verifying the installation

i. Formatting the name nodes

ii. Verifying the HDFS File system

hadoop@ravi: ~

hdoop@aryan

Aryan Yadav 2301331549002 12

iii. Starting YARN

hadoop@aryan: ~

hdoop@aryan

iv. Accessing the HADOOP bowser and verifying everything.

Fully distributed mode

1 Configure system and create host files on each node

a. For each node, edit eh /etc/hosts/ file and add the IP addresses of the servers

e.g

2 Distribute the authentication key-pairs to the users

a. Login to the node-master and generate ssh-keys

b. Copy the keys to the other nodes.

Aryan Yadav 2301331549002 13

3 Download and extract the HADOOP binaries

4 Set the environment variables (same as pseudo-distributed)

5 Edit the core-site.xml file to set Name Node location

6 Set the HDFS Paths in hdfs-site.xml

7 Set the Job scheduler (same as pseudo-distributed)

8 Configure YARN in yarn-site.xml

Aryan Yadav 2301331549002 14

9 Duplicate the config files to each node.

10 Format the HDFS (same as pseudo-distributed).

11 Start the HDFS (same as pseudo-distributed).

12 Run YARN (same as pseudo-distributed).

Standalone
Step 1 — Installing Java
To get started, you’ll update our package list and install OpenJDK, the default
Java Development Kit on Ubuntu 20.04:

>>sudo apt update

>>sudo apt install default-jdk

hadoop@aryan : ~

hdoop@aryan

Step 2 — Installing Hadoop

With Java in place, you’ll visit the Apache Hadoop Releases page to find the
most recent stable release.

Navigate to binary for the release you’d like to install. In this guide you’ll install
>>wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-
Hadoop 3.4.1. 3.4.1.tar.gz
hadoop@aryan : ~

hdoop@aryan

Aryan Yadav 2301331549002 15

Step 3 — Configuring Hadoop’s Java Home
Hadoop requires that you set the path to Java, either as an environment
variable or in the Hadoop configuration file.

The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in

turn a symlink to default Java binary. You will use readlink with the -f flag to
follow every symlink in every part of the path, recursively. Then, you’ll use sed to
trim bin/java from the output to give us the correct value for JAVA_HOME.
To find the default Java path

>>readlink -f /usr/bin/java | sed "s:bin/java::"

Step 4 — Running Hadoop

Now you should be able to run Hadoop:

>>/usr/local/hadoop/bin/Hadoop

Result: - We have installed HADOOP in pseudo-distributed, fully distributed

modes and Standalone.

Aryan Yadav 2301331549002 16

Experiment No. 3
Aim: - Implementing the basic commands of LINUX Operating
System – File/Directory creation, deletion, update operations.

Prerequisites
Basic understanding of what a command line interface is

Familiarity with the concept of files and directories (folders)

1. mkdir
Purpose: to create a directory

As we covered in the last article, folders in Linux are called “directories”. They
serve the same purpose as folders in Windows.
usage: mkdir [directory name]

Here’s mkdir in action:

hadoop@aryan : ~

hdoop@aryan

You can even make a directory within a directory, even if the base one doesn’t
exist with the -p option.
Here I will create a new directory called test2 within a test directory, with the -p
option:

Aryan Yadav 2301331549002 17

hdoop@aryan
hdoop@aryan

hdoop@aryan

2. rmdir
Purpose: to remove a directory

With the rmdir command, you can remove a directory quickly and easily.

usage: rmdir [directory name]

Here’s rmdir in action:

hdoop@aryan
hdoop@aryan

hdoop@aryan

Now this works great if the directory is empty. But what about the directory I
created that has another directory in it?
Here’s what happens when I try to use rmdir on that directory:

Aryan Yadav 2301331549002 18

hdoop@aryan

rmdir cannot remove directories that have files or directories in them. To do that,
you must use the rm command (which we’ll cover again in command #5 )
Do that we need to type in

rm -rf [directory name]

hdoop@aryan
hdoop@aryan

hdoop@aryan

3. cp
Purpose: to make a copy of a file Here’s one you’ll use all the time, especially if

you’re making a config file

backup. Let’s use that as an example. I want to make a backup of this file. If I
mess something up, I can go back to the old version. usage: cp [file name] [new

file name]

Aryan Yadav 2301331549002 19

hadoop@aryan : ~

hdoop@aryan

You can also copy the file to another directory and keep the same file name:

cp [file name] [new location]

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

4. mv
Purpose: to move a file to another location or rename it

This one is pretty straightforward. You use it to move a file from one place to the
other.
usage:

mv [file name] [new location]

Aryan Yadav 2301331549002 20

It’s used the same way as cp, though it moves the file instead of making a copy.

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

This is also how you rename a file.

usage:

mv [file name] [new file name]

So if I want to rename my nginx configuration file, I can do this:

hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 21

5. rm
Purpose: to delete a file

We used rm earlier to remove a directory. It’s also you delete individual files.

usage:

rm [file name]

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

6. touch
Purpose: create an empty file

You may have noticed my “nginx.conf” was zero bytes. This is a nifty command
for creating empty files. This is handy for creating a new file or testing things.
usage:

touch [file name]

This creates a file with nothing in it:

Aryan Yadav 2301331549002 22

hadoop@aryan : ~

hdoop@aryan
hdoop@aryan

hdoop@aryan

7. find a file (find)

This is a powerful command for finding files in the file system.

usage:

find [path to search] -name filename

hadoop@aryan : ~

hdoop@aryan

Aryan Yadav 2301331549002 23

Experiment No. 4
Aim: - Perform various File Management tasks in Hadoop.
i. Upload and download a file in HDFS.
ii. See contents of a file.
iii. Copy a file from source to destination.
iv. Copy a file from/To Local file system to HDFS.
v. Move file from source to destination.
vi. Remove a file or directory in HDFS.
vii. Display last few lines of a file viii. Display the aggregate length of
a file.

i. Upload and Download a File in HDFS

Upload from local to HDFS:

hdfs dfs -put /path/to/local/file.txt /user/hadoop/

Download from HDFS to local:

hdfs dfs -get /user/hadoop/file.txt /path/to/local/

ii. See Contents of a File

hdfs dfs -cat /user/hadoop/file.txt

hadoop@aryan : ~

hdoop@aryan

vam

hdoop@aryan

Aryan Yadav 2301331549002 24

iii. Copy a File from Source to Destination (within HDFS)
hdfs dfs -cp /user/hadoop/file.txt /user/hadoop/copy_file.txt

hdoop@aryan

iv. Copy a File from/To Local File System to HDFS

Local to HDFS:

hdfs dfs -copyFromLocal /path/to/local/file.txt /user/hadoop/

HDFS to Local:

hdfs dfs -copyToLocal /user/hadoop/file.txt /path/to/local/

hdoop@aryan

v. Move File from Source to Destination (within HDFS)

hdfs dfs -mv /user/hadoop/file.txt /user/hadoop/new_folder/

vi. Remove a File or Directory in HDFS

Remove a file:

hdfs dfs -rm /user/hadoop/file.txt

Aryan Yadav 2301331549002 25

Remove a directory and its contents:

hdfs dfs -rm -r /user/hadoop/directory_name

hadoop@aryan : ~

hdoop@aryan

vii. Display Last Few Lines of a File

hdfs dfs -tail /user/hadoop/file.txt

By default, it displays the last 1KB of the file.

viii. Display the Aggregate Length of a File

hdfs dfs -du /user/hadoop/file.txt

For total space used by a directory recursively:

hdfs dfs -du -s -h /user/hadoop/

hdoop@aryan

op@Barkha

hdoop@aryan

Aryan Yadav 2301331549002 26

Experiment No. 5
Aim: - Implement Word Count Map Reduce program to understand
Map Reduce Program.

Step 1: Setup Requirements

Make sure you have:

Hadoop installed and configured (HADOOP_HOME set, HDFS

working).
Java installed (java -version should show JDK 8 or above).
$HADOOP_HOME/bin added to your system's PATH.

Step 2: Create Java Program

1. Create a folder for your project:

mkdir WordCountProject
cd WordCountProject

2. Create a file named WordCount.java:

nano WordCount.java

hadoop@aryan : ~ /Wordcount

hdoop@aryan

hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 27

3. Paste this code inside:

Aryan Yadav 2301331549002 28

Step 3: Compile the Program
mkdir wordcount_classes
javac -classpath `hadoop classpath` -d wordcount_classes WordCount.java

Step 4: Create a JAR File

jar -cvf wordcount.jar -C wordcount_classes/ .

hadoop@aryan : ~ /Wordcount

hdoop@aryan

hdoop@aryan
hdoop@aryan

hdoop@aryan

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 5: Prepare Input File in HDFS

1. Create a sample text file:

echo "Hello Hadoop Hello World" > input.txt

2. Create an input directory in HDFS:

hdfs dfs -mkdir /input

3. Upload the file to HDFS:

hdfs dfs -put input.txt /input/

hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 29

Step 6: Run the MapReduce Job
hadoop jar wordcount.jar WordCount /input /output

Note: /output directory must not exist before you run the job. If it does:

hdfs dfs -rm -r /output

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 7: View the Output

hdfs dfs -cat /output/part-r-00000

You should see something like:

hadoop 1
hello 2
world 1

hdoop@aryan

Aryan Yadav 2301331549002 30

Experiment No. 6
Aim: - Implement matrix multiplication with Hadoop Map Reduce.

Step 1: Setup Your Input Matrices

We’ll create two files: matrix_a.txt and matrix_b.txt

nano matrix_a.txt Paste this content: A 0 0 1

A012
A023
A104
A115
A126
Press Ctrl+O, then Enter, then Ctrl+X to save and exit.

nano matrix_b.txt

Paste this content:

B007
B018
B109
B 1 1 10
B 2 0 11
B 2 1 12
Save and exit (same steps as above).

hadoop@aryan : ~ /matrix-multiplication

hdoop@aryan
hdoop@aryan

hdoop@aryan

Aryan Yadav 2301331549002 31

Step 2: Combine Inputs
mkdir -p matrix_input
cat matrix_a.txt matrix_b.txt > matrix_input/input.txt

hadoop@aryan : ~ /matrix-multiplication

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 3: Upload to HDFS

hadoop fs -mkdir -p /matrix/input
hadoop fs -put -f matrix_input/input.txt /matrix/input

Step 4: Create mapper.py

nano mapper.py

Save and exit, then make it executable:

Aryan Yadav 2301331549002 32

chmod +x mapper.py

Step 5: Create reducer.py

nano reducer.py

Make it executable:

chmod +x reducer.py

Step 6: Run Hadoop Streaming Job

hadoop jar /home/hdoop/hadoop-3.4.1/share/hadoop/tools/lib/hadoop-

streaming-3.4.1.jar \
-input /matrix/input \ -
output /matrix/output \
-mapper mapper.py \ -
reducer reducer.py \ -
file mapper.py \ -file
reducer.py

Aryan Yadav 2301331549002 33

Wait for the job to complete.

hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 7: View Output

hadoop fs -cat /matrix/output/part-00000

You should see:

0,0 58
0,1 64
1,0 139
1,1 154

hdoop@aryan

Aryan Yadav 2301331549002 34

Experiment No.: - 7
Aim: -
I. Installation of PIG.
ii. Write Pig Latin scripts sort, group, join, project, and filter your data.

About Pig: -
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is
used to process the large datasets. It provides a high-level of abstraction for
processing over the MapReduce. It provides a high-level scripting language,
known as Pig Latin, which is used to develop the data analysis codes. First, to
process the data which is stored in the HDFS, the programmers will write the
scripts using the Pig Latin Language.

Need of Pig: One limitation of MapReduce is that the development cycle is very
long. Writing the reducer and mapper, compiling packaging the code, submitting
the job and retrieving the output is a time-consuming task. Apache Pig reduces
the time of development using the multi-query approach.

I. Installing Apache Pig -

1. Install Java (Pig requires Java)

sudo apt update
sudo apt install openjdk-11-jdk -y
2. Set JAVA_HOME

Add to .bashrc:

nano ~/.bashrc

Add at the end:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
Then apply changes:

Aryan Yadav 2301331549002 35

source ~/.bashrc

3. Download Apache Pig

wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz
4. Extract and Move Pig
tar -xvzf pig-0.17.0.tar.gz
sudo mv pig-0.17.0 /opt/pig

5. Set PIG_HOME and PATH

Again edit .bashrc:

nano ~/.bashrc

Add:

export PIG_HOME=/opt/pig
export PATH=$PATH:$PIG_HOME/bin
Apply:

source ~/.bashrc

6. Verify Installation
pig -version

hadoop@aryan:~
hdoop@aryan

hdoop@aryan

II. Writing Pig Latin Scripts

Let’s assume you have a CSV file called data.csv:

id,name,age,department
Aryan Yadav 2301331549002 36
1,John,25,IT
2,Alice,30,HR
3,Bob,22,IT
4,Eve,35,Finance

hadoop@aryan:~

hdoop@aryan
hdoop@aryan

hdoop@aryan

1. Start Pig in Local Mode

pig -x local
2. Load Data
data = LOAD 'data.csv' USING PigStorage(',') AS (id:int, name:chararray,
age:int, dept:chararray);

A. Project (select specific columns)

proj = FOREACH data GENERATE name, age;
DUMP proj;

hadoop@aryan:~

Aryan Yadav 2301331549002 37

B. Filter (e.g., age > 25)
filtered = FILTER data BY age > 25;
DUMP filtered;

hadoop@aryan:~

C. Sort (by age ascending)

sorted = ORDER data BY age ASC;
DUMP sorted;

D. Group (by department)

grouped = GROUP data BY dept;
DUMP grouped;

Aryan Yadav 2301331549002 38

hadoop@aryan:~

E. Join (joining two datasets)

Assume another file salaries.csv:

id,salary
1,50000
2,60000
3,55000
4,70000

Load salaries:

salaries = LOAD 'salaries.csv' USING PigStorage(',') AS (id:int, salary:int);

joined = JOIN data BY id, salaries BY id;
DUMP joined;

3. Exit Pig
quit;

Aryan Yadav 2301331549002 39

Experiment No.: - 8
Aim: -
i. Run the Pig Latin Scripts to find Word Count.
ii. Run the Pig Latin Scripts to find a max temp for every year.

i. Run Pig Script to Find Word Count

Step 1: Create a Text File

Let’s create a file text.txt:

nano text.txt

Paste this sample content:

hello world
hello pig
hello hadoop
pig loves hadoop
Save and exit (Ctrl + O, Enter, Ctrl + X)

Step 2: Start Grunt Shell

pig -x local

hadoop@aryan:~
hdoop@aryan
hdoop@aryan

Aryan Yadav 2301331549002 40

Step 3: Run Word Count in Grunt
-- Load file
lines = LOAD 'text.txt' AS (line:chararray);

-- Break into words

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- Group words
grouped = GROUP words BY word;
-- Count each word
wordcount = FOREACH grouped GENERATE group AS word, COUNT(words)
AS count;

-- Dump result
DUMP wordcount;

You’ll see output like:

(hadoop,2)
(hello,3)
(pig,2)
(world,1)
(loves,1)

Aryan Yadav 2301331549002 41

Step 1: Create the Weather File
nano weather.csv Paste this:

2010,30
2010,35
2010,29
2011,38
2011,36
2011,32
Save and exit.

Step 2: Start Grunt Shell (if not already)

pig -x local

hadoop@aryan:~
hdoop@aryan
hdoop@aryan

Aryan Yadav 2301331549002 42

Step 3: Run Max Temp Script in Grunt
-- Load weather data
weather = LOAD 'weather.csv' USING PigStorage(',') AS (year:int, temp:int);

-- Group by year
grouped = GROUP weather BY year;
-- Find max temp per year

max_temp = FOREACH grouped GENERATE group AS year,

MAX(weather.temp) AS max_temp;

-- Dump results
DUMP max_temp;

You’ll see output like:

(2010,35)
(2011,38)

Aryan Yadav 2301331549002 43

Experiment No.: - 9
Aim: -
i. Installation of HIVE.
ii. Use Hive to create, alter, and drop databases, tables, views,
functions, and indexes

1. Apache Hive 3.1.3 Installation (Step-by-Step)

Pre-requisites

Java JDK 8 or 11 installed

Hadoop installed and configured (version 3.4.1)
Environment variables set for Hadoop
Linux-based system (Ubuntu or similar)

1. Download Hive 3.1.3

cd /home/hdoop/
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

2. Extract Hive
tar -xvzf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin hive

Now Hive is located at: /home/hdoop/hive

3. Set Hive Environment Variables

Add the following lines to your ~/.bashrc file:

# Hive Environment
export HIVE_HOME=/home/hdoop/hive
export PATH=$PATH:$HIVE_HOME/bin

Apply the changes:

source ~/.bashrc

Aryan Kumar 2301331549002 44

4. Configure Hive (hive-site.xml)
Create the Hive configuration directory and copy default files:

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and set the following basic properties:

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>Derby embedded metastore URL</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>Hive warehouse location</description>
</property>
</configuration>

5. Initialize the Hive Metastore

schematool -dbType derby -initSchema

This sets up the metastore database using embedded Derby.

6. Create Warehouse Directory in HDFS

Start Hadoop services first:

start-dfs.sh
start-yarn.sh
Then create Hive warehouse directory in HDFS:

hdfs dfs -mkdir -p /user/hive/warehouse

hdfs dfs -chmod g+w /user/hive/warehouse

Aryan Kumar 2301331549002 45

7. Start Hive
Launch the Hive shell:

$ hive

hdoop@aryan:~/hadoop-3.4.1

hdoop@aryan

2. Use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.

1. Create a Database

To create a new database in Hive:

CREATE DATABASE mydb;

2. Switch to the New Database

Once the database is created, switch to it:

USE mydb;

Aryan Kumar 2301331549002 46

3. Create a Table

Now, let’s create a table students with some columns:

CREATE TABLE students (

id INT,
name STRING,
grade FLOAT
);
4. Insert Data into the Table
Insert some sample data into the students table:

INSERT INTO students (id, name, grade) VALUES (1, 'John Doe', 95);
INSERT INTO students (id, name, grade) VALUES (2, 'Jane Doe', 85);
INSERT INTO students (id, name, grade) VALUES (3, 'Mike Ross', 92);

hdoop@aryan:~/hadoop-3.4.1

5. Show Data in the Table

Now, check if the data has been inserted correctly:

SELECT * FROM students;

Aryan Kumar 2301331549002 47

6. Alter the Table (Add a Column)

You can alter the table by adding a new column (e.g., email):

ALTER TABLE students ADD COLUMNS (email STRING);

7. Alter the Table (Change Column Type)

Change the name column to a new data type (e.g., STRING to VARCHAR):

ALTER TABLE students CHANGE name full_name STRING;

8. Create a View
Now, let's create a view high_achievers to show students with grades greater
than 90:

CREATE VIEW high_achievers AS

SELECT full_name, grade FROM students WHERE grade > 90;

9. Query the View

Check the data in the high_achievers view:

SELECT * FROM high_achievers;

Aryan Kumar 2301331549002 48

10. Create an Index
CREATE INDEX idx_grade ON TABLE students (grade)
AS 'COMPACT' WITH DEFERRED REBUILD;

11. Drop the Index

To drop the index created on the students table:

DROP INDEX idx_grade ON students;

12. Drop the View

Drop the high_achievers view if it's no longer needed:

DROP VIEW high_achievers;

13. Drop the Table

To drop the students table:

DROP TABLE students;

14. Drop the Database

If you want to drop the database mydb (with its tables, views, and other
objects):

DROP DATABASE mydb CASCADE;

Aryan Kumar 2301331549002 49

Experiment No.: - 10

Aim: - Install Hbase and Perform CRUD operation using Hbase

Shell.

Step 1: Download and Install HBase

1.1 Download HBase
cd ~
wget https://downloads.apache.org/hbase/2.4.17/hbase-2.4.17-bin.tar.gz

1.2 Extract the Archive

tar -xvzf hbase-2.4.17-bin.tar.gz
mv hbase-2.4.17 hbase
hdoop@aryan

hdoop@aryan

Step 2: Configure HBase

2.1 Edit hbase-env.sh

nano ~/hbase/conf/hbase-env.sh

Uncomment and set Java home:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 # Change

path as needed
Add Hadoop path:

Aryan Kumar 2301331549002 50

export HBASE_MANAGES_ZK=false

2.2 Edit hbase-site.xml

nano ~/hbase/conf/hbase-site.xml

Add this configuration inside <configuration>:

<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hdoop/hbase/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

Step 3: Start Hadoop and HBase

3.1 Start Hadoop (if not already running)
start-dfs.sh

3.2 Format HDFS (first time only)

hdfs namenode -format

3.3 Start HBase

~/hbase/bin/start-hbase.sh
hdoop@aryan

hdoop@aryan

Step 4: Access HBase Shell

~/hbase/bin/hbase shell

You should see the HBase shell prompt like this:

Aryan Kumar 2301331549002 51

hbase(main):001:0>

Step 5: Perform CRUD Operations in HBase Shell

5.1 Create a Table
create 'students', 'info'

5.2 Insert (Put) Data

put 'students', '1', 'info:name', 'Alice'
put 'students', '1', 'info:age', '22'
put 'students', '2', 'info:name', 'Bob'
put 'students', '2', 'info:age', '24'

5.3 Read (Get and Scan)

get 'students', '1'
scan 'students'

5.4 Update Data (same as put)

put 'students', '1', 'info:age', '23'
get 'students', '1'

Aryan Kumar 2301331549002 52

5.5 Delete Data
Delete a column:
delete 'students', '1', 'info:age'
Delete entire row:
deleteall 'students', '1'

Step 6: Stop HBase and Hadoop

6.1 Stop HBase
~/hbase/bin/stop-hbase.sh

6.2 Stop Hadoop

stop-dfs.sh

Aryan Kumar 2301331549002 53

Experiment No.: -11

Aim: - Implement Spark Core Processing RDD to run Word

Count Program.

Steps: WordCount in PySpark (Using Local File)

Step 1: Create Your Input File

Open terminal and run:

nano input.txt

Add this sample text:

hello world
hello hadoop
hello spark

Save and exit (Ctrl + O, Enter, Ctrl + X).

Step 2: Create Your PySpark Script

Now create a new Python file:

nano wordcount.py

Paste this code:

from pyspark import SparkContext

# Create SparkContext
sc = SparkContext("local", "WordCount")
# Read the local input file
text_file = sc.textFile("file:///home/hdoop/input.txt")
# Split lines into words, map to (word, 1), reduce by key
word_counts = text_file.flatMap(lambda line: line.split()) \

.map(lambda word: (word, 1)) \

Aryan Kumar 2301331549002 54

.reduceByKey(lambda a, b: a + b)

# Save the result in output folder

word_counts.saveAsTextFile("file:///home/hdoop/output")
# Stop the SparkContext
sc.stop()

Save and exit (Ctrl + O, Enter, Ctrl + X).

Step 3: Run Your Script

In terminal:

spark-submit wordcount.py

hdoop@aryan:~

hdoop@aryan
hdoop@aryan
hdoop@aryan

Step 4: See the Result

Run:

cat output/part-00000

Example output:

('hello', 3)
('world', 1)
('hadoop', 1)
('spark', 1)

Aryan Kumar 2301331549002 55

hdoop@aryan

Aryan Kumar 2301331549002 56

Experiment No.: - 12

Aim: - Implement Spark Core Processing RDD to read a table

Stored in a database and calculate the number of people for every
age.

Step 1: Create the MySQL Table

Open MySQL:

mysql -u root -p

Then:

CREATE DATABASE sparkdb;

USE sparkdb;
CREATE TABLE people (

id INT AUTO_INCREMENT PRIMARY KEY,

name VARCHAR(50),
age INT
);

INSERT INTO people (name, age) VALUES

('Alice', 25),('Bob', 30),('Charlie', 25),('David', 40),('Eve', 30);
EXIT;

hdoop@aryan:~

hdoop@aryan

Aryan Kumar 2301331549002 57

Step 2: Download JDBC Driver

Download MySQL JDBC driver:

wget https://repo1.maven.org/maven2/mysql/mysql-connector-
java/8.0.33/mysql-connector-java-8.0.33.jar
Move it to your home directory if needed:

mv mysql-connector-java-8.0.33.jar ~/mysql-connector-java.jar

Step 3: Create the PySpark Script

Create a new Python file:

nano step2_read_mysql.py

Paste this code:

from pyspark.sql import SparkSession

# Start Spark session

spark = SparkSession.builder \
.appName("AgeCount") \
.config("spark.driver.extraClassPath", "/home/hdoop/mysql-connector-
java.jar") \
.getOrCreate()

# JDBC config

Aryan Kumar 2301331549002 58

url = "jdbc:mysql://localhost:3306/sparkdb"
properties = {
"user": "root",
"password": "your_mysql_password",
"driver": "com.mysql.cj.jdbc.Driver"
}

# Load data from MySQL

df = spark.read.jdbc(url=url, table="people", properties=properties)
# Convert to RDD and count people by age
rdd = df.rdd
age_counts = rdd.map(lambda row: (row.age, 1)).reduceByKey(lambda a, b: a +
b)
# Print result
for age, count in age_counts.collect():

print(f"Age: {age}, Count: {count}")

spark.stop()

Replace your_mysql_password with your actual MySQL root password.

hdoop@aryan
hdoop@aryan

hdoop@aryan

Step 4: Run the Script

Run it using Spark:

spark-submit --jars ~/mysql-connector-java.jar age_count.py

Aryan Kumar 2301331549002 59
Example Output
Age: 25, Count: 4
Age: 30, Count: 3
Age: 40, Count: 2
Age: 35, Count: 1

Aryan Kumar 2301331549002 60

Oracle Apps R-12 Technical
100% (1)
Oracle Apps R-12 Technical
54 pages
DONGIRI NAVEEN Adf CV
0% (1)
DONGIRI NAVEEN Adf CV
3 pages
Checklist
No ratings yet
Checklist
4 pages
A Single PPT On ERP Systems - Used in Class For MBA 12-14
No ratings yet
A Single PPT On ERP Systems - Used in Class For MBA 12-14
182 pages
Information Security Policy Checklist
100% (6)
Information Security Policy Checklist
2 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
Experiment No - 1
No ratings yet
Experiment No - 1
13 pages
SQL Complete Notes.
No ratings yet
SQL Complete Notes.
63 pages
Types of Industry
0% (1)
Types of Industry
9 pages
Hadoop 3 Installation
No ratings yet
Hadoop 3 Installation
10 pages
Presentation 2 EHMIS - PHEM
No ratings yet
Presentation 2 EHMIS - PHEM
15 pages
Hadoop Installation Step by Step
No ratings yet
Hadoop Installation Step by Step
8 pages
Data Stage1
No ratings yet
Data Stage1
12 pages
213nt1306 - Big Data Analytics Lab Manual
No ratings yet
213nt1306 - Big Data Analytics Lab Manual
80 pages
Anurag 1-6 Merged
No ratings yet
Anurag 1-6 Merged
60 pages
Hadoop Installaion
No ratings yet
Hadoop Installaion
113 pages
10 131448
No ratings yet
10 131448
48 pages
Hadoop Installation Steps
100% (1)
Hadoop Installation Steps
6 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
Installing Hadoop On Ubuntu
No ratings yet
Installing Hadoop On Ubuntu
29 pages
Bypass Windows Defense
No ratings yet
Bypass Windows Defense
30 pages
BD Lab File
No ratings yet
BD Lab File
39 pages
Hadoop 2.6 Installing On Ubuntu 14.04 (Single-Node Cluster)
No ratings yet
Hadoop 2.6 Installing On Ubuntu 14.04 (Single-Node Cluster)
27 pages
BDA Practical
No ratings yet
BDA Practical
38 pages
A Report On Distributed Computing
No ratings yet
A Report On Distributed Computing
25 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
20 pages
BDA Lab Manual UPDATED
No ratings yet
BDA Lab Manual UPDATED
45 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
49 pages
Updated CMD
No ratings yet
Updated CMD
23 pages
Google Apps Script Tutorial: Creating Your First Spreadsheet Script
No ratings yet
Google Apps Script Tutorial: Creating Your First Spreadsheet Script
3 pages
BDA LAB Programs
No ratings yet
BDA LAB Programs
56 pages
BigData Lab Manual
No ratings yet
BigData Lab Manual
44 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Big Data
No ratings yet
Big Data
32 pages
Hadoop Install
No ratings yet
Hadoop Install
19 pages
Presentation On Cloud Based Secure Text Transfer
No ratings yet
Presentation On Cloud Based Secure Text Transfer
30 pages
BDAO
No ratings yet
BDAO
23 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Big Data Manual Ai
No ratings yet
Big Data Manual Ai
33 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
18 pages
Big Data File
No ratings yet
Big Data File
32 pages
ION Enterprise 6.0 On Windows 7 64bit Install Work Around
No ratings yet
ION Enterprise 6.0 On Windows 7 64bit Install Work Around
3 pages
Hadoop Installation Manual 2.odt
No ratings yet
Hadoop Installation Manual 2.odt
20 pages
Bda Lab
No ratings yet
Bda Lab
37 pages
Intro To SRDF
No ratings yet
Intro To SRDF
10 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
18 pages
Hadoop Installation Step by Step
No ratings yet
Hadoop Installation Step by Step
6 pages
A Step-By-Step Approach On Installing Hadoop in Vmware Workstation
No ratings yet
A Step-By-Step Approach On Installing Hadoop in Vmware Workstation
9 pages
2023MCS320004 HEMANTH TARRA - Hadoop Installation - Assignment
No ratings yet
2023MCS320004 HEMANTH TARRA - Hadoop Installation - Assignment
9 pages
How To Install Hadoop On Ubuntu 18
No ratings yet
How To Install Hadoop On Ubuntu 18
15 pages
Code Review Exercise Workbook
No ratings yet
Code Review Exercise Workbook
15 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
15 pages
Bdamanual
No ratings yet
Bdamanual
8 pages
Installation of Hadoop in Ubuntu
No ratings yet
Installation of Hadoop in Ubuntu
15 pages
Hadoop Installation
No ratings yet
Hadoop Installation
12 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
Experiment-2 BDA Lab
No ratings yet
Experiment-2 BDA Lab
13 pages
Support of Hadoop Cluster Installation and Administration
No ratings yet
Support of Hadoop Cluster Installation and Administration
10 pages
Hadoop Installation
No ratings yet
Hadoop Installation
7 pages
Hadoop Configuration
No ratings yet
Hadoop Configuration
12 pages
Hbase Installationn
No ratings yet
Hbase Installationn
12 pages
Installationof Hadoop 3
No ratings yet
Installationof Hadoop 3
6 pages
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
Bdafile
No ratings yet
Bdafile
9 pages
IDP Proposal
No ratings yet
IDP Proposal
12 pages
Hadoop Installation
No ratings yet
Hadoop Installation
7 pages
DataVisuaization Lab
No ratings yet
DataVisuaization Lab
5 pages
Hadoop Installation
No ratings yet
Hadoop Installation
6 pages
Experiment 1 Hadoop Installation
No ratings yet
Experiment 1 Hadoop Installation
6 pages
Hadoop
No ratings yet
Hadoop
5 pages
Install Hadoop
No ratings yet
Install Hadoop
8 pages
Database Modeling Using EER Model
No ratings yet
Database Modeling Using EER Model
28 pages
HADOOP 1.X Installation Steps On Ubuntu
No ratings yet
HADOOP 1.X Installation Steps On Ubuntu
3 pages
TP2 - 3IM - en
No ratings yet
TP2 - 3IM - en
7 pages
Export
No ratings yet
Export
7 pages
Swiggy
No ratings yet
Swiggy
7 pages
62-Introduction AAA
No ratings yet
62-Introduction AAA
4 pages
$ Sudo Apt-Get Install Oracle-Java8-Installer
No ratings yet
$ Sudo Apt-Get Install Oracle-Java8-Installer
4 pages
All Dat
No ratings yet
All Dat
13 pages
LDAP Cognos Configuration
No ratings yet
LDAP Cognos Configuration
12 pages
Szlimjdxzd
No ratings yet
Szlimjdxzd
10 pages
Project Synopsis
No ratings yet
Project Synopsis
9 pages
Ravi Raushan Kumar 2 Yoe PDF
No ratings yet
Ravi Raushan Kumar 2 Yoe PDF
1 page
Automation Framework Exercise
No ratings yet
Automation Framework Exercise
4 pages
Ansible Key Points
No ratings yet
Ansible Key Points
4 pages
Testing Shows Presence of Defects: But Cannot Prove That There Are No Defects
No ratings yet
Testing Shows Presence of Defects: But Cannot Prove That There Are No Defects
4 pages
Sbcmont1 and Se16n CD Data
No ratings yet
Sbcmont1 and Se16n CD Data
1 page
Backend Handbook: for Ruby on Rails Apps
From Everand
Backend Handbook: for Ruby on Rails Apps
Francisco Quintero
1/5 (1)