0% found this document useful (0 votes)
16 views55 pages

Bda Lab Manual

The document is a lab manual for a Big Data Analytics course (CCS334) under the 2021 regulation for Computer Science and Engineering students. It outlines the institution's vision and mission, program educational objectives, course objectives, outcomes, and a detailed list of experiments related to Hadoop and big data management. The manual includes installation procedures, file management tasks in Hadoop, and assessment rubrics for laboratory performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views55 pages

Bda Lab Manual

The document is a lab manual for a Big Data Analytics course (CCS334) under the 2021 regulation for Computer Science and Engineering students. It outlines the institution's vision and mission, program educational objectives, course objectives, outcomes, and a detailed list of experiments related to Hadoop and big data management. The manual includes installation procedures, file management tasks in Hadoop, and assessment rubrics for laboratory performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

BIG DATA ANALYTICS LABORATORY

LAB MANUAL

Regulation : 2021

Course Code : CCS334

Semester : VI

Branch : CSE

Prepared By Approved By
R. JAAZIELIAH, AP/CSE HOD / CSE
Institution Vision

To impart quality technical education emphasizing innovations and research with social and ethical
values.

Institution Mission

1. Establishing state-of–the-art infrastructure, effective procedures for recruitment of competent


faculty and innovative teaching practices.
2. Creating a conducive environment for nurturing innovative ideas and encouraging research skills.
3. Inculcating social and ethical values through co-curricular and extra-curricular activities.

Department Vision

To provide technical education in the field of computer science and engineering by imparting
employability, research capabilities, entrepreneurship with human values.

Department Mission

1. Adopting innovative teaching learning practices through the state-of-the-infrastructure.


2. Establishing a conducive environment for innovations and research activities.
3. Inculcating leadership skills, moral and social values through extension activities.
Program Educational Objectives

PEO 1: The graduates will excel in their chosen technical domains in Computer Science and Engineering.
PEO 2: The graduates will solve real world problems in Computer Science and Engineering by applying
logical and coding skills considering ethical values.
PEO 3: The graduates will work enthusiastically as a team and engage in lifelong learning.

Program Outcomes

Engineering knowledge: Apply the knowledge of mathematics, science,


PO 1 engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems
Problem analysis: Identify, formulate, review research literature, and analyze
PO 2
complex engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences
Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
PO 3
consideration for the public health and safety, and the
cultural, societal, and environmental considerations
Conduct investigations of complex problems: Use research-based knowledge
PO 4
and research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
PO 5 engineering and IT tools including prediction and modeling to
complex engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual
PO 6
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
Environment and sustainability: Understand the impact of the professional engineering
PO 7
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.

Ethics: Apply ethical principles and commit to professional ethics and


PO 8 responsibilities and norms of the engineering practice.

Individual and team work: Function effectively as an individual, and as a


PO 9 member or leader in diverse teams, and in multidisciplinary settings

Communication: Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to comprehend and write
PO 10
effective reports and design documentation, make effective
presentations, and give and receive clear instructions
Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
PO 11
leader in a team, to manage projects and in multidisciplinary
environments
Life-long learning: Recognize the need for, and have the preparation and ability
PO 12 to engage in independent and life-long learning in the broadest context of technological
change.
Program Specific Outcomes

PSO1: Analyze, design and develop computing solutions by applying foundational concepts of Computer
Science and Engineering.
PSO2: Apply software engineering principles and practices for developing quality software for scientific
and business applications.
PSO3: Adapt to emerging Information and Communication Technologies (ICT) to innovate ideas and
solutions to existing/novel problems.
CCS334 BIG DATA ANALYTICS LTPC
2023
COURSE OBJECTIVES:
 To understand big data.
 To learn and use NoSQL big data management.
 To learn mapreduce analytics using Hadoop and related tools.
 To work with map reduce applications
 To understand the usage of Hadoop related tools for Big Data Analytics

LIST OF EXPERIMENTS: 30 PERIODS


1. Downloading and installing Hadoop; Understanding different Hadoop modes.Startup scripts,Configuration
files.
2. Hadoop Implementation of file management tasks, such as Adding files and directories, retrieving files
and Deleting files
3. Implement of Matrix Multiplication with Hadoop Map Reduce
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
5. Installation of Hive along with practice examples.
6. Installation of HBase, Installing thrift along with Practice examples.
7. Practice importing and exporting data from various databases.
8. Installation of Apache PIG.

Software Requirements:
Cassandra, Hadoop, Java, Pig, Hive and HBase.

COURSE OUTCOMES:
After the completion of this course, students will be able to:
CO1: Summarize big data and use cases from selected business domains.
CO2: Explain NoSQL big data management.
CO3: Experiment with Hadoop and HDFS installation and configuration
CO4: Examine of map-reduce analytics using Hadoop.
CO5: Test Hadoop-related tools such as HBase, Cassandra, Pig, and Hive for big data analytics.

OBJECTIVES:

CO1: Illustrate the big data and use cases from selected business domains
CO2: Explain NoSQL big data management
CO3: Solve map-reduce analytics using Hadoop
CO4: Demonstrate the knowledge of big data analytics and implement different file management
task in Hadoop.
CO5: Make use of Hadoop-related tools such as HBase, Cassandra, Pig, and Hive for big data
analytics.
CO - PO MATRICES OF COURSE
Mapping of Course Outcomes with Program Outcomes & Program Specific Outcomes:

CO No PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
1 3 3 3 3 3 - - - 2 2 3 1 1 3 3
2 3 3 2 3 2 - - - 2 2 3 3 2 3 2
3 3 3 3 2 3 - - - 2 2 1 2 2 3 3
4 2 3 3 3 3 - - - 2 2 3 2 3 3 2
5 3 3 3 3 3 - - - 3 1 3 2 3 2 3
Average 2.8 3 2.8 2.8 2.8 - - - 2.2 1.8 2.6 2 2.2 2.8 2.6

Experiment wise CO Mapping:

CO/ Exp. No. 1 2 3 4 5 6 7 8



C310.1

C310.2 

C310.3  

C310.4 

C310.5   

3
RUBRICS FOR INTERNAL LABORATORY ASSESSMENT

S.NO COMPONENT ASSESMENT METRICS


(Marks) (Marks)
Understanding Step by Step Appropriate Knowledge Preparation
Preparation (10) The Problem (2) Algorithm Code on I/O (2) within Time
1
(2) written (2) bound (2)
Clean Code - Debugging (3) Error free Education With
Observation (10) Indentation, compilation (2) Expected Output
2
Naming Identifier, (2)
Comments (3)
3 Report (5) Output presentation (3) Compilation time (2)
4 Record (15) Error Free (6) Neat Presentation (6) Time of Submission
(3)
5 Viva (10) Five Questions- Each two marks (10)

4
List of Experiment

PAGE
S. NO NAME OF THE EXPRIMENTS NO

1 Installation and configuration of hadoop 6

2 Implementation of Hadoop file management tasks 11

3 Matrix Multiplication with Hadoop Map Reduce 14

4 Word count MapReduce program 18

5 Installation of Hive 25

6 Installation of HBase and Thrift 30

7 Import and export data from various databases 39

8 Additional Experiments: Installation of Pig 45

5
EXP.NO.1 Installation and Configuration of Hadoop

AIM:

To Download and install Hadoop and understand its different modes.

PROCEDURE:

Prerequisites to Install Hadoop on Ubuntu


Hardware requirement- The machine must have 4GB RAM and minimum 60 GB hard disk for
better performance.
Check java version- It is recommended to install Oracle Java 8. The
user can check the version of java with below command.
$ java –version

Step 1: Download Hadoop and Java


tar -zxvf hadoop-2.9.1.tar.gz (Extract the tar file)
tar -zxvf jdk-8u45-linux-x64.tar (Extract the tar file)

sudo apt-get install vim (Install USER Friendly Editer)

vi .bashrc (Set the java Path in your Home Path)

export JAVA_HOME=/home/username/jdk1.8.0_45
export PATH=HOME/bin:JAVA_HOME/bin:PATH

source .bashrc (Execute the bashrc file)


echo JAVA_HOME (Check the java path)

Step 2. Modify Hadoop Configuration Files


6
NAMENODE ----> core-site.xml
RESOURCE MANGER ----> mapperd-site.xml
SECONDARYNAMENODE ---->
DATANODE ----> slaves
NODEMANGER ----> slaves & yarn-site.xml

vi etc/hadoop/core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50000</value>
</property>

vi etc/hadoop/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>{yarn.resourcemanager.hostname}:8032</value>
</property>

vi etc/hadoop/hdfs-site.xml

<property>
<name>dfs.namenode.name.dir</name>
<value>/home/username/hadoop2-dir/namenode-dir</value>
</property>
7
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/username/hadoop2-dir/datanode-dir</value>
</property>

vi etc/hadoop/mapred-site.xml

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

vi etc/hadoop/hadoop-env.sh
export JAVA_HOME=/home/username/jdk1.8.0_45

vi etc/hadoop/mapred-env.sh
export JAVA_HOME=/home/username/jdk1.8.0_ 45

vi etc/hadoop/yarn-env.sh
export JAVA_HOME=/home/username/jdk1.8.0_45

vi etc/hadoop/slaves
localhost

Step 3: Install the ssh key


(Generates, Manages and Converts Authentication keys)

sudo apt-get install openssh-server


ssh-keygen -t rsa
(Setup passwordless ssh to localhost and to slaves )
cd .ssh

8
ls
cat id_rsa.pub >> authorized_keys (copy the .pub)
(Copy the id_rsa.pub from NameNode to authorized_keys in all machines)

ssh localhost
(Asking No Password )

Step 4. Format NameNode


cd hadoop-2.9.1
bin/hadoop namenode -format (Your Hadoop File System Ready)

9
Step 5. Start All Hadoop Related Services
sbin/start-all.sh
(Starting Daemon’s For DFS & YARN)
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager

(check the Browser Web GUI )


NameNode - http://localhost:50070/
Resource Manager - http://localhost:8088/

Step 6.Stop All Hadoop and Yarn Related Services


sbin/stop-all.sh

Result:
10
Thus installation of Hadoop is completed successfully.

11
EXPT.NO.2 Implementation of Hadoop file management tasks

AIM:
To implement the following file management tasks in Hadoop:
1. Adding files and directories
2. Retrieving files
3. Deleting Files

DESCRIPTION:- HDFS is a scalable distributed filesystem designed to scale to petabytes of


data while running on top of the underlying filesystem of the operating system. HDFS keeps
track of where the data resides in a network by associating the name of its rack (or network switch)
with the dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain
data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of
command line utilities that work similarly to the Linux file commands, and serve as your
primary interface with HDFS.
We‘re going to have a look into HDFS by interacting with it from the command line.
We will take a look at the most common file management tasks in Hadoop, which include:
1. Adding files and directories to HDFS
2. Retrieving files from HDFS to local filesystem
3. Deleting files from HDFS

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step 1:Starting HDFS


Initially you have to format the configured HDFS file system, open namenode (HDFS server), and
execute the following command.
$ hadoop namenode -format

12
After formatting the HDFS, start the distributed file system. The following command will start the
namenode as well as the data nodes as cluster.
$ start-dfs.sh

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status of a file,
using ls Given below is the syntax of ls that you can pass to a directory or a filename as an
argument.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to be saved in the
hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.

Step-2: Adding Files and Directories to HDFS

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Transfer and store a data file from local systems to the Hadoop file system using the put command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input

Step 3 :You can verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Step 4 Retrieving Data from HDFS


Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving
the required file from the Hadoop file system.

Initially, view the data from HDFS using cat command.

13
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Get the file from HDFS to the local file system using get command.

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/


Step-5: Deleting Files from HDFS
$ hadoop fs -rm file.txt

Step 6:Shutting Down the HDFS


You can shut down the HDFS by using the following command.

$ stop-dfs.sh

Result:
Thus Installing of Hadoop in three operating modes has been successfully completed

14
EXPT.NO.3 Matrix Multiplication with Hadoop Map Reduce

AIM: To Develop a MapReduce program to implement Matrix Multiplication.


Description:

In mathematics, matrix multiplication or the matrix product is a binary operationthat


produces a matrix from two matrices. The definition is motivated by linear equations and linear
transformations on vectors, which have numerous applicationsin applied mathematics, physics, and
engineering. In more detail, if A is an n × m matrix and B is an m × p matrix, their matrix product
AB is an n × p matrix, in which the m entries across a row of A are multiplied with the m entries
down a column of B and summed to produce an entry of AB. When two linear transformations are
represented by matrices, then the matrix product represents the composition of the two
transformations.

Algorithm for Map Function.

a. for each element mij of M do


produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of
columns of N
b. for each element njk of N do

15
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of rows
of M.

c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.

Algorithm for Reduce Function.

d. for each key (i,k) do


e. sort values begin with M by j in listM sort values begin with N by j in list N
multiply mij and njk for jth value of each list
f. sum up mij x njk return (i,k), Σj=1 mij x njk

Step 1. Creating directory for matrix


Then open matrix1.txt and matrix2.txt put the values in that text files

Step 2. Creating Mapper file for Matrix Multiplication.


#!/usr/bin/env python
import sys
cache_info = open("cache.txt").readlines()[0].split(",") row_a,
col_b = map(int,cache_info)
for line in sys.stdin:
matrix_index, row, col, value = line.rstrip().split(",") if
matrix_index == "A":
for i in xrange(0,col_b):
key = row + "," + str(i)
print "%s\t%s\t%s"%(key,col,value)
else:
for j in xrange(0,row_a):
key = str(j) + "," + col
print "%s\t%s\t%s"%(key,row,value)
Step 3. Creating reducer file for Matrix Multiplication.

#!/usr/bin/env python
import sys
from operator import itemgetter
prev_index = None
value_list = []

16
for line in sys.stdin:
curr_index, index, value = line.rstrip().split("\t") index,
value = map(int,[index,value])
if curr_index == prev_index:
value_list.append((index,value))
else:
if prev_index:
value_list = sorted(value_list,key=itemgetter(0)) i =
0
result = 0
while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i + 1][1] i
+= 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))
prev_index = curr_index
value_list = [(index,value)]

if curr_index == prev_index:
value_list = sorted(value_list,key=itemgetter(0)) i =
0
result = 0
while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i + 1][1] i
+= 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))
Step 4. To view this file using cat command
$cat *.txt |python mapper.py

17
$ chmod +x ~/Desktop/mr/matrix/Mapper.py
$ chmod +x ~/Desktop/mr/matrixl/Reducer.py
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -input /user/cse/matrices/ \
> -output /user/cse/mat_output \
> -mapper ~/Desktop/mr/matrix/Mapper.py \
> -reducer ~/Desktop/mr/matrix/Reducer.py
Step 5: To view this full output

Result:
Thus the MapReduce program to implement Matrix Multiplication was executed successfully

18
EXPT.NO.4 Word count MapReduce program

AIM: To Develop a MapReduce program to calculate the frequency of a given word in a


given file .
Map Function – It takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)

Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN

Output
Convert into another set
of data(Key,Value)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those
data tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of
Tuples(output of
Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1),
(bus,1),(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)

Output Converts into smaller set of tuples


(BUS,7), (CAR,7), (TRAIN,4)

19
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by
space,comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In
orderto group them in “Reduce Phase” the similar KEY data should be on same
cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from
eachcluster) is combine together to form a Result
Now Let’s See the Word Count Program in Java

Step1 :Make sure Hadoop and Java are installed properly

hadoop version javac

–version

Step 2. Create a directory on the Desktop named Lab and


inside it create two folders;
one called “Input” and the other called “tutorial_classes”. [You can
do this step using GUI normally or through terminal commands]
cd Desktop mkdir
Lab mkdir
Lab/Input
mkdir Lab/tutorial_classes
Step 3. Add the file attached with this document
“WordCount.java” in the directory Lab
Step 4. Add the file attached with this document “input.txt” in the
directory Lab/Input.

20
Step 5. Type the following command to export the hadoop
classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
Step 6. It is time to create these directories on HDFS rather than locally.
Type the following commands.
hadoop fs -mkdir /WordCountTutorial hadoop fs
-mkdir /WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input

Step 7. Go to localhost:9870 from the browser, Open“Utilities→ Browse File System” and
you should see the directories and files we placed in the file system.

21
Step 8. Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes WordCount.java

Put the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .

Step 9. Now, we run the jar file on Hadoop.


hadoop jar WordCount.jar WordCount /WordCountTutorial/Input
/WordCountTutorial/Output

22
Step 10. Output the result:

hadoop dfs -cat /WordCountTutorial/Output/*

Program: Step 5. Type following Program :

package PackageDemo; import


java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job;

23
import
org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new
GenericOptionsParser(c,args).getRemainingArgs(); Path input=new
Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.c
lass);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words
)
{
Text outputKey = new
Text(word.toUpperCase().trim());IntWritable outputValue
= new IntWritable(1); con.write(outputKey,
outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throwsIOException,

24
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
The output is stored in /r_output/part-00000

OUTPUT:

Result:
Thus the Word Count Map Reduce program to understand Map Reduce Paradigm was successfully
executed.

25
EXPT.NO.5 Installation of Hive

AIM: To install hive with example

Step 1:

Download and unzip Hive


=============================
wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz tar
xzf apache-hive-3.1.2-bin.tar.gz
step 2:
Edit .bashrc file
========================
sudo nano .bashrc
export HIVE_HOME= /home/hadoop/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin

step 3:
source ~/.bashrc
step 4:
Edit hive-config.sh file
====================================

26
sudo nano $HIVE_HOME/bin/hive-config.sh export
HADOOP_HOME=/home/cse/hadoop-3.3.6

step 5:
Create Hive directories in HDFS
===================================
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /tmp
hdfs dfs -mkdir -p /user/hive/warehouse hdfs
dfs -chmod g+w /user/hive/warehouse

step 6:

Fixing guava problem – Additional step


=================

rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

step 7: Configure hive-site.xml File (Optional) Use the


following command to locate the correct file: cd
$HIVE_HOME/conf
List the files contained in the folder using the ls command.

27
Use the hive-default.xml.template to create the hive-site.xml file:

cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the nano text editor: sudo
nano hive-site.xml

Step 8: Initiate Derby Database


============================
$HIVE_HOME/bin/schematool -dbType derby –initSchema

28
To installing hive with example

Create Database from Hive Beeline shell


1.Create database database_name; Ex:
>Create database Emp;
>use Emp;
>create table emp.employee(sno int,user String,city String)Row format delimited fields terminated by /n
stored as textfile;

29
Show Database

Result:
Thus installation of Hive is completed successfully.

30
EXPT.NO.6 Installation of HBase and Thrift

AIM:
To Install HBase and thrift on Ubuntu 18.04 HBase in Standalone Mode

Procedure:
Pre-requisite:

Ubuntu 16.04 or higher installed on a virtual machine.


step-1:Make sure that java has installed in your machine to verify that run java –version

If any Error Occured While Execute this command , then java is not installed in your system To

Install Java sudo apt install openjdk-8-jdk -y


Step-2:Download Hbase
wget https://dlcdn.apache.org/hbase/2.5.5/hbase-2.5.5-bin.tar.gz

31
Step-3:Extract The hbase-2.5.5-bin.tar.gz file by using the command tar xvf hbase-2.5.5- bin.tar.gz

step-4: goto hbase2.5.5/conf folder and open hbase-env.sh file

32
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

step-5 : Edit .bashrc file

and then open .bashrc file and mention HBASE_HOME path as shown in below

export HBASE_HOME=/home/prasanna/hbase-2.5.5

here you can change name according to your local machine name

eg : export HBASE_HOME=/home/<your_machine_name>/hbase-2.5.5

export PATH= $PATH:$HBASE_HOME/bin


Note:*make sure that the hbase-2.5.5 folderin home directory before setting HBASE_HOME path
, if not then move the hbase-2.5.5 file to home directory*

33
step-6 : Add properties in the hbase-site.xml

put the below property between the <configuratio></configuration> tag


<property>
<name>hbase.rootdir</name>
<value>file:///home/prasanna/HBASE/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/prasanna/HBASE/zookeeper</value>
</property>
step-7:Goto To /etc/ folderand run the following command and configure

34
change in line no-2 by default the ip is 127.0.1.1
change it to 127.0.0.1 in second line onlystep-8:starting hbase
goto hbase-2.5.5/bin folder

After this run jps command to ensure that hbase is running

run http://localhost:16010 to see hbase web UI

35
step-9: accessing hbase shell by running ./hbase shell command

EXAMPLE
1) To create
Table syntax:
create ‘Table_Name’,’col_fam_1’,’col_fam_1’,.........’col_fam-n’
code :
create 'aamec','dept','year'

2) List All Tables


code :
list

3) insert

data syntax:
put ‘table_name’,’row_key’,’column_family:attribute’,’value’

here row_key is a unique key to retrive data

36
code :
this data will enter data into the dept column family
put 'aamec','cse','dept:studentname','prasanna'
put 'aamec','cse','dept:year','third'
put 'aamec','cse','dept:section','A'

This data will enter data into the year column family

put 'aamec','cse','year:joinedyear','2021' put


'aamec','cse','year:finishingyear','2025'

4) Scan Table

same as desc in RDBMS

syntax:
scan ‘table_name’
code:
scan ‘aamec’

5) To get specific data

syntax:
get ‘table_name’,’row_key’,[optional column family: attribute]
37
code :
get ‘aamec’,’cse’

6.update table value

The same put command is used to update the table value ,if the row key is aldready present in the
database then it will update data according to the value ,if not present the it will create new row with the
given row key

previously the value for the section in cse is A ,But after running this command the value will be
changed into B

7)To Delete Data

syntax:
delete ‘table_name’,’row_key’,’column_family:attribute’
code :
delete 'aamec','cse','year:joinedyear'

38
8.Delete Table

first we need to disable the table before dropping it


To Disable:
syntax:
disable ‘table_name’
code:

disable ‘aamec’

Result:

Thus HBase was successfully installed on Ubuntu.

39
EXPT.NO.7 Import and export data from various databases

Aim:
To import and export the order of columns in MySQL and Hive

Pre-requisite
Hadoop and Java
MySQL
Hive
SQOOP

Step 1:To start hdfs

Step 2: MySQL Installation

sudo apt install mysql-server ( use this command to install MySQL server)

COMMANDS:
~$ sudo su

After this enter your linux user password,then the root mode will be open here we don’t need any
authentication for mysql.

~root$ mysql

Creating user profiles and grant them permissions:

Mysql> CREATE USER ‘bigdata'@'localhost' IDENTIFIED BY ‘bigdata’;

40
Mysql>grant all privileges on *.* to bigdata@localhost;

Note: This step is not required if you just use the root user to make CRUD operations in
the MySQL
Mysql> CREATE USER ‘bigdata’@’127.0.0.1' IDENTIFIED BY ‘ bigdata’; Mysql>grant all
privileges on *.* to [email protected];

Note: Here, *.* means that the user we create has all the privileges on all the tables of all the
databases.
Now, we have created user profiles which will be used to make CRUD operations in the mysql
Step 3: Create a database and table and insert data.
Example:
create database Employe;
create table Employe.Emp(author_name varchar(65), total_no_of_articles int, phone_no int, address
varchar(65));
insert into Emp values(“Rohan”,10,123456789,”Lucknow”);
Step 3: Create a database and table in the hive where data should be imported.
create table geeks_hive_table(name string, total_articles int, phone_no int, address string) row format
delimited fields terminated by ‘,’;

41
Step 4: SQOOP INSTALLATION :

After downloading the sqoop , go to the directory where we downloaded the sqoop and then extract
it using the following command :

$ tar -xvf sqoop-1.4.4.bin hadoop-2.0.4-alpha.tar.gz

Then enter into the super user : $ su

Next to move that to the usr/lib which requires a super user privilege

$ mv sqoop-1.4.4.bin hadoop-2.0.4-alpha /usr/lib/sqoop

42
Then exit : $ exit

Goto .bashrc: $ sudo nano .bashrc , and then add the following export

SQOOP_HOME=/usr/lib/sqoop

export PATH=$PATH:$SQOOP_HOME/bin

$ source ~/.bashrc
Then configure the sqoop, goto the directory of the config folder of sqoop_home and then move the
contents of template file to the environment file.
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Then open the sqoop-environment file and then add the following,
export HADOOP_COMMON_HOME=/usr/local/Hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Note : Here we add the path of the Hadoop libraries and files and it may different from the path which
we mentioned here. So, add the Hadoop path based on your installation.

Step 5: Download and Configure mysql-connector-java :


We can download mysql-connector-java-5.1.30.tar.gz file from the following link.

Next, to extract the file and place it to the lib folder of sqoop
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
$ cd mysql-connector-java-5.1.30
$ mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Note : This is library file is very important don’t skip this step because it contains the libraries to
connect the mysql databases to jdbc.
Verify sqoop: sqoop-version
Step 3: hive database Creation
hive> create database sqoop_example;
hive>use sqoop_example;
hive>create table sqoop(usr_name string,no_ops int,ops_names string);
Hive commands much more alike mysql commands.Here, we just create the structure to store the data
which we want to import in hive.

43
Step 6: Importing data from MySQL to hive :
sqoop import --connect \
jdbc:mysql://127.0.0.1:3306/database_name_in_mysql \
--username root --password cloudera \
--table table_name_in_mysql \
--hive-import --hive-table database_name_in_hive.table_name_in_hive \
--m 1

44
OUTPUT:

Result:
Thus the order of columns in MySQL queries are exported to hive successfully.

45
EXPT.NO.8 Installation of Apache PIG

Aim:
To install apache pig on ubuntu

Prerequisites

It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the following
link −

Download Apache Pig

Step 1:

First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org/

Open the homepage of Apache Pig website. Under the section News, click on the link release page as shown
in the following snapshot.

Step 2

On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under the
Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the
link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
46
Step 3

Choose and click any one of these mirrors as shown below.

47
Step 4

These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click
the latest version among them.

Step 5

Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.

48
Install Apache Pig

After downloading the Apache Pig software, install it in your Linux environment by following the steps given
below.

Step 1

Create a directory with the name Pig in the same directory where the installation directories of Hadoop, Java,
and other software were installed. (In our tutorial, we have created the Pig directory in the user named
Hadoop).

$ mkdir Pig
Step 2

Extract the downloaded tar files as shown below.

$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3

Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and
pig.properties.

.bashrc file

In the .bashrc file, set the following variables −

 PIG_HOME folder to the Apache Pig’s installation folder,


 PATH environment variable to the bin folder, and
 PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations
(the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files).

export PIG_HOME = /home/Hadoop/Pig


export PATH = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf
pig.properties file

In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various
parameters as given below.

pig -h properties

The following properties are supported −

Logging: verbose = true|false; default is false. This property is the same as -v


switch brief=true|false; default is false. This property is the same
as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.
49
This property is the same as -d switch aggregate.warning = true|false; default is true.
If true, prints count of warnings of each type rather than logging each warning.

Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).


Note that this memory is shared across all large bags used by the application.
pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
Specifies the fraction of heap available for the reducer to perform the join.
pig.exec.nocombiner = true|false; default is false.
Only disable combiner as a temporary workaround for problems.
opt.multiquery = true|false; multiquery is on by default.
Only disable multiquery as a temporary workaround for problems.
opt.fetch=true|false; fetch is on by default.
Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.
pig.tmpfilecompression = true|false; compression is off by default.
Determines whether output of intermediate jobs is compressed.
pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
Used in conjunction with pig.tmpfilecompression. Defines compression type.
pig.noSplitCombination = true|false. Split combination is on by default.
Determines if multiple small files are combined into a single map.

pig.exec.mapPartAgg = true|false. Default is false.


Determines if partial aggregation is done within map phase, before records are sent to combiner.
pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.
If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.

Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.
Verifying the Installation

Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will
get the version of Apache Pig as shown below.

$ pig –version

Apache Pig version 0.15.0 (r1682971)


compiled Jun 01 2015, 11:44:35

Result
Thus installation of apache pig is done successfully.

50
51

You might also like