Bda Lab Manual
Bda Lab Manual
LAB MANUAL
Regulation : 2021
Semester : VI
Branch : CSE
Prepared By Approved By
R. JAAZIELIAH, AP/CSE HOD / CSE
Institution Vision
To impart quality technical education emphasizing innovations and research with social and ethical
values.
Institution Mission
Department Vision
To provide technical education in the field of computer science and engineering by imparting
employability, research capabilities, entrepreneurship with human values.
Department Mission
PEO 1: The graduates will excel in their chosen technical domains in Computer Science and Engineering.
PEO 2: The graduates will solve real world problems in Computer Science and Engineering by applying
logical and coding skills considering ethical values.
PEO 3: The graduates will work enthusiastically as a team and engage in lifelong learning.
Program Outcomes
PSO1: Analyze, design and develop computing solutions by applying foundational concepts of Computer
Science and Engineering.
PSO2: Apply software engineering principles and practices for developing quality software for scientific
and business applications.
PSO3: Adapt to emerging Information and Communication Technologies (ICT) to innovate ideas and
solutions to existing/novel problems.
CCS334 BIG DATA ANALYTICS LTPC
2023
COURSE OBJECTIVES:
To understand big data.
To learn and use NoSQL big data management.
To learn mapreduce analytics using Hadoop and related tools.
To work with map reduce applications
To understand the usage of Hadoop related tools for Big Data Analytics
Software Requirements:
Cassandra, Hadoop, Java, Pig, Hive and HBase.
COURSE OUTCOMES:
After the completion of this course, students will be able to:
CO1: Summarize big data and use cases from selected business domains.
CO2: Explain NoSQL big data management.
CO3: Experiment with Hadoop and HDFS installation and configuration
CO4: Examine of map-reduce analytics using Hadoop.
CO5: Test Hadoop-related tools such as HBase, Cassandra, Pig, and Hive for big data analytics.
OBJECTIVES:
CO1: Illustrate the big data and use cases from selected business domains
CO2: Explain NoSQL big data management
CO3: Solve map-reduce analytics using Hadoop
CO4: Demonstrate the knowledge of big data analytics and implement different file management
task in Hadoop.
CO5: Make use of Hadoop-related tools such as HBase, Cassandra, Pig, and Hive for big data
analytics.
CO - PO MATRICES OF COURSE
Mapping of Course Outcomes with Program Outcomes & Program Specific Outcomes:
CO No PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
1 3 3 3 3 3 - - - 2 2 3 1 1 3 3
2 3 3 2 3 2 - - - 2 2 3 3 2 3 2
3 3 3 3 2 3 - - - 2 2 1 2 2 3 3
4 2 3 3 3 3 - - - 2 2 3 2 3 3 2
5 3 3 3 3 3 - - - 3 1 3 2 3 2 3
Average 2.8 3 2.8 2.8 2.8 - - - 2.2 1.8 2.6 2 2.2 2.8 2.6
C310.2
C310.3
C310.4
C310.5
3
RUBRICS FOR INTERNAL LABORATORY ASSESSMENT
4
List of Experiment
PAGE
S. NO NAME OF THE EXPRIMENTS NO
5 Installation of Hive 25
5
EXP.NO.1 Installation and Configuration of Hadoop
AIM:
PROCEDURE:
export JAVA_HOME=/home/username/jdk1.8.0_45
export PATH=HOME/bin:JAVA_HOME/bin:PATH
vi etc/hadoop/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50000</value>
</property>
vi etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>{yarn.resourcemanager.hostname}:8032</value>
</property>
vi etc/hadoop/hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/username/hadoop2-dir/namenode-dir</value>
</property>
7
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/username/hadoop2-dir/datanode-dir</value>
</property>
vi etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
vi etc/hadoop/hadoop-env.sh
export JAVA_HOME=/home/username/jdk1.8.0_45
vi etc/hadoop/mapred-env.sh
export JAVA_HOME=/home/username/jdk1.8.0_ 45
vi etc/hadoop/yarn-env.sh
export JAVA_HOME=/home/username/jdk1.8.0_45
vi etc/hadoop/slaves
localhost
8
ls
cat id_rsa.pub >> authorized_keys (copy the .pub)
(Copy the id_rsa.pub from NameNode to authorized_keys in all machines)
ssh localhost
(Asking No Password )
9
Step 5. Start All Hadoop Related Services
sbin/start-all.sh
(Starting Daemon’s For DFS & YARN)
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager
Result:
10
Thus installation of Hadoop is completed successfully.
11
EXPT.NO.2 Implementation of Hadoop file management tasks
AIM:
To implement the following file management tasks in Hadoop:
1. Adding files and directories
2. Retrieving files
3. Deleting Files
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
12
After formatting the HDFS, start the distributed file system. The following command will start the
namenode as well as the data nodes as cluster.
$ start-dfs.sh
After loading the information in the server, we can find the list of files in a directory, status of a file,
using ls Given below is the syntax of ls that you can pass to a directory or a filename as an
argument.
Transfer and store a data file from local systems to the Hadoop file system using the put command.
13
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Get the file from HDFS to the local file system using get command.
$ stop-dfs.sh
Result:
Thus Installing of Hadoop in three operating modes has been successfully completed
14
EXPT.NO.3 Matrix Multiplication with Hadoop Map Reduce
15
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of rows
of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.
#!/usr/bin/env python
import sys
from operator import itemgetter
prev_index = None
value_list = []
16
for line in sys.stdin:
curr_index, index, value = line.rstrip().split("\t") index,
value = map(int,[index,value])
if curr_index == prev_index:
value_list.append((index,value))
else:
if prev_index:
value_list = sorted(value_list,key=itemgetter(0)) i =
0
result = 0
while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i + 1][1] i
+= 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))
prev_index = curr_index
value_list = [(index,value)]
if curr_index == prev_index:
value_list = sorted(value_list,key=itemgetter(0)) i =
0
result = 0
while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i + 1][1] i
+= 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))
Step 4. To view this file using cat command
$cat *.txt |python mapper.py
17
$ chmod +x ~/Desktop/mr/matrix/Mapper.py
$ chmod +x ~/Desktop/mr/matrixl/Reducer.py
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -input /user/cse/matrices/ \
> -output /user/cse/mat_output \
> -mapper ~/Desktop/mr/matrix/Mapper.py \
> -reducer ~/Desktop/mr/matrix/Reducer.py
Step 5: To view this full output
Result:
Thus the MapReduce program to implement Matrix Multiplication was executed successfully
18
EXPT.NO.4 Word count MapReduce program
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set
of data(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those
data tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of
Tuples(output of
Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1),
(bus,1),(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
19
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by
space,comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In
orderto group them in “Reduce Phase” the similar KEY data should be on same
cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from
eachcluster) is combine together to form a Result
Now Let’s See the Word Count Program in Java
–version
20
Step 5. Type the following command to export the hadoop
classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
Step 6. It is time to create these directories on HDFS rather than locally.
Type the following commands.
hadoop fs -mkdir /WordCountTutorial hadoop fs
-mkdir /WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input
Step 7. Go to localhost:9870 from the browser, Open“Utilities→ Browse File System” and
you should see the directories and files we placed in the file system.
21
Step 8. Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes WordCount.java
Put the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .
22
Step 10. Output the result:
23
import
org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new
GenericOptionsParser(c,args).getRemainingArgs(); Path input=new
Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.c
lass);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words
)
{
Text outputKey = new
Text(word.toUpperCase().trim());IntWritable outputValue
= new IntWritable(1); con.write(outputKey,
outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throwsIOException,
24
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
The output is stored in /r_output/part-00000
OUTPUT:
Result:
Thus the Word Count Map Reduce program to understand Map Reduce Paradigm was successfully
executed.
25
EXPT.NO.5 Installation of Hive
Step 1:
step 3:
source ~/.bashrc
step 4:
Edit hive-config.sh file
====================================
26
sudo nano $HIVE_HOME/bin/hive-config.sh export
HADOOP_HOME=/home/cse/hadoop-3.3.6
step 5:
Create Hive directories in HDFS
===================================
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /tmp
hdfs dfs -mkdir -p /user/hive/warehouse hdfs
dfs -chmod g+w /user/hive/warehouse
step 6:
rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
27
Use the hive-default.xml.template to create the hive-site.xml file:
cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the nano text editor: sudo
nano hive-site.xml
28
To installing hive with example
29
Show Database
Result:
Thus installation of Hive is completed successfully.
30
EXPT.NO.6 Installation of HBase and Thrift
AIM:
To Install HBase and thrift on Ubuntu 18.04 HBase in Standalone Mode
Procedure:
Pre-requisite:
If any Error Occured While Execute this command , then java is not installed in your system To
31
Step-3:Extract The hbase-2.5.5-bin.tar.gz file by using the command tar xvf hbase-2.5.5- bin.tar.gz
32
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
and then open .bashrc file and mention HBASE_HOME path as shown in below
export HBASE_HOME=/home/prasanna/hbase-2.5.5
here you can change name according to your local machine name
eg : export HBASE_HOME=/home/<your_machine_name>/hbase-2.5.5
33
step-6 : Add properties in the hbase-site.xml
34
change in line no-2 by default the ip is 127.0.1.1
change it to 127.0.0.1 in second line onlystep-8:starting hbase
goto hbase-2.5.5/bin folder
35
step-9: accessing hbase shell by running ./hbase shell command
EXAMPLE
1) To create
Table syntax:
create ‘Table_Name’,’col_fam_1’,’col_fam_1’,.........’col_fam-n’
code :
create 'aamec','dept','year'
3) insert
data syntax:
put ‘table_name’,’row_key’,’column_family:attribute’,’value’
36
code :
this data will enter data into the dept column family
put 'aamec','cse','dept:studentname','prasanna'
put 'aamec','cse','dept:year','third'
put 'aamec','cse','dept:section','A'
This data will enter data into the year column family
4) Scan Table
syntax:
scan ‘table_name’
code:
scan ‘aamec’
syntax:
get ‘table_name’,’row_key’,[optional column family: attribute]
37
code :
get ‘aamec’,’cse’
The same put command is used to update the table value ,if the row key is aldready present in the
database then it will update data according to the value ,if not present the it will create new row with the
given row key
previously the value for the section in cse is A ,But after running this command the value will be
changed into B
syntax:
delete ‘table_name’,’row_key’,’column_family:attribute’
code :
delete 'aamec','cse','year:joinedyear'
38
8.Delete Table
disable ‘aamec’
Result:
39
EXPT.NO.7 Import and export data from various databases
Aim:
To import and export the order of columns in MySQL and Hive
Pre-requisite
Hadoop and Java
MySQL
Hive
SQOOP
sudo apt install mysql-server ( use this command to install MySQL server)
COMMANDS:
~$ sudo su
After this enter your linux user password,then the root mode will be open here we don’t need any
authentication for mysql.
~root$ mysql
40
Mysql>grant all privileges on *.* to bigdata@localhost;
Note: This step is not required if you just use the root user to make CRUD operations in
the MySQL
Mysql> CREATE USER ‘bigdata’@’127.0.0.1' IDENTIFIED BY ‘ bigdata’; Mysql>grant all
privileges on *.* to [email protected];
Note: Here, *.* means that the user we create has all the privileges on all the tables of all the
databases.
Now, we have created user profiles which will be used to make CRUD operations in the mysql
Step 3: Create a database and table and insert data.
Example:
create database Employe;
create table Employe.Emp(author_name varchar(65), total_no_of_articles int, phone_no int, address
varchar(65));
insert into Emp values(“Rohan”,10,123456789,”Lucknow”);
Step 3: Create a database and table in the hive where data should be imported.
create table geeks_hive_table(name string, total_articles int, phone_no int, address string) row format
delimited fields terminated by ‘,’;
41
Step 4: SQOOP INSTALLATION :
After downloading the sqoop , go to the directory where we downloaded the sqoop and then extract
it using the following command :
Next to move that to the usr/lib which requires a super user privilege
42
Then exit : $ exit
Goto .bashrc: $ sudo nano .bashrc , and then add the following export
SQOOP_HOME=/usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
$ source ~/.bashrc
Then configure the sqoop, goto the directory of the config folder of sqoop_home and then move the
contents of template file to the environment file.
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Then open the sqoop-environment file and then add the following,
export HADOOP_COMMON_HOME=/usr/local/Hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Note : Here we add the path of the Hadoop libraries and files and it may different from the path which
we mentioned here. So, add the Hadoop path based on your installation.
Next, to extract the file and place it to the lib folder of sqoop
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
$ cd mysql-connector-java-5.1.30
$ mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Note : This is library file is very important don’t skip this step because it contains the libraries to
connect the mysql databases to jdbc.
Verify sqoop: sqoop-version
Step 3: hive database Creation
hive> create database sqoop_example;
hive>use sqoop_example;
hive>create table sqoop(usr_name string,no_ops int,ops_names string);
Hive commands much more alike mysql commands.Here, we just create the structure to store the data
which we want to import in hive.
43
Step 6: Importing data from MySQL to hive :
sqoop import --connect \
jdbc:mysql://127.0.0.1:3306/database_name_in_mysql \
--username root --password cloudera \
--table table_name_in_mysql \
--hive-import --hive-table database_name_in_hive.table_name_in_hive \
--m 1
44
OUTPUT:
Result:
Thus the order of columns in MySQL queries are exported to hive successfully.
45
EXPT.NO.8 Installation of Apache PIG
Aim:
To install apache pig on ubuntu
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps given in the following
link −
Step 1:
First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org/
Open the homepage of Apache Pig website. Under the section News, click on the link release page as shown
in the following snapshot.
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page. On this page, under the
Download section, you will have two links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the
link Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
46
Step 3
47
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Click
the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various distributions.
Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-
0.15.0.tar.gz.
48
Install Apache Pig
After downloading the Apache Pig software, install it in your Linux environment by following the steps given
below.
Step 1
Create a directory with the name Pig in the same directory where the installation directories of Hadoop, Java,
and other software were installed. (In our tutorial, we have created the Pig directory in the user named
Hadoop).
$ mkdir Pig
Step 2
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
Configure Apache Pig
After installing Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and
pig.properties.
.bashrc file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various
parameters as given below.
pig -h properties
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.
Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will
get the version of Apache Pig as shown below.
$ pig –version
Result
Thus installation of apache pig is done successfully.
50
51