BDA Lab Manual
BDA Lab Manual
LIST OF EXPERIMENTS
S. PAGE
NO DATE NAME OF THE EXPRIMENTS MARK SIGN
NO
2
3
EXP NO: 1
To Study of Big Data Analytics and Hadoop Architecture
Date:
➢ Apache Hadoop offers a scalable, flexible and reliable distributed computing big
data framework for a cluster of systems with storage capacity and local computing
power by leveraging commodity hardware.
➢ Hadoop follows a Master Slave architecture for the transformation and analysis of
large datasets using Hadoop MapReduce paradigm. The 3 important hadoop
components that play a vital role in the Hadooparchitecture.Hadoop Common – the
libraries and utilities used by other Hadoop modules
➢ Hadoop Distributed File System (HDFS) – the Java-based scalable system thatstores
data across multiple machines without prior organization.
4
5
➢ YARN – (Yet Another Resource Negotiator) provides resource management for the
processes running on Hadoop.
➢ MapReduce – a parallel processing software framework. It is comprised of two
steps. Map step is a master node that takes inputs and partitions them into smaller
sub problems and then distributes them to worker nodes. After the map step has
taken place, the master node takes the answers to all of the sub problems and
combines them to produce output.
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
Result: Thus successfully studied of Big Data Analytics and Hadoop Architecture
6
ownload Hadoop from www.hadoop.apache.org
7
EXP NO: 2
Downloading and installing Hadoop; Understanding different
Date: Hadoop modes. Startup scripts,Configuration files.
Aim:
To Install Apache Hadoop.
Hadoop software can be installed in three modes of
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.
Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
HDFS, which stands for Hadoop Distributed File System, is responsible for persisting
data to disk.
YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
MapReduce is the original processing model for Hadoop clusters. It distributes work within
the cluster or map, then organizes and reduces the results from the nodes into a response to
a query. Many other processing models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.
Procedure:
we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.
Prerequisites:
8
lOMoARcPSD|161847 63
9
lOMoARcPSD|161847 63
If Apache Hadoop 2.2.0 is not already installed then follow the post
Build, Install,Configure and Run Apache Hadoop 2.2.0 in Microsoft
Windows OS.
Run following
commands.
Command Prompt
C:\Users\abhijitg>cd
c:\hadoop
c:\hadoop>sbin\start
-dfs
c:\hadoop>sbin\start
-yarn starting yarn
daemons
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
10
lOMoARcPSD|161847 63
11
lOMoARcPSD|161847 63
EXP NO: 3
Hadoop Implementation of file management tasks, such as Adding
Date: files and directories,retrieving files and Deleting files
Aim:
Implement the following file management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting Files
Procedure:
Algorithm:
Syntax And Commands To Add, Retrieve And Delete Data From Hdfs
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the
data into HDFS first. Let‘s create a directory and put a file in it. HDFS has a default
working directory of /user/$USER, where $USER is your login user name. This
directory isn‘t automatically created for you, though, so let‘s create it with the mkdir
command. For the purpose of illustration, we use chuck. You should substitute your
user name in the example commands.
hadoop fs -mkdir /user/chuck
hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck
12
lOMoARcPSD|161847 63
13
lOMoARcPSD|161847 63
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command.
hadoop fs -cat example.txt
Copying from directory command is “hdfs dfs –copyFromLocal /home /lendi/ Desktop/
shakes/
glossary /lendicse/”
View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”
Command for Deleting files is “hdfs dfs –rm r /kartheek”
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
Result: Thus implement the file management tasks in Hadoop Completed successfully.
14
lOMoARcPSD|161847 63
15
lOMoARcPSD|161847 63
EXP NO: 4
Implement of Matrix Multiplication with Hadoop Map Reduce
Date:
16
lOMoARcPSD|161847 63
17
lOMoARcPSD|161847 63
produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of
columns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number
ofrows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.
Algorithm for Reduce Function.
d. for each key (i,k) do
e. sort values begin with M by j in listM sort values begin with N by j in listN
multiply mij and njk for jth value of each list
f. sum up mij x njk return (i,k), Σj=1 mij x njk
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
18
lOMoARcPSD|161847 63
19
lOMoARcPSD|161847 63
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;
Pair() {
i = 0;
20
lOMoARcPSD|161847 63
21
lOMoARcPSD|161847 63
j = 0;
}
Pair(int i, int j) {
this.i = i;
this.j = j;
}
@Override
public void readFields(DataInput input) throws IOException
{i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws IOException
{output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair compare)
{if (i > compare.i) {
return 1;
} else if ( i < compare.i)
{return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j < compare.j)
{return -1;
}
}
return 0;
}
public String toString()
{ return i + " " + j + "
";
}
}
public class Multiply {
public static class MatriceMapperM extends Mapper<Object,Text,IntWritable,Element>
{
22
lOMoARcPSD|161847 63
23
lOMoARcPSD|161847 63
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString(); String[]
stringTokens = readLine.split(",");
24
lOMoARcPSD|161847 63
25
lOMoARcPSD|161847 63
if (tempElement.tag == 0)
{ M.add(tempElemen
t);
} else if(tempElement.tag == 1)
{N.add(tempElement);
}
}
for(int i=0;i<M.size();i++) {
for(int j=0;j<N.size();j++) {
26
lOMoARcPSD|161847 63
27
lOMoARcPSD|161847 63
sum += value.get();
}
context.write(key, new DoubleWritable(sum));
}
}
public static void main(String[] args) throws Exception
{Job job = Job.getInstance();
job.setJobName("MapIntermediate");
job.setJarByClass(Project1.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class,
MatriceMapperM.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class,
MatriceMapperN.class);
job.setReducerClass(ReducerMxN.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Element.class);
job.setOutputKeyClass(Pair.class);
job.setOutputValueClass(DoubleWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
Job job2 = Job.getInstance();
job2.setJobName("MapFinalOutput");
job2.setJarByClass(Project1.class);
job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);
job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);
job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
28
lOMoARcPSD|161847 63
29
lOMoARcPSD|161847 63
job2.waitForCompletion(true);
}
}
#!/bin/bash
mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath` Multiply.java
jar cf multiply.jar -C classes .
echo "end"
stop-yarn.sh
stop-dfs.sh
myhadoop-cleanup.sh
30
lOMoARcPSD|161847 63
31
lOMoARcPSD|161847 63
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
Result: Thus implement of Matrix Multiplication with Hadoop Map Reduce completed
successfully.
32
lOMoARcPSD|161847 63
C:\file1.txt
Install Hadoop
33
lOMoARcPSD|161847 63
EXP NO: 5
Run a basic Word Count Map Reduce program to understand
Date: Map Reduce Paradigm.
Procedure:
Create a text file with some content. We'll pass this file as input tothe
wordcount MapReduce job for counting words.
Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be used forcounting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.
34
lOMoARcPSD|161847 63
35
lOMoARcPSD|161847 63
36
lOMoARcPSD|161847 63
37
lOMoARcPSD|161847 63
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
Result: Thus basic Word Count Map Reduce program to understand Map Reduce
Paradigm run successfully.
38
lOMoARcPSD|161847 63
39
lOMoARcPSD|161847 63
EXP NO: 6
Installation of Hive along with practice examples.
Date:
Prerequisites:
With Java in place, we'll visit the Apache Hadoop Releases page to find the most recent stable release.
Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME.
(it cancause issues later)
I have placed my Hive in D: drive you can use C: or any other drive also.
To edit environment variables, go to Control Panel > System > click on the “Advanced system settings”
link
40
lOMoARcPSD|161847 63
41
lOMoARcPSD|161847 63
Alternatively, We can Right click on This PC icon and click on Properties and click on the “Advanced
system settings” link
Or, easiest way is to search for Environment Variable in search bar and there you go
Now as shown, add HIVE_HOME in variable name and path of Hive in Variable Value.
Click OK and we are half done with setting HIVE_HOME.
Click OK and OK. & we are done with Setting Environment Variables.
3.4 Verify the Paths
Now we need to verify that what we have done is correct and reflecting.
42
lOMoARcPSD|161847 63
43
lOMoARcPSD|161847 63
4. Editing Hive
Once we have configured the environment variables next step is to configure Hive. It has 7 parts:-
First step in configuring the hive is to download and replace the bin folder.
* Extract the zip and replace all the files present under bin folder to %HIVE_HOME%\bin
Note:- If you are using different version of HIVE then please search for its respective bin folder and
download it.
Now Open the newly created Hive-site.xml and we need to edit the following properties
<property>
<name>hive.metastore.uris</name>
<value>thrift://<Your IP Address>:9083</value>
<property>
<name>hive.downloaded.resources.dir</name>
<value><Your drive Folder>/${hive.session.id}_resources</value>
<property>
44
lOMoARcPSD|161847 63
45
lOMoARcPSD|161847 63
<name>hive.exec.scratchdir</name>
<value>/tmp/mydir</value>
Replace the value for <Your IP Address> with the IP Address of your System and replace <Your drive
Folder> with the Hive folder Path.
This is a short step and we need to remove all the  character present in the hive-site.xml file.
The next important step in configuring Hive is to create users for MySQL.
These Users are used for connecting Hive to MySQL Database for reading and writing data from it.
Note:- You can skip this step if you have created the hive user while SQOOP installation.
Firstly, we need to open the MySQL Workbench and open the workspace(default or any specific, if you
want). We will be using the default workspace only for now.
Now Open the Administration option in the Workspace and select Users and privileges option
under Management.
Now select Add Account option and Create an new user with Login Name as hive and Limit
to HostMapping as the localhost and Password of your choice.
Now we have to define the roles for this user under Administrative
Roles andselect DBManager ,DBDesigner and BackupAdmin Roles
46
lOMoARcPSD|161847 63
</dependency>
Start HiveServer2
To connect to Hive from Java, you need to start hiveserver2 from $HIVE_HOME/bin
prabha@namenode:~/hive/bin$ ./hiveserver2
2020-10-03 23:17:08: Starting HiveServer2
Copy
Below are complete Java example of how to create a Hive Database.
Create a Hive Table from Java Example
package com.sparkbyexamples.hive;
import java.sql.Connection;
import java.sql.Statement;
import java.sql.DriverManager;
47
lOMoARcPSD|161847 63
Now we need to grant schema privileges for the user by using Add Entry option and selecting
the schemas we need access to.
I am using schema matching pattern as %_bigdata% for all my bigdata related schemas. You can use other
2 options also.
After clicking OK we need to select All the privileges for this schema.
Click Apply and we are done with the creating Hive user.
4.5 Granting permission to Users
Once we have created the user hive the next step is to Grant All privileges to this user for all the Tables in
the previously selected Schema.
Upon opening it will ask for your root user password(created while setting up MySQL).
Finally, we need to open our hive-site.xml file once again and make some changes their, these are related
grant all privileges on test_bigdata.* to 'hive'@'localhost';
where test_bigdata will be you schema name and hive@localhost will be the user name @ Host name.
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/<Your Database>?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in theconnection
URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
/description>
</property>
48
lOMoARcPSD|161847 63
49
lOMoARcPSD|161847 63
<property>
name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><Hive Password></value>
to Hive metastore that’s why did not add them in starting so as to distinguish between the different set of
properties.
Open the MySQL cmd Window. We can open it by using the Window’s Search bar.
<property>
<name>datanucleus.schema.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.autoCreateTables</name>
<value>True</value>
</property>
<property>
<name>datanucleus.schema.validateTables</name>
<value>true</value>
<description>validates existing schema against code. turn this on if you want to verify existing
schema</description>
</property>
Replace the value for <Hive Password> with the hive user password that we created in MySQL usercreation.
And <Your Database> with the database that we used for metastore in MySQL.
5. Starting Hive
Now we need to start a new Command Prompt remember to run it as administrator to avoid permission issues and
execute below commands
start-all.cmd
Now open a new cmd window and run the below command to start Hive
Hive
Hive – Create Database from Java Example
Hive Java Dependency
50
lOMoARcPSD|161847 63
51
lOMoARcPSD|161847 63
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
52
lOMoARcPSD|161847 63
53
lOMoARcPSD|161847 63
EXP NO: 7
Installation of HBase, Installing thrift along with Practice
Date: examples
Prerequisites:
Installing Hadoop
Procedure:
Step 3: Now we need to change 2 files a config and a cmd file. Inorder to do that, go to the unzipped location.
Change 1 Edit hbase-config.cmd, located in the bin folder under the unzipped location and add the below line to
set JAVA_HOME=C:\software\Java\jdk1.8.0_201
Change 2 Edit hbase-site.xml, located in the conf folder under the unzipped location and add the section below
fs.defaultFS value.
<property>
<name>hbase.rootdir</name>
<value>file:/home/hadoop/HBase/HFiles</value>
</property>
<property>
54
lOMoARcPSD|161847 63
55
lOMoARcPSD|161847 63
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
Step 5: Now we are all set to run HBase, to start HBase execute the command below from the bin folder.
Open Command Prompt and cd to Hbase’ bin directory
Run start-hbase.cmd
Look for any errors
Step 6: Test the installation using HBase shell
create ‘emp’,’p’
56
lOMoARcPSD|161847 63
57
lOMoARcPSD|161847 63
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
58
lOMoARcPSD|161847 63
59
lOMoARcPSD|161847 63
EXP NO: 8
Practice importing and exporting data from various databases.
Date:
Procedure:
Step 2: Choose Add Data - > New Data to import data into a new dataset. or
In the Datasets panel, click More next to the dataset name and choose Edit Dataset to add data to the
dataset. The Preview Dialog opens. Click Add a new table.
The Data Sources dialog opens.
Step 3: To import data from a specific database, select the corresponding logo (Amazon Redshift, Apache
Cassandra, Cloudera Hive, Google BigQuery, Hadoop, etc.). If you select Pig or Web Services, the Import
from Tables dialog opens, bypassing the Select Import Options dialog, allowing you to type a query to
import a table. If you select SAP Hana, you must build or type a query, instead of selecting tables.or
Step 4: Select Select Tables and click Next. The Import from Tables dialog opens. If you selected a specific
database, only the data source connections that correspond to the selected database appear. If you did not
select a database, all available data source connections appear.
If necessary, you can create a new connection to a data source while importing your data.
The terminology on the Import from Tables dialog varies based on the source of the data.
Step 5: In the Data Sources/Projects pane, click on the data source/project that contains the data to import.
Step 6: If your data source/project supports namespaces, select a namespace from the Namespace drop-down list
in the Available Tables/Datasets pane to display only the tables/datasets within a selected namespace. To
search for a namespace, type its name in Namespace. The choices in the drop-down list are filtered as you
type.
Step 7: Expand a table/dataset to view the columns within it. Each column appears with its corresponding data
type in brackets. To search for a table/dataset, type its name in Table. The tables/datasets are filtered as
you type.
Step 8: MicroStrategy creates a cache of the database’s tables and columns when a data source/project is first
used. Hover over the Information icon at the top of the Available Tables/Datasets pane to view a tooltip
displaying the number of tables and the last time the cache was updated.
Step 9: Click Update namespaces in the Available Tables/Datasets pane to refresh the namespaces.
Step 10: Click Update in the Available Tables/Datasets pane to refresh the tables/datasets.
60
lOMoARcPSD|161847 63
61
lOMoARcPSD|161847 63
Step 11: Double-click tables/datasets in the Available Tables/Datasets pane to add them to the list of tables to
import. The tables/datasets appear in the Query Builder pane along with their corresponding columns.
Step 12: Click Prepare Data if you are adding a new dataset and want to preview, modify, and specify import
options.or
Step 13: Click Finish if you are adding a new dataset and go to the next step.or
Click Update Dataset if you are editing an existing dataset and skip the next step.
Click Connect Live to connect to a live database when retrieving data. Connecting live is useful if you
are working with a large amount of data, when importing into the dossier may not be feasible. Go to the
last step.or
Click Import as an In-memory Dataset to import the data directly into your dossier. Importing the data
leads to faster interaction with the data, but uses more RAM memory. Go to the last step.
If you are editing a connect live dataset, the existing dataset is refreshed and updated.or
If you are editing an in-memory dataset, you are prompted to refresh the existing dataset first.
Step 16: View the new or updated datasets on the Datasets panel.
PERFORMANCE 50
RECORD 15
VIVA 10
TOTAL 75
Result: Thus importing and exporting data from various databases completed successfully.
62