0% found this document useful (0 votes)
45 views62 pages

BDA Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views62 pages

BDA Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

1

LIST OF EXPERIMENTS

S. PAGE
NO DATE NAME OF THE EXPRIMENTS MARK SIGN
NO

1 To Study of Big Data Analytics and Hadoop


Architecture.

2 Downloading and installing;


Understanding different Hadoop modes.
Startup scripts,Configuration files.

3 Hadoop Implementation of file management tasks,


such as Adding files
and directories,retrieving files and Deleting files

4 Implement of Matrix Multiplication with Hadoop Map


Reduce

5 Run a basic Word Count Map Reduce program to


understand Map Reduce Paradigm.

6 Installation of Hive along with practice examples.

7 Installation of HBase, Installing thrift along with


Practice examples

8 Practice importing and exporting data from various


databases.

2
3
EXP NO: 1
To Study of Big Data Analytics and Hadoop Architecture
Date:

Aim: To Study of Big Data Analytics and Hadoop Architecture.

Introduction of bigdata architecture:


➢ A big data architecture is designed to handle the ingestion, processing, and
analysis of data that is too large or complex for traditional database systems.

Component of Big Data Architecture


➢ Data sources. All big data solutions start with one or more data sources.
➢ Data storage. Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats
➢ Batch processing. Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise
prepare the data for analysis. Usually these jobs involve reading source files,
processing them, and writing the output to new files.
➢ Real-time message ingestion. If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for stream
processing.
➢ Stream processing. After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis
➢ Analytical data store. Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using analytical
tools
➢ Analysis and reporting. The goal of most big data solutions is to provide insights
into the data through analysis and reporting.
➢ Orchestration. Most big data solutions consist of repeated data processing
operations, encapsulated in work flows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical data
store, or push the results straight to a report or dashboard.

Introduction of Hadoop Architecture:

➢ Apache Hadoop offers a scalable, flexible and reliable distributed computing big
data framework for a cluster of systems with storage capacity and local computing
power by leveraging commodity hardware.

➢ Hadoop follows a Master Slave architecture for the transformation and analysis of
large datasets using Hadoop MapReduce paradigm. The 3 important hadoop
components that play a vital role in the Hadooparchitecture.Hadoop Common – the
libraries and utilities used by other Hadoop modules

➢ Hadoop Distributed File System (HDFS) – the Java-based scalable system thatstores
data across multiple machines without prior organization.

4
5
➢ YARN – (Yet Another Resource Negotiator) provides resource management for the
processes running on Hadoop.
➢ MapReduce – a parallel processing software framework. It is comprised of two
steps. Map step is a master node that takes inputs and partitions them into smaller
sub problems and then distributes them to worker nodes. After the map step has
taken place, the master node takes the answers to all of the sub problems and
combines them to produce output.

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus successfully studied of Big Data Analytics and Hadoop Architecture

6
ownload Hadoop from www.hadoop.apache.org

7
EXP NO: 2
Downloading and installing Hadoop; Understanding different
Date: Hadoop modes. Startup scripts,Configuration files.

Aim:
To Install Apache Hadoop.
Hadoop software can be installed in three modes of
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.

Hadoop-2.7.3 is comprised of four main layers:

 Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
 HDFS, which stands for Hadoop Distributed File System, is responsible for persisting
data to disk.
 YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
 MapReduce is the original processing model for Hadoop clusters. It distributes work within
the cluster or map, then organizes and reduces the results from the nodes into a response to
a query. Many other processing models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.

Procedure:

we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.

Prerequisites:

Step1: Installing Java 8 version.


Openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME

Step2: Installing Hadoop


With Java in place, we'll visit the Apache Hadoop Releases page to find the most
recent stable release. Follow the binary for the current release:

8
lOMoARcPSD|161847 63

Resource Manager & Node Manager:

9
lOMoARcPSD|161847 63

Procedure to Run Hadoop

1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS

If Apache Hadoop 2.2.0 is not already installed then follow the post
Build, Install,Configure and Run Apache Hadoop 2.2.0 in Microsoft
Windows OS.

2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager


and NodeManager)

Run following
commands.
Command Prompt
C:\Users\abhijitg>cd
c:\hadoop
c:\hadoop>sbin\start
-dfs
c:\hadoop>sbin\start
-yarn starting yarn
daemons

Namenode, Datanode, Resource Manager and Node Manager will be


started infew minutes and ready to execute Hadoop MapReduce job in the
Single Node (pseudo-distributed mode) cluster.

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus Hadoop installed successfully.

10
lOMoARcPSD|161847 63

11
lOMoARcPSD|161847 63

EXP NO: 3
Hadoop Implementation of file management tasks, such as Adding
Date: files and directories,retrieving files and Deleting files

Aim:
Implement the following file management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting Files

Procedure:

HDFS is a scalable distributed filesystem designed to scale to petabytes of data


while running on top of the underlying filesystem of the operating system. HDFS
keeps track of where the data resides in a network by associating the name of its
rack (or network switch) with the dataset. This allows Hadoop to efficiently
schedule tasks to those nodes that contain data, or which are nearest to it, optimizing
bandwidth utilization. Hadoop provides a set of command line utilities that work
similarly to the Linux file commands, and serve as your primary interface with
HDFS. We‘re going to have a look into HDFS by interacting with it from the
command line.
We will take a look at the most common file management tasks in Hadoop, which
include:
Adding files and directories to HDFS
Retrieving files from HDFS to local filesystem
Deleting files from HDFS

Algorithm:

Syntax And Commands To Add, Retrieve And Delete Data From Hdfs

Step-1: Adding Files and Directories to HDFS

Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the
data into HDFS first. Let‘s create a directory and put a file in it. HDFS has a default
working directory of /user/$USER, where $USER is your login user name. This
directory isn‘t automatically created for you, though, so let‘s create it with the mkdir
command. For the purpose of illustration, we use chuck. You should substitute your
user name in the example commands.
hadoop fs -mkdir /user/chuck
hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck

12
lOMoARcPSD|161847 63

13
lOMoARcPSD|161847 63

Step-2 : Retrieving Files from HDFS

The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command.
hadoop fs -cat example.txt

Step-3: Deleting Files from HDFS

hadoop fs -rm example.txt

Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”


Adding directory is done through the command “hdfs dfs –put lendi_english /”.
Step-4: Copying Data from NFS to HDFS

Copying from directory command is “hdfs dfs –copyFromLocal /home /lendi/ Desktop/
shakes/
glossary /lendicse/”

View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”
Command for Deleting files is “hdfs dfs –rm r /kartheek”

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus implement the file management tasks in Hadoop Completed successfully.

14
lOMoARcPSD|161847 63

15
lOMoARcPSD|161847 63

EXP NO: 4
Implement of Matrix Multiplication with Hadoop Map Reduce
Date:

AIM: To Develop a MapReduce program to implement Matrix Multiplication.

In mathematics, matrix multiplication or the matrix product is a binary


operationthat produces a matrix from two matrices. The definition is motivated
by linear equations and linear transformations on vectors, which have numerous
applicationsin applied mathematics, physics, and engineering. In more detail, if
A is an n × m matrix and B is an m × p matrix, their matrix product AB is an n
× p matrix, in which the m entries across a row of A are multiplied with the m
entries down a column of B and summed to produce an entry of AB. When two
linear transformations are represented by matrices, then the matrix product
represents the composition of the two transformations.

Algorithm for Map Function.


a. for each element mij of M do

16
lOMoARcPSD|161847 63

17
lOMoARcPSD|161847 63

produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of
columns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number
ofrows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.
Algorithm for Reduce Function.
d. for each key (i,k) do
e. sort values begin with M by j in listM sort values begin with N by j in listN
multiply mij and njk for jth value of each list
f. sum up mij x njk return (i,k), Σj=1 mij x njk

Step 1. Download the hadoop jar files with these links.


Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar
Download Hadoop Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-2.7.1.jar

Step 2. Creating Mapper file for Matrix Multiplication.


import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;

18
lOMoARcPSD|161847 63

19
lOMoARcPSD|161847 63

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;

class Element implements Writable


{int tag;
int index;
double value;
Element() {
tag = 0;
index = 0;
value = 0.0;
}
Element(int tag, int index, double value)
{this.tag = tag;
this.index = index;
this.value = value;
}
@Override
public void readFields(DataInput input) throws IOException
{tag = input.readInt();
index = input.readInt();
value = input.readDouble();
}
@Override
public void write(DataOutput output) throws IOException
{output.writeInt(tag);
output.writeInt(index);
output.writeDouble(value);
}
}
class Pair implements WritableComparable<Pair>
{int i;
int j;

Pair() {
i = 0;

20
lOMoARcPSD|161847 63

21
lOMoARcPSD|161847 63

j = 0;
}
Pair(int i, int j) {
this.i = i;
this.j = j;
}
@Override
public void readFields(DataInput input) throws IOException
{i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws IOException
{output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair compare)
{if (i > compare.i) {
return 1;
} else if ( i < compare.i)
{return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j < compare.j)
{return -1;
}
}
return 0;
}
public String toString()
{ return i + " " + j + "
";
}
}
public class Multiply {
public static class MatriceMapperM extends Mapper<Object,Text,IntWritable,Element>
{

22
lOMoARcPSD|161847 63

23
lOMoARcPSD|161847 63

@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString(); String[]
stringTokens = readLine.split(",");

int index = Integer.parseInt(stringTokens[0]);


double elementValue = Double.parseDouble(stringTokens[2]);
Element e = new Element(0, index, elementValue);
IntWritable keyValue = new
IntWritable(Integer.parseInt(stringTokens[1]));
context.write(keyValue, e);
}
}
public static class MatriceMapperN extends Mapper<Object,Text,IntWritable,Element>
{@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString(); String[]
stringTokens = readLine.split(",");
int index = Integer.parseInt(stringTokens[1]);
double elementValue = Double.parseDouble(stringTokens[2]);
Element e = new Element(1,index, elementValue);
IntWritable keyValue = new
IntWritable(Integer.parseInt(stringTokens[0]));
context.write(keyValue, e);
}
}
public static class ReducerMxN extends Reducer<IntWritable,Element, Pair,
DoubleWritable> {
@Override
public void reduce(IntWritable key, Iterable<Element> values, Context context) throws
IOException, InterruptedException {
ArrayList<Element> M = new ArrayList<Element>();
ArrayList<Element> N = new ArrayList<Element>();
Configuration conf = context.getConfiguration();
for(Element element : values) {
Element tempElement = ReflectionUtils.newInstance(Element.class,
conf);

24
lOMoARcPSD|161847 63

25
lOMoARcPSD|161847 63

ReflectionUtils.copy(conf, element, tempElement);

if (tempElement.tag == 0)
{ M.add(tempElemen
t);
} else if(tempElement.tag == 1)
{N.add(tempElement);
}
}
for(int i=0;i<M.size();i++) {
for(int j=0;j<N.size();j++) {

Pair p = new Pair(M.get(i).index,N.get(j).index);


double multiplyOutput = M.get(i).value * N.get(j).value;

context.write(p, new DoubleWritable(multiplyOutput));


}
}
}
}
public static class MapMxN extends Mapper<Object, Text, Pair, DoubleWritable>
{@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] pairValue = readLine.split(" ");
Pair p = new
Pair(Integer.parseInt(pairValue[0]),Integer.parseInt(pairValue[1]));
DoubleWritable val = new
DoubleWritable(Double.parseDouble(pairValue[2]));
context.write(p, val);
}
}
public static class ReduceMxN extends Reducer<Pair, DoubleWritable, Pair,
DoubleWritable> {
@Override
public void reduce(Pair key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double sum = 0.0;
for(DoubleWritable value : values) {

26
lOMoARcPSD|161847 63

27
lOMoARcPSD|161847 63

sum += value.get();
}
context.write(key, new DoubleWritable(sum));
}
}
public static void main(String[] args) throws Exception
{Job job = Job.getInstance();
job.setJobName("MapIntermediate");
job.setJarByClass(Project1.class);
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class,
MatriceMapperM.class);
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class,
MatriceMapperN.class);
job.setReducerClass(ReducerMxN.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Element.class);
job.setOutputKeyClass(Pair.class);
job.setOutputValueClass(DoubleWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
Job job2 = Job.getInstance();
job2.setJobName("MapFinalOutput");
job2.setJarByClass(Project1.class);

job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);

job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);

job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);

job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job2, new Path(args[2]));


FileOutputFormat.setOutputPath(job2, new Path(args[3]));

28
lOMoARcPSD|161847 63

29
lOMoARcPSD|161847 63

job2.waitForCompletion(true);
}
}

Step 5. Compiling the program in particular folder named as operation

#!/bin/bash

rm -rf multiply.jar classes

module load hadoop/2.6.0

mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath` Multiply.java
jar cf multiply.jar -C classes .

echo "end"

Step 6. Running the program in particular folder named as operation


export HADOOP_CONF_DIR=/home/$USER/cometcluster
module load hadoop/2.6.0
myhadoop-configure.sh
start-dfs.sh
start-yarn.sh

hdfs dfs -mkdir -p /user/$USER


hdfs dfs -put M-matrix-large.txt /user/$USER/M-matrix-large.txt
hdfs dfs -put N-matrix-large.txt /user/$USER/N-matrix-large.txt
hadoop jar multiply.jar edu.uta.cse6331.Multiply /user/$USER/M-matrix-large.txt
/user/$USER/N-matrix-large.txt /user/$USER/intermediate /user/$USER/output
rm -rf output-distr
mkdir output-distr
hdfs dfs -get /user/$USER/output/part* output-distr

stop-yarn.sh
stop-dfs.sh
myhadoop-cleanup.sh

30
lOMoARcPSD|161847 63

31
lOMoARcPSD|161847 63

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus implement of Matrix Multiplication with Hadoop Map Reduce completed
successfully.

32
lOMoARcPSD|161847 63

C:\file1.txt

Install Hadoop

Run Hadoop Wordcount Mapreduce Example

33
lOMoARcPSD|161847 63

EXP NO: 5
Run a basic Word Count Map Reduce program to understand
Date: Map Reduce Paradigm.

Aim: To run a basic Word Count Map Reduce program.

Procedure:

Create a text file with some content. We'll pass this file as input tothe
wordcount MapReduce job for counting words.

Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be used forcounting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input

Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.

C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input

Check content of the copied file.

C:\hadoop>hdfs dfs -ls input


Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup 55 2014-02-03 13:19 input/file1.txt

C:\hadoop>bin\hdfs dfs -cat input/file1.txt


Install Hadoop
Run Hadoop Wordcount Mapreduce Example

Run the wordcount MapReduce job provided


in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar

C:\hadoop>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-


2.2.0.jar wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 1
14/02/03 13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
:
14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002

34
lOMoARcPSD|161847 63

35
lOMoARcPSD|161847 63

14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application


application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032
14/02/03 13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job: job_1391412385921_0002
14/02/03 13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running in
uber mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
14/02/03 13:22:22 INFO mapreduce.Job: map 100% reduce 0%
14/02/03 13:22:30 INFO mapreduce.Job: map 100% reduce 100%
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002 completed
successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=160142
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=171
HDFS: Number of bytes written=59
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5657
Total time spent by all reduces in occupied slots (ms)=6128
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=82
Map output materialized bytes=89
Input split bytes=116
Combine input records=7
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=89
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145

36
lOMoARcPSD|161847 63

37
lOMoARcPSD|161847 63

CPU time spent (ms)=1418


Physical memory (bytes) snapshot=368246784
Virtual memory (bytes) snapshot=513716224
Total committed heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters
Bytes Written=59

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus basic Word Count Map Reduce program to understand Map Reduce
Paradigm run successfully.

38
lOMoARcPSD|161847 63

39
lOMoARcPSD|161847 63

EXP NO: 6
Installation of Hive along with practice examples.
Date:

Aim: To Installation of Hive and create Database

Prerequisites:

Step1: Installing Java 8 version.


Openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME
Step2: Installing Hadoop

With Java in place, we'll visit the Apache Hadoop Releases page to find the most recent stable release.

Procedure to Run Hive:

1. Download Hive zip

Can also use any other STABLE version for Hive.

2. Unzip and Install Hive

After Downloading the Hive, we need to Unzip the apache-hive-3.1.2-bin.tar.gz file.


Once extracted, we would get a new file apache-hive-3.1.2-bin.tar
Now, once again we need to extract this tar file.
 Now we can organize our Hive installation, we can create a folder and move the final extracted file in it.
For Eg. :-

 Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME.
(it cancause issues later)

 I have placed my Hive in D: drive you can use C: or any other drive also.

3. Setting Up Environment Variables


Another important step in setting up a work environment is to set your Systems environment variable.

To edit environment variables, go to Control Panel > System > click on the “Advanced system settings”
link

40
lOMoARcPSD|161847 63

41
lOMoARcPSD|161847 63

Alternatively, We can Right click on This PC icon and click on Properties and click on the “Advanced
system settings” link
Or, easiest way is to search for Environment Variable in search bar and there you go

3.1 Setting HIVE_HOME

 Open environment Variable and click on “New” in “User Variable”

 On clicking “New”, we get below screen.

 Now as shown, add HIVE_HOME in variable name and path of Hive in Variable Value.
 Click OK and we are half done with setting HIVE_HOME.

3.2 Setting Path Variable

 Last step in setting Environment variable is setting Path in System Variable.

 Select Path variable in the system variables and click on “Edit”.

 Now we need to add these paths to Path Variable :-


* %HIVE_HOME%\bin

 Click OK and OK. & we are done with Setting Environment Variables.
3.4 Verify the Paths

 Now we need to verify that what we have done is correct and reflecting.

 Open a NEW Command Window

Run following commands


echo %HIVE_HOME%

42
lOMoARcPSD|161847 63

43
lOMoARcPSD|161847 63

4. Editing Hive

Once we have configured the environment variables next step is to configure Hive. It has 7 parts:-

4.1 Replacing bins

First step in configuring the hive is to download and replace the bin folder.

* Go to this GitHub Repo and download the bin folder as a zip.

* Extract the zip and replace all the files present under bin folder to %HIVE_HOME%\bin

Note:- If you are using different version of HIVE then please search for its respective bin folder and
download it.

4.2 Creating File Hive-site.xml

Now we need to create the Hive-site.xml file in hive for configuring it :-


(We can find these files in Hive -> conf -> hive-default.xml.template)
We need to copy the hive-default.xml.template file and paste it in the same location and rename it to hive-
site.xml. This will act as our main Config file for Hive.

4.3 Editing Configuration Files

4.3.1 Editing the Properties

Now Open the newly created Hive-site.xml and we need to edit the following properties
<property>
<name>hive.metastore.uris</name>
<value>thrift://<Your IP Address>:9083</value>

<property>
<name>hive.downloaded.resources.dir</name>
<value><Your drive Folder>/${hive.session.id}_resources</value>

<property>

44
lOMoARcPSD|161847 63

45
lOMoARcPSD|161847 63

<name>hive.exec.scratchdir</name>
<value>/tmp/mydir</value>
Replace the value for <Your IP Address> with the IP Address of your System and replace <Your drive
Folder> with the Hive folder Path.

4.3.2 Removing Special Characters

This is a short step and we need to remove all the &#8 character present in the hive-site.xml file.

4.3.3 Adding few More Properties

Now we need to add the following properties as it is in the hive-site.xml File.


<property>
<name>hive.querylog.location</name>
<value>$HIVE_HOME/iotmp</value>
<description>Location of Hive run time structured log file</description>
</property><property>
<name>hive.exec.local.scratchdir</name>
<value>$HIVE_HOME/iotmp</value>
<description>Local scratch space for Hive jobs</description>
</property><property>
<name>hive.downloaded.resources.dir</name>
<value>$HIVE_HOME/iotmp</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
Great..!!! We are almost done with the Hive part, for configuring MySQL database as Metastore for Hive,
we need to follow below steps:-

4.4 Creating Hive User in MySQL

The next important step in configuring Hive is to create users for MySQL.
These Users are used for connecting Hive to MySQL Database for reading and writing data from it.
Note:- You can skip this step if you have created the hive user while SQOOP installation.

 Firstly, we need to open the MySQL Workbench and open the workspace(default or any specific, if you
want). We will be using the default workspace only for now.

 Now Open the Administration option in the Workspace and select Users and privileges option
under Management.
 Now select Add Account option and Create an new user with Login Name as hive and Limit
to HostMapping as the localhost and Password of your choice.

 Now we have to define the roles for this user under Administrative
Roles andselect DBManager ,DBDesigner and BackupAdmin Roles

46
lOMoARcPSD|161847 63

</dependency>

Start HiveServer2

To connect to Hive from Java, you need to start hiveserver2 from $HIVE_HOME/bin

prabha@namenode:~/hive/bin$ ./hiveserver2
2020-10-03 23:17:08: Starting HiveServer2
Copy
Below are complete Java example of how to create a Hive Database.
Create a Hive Table from Java Example

package com.sparkbyexamples.hive;

import java.sql.Connection;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateDatabase {


public static void main(String[] args) {
Connection con = null;
try {
String conStr = "jdbc:hive2://192.168.1.148:10000/default";
Class.forName("org.apache.hive.jdbc.HiveDriver");
con = DriverManager.getConnection(conStr, "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("CREATE DATABASE emp");
System.out.println("Database emp created successfully.");
} catch (Exception ex) {
ex.printStackTrace();
} finally {
try {
if (con != null)
con.close();
} catch (Exception ex) {
}
}
}
}

47
lOMoARcPSD|161847 63

 Now we need to grant schema privileges for the user by using Add Entry option and selecting
the schemas we need access to.

I am using schema matching pattern as %_bigdata% for all my bigdata related schemas. You can use other
2 options also.

 After clicking OK we need to select All the privileges for this schema.
 Click Apply and we are done with the creating Hive user.
4.5 Granting permission to Users
Once we have created the user hive the next step is to Grant All privileges to this user for all the Tables in
the previously selected Schema.
 Upon opening it will ask for your root user password(created while setting up MySQL).

Now we need to run the below command in the cmd window.


4.6 Creating Metastore

Now we need to create our own metastore for Hive in MySQL..


Firstly, we need to create a database for metastore in MySQL OR we can use the one which used inprevious step
test_bigdata in my case.
Now Navigate to the below path
hive -> scripts -> metastore -> upgrade -> mysql and execute the file hive-schema-3.1.0.mysql in MySQLin
your database.
Note:- If you are using different Database, select the folder for same in upgrade folder and executethe
hive-schema file
.
4.7 Adding Few More Properties(Metastore related Properties)

Finally, we need to open our hive-site.xml file once again and make some changes their, these are related
grant all privileges on test_bigdata.* to 'hive'@'localhost';
where test_bigdata will be you schema name and hive@localhost will be the user name @ Host name.

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/<Your Database>?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in theconnection
URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
/description>
</property>

48
lOMoARcPSD|161847 63

49
lOMoARcPSD|161847 63

<property>
name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><Hive Password></value>
to Hive metastore that’s why did not add them in starting so as to distinguish between the different set of
properties.

 Open the MySQL cmd Window. We can open it by using the Window’s Search bar.

<description>password to use against metastore database</description>


</property>

<property>
<name>datanucleus.schema.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.schema.autoCreateTables</name>
<value>True</value>
</property>

<property>
<name>datanucleus.schema.validateTables</name>
<value>true</value>
<description>validates existing schema against code. turn this on if you want to verify existing
schema</description>
</property>

Replace the value for <Hive Password> with the hive user password that we created in MySQL usercreation.
And <Your Database> with the database that we used for metastore in MySQL.

5. Starting Hive

5.1 Starting Hadoop

Now we need to start a new Command Prompt remember to run it as administrator to avoid permission issues and
execute below commands
start-all.cmd

hive --service metastore


5.3 Starting Hive

Now open a new cmd window and run the below command to start Hive
Hive
Hive – Create Database from Java Example
Hive Java Dependency

50
lOMoARcPSD|161847 63

51
lOMoARcPSD|161847 63

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus Hive installed successfully and Database created.

52
lOMoARcPSD|161847 63

53
lOMoARcPSD|161847 63

EXP NO: 7
Installation of HBase, Installing thrift along with Practice
Date: examples

Aim: To Installation of HBase and create Database

Prerequisites:

Installing Hadoop

Procedure:

Step 1: Download HBase from Apache HBase site.

 Download Link -> https://hbase.apache.org/downloads.html ( I used version hbase-1.4.9-bin.tar.gz)

Step 2: Unzip it to a folder — I used c:\software\hbase.

Step 3: Now we need to change 2 files a config and a cmd file. Inorder to do that, go to the unzipped location.

Change 1 Edit hbase-config.cmd, located in the bin folder under the unzipped location and add the below line to

set JAVA_HOME [add it below the comments section on the top]

set JAVA_HOME=C:\software\Java\jdk1.8.0_201

Change 2 Edit hbase-site.xml, located in the conf folder under the unzipped location and add the section below

to hbase-site.xml. [inside <configuration> tag]

Note : hbase.rootdir’s value Eg : hdfs://localhost:9000/hbase, must be same as the hadoop core-site.xml’s

fs.defaultFS value.

<property>

<name>hbase.rootdir</name>

<value>file:/home/hadoop/HBase/HFiles</value>

</property>

<property>

54
lOMoARcPSD|161847 63

55
lOMoARcPSD|161847 63

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hadoop/zookeeper</value>

</property>

<property>

<name>hbase.cluster.distributed</name>

<value>false</value>

</property>

<property>

<name>hbase.rootdir</name>

<value>hdfs://localhost:9000/hbase</value>

</property>

Step 5: Now we are all set to run HBase, to start HBase execute the command below from the bin folder.
 Open Command Prompt and cd to Hbase’ bin directory
 Run start-hbase.cmd
 Look for any errors
Step 6: Test the installation using HBase shell

 Open command prompt and cd to HBase’ bin directory

 Run hbase shell [should connect to the HBase server]

 Try creating a table

 create ‘emp’,’p’

 list [Table name should get printed]

 put ‘emp’,’emp01',’p:fn’,’First Name’

 scan ‘emp’ [The row content should get printed]

56
lOMoARcPSD|161847 63

57
lOMoARcPSD|161847 63

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus Hive installed successfully and Database created.

58
lOMoARcPSD|161847 63

59
lOMoARcPSD|161847 63

EXP NO: 8
Practice importing and exporting data from various databases.
Date:

Aim: To Installation of HBase and create Database

Procedure:

Step 1: Create a blank dossier or open an existing one.

Step 2: Choose Add Data - > New Data to import data into a new dataset. or
In the Datasets panel, click More next to the dataset name and choose Edit Dataset to add data to the
dataset. The Preview Dialog opens. Click Add a new table.
The Data Sources dialog opens.

Step 3: To import data from a specific database, select the corresponding logo (Amazon Redshift, Apache
Cassandra, Cloudera Hive, Google BigQuery, Hadoop, etc.). If you select Pig or Web Services, the Import
from Tables dialog opens, bypassing the Select Import Options dialog, allowing you to type a query to
import a table. If you select SAP Hana, you must build or type a query, instead of selecting tables.or

To import data without specifying a database type, click Databases.

The Select Import Options dialog opens.

Step 4: Select Select Tables and click Next. The Import from Tables dialog opens. If you selected a specific
database, only the data source connections that correspond to the selected database appear. If you did not
select a database, all available data source connections appear.

If necessary, you can create a new connection to a data source while importing your data.

The terminology on the Import from Tables dialog varies based on the source of the data.

Step 5: In the Data Sources/Projects pane, click on the data source/project that contains the data to import.

Step 6: If your data source/project supports namespaces, select a namespace from the Namespace drop-down list
in the Available Tables/Datasets pane to display only the tables/datasets within a selected namespace. To
search for a namespace, type its name in Namespace. The choices in the drop-down list are filtered as you
type.

Step 7: Expand a table/dataset to view the columns within it. Each column appears with its corresponding data
type in brackets. To search for a table/dataset, type its name in Table. The tables/datasets are filtered as
you type.

Step 8: MicroStrategy creates a cache of the database’s tables and columns when a data source/project is first
used. Hover over the Information icon at the top of the Available Tables/Datasets pane to view a tooltip
displaying the number of tables and the last time the cache was updated.

Step 9: Click Update namespaces in the Available Tables/Datasets pane to refresh the namespaces.

Step 10: Click Update in the Available Tables/Datasets pane to refresh the tables/datasets.

60
lOMoARcPSD|161847 63

61
lOMoARcPSD|161847 63

Step 11: Double-click tables/datasets in the Available Tables/Datasets pane to add them to the list of tables to
import. The tables/datasets appear in the Query Builder pane along with their corresponding columns.

Step 12: Click Prepare Data if you are adding a new dataset and want to preview, modify, and specify import
options.or

Click Add if you are editing an existing dataset.

Step 13: Click Finish if you are adding a new dataset and go to the next step.or

Click Update Dataset if you are editing an existing dataset and skip the next step.

Step 14: The Data Access Mode dialog opens.

Click Connect Live to connect to a live database when retrieving data. Connecting live is useful if you
are working with a large amount of data, when importing into the dossier may not be feasible. Go to the
last step.or

Click Import as an In-memory Dataset to import the data directly into your dossier. Importing the data
leads to faster interaction with the data, but uses more RAM memory. Go to the last step.

Step 15: The Publishing Status dialog opens.

If you are editing a connect live dataset, the existing dataset is refreshed and updated.or

If you are editing an in-memory dataset, you are prompted to refresh the existing dataset first.

Step 16: View the new or updated datasets on the Datasets panel.

PERFORMANCE 50

RECORD 15

VIVA 10

TOTAL 75

Result: Thus importing and exporting data from various databases completed successfully.

62

You might also like