0% found this document useful (0 votes)

13 views69 pages

BDA Lab Manual - Organized

Uploaded by

sujithrajendiran29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views69 pages

BDA Lab Manual - Organized

Uploaded by

sujithrajendiran29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

KUTHAMBAKAM, Chennai – 600124

BONAFIDE CERTIFICATE

Name :
Semester :
Branch :

Certified that this is the bonafide record of work done by the above student in
the
Laboratory during the year 20 - 20 .

Signature of the Faculty Head of the Department

Submitted for the Practical Examination held on ______________ .

Internal Examiner External Examiner

INDEX
Page Teacher's
S.No. Date Title
No. Sign
EXP NO: 1
Date: Downloading and installing Hadoop; Understanding different
Hadoop modes. Startup scripts, Configuration files.

AIM:

Hadoop-2.7.3 is comprised of four main layers:

➢ Hadoop Common is the collection of utilities and libraries that support other
Hadoopmodules.
➢ HDFS, which stands for Hadoop Distributed File System, is responsible for
persistingdata to disk.
➢ YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
➢ MapReduce is the original processing model for Hadoop clusters. It distributes
work within the cluster or map, then organizes and reduces the results from the
nodes into a response to a query. Many other processing models are available for
the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.

Procedure:

we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.

Prerequisites:

Step1: Installing Java 8 version.

Openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-
b14)OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME
for the current release:1
Download Hadoop from www.hadoop.apache.org

2
Procedure to Run Hadoop

1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS

If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.

2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)

Run following commands.

Command Prompt
C:\Users\abhijitg>cd c:\hadoop
c:\hadoop>sbin\start-dfs
c:\hadoop>sbin\start-yarn
starting yarn daemons

Namenode, Datanode, Resource Manager and Node Manager will be started infew
minutes and ready to execute Hadoop MapReduce job in the Single Node (pseudo-
distributed mode) cluster.

Resource Manager & Node Manager:

3
Run wordcount MapReduce job

Now we'll run wordcount MapReduce job available

in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-
examples- 2.2.0.jar

Create a text file with some content. We'll pass this file as
input tothe wordcount MapReduce job for counting words.
C:\file1.txt
Install Hadoop

Run Hadoop Wordcount Mapreduce Example

Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be
used forcounting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input

Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.

C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input

Check content of the copied file.

C:\hadoop>hdfs dfs -ls input

Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup 55 2014-02-03 13:19 input/file1.txt

C:\hadoop>bin\hdfs dfs -cat input/file1.txt

Install Hadoop
Run Hadoop Wordcount Mapreduce Example

Run the wordcount MapReduce job provided

in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar

C:\hadoop>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-

2.2.0.jar wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 1
14/02/03 13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
:
14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002
14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application
application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032
14/02/03 13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job: job_1391412385921_0002
14/02/03 13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running in
uber mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
14/02/03 13:22:22 INFO mapreduce.Job: map 100% reduce 0%
14/02/03 13:22:30 INFO mapreduce.Job: map 100% reduce 100%
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002 completed
successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=160142
FILE: Number of read operations=0 FILE:
Number of large read operations=0FILE:
Number of write operations=0
5
HDFS: Number of bytes read=171
HDFS: Number of bytes written=59
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5657
Total time spent by all reduces in occupied slots (ms)=6128
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=82
Map output materialized bytes=89
Input split bytes=116
Combine input records=7
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=89
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145
CPU time spent (ms)=1418
Physical memory (bytes) snapshot=368246784
Virtual memory (bytes) snapshot=513716224 Total
committed heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters
6
Bytes Written=59
http://abhijitg:8088/cluster

Result:
EXP NO: 2 Hadoop Implementation of file management tasks, such as Adding
Date: files and directories, retrieving files and Deleting files

AIM:-

DESCRIPTION:-

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while

running on top of the underlying filesystem of the operating system. HDFS keeps track of
where the data resides in a network by associating the name of its rack (or network switch)
with the dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain
data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of
command line utilities that work similarly to the Linux file commands, and serve as your
primary interface with HDFS. We‘re going to have a look into HDFS by interacting with it
from the command line. We will take a look at the most common file management tasks in
Hadoop, which include:

• Adding files and directories to HDFS

• Retrieving files from HDFS to local filesystem
• Deleting files from HDFS

ALGORITHM:-

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step-1
• Copying from directory command is “hdfs dfs –copyFromLocal
/home/lendi/Desktop/shakes/glossary /lendicse/”
• View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
• Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
• Command for Deleting files is “hdfs dfs –rm r /kartheek”.

SAMPLE INPUT:

Input as any data format of type structured, Unstructured or Semi Structured

EXPECTED OUTPUT:

Result:
EXP NO: 3
MapReduce program to implement Matrix Multiplication
Date:

AIM:
Algorithm for Map Function:
a. sort values begin with M by j in listM sort values begin with N by j in listNmultiply
mij and njk for jth value of each list
b. sum up mij x njk return (i,k), Σj=1 mij x njk
Step 1. Download the hadoop jar files with these links.
Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar Download
Hadoop Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-2.7.1.jar

Step 2. Creating Mapper file for Matrix Multiplication.

import java.io.DataInput;
import
java.io.DataOutput;
import
java.io.IOException;
import java.util.ArrayList;

import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import
org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.*
; import
org.apache.hadoop.mapreduce.lib.output.
*; import
org.apache.hadoop.util.ReflectionUtils;

class Element implements

Writable {int tag;
int
index;
doubl
e
value;
Eleme
nt() {
tag = 0;
index = 0;
value = 0.0;
}
Element(int tag, int index, double
value) {this.tag = tag;
this.index =
index;
this.value =
value;
}
@Override
public void readFields(DataInput input) throws
IOException {tag = input.readInt();
index =
input.readInt();
value =
input.readDouble();
}
@Override
public void write(DataOutput output) throws
IOException {output.writeInt(tag);
output.writeInt(index
);
output.writeDouble(v
alue);
}
}
class Pair implements
WritableComparable<Pair> {int i;
int j;

Pair() {
i = 0;
j = 0;
}
Pair(int i, int j) {
th
is.
i=
i;
th
is.
j=
j;
}
@Override
public void readFields(DataInput input) throws
IOException {i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws
IOException {output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair
compare) { if (i >
compare.i) {
return 1;
} else if ( i < compare.i)
{return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j <
compare.j
) {return -
1;
}
}
return 0;
}
public String
toString() {
return i + " " +
j + " ";
}
}
public class Multiply {
public static class MatriceMapperM extends Mapper<Object,Text,IntWritable,Element>
{

@Override
public void map(Object key, Text value, Context
context) throws IOException,
InterruptedException {
String readLine = value.toString();
String[] stringTokens =
readLine.split(",");

int index = Integer.parseInt(stringTokens[0]);

double elementValue =
Double.parseDouble(stringTokens[2]); Element e =
new Element(0, index, elementValue); IntWritable
keyValue = new
IntWritable(Integer.parseInt(stringTokens[1]));
context.write(keyValue, e);
}
}
public static class MatriceMapperN extends
Mapper<Object,Text,IntWritable,Element> {@Override
public void map(Object key, Text value, Context
context) throws IOException,
InterruptedException {
String readLine = value.toString();
String[] stringTokens =
readLine.split(",");
int index = Integer.parseInt(stringTokens[1]);
double elementValue =
Double.parseDouble(stringTokens[2]); Element e =
new Element(1,index, elementValue);
IntWritable
keyValue = new
IntWritable(Integer.parseInt(stringToke
ns[0]));
context.write(keyValue, e);
}
}
public static class ReducerMxN extends
Reducer<IntWritable,Element, Pair,DoubleWritable> {
@Override
public void reduce(IntWritable key, Iterable<Element> values, Context
context) throwsIOException, InterruptedException {
ArrayList<Element> M = new
ArrayList<Element>(); ArrayList<Element>
N = new ArrayList<Element>();Configuration
conf = context.getConfiguration();
for(Element element : values) {
Element tempElement = ReflectionUtils.newInstance(Element.class,
conf);

ReflectionUtils.copy(conf, element, tempElement);

if (tempElement.tag
== 0) {
M.add(tempEl
ement);
} else if(tempElement.tag
== 1) {
N.add(tempEleme
nt);
}
}
for(int i=0;i<M.size();i++) {
for(int j=0;j<N.size();j++) {

Pair p = new Pair(M.get(i).index,N.get(j).index);

double multiplyOutput = M.get(i).value * N.get(j).value;

context.write(p, new DoubleWritable(multiplyOutput));

}
}
}
}
public static class MapMxN extends Mapper<Object, Text, Pair,
DoubleWritable> {@Override
public void map(Object key, Text value, Context
context) throws IOException,
InterruptedException {
String readLine =
value.toString(); String[]
pairValue = readLine.split(" ");
Pair p = new
Pair(Integer.parseInt(pairValue[0]),Integer.parseInt(pairValue[1]));
DoubleWritable val = new
DoubleWritable(Double.parseDouble(pairValue[2]));
context.write(p, val);
}
}
public static class ReduceMxN extends Reducer<Pair,
DoubleWritable, Pair,DoubleWritable> {
@Override
public void reduce(Pair key, Iterable<DoubleWritable> values,
Context context)throws IOException, InterruptedException {
double sum = 0.0;
for(DoubleWritable value :
values) {
sum += value.get();
}
context.write(key, new DoubleWritable(sum));
}
}
public static void main(String[] args) throws
Exception { Job job = Job.getInstance();
job.setJobName("MapIntermediate");
job.setJarByClass(Project1.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class,MatriceMapperM.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class,MatriceMapperN.class);
job.setReducerClass(ReducerMxN.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(Element.class);
job.setOutputKeyClass(Pair.class);
job.setOutputValueClass(DoubleWritable.class);
job.setOutputFormatClass(TextOutputFormat.cl
ass); FileOutputFormat.setOutputPath(job, new
Path(args[2]));job.waitForCompletion(true);
Job job2 = Job.getInstance();
job2.setJobName("MapFinalOutput");
job2.setJarByClass(Project1.class);

job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);

job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);

job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job2, new
Path(args[2]));
FileOutputFormat.setOutputPath(job2, new
Path(args[3]));
job2.waitForCompletion(true);
}
}

Step 5. Compiling the program in particular folder named as operation

#!/bin/bash

rm -rf multiply.jar classes

module load

hadoop/2.6.0

mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath`
Multiply.javajar cf multiply.jar -C classes .

echo "end"

Step 6. Running the program in particular folder named as operation

export
HADOOP_CONF_DIR=/home/$USER/cometcluster
module load hadoop/2.6.0
myhadoop-
configure.sh
start-dfs.sh
start-yarn.sh

hdfs dfs -mkdir -p /user/$USER

hdfs dfs -put M-matrix-large.txt /user/$USER/M-
matrix-large.txt hdfs dfs -put N-matrix-large.txt
/user/$USER/N-matrix-large.txt
hadoop jar multiply.jar edu.uta.cse6331.Multiply /user/$USER/M-matrix-large.txt
/user/$USER/N-matrix-large.txt /user/$USER/intermediate
/user/$USER/outputrm -rf output-distr
mkdir output-distr
hdfs dfs -get /user/$USER/output/part* output-distr

stop-
yarn.
sh
stop-
dfs.s
h
myhadoop-cleanup.sh
Expected Output:

module load

hadoop/2.6.0 rm -

rf output

intermediate

hadoop --config $HOME jar multiply.jar edu.uta.cse6331.Multiply M-matrix-small.txt N-

matrix-small.txt intermediate output

UBUNTU

DESCRIPTION:

We can represent a matrix as a relation (table) in RDBMS where each cell in the
matrix can be represented as a record (i,j,value). As an example let us consider the following
matrix and its representation. It is important to understand that this relation is a very
inefficient relation if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then
we need to store only 30 values. But if you consider above relation we are storing 30 rowid,
30 col_id and 30 values in other sense we are tripling the data. So a natural question arises
why we need to store in this format ? In practice most of the matrices are sparse matrices .
In sparse matrices not all cells used to have any values , so we don‘t have to store those cells
in DB. So this turns out to be very efficient in storing such matrices.

MapReduceLogic

Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell of output (0,0) has multiplication and summation
of elements from row 0 of the matrix A and elements from col 0 of matrix B. To do the
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we
need to use (0,0) as output key of mapphase and value should have array of values from row
0 of matrix A and column 0 of matrix B. Hopefully this picture will explain the point. So in
this algorithm output from map phase should be having a <key,value> , where key
represents the output cell location (0,0) , (0,1) etc.. and value will be list of all values required
for reducer to do computation. Let us take an example for calculatiing value at output cell
(00). Here we need to collect values from row 0 of matrix A and col 0 of matrix B in the map
phase and pass (0,0) as key. So a single reducer can do the calculation.

ALGORITHM

We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding
matrix element value. The output files for matrix C=A*B are in the same format.

We have the following input parameters:

The path of the input file or directory for matrix

A. The path of the input file or directory for
matrix B.
The path of the directory for the output files for matrix C.
strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in A and rows in B.
J = the number of columns in B and C.
IB = the number of rows per A block and C block.
KB = the number of columns per A block and rows per B block.
JB = the number of columns per B block and C block.

In the pseudo-code for the individual strategies below, we have intentionally avoided
factoring common code for the purposes of clarity.

Note that in all the strategies the memory footprint of both the mappers and the reducers is
flat at scale.

Note that the strategies all work reasonably well with both dense and sparse matrices. For
sparse matrices we do not emit zero elements. That said, the simple pseudo-code for
multiplying the individual blocks shown here is certainly not optimal for sparse matrices. As
a learning exercise, our focus here is on mastering the MapReduce complexities, not on
optimizing the sequential matrix multipliation algorithm for the individual blocks.

Steps

1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))

Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by
jb, then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:

11. r = ((ibJB + jb)KB + kb) mod R

12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data
for the A block immediately preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1
Reduce (key, valueList)

17. if key is (ib, kb, jb, 0)

18. // Save the A block.
19. sib = ib
20. skb = kb
21. Zero matrix A
22. for each value = (i, k, v) in valueList A(i,k) = v
23. if key is (ib, kb, jb, 1)
24. if ib != sib or kb != skb return // A[ib,kb] must be zero!
25. // Build the B block.
26. Zero matrix B
27. for each value = (k, j, v) in valueList B(k,j) = v
28. // Multiply the blocks and emit the result.
29. ibase = ib*IB
30. jbase = jb*JB
31. for 0 <= i < row dimension of A
32. for 0 <= j < column dimension of B
33. sum = 0
34. for 0 <= k < column dimension of A = row dimension of B
a. sum += A(i,k)*B(k,j)
35. if sum != 0 emit (ibase+i, jbase+j), sum

INPUT:-

Set of Data sets over different Clusters are taken as Rows and Columns
OUTPUT:-

Result:
EXP NO: 4
Date: Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm

AIM:-

DESCRIPTION:--

ALGORITHM

MAPREDUCE PROGRAM

WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:
Step-1. Write a Mapper

A Mapper overrides the ―map‖ function from the Class

"org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs as the input. A
Mapper implementation may output
<key,value> pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the
key would be the line number <line_number, line_of_text> . Map task outputs <word, one>
for each word in the line of text.

Pseudo-code

void Map (key, value){

for each word x in value:

output.collect(x, 1);

Step-2. Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and
assemble a single result. Here, the WordCount program will sum up the occurrence of each
word to pairs as
<word, occurrence>.

Pseudo-code
void Reduce (keyword, <list of value>){

for each x in <list of value>:

sum+=x;
final_output.collect(keyword, sum);
Step-3. Write Driver

The Driver program configures and run the MapReduce job. We use the main program to perform
basic configurations such as:
• Job Name : name of this Job
• Executable (Jar) Class: the main executable class. For here, WordCount.
• Mapper Class: class which overrides the "map" function. For here, Map.
• Reducer: class which override the "reduce" function. For here , Reduce.
• Output Key: type of output key. For here, Text.
• Output Value: type of output value. For here, IntWritable.
• File Input Path
• File Output Path

INPUT:-

Set of Data Related Shakespeare Comedies, Glossary, Poems

OUTPUT:-
Result:
EXP NO: 5
Installation of Hive along with practice examples.
Date:

AIM:-

DESCRIPTION

ALGORITHM:

Apache HIVE INSTALLATION STEPS

1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password
3) Creating User and granting all
Privileges Mysql –uroot –proot
4) Extract and Configure Apache Hive

tar xvfz apache-hive-1.0.1.bin.tar.gz

5) Move Apache Hive from Local directory to Home directory
6) Set CLASSPATH in bashrc
Export HIVE_HOME = /home/apache-hive
Export PATH = $PATH:$HIVE_HOME/bin
7) Configuring hive-default.xml by adding My SQL Server Credentials

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=t
rue
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>

8) Copying mysql-java-connector.jar to hive/lib directory.

SYNTAX for HIVE Database Operations

DATABASE Creation

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

DropDatabaseStatement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];

Creating and Dropping Table in HIVE

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format]

Loading Data into table log_data

Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;

AlterTableinHIVE

Syntax

ALTER TABLE name RENAME TO new_name

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type ALTER
TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Creating and Dropping View

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment],
...) ] [COMMENT table_comment] AS SELECT ...

Dropping

View Syntax:

DROP VIEW view_name

Functions in HIVE

String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc

Date and Time Functions:- year(), month(), day(), to_date() etc

Aggregate Functions :- sum(), min(), max(), count(), avg() etc

INDEXES

CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS

'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)] [IN
TABLE index_table_name]
[PARTITIONED BY (col_name, ...)] [
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Creating Index

CREATE INDEX index_ip ON TABLE log_data(ip_address) AS

'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
Altering and Inserting Index

ALTER INDEX index_ip_address ON log_data REBUILD;

Storing Index Data in Metastore

SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipadd
ress_re sult;

SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

Dropping Index

DROP INDEX INDEX_NAME on TABLE_NAME;

INPUT

Input as Web Server Log Data

OUTPUT
Result: We've successfully installed Hive and practiced examples.
EXP NO: 6
Installation of HBase, Installing thrift along with Practice examples
Date:

AIM:-

PROCEDURE:
Step 3) Click on the hbase-1.1.2-bin.tar.gz. It will download tar file. Copy the tar file into an
installation location.

How to Install HBase in Ubuntu with Standalone Mode

Here is the step by step process of HBase standalone mode installation in Ubuntu:

Step 1) Place the below command

Place hbase-1.1.2-bin.tar.gz in /home/hduser
Step 2) Unzip it by executing command $tar -xvf hbase-1.1.2-bin.tar.gz.
It will unzip the contents, and it will create hbase-1.1.2 in the location /home/hduser

Step 3) Open hbase-env.sh

Open hbase-env.sh as below and mention JAVA_HOME path in the location.

Step 4) Open the file and mention the path

Open ~/.bashrc file and mention HBASE_HOME path as shown in below

export HBASE_HOME=/home/hdus
export PATH= $PATH:$HBASE_HOME/bin
Step 5) Add properties in the file
Open hbase-site.xml and place the following properties inside the file

hduser@ubuntu$ gedit hbase-site.xml(code as below)

<name>hbase.rootdir</name>

<value>file:///home/hduser/HBASE/hbase</value>

</property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hduser/HBASE/zookeeper</value>
</property>

Here we are placing two properties

• One for HBase root directory and

• Second one for data directory correspond to ZooKeeper.

All HMaster and ZooKeeper activities point out to this hbase-site.xml.

Step 6) Mention the IPs

Open hosts file present in /etc. location and mention the IPs as shown in below.
Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below.

And we can check by jps command to see HMaster is running or not.

Step 8) Start the HBase Shell

HBase shell can start by using “hbase shell” and it will enter into interactive shell mode as shown
in below screenshot. Once it enters into shell mode, we can perform all type of commands.
The standalone mode does not require Hadoop daemons to start. HBase can run independently.

HBase Pseudo Distributed Mode of Installation

This is another method for Apache HBase Installation, known as Pseudo Distributed mode of
Installation.
Below are the steps to install HBase through Pseudo Distributed mode:

Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser

Step 2) Unzip it by executing command$tar -xvf hbase-1.1.2-bin.tar.gz. It will unzip the contents,
and it will create hbase-1.1.2 in the location /home/hduser

Step 3) Open hbase-env.sh as following below and mention JAVA_HOME path and Region servers’
path in the location and export the command as shown
Step 4) In this step, we are going to open ~/.bashrc file and mention the HBASE_HOME path as
shown in screen-shot.
Step 5) Open HBase-
site.xml and mention the
below properties in the file.(Code as below)

<name>hbase.rootdir</name>

<value>hdfs://localhost:9000/hbase</value>

</property>

<name>hbase.cluster.distributed</name>

</property>

<property>
<name>hbase.zookeeper.quorum</name>

<value>localhost</value>

</property>

<name>dfs.replication</name>

</property>

<name>hbase.zookeeper.property.clientPort</name>

</property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hduser/hbase/zookeeper</value>

</property>
1. Setting up Hbase root directory in this property
2. For distributed set up we have to set this property
3. ZooKeeper quorum property should be set up here
4. Replication set up done in this property. By default we are placing replication as 1.In the
fully distributed mode, multiple data nodes present so we can increase replication by placing
more than 1 value in the dfs.replication property
5. Client port should be mentioned in this property
6. ZooKeeper data directory can be mentioned in this property

Step 6) Start Hadoop daemons first and after that start HBase daemons as shown below

Here first you have to start Hadoop daemons by using“./start-all.sh” command as shown in below.

After starting Hbase daemons by hbase-start.sh

Now check jps

HBase Fully Distributed Mode Installation

• This set up will work in Hadoop cluster mode where multiple nodes spawn across the cluster
and running.
• The installation is same as pseudo distributed mode; the only difference is that it will spawn
across multiple nodes.
• The configurations files mentioned in HBase-site.xml and hbase-env.sh is same as
mentioned in pseudo mode.

General commands
In Hbase, general commands are categorized into following commands

• Status
• Version
• Table_help ( scan, drop, get, put, disable, etc.)
• Whoami

To get enter into HBase shell command, first of all, we have to execute the code as mentioned below

hbase Shell
Once we get to enter into HBase shell, we can execute all shell commands mentioned below. With
the help of these commands, we can perform all type of table operations in the HBase shell mode.

Let us look into all of these commands and their usage one by one with an example.
Status
Syntax:status
This command will give details about the system status like a number of servers present in the
cluster, active server count, and average load value. You can also pass any particular parameters
depending on how detailed status you want to know about the system. The parameters can
be ‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is “summary”.

Below we have shown how you can pass different parameters to the status command.

If we observe the below screen shot, we will get a better idea.

hbase(main):001:0>status
hbase(main):002:0>status 'simple'
hbase(main):003:0>status 'summary'
hbase(main):004:0> status 'detailed'
When we execute this command status, it will give information about number of server’s present,
dead servers and average load of server, here in screenshot it shows the information like- 1 live
server, 1 dead servers, and 7.0000 average load.
Version
Syntax: version

• This command will display the currently used HBase version in command mode
• If you run version command, it will give output as shown above

Table help
Syntax:table_help

This command guides

• What and how to use table-referenced commands

• It will provide different HBase shell command usages and its syntaxes
• Here in the screen shot above, its shows the syntax to “create” and “get_table” command
with its usage. We can manipulate the table via these commands once the table gets created
in HBase.
• It will give table manipulations commands like put, get and all other commands information.

whoami
Syntax:

Syntax: Whoami
This command “whoami” is used to return the current HBase user information from the HBase
cluster.

It will provide information like

• Groups present in HBase

• The user information, for example in this case “hduser” represent the user name as shown
in screen shot

TTL(Time To Live) – Attribute

In HBase, Column families can be set to time values in seconds using TTL. HBase will automatically
delete rows once the expiration time is reached. This attribute applies to all versions of a row –
even the current version too.

The TTL time encoded in the HBase for the row is specified in UTC. This attribute used with table
management commands.

Important differences between TTL handling and Column family TTLs are below

• Cell TTLs are expressed in units of milliseconds instead of seconds.

• A cell TTLs cannot extend the effective lifetime of a cell beyond a Column Family level TTL
setting.

Tables Managements commands

These commands will allow programmers to create tables and table schemas with rows and column
families.

The following are Table Management commands

• Create
• List
• Describe
• Disable
• Disable_all
• Enable
• Enable_all
• Drop
• Drop_all
• Show_filters
• Alter
• Alter_status

Let us look into various command usage in HBase with an example.

Create
Syntax: create <tablename>, <columnfamilyname>

Example:-

hbase(main):001:0> create 'education' ,'guru99'

0 rows(s) in 0.312 seconds
=>Hbase::Table – education
The above example explains how to create a table in HBase with the specified name given according
to the dictionary or specifications as per column family. In addition to this we can also pass some
table-scope attributes as well into it.

In order to check whether the table ‘education’ is created or not, we have to use the “list” command
as mentioned below.

List
Syntax:list

• “List” command will display all the tables that are present or created in HBase
• The output showing in above screen shot is currently showing the existing tables in HBase
• Here in this screenshot, it shows that there are total 8 tables present inside HBase
• We can filter output values from tables by passing optional regular expression parameters

Describe
Syntax:describe <table name>
hbase(main):010:0>describe 'education'
This command describes the named table.

• It will give more information about column families present in the mentioned table
• In our case, it gives the description about table “education.”
• It will give information about table name with column families, associated filters, versions
and some more details.

disable
Syntax: disable <tablename>

hbase(main):011:0>disable 'education'

• This command will start disabling the named table

• If table needs to be deleted or dropped, it has to disable first

Here, in the above screenshot we are disabling table education

disable_all
Syntax: disable_all<"matching regex"

• This command will disable all the tables matching the given regex.
• The implementation is same as delete command (Except adding regex for matching)
• Once the table gets disable the user can able to delete the table from HBase
• Before delete or dropping table, it should be disabled first

Enable
Syntax: enable <tablename>

hbase(main):012:0>enable 'education'

• This command will start enabling the named table

• Whichever table is disabled, to retrieve back to its previous state we use this command
• If a table is disabled in the first instance and not deleted or dropped, and if we want to re-
use the disabled table then we have to enable it by using this command.
• Here in the above screenshot we are enabling the table “education.”
show_filters
Syntax: show_filters

This command displays all the filters present in HBase like ColumnPrefix Filter, TimestampsFilter,
PageFilter, FamilyFilter, etc.

drop
Syntax:drop <table name>

hbase(main):017:0>drop 'education'
We have to observe below points for drop command

• To delete the table present in HBase, first we have to disable it

• To drop the table present in HBase, first we have to disable it
• So either table to drop or delete first the table should be disable using disable command
• Here in above screenshot we are dropping table “education.”
• Before execution of this command, it is necessary that you disable table “education.”

drop_all
Syntax: drop_all<"regex">

• This command will drop all the tables matching the given regex
• Tables have to disable first before executing this command using disable_all
• Tables with regex matching expressions are going to drop from HBase

is_enabled
Syntax: is_enabled 'education'
This command will verify whether the named table is enabled or not. Usually, there is a little
confusion between “enable” and “is_enabled” command action, which we clear here

• Suppose a table is disabled, to use that table we have to enable it by using enable command
• is_enabled command will check either the table is enabled or not
alter
Syntax: alter <tablename>, NAME=><column familyname>, VERSIONS=>5
This command alters the column family schema. To understand what exactly it does, we have
explained it here with an example.

Examples:

In these examples, we are going to perform alter command operations on tables and on its columns.
We will perform operations like

• Altering single, multiple column family names

• Deleting column family names from table
• Several other operations using scope attributes with table

1. To change or add the ‘guru99_1’ column family in table ‘education’ from current value to
keep a maximum of 5 cell VERSIONS,

• “education” is table name created with column name “guru99” previously

• Here with the help of an alter command we are trying to change the column family schema
to guru99_1 from guru99

hbase> alter 'education', NAME='guru99_1', VERSIONS=>5

2. You can also operate the alter command on several column families as well. For example, we
will define two new column to our existing table “education”.

hbase> alter 'edu', 'guru99_1', {NAME => 'guru99_2', IN_MEMORY => true}, {NAME => 'guru99_3',
VERSIONS => 5}
• We can change more than one column schemas at a time using this command
• guru99_2 and guru99_3 as shown in above screenshot are the two new column names that
we have defined for the table education
• We can see the way of using this command in the previous screen shot

3. In this step, we will see how to delete column family from the table. To delete the ‘f1’ column
family in table ‘education’.

Use one ofthese commands below,

hbase> alter 'education', NAME => 'f1', METHOD => 'delete'

hbase> alter 'education', 'delete' =>' guru99_1'

• In this command, we are trying to delete the column space name guru99_1 that we
previously created in the first step

4. As shown in the below screen shots, it shows two steps – how to change table scope attribute
and how to remove the table scope attribute.

Syntax: alter <'tablename'>, MAX_FILESIZE=>'132545224'

Step 1) You can change table-scope attributes like MAX_FILESIZE, READONLY,
MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc. These can be put at the end;for example, to
change the max size of a region to 128MB or any other memory value we use this command.

Usage:

• We can use MAX_FILESIZE with the table as scope attribute as above

• The number represent in MAX_FILESIZE is in term of memory in bytes

NOTE: MAX_FILESIZE Attribute Table scope will be determined by some attributes present in the
HBase. MAX_FILESIZE also come under table scope attributes.

Step 2) You can also remove a table-scope attribute using table_att_unset method. If you see the
command

alter 'education', METHOD => 'table_att_unset', NAME => 'MAX_FILESIZE'

• The above screen shot shows altered table name with scope attributes
• Method table_att_unset is used to unset attributes present in the table
• The second instance we are unsetting attribute MAX_FILESIZE
• After execution of the command, it will simply unset MAX_FILESIZE attribute
from”education” table.

alter_status
Syntax: alter_status 'education'

• Through this command, you can get the status of the alter command
• Which indicates the number of regions of the table that have received the updated schema
pass table name
• Here in above screen shot it shows 1/1 regions updated. It means that it has updated one
region. After that if it successful it will display comment done.

Data manipulation commands

These commands will work on the table related to data manipulations such as putting data into a
table, retrieving data from a table and deleting schema, etc.

The commands come under these are

• Count
• Put
• Get
• Delete
• Delete all
• Truncate
• Scan

Let look into these commands usage with an example.

Count
Syntax: count <'tablename'>, CACHE =>1000

• The command will retrieve the count of a number of rows in a table. The value returned by
this one is the number of rows.
• Current count is shown per every 1000 rows by default.
• Count interval may be optionally specified.
• Default cache size is 10 rows.
• Count command will work fast when it is configured with right Cache.

Example:

hbase> count 'guru99', CACHE=>1000

This example count fetches 1000 rows at a time from “Guru99” table.

We can make cache to some lower value if the table consists of more rows.

But by default it will fetch one row at a time.

hbase>count 'guru99', INTERVAL => 100000

hbase> count 'guru99', INTERVAL =>10, CACHE=> 1000
If suppose if the table “Guru99” having some table reference like say g.

We can run the count command on table reference also like below

hbase>g.count INTERVAL=>100000
hbase>g.count INTERVAL=>10, CACHE=>1000
Put
Syntax: put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
This command is used for following things
• It will put a cell ‘value’ at defined or specified table or row or column.
• It will optionally coordinate time stamp.

Example:

• Here we are placing values into table “guru99” under row r1 and column c1

hbase> put 'guru99', 'r1', 'c1', 'value', 10

• We have placed three values, 10,15 and 30 in table “guru99” as shown in screenshot below

• Suppose if the table “Guru99” having some table reference like say g. We can also run the
command on table reference also like

hbase> g.put 'guru99', 'r1', 'c1', 'value', 10

• The output will be as shown in the above screen shot after placing values into “guru99”.

To check whether the input value is correctly inserted into the table, we use “scan” command. In
the below screen shot, we can see the values are inserted correctly

Code Snippet: For Practice

create 'guru99', {NAME=>'Edu', VERSIONS=>213423443}

put 'guru99', 'r1', 'Edu:c1', 'value', 10
put 'guru99', 'r1', 'Edu:c1', 'value', 15
put 'guru99', 'r1', 'Edu:c1', 'value', 30
From the code snippet, we are doing these things
• Here we are creating a table named ‘guru99’ with the column name as “Edu.”
• By using “put” command, we are placing values into row name r1 in column “Edu” into table
“guru99.”

Get
Syntax: get <'tablename'>, <'rowname'>, {< Additional parameters>}
Here <Additional Parameters> include TIMERANGE, TIMESTAMP, VERSIONS and FILTERS.

By using this command, you will get a row or cell contents present in the table. In addition to that
you can also add additional parameters to it like TIMESTAMP, TIMERANGE,VERSIONS, FILTERS,
etc. to get a particular row or cell content.

Examples:-

hbase> get 'guru99', 'r1', {COLUMN => 'c1'}

For table “guru99′ row r1 and column c1 values will display using this command as shown in the
above screen shot

hbase> get 'guru99', 'r1'

For table “guru99″row r1 values will be displayed using this command

hbase> get 'guru99', 'r1', {TIMERANGE => [ts1, ts2]}

For table “guru99″row 1 values in the time range ts1 and ts2 will be displayed using this command

hbase> get 'guru99', 'r1', {COLUMN => ['c1', 'c2', 'c3']}

For table “guru99” row r1 and column families’ c1, c2, c3 values will be displayed using this
command

Delete
Syntax:delete <'tablename'>,<'row name'>,<'column name'>

• This command will delete cell value at defined table of row or column.
• Delete must and should match the deleted cells coordinates exactly.
• When scanning, delete cell suppresses older versions of values.
Example:

hbase(main):)020:0> delete 'guru99', 'r1', 'c1''.

• The above execution will delete row r1 from column family c1 in table “guru99.”
• Suppose if the table “guru99” having some table reference like say g.
• We can run the command on table reference also like hbase> g.delete ‘guru99’, ‘r1’, ‘c1′”.

deleteall
Syntax: deleteall <'tablename'>, <'rowname'>

• This Command will delete all cells in a given row.

• We can define optionally column names and time stamp to the syntax.

Example:-

hbase>deleteall 'guru99', 'r1', 'c1'

This will delete all the rows and columns present in the table. Optionally we can mention column
names in that.

Truncate
Syntax: truncate <tablename>

After truncate of an hbase table, the schema will present but not the records. This command
performs 3 functions; those are listed below

• Disables table if it already presents

• Drops table if it already presents
• Recreates the mentioned table
Scan
Syntax: scan <'tablename'>, {Optional parameters}
This command scans entire table and displays the table contents.

• We can pass several optional specifications to this scan command to get more information
about the tables present in the system.
• Scanner specifications may include one or more of the following attributes.
• These are TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS, CACHE,
STARTROW and STOPROW.

scan 'guru99'
The output as below shown in screen shot

In the above screen shot

• It shows “guru99” table with column name and values

• It consists of three row values r1, r2, r3 for single column value c1
• It displays the values associated with rows

Examples:-

The different usages of scan command

Command Usage
scan ‘.META.’, {COLUMNS => It display all the meta data information related to
‘info:regioninfo’} columns that are present in the tables in HBase
scan ‘guru99’, {COLUMNS => [‘c1’, ‘c2’], It display contents of table guru99 with their column
LIMIT => 10, STARTROW => ‘xyz’} families c1 and c2 limiting the values to 10
scan ‘guru99’, {COLUMNS => ‘c1’, It display contents of guru99 with its column name c1
TIMERANGE => [1303668804, with the values present in between the mentioned time
1303668904]} range attribute value
In this command RAW=> true provides advanced
scan ‘guru99’, {RAW => true, VERSIONS
feature like to display all the cell values present in the
=>10}
table guru99
Code Example:
First create table and place values into table

create 'guru99', {NAME=>'e', VERSIONS=>2147483647}

put 'guru99', 'r1', 'e:c1', 'value', 10
put 'guru99', 'r1', 'e:c1', 'value', 12
put 'guru99', 'r1', 'e:c1', 'value', 14
delete 'guru99', 'r1', 'e:c1', 11
Input Screenshot:

If we run scan command

Query: scan 'guru99', {RAW=>true, VERSIONS=>1000}

It will display output shown in below.

Output screen shot:

The output shown in above screen shot gives the following information

• Scanning guru99 table with attributes RAW=>true, VERSIONS=>1000

• Displaying rows with column families and values
• In the third row, the values displayed shows deleted value present in the column
• The output displayed by it is random; it cannot be same order as the values that we inserted
in the table
Cluster Replication Commands

• These commands work on cluster set up mode of HBase.

• For adding and removing peers to cluster and to start and stop replication these commands
are used in general.

Command Functionality
Add peers to cluster to replicate
add_peer
hbase> add_peer ‘3’, zk1,zk2,zk3:2182:/hbase-prod
Stops the defined replication stream.
Deletes all the metadata information about the peer
remove_peer
hbase> remove_peer ‘1’
Restarts all the replication features
start_replication
hbase> start_replication
Stops all the replication features
stop_replication
hbase>stop_replication

Result:
ADDITIONAL EXPERIMENTS

EXP NO: 1
PIG LATIN MODES, PROGRAMS
Date:

OBJECTIVE:

a. Run the Pig Latin Scripts to find Word Count.

b. Run the Pig Latin Scripts to find a max temp for each and every year.

PROGRAM LOGIC:

Run the Pig Latin Scripts to find Word Count.

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

Run the Pig Latin Scripts to find a max temp for each and every year

-- max_temp.pig: Finds the maximum temperature by

year records = LOAD 'input/ncdc/micro-
tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;

OUTPUT:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

RESULT: Thus, the Pig Latin Scripts to find Word Count and to find a max temp for each and every year is
successfully implemented

MV3000 Software Manual PDF
0% (1)
MV3000 Software Manual PDF
464 pages
VenkatSAP MM Resume
No ratings yet
VenkatSAP MM Resume
6 pages
n5 Office Practice Topic 4 Security at Workstation
No ratings yet
n5 Office Practice Topic 4 Security at Workstation
15 pages
Devkit Modules-Datasheet JADAK-1
No ratings yet
Devkit Modules-Datasheet JADAK-1
2 pages
NullClass-Data Analyst-Intern-OL
No ratings yet
NullClass-Data Analyst-Intern-OL
3 pages
Unit-5 Notes
No ratings yet
Unit-5 Notes
61 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
REST API in ASP - NET Core
No ratings yet
REST API in ASP - NET Core
15 pages
Unit 1 Understanding Big Data
No ratings yet
Unit 1 Understanding Big Data
17 pages
AU OfferLetter
No ratings yet
AU OfferLetter
1 page
688966705-At-T-Mobility-Llc-Iphone-12-2 11.13.02 Am
No ratings yet
688966705-At-T-Mobility-Llc-Iphone-12-2 11.13.02 Am
1 page
BDA Lab Manual
No ratings yet
BDA Lab Manual
26 pages
Ccs334 Bda Lab Manual PRINT
No ratings yet
Ccs334 Bda Lab Manual PRINT
53 pages
MMW1 - 4
No ratings yet
MMW1 - 4
50 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Bda 1
No ratings yet
Bda 1
6 pages
Quotation # SO2023/3378851: Your Reference: Recurrence: Quotation Date: Expiration: Salesperson
No ratings yet
Quotation # SO2023/3378851: Your Reference: Recurrence: Quotation Date: Expiration: Salesperson
1 page
BTH Brochure - Bharat Test House
No ratings yet
BTH Brochure - Bharat Test House
6 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
MXC-6400 Series Datasheet-En 20180706
No ratings yet
MXC-6400 Series Datasheet-En 20180706
2 pages
Big Data Analytics Lab Manual (BE AI&DS)
No ratings yet
Big Data Analytics Lab Manual (BE AI&DS)
29 pages
Journal - Arrive Tsitaire Jean 2se17mr
No ratings yet
Journal - Arrive Tsitaire Jean 2se17mr
21 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Exp 1-2
No ratings yet
Exp 1-2
9 pages
Lab Manual
No ratings yet
Lab Manual
34 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Bda Manual
No ratings yet
Bda Manual
80 pages
Big Datalab
No ratings yet
Big Datalab
4 pages
Doors Assignment
No ratings yet
Doors Assignment
29 pages
Integrative Programming and Technology 1
No ratings yet
Integrative Programming and Technology 1
4 pages
Toc 9780134049984
No ratings yet
Toc 9780134049984
10 pages
BIG Data File
No ratings yet
BIG Data File
28 pages
Cp5261 Da Lab Me-Cse 2021 - Edit
No ratings yet
Cp5261 Da Lab Me-Cse 2021 - Edit
88 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
College Social Networking Site
No ratings yet
College Social Networking Site
5 pages
BDA Record
No ratings yet
BDA Record
58 pages
New Bda Manual
No ratings yet
New Bda Manual
80 pages
Difference Between Microkernel and Exokernel
No ratings yet
Difference Between Microkernel and Exokernel
4 pages
BDA Lab
No ratings yet
BDA Lab
13 pages
SAP PP - How To Create & Change BOM
No ratings yet
SAP PP - How To Create & Change BOM
10 pages
Management of Technology Task: Skype Business Canvas
0% (1)
Management of Technology Task: Skype Business Canvas
26 pages
YouTube Gains by DarkFerret
No ratings yet
YouTube Gains by DarkFerret
11 pages
Installation Instructions: Model Icp-B6 Intelligent Control Point
No ratings yet
Installation Instructions: Model Icp-B6 Intelligent Control Point
8 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Pro 3
No ratings yet
Pro 3
45 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Big Data
No ratings yet
Big Data
23 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Unit 5 - Week 3: Assignment 3
No ratings yet
Unit 5 - Week 3: Assignment 3
5 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Bda Record
No ratings yet
Bda Record
83 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
Big Data
No ratings yet
Big Data
28 pages
Computer Science Extended Essay First Draft (Second Version)
No ratings yet
Computer Science Extended Essay First Draft (Second Version)
10 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Hadoop Week 3
No ratings yet
Hadoop Week 3
60 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Practice Excel 2
No ratings yet
Practice Excel 2
3 pages
20 Coding Patterns To Master MAANG Interviews
No ratings yet
20 Coding Patterns To Master MAANG Interviews
22 pages
BDA LabManual
No ratings yet
BDA LabManual
20 pages
Datasheet Wdeh220 20120702-14235212729
No ratings yet
Datasheet Wdeh220 20120702-14235212729
4 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Data Science
No ratings yet
Data Science
82 pages
Behringer MIC100 P0207 M en
No ratings yet
Behringer MIC100 P0207 M en
0 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
Big Data File
No ratings yet
Big Data File
16 pages
Self Balancing Scooter Ver 20 PDF
No ratings yet
Self Balancing Scooter Ver 20 PDF
9 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
Missing Neighbors in WCDMA Analysis Guide
100% (2)
Missing Neighbors in WCDMA Analysis Guide
15 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Gateforum Ece Question Paper
No ratings yet
Gateforum Ece Question Paper
17 pages
Big Data Manual
No ratings yet
Big Data Manual
19 pages
Notes
No ratings yet
Notes
53 pages
Ac Motor Drive VFDG
No ratings yet
Ac Motor Drive VFDG
183 pages
HP Z1 Workstation
No ratings yet
HP Z1 Workstation
2 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
BIGDATA LAB MANUAL
No ratings yet
BIGDATA LAB MANUAL
27 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Hadoop Single Node Cluster Setup Steps
No ratings yet
Hadoop Single Node Cluster Setup Steps
7 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)

BDA Lab Manual - Organized

Uploaded by

BDA Lab Manual - Organized

Uploaded by

RAJALAKSHMI INSTITUTE OF TECHNOLOGY

KUTHAMBAKAM, Chennai – 600124

Signature of the Faculty Head of the Department

Submitted for the Practical Examination held on ______________ .

Internal Examiner External Examiner

Hadoop-2.7.3 is comprised of four main layers:

Step1: Installing Java 8 version.

1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS

Run following commands.

Resource Manager & Node Manager:

Now we'll run wordcount MapReduce job available

Run Hadoop Wordcount Mapreduce Example

C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input

C:\hadoop>hdfs dfs -ls input

C:\hadoop>bin\hdfs dfs -cat input/file1.txt

Run the wordcount MapReduce job provided

C:\hadoop>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while

• Adding files and directories to HDFS

Input as any data format of type structured, Unstructured or Semi Structured

Step 2. Creating Mapper file for Matrix Multiplication.

class Element implements

int index = Integer.parseInt(stringTokens[0]);

ReflectionUtils.copy(conf, element, tempElement);

Pair p = new Pair(M.get(i).index,N.get(j).index);

context.write(p, new DoubleWritable(multiplyOutput));

Step 5. Compiling the program in particular folder named as operation

rm -rf multiply.jar classes

Step 6. Running the program in particular folder named as operation

hdfs dfs -mkdir -p /user/$USER

hadoop --config $HOME jar multiply.jar edu.uta.cse6331.Multiply M-matrix-small.txt N-

We have the following input parameters:

The path of the input file or directory for matrix

11. r = ((ib*JB + jb)*KB + kb) mod R

17. if key is (ib, kb, jb, 0)

A Mapper overrides the ―map‖ function from the Class

void Map (key, value){

for each word x in value:

Step-2. Write a Reducer

for each x in <list of value>:

Set of Data Related Shakespeare Comedies, Glossary, Poems

Apache HIVE INSTALLATION STEPS

tar xvfz apache-hive-1.0.1.bin.tar.gz

8) Copying mysql-java-connector.jar to hive/lib directory.

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

Creating and Dropping Table in HIVE

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format]

Loading Data into table log_data

ALTER TABLE name RENAME TO new_name

Creating and Dropping View

DROP VIEW view_name

String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc

Date and Time Functions:- year(), month(), day(), to_date() etc

Aggregate Functions :- sum(), min(), max(), count(), avg() etc

CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS

CREATE INDEX index_ip ON TABLE log_data(ip_address) AS

ALTER INDEX index_ip_address ON log_data REBUILD;

DROP INDEX INDEX_NAME on TABLE_NAME;

Input as Web Server Log Data

How to Install HBase in Ubuntu with Standalone Mode

Step 1) Place the below command

Step 3) Open hbase-env.sh

Step 4) Open the file and mention the path

hduser@ubuntu$ gedit hbase-site.xml(code as below)

Here we are placing two properties

• One for HBase root directory and

All HMaster and ZooKeeper activities point out to this hbase-site.xml.

Step 6) Mention the IPs

And we can check by jps command to see HMaster is running or not.

Step 8) Start the HBase Shell

HBase Pseudo Distributed Mode of Installation

Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser

After starting Hbase daemons by hbase-start.sh

Now check jps

If we observe the below screen shot, we will get a better idea.

11. r = ((ibJB + jb)KB + kb) mod R