BDA Lab Manual - Organized
BDA Lab Manual - Organized
Name :
Semester :
Branch :
Register Number :
Certified that this is the bonafide record of work done by the above student in
the
Laboratory during the year 20 - 20 .
AIM:
Procedure:
we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.
Prerequisites:
2
Procedure to Run Hadoop
If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.
2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)
Namenode, Datanode, Resource Manager and Node Manager will be started infew
minutes and ready to execute Hadoop MapReduce job in the Single Node (pseudo-
distributed mode) cluster.
3
Run wordcount MapReduce job
Create a text file with some content. We'll pass this file as
input tothe wordcount MapReduce job for counting words.
C:\file1.txt
Install Hadoop
Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be
used forcounting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.
Result:
EXP NO: 2 Hadoop Implementation of file management tasks, such as Adding
Date: files and directories, retrieving files and Deleting files
AIM:-
DESCRIPTION:-
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1
• Copying from directory command is “hdfs dfs –copyFromLocal
/home/lendi/Desktop/shakes/glossary /lendicse/”
• View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
• Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
• Command for Deleting files is “hdfs dfs –rm r /kartheek”.
SAMPLE INPUT:
EXPECTED OUTPUT:
Result:
EXP NO: 3
MapReduce program to implement Matrix Multiplication
Date:
AIM:
Algorithm for Map Function:
a. sort values begin with M by j in listM sort values begin with N by j in listNmultiply
mij and njk for jth value of each list
b. sum up mij x njk return (i,k), Σj=1 mij x njk
Step 1. Download the hadoop jar files with these links.
Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar Download
Hadoop Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-2.7.1.jar
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import
org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.*
; import
org.apache.hadoop.mapreduce.lib.output.
*; import
org.apache.hadoop.util.ReflectionUtils;
Pair() {
i = 0;
j = 0;
}
Pair(int i, int j) {
th
is.
i=
i;
th
is.
j=
j;
}
@Override
public void readFields(DataInput input) throws
IOException {i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws
IOException {output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair
compare) { if (i >
compare.i) {
return 1;
} else if ( i < compare.i)
{return -1;
} else {
if(j > compare.j) {
return 1;
} else if (j <
compare.j
) {return -
1;
}
}
return 0;
}
public String
toString() {
return i + " " +
j + " ";
}
}
public class Multiply {
public static class MatriceMapperM extends Mapper<Object,Text,IntWritable,Element>
{
@Override
public void map(Object key, Text value, Context
context) throws IOException,
InterruptedException {
String readLine = value.toString();
String[] stringTokens =
readLine.split(",");
job2.setMapperClass(MapMxN.class);
job2.setReducerClass(ReduceMxN.class);
job2.setMapOutputKeyClass(Pair.class);
job2.setMapOutputValueClass(DoubleWritable.class);
job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job2, new
Path(args[2]));
FileOutputFormat.setOutputPath(job2, new
Path(args[3]));
job2.waitForCompletion(true);
}
}
#!/bin/bash
module load
hadoop/2.6.0
mkdir -p classes
javac -d classes -cp classes:`$HADOOP_HOME/bin/hadoop classpath`
Multiply.javajar cf multiply.jar -C classes .
echo "end"
export
HADOOP_CONF_DIR=/home/$USER/cometcluster
module load hadoop/2.6.0
myhadoop-
configure.sh
start-dfs.sh
start-yarn.sh
stop-
yarn.
sh
stop-
dfs.s
h
myhadoop-cleanup.sh
Expected Output:
module load
hadoop/2.6.0 rm -
rf output
intermediate
UBUNTU
DESCRIPTION:
We can represent a matrix as a relation (table) in RDBMS where each cell in the
matrix can be represented as a record (i,j,value). As an example let us consider the following
matrix and its representation. It is important to understand that this relation is a very
inefficient relation if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then
we need to store only 30 values. But if you consider above relation we are storing 30 rowid,
30 col_id and 30 values in other sense we are tripling the data. So a natural question arises
why we need to store in this format ? In practice most of the matrices are sparse matrices .
In sparse matrices not all cells used to have any values , so we don‘t have to store those cells
in DB. So this turns out to be very efficient in storing such matrices.
MapReduceLogic
Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell of output (0,0) has multiplication and summation
of elements from row 0 of the matrix A and elements from col 0 of matrix B. To do the
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we
need to use (0,0) as output key of mapphase and value should have array of values from row
0 of matrix A and column 0 of matrix B. Hopefully this picture will explain the point. So in
this algorithm output from map phase should be having a <key,value> , where key
represents the output cell location (0,0) , (0,1) etc.. and value will be list of all values required
for reducer to do computation. Let us take an example for calculatiing value at output cell
(00). Here we need to collect values from row 0 of matrix A and col 0 of matrix B in the map
phase and pass (0,0) as key. So a single reducer can do the calculation.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding
matrix element value. The output files for matrix C=A*B are in the same format.
In the pseudo-code for the individual strategies below, we have intentionally avoided
factoring common code for the purposes of clarity.
Note that in all the strategies the memory footprint of both the mappers and the reducers is
flat at scale.
Note that the strategies all work reasonably well with both dense and sparse matrices. For
sparse matrices we do not emit zero elements. That said, the simple pseudo-code for
multiplying the individual blocks shown here is certainly not optimal for sparse matrices. As
a learning exercise, our focus here is on mastering the MapReduce complexities, not on
optimizing the sequential matrix multipliation algorithm for the individual blocks.
Steps
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by
jb, then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data
for the A block immediately preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1
Reduce (key, valueList)
INPUT:-
Set of Data sets over different Clusters are taken as Rows and Columns
OUTPUT:-
Result:
EXP NO: 4
Date: Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm
AIM:-
DESCRIPTION:--
ALGORITHM
MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:
Step-1. Write a Mapper
Pseudo-code
output.collect(x, 1);
A Reducer collects the intermediate <key,value> output from multiple map tasks and
assemble a single result. Here, the WordCount program will sum up the occurrence of each
word to pairs as
<word, occurrence>.
Pseudo-code
void Reduce (keyword, <list of value>){
sum+=x;
final_output.collect(keyword, sum);
Step-3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform
basic configurations such as:
• Job Name : name of this Job
• Executable (Jar) Class: the main executable class. For here, WordCount.
• Mapper Class: class which overrides the "map" function. For here, Map.
• Reducer: class which override the "reduce" function. For here , Reduce.
• Output Key: type of output key. For here, Text.
• Output Value: type of output value. For here, IntWritable.
• File Input Path
• File Output Path
INPUT:-
OUTPUT:-
Result:
EXP NO: 5
Installation of Hive along with practice examples.
Date:
AIM:-
DESCRIPTION
ALGORITHM:
1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password
3) Creating User and granting all
Privileges Mysql –uroot –proot
4) Extract and Configure Apache Hive
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>
jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=t
rue
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
DATABASE Creation
DropDatabaseStatement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;
AlterTableinHIVE
Syntax
Dropping
View Syntax:
Functions in HIVE
INDEXES
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipadd
ress_re sult;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
Dropping Index
INPUT
AIM:-
PROCEDURE:
Step 3) Click on the hbase-1.1.2-bin.tar.gz. It will download tar file. Copy the tar file into an
installation location.
export HBASE_HOME=/home/hdus
export PATH= $PATH:$HBASE_HOME/bin
Step 5) Add properties in the file
Open hbase-site.xml and place the following properties inside the file
<property>
<name>hbase.rootdir</name>
<value>file:///home/hduser/HBASE/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/HBASE/zookeeper</value>
</property>
Step 2) Unzip it by executing command$tar -xvf hbase-1.1.2-bin.tar.gz. It will unzip the contents,
and it will create hbase-1.1.2 in the location /home/hduser
Step 3) Open hbase-env.sh as following below and mention JAVA_HOME path and Region servers’
path in the location and export the command as shown
Step 4) In this step, we are going to open ~/.bashrc file and mention the HBASE_HOME path as
shown in screen-shot.
Step 5) Open HBase-
site.xml and mention the
below properties in the file.(Code as below)
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/hbase/zookeeper</value>
</property>
1. Setting up Hbase root directory in this property
2. For distributed set up we have to set this property
3. ZooKeeper quorum property should be set up here
4. Replication set up done in this property. By default we are placing replication as 1.In the
fully distributed mode, multiple data nodes present so we can increase replication by placing
more than 1 value in the dfs.replication property
5. Client port should be mentioned in this property
6. ZooKeeper data directory can be mentioned in this property
Step 6) Start Hadoop daemons first and after that start HBase daemons as shown below
Here first you have to start Hadoop daemons by using“./start-all.sh” command as shown in below.
• This set up will work in Hadoop cluster mode where multiple nodes spawn across the cluster
and running.
• The installation is same as pseudo distributed mode; the only difference is that it will spawn
across multiple nodes.
• The configurations files mentioned in HBase-site.xml and hbase-env.sh is same as
mentioned in pseudo mode.
General commands
In Hbase, general commands are categorized into following commands
• Status
• Version
• Table_help ( scan, drop, get, put, disable, etc.)
• Whoami
To get enter into HBase shell command, first of all, we have to execute the code as mentioned below
hbase Shell
Once we get to enter into HBase shell, we can execute all shell commands mentioned below. With
the help of these commands, we can perform all type of table operations in the HBase shell mode.
Let us look into all of these commands and their usage one by one with an example.
Status
Syntax:status
This command will give details about the system status like a number of servers present in the
cluster, active server count, and average load value. You can also pass any particular parameters
depending on how detailed status you want to know about the system. The parameters can
be ‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is “summary”.
Below we have shown how you can pass different parameters to the status command.
hbase(main):001:0>status
hbase(main):002:0>status 'simple'
hbase(main):003:0>status 'summary'
hbase(main):004:0> status 'detailed'
When we execute this command status, it will give information about number of server’s present,
dead servers and average load of server, here in screenshot it shows the information like- 1 live
server, 1 dead servers, and 7.0000 average load.
Version
Syntax: version
• This command will display the currently used HBase version in command mode
• If you run version command, it will give output as shown above
Table help
Syntax:table_help
whoami
Syntax:
Syntax: Whoami
This command “whoami” is used to return the current HBase user information from the HBase
cluster.
The TTL time encoded in the HBase for the row is specified in UTC. This attribute used with table
management commands.
Important differences between TTL handling and Column family TTLs are below
• Create
• List
• Describe
• Disable
• Disable_all
• Enable
• Enable_all
• Drop
• Drop_all
• Show_filters
• Alter
• Alter_status
Example:-
In order to check whether the table ‘education’ is created or not, we have to use the “list” command
as mentioned below.
List
Syntax:list
• “List” command will display all the tables that are present or created in HBase
• The output showing in above screen shot is currently showing the existing tables in HBase
• Here in this screenshot, it shows that there are total 8 tables present inside HBase
• We can filter output values from tables by passing optional regular expression parameters
Describe
Syntax:describe <table name>
hbase(main):010:0>describe 'education'
This command describes the named table.
• It will give more information about column families present in the mentioned table
• In our case, it gives the description about table “education.”
• It will give information about table name with column families, associated filters, versions
and some more details.
disable
Syntax: disable <tablename>
hbase(main):011:0>disable 'education'
disable_all
Syntax: disable_all<"matching regex"
• This command will disable all the tables matching the given regex.
• The implementation is same as delete command (Except adding regex for matching)
• Once the table gets disable the user can able to delete the table from HBase
• Before delete or dropping table, it should be disabled first
Enable
Syntax: enable <tablename>
hbase(main):012:0>enable 'education'
This command displays all the filters present in HBase like ColumnPrefix Filter, TimestampsFilter,
PageFilter, FamilyFilter, etc.
drop
Syntax:drop <table name>
hbase(main):017:0>drop 'education'
We have to observe below points for drop command
drop_all
Syntax: drop_all<"regex">
• This command will drop all the tables matching the given regex
• Tables have to disable first before executing this command using disable_all
• Tables with regex matching expressions are going to drop from HBase
is_enabled
Syntax: is_enabled 'education'
This command will verify whether the named table is enabled or not. Usually, there is a little
confusion between “enable” and “is_enabled” command action, which we clear here
• Suppose a table is disabled, to use that table we have to enable it by using enable command
• is_enabled command will check either the table is enabled or not
alter
Syntax: alter <tablename>, NAME=><column familyname>, VERSIONS=>5
This command alters the column family schema. To understand what exactly it does, we have
explained it here with an example.
Examples:
In these examples, we are going to perform alter command operations on tables and on its columns.
We will perform operations like
1. To change or add the ‘guru99_1’ column family in table ‘education’ from current value to
keep a maximum of 5 cell VERSIONS,
2. You can also operate the alter command on several column families as well. For example, we
will define two new column to our existing table “education”.
hbase> alter 'edu', 'guru99_1', {NAME => 'guru99_2', IN_MEMORY => true}, {NAME => 'guru99_3',
VERSIONS => 5}
• We can change more than one column schemas at a time using this command
• guru99_2 and guru99_3 as shown in above screenshot are the two new column names that
we have defined for the table education
• We can see the way of using this command in the previous screen shot
3. In this step, we will see how to delete column family from the table. To delete the ‘f1’ column
family in table ‘education’.
• In this command, we are trying to delete the column space name guru99_1 that we
previously created in the first step
4. As shown in the below screen shots, it shows two steps – how to change table scope attribute
and how to remove the table scope attribute.
Usage:
NOTE: MAX_FILESIZE Attribute Table scope will be determined by some attributes present in the
HBase. MAX_FILESIZE also come under table scope attributes.
Step 2) You can also remove a table-scope attribute using table_att_unset method. If you see the
command
• The above screen shot shows altered table name with scope attributes
• Method table_att_unset is used to unset attributes present in the table
• The second instance we are unsetting attribute MAX_FILESIZE
• After execution of the command, it will simply unset MAX_FILESIZE attribute
from”education” table.
alter_status
Syntax: alter_status 'education'
• Through this command, you can get the status of the alter command
• Which indicates the number of regions of the table that have received the updated schema
pass table name
• Here in above screen shot it shows 1/1 regions updated. It means that it has updated one
region. After that if it successful it will display comment done.
• Count
• Put
• Get
• Delete
• Delete all
• Truncate
• Scan
Count
Syntax: count <'tablename'>, CACHE =>1000
• The command will retrieve the count of a number of rows in a table. The value returned by
this one is the number of rows.
• Current count is shown per every 1000 rows by default.
• Count interval may be optionally specified.
• Default cache size is 10 rows.
• Count command will work fast when it is configured with right Cache.
Example:
We can make cache to some lower value if the table consists of more rows.
We can run the count command on table reference also like below
hbase>g.count INTERVAL=>100000
hbase>g.count INTERVAL=>10, CACHE=>1000
Put
Syntax: put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
This command is used for following things
• It will put a cell ‘value’ at defined or specified table or row or column.
• It will optionally coordinate time stamp.
Example:
• Here we are placing values into table “guru99” under row r1 and column c1
• We have placed three values, 10,15 and 30 in table “guru99” as shown in screenshot below
• Suppose if the table “Guru99” having some table reference like say g. We can also run the
command on table reference also like
• The output will be as shown in the above screen shot after placing values into “guru99”.
To check whether the input value is correctly inserted into the table, we use “scan” command. In
the below screen shot, we can see the values are inserted correctly
Get
Syntax: get <'tablename'>, <'rowname'>, {< Additional parameters>}
Here <Additional Parameters> include TIMERANGE, TIMESTAMP, VERSIONS and FILTERS.
By using this command, you will get a row or cell contents present in the table. In addition to that
you can also add additional parameters to it like TIMESTAMP, TIMERANGE,VERSIONS, FILTERS,
etc. to get a particular row or cell content.
Examples:-
Delete
Syntax:delete <'tablename'>,<'row name'>,<'column name'>
• This command will delete cell value at defined table of row or column.
• Delete must and should match the deleted cells coordinates exactly.
• When scanning, delete cell suppresses older versions of values.
Example:
• The above execution will delete row r1 from column family c1 in table “guru99.”
• Suppose if the table “guru99” having some table reference like say g.
• We can run the command on table reference also like hbase> g.delete ‘guru99’, ‘r1’, ‘c1′”.
deleteall
Syntax: deleteall <'tablename'>, <'rowname'>
Example:-
Truncate
Syntax: truncate <tablename>
After truncate of an hbase table, the schema will present but not the records. This command
performs 3 functions; those are listed below
• We can pass several optional specifications to this scan command to get more information
about the tables present in the system.
• Scanner specifications may include one or more of the following attributes.
• These are TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS, CACHE,
STARTROW and STOPROW.
scan 'guru99'
The output as below shown in screen shot
Examples:-
Command Usage
scan ‘.META.’, {COLUMNS => It display all the meta data information related to
‘info:regioninfo’} columns that are present in the tables in HBase
scan ‘guru99’, {COLUMNS => [‘c1’, ‘c2’], It display contents of table guru99 with their column
LIMIT => 10, STARTROW => ‘xyz’} families c1 and c2 limiting the values to 10
scan ‘guru99’, {COLUMNS => ‘c1’, It display contents of guru99 with its column name c1
TIMERANGE => [1303668804, with the values present in between the mentioned time
1303668904]} range attribute value
In this command RAW=> true provides advanced
scan ‘guru99’, {RAW => true, VERSIONS
feature like to display all the cell values present in the
=>10}
table guru99
Code Example:
First create table and place values into table
The output shown in above screen shot gives the following information
Command Functionality
Add peers to cluster to replicate
add_peer
hbase> add_peer ‘3’, zk1,zk2,zk3:2182:/hbase-prod
Stops the defined replication stream.
Deletes all the metadata information about the peer
remove_peer
hbase> remove_peer ‘1’
Restarts all the replication features
start_replication
hbase> start_replication
Stops all the replication features
stop_replication
hbase>stop_replication
Result:
ADDITIONAL EXPERIMENTS
EXP NO: 1
PIG LATIN MODES, PROGRAMS
Date:
OBJECTIVE:
PROGRAM LOGIC:
Run the Pig Latin Scripts to find a max temp for each and every year
OUTPUT:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
RESULT: Thus, the Pig Latin Scripts to find Word Count and to find a max temp for each and every year is
successfully implemented