Bda Lab Manual 2024
Bda Lab Manual 2024
Program No.1
DESCRIPTION:
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running on
top of the underlying filesystem of the operating system. HDFS keeps track of where the data
resides in a network by associating the name of its rack (or network switch) with the dataset. This
allows Hadoop to efficiently schedule tasks to those nodes that contain data, or which are nearest
to it, optimizing bandwidth utilization. Hadoop provides a set of command line utilities that work
similarly to the Linux file commands, and serve as your primary interface with HDFS. We‘re
going to have a look into HDFS by interacting with it from the command line. We will take a look
at the most common file management tasks in Hadoop, which include:
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.
Step-2
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
Step-3
Step-4
Program-2:
Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm
DESCRIPTION:--
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept
is fairly simple to understand for those who are familiar with clustered scale-out data processing
solutions. The term MapReduce actually refers to two separate and distinct tasks that Hadoop
programs perform. The first is the map job, which takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value pairs). The reduce
job takes the output from a map as input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce job is always performed after the
map job.
ALGORITHM
MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style.
1. Mapper
2. Reducer
3. Drive
Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number . Map task outputs for each word in the line of text.
Pseudo-code
output.collect(x,1);
A Reducer collects the intermediate output from multiple map tasks and assemble a single result.
Here, the WordCount program will sum up the occurrence of each word to pairs as .
Pseudo-code
for each x in :
sum+=x;
final_output.collect(keyword, sum);
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
INPUT
Output
Program No.3
Write a Map Reduce program that mines weather data. Hint: Weather sensors collecting data every
hour at nuny locations across the globe gather a large volume of log data, which is a good candidate
for analysts with Map Reduce, since it is semi structured and record-oriented.
Software:
Hardware:
Step 1
Step 2
Step3
Step 4
Step 5
Step6
Fig 6:Once all the jar files are added click on finsh
Step 7
Step 8
Step 9
Fig 9:Add the code in particular java class above snapshot is for MaxTempMapper
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTempMapper extends Mapper<LongWritable, Text, Text, IntWritable > {
Public void map(LongWritable key, Text value, Context context)throws IOException,
InterruptedException {
String line=value.toString();
String year=line.substring(15,19);
int airtemp;
if(line.charAt(87)== '+')
{
airtemp=Integer.parseInt(line.substring(88,92));
}
else
airtemp=Integer.parseInt(line.substring(87,92));
String q=line.substring(92,93);
if(airtemp!=9999&&q.matches("[01459]"))
{
context.write(new Text(year),new IntWritable(airtemp));
}
}
}
Step 10
Fig 10:Add the code in particular java class above snapshot is for MaxTempReducer
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import sun.awt.SunHints.Value;
public class MaxTempReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException,
InterruptedException {
int maxvalue=Integer.MIN_VALUE;
for (IntWritable value : values) {
maxvalue=Math.max(maxvalue, value.get());
}
context.write(key, new IntWritable(maxvalue));
}
}
Step11
Fig 11:Add the code in particular java class above snapshot is for MaxTemp
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemp {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "MaxTemp");
job.setJarByClass(maxtemp.maxtempdriver.class);
job.setMapperClass(MaxTemp.class);
job.setReducerClass(MaxTempReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
if (!job.waitForCompletion(true))
return;
}
}
Step12
Fig 12:Once all codes are done now we have to create jar file right click on java
project(MaxTemp)and click on export
Step 13
Fig 13:Once u click on export select java and select jar file
Step14
Step15
Fig 15:Name it asMaxTemp.jar and click on Desktop it will created in desktop then click on ok
Step16
Step17
Step18
Fig18: Now open the terminal and first we have to create the directory
Command
Hadoop dfs -mkdir / maxinput
Step19
Fig 19:Now go to deskstop to identify the actual data set present in in deskstop circle one in
dataset(dataset present in the kaggel nd internet can download)
Command: Cd Desktop/
Step 20
Fig20 :Copy the input file into the input directory it will successfully copy
Step21
Step22
OUTPUT
Step23
We can also check this in browser using localhost:50070 and search for maxoutput and it will
display the maximum temperature.
Program-4:
Software:
Hardware:
Step 1:
1a. Create two files M1, M2 and put the matrix values. (sperate columns with spaces and rows
with a line break)
Step 2:
Mapper.py
2b. Read each line i.e a row from stdin and split then to separate elements. Map int to each
element as we read elements as string from stdin.
2c.The mapper will first read the first matrix and then the second. To differentiate them we can
keep a count i of the line number we are reading and the first m_r lines will belong to the first
matrix.
2d. Now comes the crucial part, printing the key value. We need to think of a key which will
group elements that need to be multiplied, elements that need to be summed and elements that
belong to the same row.
{0} {1} {2} are the part of key and {3} is the value.
{0} {1} {2} actually represents the position of element from A or B to A*B
{2} is the position of the element in addition. (like 1, 6 are at position 0 in addition and
2,5 are at position 1)
Element is duplicated and distributed to each column, therefore, column pos in A*B =
Duplication order of element i.e {1}=k
As you can see in the picture, the position of the element, in addition, is the same as it’s
column’s number therefore {2}=j
Element is duplicated and distributed to each row, therefore, row pos in A*B = Duplication order
of element i.e {0}=k
As you can see in the picture, the position of the element, in addition, is the same as it’s row’s
position therefore {2}=i-m_r
Step 3:After mapper produces output, Hadoop will sort by key and provide it to reducer.py
3a.Reducer.py
Our reducer program will get sorted mapper result which will look like this.
If you look closely at the output and image of matrix multiplication, you will realize:
You can run the map reduce job and view the result by the following code (considering you have
already put input files in HDFS)
This will take some time as Hadoop do its mapping and reducing work. After the successful
completion of the above process view the output by:
Output:
Program-5:
Software:
Hardware:
Copy the following content to ‘hadooptext.txt’ by clicking the file and edit option
Step 3:Now load the file stored in hdfs (Space separated file)
DUMP input1;
DUMP wordsInEachLine;
dump groupedWords;
describe groupedWords;
dump countedWords;
Output:
Program 6
Run the Pig Latin Scripts to find a max temp for each and every year.
Step 1:
The Pig Latin MAX() function is used to calculate the highest value for a column (numeric values
or chararrays) in a single-column bag. While calculating the maximum value,the Max()
function ignores the NULL values.
Note
To get the global maximum value, we need to perform a Group All operation, and
calculate the maximum value using the MAX() function.
To get the maximum value of a group, we need to group it using the Group By operator
and proceed with the maximum function.
Syntax
grunt> Max(expression)
Copy the following content to ‘hadooptext.txt’ by clicking the file and edit option
Pune,2007,31.5
Pune,2007,30.5
Pune,2008,34.5
Blre,2009,13.0
Blre,2009,10.5
Commands
Output
34.5
Program No.7
Use Hive to create, alter, and drop database, tables, views, functions, and indexes.
Software:
Hardware:
Step 1
Step 2
Query->Editor->Hive
Default->Show database;
Step3
Step 4
Step 5
Ok
Default
Ok
hive>show databases;
Ok
Default
Office
hive>
Step6
[cloudera@quickstart Documents]$ ls
1,Rose,IT,2012,26000
2,Sam,Sales,2012,22000
3,Quke,HR,2013,30000
4,Nick,SC,2013,20000
/home/cloudera/Documents
[cloudera@quickstart Documents]$
1,Rose,IT,2012,26000
2,Sam,Sales,2012,22000
3,Quke,HR,2013,30000
4,Nick,SC,2013,20000
Step 8
OK
default
office
OK
default
office
default
office
OK
default
office
ОК
default
office
> (Id INT, Name STRING, Dept STRING, Yoj INT, salary INT)
OK
default
office
>(Id INT, Name STRING, Dept STRING, Yoj INT, salary INT)
OK
OK
employee
OK
‘/home/cloudera/Documents/Employee.csv’
INTO TABLE employee;
OK
Query!
Function
Retrieving information
AB values
Some values
Multiple criteria
Sorting
Sorting backward
Department of CS&E, M.Tech, SJCIT 42
Big Data Lab (22SCSL26)
Counting rows
Maximum value
MySQL
rec2"value2";
HiveQL
Owner: