PDC All Labs
PDC All Labs
LAB - 1
PRELAB -
INLAB -
1. Write the steps involved in installing Hadoop and execute the basic shell commands.
ANS -
Steps for Hadoop Installation:
a. Download Virtual Box
b. Install Virtual Box
c. Download Hadoop Training Ubuntu Image
d. Open Oracle VM Box, then click on file->Import Appliances->Browse->Hadoop Training v1.0-
>import, in order to complete the setup
shell commands:
1) Version Check
$ hadoop version
2) list Command
List all the files/directories for the given hdfs destination path.
3) df Command
4) count Command
Count the number of directories, files and bytes under the paths that match the specified file
pattern.
5) fsck Command
$ hdfs fsck /
6) balancer Command
$ hdfs balancer
7) mkdir Command
8) put Command
File
Copy file from single src, or multiple srcs from local file system to the destination file system.
$ cd Desktop
Directory
HDFS Command to copy directory from single source, or multiple sources from local file system to
the destination file system.
$ mkdir dirname
9) du Command
Displays size of files and directories contained in the given directory or the size of a file if its just a
file.
11) rm -r Command
HDFS Command to remove the entire directory and all of its content from HDFS.
HDFS Command to copy files from hdfs to the local file system.
$ ls -l filename
HDFS Command that takes a source file and outputs the file in text format.
16) mv Command
HDFS Command to move files from source to destination. This command allows multiple sources
as well, in which case the destination needs to be a directory.
17) cp Command
HDFS Command to copy files from source to destination. This command allows multiple sources as
well, in which case the destination must be a directory.
POSTLAB -
1. Write a command which creates an empty file and also give the command, which prints last
modified time of directory or path? And explain the commands in detail with syntax?
ANS -
Touchz -
Create a file of zero length. An error is returned if the file exists with non-zero length.
Example:
hadoop fs -touchz pathname
Exit Code: Returns 0 on success and -1 on error.
Stat -
stat is a command-line utility that displays detailed information about given files or file systems.
Example:
Hdfs dfs -stat pathname
2. Hadoop copyFromLocal command is used to copy the file from your local file system to
the HDFS(Hadoop Distributed File System). copyFromLocal command has an optional switch –f
which is used to replace the already existing file in the system, means it can be used to update
that file. Give the Syntax.
ANS -
Step 1: Make a directory in HDFS where you want to copy this file with the below command.
$ hdfs dfs -mkdir /Sample
Step 2: Use copyFromLocal command as shown below to copy it to HDFS /Sample.
$cd Desktop
$ hdfs dfs -copyFromLocal Sample /Sample
Step 3: Check whether the file is copied successfully or not by moving to its directory location
with below command.
$ hdfs dfs -ls /Sample
Note: To update the content of the file or to Overwrite it, you should use -f switch as shown
below.
$ hdfs dfs -copyFromLocal -f Sample /Sample
Output:
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 2
PRELAB -
b)equals(Object o)
public boolean equals(Object o)
DESC -
Returns true iff o is a IntWritable with the same value.
Overrides:
equals in class Object
c)get()
public int get()
DESC -
Return the value of this IntWritable.
d)hashCode()
public int hashCode()
DESC -
Overrides:
hashCode in class Object
e)readFields(DataInput in)
public void readFields(DataInput in)
throws IOException
DESC -
Description copied from interface: Writable
Deserialize the fields of this object from in.
For efficiency, implementations should attempt to re-use storage in the existing object where
possible.
Specified by:
readFields in interface Writable
Parameters:
in - DataInput to deseriablize this object from.
Throws:
IOException
f)set(int value)
public void set(int value)
DESC -
Set the value of this IntWritable.
g)toString()
public String toString()
DESC -
Overrides:
toString in class Object
h)write(DataOutput out)
public void write(DataOutput out)
throws IOException
DESC -
Description copied from interface: Writable
Serialize the fields of this object to out.
Specified by:
write in interface Writable
Parameters:
out - DataOuput to serialize this object into.
Throws:
IOException
INLAB -
ANS -
Steps -
.Create 3 java files under src of default package or any other package of your choice
-Runner
-Mapper
-Reducer
Right Click on Project-----> Export------> Select export destination as Jar File ---------> next--------->
Finish
5.Check whether the jar file is created or not
$ cd startCDH.sh
9. $ cd Desktop
POSTLAB -
1.Write the algorithm for map reduce and apply map reduce word count on the given data
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
ANS -
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data
resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 3
PRELAB -
2. Consider the checkout counter at a large supermarket chain. For each item
sold, it generates a record of the form [ProductId, Supplier, Price]. Here, ProductId is
the unique identifier of a product, Supplier is the supplier name of the product and Price
is the sales price for the item. Assume that the supermarket chain has accumulated many
terabytes of data over a period of several months.
The CEO wants a list of suppliers, listing for each supplier the average sales price of items
provided by the supplier. How would you organize the computation using the Map-Reduce
computation model?
ANS -
SELECT AVG(Price) FROM DATA GROUP BY Supplier
Pseudo code :
map(key, record):
output [record(SUPPLIER),record(PRICE)]
reduce(SUPPLIER, list of PRICE):
emit [SUPPLIER, AVG(PRICE)]
3.Mention what are the main configuration parameters that user need to specify to run
MapReduce Job?
ANS -
The user of the MapReduce framework needs to specify
1. Siri a third year BTech student wants to read and write data by accessing HDFS by using JAVA
APIs helps her in doing this task.
ANS -
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Read{
public static void main(String[] args) throws IOException
{
Configuration conf = new Configuration();
// Configuration() is a constructor whenever constructor is invoked the obj will be assigned with
memory
String localpath="/home/hadoop/Desktop/samplecopy";
String uri= "hdfs://localhost:8020";
String Hdfspath="hdfs://localhost:8020/lab/sample";
FileSystem fs =FileSystem.get(URI.create(Hdfspath),conf);
fs.copyToLocalFile(new Path(Hdfspath),new Path(localpath));
}
}
OUTPUT -
WRITE CLASS - write to HDFS
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Write
{
public static void main(String[] args) throws IOException
{
Configuration conf = new Configuration();
String localpath="/home/hadoop/Desktop/sample2";
String uri= "hdfs://localhost:8020";
String Hdfspath="hdfs://localhost:8020/lab";
FileSystem fs=FileSystem.get(URI.create(uri),conf);
fs.copyFromLocalFile(new Path(localpath),new Path(Hdfspath));
}
}
COMMAND -
OUTPUT -
POSTLAB -
1.Your Friend asked you to help him by explaining the steps of How HDFS stores a Data with the
help of diagram Now Help your friend Understanding the Concept.
2. Write the applications of Hadoop MapReduce that can be seen in our daily life.
ANS -
E-commerc:
E-commerce companies such as Walmart, E-Bay, and Amazon use MapReduce to analyze buying
behavior. MapReduce provides meaningful information that is used as the basis for developing
product recommendations. Some of the information used include site records, e-commerce
catalogs, purchase history, and interaction logs.
Social networks
The MapReduce programming tool can evaluate certain information on social media platforms
such as Facebook, Twitter, and LinkedIn. It can evaluate important information such as who liked
your status and who viewed your profile.
Entertainment
Netflix uses MapReduce to analyze the clicks and logs of online customers. This information helps
the company suggest movies based on customers’ interests and behavior.
Conclusion
MapReduce is a crucial processing component of the Hadoop framework. It’s a quick, scalable, and
cost-effective program that can help data analysts and developers process huge data.
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 4
PRELAB -
I. FileInputFormat
II. TextInputFormat
III. SequenceFileInputFormat
IV. SequenceFileAsTextInputFormat
V. SequenceFileAsBinaryInputFormat
VI. DBInputFormat
VII. NLineInputFormat
VIII. KeyValueTextInputFormat
2. Why the output of map tasks are stored into local disc and not in HDFS?
ANS - The outputs of map task are the intermediate key-value pairs which is then processed by
reducer to produce the final aggregated result. Once a MapReduce job is completed, there is no
need of the intermediate output produced by map tasks. Therefore, storing these intermediate
output into HDFS and replicate it will create unnecessary overhead.
MAPPER CLASS -
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class SumMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
String s=tokenizer.nextToken("+");
int p=Integer.parseInt(s);
output.collect(value,new IntWritable(p));
}
}
}
REDUCER CLASS -
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
DRIVER CLASS -
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat ;
OUTPUT FILE -
POSTLAB -
1. Create a file which includes a minimum of 3 lines of words/characters/info, And then write a
map reduce program in order to find the character count i.e. you have to find out the number of
times a character has appeared in the file you created.
ANS -
Driver Class:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class CharCountDriver {
public static void main(String[] args)
throws IOException
{
JobConf conf = new JobConf(CharCountDriver.class);
conf.setJobName("CharCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CharCountMapper.class);
conf.setCombinerClass(CharCountReducer.class);
conf.setReducerClass(CharCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,
new Path(args[0]));
FileOutputFormat.setOutputPath(conf,
new Path(args[1]));
JobClient.runJob(conf);
}
}
Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class CharCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reducer Class:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
PayLoad − Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
INLAB -
}
WordCountReducer.java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
}
output.collect(key,new IntWritable(sum));
}
}
WordCount.java:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
3. Add required jar files, Right click on src-> build path -> configured build path ->
libraries -> External jarfiles
now add the jarfiles and click on OK
3. Add required jar files, Right click on src-> build path -> configured build path ->
libraries -> External jarfiles
now add the jarfiles and click on OK
1. Write the syntax of final command to execute word count problem in map reduce by
explaining each key word in it and finally give an example of that command.
ANS -
Syntax:
Hadoop jar <jar file name> <class name> <path of input file in hadoop> <path of output file in
hadoop>
Explanation:
<jar file name>
Exported jar file has to be exported in the current directory and that jar file name will be written
here.
<class name>
Class Name of the java code which is exported as the jar file, name will be written here.
<path of input file in hadoop>
Have to create an input file which contains the input data and we have to push that file into the
Hadoop file system by executing this command
hdfs dfs -put <file name> <path in hadoop>
<path of output file in hadoop>
After completion of executing, the output will be stored in the create the folder in the specific path
that you have given the command.
In it you can able to see the answer and the executed message.
Example:
hadoop jar wordcountdemo.jar WordCount /test/data.txt /text/r_output
INLAB -
3. Add required jar files, Right click on src-> build path -> configured build path ->
libraries -> External jarfiles now add the jarfiles and click on OK
words.txt
abcdefghijklmnopqrstuvwxymnbvcxasdfghjklpoiuytrewqadsfgfhj
klmnbvcxasdfhjjkkooiytresqcxbcncjdhgdfrdwcvdbbcdncjdjchydtg
cfgcbdncdsbcsdbahoubvarueobvyuebvyuebrubdfuhbvodufbvuobf
bbbbaeffffgy
ANS -
Java Codes:
These codes are the mapper class and reducer class and the driver class.
You have to export the driver class into the jar file and have to place the jar file in the desktop.
Execution: Firstly you have you place your text file into a directory of Hadoop then you have to
execute the programs as show in the above image.
Files that are created on the directory and the automatic output files created on the folder.
Final Output of the problem:
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 7
PRELAB-
1. Explain The interfaces between the Hadoop clusters any external network.
Ans:
The interfaces between the Hadoop clusters any external network are called the edge nodes. These are also
called gateway nodes as they provide access to-and-from between the Hadoop cluster and other applications.
Administration tools and client-side applications are generally the primary utility of these nodes.
Edges nodes are the interface between hadoop cluster and the external network. Edge nodes are used for
running cluster adminstration tools and client applications.Edge nodes are also referred to as gateway nodes.
Data - a,a,b,b,c,c,d,d,e,e,a,a,f,f,b,b,c,c,d,d,e,e,f,f
Draw the 2 flow diagrams of the MapReduce program to implement word count logic with Combiner and
without combiner in between Mapper and Reducer
INLAB:
1. Implement Maximum temperature program using Map Reduce with and without combiner
logic
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
@Override
int airTemperature;
} else {
-------
Reducer for the maximum temperature
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
@Override
-------
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
System.exit(-1);
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
2)Implement Maximum temperature program with combiner logic
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
@Override
int airTemperature;
} else {
-------
Reducer for the maximum temperature
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
@Override
----------------------
public class MaxTemperatureWithCombiner {
"<output path>");
System.exit(-1);
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
OUTPUT
(1950, 0)
(1950, 20)
(1950, 10)
(1950, 25)
(1950, 15)
FINAL OUTPUT
(1950, 25)
POSTLAB-
1.Write a MapReduce program using partitioner to process the input dataset to find the highest
salaried employee by gender in different age groups forexample, below20, between21to30, above30.
Input Dataset:
ANS -
//Reducer class
//Partitioner class
job.setMapperClass(MapClass.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setPartitionerClass(CaderPartitioner.class);
job.setReducerClass(ReduceClass.class);
job.setNumReduceTasks(3);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true)? 0 : 1);
return 0;
}
3. A
dd
required
jar files,
Right click
on src->
build path
->
configure
d build
path ->
libraries ->
External
jarfiles
now add
the
jarfiles
and click on OK
8. Now
execu
te the
mapre
duce
progr
am
h
a
d
o
o
p
ja
r
cr
e
at
e
d
ja
r
name Mainclassname HDFS
filename output directory name
hadoop jar output2.jar WordCountDriver dir_name/input output2
9. Run the program
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 8
PRELAB-
2. Why the output of map tasks are stored (spilled ) into local disc and not in HDFS?
ANS -The outputs of map task are the intermediate key-value pairs which is then processed by
reducer to produce the final aggregated result. Once a MapReduce job is completed, there is
no need of the intermediate output produced by map tasks. Therefore, storing these
intermediate output into HDFS and replicate it will create unnecessary overhead.
ANS - The advantages of using map side join in MapReduce are as follows:
Map-side join helps in minimizing the cost that is incurred for sorting and merging in
the shuffle and reduce stages.
Map-side join also helps in improving the performance of the task by decreasing the
time to finish the task.
INLAB:
1. Find the number of products sold in each country using Map Reduce
ANS -
SalesMapper.java
package SalesCountry;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
SalesCountryReducer.java
package SalesCountry;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
SalesCountryDriver.java
package SalesCountry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
my_client.setConf(job_conf);
try {
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
POSTLAB-
1. Pinky wants to find the status of the file (Healthy/Not), she also wants to find the total size, total
blocks and all the things related to the files present in a particular directory.
Your task is to help pinky find the relevant information related to a particular directory.
Note: You can find the status any particular directory present in your system. Explain in detail about
the command you tend to sue.
Solution:
HDFS fsck is used to check the health of the file system, to find missing files, over replicated, under
replicated and corrupted blocks.
2.Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
Answer: In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its
own JVM process
Answer: The function of MapReduce partitioner is to make sure that all the value of a single key goes to the
same reducer, eventually which helps even distribution of the map output over the reducers
INLAB:
6. Execution:
Runner Class
7.
Mapper Class
8.
Reducer Class
EXECUTION
INPUT
OUTPUT
9.
POSTLAB-
1. Siri is playing a solo online game. There are 4 rounds in the game. In each round she is given with 5 chances
to score points. In every chance she can score points ranging from 1-5. You are now given with an input file
containing the points Siri scored in 4 rounds now help her in finding the frequency of the points she scored.
ANS -
Mapper code -
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
Reducer code -
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
Driver code -
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
Commands Used -
startCDH.sh
cd Desktop
hdfs dfs -put gameinput /lab/gameinput
hadoop jar game.jar Game /lab/gameinput /lab/gameoutput
OUTPUT SCREENSHOTS -
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 10
PRELAB-
1. If a file contains 100 billion URLs. Then how will you find the first unique URL?
This problem has the large set of data i.e., 100 billion URLs, so it has to be divided into
the chunks which fits into the memory and then the chunks needs to be processed and
then the results get combined in order to get a final answer. In this scenario, the file is
divided in the smaller ones using uniformity in the hashing function which produces the
N/M chunks, each is of M (i.e., size of main-memory). Here each URLs is read from an
input file, and apply hash function to it in order to find the written chunk file and further
append the file with the original line-numbers. Then each file is read from the memory
and builds the hash-table for URLs which is used in order to count the occurrences of
each of the URLs and then stores the line-number of each URL. After the hash-table built
completely the lowest entry of the line-number having a count value of 1 is scanned,
which is the first URL in the chunk file which is unique in itself. The above step is
repeated for all the chunk files, and the line-number of each URL is compared after its
processing. Hence, after the process of all the chunk-file, the 1st unique URL found out
from all that processed input.
2. Varun wants to do execution of a Hadoop program and he is new to hadoop so he wants to know
about the different types of the main configuration files of Hadoop, Help him to know what are the
configuration files in the Hadoop.
• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• yarn-site.xml
• mapred-site.xml
• masters
• slaves
INLAB:
1.Implement stock market program using Map Reduce with and without combiner logic
Part – 1: Implement stock market program using Map Reduce with combiner logic.
RunnerClass
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class RunnerClass {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(RunnerClass.class);
conf.setJobName("stocksmax");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(FloatWritable.class);
conf.setMapperClass(MapperClass.class);
conf.setCombinerClass(ReducerClass.class);
conf.setReducerClass(ReducerClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
MapperClass
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
}
}
ReducerClass
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
while (values.hasNext()) {
maxClosePrice = Math.max(maxClosePrice, values.next().get());
}
EXECUTION:
EXECUTION:
INPUT
OUTPUT:
OUTPUT:
Part – 2: Implement stock market program using Map Reduce withOUT combiner logic.
RunnerClass
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class RunnerClass {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(RunnerClass.class);
conf.setJobName("stocksmax");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(FloatWritable.class);
conf.setMapperClass(MapperClass.class);
conf.setReducerClass(ReducerClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
MapperClass
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
}
}
ReducerClass
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class ReducerClass extends MapReduceBase implements
Reducer<Text,FloatWritable,Text,FloatWritable> {
public void reduce(Text key, Iterator<FloatWritable> values,OutputCollector<Text,FloatWritable>
output,
Reporter reporter) throws IOException {
while (values.hasNext()) {
maxClosePrice = Math.max(maxClosePrice, values.next().get());
}
EXECUTION
Input
OUTPUT
POSTLAB-
1. You have a directory ProjectPro that has the following files – HadoopTraining.txt,
_SparkTraining.txt, #DataScienceTraining.txt, .SalesforceTraining.txt. If you pass the ProjectPro
directory to the Hadoop MapReduce jobs, how many files are likely to be processed?
}
};
However, we can set our own custom filter such as FileInputFormat.setInputPathFilter to eliminate such criteria but
remember, hiddenFileFilter is always active.
2.KL University has conducted some weekly tests to the students who have opted Parallel and
Distributed Computing course. Now you are given with the marks scored by those students as
the input file. Find the least mark scored by each student.
CODE
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
OUTPUT
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 11
PRELAB-
1. Imagine that you are uploading a file of 500MB into HDFS.100MB of data is successfully uploaded into
HDFS and another client wants to read the uploaded data while the upload is still in progress. What will
happen in such a scenario, will the 100 MB of data that is uploaded will it be displayed?
Although the default blocks size is 64 MB in Hadoop 1x and 128 MB in Hadoop 2x whereas in such a
scenario let us consider block size to be 100 MB which means that we are going to have 5 blocks replicated
3 times (default replication factor). Let’s consider an example of how does a block is written to HDFS:
We have 5 blocks (A/B/C/D/E) for a file, a client, a namenode and a datanode. So, first the client will take
Block A and will approach namenode for datanode location to store this block and the replicated copies.
Once client is aware about the datanode information, it will directly reach out to datanode and start
copying Block A which will be simultaneously replicated to other 2 datanodes. Once the block is copied a nd
replicated to the datanodes, client will get the confirmation about the Block A storage and then, it will
initiate the same process for next block “Block B”.
So, during this process if 1st block of 100 MB is written to HDFS and the next block has been s tarted by the
client to store then 1st block will be visible to readers. Only the current block being written will not be
visible by the readers.
Namenode is responsible for managing the meta storage of the cluster and if something is missing from
the cluster then Namenode will be held. This makes Namenode checking all the necessary information
during the safe mode before making cluster writable to the users. There are couple of reasons for
Namenode to enter the safe mode during startup such as;
i) Namenode loads the filesystem state from fsimage and edits log file, it then waits for datanodes to
report their blocks, so it does not start replicating the blocks which already exist in the cluster another.
ii) Heartbeats from all the datanodes and also if any corrupt blocks exist in the cluster. Once Namenode
verify all these information, it will leave the safe mode and make cluster accessible. Sometime, we need to
manually enter/leave the safe mode for Namenode which can be done using command line “hdfs dfsadmin
-safemode enter/leave”.
INLAB:
1. Rio took two files M and N with the matrix order of different matrixes now with the help
of these two file calculate the matrix multiplication using combiner with help of Map
reduce concept.
Step 1: start Hadoop by giving the command startCDH.sh in terminal.
Step 2: Open Eclipse in the Desktop.
Step 3: Create new java project
Step 4: Click on new -> Class
In general, for a map reduce program we create 3 classes, such as Mapper Class, Reducer Class, Runner Class
Step 5: : Add External libraries
Step 6: Right click on “src”
Click on Build Path -> Click on Configure Build Pat
Step 7:
Create a directory named LAB11 and put input filed i.e. M, N in the directory
Step 8:
hadoop@hadoop-laptop:~/Desktop$ hadoop jar matrix.jar MatrixMultiply /matrix/* /LAB11/190030324out
Step 9: Open the Mozillafirefox browser and go to the directory matrix to see output
Output:
POSTLAB-
1. If there are 8TB be the available disk space per node (i.e., 10 disks having 1TB, 2 disks is for
Operating-System etc., were excluded). Then how will you estimate the number of data
nodes(n)? (Assuming the initial size of data is 600 TB)
In the Hadoop environment, the estimation of hardware -requirements is challenging due to the
increased of data at any-time in the organization. Thus, one must have the proper knowledge of
the cluster based on the current scenario which depends on the f ollowing factor:
Steps to find the number of the data-nodes which are required to store 600TB data:
Intermediate data: 1
• Case: It has been predicted that there is 20% of the increase of data in quarter and
we all need to predict is the new machines which is added in particular year
Additional data:
Additional data-nodes:
The default replication factor is 3 and the default block-size is 128MB in Hadoop 2.x.
a b c d e
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform
several other map tasks and also exchange the intermediate outputs with the reducers as
required. This process of moving the intermediate outputs of map tasks to the reducer is
referred to as Shuffling.
Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a
single node before they are given as input to the reducer.
Partitioning Phase-The process that determines which intermediate keys and value will be
received by each reducer instance is referred to as partitioning. The destination partition is
same for any key irrespective of the mapper instance that generated it.
• A new class must be created that extends the pre-defined Partitioner Class.
• getPartition method of the Partitioner class must be overridden.
• The custom partitioner to the job can be added as a config file in the wrapper which
runs Hadoop MapReduce or the custom partitioner can be added to the job by using
the set method of the partitioner class.
INLAB:
1. Map Reduce logic for online music store using combiner logic.
UniqueListenersReducer
UniqueListenersMapper
UniqueListeners
Input
Execution
Output
POSTLAB-
1. Considering the below dataset Find the dates of transactions on which more
than one transaction has occurred on a particular date By using MapReduce.
• https://www.kaggle.com/jensroderus/salesjan2009csv
Codes:
➔ Upload the hadoop-core-1.2.1 jar file in Build Path → Configure Build Path → Add External Jars Then upload this jar file
their