100% found this document useful (1 vote)
125 views129 pages

PDC All Labs

Hadoop MapReduce is a framework for processing large datasets in parallel across clusters. It uses a two-step map and reduce process where the map phase counts words in documents and the reduce phase aggregates the data by document. One major drawback of Hadoop is limited function security since it is vulnerable to hacks. To fix a down NameNode, the fsimage file system metadata replica is used to start a new NameNode and configure datanodes to acknowledge the new NameNode. Installing Hadoop involves downloading software, importing images, and executing basic shell commands like version, ls, and df.

Uploaded by

Sai Kiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
125 views129 pages

PDC All Labs

Hadoop MapReduce is a framework for processing large datasets in parallel across clusters. It uses a two-step map and reduce process where the map phase counts words in documents and the reduce phase aggregates the data by document. One major drawback of Hadoop is limited function security since it is vulnerable to hacks. To fix a down NameNode, the fsimage file system metadata replica is used to start a new NameNode and configure datanodes to acknowledge the new NameNode. Installing Hadoop involves downloading software, importing images, and executing basic shell commands like version, ls, and df.

Uploaded by

Sai Kiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

PARALLEL AND DISTRIBUTED COMPUTING

LAB - 1
PRELAB -

1. What is Hadoop Map Reduce?How Hadoop MapReduce works?


ANS -
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce
framework is used. Data analysis uses a two-step map and reduce process.
In MapReduce, during the map phase, it counts the words in each document, while in the reduce
phase it aggregates the data as per the document spanning the entire collection. During the map
phase, the input data is divided into splits for analysis by map tasks running in parallel across
Hadoop framework.

2. Name one major drawback of Hadoop?


ANS -
One major drawback of Hadoop is the limit function security. Written on Java and crowdsourced,
it is heavily vulnerable to hacks. This is a serious problem since critical data is stored and
processed here.

3. How do you fix NameNode when it is down?


ANS -
The following steps can be followed to fix NameNode:
FsImage, the file systems metadata replica, should be used to start a new NameNode
Configuration of datanodes to acknowledge the creation of this new NameNode
NameNode will begin its operation and the cluster will go back to normalcy after it has
completely loaded the last FsImage checkpoint.
In some cases, NameNode revival can take a lot of time.

INLAB -

1. Write the steps involved in installing Hadoop and execute the basic shell commands.
ANS -
Steps for Hadoop Installation:
a. Download Virtual Box
b. Install Virtual Box
c. Download Hadoop Training Ubuntu Image
d. Open Oracle VM Box, then click on file->Import Appliances->Browse->Hadoop Training v1.0-
>import, in order to complete the setup

shell commands:
1) Version Check

To check the version of Hadoop.

$ hadoop version
2) list Command

List all the files/directories for the given hdfs destination path.

$ hdfs dfs -ls /

3) df Command

Displays free space at given hdfs destination

$ hdfs dfs -df hdfs:/

4) count Command

Count the number of directories, files and bytes under the paths that match the specified file
pattern.

$ hdfs dfs -count hdfs:/

5) fsck Command

HDFS Command to check the health of the Hadoop file system.

$ hdfs fsck /
6) balancer Command

Run a cluster balancing utility.

$ hdfs balancer

7) mkdir Command

HDFS Command to create the directory in HDFS.

$ hdfs dfs -mkdir /hadoop

$ hdfs dfs -ls /

8) put Command

File
Copy file from single src, or multiple srcs from local file system to the destination file system.

$ cd Desktop

$ hdfs dfs -put filename /hadoop

$ hdfs dfs -ls /Hadoop

Directory

HDFS Command to copy directory from single source, or multiple sources from local file system to
the destination file system.

$ mkdir dirname

$ hdfs dfs -put dirname /hadoop/

$ hdfs dfs -ls /Hadoop

9) du Command

Displays size of files and directories contained in the given directory or the size of a file if its just a
file.

$ hdfs dfs -du /


10) rm Command

HDFS Command to remove the file from HDFS.

$ hdfs dfs -rm /hadoop/filename

11) rm -r Command

HDFS Command to remove the entire directory and all of its content from HDFS.

$ hdfs dfs -rm -r /hadoop/hello

12) chmod Command

Change the permissions of files.

$ hdfs dfs -chmod 777 /hadoop

$ hdfs dfs -ls /

13) get Command

HDFS Command to copy files from hdfs to the local file system.

$ hdfs dfs -get /hadoop/filename copy_filename

$ ls -l filename

14) cat Command


HDFS Command that copies source paths to stdout.

$ hdfs dfs -cat /hadoop/filename

15) text Command

HDFS Command that takes a source file and outputs the file in text format.

$ hdfs dfs -text /hadoop/filename

16) mv Command

HDFS Command to move files from source to destination. This command allows multiple sources
as well, in which case the destination needs to be a directory.

$ hdfs dfs -mv /hadoop/filename /tmp

$ hdfs dfs -ls /tmp

17) cp Command

HDFS Command to copy files from source to destination. This command allows multiple sources as
well, in which case the destination must be a directory.

$ hdfs dfs -cp /tmp/filename /exercis

$ hdfs dfs -ls /exercise


18) tail Command

Displays last kilobyte of the file "new" to stdout

$ hdfs dfs -tail /hadoop/filename

19) chown Command

HDFS command to change the owner of files.

$ hdfs dfs -chown root:root /tmp

$ hdfs dfs -ls /

POSTLAB -

1. Write a command which creates an empty file and also give the command, which prints last
modified time of directory or path? And explain the commands in detail with syntax?

ANS -
Touchz -
Create a file of zero length. An error is returned if the file exists with non-zero length.
Example:
hadoop fs -touchz pathname
Exit Code: Returns 0 on success and -1 on error.

Stat -
stat is a command-line utility that displays detailed information about given files or file systems.
Example:
Hdfs dfs -stat pathname

2. Hadoop copyFromLocal command is used to copy the file from your local file system to
the HDFS(Hadoop Distributed File System). copyFromLocal command has an optional switch –f
which is used to replace the already existing file in the system, means it can be used to update
that file. Give the Syntax.
ANS -
Step 1: Make a directory in HDFS where you want to copy this file with the below command.
$ hdfs dfs -mkdir /Sample
Step 2: Use copyFromLocal command as shown below to copy it to HDFS /Sample.
$cd Desktop
$ hdfs dfs -copyFromLocal Sample /Sample
Step 3: Check whether the file is copied successfully or not by moving to its directory location
with below command.
$ hdfs dfs -ls /Sample

Note: To update the content of the file or to Overwrite it, you should use -f switch as shown
below.
$ hdfs dfs -copyFromLocal -f Sample /Sample
Output:
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 2
PRELAB -

1. Explain the following methods with syntax :


a)compareTo(IntWritable o)
public int compareTo(IntWritable o)
DESC -
Compares two IntWritables.
Specified by:
compareTo in interface Comparable<IntWritable>

b)equals(Object o)
public boolean equals(Object o)
DESC -
Returns true iff o is a IntWritable with the same value.
Overrides:
equals in class Object

c)get()
public int get()
DESC -
Return the value of this IntWritable.

d)hashCode()
public int hashCode()
DESC -
Overrides:
hashCode in class Object

e)readFields(DataInput in)
public void readFields(DataInput in)
throws IOException
DESC -
Description copied from interface: Writable
Deserialize the fields of this object from in.
For efficiency, implementations should attempt to re-use storage in the existing object where
possible.

Specified by:
readFields in interface Writable
Parameters:
in - DataInput to deseriablize this object from.
Throws:
IOException

f)set(int value)
public void set(int value)
DESC -
Set the value of this IntWritable.

g)toString()
public String toString()
DESC -
Overrides:
toString in class Object

h)write(DataOutput out)
public void write(DataOutput out)
throws IOException
DESC -
Description copied from interface: Writable
Serialize the fields of this object to out.
Specified by:
write in interface Writable
Parameters:
out - DataOuput to serialize this object into.
Throws:
IOException

2. Answer the following questions :


a)What is IntWritable in MapReduce Hadoop?
ANS - IntWritable is the Wrapper Class/Box Class in Hadoop similar to Integer Class in Java.
IntWritable is the Hadoop flavour of Integer, which is optimized to provide serialization in Hadoop.
b)What is the need of IntWritable?
ANS - Java Serialization is too big or too heavy for Hadoop, hence the box classes in Hadoop
implements serialization through the Interface called Writable.
c)Why does Hadoop need IntWritable instead of int?
ANS - Writable can serialize the object in a very light way.

INLAB -

1. Mention the steps to be followed to execute a Map-Reduce program in Hadoop.

ANS -

Steps -

1.Create a new java project.

.Create 3 java files under src of default package or any other package of your choice

-Runner
-Mapper

-Reducer

3. Add the external jar files

Right click on Project---->Build Path----->Configure Build Path----->Add external jar files----->select


the jar files---->ok
4.Make a jar file

Right Click on Project-----> Export------> Select export destination as Jar File ---------> next--------->
Finish
5.Check whether the jar file is created or not

7. For convenience paste that jar file on Desktop.

EXECUTING THE PROGRAM - (In Terminal)


8. Open the terminal.

$ cd startCDH.sh

9. $ cd Desktop

10. Place the input file into a particular directory

$ hdfs dfs -put filename /dir_name/filename_in_HDFS

11. Run the Code

$ hadoop jar jarfile.jar driverclass /dir_name/filaname /dir_name/OUTPUT_filename

POSTLAB -

1.Write the algorithm for map reduce and apply map reduce word count on the given data

Dear, Bear, River, Car, Car, River, Deer, Car and Bear
ANS -
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data
resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 3
PRELAB -

1. What is the purpose of MapReduce?


ANS -
MapReduce serves two essential functions: it filters and parcels out work to various nodes within
the cluster or map, a function sometimes referred to as the mapper, and it organizes and reduces
the results from each node into a cohesive answer to a query, referred to as the reducer.

2. Consider the checkout counter at a large supermarket chain. For each item
sold, it generates a record of the form [ProductId, Supplier, Price]. Here, ProductId is
the unique identifier of a product, Supplier is the supplier name of the product and Price
is the sales price for the item. Assume that the supermarket chain has accumulated many
terabytes of data over a period of several months.
The CEO wants a list of suppliers, listing for each supplier the average sales price of items
provided by the supplier. How would you organize the computation using the Map-Reduce
computation model?

ANS -
SELECT AVG(Price) FROM DATA GROUP BY Supplier
Pseudo code :
map(key, record):
output [record(SUPPLIER),record(PRICE)]
reduce(SUPPLIER, list of PRICE):
emit [SUPPLIER, AVG(PRICE)]

3.Mention what are the main configuration parameters that user need to specify to run
MapReduce Job?

ANS -
The user of the MapReduce framework needs to specify

Job’s input locations in the distributed file system


Job’s output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
INLAB -

1. Siri a third year BTech student wants to read and write data by accessing HDFS by using JAVA
APIs helps her in doing this task.
ANS -

READ CLASS - Read to Local space

import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Read{
public static void main(String[] args) throws IOException
{
Configuration conf = new Configuration();
// Configuration() is a constructor whenever constructor is invoked the obj will be assigned with
memory
String localpath="/home/hadoop/Desktop/samplecopy";
String uri= "hdfs://localhost:8020";
String Hdfspath="hdfs://localhost:8020/lab/sample";
FileSystem fs =FileSystem.get(URI.create(Hdfspath),conf);
fs.copyToLocalFile(new Path(Hdfspath),new Path(localpath));
}
}

OUTPUT -
WRITE CLASS - write to HDFS

import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Write
{
public static void main(String[] args) throws IOException
{
Configuration conf = new Configuration();
String localpath="/home/hadoop/Desktop/sample2";
String uri= "hdfs://localhost:8020";
String Hdfspath="hdfs://localhost:8020/lab";
FileSystem fs=FileSystem.get(URI.create(uri),conf);
fs.copyFromLocalFile(new Path(localpath),new Path(Hdfspath));
}
}

COMMAND -

OUTPUT -
POSTLAB -

1.Your Friend asked you to help him by explaining the steps of How HDFS stores a Data with the
help of diagram Now Help your friend Understanding the Concept.

Note: If FS Data Output Stream Data is large,


• It is divided into blocks of smaller mb’s and then
• Data Nodes are allocated to each divided blocks by Name Node
• And FS Data Output Stream sends the Data to Data Nodes
Steps to store the HDFS as a file-
Step 1: Hadoop Client sends request to create a file to Name node
- Client sends Request along with path
Step 2: Name node checks Directory exist , client permission
- Name node maintains an image of entire HDFS Namespace in to memory that is called FS Image.
Step 3: If all tests are passed Name node created a new file and returns to client that message
- New file is Created but No Data in that file , now we will write into the file.
Step 4: Client creates an FS Data Output Stream to write the Data
- It buffers the data up to certain mb called as block
Step 5: Now FS Data Output Stream requests the Name node to allocate the block
Step 6: Name node doesn’t store the data . it gives the client with data node number where you
can allocate.
Step 7: Now streamer knows where to send data
- If data is large then it divides into block i.e smaller mb’s , Then allocate to different nodes
Step 8: After completion the file closes and commit the changes in Data Node.

2. Write the applications of Hadoop MapReduce that can be seen in our daily life.

ANS -
E-commerc:
E-commerce companies such as Walmart, E-Bay, and Amazon use MapReduce to analyze buying
behavior. MapReduce provides meaningful information that is used as the basis for developing
product recommendations. Some of the information used include site records, e-commerce
catalogs, purchase history, and interaction logs.
Social networks

The MapReduce programming tool can evaluate certain information on social media platforms
such as Facebook, Twitter, and LinkedIn. It can evaluate important information such as who liked
your status and who viewed your profile.

Entertainment

Netflix uses MapReduce to analyze the clicks and logs of online customers. This information helps
the company suggest movies based on customers’ interests and behavior.

Conclusion

MapReduce is a crucial processing component of the Hadoop framework. It’s a quick, scalable, and
cost-effective program that can help data analysts and developers process huge data.
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 4
PRELAB -

1. What are the types of InputFormat in MapReduce?


ANS - The eight different types of InputFormat in MapReduce are:

I. FileInputFormat
II. TextInputFormat
III. SequenceFileInputFormat
IV. SequenceFileAsTextInputFormat
V. SequenceFileAsBinaryInputFormat
VI. DBInputFormat
VII. NLineInputFormat
VIII. KeyValueTextInputFormat

2. Why the output of map tasks are stored into local disc and not in HDFS?

ANS - The outputs of map task are the intermediate key-value pairs which is then processed by
reducer to produce the final aggregated result. Once a MapReduce job is completed, there is no
need of the intermediate output produced by map tasks. Therefore, storing these intermediate
output into HDFS and replicate it will create unnecessary overhead.

3. How would you split data into Hadoop?


ANS - Splits are created with the help of the InputFormat. Once the splits are created, the number
of mappers is decided based on the total number of splits. The splits are created according to the
programming logic defined within the getSplits() method of InputFormat, and it is not bound to the
HDFS block size.

The split size is calculated according to the following formula.

Split size = input file size/ number of map tasks

4. What is distributed Cache in MapReduce Framework? Explain.


ANS - Distributed cache is an important part of the MapReduce framework. It is used to cache files
across operations during the time of execution and ensures that tasks are performed faster. The
framework uses the distributed cache to store important file that is frequently required to execute
tasks at that particular node.
INLAB -

1. Compute the sum of numbers using Map Reduce.

MAPPER CLASS -

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class SumMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>
{
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
String s=tokenizer.nextToken("+");
int p=Integer.parseInt(s);
output.collect(value,new IntWritable(p));
}
}
}

REDUCER CLASS -

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class SumReducer extends MapReduceBase implements


Reducer<Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable>
output,Reporter reporter) throws IOException
{
int sum=0;
while (values.hasNext())
{
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}

DRIVER CLASS -

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat ;

public class Sum


{
public static void main(String[] args) throws IOException
{
JobConf conf = new JobConf(Sum.class);
conf.setJobName("Sumofthedigits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(SumMapper.class);
conf.setReducerClass(SumReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
INPUT FILE -

OUTPUT FILE -
POSTLAB -

1. Create a file which includes a minimum of 3 lines of words/characters/info, And then write a
map reduce program in order to find the character count i.e. you have to find out the number of
times a character has appeared in the file you created.
ANS -
Driver Class:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class CharCountDriver {
public static void main(String[] args)
throws IOException
{
JobConf conf = new JobConf(CharCountDriver.class);
conf.setJobName("CharCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CharCountMapper.class);
conf.setCombinerClass(CharCountReducer.class);
conf.setReducerClass(CharCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,
new Path(args[0]));
FileOutputFormat.setOutputPath(conf,
new Path(args[1]));
JobClient.runJob(conf);
}
}

Mapper Class:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class CharCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,

Reporter reporter) throws IOException{


String line = value.toString();
String tokenizer[] = line.split("");
for(String SingleChar : tokenizer)
{
Text charKey = new Text(SingleChar);
IntWritable One = new IntWritable(1);
output.collect(charKey, One);
}
}

Reducer Class:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class CharCountReducer extends MapReduceBase


implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void
reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Output:
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 5
PRELAB -

1. Explain these terminologies ?


1) PayLoad
2) Mapper
3) NamedNode
4) DataNode
5) MasterNode
6) SlaveNode
7) JobTracker
8) TAskTracker
9) Job
10)Task
11)Task Attempt

PayLoad − Applications implement the Map and the Reduce functions, and form the core of the
job.

Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.

NamedNode − Node that manages the Hadoop Distributed File System (HDFS).

DataNode − Node where data is presented in advance before any processing takes place.

MasterNode − Node where JobTracker runs and which accepts job requests from clients.

SlaveNode − Node where Map and Reduce program runs.

JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker − Tracks the task and reports status to JobTracker.

Job − A program is an execution of a Mapper and Reducer across a dataset.

Task − An execution of a Mapper or a Reducer on a slice of data.

Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

2. What is Hadoop Combiner?


ANS - On a large dataset when we run MapReduce job, large chunks of intermediate
data is generated by the Mapper and this intermediate data is passed on the
Reducer for further processing, which leads to enormous network congestion.
MapReduce framework provides a function known as Hadoop Combiner that plays
a key role in reducing network congestion.
We have already seen earlier what is mapper and what is reducer in Hadoop
MapReduce. Now we in the next step to learn Hadoop MapReduce Combiner.
The combiner in MapReduce is also known as ‘Mini-reducer’. The primary job of
Combiner is to process the output data from the Mapper, before passing it to
Reducer. It runs after the mapper and before the Reducer and its use is optional.

3.What are the advantages of a Map reduce Combiner?


ANS - As we have discussed what is Hadoop MapReduce Combiner in detail, now we will
discuss some advantages of Mapreduce Combiner.
• Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
• It decreases the amount of data that needed to be processed by the reducer.
• The Combiner improves the overall performance of the reducer.

INLAB -

1. Implement word count map reduce program without combiner


ANS -
1. Open Eclipse and create a project “wordcountwithoutcombiner”
2. Under src folder create 4 java files with
i) WordcountMapper.java
ii) WordcountReducer.java
iii) WordCount.java
WordCountMapper.java:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString(); // used to convert text class to string
StringTokenizer tokenizer = new StringTokenizer(line); // class used to divide into tokens
while (tokenizer.hasMoreTokens()){
String p=(tokenizer.nextToken());//string will be divided into tokens and stored in the
object p
output.collect(new Text(p), new IntWritable(1)); // used to collect the data in the form of
Text and IntWritable(ex:hi,1)
}
}

}
WordCountReducer.java:

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WordCountReducer extends MapReduceBase implements


Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable>
output,
Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) { //(hi,[1,1,1,1]) is the input value to iterator and
// values.hasnext is used to verify whether data is existing or not its a boolean statement
sum+=values.next().get();// sum=sum+values.next() is used to take the value 1 from the iterator
// but it is a intwritable data .so u need to convert into integer becouse sum is of int type so we
use a method
// .get() for convertion

}
output.collect(key,new IntWritable(sum));
}
}

WordCount.java:

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount{


public static void main(String[] args) throws IOException{
JobConf conf =new JobConf(WordCount.class);
// defining job configuration
conf.setJobName("WordCount1"); // used to give name to the job
conf.setInputFormat(TextInputFormat.class); // take input from the file in text format
conf.setOutputFormat(TextOutputFormat.class); // take output from the file in textformat
conf.setOutputKeyClass(Text.class); // output key format
conf.setOutputValueClass(IntWritable.class);
// output value format
conf.setMapperClass(WordCountMapper.class);conf.setReducerClass(WordCountReducer.class);
FileInputFormat.setInputPaths(conf,new Path(args[0])); // a.txt file (input file)
FileOutputFormat.setOutputPath(conf,new Path(args[1])); //output directory out
JobClient.runJob(conf); // used to start the job
}
}

3. Add required jar files, Right click on src-> build path -> configured build path ->
libraries -> External jarfiles
now add the jarfiles and click on OK

4. Convert the project as jar file


Folder-> right click-> go to export->go to java->select
Jar->give some name and click on ok

5. Create a input file “myinput” on Desktop

6. Open terminal and start hadoop “startCDH.sh”


7. Put the input file on HDFS
i) Create a directory hdfs dfs -mkdir “dir_name”
ii) Now insert the input file in new created directory hdfs dfs -put myinput
dir_name/myinput
8. Now execute the mapreduce program
hadoop jar created jar name Mainclassname HDFS
filename output directory name
hadoop jar output.jar WordCount dir_name/myinput output2
9. Run the program
POSTLAB -

1. Implement word count map reduce program with combiner


ANS -

1. Open Eclipse and create a project “wordcountwithcombiner”


2. Under src folder create 4 java files with
iv) WordcountMapper.java
v) WordcountReducer.java
vi) WordCountWithFixedCombiner.java
WordcountMapper.java:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordcountMapper extends Mapper <LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
//output key k1 format
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
// convert each line into a string and split the string into words seperated by " " (spaces).
String[] words = value.toString().split(" ") ;
// set word count to 1 for each word encountered in the line.
// Combining similar words into grouped key-value pairs will be handled by Hadoop
framework
// resulting in (k2, v2) pairs where v2 is a collection of items.
for (String str: words)
{
word.set(str);
context.write(word, one);
}
}}
WordcountReducer.java :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordcountReducer extends Reducer <Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException
{
//counter for total count of each word
int sum = 0;
for (IntWritable val : values)
{
sum += val.get() ; //Summing actual values associated with each key instead of counting the
values
}
result.set(sum);
//finally write (word, total) pairs into Reducer's context.
context.write(key, result);
}
}
WordCountWithCombiner.java:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountWithCombiner
{
public static void main (String [] args) throws Exception
{
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = new Configuration();
// Job Object configuration
Job job = new Job();
job.setJobName("WordCount4");
job.setJarByClass(WordCountWithCombiner.class);
job.setMapperClass(WordcountMapper.class);
//Set Combiner class as WordcounReducer class.
job.setCombinerClass(WordcountReducer.class);
job.setReducerClass(WordcountReducer.class);
//Set Output Key and value data types
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//Set Input and Output formats and paths
FileInputFormat.addInputPath(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
System.exit(job.waitForCompletion(true) ? 0:1);
}
}

3. Add required jar files, Right click on src-> build path -> configured build path ->
libraries -> External jarfiles
now add the jarfiles and click on OK

4. Convert the project as jar file


Folder-> right click-> go to export->go to java->select
Jar->give some name and click on ok
5. Create a input file “myinput” on Desktop

6. Open terminal and start hadoop “startCDH.sh”


7. Put the input file on HDFS
iii) Create a directory hdfs dfs -mkdir “dir_name”
iv) Now insert the input file in new created directory hdfs dfs -put myinput
dir_name/myinput
8. Now execute the mapreduce program
hadoop jar created jar name Mainclassname HDFS
filename output directory name
hadoop jar output1.jar WordCountWithCombiner dir_name/myinput output1
9. Run the program
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 6
PRELAB -

1. Write the syntax of final command to execute word count problem in map reduce by
explaining each key word in it and finally give an example of that command.
ANS -
Syntax:
Hadoop jar <jar file name> <class name> <path of input file in hadoop> <path of output file in
hadoop>
Explanation:
<jar file name>
Exported jar file has to be exported in the current directory and that jar file name will be written
here.
<class name>
Class Name of the java code which is exported as the jar file, name will be written here.
<path of input file in hadoop>
Have to create an input file which contains the input data and we have to push that file into the
Hadoop file system by executing this command
hdfs dfs -put <file name> <path in hadoop>
<path of output file in hadoop>
After completion of executing, the output will be stored in the create the folder in the specific path
that you have given the command.
In it you can able to see the answer and the executed message.

Example:
hadoop jar wordcountdemo.jar WordCount /test/data.txt /text/r_output

2. What is partitioner in Hadoop explain it in detail ?


ANS -
A partitioner works like a condition in processing an input dataset. The partition phase takes place
after the Map phase and before the Reduce phase.
The number of partitioners is equal to the number of reducers. That means a partitioner will divide
the data according to the number of reducers. Therefore, the data passed from a single partitioner
is processed by a single Reducer.
A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data
using a user-defined condition, which works like a hash function. The total number of partitions is
same as the number of Reducer tasks for the job. Let us take an example to understand how the
partitioner works.

3. Explain about the Poor Partitioning in Hadoop MapReduce ?


ANS -
If in data input one key appears more than any other key. In such case, we use two mechanisms to
send data to partitions.

 The key appearing more will be sent to one partition.


 All the other key will be sent to partitions according to their hashCode().
But if hashCode() method does not uniformly distribute other keys data over partition range, then
data will not be evenly sent to reducers. Poor partitioning of data means that some reducers will
have more data input than other i.e. they will have more work to do than other reducers. So, the
entire job will wait for one reducer to finish its extra-large share of the load.

4. How to overcome poor partitioning in MapReduce?


ANS -
To overcome poor partitioner in Hadoop MapReduce, we can create Custom partitioner, which
allows sharing workload uniformly across different reducers.

INLAB -

1. Implement word count map reduce program using partitioner


ANS -
1. Open Eclipse and create a project “wordcountpartitioner”
2. Under src folder create 4 java files with
i) MyCustomPartitioner.java
ii) WordcountMapper.java
iii) WordcountReducer.java
iv) WordCountDriver.java
MyCustomPartitioner.java:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyCustomPartitioner extends Partitioner<Text, IntWritable> {
public int getPartition(Text key, IntWritable value, int numOfReducers) {
String str = key.toString();
if(str.charAt(0) == 's'){
return 0;
}
if(str.charAt(0) == 'k'){
return 1%numOfReducers;
}
if(str.charAt(0) == 'c'){
return 2%numOfReducers;
}
else{
return 3%numOfReducers;
}
}}
WordCountMapper.java:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
for(String word:line.split("\\W+")){
if(word.length()>0){
context.write(new Text(word), new IntWritable(1));
}
}
}
}
WordCountReducer.java:
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int count = 0;
for(IntWritable value: values){
count +=value.get();
}
context.write(key, new IntWritable(count));
}}
WordCountDriver.java:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool{
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
int exitCode = ToolRunner.run(conf, new WordCountDriver(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.exit(-1);
}
/*
* Instantiate a Job object for your job's configuration.
*/
Job job = Job.getInstance(getConf());
job.setJarByClass(WordCountDriver.class);
job.setJobName("Word Count Job");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setPartitionerClass(MyCustomPartitioner.class);
job.setNumReduceTasks(5);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
/*
* Specify the job's output key and value classes.
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
/*
* Start the MapReduce job and wait for it to finish. If it finishes successfully, return 0.
If not, return 1.
*/
boolean success = job.waitForCompletion(true);
return (success ? 0 : 1);
}
}

3. Add required jar files, Right click on src-> build path -> configured build path ->
libraries -> External jarfiles now add the jarfiles and click on OK

4. Convert the project as jar file


Folder-> right click-> go to export->go to java->select
Jar->give some name and click on ok

5. Create a input file “myinput” on Desktop


6. Open terminal and start hadoop “startCDH.sh”
7. Put the input file on HDFS
i) Create a directory hdfs dfs -mkdir “dir_name”
ii) Now insert the input file in new created directory hdfs dfs -put myinput
dir_name/myinput
8. Now execute the mapreduce program
hadoop jar created jar name Mainclassname HDFS
filename output directory name
hadoop jar output3.jar WordCountDriver dir_name/myinput output3
9. Run the program
POSTLAB -

1) Map Reduce with lucky 7


You have given a text document contains only alphabets separated by spaces and you have to
apply the map reduce program to that file to count the repetitions of each alphabet and the
output file contains alphabets whose occurrence is a multiples of 7.

words.txt
abcdefghijklmnopqrstuvwxymnbvcxasdfghjklpoiuytrewqadsfgfhj
klmnbvcxasdfhjjkkooiytresqcxbcncjdhgdfrdwcvdbbcdncjdjchydtg
cfgcbdncdsbcsdbahoubvarueobvyuebvyuebrubdfuhbvodufbvuobf
bbbbaeffffgy

ANS -
Java Codes:
These codes are the mapper class and reducer class and the driver class.
You have to export the driver class into the jar file and have to place the jar file in the desktop.

Execution: Firstly you have you place your text file into a directory of Hadoop then you have to
execute the programs as show in the above image.

Files that are created on the directory and the automatic output files created on the folder.
Final Output of the problem:
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 7
PRELAB-

1. Explain The interfaces between the Hadoop clusters any external network.
Ans:
The interfaces between the Hadoop clusters any external network are called the edge nodes. These are also
called gateway nodes as they provide access to-and-from between the Hadoop cluster and other applications.
Administration tools and client-side applications are generally the primary utility of these nodes.
Edges nodes are the interface between hadoop cluster and the external network. Edge nodes are used for
running cluster adminstration tools and client applications.Edge nodes are also referred to as gateway nodes.

2. Answer the following questions


A. Explain briefly about Hadoop combiner and also mention the need of using Hadoop combiner.
Ans - A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs
from the Map class and thereafter passing the output key-value pairs to the Reducer class.The main
function of a Combiner is to summarize the map output records with the same key. The output (key-value
collection) of the combiner will be sent over the network to the actual Reducer task as input.
MapReduce Combiner plays a key role in reducing network congestion. It decreases the amount of data that
needed to be processed by the reducer. MapReduce combiner improves the overall performance of the
reducer by summarizing the output of Mapper.

B. Consider the given data

Data - a,a,b,b,c,c,d,d,e,e,a,a,f,f,b,b,c,c,d,d,e,e,f,f

Draw the 2 flow diagrams of the MapReduce program to implement word count logic with Combiner and
without combiner in between Mapper and Reducer
INLAB:

1. Implement Maximum temperature program using Map Reduce with and without combiner
logic

1)Implement Maximum temperature program without combiner logic

-----Mapper for the maximum temperature


import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {


private static final int MISSING = 9999;

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs

airTemperature = Integer.parseInt(line.substring(88, 92));

} else {

airTemperature = Integer.parseInt(line.substring(87, 92));

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]")) {

context.write(new Text(year), new IntWritable(airTemperature));

-------
Reducer for the maximum temperature

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer

extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

context.write(key, new IntWritable(maxValue));

-------

Application to find the maximum temperature in the weather dataset

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature {

public static void main(String[] args) throws Exception {


if (args.length != 2) {

System.err.println("Usage: MaxTemperature <input path> <output path>");

System.exit(-1);

Job job = new Job();

job.setJarByClass(MaxTemperature.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}
2)Implement Maximum temperature program with combiner logic

-----Mapper for the maximum temperature

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs

airTemperature = Integer.parseInt(line.substring(88, 92));

} else {

airTemperature = Integer.parseInt(line.substring(87, 92));

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]")) {

context.write(new Text(year), new IntWritable(airTemperature));


}

-------
Reducer for the maximum temperature

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer

extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

context.write(key, new IntWritable(maxValue));

----------------------
public class MaxTemperatureWithCombiner {

public static void main(String[] args) throws Exception {


if (args.length != 2) {

System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +

"<output path>");

System.exit(-1);

Job job = new Job();

job.setJarByClass(MaxTemperatureWithCombiner.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

OUTPUT

The first map produced the output:

(1950, 0)
(1950, 20)

(1950, 10)

and the second produced:

(1950, 25)

(1950, 15)

FINAL OUTPUT

(1950, 25)

POSTLAB-

1.Write a MapReduce program using partitioner to process the input dataset to find the highest
salaried employee by gender in different age groups forexample, below20, between21to30, above30.
Input Dataset:

ANS -

1. Open Eclipse and create a project “wordcountpartitioner”


2. Under src folder create java files with
i) PartitionerExample.java
PartitionerExample.java
import java.io.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class PartitionerExample extends Configured implements Tool
{
//Map class
public static class MapClass extends Mapper<LongWritable,Text,Text,Text>
{
public void map(LongWritable key, Text value, Context context)
{
try{
String[] str = value.toString().split("\t", -3);
String gender=str[3];
context.write(new Text(gender), new Text(value));
}
catch(Exception e)
{
System.out.println(e.getMessage());
}
}
}

//Reducer class

public static class ReduceClass extends Reducer<Text,Text,Text,IntWritable>


{
public int max = -1;
public void reduce(Text key, Iterable <Text> values, Context context) throws IOException,
InterruptedException
{
max = -1;

for (Text val : values)


{
String [] str = val.toString().split("\t", -3);
if(Integer.parseInt(str[4])>max)
max=Integer.parseInt(str[4]);
}

context.write(new Text(key), new IntWritable(max));


}
}

//Partitioner class

public static class CaderPartitioner extends


Partitioner < Text, Text >
{
@Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
String[] str = value.toString().split("\t");
int age = Integer.parseInt(str[2]);
if(numReduceTasks == 0)
{
return 0;
}
if(age<=20)
{
return 0;
}
else if(age>20 && age<=30)
{
return 1 % numReduceTasks;
}
else
{
return 2 % numReduceTasks;
}
}
}
@Override
public int run(String[] arg) throws Exception
{
Configuration conf = getConf();

Job job = new Job(conf, "topsal");


job.setJarByClass(PartitionerExample.class);

FileInputFormat.setInputPaths(job, new Path(arg[0]));


FileOutputFormat.setOutputPath(job,new Path(arg[1]));

job.setMapperClass(MapClass.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

//set partitioner statement

job.setPartitionerClass(CaderPartitioner.class);
job.setReducerClass(ReduceClass.class);
job.setNumReduceTasks(3);
job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)? 0 : 1);
return 0;
}

public static void main(String ar[]) throws Exception


{
int res = ToolRunner.run(new Configuration(), new PartitionerExample(),ar);
System.exit(0);
}
}

3. A
dd
required
jar files,
Right click
on src->
build path
->
configure
d build
path ->
libraries ->
External
jarfiles
now add
the
jarfiles
and click on OK

4. Convert the project as jar file


Folder-> right click-> go to export->go to java->select
Jar->give some name and click on ok

5. Create a input file “input” on Desktop


6. Open terminal and start hadoop “startCDH.sh”
7. Put the input file on HDFS
i) Create a directory hdfs dfs -mkdir “dir_name”
ii) Now insert the input file in new created directory hdfs dfs -put myinput dir_name/input

8. Now
execu
te the
mapre
duce
progr
am
h
a
d
o
o
p
ja
r
cr
e
at
e
d
ja
r
name Mainclassname HDFS
filename output directory name
hadoop jar output2.jar WordCountDriver dir_name/input output2
9. Run the program
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 8
PRELAB-

1. Define shuffling in MapReduce.


ANS - Shuffling is the process of transferring data from Mapper to Reducer. It is part of the first phase of
the framework.

2. Why the output of map tasks are stored (spilled ) into local disc and not in HDFS?

ANS -The outputs of map task are the intermediate key-value pairs which is then processed by
reducer to produce the final aggregated result. Once a MapReduce job is completed, there is
no need of the intermediate output produced by map tasks. Therefore, storing these
intermediate output into HDFS and replicate it will create unnecessary overhead.

3.What are the advantages of using map side join in MapReduce?

ANS - The advantages of using map side join in MapReduce are as follows:

 Map-side join helps in minimizing the cost that is incurred for sorting and merging in
the shuffle and reduce stages.
 Map-side join also helps in improving the performance of the task by decreasing the
time to finish the task.

INLAB:

1. Find the number of products sold in each country using Map Reduce
ANS -

SalesMapper.java
package SalesCountry;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper


<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector <Text,


IntWritable> output, Reporter reporter) throws IOException {
String valueString = value.toString();
String[] SingleCountryData = valueString.split(",");
output.collect(new Text(SingleCountryData[7]), one);
}
}

SalesCountryReducer.java
package SalesCountry;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesCountryReducer extends MapReduceBase implements


Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values,


OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
Text key = t_key;
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();

}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
SalesCountryDriver.java
package SalesCountry;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {


public static void main(String[] args) {
JobClient my_client = new JobClient();
JobConf job_conf = new JobConf(SalesCountryDriver.class);
job_conf.setJobName("SalePerCountry");
job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);
job_conf.setMapperClass(SalesCountry.SalesMapper.class);
job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job_conf, new Path(args[0]));
FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);
try {

JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
POSTLAB-

1. Pinky wants to find the status of the file (Healthy/Not), she also wants to find the total size, total
blocks and all the things related to the files present in a particular directory.
Your task is to help pinky find the relevant information related to a particular directory.
Note: You can find the status any particular directory present in your system. Explain in detail about
the command you tend to sue.

Solution:
HDFS fsck is used to check the health of the file system, to find missing files, over replicated, under
replicated and corrupted blocks.

Command: hdfs fsck <path> -files -blocks


2. John wants to collect specific information about the files within HDFS. But he don’t know how to query
for the details. Help him to explain about stat command so that he can collect the information about files.
A. The hdfs "stat" command is useful when you need to write a quick script that will collect specific
information about the files within HDFS. Use case: When you run hdfs -ls /filename it will always return
the full path of the file, but you just need to pull out the basename.
The Hadoop fs shell command stat prints the statistics about the file or directory in the specified format.
Formats:
%b – file size in bytes
%g – group name of owner
%n – file name
%o – block size
%r – replication
%u – user name of owner
%y – modification date
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 9
PRELAB-

1. What happens when a data node fails?


ANS -

When a data node fails

 Jobtracker and namenode detect the failure


 On the failed node all tasks are re-scheduled
 Namenode replicates the user's data to another node

2.Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?

Answer: In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its
own JVM process

Job Tracker performs following actions in Hadoop

 Client application submit jobs to the job tracker


 JobTracker communicates to the Name mode to determine data location
 Near the data or with available slots JobTracker locates TaskTracker nodes
 On chosen TaskTracker Nodes, it submits the work
 When a task fails, Job tracker notifies and decides what to do then.
 The TaskTracker nodes are monitored by JobTracker

3.Explain what is the function of MapReduce partitioner?

Answer: The function of MapReduce partitioner is to make sure that all the value of a single key goes to the
same reducer, eventually which helps even distribution of the map output over the reducers

INLAB:

1. Computing the average salary of ABC company using Map Reduce


RunnerClass
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.io.FloatWritable;
public class RunnerClass {
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(RunnerClass.class);
conf.setJobName("total and avg");
conf.setMapperClass(MapperClass.class);
conf.setReducerClass(ReducerClass.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(FloatWritable.class);
conf.setMapperClass(MapperClass.class);
//conf.setCombinerClass(ProcessUnitsReducer.class);
conf.setReducerClass(ReducerClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
2.
MapperClass
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MapperClass extends MapReduceBase implements
Mapper<LongWritable ,Text,Text,FloatWritable> {
public void map(LongWritable key, Text empRecord,OutputCollector<Text,
FloatWritable> con, Reporter arg3) throws IOException {
String word[]=empRecord.toString().split(" ");
String sex=word[2];
Float salary=Float.parseFloat(word[3]);
con.collect(new Text(sex),new FloatWritable(salary));
}
}
3.
4.
5.
ReducerClass
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class ReducerClass extends MapReduceBase implements
Reducer<Text,FloatWritable,Text,Text>{
public void reduce(Text key, Iterator<FloatWritable> valueList, OutputCollector<Text,
Text> con, Reporter arg3) throws IOException {
Float total=(float)0;
int count=0;
while(valueList.hasNext())
{
total+=valueList.next().get();
count++;
}
Float avg=(float)total/count;
String out="Total = "+total+"::"+"Average = "+avg;
con.collect(key, new Text(out));
}
}

6. Execution:
Runner Class

7.
Mapper Class
8.
Reducer Class

EXECUTION
INPUT
OUTPUT

9.
POSTLAB-

1. Siri is playing a solo online game. There are 4 rounds in the game. In each round she is given with 5 chances
to score points. In every chance she can score points ranging from 1-5. You are now given with an input file
containing the points Siri scored in 4 rounds now help her in finding the frequency of the points she scored.

ANS -

Mapper code -
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class GameMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>


{
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable>output, Reporter reporter)
throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
String p=(tokenizer.nextToken());
output.collect(new Text(p), new IntWritable(1));
}
}
}

Reducer code -
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class GameReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable>


{
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException
{
int sum=0;
while (values.hasNext())
{
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}

Driver code -
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class Game


{
public static void main(String[] args) throws IOException
{
JobConf conf = new JobConf(Game.class);
conf.setJobName("Game");
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(GameMapper.class);
conf.setCombinerClass(GameReducer.class);
conf.setReducerClass(GameReducer.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

Commands Used -
startCDH.sh
cd Desktop
hdfs dfs -put gameinput /lab/gameinput
hadoop jar game.jar Game /lab/gameinput /lab/gameoutput
OUTPUT SCREENSHOTS -
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 10
PRELAB-

1. If a file contains 100 billion URLs. Then how will you find the first unique URL?

This problem has the large set of data i.e., 100 billion URLs, so it has to be divided into
the chunks which fits into the memory and then the chunks needs to be processed and
then the results get combined in order to get a final answer. In this scenario, the file is
divided in the smaller ones using uniformity in the hashing function which produces the
N/M chunks, each is of M (i.e., size of main-memory). Here each URLs is read from an
input file, and apply hash function to it in order to find the written chunk file and further
append the file with the original line-numbers. Then each file is read from the memory
and builds the hash-table for URLs which is used in order to count the occurrences of
each of the URLs and then stores the line-number of each URL. After the hash-table built
completely the lowest entry of the line-number having a count value of 1 is scanned,
which is the first URL in the chunk file which is unique in itself. The above step is
repeated for all the chunk files, and the line-number of each URL is compared after its
processing. Hence, after the process of all the chunk-file, the 1st unique URL found out
from all that processed input.

2. Varun wants to do execution of a Hadoop program and he is new to hadoop so he wants to know
about the different types of the main configuration files of Hadoop, Help him to know what are the
configuration files in the Hadoop.

Below are the main confirmation files of Hadoop:

• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• yarn-site.xml
• mapred-site.xml
• masters
• slaves
INLAB:
1.Implement stock market program using Map Reduce with and without combiner logic

Part – 1: Implement stock market program using Map Reduce with combiner logic.
RunnerClass
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class RunnerClass {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(RunnerClass.class);
conf.setJobName("stocksmax");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(FloatWritable.class);
conf.setMapperClass(MapperClass.class);
conf.setCombinerClass(ReducerClass.class);
conf.setReducerClass(ReducerClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

MapperClass
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

class MapperClass extends MapReduceBase implements


Mapper<LongWritable,Text,Text,FloatWritable>{

public void map(LongWritable key, Text value,OutputCollector<Text,FloatWritable> output, Reporter


reporter) throws IOException{
String line = value.toString();
String[] items = line.split(" ");

String stock = items[1];


Float closePrice = Float.parseFloat(items[6]);

output.collect(new Text(stock), new FloatWritable(closePrice));

}
}

ReducerClass
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class ReducerClass extends MapReduceBase implements


Reducer<Text,FloatWritable,Text,FloatWritable> {
public void reduce(Text key, Iterator<FloatWritable> values,OutputCollector<Text,FloatWritable>
output,
Reporter reporter) throws IOException {

float maxClosePrice = Float.MIN_VALUE;

while (values.hasNext()) {
maxClosePrice = Math.max(maxClosePrice, values.next().get());
}

output.collect(key, new FloatWritable(maxClosePrice));


}
}

EXECUTION:
EXECUTION:
INPUT
OUTPUT:
OUTPUT:
Part – 2: Implement stock market program using Map Reduce withOUT combiner logic.
RunnerClass
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class RunnerClass {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(RunnerClass.class);
conf.setJobName("stocksmax");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(FloatWritable.class);
conf.setMapperClass(MapperClass.class);

conf.setReducerClass(ReducerClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

MapperClass
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

class MapperClass extends MapReduceBase implements


Mapper<LongWritable,Text,Text,FloatWritable>{

public void map(LongWritable key, Text value,OutputCollector<Text,FloatWritable> output, Reporter


reporter) throws IOException{
String line = value.toString();
String[] items = line.split(" ");

String stock = items[1];


Float closePrice = Float.parseFloat(items[6]);

output.collect(new Text(stock), new FloatWritable(closePrice));

}
}

ReducerClass
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class ReducerClass extends MapReduceBase implements
Reducer<Text,FloatWritable,Text,FloatWritable> {
public void reduce(Text key, Iterator<FloatWritable> values,OutputCollector<Text,FloatWritable>
output,
Reporter reporter) throws IOException {

float maxClosePrice = Float.MIN_VALUE;

while (values.hasNext()) {
maxClosePrice = Math.max(maxClosePrice, values.next().get());
}

output.collect(key, new FloatWritable(maxClosePrice));


}
}

EXECUTION
Input
OUTPUT
POSTLAB-

1. You have a directory ProjectPro that has the following files – HadoopTraining.txt,
_SparkTraining.txt, #DataScienceTraining.txt, .SalesforceTraining.txt. If you pass the ProjectPro
directory to the Hadoop MapReduce jobs, how many files are likely to be processed?

Only HadoopTraining.txt and #DataScienceTraining.txt will be processed for Mapreduce jobs


because when we process a file (either in a directory or individual) in Hadoop using any
FileInputFormat such as TextInputFormat, KeyValueInputFormat or SequenceFileInputFormat, we
must confirm that none of files must have a hidden file prefix such as “_” or “.” because mapreduce
FileInputFormat will by default uses hiddenFileFilter class to ignore all those files with these prefix
in their name.

private static final PathFilter hiddenFileFilter = new PathFilter(){

public boolean accept(Path p){

String name = p.getName();

return !name.startsWith("_") && !name.startsWith(".");

}
};

However, we can set our own custom filter such as FileInputFormat.setInputPathFilter to eliminate such criteria but
remember, hiddenFileFilter is always active.

2.KL University has conducted some weekly tests to the students who have opted Parallel and
Distributed Computing course. Now you are given with the marks scored by those students as
the input file. Find the least mark scored by each student.

CODE

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Minimum


{
public static void main(String[] args) throws Exception
{
// Create a new job
Job job = new Job();
// Set job name to locate it in the distributed environment
job.setJarByClass(Minimum.class);
job.setJobName("Minimum");
// Set input and output Path, note that we use the default input format
// which is TextInputFormat (each record is a line of input)
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set Mapper and Reducer class
job.setMapperClass(MinMapper.class);
job.setCombinerClass(MinReducer.class);
job.setReducerClass(MinReducer.class);
// Set Output key and value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MinMapper extends Mapper<LongWritable, Text, Text, IntWritable>


{
private IntWritable max = new IntWritable();
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
StringTokenizer line = new StringTokenizer(value.toString(),",\t");
word.set(line.nextToken());
max.set(Integer.parseInt(line.nextToken()));
context.write(word,max);
}
}

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MinReducer extends Reducer<Text, IntWritable, Text, IntWritable>


{
private int max_temp = Integer.MIN_VALUE;
private int temp = 0;
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws
IOException, InterruptedException
{
Iterator<IntWritable> itr = values.iterator();
temp = 0;
max_temp = itr.next().get();
while (itr.hasNext())
{
temp = itr.next().get();
if( temp < max_temp)
{
max_temp = temp;
}
}
context.write(key, new IntWritable(max_temp));
}
}

OUTPUT
PARALLEL AND DISTRIBUTED COMPUTING
LAB - 11
PRELAB-

1. Imagine that you are uploading a file of 500MB into HDFS.100MB of data is successfully uploaded into
HDFS and another client wants to read the uploaded data while the upload is still in progress. What will
happen in such a scenario, will the 100 MB of data that is uploaded will it be displayed?

Although the default blocks size is 64 MB in Hadoop 1x and 128 MB in Hadoop 2x whereas in such a
scenario let us consider block size to be 100 MB which means that we are going to have 5 blocks replicated
3 times (default replication factor). Let’s consider an example of how does a block is written to HDFS:

We have 5 blocks (A/B/C/D/E) for a file, a client, a namenode and a datanode. So, first the client will take
Block A and will approach namenode for datanode location to store this block and the replicated copies.
Once client is aware about the datanode information, it will directly reach out to datanode and start
copying Block A which will be simultaneously replicated to other 2 datanodes. Once the block is copied a nd
replicated to the datanodes, client will get the confirmation about the Block A storage and then, it will
initiate the same process for next block “Block B”.

So, during this process if 1st block of 100 MB is written to HDFS and the next block has been s tarted by the
client to store then 1st block will be visible to readers. Only the current block being written will not be
visible by the readers.

2.When does a NameNode enter the safe mode?

Namenode is responsible for managing the meta storage of the cluster and if something is missing from
the cluster then Namenode will be held. This makes Namenode checking all the necessary information
during the safe mode before making cluster writable to the users. There are couple of reasons for
Namenode to enter the safe mode during startup such as;

i) Namenode loads the filesystem state from fsimage and edits log file, it then waits for datanodes to
report their blocks, so it does not start replicating the blocks which already exist in the cluster another.

ii) Heartbeats from all the datanodes and also if any corrupt blocks exist in the cluster. Once Namenode
verify all these information, it will leave the safe mode and make cluster accessible. Sometime, we need to
manually enter/leave the safe mode for Namenode which can be done using command line “hdfs dfsadmin
-safemode enter/leave”.

INLAB:
1. Rio took two files M and N with the matrix order of different matrixes now with the help
of these two file calculate the matrix multiplication using combiner with help of Map
reduce concept.
Step 1: start Hadoop by giving the command startCDH.sh in terminal.
Step 2: Open Eclipse in the Desktop.
Step 3: Create new java project
Step 4: Click on new -> Class
In general, for a map reduce program we create 3 classes, such as Mapper Class, Reducer Class, Runner Class
Step 5: : Add External libraries
Step 6: Right click on “src”
Click on Build Path -> Click on Configure Build Pat
Step 7:
Create a directory named LAB11 and put input filed i.e. M, N in the directory

Step 8:
hadoop@hadoop-laptop:~/Desktop$ hadoop jar matrix.jar MatrixMultiply /matrix/* /LAB11/190030324out

Step 9: Open the Mozillafirefox browser and go to the directory matrix to see output
Output:
POSTLAB-

1. If there are 8TB be the available disk space per node (i.e., 10 disks having 1TB, 2 disks is for
Operating-System etc., were excluded). Then how will you estimate the number of data
nodes(n)? (Assuming the initial size of data is 600 TB)

In the Hadoop environment, the estimation of hardware -requirements is challenging due to the
increased of data at any-time in the organization. Thus, one must have the proper knowledge of
the cluster based on the current scenario which depends on the f ollowing factor:

1. The actual data size to be store is around 600TB.


2. The rate of increase of data in future (daily/weekly/monthly/quarterly/yearly)
depends on the prediction of the analysis of tending of data and the justified
requirements of the business.
3. There is a default of 3x replica factor for the Hadoop.
4. For the overhead of the hardware machine (such as logs, Operating System etc.) the
two disks were considered.
5. The output data on hard-disk is 1x for the intermediate reducer and mapper.
6. Space utilization is in between 60-70%.

Steps to find the number of the data-nodes which are required to store 600TB data:

• Given Replication factor: 3

Data size: 600TB

Intermediate data: 1

Requirements of total storage: 3+1*600=2400 TB


Available disk-size: 8TB

Total data-nodes required: 24008=300 machines.

• Actual Calculation = Disk-space utilization + Rough Calculation + Compression Ratio

Disk-space utilization: 65%

Compression ratio: 2.3

Total requirement of storage: 24002.3= 1043.5TB

Available size of disk: 8*0.65=5.2 TB

Total data-nodes required: 1043.55.2= 201 machines.

Actual usable size of cluster (100%): 201*8*2.34=925 TB

• Case: It has been predicted that there is 20% of the increase of data in quarter and
we all need to predict is the new machines which is added in particular year

Increase of data: 20% quarterly

Additional data:

1st quarter: 1043.5*0.2=208.7 TB

2nd quarter: 1043.5*1.2*0.2= 250.44 TB

3rd quarter: 1043.5*1.2* 1.2*0.2= 300.5 TB

4th quarter: 1043.5*1.2*1.2* 1.2*0.2= 360.6 TB

Additional data-nodes:

1st quarter: 208.75.2= 41 machines

2nd quarter: 250.445.2= 49 machines

3rd quarter: 300.55.2= 58 machines

4th quarter: 360.65.2= 70 machines


2. Suppose there is a file having size of 514MB is stored in the Hadoop (Hadoop 2.x) by using the
default size-configuration of block and also by default replication-factor. Then, how many
blocks will be created in total and what will be the size of each block?

The default replication factor is 3 and the default block-size is 128MB in Hadoop 2.x.

Thus, the 514MB of file can be split into:

a b c d e

128MB 128MB 128MB 128MB 2MB

• The default block size is 128MB


• Number of blocks: 514MB128MB=4.05 ≈5 blocks
• Replication factor: 3
• Total blocks: 5*3=15
• Total size: 514*3=1542MB

Hence, there are 15 blocks having size 1542MB.


PARALLEL AND DISTRIBUTED COMPUTING
LAB - 12
PRELAB-

1. Explain about the partitioning, shuffle and sort phase

Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform
several other map tasks and also exchange the intermediate outputs with the reducers as
required. This process of moving the intermediate outputs of map tasks to the reducer is
referred to as Shuffling.

Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a
single node before they are given as input to the reducer.

Partitioning Phase-The process that determines which intermediate keys and value will be
received by each reducer instance is referred to as partitioning. The destination partition is
same for any key irrespective of the mapper instance that generated it.

2. How to write a custom partitioner for a Hadoop MapReduce job?

Steps to write a Custom Partitioner for a Hadoop MapReduce Job-

• A new class must be created that extends the pre-defined Partitioner Class.
• getPartition method of the Partitioner class must be overridden.
• The custom partitioner to the job can be added as a config file in the wrapper which
runs Hadoop MapReduce or the custom partitioner can be added to the job by using
the set method of the partitioner class.
INLAB:
1. Map Reduce logic for online music store using combiner logic.
UniqueListenersReducer

UniqueListenersMapper

UniqueListeners
Input
Execution

Output
POSTLAB-
1. Considering the below dataset Find the dates of transactions on which more
than one transaction has occurred on a particular date By using MapReduce.
• https://www.kaggle.com/jensroderus/salesjan2009csv

Codes:
➔ Upload the hadoop-core-1.2.1 jar file in Build Path → Configure Build Path → Add External Jars Then upload this jar file
their

Exporting to Jar File:


Execution and outputs:

You might also like