MapReduce Exam 2019 - Solved Paper
MapReduce Exam 2019 - Solved Paper
MapReduce Exam 2019 - Solved Paper
Vinod Patne
Table of Contents
Hadoop Configuration........................................................................................................14
6. Explain Hadoop important Configuration parameters.........................................................14
Hadoop Configuration........................................................................................................20
7. Role of YARN, Hue, Application Manager, Node Manager in MapReduce.........................20
1. Write MapReduce program to read log file and count number of times word
Exception occurs in the file.
ExceptionCountMapper.java
public class ExceptionCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text wordException = new Text("Exception");
ExceptionCountReducer.java
public class ExceptionCountReducer extends Reducer<Text, IntWritable, Text,
IntWritable>{
protected void reduce(Text key, Iterable<IntWritable> values, Context
context) throws Exception {
int sum = 0;
//sum occurrences
for(IntWritable val : values){
sum = sum + val.get();
}
context.write(key, new IntWritable(sum));
}
}
ExceptionCountDriver.java
public class ExceptionCountDriver extends Configured implements Tool {
jobConf.setBoolean("mapreduce.output.fileoutputformat.compress",
true);
// Set log levels - NONE, INFO, WARN, DEBUG, TRACE, and ALL.
jobConf.set("mapreduce.map.log.level", "DEBUG");
jobConf.set("mapreduce.reduce.log.level", "TRACE");
// 3. Set the key & value classes for the job final output data
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 4. Set the Input & Output Format classes for the job
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
// 6. Configuring the input/output path from the filesystem into the job
FileInputFormat.setInputPaths(job, new Path(inputHDFSPath));
FileOutputFormat.setOutputPath(job, new Path(outputHDFSPath));
// OR
// jobConf.set("mapred.input.dir",inputHDFSPath);
// jobConf.set("mapred.output.dir",outputHDFSPath);
A partitioner works like a condition in processing an input dataset. The partition phase takes
place after the Map phase and before the Reduce phase.
The number of partitioners is equal to the number of reducers. That means a partitioner will
divide the data according to the number of reducers . It partitions the data using a user-
defined condition, which works like a hash function. Therefore, the data passed from a single
partitioner is processed by a single Reducer.
Default Partitioner (Hash Partitioner) computes a hash value for the key and assigns the
partition based on this result. But if hashCode() method does not uniformly distribute other
keys data over partition range, then data will not be evenly sent to reducers. To overcome poor
partitioner in Hadoop MapReduce, we can create Custom partitioner, which allows sharing
workload uniformly across different reducers.
Driver.java
job.setPartitionerClass(MyPartitioner.class);
// Default
// job.setPartitionerClass(HashPartitioner.class);
MyPartitioner.java
public class MyPartitioner extends Partitioner <Text, Text>
{
@Override
public int getPartition(Text key, Text value, int numReduceTasks) {
if(numReduceTasks == 0) {
return 0;
} else {
String[] str = value.toString().split("\t");
int age = Integer.parseInt(str[2]);
if(age <= 20) {
return 0;
} else if(age > 20 && age <= 30) {
return 1 % numReduceTasks;
} else {
return 2 % numReduceTasks;
}
}
}
}
3. Write custom input Formatter to extract SQL Query from the input file.
How the input files are split up and read in Hadoop is defined by the InputFormat. It splits the
Input file into InputSplit and assign to individual Mapper.
The files or other objects that should be used for input is selected by the InputFormat.
InputFormat defines the Data splits, which defines both the size of individual Map tasks
and its potential execution server.
InputFormat defines the RecordReader, which is responsible for reading actual records
from the input files.
Driver.java
job.setInputFormatClass(SqlInputFormat.class);
// Default
// job.setInputFormatClass(TextInputFormat.class);
SqlRecordReader.java
public class SqlRecordReader extends RecordReader<LongWritable, Text> {
LineRecordReader lrr;
LongWritable key;
Text value;
while (lrr.nextKeyValue()) {
String line = lrr.getCurrentValue().toString();
if (QStarted) {
int index = line.indexOf(";");
if (index != -1) {
query.append(line.substring(0, index + 1));
break;
} else {
query.append(line);
}
} else {
int index = line.toUpperCase().indexOf("SELECT");
if (index != -1) {
QStarted = true;
int endIndex = line.indexOf(";");
if (endIndex != -1) {
query.append(line.substring(index, endIndex+1));
break;
} else {
query.append(line.substring(index));
}
}
}
}
if (QStarted) {
key = new LongWritable(1);
value = new Text(query.toString());
}
return QStarted;
}
CityTempurature.java
// each value should implements Writable
public class CityTempuratureValue implements Writable<CityTempurature>{
CustomDate.java
public class CustomDate implements WritableComparable<CustomDate>{
IntWritable year;
IntWritable month;
IntWritable day;
public CustomDate() {
year = new IntWritable();
month = new IntWritable();
day = new IntWritable();
}
@Override
public void readFields(DataInput dataInput) throws IOException {
day.readFields(dataInput);
month.readFields(dataInput);
year.readFields(dataInput);
}
@Override
public void write(DataOutput dataOutput) throws IOException {
day.write(dataOutput);
month.write(dataOutput);
year.write(dataOutput);
}
this.getMonth().equals(dateWritable.getMonth()) &&
this.getDay().equals(dateWritable.getDay());
}
return false;
}
TempuratureMapper.java
public class TempuratureMapper extends Mapper<LongWritable, Text,
CustomDate, CityTemperature>{
@Override
protected void map(LongWritable, Text, Mapper.Context context) {
TempuratureReducer.java
public class TempuratureReducer extends Reducer <CustomDate,
CityTemperature, CustomDate, IntWritable> {
@Override
protected void reduce(CustomDate key, Iterable<CityTemperature> values,
Reducer.Context context)
throws IOException, InterruptedException {
int sum = 0;
int nbr =0 ;
5. What is the difference between Partitioner, Combiner, Shuffle and sort phase in
Map Reduce. What is the order of execution?
Partitioner - Partitioner comes into the existence if we are working with more than one
reducer. It takes the output of the combiner and performs partitioning. On the basis of key value
in MapReduce, partitioning of each combiner output takes place. And then the record having
the same key value goes into the same partition. After that, each partition is sent to a reducer.
Partitioning in MapReduce execution allows even distribution of the map output over the
reducer.
Shuffling and Sorting - After partitioning, the process of transferring data from the mappers
to reducers is shuffling. Then it transfers the map output to the reducer as input. Reducer gets
1 or more keys and associated values on the basis of reducers. The shuffling is the physical
movement of the data which is done over the network. As all the mappers finish and shuffle the
output on the reducer nodes. Then framework merges this intermediate output and sort by
keys. This is then provided as input to reduce phase. Since shuffling can start even before the
map phase has finished. So this saves some time and completes the tasks in lesser time.
Sorting in a MapReduce job helps reducer to easily distinguish when a new reduce task should
start. MapReduce Shuffling and Sorting occurs simultaneously to summarize the Mapper
intermediate output.
Skip shuffling and sorting - Shuffling and sorting in Hadoop MapReduce are will not take
place at all if you specify zero reducers (setNumReduceTasks(0)). If reducer is zero, then the
MapReduce job stops at the map phase. And the map phase does not include any kind of
sorting (even the map phase is faster).
Hadoop Configuration
hdfs-site.xml configuration
Recommended
Property Description
value
Note: dfs.data.dir and dfs.name.dir are deprecated. Thus use alternative mentioned above.
<!-- Reduce the interval for JobClient status reports on single node systems
-->
<property>
<name>mapreduce.client.progressmonitor.pollinterval</name>
<value>10</value> <!-- default value is 1000 milliseconds -->
</property>
<property>
<name>mapreduce.job.reduce.slowstart.completedmaps</name>
<value>0</value>
</property>
<!-- Enable Snappy for MapReduce intermediate compression for the whole
cluster -->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
4. Enabling WebHDFS
Set the following property in hdfs-site.xml:
hdfs-site.xml
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
6. Enabling Trash
CDH Parameter Value Description
The number of minutes after which a trash
checkpoint directory is deleted. This option can
be configured both on the server and the client.
Recommended
Property Description
value
admin.address,
yarn.resourcemanager.
scheduler.address,
yarn.resourcemanager. resource-
tracker.address,
yarn.resourcemanager.
webapp.address
$HADOOP_CONF_DIR,
$HADOOP_COMMON_
HOME/*,
$HADOOP_COMMON_
HOME/lib/*,
$HADOOP_HDFS_HOME
/*,
$HADOOP_HDFS_HOME
yarn.application.classpath /lib/*, Classpath for typical applications.
$HADOOP_MAPRED_H
OME/*,
$HADOOP_MAPRED_H
OME/lib/*,
$HADOOP_YARN_HOM
E/*,
$HADOOP_YARN_HOM
E/lib/*
yarn.log.aggregation-
TRUE
enable
Specifies the URIs of the directories
where the NodeManager stores its
localized files. All of the files
file:///data/1/yarn/ required for running a particular
local, YARN application will be put here for
yarn.nodemanager.local-
file:///data/2/yarn/local the duration of the application run.
dirs
, Cloudera recommends that this
file:///data/3/yarn/local property specify a directory on each
of the JBOD mount points; for
example, file:///data/1/yarn/local th
rough /data/N/yarn/local.
Specifies the URIs of the directories
where the NodeManager stores
container log files. Cloudera
file:///data/1/yarn/logs,
recommends that this property
yarn.nodemanager.log-dirs file:///data/2/yarn/logs,
specify a directory on each of the
file:///data/3/yarn/logs
JBOD mount points; for
example, file:///data/1/yarn/logs thr
ough file:///data/N/yarn/logs.
yarn.nodemanager.remote hdfs://<namenode- Specifies the URI of the directory
-app-log-dir host.company.com>: where logs are aggregated. Set the
8020/var/log/hadoop- value to eitherhdfs://namenode-
host.company.com:8020/var/log/ha
doop-yarn/apps, using the fully
qualified domain name of your
yarn/apps
NameNode
host, orhdfs:/var/log/hadoop-yarn/a
pps.
Recommended
Property Description
value
historyserver. The address of the
mapreduce.jobhistory.address
company.com:10020 JobHistory Server host:port
The address of the
mapreduce.jobhistory.webapp.addres historyserver.
JobHistory Server web
s company.com:19888
application host:port
Allows the mapreduser to
hadoop.proxyuser.mapred.groups * move files belonging to users
in these groups
Allows the mapreduser to
hadoop.proxyuser.mapred.hosts * move files belonging on
these hosts
Recommended
7.5. Property Description
value
YARN requires a staging directory
for temporary files created by
running jobs. By default it creates
/tmp/hadoop-yarn/staging with
yarn.app.mapreduce.am.staging-dir /user restrictive permissions that may
prevent your users from running
jobs. To forestall this, you should
configure and create the staging
directory yourself.
mapreduce.jobhistory.intermediate Set permissions on intermediate-
/user/tmp
-done-dir done-dir to 777
hadoop.proxyuser.mapred.groups /user/done Set permissions on done-dir to 750
Default Configuration
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Hadoop Configuration
A. YARN
YARN is the resource management and job scheduling technology in the open source Hadoop
distributed processing framework. YARN is responsible for allocating system resources to the
various applications running in a Hadoop cluster and scheduling tasks to be executed on
different cluster nodes.
B. HUE
Hue is a web-based interactive query editor in the Hadoop stack that lets you visualize and
share data. Hue brings the power of business intelligence (BI) and analytics to SQL developers.
It’s built to bridge the gap between IT and the business for trusted self-service analytics.
It allows us to:
- BI - Query Hive, Impala or HBase database tables
- View HDFS/Amazon S3 file system
- View Job Status
- Add/View Oozie Workflows
- Configure hive security (Database and Table privileges)
The Resource Manager is a single point of failure in YARN. Using Application Masters, YARN
is spreading over the cluster the metadata related to running applications. This reduces the
load of the Resource Manager and makes it fast recoverable.
Application manager is responsible for maintaining a list of submitted application. After
application is submitted by the client, application manager firstly validates whether application
requirement of resources for its application master can be satisfied or not. If enough resources
are available then it forwards the application to scheduler otherwise application will be rejected.
It also make sure that no other application is submitted with same application id.
Note: Application manager keeps the cache of completed application so that if user requests for
application data via web UI or command line at later point of time, it can fulfill the request of the
user.
The Application Master is responsible for the execution of a single application. It asks for
containers from the Resource Scheduler (Resource Manager) and executes specific programs
(e.g., the main of a Java class) on the obtained containers. The Application Master knows the
application logic and thus it is framework-specific. The MapReduce framework provides its own
implementation of an Application Master.
The Hadoop Yarn Node Manager is the per-machine/per-node framework agent who is
responsible for containers, monitoring their resource usage (memory, CPU) of individual
containers, tracking node-health, log’s management and auxiliary services and reporting the
same to the Resource Manager.