Cloud Computing - Unit 3
Cloud Computing - Unit 3
Cloud Computing - Unit 3
HADOOP
Hadoop is an open source distributed processing framework that manages data processing and
storage for big data applications in scalable clusters of computer servers.
It's at the center of big data technologies that are primarily used to support advanced analytics
initiatives, including predictive analytics, data mining and machine learning.
Unstructured information: does not have any particular internal structure; for
example, plain text or image data, such as internet clickstream records, web
server and mobile application logs, social media posts, customer emails and
sensor data from the internet of things (IoT).
Raid (redundant array of independent disks) is a way of storing the same data in different places
(thus, redundantly) on multiple hard disks.
HDFS clusters do not benefit from using RAID (Redundant Array of Independent Disks) for data
node storage (although RAID is recommended for the name node’s disks to protect against corruption of
it’s metadata). The redundancy that RAID provides is not needed, since HDFS handles it by replication
between nodes.
The Hadoop Distributed File System (HDFS) is the primary storage system used
by Hadoop applications.
Not for data nodes. For some master nodes processes like Hive Metastore, yes
Processing Techniques:
Batch Processing: Processing of previously collected jobs in a single batch.
The component to provide online access was HBase, a key-value store that uses HDFS for its
underlying storage. HBase provides both online read/write access of individual rows and batch operations
for reading and writing data in bulk, making it a
good solution for building applications on.
Goals of HDFS
• Very Large Distributed File System
– 10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
– Files are replicated to handle hardware failure,
HDFS Architecture
Functions of a NameNode
• Manages File System Namespace
– Maps a file name to a set of blocks
– Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks
NameNode Metadata
• Metadata in Memory
– The entire metadata is in main memory
– No demand paging of metadata
• Types of metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g. creation time, replication factor
• A Transaction Log
DataNode
• A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
• Block Report
– Periodically sends a report of all existing blocks to the NameNode
• Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Current Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable
Heartbeats
• DataNodes send heartbeat to the NameNode: Once every 3 seconds
• NameNode uses heartbeats to detect DataNode failure
Replication Engine
• NameNode detects DataNode failures
– Chooses new DataNodes for new replicas
– Balances disk usage
– Balances communication traffic to DataNodes
Data Correctness
• Use Checksums to validate data: Use CRC32
• File Creation
– Client computes checksum per 512 bytes
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
NameNode Failure
• A single point of failure
• Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
• Need to develop a real HA solution
Data Pieplining
• Client retrieves a list of DataNodes on which to place replicas of a block
• Client writes block to the first DataNode
Dr. B. Swaminathan, Professor, Dept., CSE., REC. Page 92
CS17702 Cloud Computing
• The first DataNode forwards the data to the next node in the Pipeline
• When all replicas are written, the Client moves on to write the next block in file
Secondary NameNode
• Copies FsImage and Transaction Log from Namenode to a temporary directory
• Merges FSImage and Transaction Log into a new FSImage in temporary directory
• Uploads new FSImage to the NameNode: Transaction Log on NameNode is purged
Java Interface, Dataflow of File read & File write,
Java Interface
The Hadoop FileSystem class: the API for interacting with one of Hadoop’s filesystems.
Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should
strive to write your code against the FileSystem abstract class, to retain portability across filesystems. This
is very useful when testing your program, for example, because you can rapidly run tests using data stored
on the local filesystem. Hadoop filesystem is by using a java.net.URL object to open a stream to read the
data from.
Table 3.1. Hadoop filesystems
Create a directory
To see how it is displayed in the listing:
% hadoop fs -mkdir books
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2014-10-04 13:22 books
-rw-r----- 1 tom supergroup 119 2014-10-04 13:21 quangle.txt
Example: rw- r-- ---
owner,group, Others mode
Copying a file from the local filesystem to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt \
hdfs://localhost/user/tom/quangle.txt
% hadoop fs –l.
quangle.txt
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Displaying files from a Hadoop filesystem on standard output by using the FileSystem directly
public class FileSystemCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
The program runs as follows:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Displaying files from a Hadoop filesystem on standard output twice, by using seek()
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Typical usage:
% hadoop FileCopyWithProgress input/docs/1400-8.txt
hdfs://localhost/user/tom/1400-8.txt
MapReduce
MapReduce—and the other processing models in Hadoop—scales linearly with the size of
the data. Data is partitioned, and the functional primitives (like map and reduce) can work in
Dr. B. Swaminathan, Professor, Dept., CSE., REC. Page 97
CS17702 Cloud Computing
parallel on separate partitions. This means that if you double the size of the input data, a job will
run twice as slowly. But if you also double the size of the cluster, a job will run as fast as the
original one. Hadoop can run MapReduce programs written in various languages. MapReduce
programs are inherently parallel, thus putting very large-scale data analysis into the hands of
anyone with enough machines at their disposal.
MapReduce - What?
• MapReduce is a programming model for efficient distributed computing
• It works like a Unix pipeline
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining
• A good fit for a lot of applications
– Log processing
– Web index building
MapReduce - Features
• Fine grained Map and Reduce tasks
– Improved load balancing
– Faster recovery from failed tasks
• Automatic re-execution on failure
– In a large cluster, some nodes are always slow or flaky
– Framework re-executes failed tasks
• Locality optimizations
– With large data, bandwidth to data is a problem
– Map-Reduce + HDFS is a very effective solution
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled close to the inputs when possible
Input splitting, map and reduce functions, Specifying input and output parameters,
MapReduce:
MapReduce is a programming framework that allows us to perform distributed and parallel processing
on large data sets in a distributed environment.
Let us understand more about MapReduce and its components. MapReduce majorly has the following
three Classes. They are,
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store saves this
intermediate data into the local disk.
Input Split: It is the logical representation of data. It represents a block of work that contains a
single map task in the MapReduce Program.
RecordReader: It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long with data
types and their respective job names.
Meanwhile, you may go through this MapReduce Tutorial video where our expert from Hadoop
online training has discussed all the concepts related to MapReduce has been clearly explained using
examples:
Advantages of MapReduce
The two biggest advantages of MapReduce are:
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part of
the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to
process the data using different machines. As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data gets reduced by a tremendous amount as
shown in the figure below (2).
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in the
MapReduce Framework. In the traditional system, we used to bring data to the processing unit and
process it. But, as the data grew and became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
Moving huge data to processing is costly and deteriorates the network performance.
Processing takes time as the data is processed by a single unit which becomes the bottleneck.
The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the data. So,
as you can see in the above image that the data is distributed among multiple nodes where each node
processes the part of the data residing on it. This allows us to have the following advantages:
It is very cost-effective to move processing unit to the data.
The processing time is reduced as all the nodes are working with their part of the data in parallel.
Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.
• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster
Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will
be finding the unique words and the number of occurrences of those unique words.
First, we divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the
tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in
itself, will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual words and
value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling happen so that
all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of
ones in the very list and gives the final output as – Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
Job is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat,
OutputFormat implementations. FileInputFormat indicates the set of input files
(FileInputFormat.setInputPaths(Job, Path…)/FileInputFormat.addInputPath(Job, Path)) and
(FileInputFormat.setInputPaths(Job, String…)/FileInputFormat.addInputPaths(Job, String)) and
where the output files should be written (FileOutputFormat.setOutputPath(Path)).
Optionally, Job is used to specify other advanced facets of the job such as the Comparator to be
used, files to be put in the DistributedCache, whether intermediate and/or job outputs are to be compressed
(and how), whether job tasks can be executed in a speculative manner
Configured Parameters:The following properties are localized in the job configuration for each task’s execution:
Name Type Description
mapreduce.job.id String The job id
mapreduce.job.jar String job.jar location in job directory
mapreduce.job.local.dir String The job specific shared scratch space
mapreduce.task.id String The task id
mapreduce.task.attempt.id String The task attempt id
mapreduce.task.is.map boolean Is this a map task
mapreduce.task.partition int The id of the task within the job
mapreduce.map.input.file String The filename that the map is reading from
mapreduce.map.input.start long The offset of the start of the map input split
mapreduce.map.input.length long The number of bytes in the map input split
mapreduce.task.output.dir String The task’s temporary output directory
JobConfs :
• Jobs are controlled by configuring JobConf
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is Executed
– conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
– conf.set(“my.string”, “foo”);
– conf.set(“my.integer”, 12);
• JobConf is available to all tasks
“Parallel computing is the simultaneous use of more than one processor to solve a problem” .
“Distributed computing is the simultaneous use of more than one computer to solve a problem”
Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle
real-time data efficiently. Hadoop is a high latency computing framework, which does not have an
interactive mode whereas Spark is a low latency computing and can process data interactively.
1. HDFS – Hadoop Distributed File System. This is the file system that manages the storage of
large sets of data across a Hadoop cluster. HDFS can handle both structured and unstructured data.
The storage hardware can range from any consumer-grade HDDs to enterprise drives.
2. MapReduce. The processing component of the Hadoop ecosystem. It assigns the data fragments
from the HDFS to separate map tasks in the cluster. MapReduce processes the chunks in parallel
to combine the pieces into the desired result.
3. YARN. Yet Another Resource Negotiator. Responsible for managing computing resources and job
scheduling.
4. Hadoop Common. The set of common libraries and utilities that other modules depend on.
Another name for this module is Hadoop core, as it provides support for all other Hadoop
components.
What is Hadoop?
Apache Hadoop is a platform that handles large datasets in a distributed fashion. The framework
uses MapReduce to split the data into blocks and assign the chunks to nodes across a cluster. MapReduce
then processes the data in parallel on each node to produce a unique output.
Every machine in a cluster both stores and processes data. Hadoop stores the data to disks
using HDFS. The software offers seamless scalability options. You can start with as low as one machine
and then expand to thousands, adding any type of enterprise or commodity hardware.
The Hadoop ecosystem is highly fault-tolerant. Hadoop does not depend on hardware to achieve
high availability. At its core, Hadoop is built to look for failures at the application layer. By replicating
data across a cluster, when a piece of hardware fails, the framework can build the missing parts from
another location. The nature of Hadoop makes it accessible to everyone who needs it. The open-source
community is large and paved the path to accessible big data processing.
What is Spark?
Apache Spark is an open-source tool. This framework can run in a standalone mode or on a cloud
or cluster manager such as Apache Mesos, and other platforms. It is designed for fast performance and
uses RAM for caching and processing data.
Spark performs different types of big data workloads. This includes MapReduce-like batch
processing, as well as real-time stream processing, machine learning, graph computation, and interactive
queries. With easy to use high-level APIs, Spark can integrate with many different libraries,
including PyTorch and TensorFlow.
The Spark engine was created to improve the efficiency of MapReduce and keep its benefits. Even
though Spark does not have its file system, it can access data on many different storage solutions. The data
structure that Spark uses is called Resilient Distributed Dataset, or RDD.
Dr. B. Swaminathan, Professor, Dept., CSE., REC. Page 105
CS17702 Cloud Computing
There are five main components of Apache Spark:
1. Apache Spark Core. The basis of the whole project. Spark Core is responsible for necessary
functions such as scheduling, task dispatching, input and output operations, fault recovery, etc.
Other functionalities are built on top of it.
2. Spark Streaming. This component enables the processing of live data streams. Data can originate
from many different sources, including Kafka, Kinesis, Flume, etc.
3. Spark SQL. Spark uses this component to gather information about the structured data and how
the data is processed.
4. Machine Learning Library (MLlib). This library consists of many machine learning algorithms.
MLlib’s goal is scalability and making machine learning more accessible.
5. GraphX. A set of APIs used for facilitating graph analytics tasks.
Hive Storage Format: Items to be considered while choosing a file format for storage include:
Support for columnar storage
Splitability
Compression
Schema evolution
Indexing capabilities
• Partitioning breaks table into separate files for each (dt, country) pair
Ex: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
A Simple Query
• Find all page views coming from xyz.com on March 31st:
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01'
AND page_views.date <= '2008-03-31'
AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-03-01,* instead of scanning entire table
Pig Limitations:
Pig does not support random reads or queries in the order of tens of milliseconds.
Pig does not support random writes to update small portions of data, all writes are bulk,
streaming writes, just like MapReduce.
Low latency queries are not supported in Pig, thus it is not suitable for OLAP and OLTP.
Development cycle is very long. Writing mappers In pig, no need of compiling or packaging of code.
and reducers, compiling and packaging the code, Pig operators will be converted into map or reduce
submitting jobs, and retrieving the results is a time tasks internally.
Consuming process
To extract small portion of data from large datasets Pig is not suitable small portions of data in a large
using Mapreduce is preferable dataset, since it is set up to scan the whole dataset,
or at least large portions of it
Not Easily Extendable. We need to write functions UDFs tend to be more reusable than the libraries
starting from scratch. developed for writing MapReduce programs
We need MapReduce when we need very deep level Sometimes, it is not very convenient to express
and fine grained control on the way we want to what we need exactly in terms of Pig and Hive
process our data. queries.
Performing Data set joins is very difficult Joins are simple to achieve in Pig.
Example Problem
Suppose you have user data in a file,
website data in another, and you need to find the top
5 most visited pages by users aged 18-25.
In Pig Latin
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group, count(Joined) as
clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5; store Top5 into ‘top5sites’;
Ease of Translation
HBase - What?
• Modeled on Google’s Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Untyped - stores byte[]
The important topics that I will be taking you through in this HBase architecture blog are:
HBase Data Model
HBase Architecture and it’s Components
HBase Write Mechanism
HBase Read Mechanism
HBase Performance Optimization Mechanisms
To better understand it, let us take an example and consider the table below.
If this table is stored in a row-oriented database. It
will store the records as shown below:
HDFS: HDFS get in contact with the HBase components and stores a large amount of
data in a distributed manner.
HBase Read and Write Data operations from Client into Hfile can be shown in below diagram.
Step 1) Client wants to write data and
in turn first communicates
with Regions server and then
regions
Step 2) Regions contacting memstore
for storing associated with the
column family
Step 3) First data stores into Memstore,
where the data is sorted and
after that, it flushes into
HFile. The main reason for
using Memstore is to store
data in a Distributed file
system based on Row Key. Memstore will be placed in Region server main memory while
HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Memstore, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Store It stores per ColumnFamily for each region for the table
Memstore Memstore for each store for each region for the table
It sorts data before flushing into HFiles
Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
HBASE HDFS
Accessed through shell commands, client API in Primarily accessed through MR (Map Reduce) jobs
Java, REST, Avro or Thrift
Storage and process both can be perform It's only for storage areas
Some typical IT industrial applications use HBase operations along with Hadoop. Applications
include stock exchange data, online banking data operations, and processing Hbase is best-suited solution
method.
HBase - Querying
• Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
• Retrieve a row
RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
Scanner s = table.getScanner( new String[] { “animal:type” } );
Consider own set of students data set. Data set also from different formats(Facebook, tweeter,
Oracle database, notepad, excel, etc., ). The data set consists of different set of marks/grades, students
from different location(City/ Village), categories, various family background, family education, Different
mode education(Native language/Specific type of language), students get different stream of studies
(school/poly techniques/ College/ University may be educations is +2/ Diploma/ Engineering/ Arts/ U.G./
P.G./ Ph. d., present working and earnings, Certificates courses and placement record. From this data set
analysis which type of students, are performing well in the outside world/social sector. Analysis is based
age group between 25-30 ages.
Assume data set is located in various location in various forms, that is the records are stored in
the distributed fashion. Example School students record are in stored in different school data base server,
College students academic record are available in the university record.
Work:
1. Setup own Hadoop platform, use different types of tools (Spark, Hive, Hbase, Pig, based your
requirements), Minimum two tools should use in your work.
2. Consider different types of data format adopted by different sectors
3. Create Own data base.
3. Design your own Architecture for Analysis purpose.
Output:
1. Screen shorts of setup
2. Write a steps involved while setting up
3. Configuration
4. Analysis report/ Your prediction
Work:
1. Reference Contents and installation from various web sites, &
https://www.tutorialspoint.com
2. Hands on Try yourself: Configuring and Running a Job. Hadoop Vs Spark.
13) How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the
‘custom splitter’. There is a feature of customization in Hadoop which can be called from the main
method.
17) Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting
happens only on the reducer side. Mapper method initialization depends upon each input split. While
doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will
get initialized. For each row, inputsplit again gets divided into mapper, thus we do not have a track of
the previous row value.
20) What is the difference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
25) What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop
Cluster?
JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There
is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process.
In a typical production cluster its run on a separate machine. Each slave node is configured with job
tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If
it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from
Hadoop Wiki:)
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they
are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it
may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
27) What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop
Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and
Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop
slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of
slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM
processes to do the actual work (called as Task Instance) this is to ensure that process failure does not
take down the task tracker. The TaskTracker monitors these task instances, capturing the output and
exit codes. When the Task instances finish, successfully or not, the task tracker notifies the
Dr. B. Swaminathan, Professor, Dept., CSE., REC. Page 122
CS17702 Cloud Computing
JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few
minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of
the number of available slots, so the JobTracker can stay up to date with where in the cluster work can
be delegated.
30) What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?
Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate
JVM process. Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is
run as a separate JVM process. One or Multiple instances of Task Instance is run on each slave node.
Each task instance is run as a separate JVM process. The number of Task instances can be controlled
by configuration. Typically a high end machine is configured to run more task instances.
35) What are combiners? When should I use a combiner in my MapReduce Job?
Combiners are used to increase the efficiency of a MapReduce program. They are used to
aggregate intermediate map output locally on individual mapper outputs. Combiners can help you
reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer
code as a combiner if the operation performed is commutative and associative. The execution of
combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may
execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners
execution.
37) What is the Hadoop MapReduce API contract for a key and value Class?
The Key must implement the org.apache.hadoop.io.WritableComparable interface.
The value must implement the org.apache.hadoop.io.Writable interface.
42) What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is
typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each
block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store
each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file
system block size.
43) What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files in the file system, and tracks where across the cluster the file data is kept. It does not store the
data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode
runs on its own JVM process. In a typical production cluster its run on a separate machine. The
NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the
file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file,
or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data lives.
44) What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode
process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a