0% found this document useful (0 votes)

8 views6 pages

Improving The Performance of Heterogeneous Hadoop Cluster-2016

Uploaded by

bcogbiprasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

Improving The Performance of Heterogeneous Hadoop Cluster-2016

Uploaded by

bcogbiprasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQGULG&RPSXWLQJ 3'&

Improving the Performance of Heterogeneous Hadoop Cluster

Ch.Bhaskar VishnuVardhan and Pallav Kumar Baruah
Department of Mathematics and Computer Science
Sri Sathya Sai Institute of Higher Learning
Puttaparthi, Anatapur, India, 515134
Email: [email protected], [email protected]

Abstract— With the development of information Technology, Apache Hadoop is an open source implementation of
we can see an exponential growth in the amount of data Google MapReduce programming model. The Hadoop
that is being generated and used over the last decade. The software framework has two main parts Hadoop Distributed
need for storing the data and processing them in order to
extract value from skew data has paved way for parallel, File System (i.e.HDFS) and the MapReduce programming
distributed processing applications like Hadoop. The current model. HDFS works as a storage for MapReduce application
Hadoop implementation assumes that the compute capacity of data, while the MapReduce programming includes the logic
all the nodes in a cluster are homogeneous in nature, but looking to process the MapReduce application data stored on
at the cloud infrastructure, we can see that different hardware HDFS.Hadoop has two predominant versions available
configurations systems are being used which is logical. Hence,
it is necessary to study the data placement policy where we can today namely Hadoop 1.x and 2.x. The 2.x has features
distribute the data based on the processing power of a node. pertaining to the improvement of the batch processing
In this paper, we propose a dynamic block placement strategy feature by introducing YARN (Yet Another Resource
in Hadoop to distribute the input data blocks to the nodes Negotiator). YARN comprises of a Resource Manager and
based on the computing capacity of each node. The proposed Node Manager. The Resource Manager is responsible for
algorithm which can adapt and balance data dynamically and
reorganize the input data in HDFS according to the compute managing and deploying resources and the Node Manager
capability of each node in hadoop heterogeneous enviornment. is responsible for managing the datanode and reporting the
The proposed method reduces the data transfer time to achieve status of the datanode to the resource manager.
improved performance. We ran benchmark applications against
the proposed block placement strategy and original HDFS Map reduce helps in analyzing large amount of data, it
blocks placement strategy. The experimental results show that
the dynamic data placement strategy can decrease the execution also helps in gaining valuable insights that benefits many
time and improve performance of hadoop heterogeneous cluster. organizations. Some of the Applications of MapReduce
Hadoop is designed to handle Big files and with less frequent include machine learning, graph mining, indexing and search
updates. But nowadays many applications need to handle small etc.., These applications are difficult to implement using the
files, which has become a pertinent issue for degradation of traditional SQL. MapReduce helps the programmers to focus
performance of an application on hadoop platform. Our pro-
posed methods have shown good improvement of performance only on applying the transformations to the data, leaving
in these cases. we are able to see a speedup of 25.7x and 20x the details of parallelization, network communication and
compared to the original Hadoop when ran on wordcount and fault tolerance to be handled by the system[1].
grep benchmark applications.
Keywords : Hadoop, HDFS, MapReduce, Block Placement HDFS is a distributed file system which stores files
Strategy, Small Files
redundantly across cluster nodes for security and availability.
To store a file HDFS splits it into blocks and replicates
I. INTRODUCTION
those according to a replication factor. The HDFS default
Hadoop is gaining popularity in the high performance blocks placement policy distributes the blocks reasonably
computing arena, as it is being used for many data and well across the cluster nodes. However, there are cases,
compute intensive applications. People had come up with in which this algorithm creates server hotspots that put
multiple techniques for achieving better performance in unnecessary load onto cluster nodes and lower the overall
hadoop clusters. The ability of such applications to process performance of the cluster[2].
petabytes of data generated from websites like Facebook,
Amazon, Yahoo and search engines like Google and Bing In this study, we developed a block placement mechanism
have eventually lead to a data revolution where by processing in Hadoop to distribute the large number of input data
each and every minute information relating to customers blocks to all the nodes based on the computing capacity
and users can add value, there by improving core compe- of each node. We propose an algorithm which reorganize
tency. Facebook process roughly around 15 terabytes of data the input data in HDFS. We also solved the small files
each day using the Hadoop Framework. Even the Scientific problem by combining the small files into large file. Even,
applications like seismic simulations and Natural Language this method will not overcome the NameNode Memory
Processing are done on Hadoop. Management issue. So, we tried using the Sequence file

,(((
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

data of the ﬁle system and the multiple data nodes which
store the application data.

The block placement plays an important role in terms

of data reliability and performance. Data block placement
strategy will improve reliability, availability and network
bandwidth utilization. When a new block is created, the first
replica of the block is placed in the first location allotted
for the block[3]. The other replicas will be placed randomly
on different nodes with a condition that maximum of two
replicas can be placed in the same rack. For a HDFS read,
name node provides the data nodes that are closer to the
Fig. 1. Hadoop Architecture client. This would reduce the network traffic tremendously
and also helps in improved write performance.

where the filename is stored as the key in the sequence file B. MapReduce Overview
and the file contents are stored as the value. This works MapReduce is a model for processing large amount of
very well when you need to maintain the input filename. data by using parallel algorithms on a cluster. The main
MapReduce job will divide the input data into independent
The rest of the paper is organized as follows. Section chunks and they can be computed parallely, the results of
2 briefly describes the Hadoop Architecture, Hadoop Dis- the map task will be fed into the reduce task as an input
tributed File System(HDFS), MapReduce and Motivation. which give the final result. Hadoop gains the recognition
Section 3 describes the Dynamic Block Placement Strategy of being largely scalable framework due to its mapreduce
and provides the experimental results by comparing our new model.
computeratio balancer with the default Hadoop. In Sections
4 we explain our methods Combining the input small files Any job that is given to the map reduce model will undergo
and the Sequence file to handle the small files problem. two stages. The first is the map stage and the second is
This section also provides the experimental results of our reduce stage. The map reduce computations take input as a
proposed methods comparing with the existing Hadoop. key-value pairs and generate a set of output key-value pairs.
Section 5 concludes by giving the summary and provides The user provided map function takes a set of key value
the necessary information for further improvement in this pairs as the input and generates a intermediate key-value
study. pairs. These intermediate key-value pairs will be collected
together and sent to another Reduce function. The reducer
II. BACKGROUND AND MOTIVATION
function also written by user, merges the key-value pairs of
A. Hadoop and Hadoop Distributed File System the intermediate stage and perform the user specified reduce
Hadoop[1] is a open source framework which is used to operation. Reducer may generate single output key value
process large amount of data(Terabytes of data). Hadoop is pairs or a set of output pairs.
written in java language by the apache foundation group.
It can run on large clusters parallely(which can have C. Motivation
thousands of systems). Hadoop framework is attractive for The current Hadoop implementation assumes that
many reasons. It process the data in a very reliable manner computing nodes in a cluster are homogeneous in nature.
by replicating the input data, can be scaled to hundreds This strategy has its potential benefits on homogeneous
of nodes, can handle terabytes or few petabytes of input environment but it might not be suitable on an heterogeneous
data, designed in such a way that it can run on commodity environment. Further looking at the cluster infrastructures,
clusters. Commodity computing is that, it is preferable to we can see that different hardware configurations systems
have more of low-cost, low-performance hardware working are being used in the same cluster which is logical.
in parallel than to have fewer high-cost, high-performance However, the default HDFS block placement strategy does
hardware. Hadoop is a recently developed project which is not consider the different hardware configurations, which
used for non-trival applications. results in under utilization of resources. So, we want to
come up with a block placement mechanism in Hadoop to
In[1],Hadoop Distributed File System, which is a Apache distribute the large number of input data blocks to all the
Hadoop Project, HDFS is a distributed, low cost commodity nodes based on the computing capacity of each node. To be
hardware and a highly fault tolerant file system. It provides specific, we want to design algorithm which reorganize the
high throughput, and is suitable for large data sets. Hadoop input data in HDFS.
Distributed File System is fault-tolerant and is designed in
such a way it can be deployed on low cost hardware. Each Hadoop works well with small number of large files, where
cluster consist of single name node that maintains the meta as it does not work well with large number of small files. For

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

small files, one file generates one Inputsplit. One Inputsplit tem(HDFS). In the first part, input file blocks are distributed
is used by one Map container, which means the number of to all the heterogeneous nodes that are part of the cluster.
Inputsplits is equal to the number of Map containers. So Once the input file fragments are available they are dis-
the first issue is that too many Map containers are created tributed to the corresponding compute nodes. The next step
by the MapReduce job for small files. Too many containers is to reorganize the input file blocks because data might get
mean too many processes. Now, as there are large number of disrupted and this is done by the next algorithm.
processes that are getting created and closed, the overhead is 1) Initial Data Placement: The algorithm starts by
big issue. The similar overhead can be seen even in reduce evenly dividing the large input file into number of blocks.
containers. We want to overcome the small files problem by Then based on the performance capability of the nodes the
combining the small files into large file. input file fragments are assigned to nodes in the cluster.
III. DYNAMIC B LOCK P LACEMENT S TRATEGY Nodes that are having high compute capability are expected
to process and store more file fragments compared to nodes
In homogeneous cluster, the data gets distributed among with low compute capability. The initial block placement
the nodes based on the availability of space in the cluster. mechanism distribute the data blocks to all the nodes in
Hadoop has a balancing functionality called Balancer which the heterogeneous cluster based on the performance of each
balances the data before running the applications in case node.
of large data accumulated on only few nodes. The balancer
even takes care of the replication factor which plays an
2) Data Redistribution: After distributing the initial in-
important role in data movement.
put file fragments to different nodes there might be some
disruption compared to the initial data distribution because
In heterogeneous environment, it is ideal to transfer the
of following reasons (1) Appending the new data to file that
data from one node to another node, as the faster node ac-
is already existing; (2) Deleting data blocks from files that
counts for the data transfer overhead and the task processing.
are existing in the cluster; (3) To the existing cluster new
As a result, the data placement policy tries to explore the
compute nodes are being added. To handle this dynamic
consequence of migrating the data block to a faster node in
block placement problem, we came up with a algorithm
the heterogeneous cluster. The implementation details of the
that distributes and organizes the data based on the compute
data placement policy algorithm provides a better overview
ratios.
of the goal.
A. Design B. Implementation Details
Initially we will do profiling to compute the computation The CRBalancer is responsible for migrating the data from
ratios of the nodes in the cluster. To accomplish this we need one node to another. The Figure describes the pattern and
to run a set of benchmarking applications namely WordCount the methodology in which the CRBalancer migrates the data.
and Grep. The computation ratio is calculated based on the It first takes into consideration the network topology, then
processing power of the nodes. calculates the under utilized and over utilized datanodes.
Mapping between the under utilized and over utilized
datanodes takes place by moving the blocks concurrently
among the nodes. It also takes into consideration the
replication factor by limiting the migration if the data is
already present on the node for the corresponding file. After
balancing takes place the benchmarks are run and tested for
the results.

The CRBalancer uses CRNamenodeConnector application

to connect to the namenode in order to get the information
about the datanodes on the ﬂy in order to decide the amount
of data to be transferred to the under-utilized nodes from
the over-utilized nodes. It also uses CRBalancingPolicy
application to keep track of the amount of data within each
node.

The CRBalancer can be run in the background while other

applications are running on the nodes. The CRBalancer takes
care of the network overhead while transferring the blocks. It
steals some bandwidth without interrupting the processing of
In this data placement strategy, two algorithms are im- the applications. The expectation is that a similar application
plemented and used in the Hadoop Distributed File sys- performs better when it executed again at some point of time.

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

produce a well balanced number of blocks in each of

the nodes in the cluster using MapReduce. There was not
much performance improvement with the New CRBalancer
and Hadoop Standard Balancer on Terasort because data is
already delivered to the TeraSort.
2) TestDFSIO: Below Figure shows that the proposed
CRBalancer performs well compared to the Hadoop default
Block Placement Strategy and the Standard Balancer that
comes with HDFS. We are trying to read and write 15 ﬁles
where size of each ﬁle is 1.4GB. So, in total we tried writing
Fig. 2. Computation Ratio Balancer
and reading 21GB of data.

C. Performance Evaluation
We ran our experiment on 6 systems, out of which 4 of
them are with a 3.4 GHz Quadcore Intel Core i5 CPU, 16
GB memory and 1TB hard disk. The 5th and 6th nodes are
3.1 GHz, Quad-core Intel Core i5, 8GB Memory and 1 TB
Hard Disk. The software used in this experiment comprises
of Hadoop application 2.3.0. A node is assigned the master
status and other nodes are assigned the slave status. Compute Fig. 4. TestDFSIO Benchmark
Ratio Balancer reads block information from the namenode
and runs the distribution algorithm in order to distribute the From the results it is clear that, even the Standard Bal-
data based on the computing capacity of the nodes. ancer of HDFS does not perform well for the TestDFSIO
1) Experimental Results: Terasort benchmark. The execution time is almost similar i.e around
The input data that is generated from TeraGen is taken as 7 min. for both the Default Block Placement Startegy and the
input to the TeraSort program and the output which is a Standard Balancer of HDFS. However our New CRBalancer
sorted data written into terasort-output folder. Below ﬁgure provides a better performance by reducing the execution time
shows that the execution time of New Balancer for the of the TestDFSIO benchmark. The new Balancer took 3.57
Terasort benchmark is less compared to the execution time min. with the weight of 1.5 . On modifying the weight to
of Hadoop standard Balancer and the default Hadoop block 1.2 we are able to improve the time of execution, which took
placement strategy. The input size of the data that we tried 2.46 minutes. With our proposed CRBalancer we are able to
to sort is 1GB. improve the time taken by the program by factor of 3.

IV. S MALL F ILES P ROBLEM

A small file can be defined as any file that is significantly
smaller than the Hadoop block size. The Hadoop block size
is usually set to 64MB or 128MB. I use the rule if a file is
not at least 75% of the block size, it is a small file.

However, the small file problem does not just affect small
files. If a large number of files in your Hadoop cluster
Fig. 3. Terasort Benchmark are marginally larger than an increment of your block size
you will encounter the same challenges as small files. For
The Terasort Benchmark does not give a very good example if your block size is 128MB but all of the files you
performance even with the new Balancer. To understand load into Hadoop are 136MB you will have a significant
this first we will see how TeraGen is working. number of small 8MB blocks. Solving the small file problem
is significantly more complex.
The data is written into the Cluster using the MapReduce
program. As we already discussed in the Architecture of A. The Issues in Traditional MapReduce for Processing
Hadoop that, when a new block is created, the first replica Small Files
of the block is placed in the first location allotted for In [5] the process of Hadoop MapReduce, Inputsplits are
the block. [4] The second and third replica are placed on generated by Application Master. For small files, one file
different nodes. The other replicas will be placed randomly generates one Inputsplit. One Inputsplit is used by one Map
on different nodes with a condition that maximum of two container, which means the number of Inputsplits is equal
replicas can be placed in the same rack. TeraGen will to the number of Map containers. So the first issue is that

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

Original Proposed
too many Map containers are created by the MapReuce No. of Files
System (sec) System(sec)
job for small ﬁles. Too many containers mean too many 1000 1349 45
processes. The creating and closing of a process cost very 1500 2090 52
2000 2688 57
little. However, when there are a large number of processes
2500 3358 60
needed to create and close, the overhead cannot be ignored. 3000 4029 71
The similar overhead is generated for Reduce containers.
TABLE I
Another issue is the imbalance between computation ability
P ROCESSING T IME FOR W ORD C OUNT A PPLICATION ON ORIGINAL
and parallelism. Excluding the overhead of using processes,
H ADOOP AND C OMBINE I NPUT F ILE
the traditional MapReduce processing costs a lot in the disk
I/O and network communication.

Files Procedure. The ﬁrst is CFInputFormat(Combine File

B. Combine Input Small Files InputFormat), second is CFRecordReader which is an
1) Methodology: InputSplit is the smallest unit in Hadoop extension of RecordReader, third is FileLineWritable which
that is given as input to the map task. The information is similar to LineWritable in Hadoop
that is provided by the InputSplit is file path, length of
the data in the length field, offset of the data in the file, 2) Experimental Results : wordcount We ran our
the information of storage blocks, set of nodes where the experiments on WordCount Benchmark Application. The
file is being stored. The number of map tasks is decided WordCount is a cpu intensive benchmark which reads
by number of the InputSplits. Typically if the input file is the dataset and counts the number of words within the
large then the MapReduce programs generate more than dataset based on a space delimiter per line. From the fig.
single InputSplit. However, if the input is large number of 6 it is clear that we are able to see a good speedup with
small files then Hadoop assigns one InputSplit for each the proposed combine input small files procedure on the
of the small file. So, Hadoop generates a large number of wordcount benchmark application.
processes i.e map tasks. This means there will be lot of
overhead that is incurred due to context switching, creating
and closing lot of processes, which affects the performance.

Fig.5 shows the proposed system for small ﬁles. Once

the small files are stored into HDFS then NameNode starts
to process these files. The maximum split size is defined
as 64MB. The JobTracker which resides on NameNode
will pack many files into one split (until split size becomes
64MB) by using CombineFileInputFormat. Splits are
provided as an input to each DataNode. The TaskTracker
which resides on DataNode will processes the input using
map and reduce tasks. The input for map tasks is key-value Fig. 6. Wordcount Benchmark Application
pair. With each line a key provided for mapper consists of
the file name and the offset length of that line and text is a The Combine Input Small File method works well in most
value. of the cases, but it does not overcome the Namenode Mem-
ory Management problem. It, only handles the MapReduce
Performance issue. So, in the next section we came up with
Sequence File method which handles both the issues of the
small files problem.
C. Sequence File Format
1) Methodology: SequenceFile is a very natural file layout
and MapReduce strategy. SequenceFile also can be used
as the input file format directly. In SequenceFile, keys and
values are stored sequentially, as shown in Figure. We can
use file name as the key and file content as the value so
that one SequenceFile can be comprised of many input files.
Also, this layout can be applied with MapReduce operations
directly. The main advantage of using sequence files is that,
Fig. 5. TestDFSIO Benchmark spliting of files can be done. So that we can independently
operate on each chunk. The compression is also allowed by
We implemented three classes in Combine Input Small sequence files .

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

Fig. 7. Sequence File Layout

Original Proposed
No. of files
System(sec) System(sec)
1000 1349 52
1500 2090 64
2000 2688 72
2500 3358 90
3000 4029 111
TABLE II Fig. 8. Grep Benchmark Application
T IME TAKEN TO PROCESS THE W ORD C OUNT BENCHMARK ON ORIGINAL
H ADOOP AND SEQUENCE FILE
does not store the metadata of files. It does not show better
performance, compared to files directly stored in HDFS.

We implemented three classes in this Sequence file format. V. CONCLUSION AND FUTURE WORK
First, is FullFileInputFormat class which is an extension of In conclusion, our proposed dynamic block placement
InputFormat. Second, is FullFileRecordReader which is an strategy is an improvement over the default HDFS blocks
extension of RecordReader class. Finally, SmallFilesToSe- placement policy and that our CRbalancer yield a greater
quenceFile class which is the main class. Each file will be performance over the HDFS standard balancer. Future work,
processed as one record and we need first and second classes has to come up with a efficient data placement strategy
to process the entire file as a record. where we can store the data blocks using different metric
2) Experimental Results: wordcount which works better and also deal with the network overhead
As, we have discussed earlier that the WordCount is a and bandwidth.
cpu intensive benchmark which reads the dataset and counts
the number of words within the dataset based on a space We came up with Sequence File format, which handles
delimiter per line. Table II explains the processing time both the issues of the small files problem. The Sequence
for executing WordCount benchmark on original Hadoop File method is ran on wordcount and grep benchmark ap-
System and on the proposed Sequence file. After running plications and we are able to see a speedup of 25.7x and
our proposed algorithm on wordcount we are able to see 20x respectively compared to the original Hadoop. Future
a speedup of 25.7x compared to the original Hadoop. The work, has to come up with a optimal schedular that manages
results were shown in above table, where the experiments the multiple jobs with different workloads, that are being
were ran on 1000 files, 1500, 2000, 2500 and 3000 files, executed simultaneously.
and the time taken by original Hadoop system and the
proposed sequence file, is shown in seconds. ACKNOWLEDGMENT
This work is dedicated to the founder chancellor of our
GREP university Sri Sathya Sai Baba.
R EFERENCES
The grep is a proved cpu intensive benchmark which
reads the dataset and counts the number of words within the [1] Jeffrey Dean and Sanjay Ghemawat, Mapreduce: simplified data
processing on large clusters, Communications of the ACM, vol. 51,
dataset based on a search pattern mentioned as an input. The no. 1, pp. 107113, 2008.
Grep benchmark program runs in sequence with the Map [2] Andrew Wang. Better sorting in NetworkTopology
job followed by reduce job. The Map job counts number of pseudoSortByDistance when no local node is found.
https://issues.apache.org/jira/browse/HDFS-6268, 2014. [Accessed
times a matching string had occured and the Reduce job 28-April-2014].
will sort the matching strings based on number of times [3] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert
they occurred (frequency) and the output is stored into a Chansler, The hadoop distributed file system, in Mass Storage Systems
and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE,
single file. 2010, pp. 110.
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert
Fig. 8 explains the processing time for executing Grep Chansler, The hadoop distributed file system,” in Mass Storage Sys-
tems and Technologies (MSST), 2010 IEEE 26th Symposium on.
benchmark application on original Hadoop System and on IEEE, 2010.
the proposed Sequence file. After running our proposed [5] Fang Zhou, Assessment of Multiple MapReduce Strategies for Fast
algorithm on Grep application we are able to see a speedup Analytics of Small Files, Ph.D. thesis, Auburn University, 2015.
[6] Fang Zhou, Hai Pham, Jianhui Yue, Hao Zou, and Weikuan Yu,
of 20x compared to the original Hadoop. Sfmapreduce: An optimized mapreduce framework for small files,” in
Even though the Sequence File Format handles both the Networking, Architecture and Storage (NAS), 2015 IEEE International
issues of small files problem in Hadoop, it cannot provide Conference on. IEEE, 2015.
quick random access to specific files, because SequenceFile

Hands On Artificial Intelligence For IoT Prefinal1
No ratings yet
Hands On Artificial Intelligence For IoT Prefinal1
48 pages
Apache Hive Tutorial
No ratings yet
Apache Hive Tutorial
139 pages
A Dynamic Data Placement Strategy
No ratings yet
A Dynamic Data Placement Strategy
9 pages
Scheduling For Hadoop Cluster
No ratings yet
Scheduling For Hadoop Cluster
5 pages
Analysis of Dynamic Data Placement Strategy For Heterogeneous Hadoop Cluster
No ratings yet
Analysis of Dynamic Data Placement Strategy For Heterogeneous Hadoop Cluster
8 pages
Adaptive Dynamic Data Placement Algorithm
No ratings yet
Adaptive Dynamic Data Placement Algorithm
14 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
Load Balancing Through Block Rearrangement Policy For Hadoop Hetero Cluster-2018
No ratings yet
Load Balancing Through Block Rearrangement Policy For Hadoop Hetero Cluster-2018
7 pages
Bigdata
No ratings yet
Bigdata
9 pages
Big Data
No ratings yet
Big Data
67 pages
bdcc-2 2
No ratings yet
bdcc-2 2
12 pages
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
No ratings yet
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
94 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Resource Aware Block Rearrangement Algorithm Load Balancing
No ratings yet
Resource Aware Block Rearrangement Algorithm Load Balancing
10 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
A New Data-Grouping-Aware Dynamic Data Placement-2016
No ratings yet
A New Data-Grouping-Aware Dynamic Data Placement-2016
15 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
Introduction To
No ratings yet
Introduction To
7 pages
CST322 Module4 Part3 Hadoop
No ratings yet
CST322 Module4 Part3 Hadoop
45 pages
Introduction-to-Hadoop-Ecosystem
No ratings yet
Introduction-to-Hadoop-Ecosystem
26 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Big Data Assignment 1
No ratings yet
Big Data Assignment 1
6 pages
DSCC Unit 5 PDF
No ratings yet
DSCC Unit 5 PDF
8 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
3b. Introduction To Hadoopm Ecosystem - Presentation
No ratings yet
3b. Introduction To Hadoopm Ecosystem - Presentation
26 pages
Hadoop MapReduce YARN Detailed
No ratings yet
Hadoop MapReduce YARN Detailed
2 pages
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
No ratings yet
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
5 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Cost-Aware Optimal Resource Provisioning Map-Reduce Scheduler For Hadoop Framework
No ratings yet
Cost-Aware Optimal Resource Provisioning Map-Reduce Scheduler For Hadoop Framework
10 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
No ratings yet
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
10 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Bda Viva Questions
No ratings yet
Bda Viva Questions
8 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
Hadoop
No ratings yet
Hadoop
12 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
BigData Hadoop Online Training by Experts
No ratings yet
BigData Hadoop Online Training by Experts
41 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
2019, Sontakke - Optimization of Hadoop MapReduce Model in Cloud Computing Environment
No ratings yet
2019, Sontakke - Optimization of Hadoop MapReduce Model in Cloud Computing Environment
6 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Unit 2
No ratings yet
Unit 2
56 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
A Two Step Parametric Method For Failure Prediction in Hard Disk Drives
No ratings yet
A Two Step Parametric Method For Failure Prediction in Hard Disk Drives
12 pages
A Combined Bayesian Network Method For Predicting Drive Failure Times From Smart Attributes
No ratings yet
A Combined Bayesian Network Method For Predicting Drive Failure Times From Smart Attributes
7 pages
Assessment of Current Health and Remaining Useful Life of Hard Disk Drives THESIS
No ratings yet
Assessment of Current Health and Remaining Useful Life of Hard Disk Drives THESIS
166 pages
Unit 4 Session 4
No ratings yet
Unit 4 Session 4
43 pages
Professional Data Engineer Sample Questions
No ratings yet
Professional Data Engineer Sample Questions
29 pages
Real-Time Big Data Analytical Architecture For Remoting Sensing Application
No ratings yet
Real-Time Big Data Analytical Architecture For Remoting Sensing Application
23 pages
Talend Preparing Metadata For HDFS Connection
No ratings yet
Talend Preparing Metadata For HDFS Connection
4 pages
M.SC CS Feb2024 - Organized
No ratings yet
M.SC CS Feb2024 - Organized
7 pages
Module 6-Special Purpose OS
No ratings yet
Module 6-Special Purpose OS
16 pages
Big Data Multiple Choice Questions
No ratings yet
Big Data Multiple Choice Questions
9 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Big Data For Dummies
No ratings yet
Big Data For Dummies
8 pages
Module 1 Algorithm For Massive Datasets
No ratings yet
Module 1 Algorithm For Massive Datasets
59 pages
Brainsci 05 00419 v2
No ratings yet
Brainsci 05 00419 v2
22 pages
Abstract Hadoop
No ratings yet
Abstract Hadoop
1 page
20CS702L CCLab Record Final
No ratings yet
20CS702L CCLab Record Final
101 pages
MScIT-Sem3 Syllabus
No ratings yet
MScIT-Sem3 Syllabus
10 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
3.1 - MapReduce Paradigm
No ratings yet
3.1 - MapReduce Paradigm
28 pages
Spectrum Archive Sg248373
No ratings yet
Spectrum Archive Sg248373
214 pages
UNIT 1 - BIG DATA ANALYTICS Full
No ratings yet
UNIT 1 - BIG DATA ANALYTICS Full
28 pages
Big Data Technologies
No ratings yet
Big Data Technologies
7 pages
Preparing For Your Professional Data Engineer Journey - T-GCPPDE-A-m1-l7-file-en-13
No ratings yet
Preparing For Your Professional Data Engineer Journey - T-GCPPDE-A-m1-l7-file-en-13
32 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Divya Namdev Resume
No ratings yet
Divya Namdev Resume
3 pages
Job Resume 2
No ratings yet
Job Resume 2
331 pages
Selvarani Mylsamy: "We Swim in A Sea of Data and The Sea Level Is Rising Rapidly."
No ratings yet
Selvarani Mylsamy: "We Swim in A Sea of Data and The Sea Level Is Rising Rapidly."
33 pages
Contexti Big Data Framework
No ratings yet
Contexti Big Data Framework
3 pages
TNPSC Unit5 Os Cloud 100 Mcqs
No ratings yet
TNPSC Unit5 Os Cloud 100 Mcqs
23 pages
Mcsl26 See QP Solution 2024
No ratings yet
Mcsl26 See QP Solution 2024
33 pages
rc159-HBase 7 PDF
No ratings yet
rc159-HBase 7 PDF
7 pages
Big Data Analytics (BDA) UNIT 1: Introduction To Big Data
No ratings yet
Big Data Analytics (BDA) UNIT 1: Introduction To Big Data
3 pages

Improving The Performance of Heterogeneous Hadoop Cluster-2016

Uploaded by

Improving The Performance of Heterogeneous Hadoop Cluster-2016

Uploaded by

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&

Improving the Performance of Heterogeneous Hadoop Cluster

The block placement plays an important role in terms

The CRBalancer uses CRNamenodeConnector application

The CRBalancer can be run in the background while other

produce a well balanced number of blocks in each of

IV. S MALL F ILES P ROBLEM

Files Procedure. The ﬁrst is CFInputFormat(Combine File

Fig.5 shows the proposed system for small ﬁles. Once

Fig. 7. Sequence File Layout

You might also like

)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQGULG&RPSXWLQJ 3'&