We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
Algorithms and Techniques to optimize
The Performance of The Hadoop
Clusters , Reduce Data Processing
Time and Minimize Resource Wastage
Arunakumati BN
Dept. CSE,
BMS Institue of
‘Technology and
Management, Bengalura
Kamatake 560064
scunskumanbn/abmsitin
Absiract- This study investigates several methods and
approaches for enhancing Hadoop cluster performance. Within
these distributed computing environments, cutting down on
resource waste and processing time for data are the main goals.
This paper explores ways to make Hadoop clusters more
efficient, mainly by optimizing process by processes and
raising overall system performance. The goal of the project isto
‘address major issues in large data processing by developing more
resource- and time-efficient Hadoop cluster through the
‘application of sophisticated algorithms and optimization
approaches.
1. INTRODUCTION
"Big data" is a collection of extraordinarily large data sets,
that exceed the capabilities of conventional database
administration systems. Big dats has become quite popular
in the world of information technology since large amounts
of new, complicated data are created every day in this
industry. Data is provided by social media platforms, mobile
devices, sensors, online transactions, and other sources. AS
data volume, velocity, and variety increase, processing leads
to hurdles and complexity. Large amounts of data so become
challenging, to manage, process, match, and relate. A new
platform is required for the processing, storing, and transfer
Of the data due to its huge size and lack of structure, It is
possible to process and evaluate lange volumes of data
Ullss BK
Dept. CSE
BMS Institue of
Technology and
Management, Bengalura
Karnataka - 560064
1by20es209¢@lomsit in
Sujith SR
Dept. CSE
BMS Institute of
Technology and
Management, Bengaluru
Kamataka - 560068
Iby20es197;@lomsit in
‘The MapReduce programming paradigm is popular across
all frameworks. By using the Map and Reduce features of
MapReduce, users can design parallel processes without
worrying about the finer details of parallelism, including
data dissemination, load balancing, and fault tolerance.
MapReduce enables continuous processing of large amounts
of data, Map and Reduce are the two parts of MapReduce,
The initial phase in this system's parallel computing process
is to assign map jobs to various nodes and have them
process input data, Subsequently, the final outputs of the
combined map are produced by applying the reduction
function,
Big Data Processing
Any organization cannot benefit from raw data. The process
of gathering unprocessed data and turning it into
information that ean be used is known as data processing. In
an enterprise, a team of data scientists and engineers often
completes it’ step-by-step. After being gathered, filtered,
sorted, processed, examined, and stored, the raw data is
finally shown in an understandable manner.
Organizations need data processing to improve their
business strategy and get a competitive advantage. All
personnel in the organization will be able to comprehend
and utilize the data if itis transformed into usable formats
such as documents, graphs, and charts,Data processing cycle
“Minimize Resource wastage in Hadoop
Im large data processing, resource allocation plays a critical
role in enhancing system performance, In the context of
cloud computing, resource demands for different
applications can differ greatly. As a result, when certain
resource capacities on a single server are depleted and
‘others remain available, there is a resource gap. When there
is a greater variation in the computer resources, this issue
becomes more noticeable. Prior —resource-allocation
algorithms gave this scenario little thought. On a server with
a Variety of resources, allocating resources in this way could
result ina large waste of the available but underutilized
resources. In order to reduce resource waste, this work
suggests the minimizing resource gap (MRG) method for
heterogeneous resources as a resource-allocation algorithm,
‘The variations in resource utilization across all MRG
servers. The results of the experiments demonstrated the
effectiveness of the suggested MRG technique for
inereasing system utilization and lowering the total
completion time for heterogeneous servers in cloud
‘computing by as much as 24.7%,
I, MOTIVATION
‘The inspiration for streamlining Hadoop group execution,
lessening information handling time, and limiting asset
wastage comes from the basic to upgrade functional
cffectiveness, Clients mean to assist information
examination, lower handling costs, and use assets really,
guaranteeing convenient experiences and worked on
generally speaking execution in taking care of broad
datasets,
I. LITERATURE SURVEY
‘A literature survey on algorithms and techniques to optimize
the performance of Hadoop clusters, reduce data processing
time, and minimize resource wastage would involve
reviewing relevant research papers, articles, and
publications in the ficld of big data and distributed
‘computing. Below, I outline key topics and areas of focus
for such a literature survey:
Hadoop Architecture and Bastes:
Understand the foundational concepts of Hadoop, including
the Hadoop Distributed File System (HDS) and
MapReduce.Explore key components of Hadoop, such as
‘NameNode, DataNode, and TaskTracker.
Challenges in Hadoop Performance:
Identify common challenges faced in Hadoop clusters,
including issues related to data locality task scheduling, and
resource utilization.
‘Task Scheduling Algorithms:
Tnvestigate various task scheduling algorithms used in
Hadoop to efficiently allocate resources and reduce
processing time.
Examples include Fair Scheduler, Capacity Scheduler, and
Delay Scheduling.
Data Locality Optimization:
Examine techniques for optimizing data locality to minimize
daa transfer across the network.
Explore solutions like speculative execution and data
replication strategies,
Parallel Processing and Computation Model
Review research on parallel processing models and
algorithms to enhance the parallelism in Hadoop clusters
Investigate alternative computation models beyond
‘MapReduce, such as Spark.
Resource Management and Optimization:
Explore resource management techniques and algorithms to
efficiently allocate and utilize cluster resources,
Investigate solutions like YARN (Yet Another Resource
Negotiator) for resource management.
Dynamic Resource Allocation:
Study dynamic resource allocation strategies that adjust
resources based on workload fluctuations.
Explore auto-scaling mechanisms and dynamic resource
provisioning.
Distributed Caching
Investigate the use of distributed caching to minimize
redundant computations and improve data access times.
‘Compression Techniques:
Explore compression algorithms and techniques to reduce
the storage requirements and speed up data processing,
Machine Learning for Optimization:‘Machine Learning for Optimization:
Investigate the application of machine learning algorithms
for proditing somes wsage md optimizing chstr
perfomance
Fault Tolerance and Reliabii
Review methods for eiancing fault tolerance and reliability
in Hadoop clusters, sich as checkpoinring and recovery
mechanisms.
‘Benchmarking and Performance Evaluation:
Examine research on benchmarking methodologies and
performance evaluation metrics for assessing the
effectiveness of optimization techniqies.
A study on existing techniques in Big data analytics with,
‘multiple clusters for heterogeneons data management
[1]Since there is no established distance between the two
classification data centers, data stored in digital foumat is
taken into accom while calculating distance
‘measutements,[2] Takes info account the series of
data-intensive applications running on the Hadoop cluster.
The data residency scheme automatically maintains a
balance between the total amount of data stored in every
node to achieve improved data processing performance. [3]
‘Takes into account an approach called enhanced dynamic
slot allocation for Hadoop, which keeps the features of the
space-based model [dJoffeed a variety of solutions to
addiess the different issues raised by the Distnbuted File
System on Hadoop's Mapper and Reducer-based
Framework. Citation SJdiscovered a usefil method for
obtaining and storing unstructured data, Created a sizable
data application that retrieves publicly accessible data trom
‘Twitter, stores it in a cluster, and then uses the state transfer
method to access data obtained from HBase for analysis. The
cchister centers of the categorization feature is represented
using 2 yellow value function approach in reference(6],
which takes the mode vahues into consideration. As aresult,
the clusier core consists of just one feature, Inaccurate
chistering stems from enoncous value allocation. By
‘modifying the cost fimetion, Reference(7] presented the
alobal spending functioning of the grouping dataset.n [8],
the covariance probability approach takes into account the
cluster setup properties when dividing the data points (9)
Has talked about the strategy of identifying the top da‘a item
and using a new distance exleulstion to name a cluster core
We employ the entopy derived from the alteruate
Manhattan distance. The purpose of Reference[10} is to
compare Word Cout’s execution times under various
scenarios. A solitary node is configured to carry ont the Map
Reduce Word Length. The stuxly examines the impact of
adjusting the aunber of lowered jobs and ile size on the
execution time Grid-based enviroriments can aid with some
advanced cluster handling|11], and the cluster cam be
improved by implementing effective fault tolerance
techniques{12] and integrating an adaptive scheduling
strategy.
TV, PROPOSED METHODOLOGY
The analysis of cwrent methods and the review of the
literature paints a picture of the majority of the work being
Gone to increase big datas processing. speed or capacity.
How well a processing environment can be used and
‘monitored is one of the maim topics omutted. The goal of the
proposed research project isto use Hadoop clusters to model
a high-performance environment High-performance
environments generally demand supercomputers or highly
Gesigned systems fo support them. The goal is to set up a
robust computing environment using a chister of minimally
configuwed machines. Since ell of the machines have
‘nominal processing capability. making effective wse of them
and keeping an eye on them is erncial
The research environment is shaped by a distibuted
platform consisting of several clusiers. Hadoop clusters,
each having a Data Node, Name Node, Job tracker, and
secondary Name Node, form the foundation of the
distributed system. This raises a requirement: how can
performance be optimized in a distnbuted setting with
several data nodes? In a distributed system, there is
typically a high probability of performance degradations. To
stabilize the environment, a noile monitor is required, The
second goal of the proposed research is accomplished with
the aid of Zoo Keeper, which serves as a watchdog over the
Zonodes that are patt of the Hadoop clusters, As a result, the
Hadoop cluster is constantly checked for performance
problems.
abit parallel Hadoop Syren,
Zi
ER,
* we
Fig 1: Architecture of Proposed Platform
Performance cannot be optimized or improved by the mere
presence of a monitor; instead, a robust scheduling
procedure that cooperates with the resource negotiator and
5job tracker is needed. Yarn enters the scene at this point, and
‘an algorithm is ru on top of the Yam scheduler. “The
algorithm receives inputs such es the number of jobs inside
a cluster, the time it takes for a work to execute, he number
ff jobs that ae not completed, the aumber of Maps and‘Reduees, the energy of a chister, and so on. The algorithm
creates an optimization factor for each parameter andl sets up
the distributed cluster system with the monitor's assistance
The system's overall performance can be measured in terms
of shorter execution and waiting periods as well as higher
accuracy, performance, and storage,
Thus, going forward, the suggested platform may be
referred to as optimized The fundamental component of the
‘proposed system's structural design, shown in Fig. Lis a
distributed computing environment, which essentially
‘handles data processing. Though an imterface designed for
the chister, the user submits jobs and data and provides
inputs. Programming for the interface makes use of Java
and Hadoop jar packages. The distbuted cluster, which is
the central component ofthe architecture, is configured with
the fewest number of machines clustered together
In the system, Hadoop Architecture is imposed. As a result,
te cluster is made up of several nodes with a master-slave
architecture, with a Name node configured on the master
The slaves can then also receive new data nodes added to
them, Once the Hadoop service is up and running, jabs can
be carried out within the Hadoop cluster and managed by
the Hadoop master. The crucial duties of a perfornauce
optimizer and monitor emerge after the Hadoop cluster is
configured. As shown in Fig 1, the Zoo keeper is
responsible for monitoring, and the Yam Manager is in
charge of scheduling and performance optimization.
\. RESULT AND ANALYSIS WITH GRAPHS
An optimization algorithi-based research idea is put into
practice on a Hadoop cluster platform. The Hadoop platform
allows researchers to nun jobs at different loads with
flexibiity.As stated in the job queue deseription, jobs are
being assigned to formula’ Here, it hecomes necessary to
assess the effectiveness of various tasks by comfasting the
schedulers included in YARN(Capneity, Fair, and FIFO) in
the appropriate way. The Next, chister load 1s computed
using the running time of the work. Every time, the average
chuster performance is noted and Viriations are noted. The
tree foundations for the performance metrics are
dimensions, lead, memory usage, and CPU wilization,
To cary out the performance assessment and oversight
necessary for the It uses research ganglia, Ganglia monitor
the activities
In order for the yam algorithm to optimize the cluster, itis
necessary to monitor the chistar performance. Every
variation in CPU cycles, memory, aud cluster load is
tracked, and all of this information is used as input by the
algorithm to determine the performance factor. The assigned
tasks are futher divided into progressive and
nnon-progressive categories, and considerable resource
optimization is carried out.
Ad
Fig 2:Chuser live dash board
Following the introduction of the algorithm and
performance monitor, there was a significant change in the
system load. The chistes jobs are opernting at a faster pace
decause of job profiling and performance factor
calculations, witich also reduce server load.
VL CONCLUSIONS
Performance on the Hadoop cluster has increased
significantly with the proposed method, The integrated use
of Hadoop, Zoakeaper, and Ganglia allows the system to
conduct multiple types of jobs simultaneously and
efficiently. The resource manager's observations of load and
resource use are the foundation for the recommended.
‘method and approach. A comparison of Yarn schedulers for
‘vatious wouk types has been conducted, The work preiile
and load optimization have resulted in a significant boost in
the efficiency of the cluster. As a result, both overall
performance and CPU and memory consumption ate
declining for the recommended cluster. Future experiments
on the scalability and performance of the proposed system
on a variety of tasks might be carried out in a heterogeneous
distributedREFERENCES
{1} Deap 1 & Ghemawat S,“MapReduce: Simplified Data Processing on
Large Chstes", Commun. ACM, Vo.SI, No.1, 2008.
{2} Asana! RA Rumani DU, “lnpeovingmapredoce performance
‘through complexity and performance based data placement in
heterogeneous hadoop clusters, Lecture Notes it Computer Science
(Gnclotingsoseres Lecture Notes in Arica Intelligence ane Lectre
[Nats in Bioinformatics), (2013, pp. 15-125,
[3] Deeptay Asbok M, Subramanian K & Prabhu A, "Dynamic Slot
Allocation for Map Rede Custer", Internationa loutal of Corl
‘Taeory and Applications, (2017)
{4} Nandimath, Bane, Patil A, Kakade , Vaidya S 8 ChaturvedD,
“Big data analysis sing Apache Hodoop”, IEEE 1th Interational
Conference on Infermtion Rese & Intezation (IRD, (2013), pp 700-703
{5} Dae TK & Kumar PM, "BIG Data Anse: A Fremework for
‘Unsdtared Data Analysis, Int J Eng. 8c Teco, VOLS, No.l,
(2013, pp152-136.
(5) in AK, Mary SIN 8 FlyanB3, "Data chstering review", ACM
Comper. Suv, VoL31, No.3, (1998), pp26-273.
[6] Abad A de Dey L,"A koma clsterngslgrithm for mixed names
and eaegoccal daa", Dae Knowl. Eng, VOL.63, No, (2007),
pp-s0s-s27
[7] VelmuruganT, “Evaluation of k-Medoid ané Fuzzy C-Means
‘luserngalgocthns for clstring teleeomrmunication dat, It Cot
Emerg. Trends Sel. Eng, Techol,, 2012), pp15~120,
{8} Kim M & Ramakrshoa RS, "Projected lsterng for ategorical
datas, Patern Recogot. Lat, Vol27, No.2, (2008), pp. 1405-1417,
2006,
{0} Subramanian K &e Prabhu A, "Simplified Data Analysis of Big Dat in
‘Map Redes" 2017,
{10} Gokaldev 5, Rao A & Kart, “An EMT scheduling aproach with
‘optimum loa balancing in computational git. Appl: Eng Res,
VoL, No, 2016), p.5753-5757
(11 Gokuldev 5 & Radhaleishnan R, “An adaptive job scheduling with
ffcient al tolerance statgy in computational gl Int I. BD
‘Technol, VeL6, No4, (2014), pp.1783-1756,