0% found this document useful (0 votes)

17 views

Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data

The document discusses big data mining and analytics, focusing on replication-based query management for resource allocation using Hadoop and MapReduce over big data. It provides background on big data and the Hadoop ecosystem, describing key components like HDFS, MapReduce, YARN and how they are used to process large datasets in parallel across clusters.

Uploaded by

jensprojectcontrol

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data

Uploaded by

jensprojectcontrol

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

BIG DATA MINING AND ANALYTICS

ISSN 2096-0654 06/10 pp 465– 477

Volume 6, Number 4, December 2023
DOI: 10.26599/BDMA.2022.9020026

Replication-Based Query Management for Resource Allocation

Using Hadoop and MapReduce over Big Data
Ankit Kumar, Neeraj Varshney, Surbhi Bhatiya, and Kamred Udham Singh

Abstract: We live in an age where everything around us is being created. Data generation rates are so scary,
creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model
functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce
strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving
such problems for commercial applications as expected from this working algorithm with insights as a result of
disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam
conducted under this job; many of them are scheduled to set schedules, match matrices’ data positions, clustering
before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and
execution times. Mapper output and proponents have been implemented, and the map has been used to reduce
the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on
evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities
to inform a massive database of information, from storage and indexing techniques to the distribution of queries,
scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the
data processing time by 30%.

Key words: big data; hadoop; mapreduce; resource allocation; query management

1 Introduction of people are generating huge amounts of information

through the increased use of such devices. In particular,
The global population exceeds 7.2 billion, and 2
remote sensors consistently provide a huge volume of
billion people have Internet connectivity. According
information that is neither constituted nor unmatched.
to McKinsey, 5 billion people use different mobile
Such information is known as big data[1] .
devices. With the revolution of this technology, millions
Three aspects that characterize big data:
Ankit Kumar and Neeraj Varshney are with the Department (1) The data cannot be categorized into regular
of Computer Engineering and Application, GLA University, relational databases.
Mathura 281406, India. E-mail: [email protected]; (2) The data are numerous.
[email protected]. (3) The data are generated, captured, and processed
3

Surbhi Bhatiya is with the College of Computer Sciences and very quickly.
Information Technology, King Faisal University, Hofuf 31982,
Big data is a growing sector of the IT industry
Saudi Arabia. E-mail: [email protected].
that focuses on business applications. It has sparked
Kamred Udham Singh is with the School of Computing,
Graphic Era Hill University, Dehradun 248002, India. E-mail:
widespread interest in a number of fields.
[email protected]. Machine construction can be used in medical
* To whom correspondence should be addressed. treatment, financial transactions, social media, and
Manuscript received: 2022-04-26; revised: 2022-07-07; satellite imagery[2] . Traditionally, the largest amount
accepted: 2022-07-14 of data is possibly used to store structural information.
C The author(s) 2023. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
466 Big Data Mining and Analytics, December 2023, 6(4): 465–477

However, unstructured and semi-structured data are now the Internet today & developers and Internet users can
driving data quantities. Because relational systems are easily discuss the importance of data for their project
used to manage databases for analyses, the translation and applications. The Internet has become an enormous
of data stored in an unorganized and undesirable format place where one can easily find everything from a needle
may have an impact on intermediate processing. to human genome reports. The Internet is a complex
Rapid growth of data rate: mesh of tools, frameworks, applications, algorithms,
YouTube: and hardware commodities for storing, processing, and
(1) Users upload 100 h of fresh videos each minute. managing data worldwide. A single data server or
(2) Monthly, over 1 billion particular users access application is adequate for handling data generated by a
YouTube. single user. Choosing which applications to use or how
(3) Six million hours of video have been observed to manage data is an ongoing research project. Today,
each calendar month, equating to nearly an hour. the Internet is inextricably linked to data, but the more
Facebook: general term for this type of massive data would be big
(1) Every moment, 34 722 likes are enrolled. data. In this scenario, another question is how such data
(2) One hundred terabytes (TB) of information are are generated[3, 4] . Big data frameworks, applications,
uploaded every day. algorithms, and hardware commodities allow people all
(3) The website has 1.4 billion consumers. Moreover, over the world to store, process, and manage data.
the site has been translated into 70 languages.
1.1 Hadoop ecosystem
Twitter:
(1) The site has more than 645 million consumers. Hadoop was built to scale up computers that perform
(2) The website creates 175 million tweets daily. processing from single to thousands of nodes, where
Google: each node or machine can work as a local computation
The website gets more than 2 million search queries or storage drive. The framework of Hadoop has four
per second daily, and 25 petabytes (PB) are refined. significant modules, as depicted below, which divide
Four-square: the functionality-based working of Hadoop into distinct
(1) This website is used by 4–5 million individuals units. The four modules that are most important in
globally. Hadoop[5] are as follows:
(2) This website gets more than 5 billion check-ins Hadoop system (standard)
daily. Hadoop MapReduce
(3) Every moment, 571 new sites are found. Hadoop Distributed File System (HDFS)
Apple: Hadoop YARN
Approximately 47 000 programs are downloaded each Other common softwares, such as Hive, Pig, Scala,
second. Scoop, and CDH, facilitate data input/output (I/O) or
Tumblr: querying into Hadoop. MapReduce is used to process
Weblog owners release 27 000 new articles per second. data on Hadoop or HDFS, but most of the functioning
Brands: is related to coding in Java, Ruby, Perl, or Python.
More than 34 000 likes are enrolled per second. Most developers prefer querying languages like SQL
Flickr: to talk to a system. Hive or Pig, both open-source
Users upload 3125 brand-new photographs per implementations, is used to facilitate the interaction
moment. between Hadoop and developers knowing querying
Instagram: languages shown in Fig. 1[6] .
Users discuss 40 million photographs daily. The Hive interpreter converts a query into a
WordPress: MapReduce code and interacts with the HDFS. The Pig
Close to 350 sites are published each second. interpreter turns it into a simple scripting language that
LinkedIn: again communicates with MapReduce and then to the
2.1 million collections have been made. HDFS. Pig and Hive can work very well, but the overall
In 2020, 73.5 zettabytes were created, it was 2.52 gist of this system is that it takes a lot of time to convert
zettabytes in 2010, data on worldwide output have queries into MapReduce forms and then interact with
increased by 4300% by 2022. Data exploding over Hadoop. It could incur substantial cost losses when
Ankit Kumar et al.: Replication-Based Query Management for Resource Allocation Using Hadoop and : : : 467

slave planning, and resource availability/consumption

tracking. JobTracker also tracks job components, tasks,
monitors, and re-executes failed tasks. Because they
regularly task status information, JobTrackers also
perform the task assigned to them, following the
TaskTracker design. Mappers and Reducers work on
the processing, as explained below. A mapper node
receives a large chunk of data assigned to a reducer.
Fig. 1 Hadoop framework architecture. 2 Literature Review
performed for significant amounts of data. Another In this section, we review the state-of-the-art big data
creation was Impala, designed to interact directly with analysis methods.
HDFS by converting SQL queries into HDFS format.
Impala is optimized for low-latency queries, a crucial 2.1 Big data processing
requirement in Hadoop and big data scenarios. Typically, Malik et al.[10] revealed the crucial importance of
an Impala query runs way faster than a Hive or Pig query big data in today’s world. As far as applications
because of its low-latency structure and direct interaction are concerned, big data have a lot of usages but
with the HDFS[7] . pose challenges. They have used the concept of 4 Vs
to describe big data, including veracity. They have
1.2 Homogeneous data clusters in Hadoop
addressed some fundamental problems that need to
The clustering in Hadoop systems is essential for be catered to so that problems can be converted into
planned research works. Similar clusters in a group opportunities. They have reported advances, research
make a collection of the same type of machine for order on big data, and six significant challenges encountered
processing and warehouse data collection in the HDFS. during the processing of unstructured data. They have
The hardware similarity in homogeneous clusters makes also proposed the research problems still open in the big
the job processing and data storage mechanisms easy data community.
to implement and deploy. Hadoop and big data[8] work In addition to the scientific issues discussed in detail,
well in a similar cluster environment. the authors have mentioned open problems that arise in
1.3 Working in Hadoop the big data scenario. The high dimensionality of big
data, subsampling of data in Hadoop using MapReduce,
Hadoop has been trying to get users’ and developers’
computational complexity of a dataset, distribution
praise. Its framework allows users to rapidly write
issues in real computing, problem of processing non-
and test a distributed system. Hadoop’s MapReduce
structured data, and visualization of datasets to improve
framework is capable of the reliable and fault-tolerant
efficiency are some of the open challenges that the
mode of the application, which can handle a significant
scientific and development community is grappling with.
amount of data in large parallel clusters of commodity
Cloud computing applications used today involve
hardware writing. Hadoop MapReduce is a program that
processing a significant amount of data. Such data could
can perform two basic tasks: map task and reduce task.
be in different formats, which require considerable
1.4 MapReduce processing. As mentioned by Yao et al.[11] , the data
Serially processing a job takes longer than necessary, processing rate is exponentially growing, but the same
which can cause delays in other system functions, such as is not right for computing power. The need for new
I/O, querying, or data retrieval. As a result, MapReduce strategies is much awaited, either an improvement of
is designed to process data in parallel. The MapReduce reliable storage methods that remain inexpensive or the
framework is in charge of task scheduling, monitoring, development of new tools for analyzing structured and
and re-running failed tasks[9] . It has a master-slave unstructured data. The focus is that different file systems
hierarchy. Each node is labeled “JobTracker” master and have issues in data migration. On a heterogeneous cluster
“TaskTracker” slave. Each node is comprised of a single in Hadoop, the load-sharing task is executed to improve
JobTracker and TaskTracker, with JobTracker being processing. The jobs are processed with a fast execution
primarily responsible for resource management, job task time on a high-speed end node, but the data are processed
468 Big Data Mining and Analytics, December 2023, 6(4): 465–477

much more slowly on a low-speed end node. The load- area to focus on. The parallelism in big datasets could
sharing mechanism enables the high-speed end node to be viewed from three different aspects, namely, data
relieve the low-speed end node, but when the volume of applications that involve a high rate of inbound and
data to be processed is high, the load-sharing mechanism outbound movements, server applications that involve
is bound to fail. The overhead of the data movements of a network with high bandwidth to process all queries,
unprocessed data between the slow-speed end node and and scientific computing applications that need high
high-speed end node affects the performance of Hadoop speed of processing and memory performances to decode
in different clusters. complex phenomena. Data parallelism, if appropriately
Alshammari et al.[12] presented another critical achieved, could benefit any organization at a significant
issue of heterogeneous Hadoop clusters, which is the level. The current trends in the IT industry have proven
namespace limitations offered by the systems. The its high profitability because of the data industry.
namespace storage is located at the RAM, and as the In the research by Parmar et al.[14] , virtualization
data are stored in file system metadata formats, the is an excellent way to harness the capabilities of
namespace assumes names similar to the data nodes. The Hadoop and MapReduce on data clusters[8] . Data-
problem arises with sharing name nodes and maintaining intensive jobs on nodes could be significantly improved
metadata over the name nodes in a Hadoop environment by incorporating virtualization as a green IT factor.
for a vast number of files. An individual namespace Resource consumption could be reduced and save
server has its limitations in handling and storing data. a lot of energy. The work describes the challenges
The surveys conducted reveal that only 200 bytes are met in the virtualization environment for Hadoop
needed for a name node server to a single object file. and MapReduce tasks. It also describes schemes to
The fact taken into consideration shows that when this overcome and manage it, so virtualization is proven as
is multiplied with other data blocks residing on the a practical solution for the big data crisis. The authors
same server, we obtain a vast number of metadata to also proposed a comprehensive rating scheme that will
store. This is one of the critical attributing factors to the check the performance of clusters in their respective
performance of name node servers in Hadoop clusters. environments. This scheme aims to extend YARN for
The name node needs RAM to process requested queries configuring Hadoop and MapReduce. The experiments
and fetch I/O files, among others. When the space conducted under this niche reveal that performance
is entirely consumed by the name node’s metadata improves through an integrated performance model. The
only, one can easily understand the overheads of the research has a wide scope, which includes automating
processing namespace in name nodes. The whole cluster adjustments needed in Hadoop configurations. The
becomes unresponsive when the name node is busy experimental setup consists of multiple nodes, and
processing internal job requests. Data-intensive job different datasets were deployed on them to test the
applications must compromise and compete for resource analysis. The nodes were divided into different cluster
distribution on name nodes. The data transfer requested sizes, and TestDFSIO was used. The tasks depict a high
by client nodes will have to suffer because each request rate of performance and parallelism. They have mapped
is first processed on the name node, where internal the task on different process to test the reduction in
metadata are managed and asked by the client node. similarity.
These issues are brought to light by the authors, who Kumar and Singh[15] mentioned that the distinguished,
convey a great insight into how a simple name node profiled critical execution of queries has equitability
and cluster configuration can affect the proceedings of issues. For the existing MapReduce-based information
Hadoop in cloud computing. warehousing framework, prediction-based systems are
suggested, and an inquiry planning framework was built,
2.2 Hadoop and MapReduce
which scaffolds that a semantic hole exists between
Ullah et al.[13] raised concerns about the challenges MapReduce runtimes. Furthermore, the inquiry compiler
that are posed by supercomputing communities. They empowers productive inquiry planning for quick and
brought forward issues that arise when data are reasonable enormous information analytics. The author
encountered in petabytes or zettabytes. The vast amount performed experiments to overcome the drawbacks of
of data created because of numerical simulations as design and improve the performance of large-scale
the resultant of supercomputing operations is the key clusters. A cross-layer scheduling framework was used
Ankit Kumar et al.: Replication-Based Query Management for Resource Allocation Using Hadoop and : : : 469

so that Hive queries can be percolated and semantically proposed comparative studies of existing algorithms
extracted, and a multivariate execution time prediction used for big data classification of datasets. The survey
and two-level query scheduling were performed[1] . conducted in this work classifies the functionality of
2.3 Query performance candidate algorithms primarily used in big data regarding
theoretical and empirical analyses. The parameters taken
Huang et al.[16] mentioned that a good query for considering the gross efficiency of an algorithm
performance prediction is nearly impossible with involve many internal and external factors. They are
old query performance models. For effective query performance metrics, runtime, scalability tests, and
optimization, resource management, and user experience stability. They also described the best possible clustering
management, the execution latency used by common algorithm used in big data.
analytical cost models is a poor predictor of comparing Tao et al.[18] supported the fact that manifesting
candidate plans’ costs. QPP seems a practical approach digitized data can be a key to solving many problems
for the high success of predictive tasks that can generate posed by the Internet. Network traffic patterns and
good results within new performance models. The social media data, among others, could be beneficial
work supports the idea that prediction can be studied for corporations and people. However, the processing
from different perspectives, such as analyzing the of such large-sized data also comes with deficiencies.
granularity at operator levels. The proposed predictive It also requires a significant amount of hardware and
modeling technique manifests the differences in the processing power. Going through all these data can be
query execution behavior at varying granularity levels time-consuming and challenging to analyze. Clustering
to understand the need and usage of coarse-grained or techniques are used to manage all these problems
fine-grained operations. and provide solutions to efficiently store and analyze
The research performed by Soualhia et al.[17] specifies information. Clustering algorithms for big data have
the use of similar I/O techniques to manage scientific to consider the volume, velocity, and variety for a
applications[10] . They incorporated a data sieving and well-thought analysis. Therefore, the 3 Vs are the core
two-phase I/O technique to optimize the I/O throughput. of big data characteristics and the key to designing
The proposed work emphasizes offering parallel access a clustering algorithm. The problem that arises in
to shared datasets for heavy workloads. They also algorithm clustering is deciding what type of algorithm
used cross-layer optimizations to improve the I/O to use. The choice of algorithm for different datasets
throughput for scientific applications, such as GEOS-5. differs as every dataset has its ups and downs.
Although the proposed work is implemented for Qin et al.[19] outrightly discussed their ideas on
scientific applications, it presents a broad idea of how parallel data processing in heterogeneous clusters. They
datasets of dynamic nature behave. It also points out recommended that Hadoop is in dire need of mechanisms
the significance of real-times change in queries, which to support scalability and improvement in performance.
is a significant issue in Hadoop, MapReduce, and big They evaluated methods to consider the view of
data scenarios. The I/O performance of computing can storage and techniques for indexing and other processes
be easily categorized into different levels based on file related to queries, such as distribution, performance,
systems and various needs. parallelization, and scalability, in a heterogeneous
Analyzing data that rapidly grows and changes has environment. The authors proposed designing Hadoop’s
driven many new technologies and frameworks to architecture so that various data nodes can be scaled up
involve predictions and new learning methods. The to optimize Hadoop clusters. The test results reveal that
rootworker of these technologies is algorithms that the replication factor of data nodes can experience node
help categorize, assess, and analyze essential data from failures in adverse scenarios. They claim that although
meaningless information. For example, in an online small data jobs can be handled by other methods,
store’s log, the time spent on a product is useful MapReduce is superior when dealing with large datasets.
information, whereas surfing random products is not that To address the challenges posed by data-intensive jobs,
useful. To understand this vital key of details from PB the cost of hardware computing and mixing software
of data, algorithms need to be smart and efficient. Here, was set. The authors clearly stated that the data size
clustering algorithms have become a powerful resource has increased in recent years from a few gigabytes to
to manifest essential data from clutters. Tao et al.[18] petabytes. This increase in data generation is due to the
470 Big Data Mining and Analytics, December 2023, 6(4): 465–477

increased use of content management systems, social implementations of Hadoop for millions of users handled
media, logs, and other similar tools. This volume of data at once[21] . The performance of Hadoop clusters is
necessitates the use of more than a couple of machines closely related to the short response times as processed
to manage it for all Internet users. As a result, handling by MapReduce[11] . In homogeneous clusters of Hadoop,
these generated data across heterogeneous clusters is tasks take a linear progression model, but this case is
critical for harnessing the power of Hadoop and the big not the same for heterogeneous clusters. The authors
data platform. The HDFS is an excellent environment for proposed a new algorithm, i.e., Longest Approximate
storing and maintaining data for a large number of users. Time to End, to improve the performance degradation
The main issue is saving such data and keeping them of heterogeneous clusters due to impromptu task
secure. They must also be processed in acceptable time scheduling. The algorithm in the work has improved
frames with efficiency and speed. The authors proposed the performance twice on a large cluster. The response
a new design model to meet future system requirements times of heterogeneous clusters can be enhanced in
and help organizations find the right partner for large such a manner by proposing a scheduler algorithm to
data storage and processing. identify tasks and manage them accordingly on data
2.4 Distributed processing and MapReduce nodes. The benefits of Hadoop and MapReduce are
that programmers need not worry about faults occurring
Wang et al.[20] stated that MapReduce is essential for at the backend. MapReduce works all right if a node
the distributed processing of large-scale applications. crashes but does not let the task run in disarray. The
Hadoop is used in data-intensive jobs, such as web proposed scheduler works to maximize performance and
indexing and data mining because it completes tasks improve robustness.
with slow response times. It makes them very widely
available, but a problem with data placement is often 2.5 Classification approaches
encountered. Data placement on data nodes becomes a In the research by Li et al.[22] , the classification of
disarray job in heterogeneous Hadoop clusters. Hence, electronic documents was used to analyze applications.
there must be a good way to assign data to nodes so that The type of classification used here is naı̈ve Bayes
high-end nodes get jobs that require fast processing and because it is simple, efficient, and effective for
slow-end nodes get appropriate time and jobs to process significant data documents. Compared to some statistical
based on their capabilities. The authors devised methods methods, it has certain limitations, but it can help
for implementing data placement in heterogeneous categorize long documents into more straight sets. The
clusters so that nodes are viable enough based on authors proposed a parallelized semi-lazy naı̈ve Bayes
their capacity. Virtualized data centers face challenges approach to speed up the processing to a very high
because data locality is not taken into account for rate. Automatic Document Classification (ADC) is the
speculative map tasks. Because most maps are assumed key component used by the authors to experimentally
to be stored locally, they are insufficient for virtualized conduct real case scenarios that help understand the
centers. By specifying a design constraint on data nodes nature and demeanor of datasets.
for data placement in Hadoop’s heterogeneous clusters, The ADC methods incorporate artificial intelligence
the performance is greatly improved. The performance in their initial stages, where the set of documents is
of data-intensive jobs could be greatly improved using used to learn and train. Once a sufficient data learning
various data placement techniques. Rebalancing data size is reached, they are used to design techniques to
nodes as needed to perform the functions of a data- categorize related neighbors. They presented a good
intensive job can improve performance, reduce latency, insight into how data should be classified by first learning
and provide a stable system. Heterogeneous Hadoop from datasets and then applying the same to those. They
clusters can be used for data-intensive jobs by managing somewhat resemble nearest-neighbor procedures and, in
data placement locally and smartly adapting to the a sense, Bayesian models. The method decomposes the
environment so that MapReduce jobs can be executed clustering problem into a hierarchical model where the
quickly and efficiently. hierarchy is divided into parent and child. However,
MapReduce algorithms have a wide range of there is a super parent and favorite child who are
applications, as mentioned repeatedly. They have dependent on each other. A validation set is used to
emerged and continued to do so for the open-source decide the heuristics and find the correct prediction for
Ankit Kumar et al.: Replication-Based Query Management for Resource Allocation Using Hadoop and : : : 471

measuring the details. The metric selection depends accounting method is divided into two main stages,
on the validation set, which enables the building of the namely, the duration of the mapping process and the
network and deciding on the super parent and favorite level of rearrangement. After the size expires, the level
child set. This method, commonly called “SP-TAN”, of rearrangement begins. Map efficiency is reduced
was popular in small datasets. The datasets used in ADC using a reducer action and mapper activities. However, a
are not relevant when used in combination with SP-TAN. mapper task will be split into an input signal, and the map
The dependencies in a large dataset are cumbersome to pointers will be executed in parallel with pointers and
manage, and the identification process becomes tedious. carry the map’s properties. Measurement activities will
When calculated, the computational costs determine that create several effects. Before the reducer can perform
the performance cannot be overloaded and contented in work, the machine automatically populates the sequence
the case of TAN or SP-TAN functioning, leading to a of the population and intercepts it with the concealed
significant increase in the computation time and thus reducers[25] . Then, a reducer task will generate output
losing performance in the category of large datasets. files that create an output signal, perform a decreasing
Wang and Cao[23] also supported the use of naı̈ve function, and be merged for results by reducing the event.
Bayes in their clinical experiments to understand the Programmers need to compile the purpose of a mapper
neuroimaging framework of humans. The research and reducer:
involves different stimulus phases, but the data collection Map: (input text)! f(key i; value i /ji D 1; : : : ; ng;
and classification presented a highly intelligent insight Reduce: (key, [value 1,: : : , value n])!(key, final
into how data classification can be managed if performed value).
correctly. Every response gathered as a stimulus provides The task failure and node are likely to occur within a
a relatable experimental platform to determine the MapReduce job implementation procedure. All mapper
right heuristic technique for the data classification of activities will proceed on available nodes whenever a
straightforward tasks with complex environments. node fails. This type of system, i.e., rescheduling, is easy
but introduces a good deal of time price for customers
2.6 Text clustering
with responsiveness prerequisites, and its effect is not
Qureshi et al.[24] followed the model framework of acceptable.
Hadoop’s MapReduce and calculations based on the
3.2 Improved rescheduling algorithm
Term Frequency-Inverse Document Frequency (TF-IDF)
and its cosine similarity. The text clustering on which In order to make changes to the frame, a different
the method is applied is usable in many applications that rescheduling technique is used, which helps to re-
are data-intensive. They are first collected and TF-IDF establish the system so that it has a high level of
is applied. Then, a cosine similarity matrix is used to performance. The activity and node failure is taken
cluster them into available components. The method into account, and delays are reduced. Two types of
delivers efficient and faster results compared to other documents, i.e., checkpoint files and index files, are
Hadoop’s MapReduce framework algorithms. introduced, which are international documents created
Clustering done pairwise has many benefits, as prior to implementing a single-sized task[26] . The
pointed out by the authors. It can be used to sanctuary is in charge of preventing the execution of
cluster documents/text quickly and arrange subsequent the current work of checkpoint documents. Whether or
processing into easy takeover turns. The phases in the not the job failed, it can be continued through the service
Cosine Similarity with MapReduce algorithm include 1 node using the information in the checkpoint file. In
building a vector model of data, then calculating the TF– addition to index documents, the sanctuary is in charge
IDF, and producing the angle of cosine for similarity in of publishing jobs that help reduce current results. In
matrices of pairwise texts. the event of a failure, the index file can be used as a file.
The algorithm is divided into two parts: the controller
3 Proposed Algorithm node’s worker node and other functions[27] . Furthermore,
This section contains the proposed work and model and because the controller node is critical, it is necessary to
discusses the parameters used to compare the results. maintain a completely consistent “hip copy” of the two
elements of the algorithm.
3.1 MapReduce programming model
As soon as an activity failure does occur, the node
To reduce the programming model during mapping, the reads the checkpoint file stored in the disk, re-establishes
472 Big Data Mining and Analytics, December 2023, 6(4): 465–477

the task status into the checkpoint, and then reloads the Algorithm 2 Slave node to process the data
results generated until collapse. Therefore, re-execution Input: Different sizes of Job 1, 2, . . . , n
is avoided. Output: Job execution time
After a failure occurs, the scheduler must schedule the Step 1: Assess the kind of specified failed task.
job with the master node and disrupt mapper activities (1) If it is indeed a sizable work, it assesses a new attempt or all
to replicate nodes, as discussed in Algorithm 1. It may the failed work to evaluate what is evaluated.
(a) If it is a new initiative, launch it, and implement it.
create the consequences of activities using the global
(b) If the area fails to rehabilitate, buy and implement its
indicator document to reduce the period. When nodes
progress from the nearby checkpoint document.
and tasks collapse until the checkpoint, the advancement (c) Suppose that it can actually be redistributed from the dirty
will continue from the checkpoint. Of course, if the act failures and nicks other than the failed works. In this case, the
of rescuing the checkpoint is neglected, the advancement global index document will examine the failure of the work
will soon begin with the checkpoint. The simple and immediately rearrange the intermediate impact using the
failover plan is cost-effective in a distributed computing information inside the global index document.
(2) If it is indeed a reducer task, it is a new commitment, i.e.,
environment. The frequency of the rescuing checkpoint
whether this neglect removes the failure of the work or
must be chosen: while there will be a frequency at the determines the incredible work from various nodes.
case, a high frequency will most likely offer the cost for (a) If it is a new initiative, it is initially implemented and
great cost savings and failover for running, as discussed implemented.
in Algorithm 2. In our experiment, the frequency value (b) If this only reinstates the neglected works from different
is assigned to a single checkpoint for every 105 pairs noses, browse the intermediate information from the local
after fixing[29] . disk and then implement it.
The actions onto the node are rescheduled to replicate (c) If it is performed unexpectedly from different nodes in
a brand-new number, it will browse through the information
habitats in case a collapse occurs in the reducer point. If
provided in the document and implement it.
mapper activities are finished and the consequences have Step 2: Produce a localized checkpoint file and worldwide index
been duplicated to the replicate node, there is not any document for any mapper task.
requirement to replicate the mapper activities over the Step 3: Begin the mapper task.
node upon which the completion time of the MapReduce (1) After the node’s memory processing of the map, the
intermediate information among the locals is empty. After
Algorithm 1 Controller node to improve the performance of the sink ends, place the input and map’s location (position,
the Hadoop framework then Map 1, where Map 1 is the key pair value) area checkpoint
Input: Number of Job 1, 2, . . . , n file[28] .
Output: Master node job schedule (2) Add a key-value pair corresponding to the positioning supply,
Step 1: The controller node predicts that the measurement is which produces a key-quality pair, and two different strategies
performed with reducer actions in different worker documents. are distributed among the global index documents.
Step 2: Choose the replicas for each job node. (a) Enter the key-quality pair output and list it permanently
Step 3: Check out the results of all staff documents. (T1, offset) in the global index document. Hence, only the
When all the results are available, these results combine the results offsets will need to be processed during re-establishment. T1
and indicate that the project is completed. usually means that it is an inch album type.
Otherwise, stay in Step 3 and wait. (b) To insert a pair that indicates a result, list the opportunity
Step 4: Slowly, all activation nodes send search packs. as (Offset 1, Offset 2), so that input signals associated with
1. If most staff respond NO, explore from Step 4 to Step 1 and Offset 1 and Offset 2 do not have an output signal and will
then stay still. be run again. T2 means that the fact is among the two sorted
facts, where T2 represents the Task 2.
2. If a node does not react in a given time interval, subsequently
inform the node as indicated. Step 4: After a mapper action ends, repeat and deliver the
intermediate leads on the reducer nodes. Copy the information
3. Employee ID and node will get almost incomplete tasks.
needed from the reducer’s completed job to replicate nodes,
4. Establish all the perfect map operations on the global line and
inform the conclusion of the mapper endeavor to the perfect
reschedule their renewal nodes.
node, and then delete the indicator document and checkpoint file.
5. You can determine the neglected activities of the failed node.
After rectifying it, the group asks for re-established nodes that
have intermediate facilities without renewing the monitor’s task is significantly diminished.
functions.
3.3 Algorithm for query optimization
6. Once all nodes are completed, insert additional unexpected
activities into the node. This algorithm stores the query in a query buffer and
Ankit Kumar et al.: Replication-Based Query Management for Resource Allocation Using Hadoop and : : : 473

breaks the query issue by the job scheduler in the Hadoop Figure 2 shows a trade-off among various schedules,
process. First, we sort all queries in the query buffer which may lead to the optimized performance of systems
based on keywords. The user enters the keyword-based in the HDFS for clustering big data shown in Table 1,
query condition in the keywords field. In addition, the and the complete process is disscussed in Algorithm 3.
user has the following options: The response time was affected by data loads, which
(1) Choice of “AND” or “OR” Boolean stringing of were the logs of an e-commerce website and response
keywords; time taken by schedulers in mapping and reducing them
(2) Specification of the number of resulting objects to to store them on the DFS and manually extracting and
be shown on each page; managing them. It also establishes that the amount of
(3) Selection of partial matching or exact matching of data is general in big datasets. Hadoop can optimize
keywords; the processing instead of frequent data loading and
(4) Choice of how much information is desired to be unloading to obtain better insights about data, as shown
presented in the result. in Fig. 3. This process also proves that once data
The access path for a simple query is generated by are accepted by a node, the average time to consume
utilizing information brokers who support keyword- and process them may vary depending on the type of
based searching. scheduler being used[31] .
Query processing is currently composed of a five-
step process:
Step 1: Source selection;
Step 2: For each source, the translation of the simple
query forms data into a valid URL address to the source;
Step 3: Parallel execution of each subquery against
an individual source, translation of each subquery result
into the consumer’s preferred result representation, and
merging of subquery results from different sources into
the final result format;
Step 4: The query management results are produced
by combining the search results from all the available Fig. 2 Comparison of different Hadoop schedulers.
data sources selected to answer the query issue by the Table 1 Comparative study of exiting vs. proposed work.
job scheduler process. Previous Proposed
Scheduling Data methodology methodology
4 Result and Discussion algorithm size execution time execution time
The following results have been established by (ms) (ms)
directing the proposed methodology in the system. The 500 MB 4000 3950
1 GB 7800 7333
heterogeneous Hadoop clusters were used to test the
FIFO 2 GB 9900 9165
working of big datasets, which emulated the I/O
5 GB 14 500 11 130
throughput, the average rate of execution, standard 10 GB – 12 780
deviations, and test execution time[30] . 500 MB 3950 3925
4.1 Comparison of productivity among Hadoop 1 GB 7200 7065
HFS 2 GB 9200 9085
schedulers
5 GB 13 500 10 980
A comparison among the most efficient schedulers used 10 GB – 12 400
in Hadoop, i.e., FIFO, HFS, and HCS, was performed. 500 MB 3900 3875
The following graphs establish that schedulers can 1 GB 7000 6998
overtake other schedulers used in the HDFS and internal HCS 2 GB 9000 8912
5 GB 12 000 10 260
mechanisms to share a file or job over several default
10 GB – 12 610
systems to execute and store them reliably.
474 Big Data Mining and Analytics, December 2023, 6(4): 465–477

Algorithm 3 Query optimization Table 2 Comparative study of FIFO job scheduler in

Input: Keyword query Q D fk1 ; k2 ; : : : ; km g, list of related Hadoop without query improvisation.
keywords k1i ; k2i ; ; : : : ; kni ; 1 6 i 6 m Data size (MB) Completion time (ms)
Output: Q1 ; Q2 ; : : : ; Qk ; where Qi D .q1i ; q2i ; : : : ; qm i
/ 100 190
Step 1: Delete redundant related keywords 200 410
fk10 ; k20 ; : : : ; kg0 g fk11 ; k12 ; : : : ; k1n g[ 300 380
fk21 ; k22 ; : : : ; k2n g[; ;[fkm1 ; km2 ; : : : ; kmn g.g 6 mn/: 400 410
Step 2: Build a Viterbi model D .A; B; / 500 610
Ag;g D faij k1 6 i; j 6 gg; aij D si mi.ki0 ; kj0 /, 1500 2000
Bg;m D fbij k1 6 i 6 g; 1 6 j 6 mg; bij D si mi.ki0 ; kj0 /, Note: Average performance ratio D 0:75.
i D .ki0 / D si mi.ki0 ; k1 /:
Table 3 Comparative study of HCS in hadoop without
Step 3: Initialize
query.
//Variable ı t .i / as the maximum probability in all paths whose
state is i at time t. Data size (MB) Completion time (ms)
//Variable t .i / is the (i ! 1)-th node of the path with the 100 93
maximum probability in state i at time t . 200 123
ı1 .i / D i bi .k1 /; i D 1; 2; : : : ; g 300 210
400 368
1.i/ D 0; i D 1; 2; : : :, g
Loop 500 387
For t D 2; 3; : : : ; m 1500 1181
ı.i / D maxf1 6 j 6 ggŒıt ! 1.j /aj i bi .k t /; i D 1; 2; : : : ; g Note: Average performance ratio D 0:78.
t .i / D arg max 1 6 j 6 gŒı t 1 .j /aj i ; i D 1; 2; : : : ; g
Table 4 Comparative study of HCF in Hadoop without
Return .i / \ t .i /
query.
End of loop
Data size (MB) Completion time (ms)
End of query buffer
100 88
200 188
300 218
400 320
500 410
1500 1301
Note: Average performance ratio D 0:86.

Fig. 3 Comparative performance analysis of previous

methodology vs. proposed methodology.

4.2 Comparative analysis of the Hadoop

framework with different schedulers Fig. 4 Comparative study of FIFO job scheduler in Hadoop
In this section, we have compared the performance of without queries.
FIFO, HCS, and HFC in Tables 2, 3, and 4, respectively. Query processing time analysis of different data sizes
Figure 4 represents the comparative study of response using the HFC job scheduling algorithm.
time computation vs. data size using FIFO algorithm.
4.3 Performance evaluation
Figure 5 represents the comparative study of response
time computation vs. different data sizes using HCS Only performance test results were evaluated with
algorithm for query processing. Figure 6 depicts the clustered nodes and are shown in Table 5. The default
Ankit Kumar et al.: Replication-Based Query Management for Resource Allocation Using Hadoop and : : : 475

strategy[33, 34] implementation is far better than Hadoop’s

settings. Table 5 shows the contrast between the
implementation times of Hadoop and Hadoop’s system.
When these data’s magnitude remains small, there is
not any difference between the implementation time.
However, the execution times of the proposed procedure
are reduced.

5 Conclusion
Works done under this research have produced amazing
results, including the ability to schedule schedules,
Fig. 5 Comparative study of HCS job scheduler in Hadoop data placement inequality matrix, clustering before
without queries. timing scrutiny, and reducing the frequency of repeat,
mapping, and domestic dependency to prevent and
stop responses. This experience demonstrates that by
defining the process to deal with different usage cases,
one can reduce the overall cost of computing and
benefit from the use of distributed systems for fast
splitting. When archiving or processing large amounts
of data, developing solutions to compress data and
search for relevant files in a compressed form could
be beneficial. The methodology employed here is based
on the practical process of sorting, iterating, managing
queries, and finally producing the output. Assume that
the data have been clustered by some powerful clustering
Fig. 6 Comparative study of HCF job scheduler in Hadoop algorithms. In this case, significant time and effort could
without queries.
be saved by improving the performance of Hadoop’s
Table 5 Comparative analysis of Hadoop framework for heterogeneous clusters for MapReduce job processing.
different files’ sizes with their execution time. The major challenges encountered during this project
Execution time of Execution time of were the incompatibility of systems, different nodes,
File size (MB)
default Hadoop proposed method
and security features provided by Linux and Hadoop.
12.2 00:00:41 00:00:40
Working on dedicated servers and clusters could help
36.5 00:00:53 00:00:53
146.0 00:01:51 00:01:52 achieve smoother experiences rather than dealing with
300.0 00:02:09 00:01:53 compatibility issues at critical times. The underlying
500.0 00:02:30 00:02:07 architecture and usage could be improved by working
938.0 00:04:42 00:04:07 on higher versions of Hadoop to incorporate the best
1500.0 00:06:07 00:05:03 offerings and avoid being bogged down by minor issues,
2000.0 00:08:51 00:07:17 such as compatibility.
References
option Hadoop and Hadoop contexts performed by the
system were always implemented through the collection [1] M. S. Mahmud, J. Z. Huang, S. Salloum, T. Z. Emara, and
of information. The size of the data collection is 12 MB, K. Sadatdiynov, A survey of data partitioning and sampling
36 MB, 300 MB, 500 MB, 1 GB, 1.5 GB, 2 GB, and methods to support big data analysis, Big Data Mining and
Analytics, vol. 3, no. 2, pp. 85–101, 2020.
10 GB[32, 33] . [2] M. D. Li, H. Z. Wang, and J. Z. Li, Mining conditional
The default Hadoop defines the system’s default functional dependency rules on big data, Big Data Mining
settings, whereas the proposed strategy defines the and Analytics, vol. 3, no. 1, pp. 68–84, 2020.
[3] S. Salloum, J. Z. Huang, and Y. L. He, Random sample
processes we have created. Table 5 reveals the
partition: A distributed data model for big data analysis,
proposed method of Hadoop and the response timing IEEE Trans. Industr. Inform., vol. 15, no. 11, pp. 5846–
of process the data in the default Hadoop. The 5854, 2019.
476 Big Data Mining and Analytics, December 2023, 6(4): 465–477

[4] R. H. Lin, Z. Z. Ye, H. Wang, and B. D. Wu, Chronic [18] D. Tao, Z. W. Lin, and B. X. Wang, Load feedback-based
diseases and health monitoring big data: A survey, IEEE resource scheduling and dynamic migration-based data
Rev. Biomed. Eng., vol. 11, pp. 275–288, 2018. locality for virtual Hadoop clusters in OpenStack-based
[5] Y. N. Tang, H. X. Guo, T. T. Yuan, Q. Wu, X. Li, C. clouds, Tsinghua Science and Technology, vol. 22, no. 2,
Wang, X. Gao, and J. Wu, OEHadoop: Accelerate Hadoop pp. 149–159, 2017.
applications by co-designing Hadoop with data center [19] P. Qin, B. Dai, B. X. Huang, and G. Xu, Bandwidth-aware
network, IEEE Access, vol. 6, pp. 25849–25860, 2018. scheduling with SDN in Hadoop: A new trend for big data,
[6] X. C. Hua, M. C. Huang, and P. Liu, Hadoop configuration IEEE Syst. J., vol. 11, no. 4, pp. 2337–2344, 2017.
tuning with ensemble modeling and metaheuristic [20] X. Y. Wang, M. Veeraraghavan, and H. Y. Shen, Evaluation
optimization, IEEE Access, vol. 6, pp. 44161–44174, 2018. study of a proposed Hadoop for data center networks
[7] D. Z. Cheng, X. B. Zhou, P. Lama, M. K. Ji, and C. J. incorporating optical circuit switches, J. Opt. Commun.
Jiang, Energy efficiency aware task assignment with DVFS Netw., vol. 10, no. 8, pp. C50–C63, 2018.
in heterogeneous Hadoop clusters, IEEE Trans. Parallel [21] Y. Q. Chen, Y. Zhou, S. Taneja, X. Qin, and J. Z. Huang,
Distrib. Syst., vol. 29, no. 1, pp. 70–82, 2018. aHDFS: An erasure-coded data archival system for Hadoop
[8] A. Kumar, A. Kumar, A. K. Bashir, M. Rashid, V. D. A. clusters, IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 11,
Kumar, and R. Kharel, Distance based pattern driven mining pp. 3060–3073, 2017.
for outlier detection in high dimensional big dataset, ACM [22] Z. Z. Li, H. Y. Shen, W. Ligon, and J. Denton, An
Trans. Manag. Inf. Syst., vol. 13, no. 1, pp. 1–17, 2022. exploration of designing a hybrid scale-up/out Hadoop
[9] A. Khaleel and H. Al-Raweshidy, Optimization of architecture based on performance measurements, IEEE
computing and networking resources of a Hadoop cluster Trans. Parallel Distrib. Syst., vol. 28, no. 2, pp. 386–400,
based on software defined network, IEEE Access, vol. 6, pp. 2017.
61351–61365, 2018. [23] H. F. Wang and Y. P. Cao, An energy efficiency optimization
[10] M. Malik, K. Neshatpour, S. Rafatirad, and H. Homayoun, and control model for Hadoop clusters, IEEE Access, vol.
Hadoop workloads characterization for performance and 7, pp. 40534–40549, 2019.
energy efficiency optimizations on microservers, IEEE [24] N. M. F. Qureshi, D. R. Shin, I. F. Siddiqui, and B. S.
Trans. Multi-Scale Comput. Syst., vol. 4, no. 3, pp. 355– Chowdhry, Storage-tag-aware scheduler for Hadoop cluster,
368, 2018. IEEE Access, vol. 5, pp. 13742–13755, 2017.
[11] Y. Yao, J. Y. Wang, B. Sheng, C. C. Tan, and N. F. [25] Z. Z. Li and H. Y. Shen, Measuring scale-up and scale-out
Mi, Self-adjusting slot configurations for homogeneous Hadoop with remote and local file systems and selecting
and heterogeneous Hadoop clusters, IEEE Trans. Cloud the best platform, IEEE Trans. Parallel Distrib. Syst., vol.
Comput., vol. 5, no. 2, pp. 344–357, 2017. 28, no. 11, pp. 3201–3214, 2017.
[12] H. Alshammari, J. Lee, and H. Bajwa, H2Hadoop: [26] Y. P. Zheng and G. Y. Chen, Energy analysis and application
Improving Hadoop performance using the metadata of of data mining algorithms for internet of things based on
related jobs, IEEE Trans. Cloud Comput., vol. 6, no. 4, Hadoop cloud platform, IEEE Access, vol. 7, pp. 183195–
pp. 1031–1040, 2018. 183206, 2019.
[13] I. Ullah, M. S. Khan, M. Amir, J. Kim, and S. M. [27] C. T. Chen, L. J. Hung, S. Y. Hsieh, R. Buyya, and A.
Kim, LSTPD: Least slack time-based preemptive deadline Y. Zomaya, Heterogeneous job allocation scheduler for
constraint scheduler for Hadoop clusters, IEEE Access, vol. Hadoop mapreduce using dynamic grouping integrated
8, pp. 111751–111762, 2020. neighboring search, IEEE Trans. Cloud Comput., vol. 8,
[14] R. R. Parmar, S. Roy, D. Bhattacharyya, S. K. no. 1, pp. 193–206, 2020.
Bandyopadhyay, and T. H. Kim, Large-scale encryption in [28] P. Q. Jin, X. J. Hao, X. L. Wang, and L. H. Yue, Energy-
the Hadoop environment: Challenges and solutions, IEEE efficient task scheduling for CPU-intensive streaming jobs
Access, vol. 5, pp. 7156–7163, 2017. on Hadoop, IEEE Trans. Parallel Distrib. Syst., vol. 30, no.
[15] S. Kumar and M. Singh, A novel clustering technique for 6, pp. 1298–1311, 2019.
efficient clustering of big data in Hadoop ecosystem, Big [29] K. Sridharan, G. Komarasamy, and S. Daniel Madan Raja,
Data Mining and Analytics, vol. 2, no. 4, pp. 240–247, Hadoop framework for efficient sentiment classification
2019. using trees, IET Netw., vol. 9, no. 5, pp. 223–228, 2020.
[16] W. Huang, L. K. Meng, D. Y. Zhang, and W. Zhang, In- [30] Z. C. Dou, I. Khalil, A. Khreishah, and A. Al-Fuqaha,
memory parallel processing of massive remotely sensed Robust insider attacks countermeasure for Hadoop: Design
data using an Apache spark on Hadoop YARN model, and implementation, IEEE Syst. J., vol. 12, no. 2, pp. 1874–
IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 10, 1885, 2018.
no. 1, pp. 3–19, 2017. [31] R. Agarwal, A. S. Jalal, and K. V. Arya, Local binary
[17] M. Soualhia, F. Khomh, and S. Tahar, A dynamic and hexagonal extrema pattern (LBH XEP): A new feature
failure-aware task scheduling framework for Hadoop, IEEE descriptor for fake iris detection, Vis. Comput., vol. 37, no.
Trans. Cloud Comput., vol. 8, no. 2, pp. 553–569, 2020. 6, pp. 1357–1368, 2021.
Ankit Kumar et al.: Replication-Based Query Management for Resource Allocation Using Hadoop and : : : 477

[32] R. Agarwal, A. S. Jalal, and K. V. Arya, Enhanced binary spatial analysis, Multimed. Tools Appl., vol. 79, no. 19,
hexagonal extrema pattern (EBH XEP) descriptor for iris pp. 13621–13645, 2020.
liveness detection, Wirel. Pers. Commun., vol. 115, no. 3, [34] R. Agrawal, A. S. Jalal, and K. V. Arya, Fake fingerprint
pp. 2627–2643, 2020. liveness detection based on micro and macro features, Int.
[33] R. Agarwal, A. S. Jalal, and K. V. Arya, A multimodal J. Biom., vol. 11, no. 2, pp. 177–206, 2019.
liveness detection using statistical texture features and

Ankit Kumar received the BEng degree Neeraj Varshney is presently working as
in computer science & engineering from an assistant professor at the Department
West Bengal Technical University, Kolkata, of Computer Engineering and Application,
India in 2010, the MEng degree in computer GLA University, Mathura. He has done
science engineering from Indian Institute of MEng and MCA degrees and is pursuing
Information Technology, Allahabad (IIIT the PhD degree. He has vast experience
Allahabad), India in 2012, and the PhD in the teaching and research of computer
degree in computer science from Sri Satya science. He has published more than 20
Sai University of Technology & Medical Sciences, Sehore, India research papers in reputed international and Indian journals.
in 2022. Currently he is working as an assistant professor at His research areas include vehicular networks, digital image
Department of Computer Engineering and Application, GLA processing, machine learning, and time series data analysis.
University, Mathura. He has developed optimization algorithms in
machine learning and data science. His current research interests Kamred Udham Singh received the PhD
include machine learning, deep learning, big data, evolutionary degree from Banaras Hindu University,
computation and its application in real-world, and optimization India in 2019. From 2015 to 2016, he was
problems especially in optimization medical applications. He a junior research fellow, and from 2017 to
has published more than 65 research papers in reputed journals 2019, he was a senior research fellow with
and conferences in high indexing databases and has 9 patents UGC (University Grant Commission), India.
granted from Australia and India. He has authored 2 edited books In 2019, he became an assistant professor at
published by Apple Publisher and CRC press. He has completed the School of Computing, Graphic Era Hill
1 funded research project from the TEQIP-3. He is an associate
University, India. He is with the School of Computing, Graphic
editor for reputed journals and publishers.
Era Hill University, India. His research interests include image
security and authentication, deep learning, medical image
Surbhi Bhatiya has ten years of rich watermarking, and information security. He has published several
teaching and academic experience. She research papers in international peer-reviewed journals. He
has earned professional management contributes his expertise as a reviewer and editor in many reputed
certification from PMI, USA. She is journals.
currently working as an assistant professor
at the Department of Information Systems,
College of Computer Sciences and
Information Technology, King Faisal
University, Saudi Arabia. She is also an adjunct professor at
Shoolini University, Himachal Pradesh, India. She is an associate
editor for reputed journals and publishers. She has published
more than 70 research papers in reputed journals and conferences
in high indexing databases, and has 9 patents granted from
USA, Australia, and India. She has authored 3 solo books and
9 edited books published by Springer, Wiley, and Elsevier. She
has completed 5 funded research projects from the Deanship of
Scientific Research, King Faisal University, and the Ministry
of Education, Saudi Arabia. Her research interests include
information systems, and data analytics and sentiment analysis.

Dissertation On HDFC Bank
100% (1)
Dissertation On HDFC Bank
50 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Big Data Analytics Litrature Review
No ratings yet
Big Data Analytics Litrature Review
7 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
Data Science
No ratings yet
Data Science
87 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Big Data and Hadoop: A Review Paper
No ratings yet
Big Data and Hadoop: A Review Paper
3 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
BDA
No ratings yet
BDA
8 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
Spark Streaming Research
No ratings yet
Spark Streaming Research
6 pages
biggdata
No ratings yet
biggdata
24 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Mapreduce With Hadoop For Simplified Analysis of Big Data: Ch. Shobha Rani Dr. B. Rama
No ratings yet
Mapreduce With Hadoop For Simplified Analysis of Big Data: Ch. Shobha Rani Dr. B. Rama
4 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
BATCH12
No ratings yet
BATCH12
32 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
The Growing Enormous of Big Data Storage
No ratings yet
The Growing Enormous of Big Data Storage
6 pages
UNIT-IV PDF
No ratings yet
UNIT-IV PDF
26 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Survey Paper On Big Data Analytics Using Hadoop Technologies
No ratings yet
Survey Paper On Big Data Analytics Using Hadoop Technologies
7 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
No ratings yet
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
9 pages
ICAI_2023_paper_3719
No ratings yet
ICAI_2023_paper_3719
6 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
DA U2
No ratings yet
DA U2
17 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
DOC-20250306-WA0000.
No ratings yet
DOC-20250306-WA0000.
35 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Experiences On Gender and Development (GAD) Programs Among Teachers Basis For Designing An Action Plan
No ratings yet
Experiences On Gender and Development (GAD) Programs Among Teachers Basis For Designing An Action Plan
8 pages
Castellaneta Zollo OS 2015 - The Dimensions of Experential Learning
No ratings yet
Castellaneta Zollo OS 2015 - The Dimensions of Experential Learning
19 pages
Chapter 10 Revised
No ratings yet
Chapter 10 Revised
48 pages
Database Management Short Notes
No ratings yet
Database Management Short Notes
5 pages
SAP Integration - Microsoft BI On Top of SAP BW
No ratings yet
SAP Integration - Microsoft BI On Top of SAP BW
10 pages
Project Report
No ratings yet
Project Report
14 pages
FAQ HADR in SAP Env
No ratings yet
FAQ HADR in SAP Env
25 pages
Sas Asst 3
No ratings yet
Sas Asst 3
9 pages
21881531163
No ratings yet
21881531163
3 pages
A Synopsis A Study On Credit Schemes of Auto Finance (With Reference of Icici Bank) in Nagpur
No ratings yet
A Synopsis A Study On Credit Schemes of Auto Finance (With Reference of Icici Bank) in Nagpur
11 pages
Topic 1 Scientific Skills
100% (2)
Topic 1 Scientific Skills
10 pages
Csharppulse - Blogspot.in-Learning MVC Part 4 Creating MVC Application With EntityFramework Code First Approach PDF
No ratings yet
Csharppulse - Blogspot.in-Learning MVC Part 4 Creating MVC Application With EntityFramework Code First Approach PDF
8 pages
Onlinequiz MTQ2 NW 3 D3 D
No ratings yet
Onlinequiz MTQ2 NW 3 D3 D
3 pages
CXF Standard PDF
No ratings yet
CXF Standard PDF
17 pages
Jewel Os
No ratings yet
Jewel Os
62 pages
How Is Statistical Engineering Different From Data Science
No ratings yet
How Is Statistical Engineering Different From Data Science
8 pages
20240601 Hkeaa Edb Ict - Elective A
No ratings yet
20240601 Hkeaa Edb Ict - Elective A
16 pages
LAB 4 - Conversion Function and Conditional Expression
No ratings yet
LAB 4 - Conversion Function and Conditional Expression
38 pages
Oracle
No ratings yet
Oracle
15 pages
Lecture6 DBMS
No ratings yet
Lecture6 DBMS
10 pages
Web Application Disassembly With ODBC Error Messages: by David Litchfield Director of Security Architecture @stake
No ratings yet
Web Application Disassembly With ODBC Error Messages: by David Litchfield Director of Security Architecture @stake
5 pages
DWDM Lab Manual Final
No ratings yet
DWDM Lab Manual Final
153 pages
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
No ratings yet
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
20 pages
Performance Testing NFR Questionnaire v01
No ratings yet
Performance Testing NFR Questionnaire v01
20 pages
Assignment for M.phil Students on Qualitative Research in Education
No ratings yet
Assignment for M.phil Students on Qualitative Research in Education
4 pages
Using Saga
No ratings yet
Using Saga
86 pages
TAFJ-DB Setup
No ratings yet
TAFJ-DB Setup
21 pages
Parts of A Thesis
No ratings yet
Parts of A Thesis
2 pages
7historical Research
No ratings yet
7historical Research
19 pages

Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data

Uploaded by

Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data

Uploaded by

BIG DATA MINING AND ANALYTICS

ISSN 2096-0654 06/10 pp 465– 477

Replication-Based Query Management for Resource Allocation

1 Introduction of people are generating huge amounts of information

slave planning, and resource availability/consumption

Algorithm 3 Query optimization Table 2 Comparative study of FIFO job scheduler in

Fig. 3 Comparative performance analysis of previous

4.2 Comparative analysis of the Hadoop

strategy[33, 34] implementation is far better than Hadoop’s

You might also like