0% found this document useful (0 votes)
24 views

Simplified Data Processing For Large Cluster A Map

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Simplified Data Processing For Large Cluster A Map

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/367908911

Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based
Study

Article in Advances in Applied Sciences · July 2021


DOI: 10.11648/j.aas.20210603.11

CITATIONS READS

5 160

2 authors:

Abdiaziz Omar Hassan Abdulkadir Abdulahi Hasan


Zhejiang University Anhui University of Science and Technology
5 PUBLICATIONS 13 CITATIONS 4 PUBLICATIONS 17 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Abdiaziz Omar Hassan on 29 March 2023.

The user has requested enhancement of the downloaded file.


Advances in Applied Sciences
2021; 6(3): 43-48
http://www.sciencepublishinggroup.com/j/aas
doi: 10.11648/j.aas.20210603.11
ISSN: 2575-2065 (Print); ISSN: 2575-1514 (Online)

Simplified Data Processing for Large Cluster:


A MapReduce and Hadoop Based Study
Abdiaziz Omar Hassan*, Abdulkadir Abdulahi Hasan
College of Mathematics and Big Data, Anhui University of Science and Technology, Huainan, China

Email address:
*
Corresponding author

To cite this article:


Abdiaziz Omar Hassan, Abdulkadir Abdulahi Hasan. Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study.
Advances in Applied Sciences. Vol. 6, No. 3, 2021, pp. 43-48. doi: 10.11648/j.aas.20210603.11

Received: May 29, 2021; Accepted: June 21, 2021; Published: July 9, 2021

Abstract: With the drastic development of computing technologies, there is an ever-increasing trend in the growth of data.
Data scientists are overwhelmed with such a large and ever-increasing amount of data, as this now requires more processing
channels. The big concern arising here for large-scale data is to provide support for the decision making process. Here in this
study, the MapReduce programming model is applied, an associated implementation introduced by Google. This programming
model involves the computation of two functions; Map and Reduce. The MapReduce libraries automatically parallelize the
computation and handle complex tasks including big data distribution, loads and fault tolerance. This MapReduce
implementation with the source formation of Google and the open-source mechanism, Hadoop has an objective of handling
computation of large clusters of commodities. Our implication of MapReduce and Hadoop framework is aimed at discussing
terabytes and petabytes of storage with thousands of machines parallel to every machine and process at identical times. This
way, large processing and manipulation of big data are maintained with effective result orientations. This study will show up
the basics of MapReduce programming and open-source Hadoop structure application. The Hadoop system can speed up the
handling of big data and respond very fast.

Keywords: Google MapReduce Processes, Hadoop, Parallel Data Processing, HDFS, Cloud Computing,
Large Cluster Data Processing

development of data mining methods and further their


1. Introduction application to make them workable. Various hurdles in the
With the introduction and advancement of technology and wake of processing are faced by large-scale internet
computerized innovation, the growth of data is unimaginable companies including Google, Yahoo, Facebook, LinkedIn, as
and unreachable. Data scientists and handlers are getting well as, other bigger internet-solution providing companies
overwhelmed and frustrated with such a large and ever- that require processing a huge chunk of data not only in
increasing amount of data with its processing requirements minimum timeframe but also keeping the cost-effective
ever-increasing and demanding more every time. With so solution in an application.
large an ever-increasing data, there comes to some problems Google had developed MapReduce and the Google File
as well concerning its handling, processing, and System, which is embracing to studied and investigated in
management. These problems are faced by various fields in this research study. Google has also built a database
making use of this large scale, drawing meanings out of it, as management system (DBMS) known as Big Table. This
well as, using it for decision making. system can search millions of pages and return the results in
Data mining, data classification, handling, and processing milliseconds by employing some algorithms that work
are some of those technologies that can amend and draw new through the MapReduce system and Google File System [1].
ways out of these large data sets. For many years in the past, In the recent past, MapReduce has made its place as an
this data mining technique with its pre-requisites is studied in algorithm to handle computing paradigm and analysis of a
all applicable scenarios; resulting it to be the phase of large amount of data [2]. MapReduce has got fame while it
44 Abdiaziz Omar Hassan and Abdulkadir Abdulahi Hasan: Simplified Data Processing for Large
Cluster: A MapReduce and Hadoop Based Study

was made part of the Google database management system will be similar to the following code:
and Google file system. MapReduce could be employed for map (String key, String value):// key: document name//
measurability and is purely a fault-tolerant data processing value: document contents for each wordw in value:
tool that can handle and process huge data along with lower- EmitIntermediate(w, "1");
bound computing nodes [3]. reduce (String key, Iterator values):// key: a word// values:
Discussing how MapReduce works, a distributed file a list of counts int result = 0; for each v in values: result +=
system (DFS) first categorizes data in multiple categories, ParseInt (v); Emit(As String(result));
and then data is presented as a pair containing key and Herewith this example, the program in detail counts the
values. The MapReduce framework performs its applications occurrences of each word within input files specified on the
and function on a single machine where the data may be command line.
preprocessed before map functions or post-process the output #include "mapreduce/mapreduce.h" // User’s map function
of MapReduce function performed [4]. As Hadoop is applied, class Word Counter: public Mapper {public: virtual void
which is a famous open-source application of MapReduce to Map (const MapInput& input) {const string& text =
handle large datasets. It employs an already provided user- input.value(); const int n = text.size(); for (int i = 0; i < n;) {//
level filesystem to handle storage across the cluster [5]. This Skip past leading whitespace while ((i < n) &&
implication will provide you with a speedy output but less isspace(text[i])) i++; // Find word end int start = I; while ((i <
significant, yet giving you a reasonable speed as well as n) &!isspace(text[i])) i++; if (start < I)
handling a larger dataset that tackles a large number of Emit(text.substr(start,i-start),"1");}}};
computing nodes and minimizes application time by 30% REGISTER_MAPPER(WordCounter); // User’s reduce
comparing with ordinary data mining techniques [6]. function class Adder: public Reducer {virtual void Reduce
(ReduceInput* input) {// Iterate over all entries with the//
1.1. Programming Model and Application of MapReduce same key and add the values int64 value = 0; while (!input-
Function >done()) {value += StringToInt(input->value()); input-
The programming model indicates and includes defined >NextValue(); // Emit sum for input->key () Emit
sets of input pairs of key or value, and outlays output pairs of (IntToString(value));}};}
key and values. This MapReduce function has two outlays: REGISTER_REDUCER(Adder); int main (int argc,
one is Map and the other, Reduce. The map function char** argv) {ParseCommandLineFlags(argc, argv);
considers the input pair and provides key/value pairs. These 1.2. MapReduce Specification
values and intermediate outputs are grouped by the
MapReduce function and then passed further to the Reduce spec;// Store list of input files into "spec" for (int i = 1; i <
function. The Reduce function accepts these intermediate argc; i++) {MapReduceInput* input = spec.add_input();
keys and merges their values to form a smaller set of values. input->set_format("text"); input->set_filepattern(argv[i]);
Let us assume an example of counting the occurrence of each input->set_mapper_class("WordCounter");} // Specify the
word in a large dataset, and appear to use it with the map and output files: // /gfs/test/freq-00000-of-00100 // /gfs/test/freq-
reduce functions. The code to do this counting of occurrence 00001-of-00100 //

Figure 1. Execution plan of the programming model.


Advances in Applied Sciences 2021; 6(3): 43-48 45

1.3. MapReduce Output


2. Related Works
out = spec.output(); out->set_filebase("/gfs/test/freq"); out-
>set_num_tasks(100); out->set_format("text"); out- Seema Maitrey with her fellow researchers has studied big
>set_reducer_class("Adder") // Optional: do partial sums data handling with a new technique under the name of
within map // tasks to save network bandwidth out- "MapReduce: Simplified Data Analysis of Big Data".
>set_combiner_class("Adder"); // Tuning parameters: use at This research study is focused on using the MapReduce
most 2000 // machines and 100 MB of memory per task technique that is based on cloud-based technologies. A
spec.set_machines(2000); spec.set_map_megabytes(100); famous application of cloud technology is Google, which
spec.set_reduce_megabytes(100); // Now run it works aligned with this technology and handles data and
The above code was written in terms of string inputs and processes with care. They also discussed Hadoop that is used
outputs, but the map and reduce functions have also by companies other than Google, including Facebook,
associated types that are defined and assessed through the Yahoo, etc. The analytical processing of data using Hadoop
listing of variables and values. and the application of MapReduce is verified and assessed
with their research-based study [7].
map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(v2) Another researcher Jeffrey Dean with his fellow
researchers has studied the MapReduce framework getting a
The map function will dig out the key from each recording
lot of attention for the application on big data. They
and will forward the key with a matching record pair. On the
other hand, the reduce function will give all pairs unchanged. classified it as a programming model with implementation
with the aim of processing and handling large datasets being
responsive for a wide variety of real-world operations [8].

Figure 2. Anatomy of MapReduce function.

Richard M. Yoo and his fellows have studied Scalable Conference. They highlighted the MapReduce mechanism
MapReduce with a large-scale shared-memory system and being a proprietary system of Google. They also discussed the
talked about dynamic runtimes with simplifying parallel distributed computing being great to extend simplified with
programming, as well as automatically detecting scenarios. implications of Map and Reduce functions, providing the
They discussed how a multi-layered approach works along basics and insights of achieving the desired performance [11].
that work for the optimizations on the algorithm, Jeffrey Dean with his fellows had discussed simplified
implementation, and OS interaction defining and data processing on large clusters with the MapReduce
channelizing significant speedup improvements with 256 framework. They stated this being the subsidiary
threads. They also identified the hurdles or roadblocks which infrastructure of Google’s MapReduce that allocates to a
are involved in limiting the scalability of runtimes on shared- distributed file system and enables the algorithms to locate
memory systems [9]. data and make it available. They termed it easy to use as with
Kyong Lee with his fellows had discussed Google's the opinion of programmers as more than ten thousand
MapReduce technique that works for big data handling and distinct MapReduce programs are on implementation
processing more simply and smoothly together with the internally at Google within the last four-year span [12].
benefit of minimized cost. The main characteristic of this Bayardo Panda and his fellows have discussed massively
MapReduce model was that it able to process large data sets parallel learning with the application of the MapReduce
among others distributed among multiple nodes and multiple framework. They highlighted combining the MapReduce
channels [10]. programming technique with the distributed file system,
B Panda and his fellows had highlighted the MapReduce being a way to achieve distributed computing objectives with
system and its applications with big data at an International data processing over thousands of computing nodes [11].
46 Abdiaziz Omar Hassan and Abdulkadir Abdulahi Hasan: Simplified Data Processing for Large
Cluster: A MapReduce and Hadoop Based Study

Jaliya Ekanayake and her fellows have discussed MapReduce This should be used in two different ways. These are the
for data-intensive scientific analysis. They discussed the outputs advantageous API streamed output and the other
MapReduce technique due to its application to large parallel data involving building of Hadoop apps with C++. Hadoop
analyses. They discussed this with efficient parallel/concurrent Distributed File System is a target file system especially to
algorithms meeting the scalability and performance use with MapReduce programs. This best applicable to the
requirements to handle and process scientific data [13]. small number of very large files. With the use of replication,
Anam Alam and her fellows have discussed the Hadoop data availability could be made possible within Hadoop
Architecture and Its Issues, together with their implication at Distributed File System (HDFS).
an international conference. Hadoop is categorized as a To process all of the files created by the mapping
distributed program or framework used to handle a large mechanism, the Reduce program get access to internode data.
amount of data. Hadoop is usually used for data-intensive When this is executed, map and reduce, both programs will
applications. With its extensive application, every social write it down to the local file system to avoid the burden over
media site has made use of it [14]. the HDFS system. HDFS can support multiple readers and
R. Vijayakumari and her fellows have discussed the one writer (MROW) approach. The indexing mechanism
comparative analysis of Google File System and Hadoop might not apply to HDFS, so, this would just be applied to
Distributed File System. They discussed this distributed read-only applications that only scan and read the contents of
computing, parallel computing, grid computing, and other the file.
parameters including; Design Goals, Processes, Fire
management, Scalability, Protection, Security, cache 3.1.1. Hadoop Architecture
management replication, etc. to compare both these methods Hadoop Distributed File System stores data within its
and their application of the file system [15]. computing nodes, providing customized and high aggregate
bandwidth across the entire cluster. This file system
installation has different nodes plus one single name node,
3. Methodology called the master node and various data nodes, called slave
The methods used may not look familiar to a common nodes. The name node has held responsible for the
audience. The first one is MapReduce which is in fact management of the file system namespace and controls the
oriented to programmers, rather than business users. This has access to files by clients. The data nodes or slave nodes are
gained popularity due to its easy application, efficiency and distributed in a way that one data node is assigned per
ability to control “Big Data” in a timely manner. MapReduce machine in the cluster, managing data while attaching it to
framework with its application and programming model is the machines where they run. The name node has an
discussed above. An example of occurrences is discussed and operation execution scenario within the file system
employed with the MapReduce framework. namespace and assigns those data blocks to data nodes.
Those data nodes are there to handle read and write requests
3.1. Hadoop from clients and performing operations with the instruction
provided [16].
Another process employed and utilized is Hadoop which is
connected with Java implementation and Java application.

Figure 3. Hadoop Architecture.

Hadoop Distributed File system manipulates and handle server for performance keeping and mechanism, load-
data chunks and replicates these data chunks across the balancing and resiliency. The processing application of any
Advances in Applied Sciences 2021; 6(3): 43-48 47

problem execution will specify the number of replicates of machine, as the name node and another, as the job tracker.
the file right when it is created, and this count or record can There could be a secondary name node that might work for
be changed any time after that. The name node has the ability periodic handshaking with a name node for fault tolerance.
to adopt different decisions concerning block replication.
3.1.3. Replication Management
3.1.2. Deploying Hadoop HDFS provides a reliable way to store huge data in a
Hadoop compiles in three different ways, the first one is a distributed environment as data blocks. The blocks are also
standalone mode, which is the default mode of Hadoop, replicated to provide fault tolerance. The default replication
running as a single Java process. The second one is Pseudo- factor is 3 which is again configurable. So, as you can see in
distributed mode, which involves the configuration of the figure below where each block is replicated three times
Hadoop to run on a single machine, whereas, with different and stored on different DataNodes (considering the default
Hadoop processes, run divergent Java processes. The third replication factor):
one is fully disseminating or cluster mode, involving one

Figure 4. Block replication.

This type of Oozie workflow works with both action nodes


3.2. Hadoop Based Oozie Structure and Implementation and control-flow nodes. An action node represents a
Apache Oozie manages all the tasks and makes them workflow task like moving files into HDFS, running a
organized. So, this could be known as a scheduler for MapReduce, or running a shell script of a program written in
Hadoop. This mechanism provides workflow of dependent Java. While a control-flow node controls the execution of the
jobs that later on and helps to develop Directed Acyclic task by allowing different action nodes and controlling
Graphs of workflows that allow jobs or tasks to run in control nodes.
parallel and sequentially in Hadoop.
4. Results and Discussions
Discussing results and discussions, big data and its
requisite technologies can bring about significant changes
and benefits to your business. But with the increased and
widespread use of technologies, it might turn into a difficult
task for your organization to manage, control and tackle a
heterogeneous collection of data and get your desired
outcomes.
To handle the growth of individual companies, certain
aspects should be followed so that timely results could be
attained from Big Data since effective use of Big Data, the
modernization, and effectiveness for entire divisions and
Figure 5. Oozie workflow chart. economies are to be attained. Therefore, you should know
how to ensure the effectiveness of usage, management and
48 Abdiaziz Omar Hassan and Abdulkadir Abdulahi Hasan: Simplified Data Processing for Large
Cluster: A MapReduce and Hadoop Based Study

re-use of data sources, including public data to construct [3] D. D, "MapReduce: A major step backwards," The Database
applications. There is a need to evaluate the best approach to Column, 2011.
use for filtering and analyzing the data. For the optimized [4] Y. Kim and K. Shim, "Parallel Top-K Similarity Join
processing, Hadoop with MapReduce could be employed. As Algorithms Using MapReduce," Arlington, VA, USA, 2012.
we have used in this paper, with the basics of MapReduce
[5] J. Shafer, S. Rixner and A. L. Cox, "The Hadoop distributed
programming and open-source Hadoop framework filesystem: Balancing portability and performance," White
application. The Hadoop framework can speed up the Plains, NY, USA, 2010.
processing of big data and respond very fast. The
extensibility and simplicity of these frameworks will be the [6] S. M. CA Moturi, "Use of MapReduce for Data Mining and
Data Optimization on a Web Portal," International Journal of
critical factors that make it a replenishing tool for big data Computer, vol. 56, no. 7, 2012.
handling, processing, and management.
[7] C. J. Seema Maitreya, "MapReduce: Simplified Data Analysis
of Big Data," Procedia Computer Science, vol. 57, pp. 563-
5. Conclusion 571, 2015.
MapReduce programming model is applied, an associated [8] S. G. Jeffrey Dean, "MapReduce: Simplified Data Processing
implementation introduced by Google. This programming on Large Clusters," USENIX Association OSDI, vol. 4, pp.
model involves the computation of two gatherings; Map and 137-149, 2004.
Reduce. [9] R. M. Yoo, A. Romano and C. Kozyrakis, "Phoenix rebirth:
Hadoop performance is made up of an ecosystem of tools Scalable MapReduce on a large-scale shared-memory
and technologies that will requirement careful analysis and system," Austin, TX, USA, 2009.
expertise to determine the suitable mapping of technologies [10] H. C. Y. D. C. B. M. Kyong-Ha Lee, "Parallel data processing
to enable a smooth migration. with MapReduce: a survey," ACM SIGMOD Record, vol. 40,
Hadoop is a highly scalable platform and is largely no. 4, 2012.
because of its ability that it stores and allocates large data sets
[11] B. P. J. S. H. S. B. R. J. Bayardo, "PLANET: Massively
across lots of servers. The servers used here are quite Parallel Learning of Tree Ensembles with MapReduce,"
inexpensive and can operate in parallel. The processing PVLDB, vol. 2, no. 2, pp. 1426-1437, 2009.
power of the system can be improved with the addition of
more servers. [12] S. G. Jeffrey Dean, "MapReduce: simplified data processing
on large clusters," Communications of the ACM, vol. 51, no. 1,
Hadoop MapReduce programming model offers 2008.
suppleness to process structure or unstructured data by
several business organizations who can use the data and [13] J. Ekanayake, S. Pallickara and G. Fox, "MapReduce for Data
operate on different types of data. Thus, they can achieve a Intensive Scientific Analyses," Indianapolis, IN, USA, 2008.
business value out of those meaningful and beneficial data [14] A. Alam and J. Ahmed, "Hadoop Architecture and Its Issues,"
for the business organizations for analysis. Las Vegas, NV, USA, 2014.
[15] R. K. R. R. Vijayakumari, "Comparative analysis of Google
File System and Hadoop Distributed File System,"
References International Journal of Advanced Trends in Computer
Science and Engineering, vol. 3, no. 1, pp. 553-558, 2014.
[1] G. Z. &. C. B. Jason R Swedlow, "Channeling the data
deluge," Nature methods, vol. 8, p. 463–465, 2011. [16] J. J. B. X. Y. F. Wang, "Hadoop high availability through
metadata replication”, in Proc," The first international
[2] J. Maitrey S, "An Integrated Approach for CURE Clustering workshop on Cloud data management, pp. 37-44, 2009.
using Map-Reduce Techniques," In Proceedings of Elsevier,
vol. 2, 2013. [17] A. D. R.-L. H. D. S. P. Hung-chih Yang, "Map-reduce-merge:
simplified relational data processing on large clusters," 2007.

View publication stats

You might also like