Simplified Data Processing For Large Cluster A Map
Simplified Data Processing For Large Cluster A Map
net/publication/367908911
Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based
Study
CITATIONS READS
5 160
2 authors:
All content following this page was uploaded by Abdiaziz Omar Hassan on 29 March 2023.
Email address:
*
Corresponding author
Received: May 29, 2021; Accepted: June 21, 2021; Published: July 9, 2021
Abstract: With the drastic development of computing technologies, there is an ever-increasing trend in the growth of data.
Data scientists are overwhelmed with such a large and ever-increasing amount of data, as this now requires more processing
channels. The big concern arising here for large-scale data is to provide support for the decision making process. Here in this
study, the MapReduce programming model is applied, an associated implementation introduced by Google. This programming
model involves the computation of two functions; Map and Reduce. The MapReduce libraries automatically parallelize the
computation and handle complex tasks including big data distribution, loads and fault tolerance. This MapReduce
implementation with the source formation of Google and the open-source mechanism, Hadoop has an objective of handling
computation of large clusters of commodities. Our implication of MapReduce and Hadoop framework is aimed at discussing
terabytes and petabytes of storage with thousands of machines parallel to every machine and process at identical times. This
way, large processing and manipulation of big data are maintained with effective result orientations. This study will show up
the basics of MapReduce programming and open-source Hadoop structure application. The Hadoop system can speed up the
handling of big data and respond very fast.
Keywords: Google MapReduce Processes, Hadoop, Parallel Data Processing, HDFS, Cloud Computing,
Large Cluster Data Processing
was made part of the Google database management system will be similar to the following code:
and Google file system. MapReduce could be employed for map (String key, String value):// key: document name//
measurability and is purely a fault-tolerant data processing value: document contents for each wordw in value:
tool that can handle and process huge data along with lower- EmitIntermediate(w, "1");
bound computing nodes [3]. reduce (String key, Iterator values):// key: a word// values:
Discussing how MapReduce works, a distributed file a list of counts int result = 0; for each v in values: result +=
system (DFS) first categorizes data in multiple categories, ParseInt (v); Emit(As String(result));
and then data is presented as a pair containing key and Herewith this example, the program in detail counts the
values. The MapReduce framework performs its applications occurrences of each word within input files specified on the
and function on a single machine where the data may be command line.
preprocessed before map functions or post-process the output #include "mapreduce/mapreduce.h" // User’s map function
of MapReduce function performed [4]. As Hadoop is applied, class Word Counter: public Mapper {public: virtual void
which is a famous open-source application of MapReduce to Map (const MapInput& input) {const string& text =
handle large datasets. It employs an already provided user- input.value(); const int n = text.size(); for (int i = 0; i < n;) {//
level filesystem to handle storage across the cluster [5]. This Skip past leading whitespace while ((i < n) &&
implication will provide you with a speedy output but less isspace(text[i])) i++; // Find word end int start = I; while ((i <
significant, yet giving you a reasonable speed as well as n) &!isspace(text[i])) i++; if (start < I)
handling a larger dataset that tackles a large number of Emit(text.substr(start,i-start),"1");}}};
computing nodes and minimizes application time by 30% REGISTER_MAPPER(WordCounter); // User’s reduce
comparing with ordinary data mining techniques [6]. function class Adder: public Reducer {virtual void Reduce
(ReduceInput* input) {// Iterate over all entries with the//
1.1. Programming Model and Application of MapReduce same key and add the values int64 value = 0; while (!input-
Function >done()) {value += StringToInt(input->value()); input-
The programming model indicates and includes defined >NextValue(); // Emit sum for input->key () Emit
sets of input pairs of key or value, and outlays output pairs of (IntToString(value));}};}
key and values. This MapReduce function has two outlays: REGISTER_REDUCER(Adder); int main (int argc,
one is Map and the other, Reduce. The map function char** argv) {ParseCommandLineFlags(argc, argv);
considers the input pair and provides key/value pairs. These 1.2. MapReduce Specification
values and intermediate outputs are grouped by the
MapReduce function and then passed further to the Reduce spec;// Store list of input files into "spec" for (int i = 1; i <
function. The Reduce function accepts these intermediate argc; i++) {MapReduceInput* input = spec.add_input();
keys and merges their values to form a smaller set of values. input->set_format("text"); input->set_filepattern(argv[i]);
Let us assume an example of counting the occurrence of each input->set_mapper_class("WordCounter");} // Specify the
word in a large dataset, and appear to use it with the map and output files: // /gfs/test/freq-00000-of-00100 // /gfs/test/freq-
reduce functions. The code to do this counting of occurrence 00001-of-00100 //
Richard M. Yoo and his fellows have studied Scalable Conference. They highlighted the MapReduce mechanism
MapReduce with a large-scale shared-memory system and being a proprietary system of Google. They also discussed the
talked about dynamic runtimes with simplifying parallel distributed computing being great to extend simplified with
programming, as well as automatically detecting scenarios. implications of Map and Reduce functions, providing the
They discussed how a multi-layered approach works along basics and insights of achieving the desired performance [11].
that work for the optimizations on the algorithm, Jeffrey Dean with his fellows had discussed simplified
implementation, and OS interaction defining and data processing on large clusters with the MapReduce
channelizing significant speedup improvements with 256 framework. They stated this being the subsidiary
threads. They also identified the hurdles or roadblocks which infrastructure of Google’s MapReduce that allocates to a
are involved in limiting the scalability of runtimes on shared- distributed file system and enables the algorithms to locate
memory systems [9]. data and make it available. They termed it easy to use as with
Kyong Lee with his fellows had discussed Google's the opinion of programmers as more than ten thousand
MapReduce technique that works for big data handling and distinct MapReduce programs are on implementation
processing more simply and smoothly together with the internally at Google within the last four-year span [12].
benefit of minimized cost. The main characteristic of this Bayardo Panda and his fellows have discussed massively
MapReduce model was that it able to process large data sets parallel learning with the application of the MapReduce
among others distributed among multiple nodes and multiple framework. They highlighted combining the MapReduce
channels [10]. programming technique with the distributed file system,
B Panda and his fellows had highlighted the MapReduce being a way to achieve distributed computing objectives with
system and its applications with big data at an International data processing over thousands of computing nodes [11].
46 Abdiaziz Omar Hassan and Abdulkadir Abdulahi Hasan: Simplified Data Processing for Large
Cluster: A MapReduce and Hadoop Based Study
Jaliya Ekanayake and her fellows have discussed MapReduce This should be used in two different ways. These are the
for data-intensive scientific analysis. They discussed the outputs advantageous API streamed output and the other
MapReduce technique due to its application to large parallel data involving building of Hadoop apps with C++. Hadoop
analyses. They discussed this with efficient parallel/concurrent Distributed File System is a target file system especially to
algorithms meeting the scalability and performance use with MapReduce programs. This best applicable to the
requirements to handle and process scientific data [13]. small number of very large files. With the use of replication,
Anam Alam and her fellows have discussed the Hadoop data availability could be made possible within Hadoop
Architecture and Its Issues, together with their implication at Distributed File System (HDFS).
an international conference. Hadoop is categorized as a To process all of the files created by the mapping
distributed program or framework used to handle a large mechanism, the Reduce program get access to internode data.
amount of data. Hadoop is usually used for data-intensive When this is executed, map and reduce, both programs will
applications. With its extensive application, every social write it down to the local file system to avoid the burden over
media site has made use of it [14]. the HDFS system. HDFS can support multiple readers and
R. Vijayakumari and her fellows have discussed the one writer (MROW) approach. The indexing mechanism
comparative analysis of Google File System and Hadoop might not apply to HDFS, so, this would just be applied to
Distributed File System. They discussed this distributed read-only applications that only scan and read the contents of
computing, parallel computing, grid computing, and other the file.
parameters including; Design Goals, Processes, Fire
management, Scalability, Protection, Security, cache 3.1.1. Hadoop Architecture
management replication, etc. to compare both these methods Hadoop Distributed File System stores data within its
and their application of the file system [15]. computing nodes, providing customized and high aggregate
bandwidth across the entire cluster. This file system
installation has different nodes plus one single name node,
3. Methodology called the master node and various data nodes, called slave
The methods used may not look familiar to a common nodes. The name node has held responsible for the
audience. The first one is MapReduce which is in fact management of the file system namespace and controls the
oriented to programmers, rather than business users. This has access to files by clients. The data nodes or slave nodes are
gained popularity due to its easy application, efficiency and distributed in a way that one data node is assigned per
ability to control “Big Data” in a timely manner. MapReduce machine in the cluster, managing data while attaching it to
framework with its application and programming model is the machines where they run. The name node has an
discussed above. An example of occurrences is discussed and operation execution scenario within the file system
employed with the MapReduce framework. namespace and assigns those data blocks to data nodes.
Those data nodes are there to handle read and write requests
3.1. Hadoop from clients and performing operations with the instruction
provided [16].
Another process employed and utilized is Hadoop which is
connected with Java implementation and Java application.
Hadoop Distributed File system manipulates and handle server for performance keeping and mechanism, load-
data chunks and replicates these data chunks across the balancing and resiliency. The processing application of any
Advances in Applied Sciences 2021; 6(3): 43-48 47
problem execution will specify the number of replicates of machine, as the name node and another, as the job tracker.
the file right when it is created, and this count or record can There could be a secondary name node that might work for
be changed any time after that. The name node has the ability periodic handshaking with a name node for fault tolerance.
to adopt different decisions concerning block replication.
3.1.3. Replication Management
3.1.2. Deploying Hadoop HDFS provides a reliable way to store huge data in a
Hadoop compiles in three different ways, the first one is a distributed environment as data blocks. The blocks are also
standalone mode, which is the default mode of Hadoop, replicated to provide fault tolerance. The default replication
running as a single Java process. The second one is Pseudo- factor is 3 which is again configurable. So, as you can see in
distributed mode, which involves the configuration of the figure below where each block is replicated three times
Hadoop to run on a single machine, whereas, with different and stored on different DataNodes (considering the default
Hadoop processes, run divergent Java processes. The third replication factor):
one is fully disseminating or cluster mode, involving one
re-use of data sources, including public data to construct [3] D. D, "MapReduce: A major step backwards," The Database
applications. There is a need to evaluate the best approach to Column, 2011.
use for filtering and analyzing the data. For the optimized [4] Y. Kim and K. Shim, "Parallel Top-K Similarity Join
processing, Hadoop with MapReduce could be employed. As Algorithms Using MapReduce," Arlington, VA, USA, 2012.
we have used in this paper, with the basics of MapReduce
[5] J. Shafer, S. Rixner and A. L. Cox, "The Hadoop distributed
programming and open-source Hadoop framework filesystem: Balancing portability and performance," White
application. The Hadoop framework can speed up the Plains, NY, USA, 2010.
processing of big data and respond very fast. The
extensibility and simplicity of these frameworks will be the [6] S. M. CA Moturi, "Use of MapReduce for Data Mining and
Data Optimization on a Web Portal," International Journal of
critical factors that make it a replenishing tool for big data Computer, vol. 56, no. 7, 2012.
handling, processing, and management.
[7] C. J. Seema Maitreya, "MapReduce: Simplified Data Analysis
of Big Data," Procedia Computer Science, vol. 57, pp. 563-
5. Conclusion 571, 2015.
MapReduce programming model is applied, an associated [8] S. G. Jeffrey Dean, "MapReduce: Simplified Data Processing
implementation introduced by Google. This programming on Large Clusters," USENIX Association OSDI, vol. 4, pp.
model involves the computation of two gatherings; Map and 137-149, 2004.
Reduce. [9] R. M. Yoo, A. Romano and C. Kozyrakis, "Phoenix rebirth:
Hadoop performance is made up of an ecosystem of tools Scalable MapReduce on a large-scale shared-memory
and technologies that will requirement careful analysis and system," Austin, TX, USA, 2009.
expertise to determine the suitable mapping of technologies [10] H. C. Y. D. C. B. M. Kyong-Ha Lee, "Parallel data processing
to enable a smooth migration. with MapReduce: a survey," ACM SIGMOD Record, vol. 40,
Hadoop is a highly scalable platform and is largely no. 4, 2012.
because of its ability that it stores and allocates large data sets
[11] B. P. J. S. H. S. B. R. J. Bayardo, "PLANET: Massively
across lots of servers. The servers used here are quite Parallel Learning of Tree Ensembles with MapReduce,"
inexpensive and can operate in parallel. The processing PVLDB, vol. 2, no. 2, pp. 1426-1437, 2009.
power of the system can be improved with the addition of
more servers. [12] S. G. Jeffrey Dean, "MapReduce: simplified data processing
on large clusters," Communications of the ACM, vol. 51, no. 1,
Hadoop MapReduce programming model offers 2008.
suppleness to process structure or unstructured data by
several business organizations who can use the data and [13] J. Ekanayake, S. Pallickara and G. Fox, "MapReduce for Data
operate on different types of data. Thus, they can achieve a Intensive Scientific Analyses," Indianapolis, IN, USA, 2008.
business value out of those meaningful and beneficial data [14] A. Alam and J. Ahmed, "Hadoop Architecture and Its Issues,"
for the business organizations for analysis. Las Vegas, NV, USA, 2014.
[15] R. K. R. R. Vijayakumari, "Comparative analysis of Google
File System and Hadoop Distributed File System,"
References International Journal of Advanced Trends in Computer
Science and Engineering, vol. 3, no. 1, pp. 553-558, 2014.
[1] G. Z. &. C. B. Jason R Swedlow, "Channeling the data
deluge," Nature methods, vol. 8, p. 463–465, 2011. [16] J. J. B. X. Y. F. Wang, "Hadoop high availability through
metadata replication”, in Proc," The first international
[2] J. Maitrey S, "An Integrated Approach for CURE Clustering workshop on Cloud data management, pp. 37-44, 2009.
using Map-Reduce Techniques," In Proceedings of Elsevier,
vol. 2, 2013. [17] A. D. R.-L. H. D. S. P. Hung-chih Yang, "Map-reduce-merge:
simplified relational data processing on large clusters," 2007.