0% found this document useful (0 votes)

79 views9 pages

V3i308 PDF

Uploaded by

IJCERT PUBLICATIONS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views9 pages

V3i308 PDF

Uploaded by

IJCERT PUBLICATIONS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Volume 3, Issue 3, March-2016, pp.

134-142 ISSN (O): 2349-7084

International Journal of Computer Engineering In Research Trends

Available online at: www.ijcert.org

Clustering and Parallel Empowering

Techniques for Hadoop File System
1
K.Naga Maha Lakshmi , 2 A.Shiva Kumar
1
Asst Professor, Keshav Memorial Institute of Technology and Science, Narayanguda, Telangana, India,
2
Asst Professor, Mahaveer Institute of Technology and Science, Bandlaguda, Telangana, India.

AbstractIn the Big Data group, Apache Hadoop and Spark are gaining prominence in handling Big Data and
analytics. Similarly MapReduce has been seen as one of the key empowering methodologies for taking care of large-
scale query processing. These middleware are traditionally written with sockets and do not deliver best performance on
datacenters with modern high performance networks. In this paper we investigate the characterizes of two file
systems that support in-memory and heterogeneous storage, and discusses the impacts of these two architectures on
the performance and fault tolerance of Hadoop MapReduce and Spark applications. We present a complete
methodology for evaluating MapReduce and Spark workloads on top of in-memory file systems and provide insights
about the interactions of different system components while running these workloads.

Keywords- Big Data, Big Data Analytics, MapReduce, Apache Spark

I.INTRODUCTION node gets fenced, there are no less than 2 different

nodes holding the same data set).
The Hadoop framework is intended to give a
dependable, shared storage and analysis base to the Hadoop verses Relational Database Systems All
client group. The storage bit of the Hadoop through the IT group, there are numerous dialogs
framework is given by a distributed file system rotating around contrasting MapReduce with
arrangement, for example, HDFS, while the analysis customary RDBMS arrangements. More or less,
usefulness is displayed by MapReduce. A few MapReduce and RDBMS systems reflects answers for
different segments are a piece of the general Hadoop totally diverse (estimated) IT handling situations and
agreement suite. The MapReduce usefulness is thus, a real correlation results into distinguishing the
planned as a device for profound data analysis and open doors and confinements of both arrangements
the change of large data sets. Hadoop empowers the in view of their particular functionalities and center
clients to investigate/dissect complex data sets by ranges. Regardless, the data sets handled by
using redid analysis scripts/commands. . In other customary database (RDBMS) arrangements are
words, by means of the redid MapReduce schedules, regularly much littler than the data pools used in a
unstructured data sets can be distributed, broke Hadoop situation (see Table 1). Consequently, unless
down, and investigated crosswise over a huge an IT base procedures TB's or PB's of unstructured
number of shared-nothing preparing data in a very parallel environment, the
systems/Clusters/nodes. Hadoop's HDFS imitates the announcement can be made that the execution of
data onto different nodes to shield nature from any Hadoop executing MapReduce inquiries will be
potential data-misfortune (to represent, if 1 Hadoop below average contrasted with SQL questions
running against a (streamlined) social database.

2016, IJCERT All Rights Reserved Page | 134

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

Hadoop uses a beast power access strategy while that uncover a gigantic parallel handling base where
RDBMS arrangements bank on streamlined getting to the data is unstructured to the point where no
schedules, for example, files, and in addition read- RDBMS streamlining systems can be connected to
ahead and compose behind procedures. Henceforth, support the execution of the inquiries.
Hadoop truly just exceeds expectations in situations

Table 1: RDBMS and MapReduce Highlights

Hadoop is essentially intended to effectively handle toughest task however is to do fast (low latency) or
expansive data volumes by connecting numerous real-time ad-hoc analytics on a complete big data set.
merchandise systems together to function as a It practically means you need to scan terabytes (or
parallel element. Commercial enterprises are utilizing even more) of data within seconds. This is only
Hadoop widely to investigate their data sets. The possible when data is processed with high
reason is that Hadoop framework depends on a parallelism. In this section we have presented how
straightforward programming model (MapReduce) actually data processed and managed on various
and it empowers a registering arrangement that is modern clusters which Substantial impact on
adaptable, adaptable, deficiency tolerant and designing and utilizing modern data management
practical. Here, the fundamental concern is to keep and processing systems in multiple tiers, the system
up velocity in handling huge datasets as far as consist of Front-end data accessing and serving
holding up time in the middle of inquiries and (Online) eg: MySql, HBase Back-end data analytics
holding up time to run the project. Flash was (Offline) eg: HDFS, MapReduce, Spark
presented by Apache Programming Establishment for
accelerating the Hadoop computational registering Big Data processing strategies break down large data
programming process. As against a typical sets at terabyte or even petabyte processing, handling
conviction, Flash is not an adjusted rendition of subjective BI use cases. While real-time stream
Hadoop and is not, generally, reliant on Hadoop on preparing is performed on the most current slice of
the grounds that it has its own particular bunch data for data profiling to pick anomalies,
administration. Hadoop is only one of the approaches misrepresentation exchange recognitions, security
to execute Flash. Spark utilizes Hadoop as a part of checking, and so on. The hardest undertaking
two ways one is storage and second is handling. however is to do quick (low latency) or ongoing ad-
Since Spark has its own group administration hoc examination on a complete big data set. It for all
calculation, it utilizes Hadoop for storage reason as it intents and purposes implies you have to output
were. terabytes (or significantly more) of data inside of
seconds. This is just conceivable when data is
II. DATA PROCESSING AND prepared with high parallelism. In this area we have
MANAGEMENT ON MODERN introduced how really data handled and oversaw on
CLUSTERS different cutting edge bunches which Generous effect
on planning and using advanced data administration
Big Data processing techniques analyze big data sets and preparing systems in various levels, the system
at terabyte or even petabyte scale. Processing, comprise of Front-end data getting to and serving
tackling arbitrary BI use cases. While real-time stream (Online) eg: MySql, HBase Back-end data
processing is performed on the most current slice of examination (Logged off) eg: HDFS, MapReduce,
data for data profiling to pick outliers, fraud Spark
transaction detections, security monitoring, etc. The

2016, IJCERT All Rights Reserved Page | 135

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

limitations of MapReduce, Spark [19] uses Resilient

Distributed Datasets (RDDs) [19] which implement
in-memory data structures used to cache intermediate
data across a set of nodes. Since RDDs can be kept in
memory, algorithms can iterate over RDD data many
times very efficiently. Although MapReduce is
designed for batch jobs, it is widely used for iterative
jobs. On the other hand, Spark has been designed
mainly for iterative jobs, but it is also used for batch
F
jobs. This is because the new big data architecture
ig 1. Data accessing and serving over internet
brings multiple frameworks together working on the
In the above fig data accessing and serving over same data, which is already stored in HDFS [17]. We
internet where Front end tier has web server which choose to compare these two frameworks due to their
process through web server as MySql Queries or wide spread adoption in big data analytics. All the
NoSql Queries later as Back-end tier data analytics major Hadoop vendors such as IBM, Cloudera,
will be done on MapReduce or Spark over HDFS . Horton works, and MapR bundle both MapReduce
and Spark with their Hadoop distributions
2.1. Spark - An Example of Back-end Data
Processing Middleware III.APACHE SPARK

An in-memory data-processing framework which Spark is another execution system. Like MapReduce,
performs Iterative machine learning jobs, Interactive it works with the file system to disperse your
data analytics. Scalable, communication and I/O information over the group, and process that
intensive which performs Wide dependencies information in parallel. Like MapReduce, Apache
between Resilient Distributed Datasets (RDDs) Spark is a fast and general engine for large-scale data
MapReduce-like shuffle operations to repartition processing. It is based on Hadoop MapReduce and it
RDDs Sockets based communication extends the MapReduce model to efficiently use it for
more types of computations, which includes
2.2. Cluster computing frameworks interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing
MapReduce is one of the earliest and best known that increases the processing speed of an application.
commodity cluster frameworks. MapReduce follows Spark is designed to cover a wide range of workloads
the functional programming model [8], and performs such as batch applications, iterative algorithms,
explicit synchronization across computational stages. interactive queries and streaming. Apart from
MapReduce exposes a simple programming API in supporting all these workload in a respective system,
terms of map () and reduce () functions. Apache it reduces the management burden of maintaining
Hadoop [1] is a widely used open source separate tools.
implementation of MapReduce. The simplicity of
MapReduce is attractive for users, but the framework 3.1. Features of Apache Spark
has several limitations. Applications such as machine
learning and graph analytics iteratively process the 3.1.1. Speed
data, which means multiple rounds of computation
are performed on the same data. In MapReduce, Run programs up to 100x faster than Hadoop
every job reads its input data, processes it, and then MapReduce in memory, or 10x faster on disk. Spark
writes it back to HDFS. For the next job to consume has an advanced DAG execution engine that supports
the output of a previously run job, it has to repeat the cyclic data flow and in-memory computing.
read, process, and write cycle. For iterative
algorithms, which want to read once, and iterate over
the data many times, the MapReduce model poses a
significant overhead. To overcome the above

2016, IJCERT All Rights Reserved Page | 136

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

stack. It allows other components to run on top of

stack.

3.2.3. Spark in MapReduce (SIMR): Spark in

MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start
Spark and uses its shell without any administrative
Fig2. Logistic regression in Hadoop and Spark
access.

3.1.2. Ease of Use (Supports multiple languages) 3.3. Components of Spark

Write applications quickly in Java, Scala, Python,
R.Spark offers over 80 high-level operators that make
it easy to build parallel apps. And you can use
it interactively from the Scala, Python and R shells.

3.1.3. Advanced Analytics: Generality

Spark not only supports Map and reduce. It also

supports SQL queries, Streaming data, Machine Fig 4. Illustration depicts the different components of
learning (ML), and Graph algorithms. Combine SQL, Spark.
streaming, and complex analytics. Spark powers a
stack of libraries including SQL and Data 3.3.1. Apache Spark Core Spark Core is the
Frames, MLlib for machine learning, GraphX, underlying general execution engine for spark
and Spark Streaming. You can combine these libraries platform that all other functionality is built upon. It
seamlessly in the same application. provides In-Memory computing and referencing
datasets in external storage systems.
3.2. Spark Built on Hadoop:

3.3.2. Spark SQL Spark SQL is a component on top

of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for
structured and semi-structured data.

3.3.3.Spark Streaming Spark Streaming leverages

Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches
and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
Fig 3. Spark built with Hadoop components.

3.2.1. Standalone: Spark Standalone deployment 3.3.4.MLlib (Machine Learning Library) MLlib is a
means Spark occupies the place on top of HDFS distributed machine learning framework above Spark
(Hadoop Distributed File System) and space is because of the distributed memory-based Spark
allocated for HDFS, explicitly. Here, Spark and architecture. It is, according to benchmarks, done by
MapReduce will run side by side to cover all spark the MLlib developers against the Alternating Least
jobs on cluster. Squares (ALS) implementations. Spark MLlib is nine
times as fast as the Hadoop disk-based version of
3.2.2. Hadoop Yarn: Hadoop Yarn deployment Apache Mahout (before Mahout gained a Spark
means, simply, spark runs on Yarn without any pre- interface).
installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or Hadoop

2016, IJCERT All Rights Reserved Page | 137

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

3.3.5.GraphX GraphX is a distributed graph- HDFS follows the master-slave architecture and it has
processing framework on top of Spark. It provides an the following elements.
API for expressing graph computation that can model
the user-defined graphs by using Pregel abstraction Namenode:The namenode is the commodity
API. It also provides an optimized runtime for this hardware that contains the GNU/Linux operating
abstraction. system and the namenode software. It is software
that can be run on commodity hardware. The system
3.4. Important of Resilient Distributed having the namenode acts as the master server and it
Datasets (RDD) in Apache Spark does the following tasks: 1.Manages the file system
namespace. 2. Regulates clients access to files. It also
executes file system operations such as renaming,
Resilient Distributed Datasets (RDD) is a
closing, and opening files and directories.
fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each Datanode: The datanode is a commodity hardware
dataset in RDD is divided into logical partitions, having the GNU/Linux operating system and
which may be computed on different nodes of the datanode software. For every node (Commodity
cluster. RDDs can contain any type of Python, Java, or hardware/System) in a cluster, there will be a
Scala objects, including user-defined classes. datanode. These nodes manage the data storage of
Formally, an RDD is a read-only, partitioned their system.
collection of records. RDDs can be created through
deterministic operations on either data on stable Datanodes perform read-write operations on
storage or other RDDs. RDD is a fault-tolerant the file systems, as per client request.
collection of elements that can be operated on in They also perform operations such as block
parallel. There are two ways to create RDDs: creation, deletion, and replication according
parallelizing an existing collection in your driver to the instructions of the namenode.
program, or referencing a dataset in an external
Block: Generally the user data is stored in the files
storage system, such as a shared file system, HDFS,
of HDFS. The file in a file system will be divided into
HBase, or any data source offering a Hadoop Input
one or more segments and/or stored in individual
Format. Spark makes use of the concept of RDD to
data nodes. These file segments are called as blocks.
achieve faster and efficient MapReduce operations.
In other words, the minimum amount of data that
Let us first discuss how MapReduce operations take
HDFS can read or write is called a Block. The default
place and why they are not so efficient.
block size is 64MB, but it can be increased as per the
need to change in HDFS configuration.
3.5. HDFS Architecture
3.6. Goals of HDFS
Given below is the architecture of a Hadoop File Fault detection and recovery: Since HDFS
System. includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS
should have mechanisms for quick and automatic
fault detection and recovery.

Huge datasets: HDFS should have hundreds of

nodes per cluster to manage the applications having
huge datasets.

Hardware at data: A requested task can be done

efficiently, when the computation takes place near
the data. Especially where huge datasets are
involved, it reduces the network traffic and increases
Fig 5. Architecture of a Hadoop File System the throughput.

2016, IJCERT All Rights Reserved Page | 138

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

IV.MAPREDUCE OVERVIEW represents the movement from information key-value

pairs to yield key-value pairs at an abnormal state
Apache Hadoop MapReduce is a framework for
preparing large data sets in parallel over a Hadoop :
cluster. Data analysis utilizes a two stage outline
lessen process. The employment setup supplies
outline diminishes analysis capacities and the
Hadoop system gives the planning, appropriation,
and parallelization administrations. The top level unit
Though each set of key-value pairs is homogeneous,
of work in MapReduce is a vocation. An occupation
the key-value pairs in each step need not have the
as a rule has a guide and a diminish stage, however
same type. For example, the key-value pairs in the
the lessen stage can be overlooked. For instance,
input set (KV1) can be (string, string) pairs, with the
consider a MapReduce work that checks the quantity
map phase producing (string, integer) pairs as
of times every word is utilized over an arrangement
intermediate results (KV2), and the reduce phase
of archives. The guide stage include the words every
producing (integer, string) pairs for the final results
record, then the diminish stage totals the per-archive
(KV3). See Example: Calculating Word Occurrences.
data into word checks spreading over the whole
accumulation.
The keys in the map output pairs need not be unique.
Between the map processing and the reduce
Amid the guide stage, the info data is separated into
processing, a shuffle step sorts all map output values
information parts for analysis by guide assignments
with the same key into a single reduce input (key,
running in parallel over the Hadoop cluster. Of
value-list) pair, where the 'value' is a list of all values
course, the MapReduce structure gets information
sharing the same key. Thus, the input to a reduce task
data from the Hadoop Appropriated Record
is actually a set of (key, value-list) pairs.
Framework (HDFS). Utilizing the MarkLogic
Connector for Hadoop empowers the system to get
The key and value types at each stage determine the
info data from a MarkLogic Server instance. For
interfaces to your map and reduce functions.
subtle elements, see Map task.
Therefore, before coding a job, determine the data
types needed at each stage in the map-reduce process.
The reduce phase uses results from map tasks as
For example:
input to a set of parallel reduce tasks. The reduce
tasks consolidate the data into final results. By
1. Choose the reduce output key and value
default, the MapReduce framework stores results in
types that best represents the desired
HDFS. Using the MarkLogic Connector for Hadoop
outcome.
enables the framework to store results in a MarkLogic
Server instance. For details, see Reduce Task.
2. Choose the map input key and value types
best suited to represent the input data from
In spite of the fact that the lessen stage relies on upon
which to derive the final result.
yield from the guide stage, outline decrease
preparing is not inexorably successive. That is,
3. Determine the transformation necessary to
decrease undertakings can start when any guide
get from the map input to the reduce output,
errand finishes. It is redundant for all guide
and choose the intermediate map
undertakings to finish before any lessen errand can
output/reduce input key value type to match.
start. MapReduce works on key-value pairs. Adroitly,
a MapReduce work takes an arrangement of info key- 4. What is MapReduce? It is a part of the
value pairs and creates an arrangement of yield key- Hadoop framework that is responsible for
value pairs by going the data through guide and processing large data sets with a parallel and
lessens capacities. The guide assignments create a distributed algorithm on a cluster. As the
middle of the road set of key-value pairs that the name suggests, the MapReduce algorithm
lessen errands utilizes as data. The chart underneath contains two important tasks: Map and

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

Reduce. Map takes a set of data and converts Spark claims to process data 100x faster than
it into another set of data, where individual MapReduce, while 10x faster with the disks.
elements are broken down into tuples
(key/value pairs). On the other hand, Reduce
takes the output from a map as an input and
combines the data tuples into smaller set of
tuples. In MapReduce, the data is distributed
over the cluster and processed.

5. The difference in Spark is that it performs in-

memory processing of data. This in-memory
processing is a faster process as there is no
time spent in moving the data/processes in
and out of the disk, whereas MapReduce How Spark is compatible with Apache
requires a lot of time to perform these hadoop?
input/output operations thereby increasing Spark can keep running on Hadoop 2's YARN
latency. bunch supervisor,
Spark can read any current Hadoop information.
Hive, Pig and Mahout can keep running on
Spark.
Advanced Big Data Analytics is perplexing.
Hadoop MapReduce (MR) works really well in
the event that you can express your issue as a
solitary MR work. Practically speaking, most
issues don't fit flawlessly into a solitary MR work.
MR is moderate for cutting edge Enormous
Information Examination, for example, iterative
preparing and storing of datasets which are
appropriate to machine learning. Spark enhances
MapReduce by uprooting the need to compose
information to plate between steps.
Need to coordinate numerous different
Fig 6. Operations of Mapreduce apparatuses for cutting edge Big Data Analytics
for Questions, Gushing Examination, Machine
5.1. Major difference among MapReduce and Learning and Chart
Spark
5.2. Key capabilities of Spark
Real-Time Big Data Analysis:
Real-time data analysis means processing data Fast parallel data processing as intermediate data
generated by the real-time event streams coming in at is stored in memory and only persisted to disc if
the rate of millions of events per second, Twitter data needed.
for instance. The strength of Spark lies in its abilities Apache Spark provides higher level
to support streaming of data along with distributed abstraction and generalization of MapReduce.
processing. This is a useful combination that delivers Over 80 high level built-in operators. Besides
near real-time processing of data. MapReduce is MapReduce, Spark supports SQL
handicapped of such an advantage as it was designed queries, streaming data, and complex analytics
to perform batch cum distributed processing on large such as machine learning and graph
amounts of data. Real-time data can still be processed algorithms out-of-the-box.
on MapReduce but its speed is nowhere close to that MapReduce is limited to batch processing. Spark
of Spark. lets you write streaming and batch-mode

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

applications with very similar logic and

APIs instead of using different tools.
Apache Spark provides a set of composable
building blocks for writing
in Scala, Java or Python,concise queries and data
flows. This makes developers highly productive.
It is possible to build a single data workflow that
leverages, streaming, batch, sql and machine
learning for example.
Spark runs on Hadoop, Mesos, standalone, or in
the cloud. It can access diverse data sources
Figure 7: Iterative operations on Spark RDD
including HDFS, Cassandra, HBase, S3.

If different queries are run on the same set of data

VI.PERFORMANCE repeatedly, this particular data can be kept in
memory for better execution times
Iterative Machine Learning Algorithms:

Almost all machine learning algorithms work

iteratively. As we have seen earlier, iterative
algorithms involve I/O bottlenecks in the
MapReduce implementations. MapReduce uses
coarse-grained tasks (task-level parallelism) that
are too heavy for iterative algorithms. Spark with
the help of Mesos a distributed system kernel,
caches the intermediate dataset after each
iteration and runs multiple iterations on this
cached dataset which reduces the I/O and helps
to run the algorithm faster in a fault tolerant Figure 8: Interactive operations on Spark RDD
manner.
Spark has a built-in scalable machine learning In a Big Data connection, each of these cycles can be
library called MLlib which contains high-quality exceptionally burdensome, with every test cycle, for
algorithms that leverages iterations and yields instance, being hours long. While there are different
better results than one pass approximations routes systems to mitigate this issue, one of the best is
sometimes used on MapReduce. to just run your project quick. On account of the
Iterative Operations on Spark RDD execution advantages of Spark, the improvement
lifecycle can be really abbreviated simply because of
the way that the test/investigate cycles are much
The illustration given below shows the iterative
shorter.
operations on Spark RDD. It will store intermediate
results in a distributed memory instead of Stable
storage (Disk) and make the system faster. Note: If
the Distributed memory (RAM) is sufficient to store
intermediate results (State of the JOB), then it will
store those results on the disk.

Figure 9.Performance of Spark

K.Naga Maha Lakshmi et al., International Journal of Computer Engineering In Research Trends
Volume 3, Issue 3, March-2016, pp. 134-142

Here are results from a survey taken on Spark by growing demand for Spark.
Typesafe to better understand the trends and

Fig 10. Apache Spark survey Report

VII. CONCLUSIONS
[5] F. Li, B. C. Ooi, M. T. zsu and S. Wu, "Distributed
In this paper we looked into the MapReduce data management using MapReduce," ACM
hindrance in meeting the reliably extending enlisting Computing Surveys, 46(3), pp. 1-42, 2014.
solicitations of Big Data. In this papers we focused on
substitution of MapReduce technique, one of the key [6] C. Doulkeridis and K. Nrvg, "A survey of large-
engaging approaches for dealing with Big Data asks scale analytical query processing in MapReduce," The
for by strategy for significantly parallel get ready on VLDB Journal, pp. 1-26, 2013.
innumerable nodes. Issues and troubles MapReduce
faces while overseeing Tremendous Data are leads
[7] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar
with Apache Spark it has practically identical levels of
and R. Pasquin, "Incoop: MapReduce for incremental
security, sensibility, and flexibility as MapReduce, and
computations," Proc. of the 2nd ACM Symposium on
should be correspondingly joined with the straggling
Cloud Computing, 2011.
leftovers of the advancements that incorporate the
ceaselessly developing Hadoop platforms.

REFERENCES:

[1] P. Zadrozny and R. Kodali, Big Data Analytics

using Splunk, Berkeley, CA, USA: Apress, 2013.

[2] F. Ohlhorst, Big Data Analytics: Turning Big Data

into Big Money, Hoboken, N.J, USA: Wiley, 2013.

[3] J. Dean and S. Ghemawat, "MapReduce: Simplified

data processing on large clusters," Commun ACM,
51(1), pp. 107-113, 2008.

[4] Apache Hadoop, http://hadoop.apache.org.

CEMU Keys
50% (4)
CEMU Keys
4 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
SQL Tutorial - DB2 SQL Tutorials - SQL Tutor
No ratings yet
SQL Tutorial - DB2 SQL Tutorials - SQL Tutor
2 pages
A Design Pattern For Deploying ML Models To Production 1651052042
No ratings yet
A Design Pattern For Deploying ML Models To Production 1651052042
60 pages
Dental Clinic Management System
No ratings yet
Dental Clinic Management System
70 pages
Crono GettingStarted v5.1
No ratings yet
Crono GettingStarted v5.1
7 pages
How Install Offline / Local SAP Web IDE
No ratings yet
How Install Offline / Local SAP Web IDE
4 pages
As400 Commands
100% (1)
As400 Commands
48 pages
Lab 11 SQL OUTER JOINS LEFT and RIGHT JOIN
No ratings yet
Lab 11 SQL OUTER JOINS LEFT and RIGHT JOIN
9 pages
Lab Manual 05 CSE 406 Integrated Design Project II
No ratings yet
Lab Manual 05 CSE 406 Integrated Design Project II
9 pages
Verification and Validation: Ian Sommerville, SW Engineering, 7th/8th Edition CH 22
No ratings yet
Verification and Validation: Ian Sommerville, SW Engineering, 7th/8th Edition CH 22
49 pages
Guidelinesgovtentities
No ratings yet
Guidelinesgovtentities
76 pages
Design and Development of Test Setup To Study The Basic Procedure of Vibration Analysis
No ratings yet
Design and Development of Test Setup To Study The Basic Procedure of Vibration Analysis
8 pages
Name (Optional) :: Assessment of The (For IT Expert)
No ratings yet
Name (Optional) :: Assessment of The (For IT Expert)
4 pages
Chapter Four 4. System Analysis
No ratings yet
Chapter Four 4. System Analysis
38 pages
Morphometric Analysis of Kelo Basin For Environmental Impact Assessment
No ratings yet
Morphometric Analysis of Kelo Basin For Environmental Impact Assessment
5 pages
Assignment 2.1.2 Image Augmentation
No ratings yet
Assignment 2.1.2 Image Augmentation
8 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
Analisis Jabatan Struktural Pada Sekretariat Daerah Kabupaten Bengkalis Oleh: Ardina Eka Yoga Pembimbing: Prof. Dr. H. Sujianto, M.Si
No ratings yet
Analisis Jabatan Struktural Pada Sekretariat Daerah Kabupaten Bengkalis Oleh: Ardina Eka Yoga Pembimbing: Prof. Dr. H. Sujianto, M.Si
12 pages
Hadoop Arch Perf DHT
No ratings yet
Hadoop Arch Perf DHT
18 pages
Essential Skills For Machine Learning Engineers - by Kurtis Pykes - Towards Data Science
No ratings yet
Essential Skills For Machine Learning Engineers - by Kurtis Pykes - Towards Data Science
7 pages
V3i608 PDF
No ratings yet
V3i608 PDF
7 pages
Javaprogram 9,10,11
No ratings yet
Javaprogram 9,10,11
7 pages
How I Got Owned A Multi-Billion Dollar Retailer's MySQL Databases Using Simple SQL Injection by Nav1n? Mar, 2023 Medium
No ratings yet
How I Got Owned A Multi-Billion Dollar Retailer's MySQL Databases Using Simple SQL Injection by Nav1n? Mar, 2023 Medium
13 pages
The Moonlight Cafe
No ratings yet
The Moonlight Cafe
7 pages
V3i609 PDF
No ratings yet
V3i609 PDF
6 pages
Vscode
100% (1)
Vscode
21 pages
Intensification of Resolution in The Realm of Digital Imaging
No ratings yet
Intensification of Resolution in The Realm of Digital Imaging
4 pages
Perintah Linux Centos
No ratings yet
Perintah Linux Centos
3 pages
Modeling of Big Data Processing
No ratings yet
Modeling of Big Data Processing
15 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
CV and Portofolio - Fajar Ihsan Adinugroho
No ratings yet
CV and Portofolio - Fajar Ihsan Adinugroho
4 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
V3i516 PDF
No ratings yet
V3i516 PDF
8 pages
Experimental Study On Automobiles Exhaust Emission Control
No ratings yet
Experimental Study On Automobiles Exhaust Emission Control
7 pages
V3i515 PDF
No ratings yet
V3i515 PDF
7 pages
Implementation of H.264/AVC Video Authentication System Using Watermark
No ratings yet
Implementation of H.264/AVC Video Authentication System Using Watermark
7 pages
V3i605 PDF
No ratings yet
V3i605 PDF
7 pages
V3i517 PDF
No ratings yet
V3i517 PDF
7 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Information Security On Egovernment As Information-Centric Networks
No ratings yet
Information Security On Egovernment As Information-Centric Networks
6 pages
V3i804 PDF
No ratings yet
V3i804 PDF
6 pages
User-Defined Privacy Grid System For Continuous Location Based Services
No ratings yet
User-Defined Privacy Grid System For Continuous Location Based Services
6 pages
003 l4dc Osd Sample Ms
No ratings yet
003 l4dc Osd Sample Ms
7 pages
Car Pooling Using Android Operating System-A Step Towards Green Environment
No ratings yet
Car Pooling Using Android Operating System-A Step Towards Green Environment
5 pages
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
No ratings yet
A Critical Analysis of Apache Hadoop and Spark For Big Data Processing
6 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Source Anonymity Using Multiple Mixes in Packet Scheduling
No ratings yet
Source Anonymity Using Multiple Mixes in Packet Scheduling
4 pages
Hadoop
No ratings yet
Hadoop
14 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
A Comparative Between Hadoop MapReduce and Apache
No ratings yet
A Comparative Between Hadoop MapReduce and Apache
4 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Navigating BDA - Shilpi Jain
No ratings yet
Navigating BDA - Shilpi Jain
6 pages
Testmagzine Admin,+115+manuscript 6
No ratings yet
Testmagzine Admin,+115+manuscript 6
8 pages
Indian Smart Cities Programme It's Challenges & Solutions
No ratings yet
Indian Smart Cities Programme It's Challenges & Solutions
7 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Extranet
No ratings yet
Extranet
2 pages
Big Data Analytics Litrature Review
No ratings yet
Big Data Analytics Litrature Review
7 pages
Meraki Datasheet MX50
No ratings yet
Meraki Datasheet MX50
2 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
V3i513 PDF
No ratings yet
V3i513 PDF
6 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
V3i901 PDF
No ratings yet
V3i901 PDF
7 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
No ratings yet
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
3 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
HADOOP
No ratings yet
HADOOP
55 pages
Spark Streaming Research
No ratings yet
Spark Streaming Research
6 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Hadoop & HDFS Final
No ratings yet
Hadoop & HDFS Final
31 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
13 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Solution Boomi-for-SAP
No ratings yet
Solution Boomi-for-SAP
3 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Unit Ii Hadoop With HDFS
No ratings yet
Unit Ii Hadoop With HDFS
22 pages
Hadoop: Presented by Y Naveen
No ratings yet
Hadoop: Presented by Y Naveen
7 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
ICAI 2023 Paper 3719
No ratings yet
ICAI 2023 Paper 3719
6 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Customer Segmentation Literature Review 1
No ratings yet
Customer Segmentation Literature Review 1
8 pages
DBMS IA2 Question Bank
No ratings yet
DBMS IA2 Question Bank
1 page
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
1.2.persistence Mechanism
No ratings yet
1.2.persistence Mechanism
4 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

V3i308 PDF

Uploaded by

V3i308 PDF

Uploaded by

Volume 3, Issue 3, March-2016, pp.

134-142 ISSN (O): 2349-7084

International Journal of Computer Engineering In Research Trends

Clustering and Parallel Empowering

Keywords- Big Data, Big Data Analytics, MapReduce, Apache Spark

I.INTRODUCTION node gets fenced, there are no less than 2 different

2016, IJCERT All Rights Reserved Page | 134

Table 1: RDBMS and MapReduce Highlights

2016, IJCERT All Rights Reserved Page | 135

limitations of MapReduce, Spark [19] uses Resilient

2016, IJCERT All Rights Reserved Page | 136

stack. It allows other components to run on top of

3.2.3. Spark in MapReduce (SIMR): Spark in

3.1.2. Ease of Use (Supports multiple languages) 3.3. Components of Spark

3.1.3. Advanced Analytics: Generality

Spark not only supports Map and reduce. It also

3.3.2. Spark SQL Spark SQL is a component on top

3.3.3.Spark Streaming Spark Streaming leverages

2016, IJCERT All Rights Reserved Page | 137

Huge datasets: HDFS should have hundreds of

Hardware at data: A requested task can be done

2016, IJCERT All Rights Reserved Page | 138

IV.MAPREDUCE OVERVIEW represents the movement from information key-value

2016, IJCERT All Rights Reserved Page | 139

5. The difference in Spark is that it performs in-

2016, IJCERT All Rights Reserved Page | 140

applications with very similar logic and

If different queries are run on the same set of data

Almost all machine learning algorithms work

Figure 9.Performance of Spark

2016, IJCERT All Rights Reserved Page | 141

Fig 10. Apache Spark survey Report

[1] P. Zadrozny and R. Kodali, Big Data Analytics

[2] F. Ohlhorst, Big Data Analytics: Turning Big Data

[3] J. Dean and S. Ghemawat, "MapReduce: Simplified

[4] Apache Hadoop, http://hadoop.apache.org.

2016, IJCERT All Rights Reserved Page | 142

You might also like