V3i308 PDF
V3i308 PDF
AbstractIn the Big Data group, Apache Hadoop and Spark are gaining prominence in handling Big Data and
analytics. Similarly MapReduce has been seen as one of the key empowering methodologies for taking care of large-
scale query processing. These middleware are traditionally written with sockets and do not deliver best performance on
datacenters with modern high performance networks. In this paper we investigate the characterizes of two file
systems that support in-memory and heterogeneous storage, and discusses the impacts of these two architectures on
the performance and fault tolerance of Hadoop MapReduce and Spark applications. We present a complete
methodology for evaluating MapReduce and Spark workloads on top of in-memory file systems and provide insights
about the interactions of different system components while running these workloads.
Hadoop uses a beast power access strategy while that uncover a gigantic parallel handling base where
RDBMS arrangements bank on streamlined getting to the data is unstructured to the point where no
schedules, for example, files, and in addition read- RDBMS streamlining systems can be connected to
ahead and compose behind procedures. Henceforth, support the execution of the inquiries.
Hadoop truly just exceeds expectations in situations
Hadoop is essentially intended to effectively handle toughest task however is to do fast (low latency) or
expansive data volumes by connecting numerous real-time ad-hoc analytics on a complete big data set.
merchandise systems together to function as a It practically means you need to scan terabytes (or
parallel element. Commercial enterprises are utilizing even more) of data within seconds. This is only
Hadoop widely to investigate their data sets. The possible when data is processed with high
reason is that Hadoop framework depends on a parallelism. In this section we have presented how
straightforward programming model (MapReduce) actually data processed and managed on various
and it empowers a registering arrangement that is modern clusters which Substantial impact on
adaptable, adaptable, deficiency tolerant and designing and utilizing modern data management
practical. Here, the fundamental concern is to keep and processing systems in multiple tiers, the system
up velocity in handling huge datasets as far as consist of Front-end data accessing and serving
holding up time in the middle of inquiries and (Online) eg: MySql, HBase Back-end data analytics
holding up time to run the project. Flash was (Offline) eg: HDFS, MapReduce, Spark
presented by Apache Programming Establishment for
accelerating the Hadoop computational registering Big Data processing strategies break down large data
programming process. As against a typical sets at terabyte or even petabyte processing, handling
conviction, Flash is not an adjusted rendition of subjective BI use cases. While real-time stream
Hadoop and is not, generally, reliant on Hadoop on preparing is performed on the most current slice of
the grounds that it has its own particular bunch data for data profiling to pick anomalies,
administration. Hadoop is only one of the approaches misrepresentation exchange recognitions, security
to execute Flash. Spark utilizes Hadoop as a part of checking, and so on. The hardest undertaking
two ways one is storage and second is handling. however is to do quick (low latency) or ongoing ad-
Since Spark has its own group administration hoc examination on a complete big data set. It for all
calculation, it utilizes Hadoop for storage reason as it intents and purposes implies you have to output
were. terabytes (or significantly more) of data inside of
seconds. This is just conceivable when data is
II. DATA PROCESSING AND prepared with high parallelism. In this area we have
MANAGEMENT ON MODERN introduced how really data handled and oversaw on
CLUSTERS different cutting edge bunches which Generous effect
on planning and using advanced data administration
Big Data processing techniques analyze big data sets and preparing systems in various levels, the system
at terabyte or even petabyte scale. Processing, comprise of Front-end data getting to and serving
tackling arbitrary BI use cases. While real-time stream (Online) eg: MySql, HBase Back-end data
processing is performed on the most current slice of examination (Logged off) eg: HDFS, MapReduce,
data for data profiling to pick outliers, fraud Spark
transaction detections, security monitoring, etc. The
An in-memory data-processing framework which Spark is another execution system. Like MapReduce,
performs Iterative machine learning jobs, Interactive it works with the file system to disperse your
data analytics. Scalable, communication and I/O information over the group, and process that
intensive which performs Wide dependencies information in parallel. Like MapReduce, Apache
between Resilient Distributed Datasets (RDDs) Spark is a fast and general engine for large-scale data
MapReduce-like shuffle operations to repartition processing. It is based on Hadoop MapReduce and it
RDDs Sockets based communication extends the MapReduce model to efficiently use it for
more types of computations, which includes
2.2. Cluster computing frameworks interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing
MapReduce is one of the earliest and best known that increases the processing speed of an application.
commodity cluster frameworks. MapReduce follows Spark is designed to cover a wide range of workloads
the functional programming model [8], and performs such as batch applications, iterative algorithms,
explicit synchronization across computational stages. interactive queries and streaming. Apart from
MapReduce exposes a simple programming API in supporting all these workload in a respective system,
terms of map () and reduce () functions. Apache it reduces the management burden of maintaining
Hadoop [1] is a widely used open source separate tools.
implementation of MapReduce. The simplicity of
MapReduce is attractive for users, but the framework 3.1. Features of Apache Spark
has several limitations. Applications such as machine
learning and graph analytics iteratively process the 3.1.1. Speed
data, which means multiple rounds of computation
are performed on the same data. In MapReduce, Run programs up to 100x faster than Hadoop
every job reads its input data, processes it, and then MapReduce in memory, or 10x faster on disk. Spark
writes it back to HDFS. For the next job to consume has an advanced DAG execution engine that supports
the output of a previously run job, it has to repeat the cyclic data flow and in-memory computing.
read, process, and write cycle. For iterative
algorithms, which want to read once, and iterate over
the data many times, the MapReduce model poses a
significant overhead. To overcome the above
3.2.1. Standalone: Spark Standalone deployment 3.3.4.MLlib (Machine Learning Library) MLlib is a
means Spark occupies the place on top of HDFS distributed machine learning framework above Spark
(Hadoop Distributed File System) and space is because of the distributed memory-based Spark
allocated for HDFS, explicitly. Here, Spark and architecture. It is, according to benchmarks, done by
MapReduce will run side by side to cover all spark the MLlib developers against the Alternating Least
jobs on cluster. Squares (ALS) implementations. Spark MLlib is nine
times as fast as the Hadoop disk-based version of
3.2.2. Hadoop Yarn: Hadoop Yarn deployment Apache Mahout (before Mahout gained a Spark
means, simply, spark runs on Yarn without any pre- interface).
installation or root access required. It helps to
integrate Spark into Hadoop ecosystem or Hadoop
3.3.5.GraphX GraphX is a distributed graph- HDFS follows the master-slave architecture and it has
processing framework on top of Spark. It provides an the following elements.
API for expressing graph computation that can model
the user-defined graphs by using Pregel abstraction Namenode:The namenode is the commodity
API. It also provides an optimized runtime for this hardware that contains the GNU/Linux operating
abstraction. system and the namenode software. It is software
that can be run on commodity hardware. The system
3.4. Important of Resilient Distributed having the namenode acts as the master server and it
Datasets (RDD) in Apache Spark does the following tasks: 1.Manages the file system
namespace. 2. Regulates clients access to files. It also
executes file system operations such as renaming,
Resilient Distributed Datasets (RDD) is a
closing, and opening files and directories.
fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each Datanode: The datanode is a commodity hardware
dataset in RDD is divided into logical partitions, having the GNU/Linux operating system and
which may be computed on different nodes of the datanode software. For every node (Commodity
cluster. RDDs can contain any type of Python, Java, or hardware/System) in a cluster, there will be a
Scala objects, including user-defined classes. datanode. These nodes manage the data storage of
Formally, an RDD is a read-only, partitioned their system.
collection of records. RDDs can be created through
deterministic operations on either data on stable Datanodes perform read-write operations on
storage or other RDDs. RDD is a fault-tolerant the file systems, as per client request.
collection of elements that can be operated on in They also perform operations such as block
parallel. There are two ways to create RDDs: creation, deletion, and replication according
parallelizing an existing collection in your driver to the instructions of the namenode.
program, or referencing a dataset in an external
Block: Generally the user data is stored in the files
storage system, such as a shared file system, HDFS,
of HDFS. The file in a file system will be divided into
HBase, or any data source offering a Hadoop Input
one or more segments and/or stored in individual
Format. Spark makes use of the concept of RDD to
data nodes. These file segments are called as blocks.
achieve faster and efficient MapReduce operations.
In other words, the minimum amount of data that
Let us first discuss how MapReduce operations take
HDFS can read or write is called a Block. The default
place and why they are not so efficient.
block size is 64MB, but it can be increased as per the
need to change in HDFS configuration.
3.5. HDFS Architecture
3.6. Goals of HDFS
Given below is the architecture of a Hadoop File Fault detection and recovery: Since HDFS
System. includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS
should have mechanisms for quick and automatic
fault detection and recovery.
Reduce. Map takes a set of data and converts Spark claims to process data 100x faster than
it into another set of data, where individual MapReduce, while 10x faster with the disks.
elements are broken down into tuples
(key/value pairs). On the other hand, Reduce
takes the output from a map as an input and
combines the data tuples into smaller set of
tuples. In MapReduce, the data is distributed
over the cluster and processed.
Here are results from a survey taken on Spark by growing demand for Spark.
Typesafe to better understand the trends and
REFERENCES: