0% found this document useful (0 votes)
100 views7 pages

Presentation On Apache Spark

This presentation provides an introduction to Apache Spark. It discusses how Spark is a framework for performing data analytics on distributed clusters like Hadoop. Spark provides in-memory computations for increased speed over MapReduce. It runs on existing Hadoop clusters and can access Hadoop data stores. Spark supports multiple programming languages and can process both batch and streaming data. Key benefits of Spark include its speed for iterative jobs like machine learning through caching data in memory. It is becoming a popular replacement for some MapReduce jobs due to its superior performance.

Uploaded by

Mridula Bvs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views7 pages

Presentation On Apache Spark

This presentation provides an introduction to Apache Spark. It discusses how Spark is a framework for performing data analytics on distributed clusters like Hadoop. Spark provides in-memory computations for increased speed over MapReduce. It runs on existing Hadoop clusters and can access Hadoop data stores. Spark supports multiple programming languages and can process both batch and streaming data. Key benefits of Spark include its speed for iterative jobs like machine learning through caching data in memory. It is becoming a popular replacement for some MapReduce jobs due to its superior performance.

Uploaded by

Mridula Bvs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 7

Presentation on Apache Spark

ByB V S Mridula (1039060)


Milind Baluni (1003192)
G. Ganesh ()
Dilip Payra ()
Sameer Nayak ()

Introduction to Apache Spark

It is a framework for performing general data analytics on distributed


computing cluster like Hadoop
It provides in memory computations for increase speed and data
process over MapReduce.
Runs on top of existing Hadoop cluster and access Hadoop data store
(HDFS)
Can also process structured data in Hive and Streaming data from
HDFS,Flume,Kafka,Twitter

High-Productivity Language
Support

Native support for multiple


languages with identical
APIs

Use of closures, iterations,


and other common
language constructs to
minimize code

Python

lines = sc.textFile(...)
lines.fi
lter(lam bda s: ERRO R in s).count()

Scala
val lines = sc.textFile(...)
lines.fi
lter(s = >
s.contains(ERRO R)).count()

Java

Unified API for batch and


streaming

JavaRD D < String> lines = sc.textFile(...);


lines.fi
lter(n ew Function< String, Boolean> ()
{
Boolean call(String s) {
retu rn s.contains(error);
}
}).count();

Why Apache Spark is used?

Apache Spark is best for performing iterative jobs like machine learning.
Apache spark maintains an in-memory copy of map reduce job using RDD(Resilient
Distributed Datasets)
With this, if the map reduce job is once loaded into in-memory copy.
The need for loading that map reduce job again and again from memory gets
reduced. With this there is a tremendous increase in speed.
The in-memory copy holds the frequently used map reduce job within itself.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk.

Spark libraries
Spark SQL:
It is a Spark's module for working with structured data.
Spark Streaming:
Which makes it easy to build scalable fault-tolerant streaming
applications.
MLlib:
It is Apache Spark's scalable machine learning library.
GraphX:
It is Apache Spark's API for graphs and graph-parallel computation.

Is Apache Spark going to replace Hadoop?

Hadoop essentially consists of a MapReduce phase and a file system


(HDFS) whereas Spark is a framework that executes jobs, so
practically Spark can only replace the MapReduce phase in the
Hadoop Ecosystem.
Spark was mainly designed to run on top of Hadoop so as to minimize
the job execution time.
Spark is an alternative to the traditional MapReduce model that used
to work only in batch mode. Spark supports both batch as well as
real-time processing.
Spark mainly utilizes the primary memory of the system to provide
efficient output. Thus, it requires a high-end machine to execute jobs.
Hadoop, on the other hand, can easily run on commodity hardware.
Spark's way of handle fault tolerance is very fast as compared to
Hadoop's. This minimizes network I/O and guarantees fault tolerance.

Thank You!

You might also like