0% found this document useful (0 votes)
71 views2 pages

Apache Spark Features

Apache Spark is an open-source cluster computing framework that provides fault tolerance, dynamic operations, and lazy evaluation. It allows for real-time stream processing, is 100x faster than Hadoop for in-memory computing and 10x faster for disk-based operations. Spark supports multiple programming languages, integrated with Hadoop, and is cost-efficient without licensing fees.

Uploaded by

nitinlucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views2 pages

Apache Spark Features

Apache Spark is an open-source cluster computing framework that provides fault tolerance, dynamic operations, and lazy evaluation. It allows for real-time stream processing, is 100x faster than Hadoop for in-memory computing and 10x faster for disk-based operations. Spark supports multiple programming languages, integrated with Hadoop, and is cost-efficient without licensing fees.

Uploaded by

nitinlucky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Module4​:

Features of Apache Spark


Fault tolerance
Dynamic In Nature
Lazy Evaluation
Real-Time Stream Processing
Speed
Reusability
Advanced Analytics
In Memory Computing
Supporting Multiple languages
Integrated with Hadoop
Cost efficient

Fault Tolerance: ​Apache Spark is designed to handle worker node failures. It achieves this
fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the
lineage of all the transformations and actions needed to complete a task. So in the event of a
worker node failure, the same results can be achieved by rerunning the steps from the existing
DAG.

Dynamic nature:​Spark offers over 80 high-level operators that make it easy to build parallel
apps.

Lazy Evaluation: ​Spark does not evaluate any transformation immediately. All the
transformations​ are lazily evaluated. The transformations are added to the DAG and the final
computation or results are available only when actions are called. This gives spark ability to
make optimization decisions as all the transformations become visible to the spark engine
before performing any action.

Real Time Stream Processing: ​Spark Streaming brings Apache Spark's


language-integrated API​ to stream processing​, letting you write streaming jobs the same
way you write batch jobs.

Speed: ​Spark enables applications running on Hadoop to run upto 100x faster in memory and
upto 10x faster on disk. Spark achieves this by minimising disk read/write operations for
intermediate results and storing in memory and perform disk operations only when essential.
Spark achieves this using ​DAG​, query optimizer and highly optimized physical execution
engine.
Reusability: ​Spark code can be used for batch-processing, joining streaming data against
historical data as well run ad-hoc queries on streaming state.

Advanced Analytics: ​Apache spark has rapidly become the de facto standard for big data
processing and data sciences across multiple industries. Spark provides both machine learning
and graph processing libraries which companies across sectors leverage to tackle complex
problems. And all this is easily done using the power of spark and highly scalable clustered
computers. Databricks provides and Advanced Analytics platform with Spark.

In Memory Computing: ​Unlike Hadoop Mapreduce, Apache spark is capable of processing


tasks in memory and it is not required to write back intermediate results to disk. This feature
gives massive speed to Spark processing. Over and above this, Spark is also capable of caching
the intermediate results so that it reused in next iteration. This gives spark added performance
boost for any iterative and repetitive process where results in one step can be used later or there
is a common dataset which can be used across multiple tasks.

Supporting Multiple languages: ​Spark comes inbuilt with multi-language support. It has
most of the APIs available in Java, Scala, Python and R. Also there are advanced features
available with R language for data analytics. Also, Spark comes with SparkSQL which has a SQL
like feature and so SQL developers find it very easy to use and the learning curve is reduced to a
great level.

Integrated with Hadoop: ​Apache Spark integrated very well with Hadoop file system HDFS.
It has support to multiple file formats like parquet,json,csv,ORC, Avro etc. Hadoop can be easily
leveraged with Spark as input data source or destination.

Cost efficient: ​Apache Spark is an open source software, so it does not have any licensing fee
associated with it. Users have to just worry about the hardware cost. Also, Apache Spark reduces
a lot of other costs as it comes inbuilt for stream processing, ML and Graph processing. Spark
does not have any locking with any vendor so that makes it very easy for organizations to pick
and choose spark features as per their use case.

You might also like