Apache Spark Features
Apache Spark Features
Fault Tolerance: Apache Spark is designed to handle worker node failures. It achieves this
fault tolerance by using DAG and RDD (Resilient Distributed Datasets). DAG contains the
lineage of all the transformations and actions needed to complete a task. So in the event of a
worker node failure, the same results can be achieved by rerunning the steps from the existing
DAG.
Dynamic nature:Spark offers over 80 high-level operators that make it easy to build parallel
apps.
Lazy Evaluation: Spark does not evaluate any transformation immediately. All the
transformations are lazily evaluated. The transformations are added to the DAG and the final
computation or results are available only when actions are called. This gives spark ability to
make optimization decisions as all the transformations become visible to the spark engine
before performing any action.
Speed: Spark enables applications running on Hadoop to run upto 100x faster in memory and
upto 10x faster on disk. Spark achieves this by minimising disk read/write operations for
intermediate results and storing in memory and perform disk operations only when essential.
Spark achieves this using DAG, query optimizer and highly optimized physical execution
engine.
Reusability: Spark code can be used for batch-processing, joining streaming data against
historical data as well run ad-hoc queries on streaming state.
Advanced Analytics: Apache spark has rapidly become the de facto standard for big data
processing and data sciences across multiple industries. Spark provides both machine learning
and graph processing libraries which companies across sectors leverage to tackle complex
problems. And all this is easily done using the power of spark and highly scalable clustered
computers. Databricks provides and Advanced Analytics platform with Spark.
Supporting Multiple languages: Spark comes inbuilt with multi-language support. It has
most of the APIs available in Java, Scala, Python and R. Also there are advanced features
available with R language for data analytics. Also, Spark comes with SparkSQL which has a SQL
like feature and so SQL developers find it very easy to use and the learning curve is reduced to a
great level.
Integrated with Hadoop: Apache Spark integrated very well with Hadoop file system HDFS.
It has support to multiple file formats like parquet,json,csv,ORC, Avro etc. Hadoop can be easily
leveraged with Spark as input data source or destination.
Cost efficient: Apache Spark is an open source software, so it does not have any licensing fee
associated with it. Users have to just worry about the hardware cost. Also, Apache Spark reduces
a lot of other costs as it comes inbuilt for stream processing, ML and Graph processing. Spark
does not have any locking with any vendor so that makes it very easy for organizations to pick
and choose spark features as per their use case.