Introduction To Spark
Introduction To Spark
Introduction to Spark
- fast and general engine for large-scale data processing
- but distributed programming is much more complex than single node; data must be partitioned
across machines, which increases latency if data is shared; chances of failure also increase
- spark makes distributed programming easy: scalable, fault-tolerant, provides a programming
paradigm to make it easy writing code
- spark is lightning fast because it uses in-memory caching and DAG-based processing engine
[DAG: operations are put in a graph and evaluated lazily; DAG records operations and not
necessarily evaluate it; evaluation happens only when user asks for result]
- spark optimizes the DAG pipeline; spark directly passes on data to next operation w/o the need
to rewrite data to storage
- spark programming model uses expressive languages like Scala, Python and Java
- interactive shell is available for Python and Scala
- spark collapses data science pipeline
- can read and write onto different data sources and format (HDFS, Cassandra, S3, HBase)
- spark is mostly written in Scala, a bit of Java and Python
Quiz
1) Which programming API does Spark support
Explanation
Apache Spark provides programming API in Python, Scala, Java and R.
3) Choose the different types of data sources that Spark can read from and write to.
Explanation
Spark can read and write to different data formats and data sources including HDFS, Cassandra,
S3 and Hbase. Can also access relational DB's and traditional BI tools using a server mode that
provides standard JDBC and ODBC connectivity
Explanation
Spark is primarily written in Scala.
Explanation
Apache Spark is open source, which implies that anybody interested can make a contribution to
Spark's source code. Code from contributors is reviewed thoroughly by Spark committers and
committed to the Spark source code base if approved.
Explanation
Spark is written in Scala and Scala runs in a JVM. Hence Spark runs in a JVM
7) Hadoop is better suited than Spark for iterative machine learning algorithms.
Correct answer
False
Explanation
Spark's in-memory machine abstractions allows caching of data sets which speeds up iterative
machine learning algorithms that need to process a data set iteratively multiple times. Hadoop is
slow since it writes to disk for every iteration for machine learning algorithms.
8) Which of the following application types can Spark run in addition to batch-processing jobs?
Explanation
Spark's provides API that can perform batch processing, stream processing, machine learning,
graph processing, SQL processing all in one single application in one single environment
Explanation
Spark is winner of Daytona GraySort contesting 2014, sorting a petabyte 3 times faster and using
10 times less hardware than Hadoop’s MapReduce
Explanation
Spark originated as a research project in 2009 in UC Berkeley AMPLab, motivated by
MapReduce and the need to apply machine learning in a scalable fashion
Explanation
Spark offers a command line interface or an interactive shell for Scala and Python only at this
time.
13) Spark is faster than MapReduce both for in-memory and on-disk computations. How many
times faster was Spark recorded to be for in-memory computations?
Correct answer
100 times faster
Explanation
Spark was found to be 100 times faster than Hadoop's MapReduce for in-memory computations.
14) What year was Apache Spark made an open source technology?
Correct answer
2010
Explanation
Apache Spark was created in 2009 at UC Berkeley, open sourced in 2010 and transferred to
Apache foundation in 2013.
15) Operational and debugging tools from the Java stack are available for Spark programmers
Correct answer
True
Explanation
Since Spark runs on a JVM, the operational and debugging tools from the Java stack are
available for performance monitoring and tuning of a Spark application.