0% found this document useful (0 votes)
77 views

Introduction To Spark

Spark is a fast and general engine for large-scale data processing. It makes distributed programming easy by providing a scalable, fault-tolerant framework and programming paradigm. Spark optimizes performance by using in-memory caching and lazy evaluation of operations in a DAG. It supports programming in Scala, Java, Python and R. Spark can read from and write to different data sources and formats like HDFS, Cassandra, S3 and HBase. It is primarily written in Scala.

Uploaded by

miyumi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Introduction To Spark

Spark is a fast and general engine for large-scale data processing. It makes distributed programming easy by providing a scalable, fault-tolerant framework and programming paradigm. Spark optimizes performance by using in-memory caching and lazy evaluation of operations in a DAG. It supports programming in Scala, Java, Python and R. Spark can read from and write to different data sources and formats like HDFS, Cassandra, S3 and HBase. It is primarily written in Scala.

Uploaded by

miyumi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Lecture 1

Introduction to Spark
- fast and general engine for large-scale data processing
- but distributed programming is much more complex than single node; data must be partitioned
across machines, which increases latency if data is shared; chances of failure also increase
- spark makes distributed programming easy: scalable, fault-tolerant, provides a programming
paradigm to make it easy writing code
- spark is lightning fast because it uses in-memory caching and DAG-based processing engine
[DAG: operations are put in a graph and evaluated lazily; DAG records operations and not
necessarily evaluate it; evaluation happens only when user asks for result]
- spark optimizes the DAG pipeline; spark directly passes on data to next operation w/o the need
to rewrite data to storage
- spark programming model uses expressive languages like Scala, Python and Java
- interactive shell is available for Python and Scala
- spark collapses data science pipeline
- can read and write onto different data sources and format (HDFS, Cassandra, S3, HBase)
- spark is mostly written in Scala, a bit of Java and Python

Use Cases of Spark


- fraud detection: spark streaming and ML
- network intrusion detection: spark streaming and ML
- customer segmentation and personalization: spark SQL, ML
- social media sentiment analysis: spark streaming, spark SQL, Stanford's CoreNLP wrapper
- real-time ad targeting
- predictive healthcare
- ex. Uber logic: used spark streaming and spark SQL for ETL, spark MLlib and GraphX for
advanced analytics
- ex. Netflix: use spark streaming in AWS cloud, spark GraphX for recommender system
- ex. Pinterest: use spark streaming, spark SQL, MemSQL saprk connector for real time
analytics, Spark MLlib for machine learning
- ex. ADAM: use spark on Amazon EMR
- ex. Yahoo image and speech recognition: deep learning CaffeOnSpark

Quiz
1) Which programming API does Spark support

Explanation
Apache Spark provides programming API in Python, Scala, Java and R.

2) How does Spark make distributed processing simple?


Explanation
Spark provides distributed and parallel processing framework. Provides scalability. Provides
fault tolerance . Provides a programming paradigm makes it easy to write code in a parallel
manner.

3) Choose the different types of data sources that Spark can read from and write to.
Explanation
Spark can read and write to different data formats and data sources including HDFS, Cassandra,
S3 and Hbase. Can also access relational DB's and traditional BI tools using a server mode that
provides standard JDBC and ODBC connectivity

4) Which language is Spark written in primarily?


Correct answer
Scala

Explanation
Spark is primarily written in Scala.

5) Contributions to Spark's source code can only be made by employees of authorized


organizations
Correct answer
False

Explanation
Apache Spark is open source, which implies that anybody interested can make a contribution to
Spark's source code. Code from contributors is reviewed thoroughly by Spark committers and
committed to the Spark source code base if approved.

6) Spark code executes in a JVM (Java virtual machine)


Correct answer
True

Explanation
Spark is written in Scala and Scala runs in a JVM. Hence Spark runs in a JVM

7) Hadoop is better suited than Spark for iterative machine learning algorithms.
Correct answer
False

Explanation
Spark's in-memory machine abstractions allows caching of data sets which speeds up iterative
machine learning algorithms that need to process a data set iteratively multiple times. Hadoop is
slow since it writes to disk for every iteration for machine learning algorithms.

8) Which of the following application types can Spark run in addition to batch-processing jobs?
Explanation
Spark's provides API that can perform batch processing, stream processing, machine learning,
graph processing, SQL processing all in one single application in one single environment

9) Which of the following is NOT a characteristic of Spark?


Correct answer
Has its own file system
Explanation
Spark is not designed to be a data store and hence does not have its own file system. It can
integrate with a wide variety of data sources to read and write from.

10) Apache Spark set a record in in large scale sorting of data


Correct answer
True

Explanation
Spark is winner of Daytona GraySort contesting 2014, sorting a petabyte 3 times faster and using
10 times less hardware than Hadoop’s MapReduce

11) Spark originated as a research project at Harvard university.


Correct answer
False

Explanation
Spark originated as a research project in 2009 in UC Berkeley AMPLab, motivated by
MapReduce and the need to apply machine learning in a scalable fashion

12) An interactive shell is available in Spark for which languages?


Correct answer
Python, Scala

Explanation
Spark offers a command line interface or an interactive shell for Scala and Python only at this
time.

13) Spark is faster than MapReduce both for in-memory and on-disk computations. How many
times faster was Spark recorded to be for in-memory computations?
Correct answer
100 times faster

Explanation
Spark was found to be 100 times faster than Hadoop's MapReduce for in-memory computations.

14) What year was Apache Spark made an open source technology?
Correct answer
2010

Explanation
Apache Spark was created in 2009 at UC Berkeley, open sourced in 2010 and transferred to
Apache foundation in 2013.

15) Operational and debugging tools from the Java stack are available for Spark programmers
Correct answer
True

Explanation
Since Spark runs on a JVM, the operational and debugging tools from the Java stack are
available for performance monitoring and tuning of a Spark application.

You might also like