0% found this document useful (0 votes)

77 views

Introduction To Spark

Spark is a fast and general engine for large-scale data processing. It makes distributed programming easy by providing a scalable, fault-tolerant framework and programming paradigm. Spark optimizes performance by using in-memory caching and lazy evaluation of operations in a DAG. It supports programming in Scala, Java, Python and R. Spark can read from and write to different data sources and formats like HDFS, Cassandra, S3 and HBase. It is primarily written in Scala.

Uploaded by

miyumi

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views

Introduction To Spark

Uploaded by

miyumi

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Lecture 1

Introduction to Spark
- fast and general engine for large-scale data processing
- but distributed programming is much more complex than single node; data must be partitioned
across machines, which increases latency if data is shared; chances of failure also increase
- spark makes distributed programming easy: scalable, fault-tolerant, provides a programming
paradigm to make it easy writing code
- spark is lightning fast because it uses in-memory caching and DAG-based processing engine
[DAG: operations are put in a graph and evaluated lazily; DAG records operations and not
necessarily evaluate it; evaluation happens only when user asks for result]
- spark optimizes the DAG pipeline; spark directly passes on data to next operation w/o the need
to rewrite data to storage
- spark programming model uses expressive languages like Scala, Python and Java
- interactive shell is available for Python and Scala
- spark collapses data science pipeline
- can read and write onto different data sources and format (HDFS, Cassandra, S3, HBase)
- spark is mostly written in Scala, a bit of Java and Python

Use Cases of Spark

- fraud detection: spark streaming and ML
- network intrusion detection: spark streaming and ML
- customer segmentation and personalization: spark SQL, ML
- social media sentiment analysis: spark streaming, spark SQL, Stanford's CoreNLP wrapper
- real-time ad targeting
- predictive healthcare
- ex. Uber logic: used spark streaming and spark SQL for ETL, spark MLlib and GraphX for
advanced analytics
- ex. Netflix: use spark streaming in AWS cloud, spark GraphX for recommender system
- ex. Pinterest: use spark streaming, spark SQL, MemSQL saprk connector for real time
analytics, Spark MLlib for machine learning
- ex. ADAM: use spark on Amazon EMR
- ex. Yahoo image and speech recognition: deep learning CaffeOnSpark

Quiz
1) Which programming API does Spark support

Explanation
Apache Spark provides programming API in Python, Scala, Java and R.

2) How does Spark make distributed processing simple?

Explanation
Spark provides distributed and parallel processing framework. Provides scalability. Provides
fault tolerance . Provides a programming paradigm makes it easy to write code in a parallel
manner.

3) Choose the different types of data sources that Spark can read from and write to.
Explanation
Spark can read and write to different data formats and data sources including HDFS, Cassandra,
S3 and Hbase. Can also access relational DB's and traditional BI tools using a server mode that
provides standard JDBC and ODBC connectivity

4) Which language is Spark written in primarily?

Correct answer
Scala

Explanation
Spark is primarily written in Scala.

5) Contributions to Spark's source code can only be made by employees of authorized

organizations
Correct answer
False

Explanation
Apache Spark is open source, which implies that anybody interested can make a contribution to
Spark's source code. Code from contributors is reviewed thoroughly by Spark committers and
committed to the Spark source code base if approved.

6) Spark code executes in a JVM (Java virtual machine)

Correct answer
True

Explanation
Spark is written in Scala and Scala runs in a JVM. Hence Spark runs in a JVM

7) Hadoop is better suited than Spark for iterative machine learning algorithms.
Correct answer
False

Explanation
Spark's in-memory machine abstractions allows caching of data sets which speeds up iterative
machine learning algorithms that need to process a data set iteratively multiple times. Hadoop is
slow since it writes to disk for every iteration for machine learning algorithms.

8) Which of the following application types can Spark run in addition to batch-processing jobs?
Explanation
Spark's provides API that can perform batch processing, stream processing, machine learning,
graph processing, SQL processing all in one single application in one single environment

9) Which of the following is NOT a characteristic of Spark?

Correct answer
Has its own file system
Explanation
Spark is not designed to be a data store and hence does not have its own file system. It can
integrate with a wide variety of data sources to read and write from.

10) Apache Spark set a record in in large scale sorting of data

Correct answer
True

Explanation
Spark is winner of Daytona GraySort contesting 2014, sorting a petabyte 3 times faster and using
10 times less hardware than Hadoop’s MapReduce

11) Spark originated as a research project at Harvard university.

Correct answer
False

Explanation
Spark originated as a research project in 2009 in UC Berkeley AMPLab, motivated by
MapReduce and the need to apply machine learning in a scalable fashion

12) An interactive shell is available in Spark for which languages?

Correct answer
Python, Scala

Explanation
Spark offers a command line interface or an interactive shell for Scala and Python only at this
time.

13) Spark is faster than MapReduce both for in-memory and on-disk computations. How many
times faster was Spark recorded to be for in-memory computations?
Correct answer
100 times faster

Explanation
Spark was found to be 100 times faster than Hadoop's MapReduce for in-memory computations.

14) What year was Apache Spark made an open source technology?
Correct answer
2010

Explanation
Apache Spark was created in 2009 at UC Berkeley, open sourced in 2010 and transferred to
Apache foundation in 2013.

15) Operational and debugging tools from the Java stack are available for Spark programmers
Correct answer
True

Explanation
Since Spark runs on a JVM, the operational and debugging tools from the Java stack are
available for performance monitoring and tuning of a Spark application.

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Group 2 - Bail (Complete)
100% (2)
Group 2 - Bail (Complete)
40 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Shark
No ratings yet
Shark
24 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Module 4
No ratings yet
Module 4
29 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
spark theory
No ratings yet
spark theory
26 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Spark BD
No ratings yet
Spark BD
9 pages
Spark 101
No ratings yet
Spark 101
25 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Compare Hadoop and Spark.: Table
No ratings yet
Compare Hadoop and Spark.: Table
10 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Module 3
No ratings yet
Module 3
51 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
Spark
No ratings yet
Spark
9 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
MyActivitySummary(Consolidated)_20250209173919
No ratings yet
MyActivitySummary(Consolidated)_20250209173919
3 pages
19 - 118th CONGRESS 2nd Session
No ratings yet
19 - 118th CONGRESS 2nd Session
18 pages
08 - THOMAS R CARPER WATER RESOURCES DEVELOPMENT ACT OF 2024
No ratings yet
08 - THOMAS R CARPER WATER RESOURCES DEVELOPMENT ACT OF 2024
312 pages
The 2016 Revised Implementing Rules and Regulations of Republic Act No. 9184 (Updated As of 03 July 2023)
No ratings yet
The 2016 Revised Implementing Rules and Regulations of Republic Act No. 9184 (Updated As of 03 July 2023)
254 pages
16 - 118th CONGRESS 2nd Session
No ratings yet
16 - 118th CONGRESS 2nd Session
49 pages
05 - 119th CONGRESS 1st Session
No ratings yet
05 - 119th CONGRESS 1st Session
3 pages
96PhilLJ764-905
No ratings yet
96PhilLJ764-905
149 pages
Logic and Sets: Clast Mathematics Competencies
No ratings yet
Logic and Sets: Clast Mathematics Competencies
40 pages
The Internet and The Google Age: Prospects and Perils: February 2018
No ratings yet
The Internet and The Google Age: Prospects and Perils: February 2018
28 pages
226 Rural Bank of Lipa v. CA
No ratings yet
226 Rural Bank of Lipa v. CA
4 pages
6-11 Regional Offices CCT 09282023
No ratings yet
6-11 Regional Offices CCT 09282023
108 pages
Guide Question 1 - May 8
No ratings yet
Guide Question 1 - May 8
2 pages
Tumalad v. Vicencio: Although There Is No Specific
No ratings yet
Tumalad v. Vicencio: Although There Is No Specific
13 pages
6-16-2020 Digests
No ratings yet
6-16-2020 Digests
23 pages
15 Arco Pulp v. Lim
No ratings yet
15 Arco Pulp v. Lim
4 pages
CREDIT Case Reviewer-CAM395ZMV2
No ratings yet
CREDIT Case Reviewer-CAM395ZMV2
37 pages
3-22-2020 Assignment
No ratings yet
3-22-2020 Assignment
22 pages
Lumbos v. Baliguat
No ratings yet
Lumbos v. Baliguat
5 pages
ENRILE V. SANDIGANBAYAN, August 18, 2015, G.R. No. 213847: Answers To Guide Questions
No ratings yet
ENRILE V. SANDIGANBAYAN, August 18, 2015, G.R. No. 213847: Answers To Guide Questions
7 pages
Excuse Letter
No ratings yet
Excuse Letter
2 pages
Pan Malayan v. CA
No ratings yet
Pan Malayan v. CA
3 pages
(GRP) Zulueta v. Nicolas
100% (1)
(GRP) Zulueta v. Nicolas
2 pages
Harding v. Commercial Union
100% (1)
Harding v. Commercial Union
5 pages
109 Vaporoso V People
100% (1)
109 Vaporoso V People
4 pages
Parties: Prudential Guarantee and Assurance, Inc., Petitioner, vs. Equinox Land CORPORATION, Respondent
No ratings yet
Parties: Prudential Guarantee and Assurance, Inc., Petitioner, vs. Equinox Land CORPORATION, Respondent
3 pages
People v. Batin
No ratings yet
People v. Batin
3 pages
110 People v. Tan
100% (1)
110 People v. Tan
7 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
How Do I Use Professionbuddy?
No ratings yet
How Do I Use Professionbuddy?
8 pages
Nandini Singh - CV
No ratings yet
Nandini Singh - CV
2 pages

Introduction To Spark

Uploaded by

Introduction To Spark

Uploaded by

Lecture 1

Use Cases of Spark

2) How does Spark make distributed processing simple?

4) Which language is Spark written in primarily?

5) Contributions to Spark's source code can only be made by employees of authorized

6) Spark code executes in a JVM (Java virtual machine)

9) Which of the following is NOT a characteristic of Spark?

10) Apache Spark set a record in in large scale sorting of data

11) Spark originated as a research project at Harvard university.

12) An interactive shell is available in Spark for which languages?

You might also like