Introduction To Apache Spark
Introduction To Apache Spark
Outline
q The Genesis of Spark
Reference:
• Chapter 1, “Learning Spark”, 2nd Edition. Authors: Jules S. Damji, Brooke Wenig,
Tathagata Das, Denny Lee. Publisher(s): O'Reilly Media, Inc. ISBN: 9781492050049
2
3
The Genesis of Spark
• Big Data and Distributed Computing at Google
o creation of the Google File System (GFS), MapReduce (MR), and Bigtable to handle
massive amount of data on the Internet
• Hadoop at Yahoo!
o Open-source community – especially, Yahoo! was also interested
o GFS provided a blueprint for the Hadoop File System (HDFS)
o Donated to the Apache
o Shortcomings: administration and management, complex operation, low fault
tolerance of MapReduce, slow MR jobs
4
The Genesis of Spark
• Spark was developed to address the issues Hadoop had
Intermittent iteration of reads and writes between map and reduce computations
5
What Is Apache Spark?
● Apache Spark is a unified engine
designed for large-scale distributed
data processing, on premises in data
centers or in the cloud.
● Design philosophy:
○ Speed
○ Ease of use
○ Modularity
○ Extensibility
6
What Is Apache Spark?
Structured Real-time Common Analyze
data processing of Machine graphs and
(e.g., CSV, text, continually learning topologies
JSON, Avro, growing table algorithms using
ORC, Parquet) algorithms e.g.,
PageRank
9
Who Uses Spark, and for What?
Data Science, Data Engineering, Machine Learning
10
Basic Operations a Data Scientist May Perform
11
Spark Ecosystem
12
Spark’s Distributed Execution
13
Spark Installation
14
Spark – Databricks Community Edition
1. Create a free Databricks account using this link:
https://databricks.com/try-databricks
15