0% found this document useful (0 votes)

39 views13 pages

Introduction To Apache Spark

The document provides an introduction to Apache Spark, detailing its genesis as a solution to the shortcomings of Hadoop in handling big data and distributed computing. It describes Spark as a unified engine for large-scale data processing, emphasizing its speed, ease of use, and modularity. Additionally, it outlines various use cases for Spark, including data science, machine learning, and real-time data processing.

Uploaded by

azamsyed811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views13 pages

Introduction To Apache Spark

Uploaded by

azamsyed811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Introduction to Apache Spark

Outline
q The Genesis of Spark

q What is Apache Spark?

q Getting Started with Spark

Reference:
• Chapter 1, “Learning Spark”, 2nd Edition. Authors: Jules S. Damji, Brooke Wenig,
Tathagata Das, Denny Lee. Publisher(s): O'Reilly Media, Inc. ISBN: 9781492050049
2
3
The Genesis of Spark
• Big Data and Distributed Computing at Google
o creation of the Google File System (GFS), MapReduce (MR), and Bigtable to handle
massive amount of data on the Internet

• Hadoop at Yahoo!
o Open-source community – especially, Yahoo! was also interested
o GFS provided a blueprint for the Hadoop File System (HDFS)
o Donated to the Apache
o Shortcomings: administration and management, complex operation, low fault
tolerance of MapReduce, slow MR jobs

• Spark was developed to address the issues Hadoop had

4
The Genesis of Spark
• Spark was developed to address the issues Hadoop had

Intermittent iteration of reads and writes between map and reduce computations

5
What Is Apache Spark?
● Apache Spark is a unified engine
designed for large-scale distributed
data processing, on premises in data
centers or in the cloud.
● Design philosophy:
○ Speed
○ Ease of use
○ Modularity
○ Extensibility

Apache Spark’s ecosystem of connectors

6
What Is Apache Spark?
Structured Real-time Common Analyze
data processing of Machine graphs and
(e.g., CSV, text, continually learning topologies
JSON, Avro, growing table algorithms using
ORC, Parquet) algorithms e.g.,
PageRank

Apache Spark components and API stack

8
Spark SQL
• Read from a JSON file stored on Amazon S3
• Create a temporary table, and
• Issue a SQL-like query on the results read into memory as a Spark DataFrame

9
Who Uses Spark, and for What?
Data Science, Data Engineering, Machine Learning

Some use cases:

• Processing in parallel large data sets distributed across a cluster

• Performing ad hoc or interactive queries to explore and visualize data sets

• Building, training, and evaluating ML models using MLlib

• Implementing end-to-end data pipelines from myriad streams of data

• Analyzing graph data sets and social networks

10
Basic Operations a Data Scientist May Perform

11
Spark Ecosystem

12
Spark’s Distributed Execution

13
Spark Installation

14
Spark – Databricks Community Edition
1. Create a free Databricks account using this link:
https://databricks.com/try-databricks

2. When asked to select a cloud provider, click "Get

started with Community Edition" towards the bottom
(see screenshot)

3. Verify your email account by clicking the link sent to

your email. Then log in here:
https://community.cloud.databricks.com/login.html

DE in AI
No ratings yet
DE in AI
14 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Apache Spark A Comprehensive Guide
No ratings yet
Apache Spark A Comprehensive Guide
9 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
Spark Introduction
No ratings yet
Spark Introduction
25 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Unit 4
No ratings yet
Unit 4
60 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Bda U4
No ratings yet
Bda U4
49 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
WT Da All Practical Questions
100% (2)
WT Da All Practical Questions
100 pages
Module 2
No ratings yet
Module 2
20 pages
Apache Spark 1
No ratings yet
Apache Spark 1
11 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
1 Spark
No ratings yet
1 Spark
2 pages
AnshPatelResume OP-2
No ratings yet
AnshPatelResume OP-2
1 page
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
NLP Final
No ratings yet
NLP Final
72 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark
No ratings yet
Spark
4 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Machine Learning With Spark - Sample Chapter
100% (1)
Machine Learning With Spark - Sample Chapter
36 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Unit II
No ratings yet
Unit II
35 pages
ITISA1 Ch06 PowerPoint
No ratings yet
ITISA1 Ch06 PowerPoint
39 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Ai Unit-5 QB
No ratings yet
Ai Unit-5 QB
9 pages
Agentic AI
No ratings yet
Agentic AI
4 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Career in Software Engineering - Software Engineering Is The Branch of Engineering
No ratings yet
Career in Software Engineering - Software Engineering Is The Branch of Engineering
4 pages
Quiz 1 Crypto CS01083112
No ratings yet
Quiz 1 Crypto CS01083112
2 pages
Spark 101
No ratings yet
Spark 101
25 pages
Computer Science Engineering
No ratings yet
Computer Science Engineering
13 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Case Study C++
No ratings yet
Case Study C++
11 pages
AC, TLS, and Encoders
No ratings yet
AC, TLS, and Encoders
25 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
Outline of Artificial Intelligence
No ratings yet
Outline of Artificial Intelligence
21 pages
HSE 4035 MDOC Training Assets Catalog 00
No ratings yet
HSE 4035 MDOC Training Assets Catalog 00
4 pages
BBT 4104 - Business Intelligence I - December 2017
No ratings yet
BBT 4104 - Business Intelligence I - December 2017
2 pages
21 - Data Structure and Algorithms - Hash Table
No ratings yet
21 - Data Structure and Algorithms - Hash Table
9 pages
Migration of Relational Database To Mongodb
No ratings yet
Migration of Relational Database To Mongodb
7 pages
Metadata-Drainage Classes
No ratings yet
Metadata-Drainage Classes
3 pages
2025 Feb Cs Sem IV Dbms Test 1 B
No ratings yet
2025 Feb Cs Sem IV Dbms Test 1 B
2 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
2016B1A10800P
No ratings yet
2016B1A10800P
1 page
Study Guide: Exam AI-900: Microsoft Azure AI Fundamentals
No ratings yet
Study Guide: Exam AI-900: Microsoft Azure AI Fundamentals
7 pages
Minor Project (MCA-169) - MCA2022-24 - Format
No ratings yet
Minor Project (MCA-169) - MCA2022-24 - Format
14 pages
Database Middleware and Web Services For Data Distribution and Integration in Distributed Heterogeneous Databased Systems
No ratings yet
Database Middleware and Web Services For Data Distribution and Integration in Distributed Heterogeneous Databased Systems
6 pages
Lecture 10.1
No ratings yet
Lecture 10.1
6 pages
Sl. Student ID Student Name: Topic
No ratings yet
Sl. Student ID Student Name: Topic
4 pages
AI Based Smart Robot (Chatbot) Using Python
No ratings yet
AI Based Smart Robot (Chatbot) Using Python
4 pages
Resume Uditi Mehta 0222
No ratings yet
Resume Uditi Mehta 0222
1 page
Legal Reform by Thom Gunn
No ratings yet
Legal Reform by Thom Gunn
3 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet

Introduction To Apache Spark

Uploaded by

Introduction To Apache Spark

Uploaded by

Introduction to Apache Spark

q What is Apache Spark?

q Getting Started with Spark

• Spark was developed to address the issues Hadoop had

Apache Spark’s ecosystem of connectors

Apache Spark components and API stack

Some use cases:

• Processing in parallel large data sets distributed across a cluster

• Performing ad hoc or interactive queries to explore and visualize data sets

• Building, training, and evaluating ML models using MLlib

• Implementing end-to-end data pipelines from myriad streams of data

• Analyzing graph data sets and social networks

2. When asked to select a cloud provider, click "Get

3. Verify your email account by clicking the link sent to

You might also like