0% found this document useful (0 votes)

3 views

1_PDFsam_apache_spark_tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

1_PDFsam_apache_spark_tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Apache Spark

About the Tutorial

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was
built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use
more types of computations which includes Interactive Queries and Stream Processing.

This is a brief tutorial that explains the basics of Spark Core programming.

Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Spark Framework and become a Spark Developer. In addition, it would
be useful for Analytics Professionals and ETL developers as well.

Prerequisite
Before you start proceeding with this tutorial, we assume that you have prior exposure
to Scala programming, database concepts, and any of the Linux operating system
flavors.

Copyright & Disclaimer

All the content and graphics published in this e-book are the property of Tutorials Point
(I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or
republish any contents or a part of contents of this e-book in any manner without written
consent of the publisher.

We strive to update the contents of our website and tutorials as timely and as precisely
as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I)
Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of
our website or its contents including this tutorial. If you discover any errors on our
website or in this tutorial, please notify us at [email protected]

i
Apache Spark

Table of Contents
About the Tutorial .................................................................................................................................... i

Audience .................................................................................................................................................. i

Prerequisite.............................................................................................................................................. i

Copyright & Disclaimer............................................................................................................................. i

Table of Contents .................................................................................................................................... ii

1. SPARK INTRODUCTION ......................................................................................................... 1

Apache Spark .......................................................................................................................................... 1

Evolution of Apache Spark ...................................................................................................................... 1

Features of Apache Spark ........................................................................................................................ 1

Spark Built on Hadoop ............................................................................................................................ 2

Components of Spark .............................................................................................................................. 3

2. SPARK – RDD ........................................................................................................................ 4

Resilient Distributed Datasets ................................................................................................................. 4

Data Sharing is Slow in MapReduce ........................................................................................................ 4

Iterative Operations on MapReduce ....................................................................................................... 4

Interactive Operations on MapReduce .................................................................................................... 5

Data Sharing using Spark RDD ................................................................................................................. 6

Iterative Operations on Spark RDD.......................................................................................................... 6

Interactive Operations on Spark RDD ...................................................................................................... 6

3. SPARK – INSTALLATION ........................................................................................................ 8

Step 1: Verifying Java Installation............................................................................................................ 8

Step 2: Verifying Scala installation .......................................................................................................... 8

Step 3: Downloading Scala ...................................................................................................................... 8

Step 4: Installing Scala ............................................................................................................................. 9

Step 5: Downloading Apache Spark ......................................................................................................... 9

ii
Apache Spark

Step 6: Installing Spark .......................................................................................................................... 10

Step 7: Verifying the Spark Installation ................................................................................................. 10

4. SPARK – CORE PROGRAMMING.......................................................................................... 12

Spark Shell ............................................................................................................................................ 12

RDD ....................................................................................................................................................... 12

Transformations .................................................................................................................................... 12

Actions .................................................................................................................................................. 16

Programming with RDD ......................................................................................................................... 17

UN Persist the Storage .......................................................................................................................... 21

5. SPARK – DEPLOYMENT ....................................................................................................... 23

Spark-submit Syntax ............................................................................................................................. 27

6. ADVANCED SPARK PROGRAMMING ................................................................................... 30

Broadcast Variables............................................................................................................................... 30

Accumulators ........................................................................................................................................ 30

Numeric RDD Operations ...................................................................................................................... 31

iii
1. SPARK – INTRODUCTION Apache Spark

Industries are using Hadoop extensively to analyze their data sets. The reason is that
Hadoop framework is based on a simple programming model (MapReduce) and it
enables a computing solution that is scalable, flexible, fault-tolerant and cost effective.
Here, the main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.

Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.

As against a common belief, Spark is not a modified version of Hadoop and is not,
really, dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage purpose
only.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications,

iterative algorithms, interactive queries and streaming. Apart from supporting all these
workload in a respective system, it reduces the management burden of maintaining
separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by
Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to
Apache software foundation in 2013, and now Apache Spark has become a top level
Apache project from Feb-2014.

Features of Apache Spark

Apache Spark has following features.

 Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data
in memory.

1
Apache Spark

 Supports multiple languages: Spark provides built-in APIs in Java, Scala, or

Python. Therefore, you can write applications in different languages. Spark comes
up with 80 high-level operators for interactive querying.

 Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop
components.

There are three ways of Spark deployment as explained below.

 Standalone: Spark Standalone deployment means Spark occupies the place on

top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs
on cluster.

 Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.

 Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job

in addition to standalone deployment. With SIMR, user can start Spark and uses
its shell without any administrative access.

2
Apache Spark

Components of Spark
The following illustration depicts the different components of Spark.

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using
Pregel abstraction API. It also provides an optimized runtime for this abstraction.

Apache Spark 24 Hours PDF
100% (6)
Apache Spark 24 Hours PDF
1,129 pages
Case 2 Temporary Staffing at Christie's
100% (2)
Case 2 Temporary Staffing at Christie's
4 pages
Service Manual Philips 246E9QDSB01
No ratings yet
Service Manual Philips 246E9QDSB01
75 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Grade 3 Reading Comprehension Workbook
13% (8)
Grade 3 Reading Comprehension Workbook
3 pages
Apache Spark Tutorial
100% (4)
Apache Spark Tutorial
36 pages
Libro Spark
No ratings yet
Libro Spark
68 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Module 3
No ratings yet
Module 3
51 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
(Na) Aven Jeffrey - Sams Teach Yourself Spark in 24 Hours
No ratings yet
(Na) Aven Jeffrey - Sams Teach Yourself Spark in 24 Hours
1,229 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García All Chapters Instant Download
100% (2)
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García All Chapters Instant Download
41 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing 1st Edition Alfonso Antolínez García instant download
No ratings yet
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing 1st Edition Alfonso Antolínez García instant download
85 pages
Apache Spark Cheatsheet (2014)
No ratings yet
Apache Spark Cheatsheet (2014)
9 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
Unit IV spark
No ratings yet
Unit IV spark
23 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Developer Training For Apache Spark and Hadoop
No ratings yet
Developer Training For Apache Spark and Hadoop
3 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Spark and Ignite
No ratings yet
Apache Spark and Ignite
4 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
Solution Methodology
No ratings yet
Solution Methodology
3 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Sspark
No ratings yet
Sspark
7 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
100% (1)
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
79 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Learning Spark Preview Ed
No ratings yet
Learning Spark Preview Ed
18 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
07_Apache Spark - An Introduction
No ratings yet
07_Apache Spark - An Introduction
36 pages
Spark Training - Java
No ratings yet
Spark Training - Java
8 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Apache Spark 1
No ratings yet
Apache Spark 1
11 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
UNIT V
No ratings yet
UNIT V
35 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Machine Learning With Spark - Sample Chapter
100% (1)
Machine Learning With Spark - Sample Chapter
36 pages
Cloud Computing : Beginners And Intermediate User Guide
From Everand
Cloud Computing : Beginners And Intermediate User Guide
David comer
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
301_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
301_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
6 pages
181_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
181_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
89_PDFsam_Start Sketching and Drawing Now Simple Techniques for Drawing Landscapes, People and Objects
No ratings yet
89_PDFsam_Start Sketching and Drawing Now Simple Techniques for Drawing Landscapes, People and Objects
8 pages
201_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
201_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
281_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
281_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
121_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
121_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
121_PDFsam_Programming Pig
No ratings yet
121_PDFsam_Programming Pig
10 pages
181_PDFsam_Programming Pig
No ratings yet
181_PDFsam_Programming Pig
10 pages
71_PDFsam_Programming Pig
No ratings yet
71_PDFsam_Programming Pig
10 pages
_assets_File_ans-key_iom_CC_CL_9
No ratings yet
_assets_File_ans-key_iom_CC_CL_9
1 page
3_PDFsam_Beginner Guide Spark
No ratings yet
3_PDFsam_Beginner Guide Spark
2 pages
_assets_File_ans-key_iio_CC_CL_9
No ratings yet
_assets_File_ans-key_iio_CC_CL_9
1 page
9_PDFsam_Beginner Guide Spark
No ratings yet
9_PDFsam_Beginner Guide Spark
2 pages
1_PDFsam_Grafana Guidebook v1.1
No ratings yet
1_PDFsam_Grafana Guidebook v1.1
20 pages
_assets_File_ans-key_ioel_BB_CL_9
No ratings yet
_assets_File_ans-key_ioel_BB_CL_9
1 page
Intro to Vector Embeddings
No ratings yet
Intro to Vector Embeddings
8 pages
Difference between IRR and XIRR
No ratings yet
Difference between IRR and XIRR
7 pages
9 career saving lessons
No ratings yet
9 career saving lessons
11 pages
Language of Money
No ratings yet
Language of Money
19 pages
MasterSpellers FINALS-WINNERS-2024-25 list
No ratings yet
MasterSpellers FINALS-WINNERS-2024-25 list
4 pages
Beginner Guide Spark
No ratings yet
Beginner Guide Spark
12 pages
Chapter 4 Using Excel as Database
No ratings yet
Chapter 4 Using Excel as Database
4 pages
Grafana Guidebook v1.1
No ratings yet
Grafana Guidebook v1.1
41 pages
Craig Gentry Thesis
100% (3)
Craig Gentry Thesis
10 pages
Applying Machine Learning To Estimate The Effort and Duration of Individual Tasks in Software Projects
No ratings yet
Applying Machine Learning To Estimate The Effort and Duration of Individual Tasks in Software Projects
14 pages
Opposite Party Quote
No ratings yet
Opposite Party Quote
4 pages
Interfaces and Cables
No ratings yet
Interfaces and Cables
2 pages
MIC MICROPROJECT
No ratings yet
MIC MICROPROJECT
11 pages
Arba Minch Institute of Technology Arba Minch University: Arba Minch Kenema Foot Ball Club Information Management System
No ratings yet
Arba Minch Institute of Technology Arba Minch University: Arba Minch Kenema Foot Ball Club Information Management System
90 pages
Meil122 - Performance Task 1 - Manatad, Adelle Melyssa Imoen M.
No ratings yet
Meil122 - Performance Task 1 - Manatad, Adelle Melyssa Imoen M.
6 pages
Docu10450 - Documentum Content Server 6.6 Full Text Indexing Deployment Guide
No ratings yet
Docu10450 - Documentum Content Server 6.6 Full Text Indexing Deployment Guide
59 pages
Operating System Summary of Chapter 5
No ratings yet
Operating System Summary of Chapter 5
10 pages
Nesting in The Sheet Metal Industry Dealing With C
No ratings yet
Nesting in The Sheet Metal Industry Dealing With C
8 pages
Foxconn G31MV-K LGA775 1333FSB DDR2
No ratings yet
Foxconn G31MV-K LGA775 1333FSB DDR2
36 pages
Lista de Cliente 14-03-25
No ratings yet
Lista de Cliente 14-03-25
42 pages
ScanTools_Prisma
No ratings yet
ScanTools_Prisma
2 pages
Ra2211003011631 Monish
No ratings yet
Ra2211003011631 Monish
6 pages
Project Report 2024-25
No ratings yet
Project Report 2024-25
46 pages
AI Routers & Network Mind: A Hybrid Machine Learning Paradigm For Packet Routing
No ratings yet
AI Routers & Network Mind: A Hybrid Machine Learning Paradigm For Packet Routing
10 pages
LCFC NM-D711 r0.1
100% (1)
LCFC NM-D711 r0.1
86 pages
Direct Methods For Sparse Linear Systems Illustrated Edition Timothy A Davis pdf download
100% (3)
Direct Methods For Sparse Linear Systems Illustrated Edition Timothy A Davis pdf download
70 pages
Cheat Sheet Micro
No ratings yet
Cheat Sheet Micro
1 page
chp1ES
No ratings yet
chp1ES
45 pages
Lesson C. Have and Has
No ratings yet
Lesson C. Have and Has
1 page
Xps 13 9370 Laptop - Setup Guide - en Us
No ratings yet
Xps 13 9370 Laptop - Setup Guide - en Us
19 pages
Project Proposal
No ratings yet
Project Proposal
12 pages
PowerShell Practice Programs
No ratings yet
PowerShell Practice Programs
10 pages
Solution Manual for Introductory Econometrics A Modern Approach 5th Edition Wooldridge 1111531048 9781111531041 pdf download
80% (5)
Solution Manual for Introductory Econometrics A Modern Approach 5th Edition Wooldridge 1111531048 9781111531041 pdf download
49 pages
The Online Archive and The Internet Archive Challenges and Stakes
No ratings yet
The Online Archive and The Internet Archive Challenges and Stakes
21 pages
Execution Unit (EU) : Instruction Queue, and The Instruction Pointer. It Has The Task of Making Sure That The Bus
No ratings yet
Execution Unit (EU) : Instruction Queue, and The Instruction Pointer. It Has The Task of Making Sure That The Bus
10 pages

1_PDFsam_apache_spark_tutorial

Uploaded by

1_PDFsam_apache_spark_tutorial

Uploaded by

Apache Spark

About the Tutorial

Copyright & Disclaimer

Copyright & Disclaimer............................................................................................................................. i

Table of Contents .................................................................................................................................... ii

1. SPARK INTRODUCTION ......................................................................................................... 1

Apache Spark .......................................................................................................................................... 1

Evolution of Apache Spark ...................................................................................................................... 1

Features of Apache Spark ........................................................................................................................ 1

Spark Built on Hadoop ............................................................................................................................ 2

Components of Spark .............................................................................................................................. 3

2. SPARK – RDD ........................................................................................................................ 4

Resilient Distributed Datasets ................................................................................................................. 4

Data Sharing is Slow in MapReduce ........................................................................................................ 4

Iterative Operations on MapReduce ....................................................................................................... 4

Interactive Operations on MapReduce .................................................................................................... 5

Data Sharing using Spark RDD ................................................................................................................. 6

Iterative Operations on Spark RDD.......................................................................................................... 6

Interactive Operations on Spark RDD ...................................................................................................... 6

3. SPARK – INSTALLATION ........................................................................................................ 8

Step 1: Verifying Java Installation............................................................................................................ 8

Step 2: Verifying Scala installation .......................................................................................................... 8

Step 3: Downloading Scala ...................................................................................................................... 8

Step 4: Installing Scala ............................................................................................................................. 9

Step 5: Downloading Apache Spark ......................................................................................................... 9

Step 6: Installing Spark .......................................................................................................................... 10

Step 7: Verifying the Spark Installation ................................................................................................. 10

4. SPARK – CORE PROGRAMMING.......................................................................................... 12

Spark Shell ............................................................................................................................................ 12

Programming with RDD ......................................................................................................................... 17

UN Persist the Storage .......................................................................................................................... 21

5. SPARK – DEPLOYMENT ....................................................................................................... 23

Spark-submit Syntax ............................................................................................................................. 27

6. ADVANCED SPARK PROGRAMMING ................................................................................... 30

Numeric RDD Operations ...................................................................................................................... 31

Spark is designed to cover a wide range of workloads such as batch applications,

Evolution of Apache Spark

Features of Apache Spark

 Supports multiple languages: Spark provides built-in APIs in Java, Scala, or

Spark Built on Hadoop

There are three ways of Spark deployment as explained below.

 Standalone: Spark Standalone deployment means Spark occupies the place on

 Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job

Apache Spark Core

MLlib (Machine Learning Library)

You might also like