0% found this document useful (0 votes)

4 views19 pages

Spark Running Notes

Apache Spark is a general-purpose, in-memory compute engine that serves as an alternative to Hadoop's MapReduce, requiring only two disk I/O operations for processing. It utilizes Resilient Distributed Datasets (RDDs) for data handling, which are immutable, fault-tolerant, and support lazy transformations and actions. Spark allows for various data processing tasks, including cleaning, querying, and machine learning, using a unified coding style across different programming languages like Scala and Python.

Uploaded by

syedzaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views19 pages

Spark Running Notes

Uploaded by

syedzaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Apache Spark

=============

Apache spark is a
==================

general purpose
in memory
compute engine

compute engine
===============

hadoop provides 3 things:

1. hdfs - Storage
2. mapreduce - Computation
3. YARN - Resource Manager

9108179578
Spark is a replacement/alternative of mapreduce.

Spark is a plug and play compute engine which needs 2

things to work with.

1. Storage - local storage, hdfs, Amazon S3

2. Resource Manager - YARN, Mesos, Kubernetes

in-memory
==========

mr1 mr2 mr3 mr4 mr5

HDFS

for each mapreduce job we require 2 disk access

one is for reading and other is for writing.

Spark
9108179578
V1 V2 V3 v4 v5

HDFS

only 2 disk IO's are required.

spark is said to be
10 to 100 times faster than mapreduce

General Purpose
================

pig for cleaning

9108179578
hive for querying

mahout

sqoop
only bound to use map and reduce

learn just one style of writing the code and all the things
like cleaning, querying, machine learning, data ingestion
all these can happen with that.

filter

map

reduce

Spark Session - 2
==================

The basic unit which holds the data in spark is called as

RDD
9108179578
resilient distributed dataset

List
RDD is nothing but in-memory distributed collection.

rdd1 = load file1 from hdfs

rdd1 = rdd1.map
rdd1 = rdd2.filter
rdd1.collect()

DAG - Directed Acyclic graph

There are 2 kind of operations in spark

========================================

1. Transformation
2. Action

Transformations are lazy

Action are not.

9108179578
whenever you call transformations an entry to the
execution plan is added.
RDD's are:

Distributed

in memory

Resilient (fault tolerant) - if we loose an rdd we can again

recover it back.

rdds are resilient to failures.

RDD1
| MAP
|
RDD2
| FILTER
|
RDD3

IF RDD3 IS LOST THEN IT WILL CHECK FOR ITS

9108179578
PARENT RDD USING THE LINEAGE GRAPH AND IT
WILL QUICKLY APPLY THE FILTER TRANSFORMATION
ON RDD2

Immutable
==========

once we load rdd with data the data cannot be changed.

why immutable?

why transformations are lazy?

assume that transformations are not lazy.

consider you have a 1 gb file in hdfs.

rdd1 = load file1 from hdfs

rdd1.print(line1)

to print just 1 line we ended up loading 1 gb file in

memory.
9108179578
consider the fact, spark is lazy.
rdd1 = load file1 from hdfs

rdd1.print(line1)

file1 is in hdfs with 10 lakh rows

rdd1 = load file1 from hdfs

rdd2 = rdd1.map
rdd3 = rdd2.filter
rdd3.collect()

consider if spark is not lazy.

rdd1 will be materialized.

10 lakh lines will be processes

in case of a map the number of input records is the same

as number of output records.

9108179578
consider after applying filter we are just interested in 5
records..
but consider the fact that spark is lazy:

=======

word count in spark

====================

we need to find the frequency of each word in a file which

resides in hdfs.

we have created a file in local and then moved it to hdfs

now we want to process this file in hdfs using apache

spark

spark-shell (scala)

pyspark (python)

9108179578
sc is nothing but the spark context

and it is the entry point to the spark cluster

the basic unit which holds the data in spark is called as an
rdd

val rdd1 = sc.textFile("/user/cloudera/sparkinput/file1")

val rdd2 = rdd1.flatMap(x => x.split(" "))

flatmap basically takes each line as input

Array(spark,is,very,interesting,spark,is,in,memory,compute
,engine)

spark (spark,1)
is (is,1)
very (very,1)
interesting (interesting,1)
spark. (spark,1)
is
in 9108179578
in a map if we have n inputs then we will definitely have n
outputs.
val rdd3 = rdd2.map(x => (x,1))

(spark,4)
(is,1)
(very,1)
(interesting,2)

val rdd4 = rdd3.reduceByKey((x,y) => x+y)

rdd4.collect()

localhost:4040 (this gives you the spark ui)

9108179578
spark with scala code
=======================
val rdd1 = sc.textFile("/user/cloudera/sparkinput/file1")

val rdd2 = rdd1.flatMap(x => x.split(" "))

val rdd3 = rdd2.map(x => (x, 1))

val rdd4 = rdd3.reduceByKey( x,y => x + y)

rdd4.collect()

pyspark
=========

rdd1 = sc.textFile("file:///home/cloudera/file1")

rdd2 = rdd1.flatMap(lambda x : x.split(" "))

rdd3 = rdd2.map(lambda x : (x, 1))

rdd4 = rdd3.reduceByKey(lambda x,y : x + y)

rdd4.collect

9108179578
.saveAsTextFile("<hdfs path>")

in scala we have anonymous functions and same thing is

called as lambdas in your python
Spark practical - 4
====================

word count problem - spark-shell (terminal)

word count problem in ide with a better dataset

we improved the word count by normalizing the case

so we sorted the data and get the top 10.

customer_id , product_id, amount_spent

we need to find out top 10 customers who spent the

maximum amount.

44,8602,37.19 44,37.19

9108179578
35,5368,65.89
2,3391,40.64
-> map 35,65.89
2,40.64

(44,94)
(35,165)
(2,40)

map

reduceByKey((x,y) => x+y)

sortBy(x => x._2)

collect

x 44,8602,37.19

x.split(",")

whenever rdd contains tuple of 2 elements it is called as a

pair rdd.

9108179578
here the 1st element can be treated like key and second
element can be treated like value.
Spark practical - 5
====================

user_id movie_idrating_given timestamp

how many times movies were rated 5 star

how many times movies were rated 4 star
how many times movies were rated 3 star
how many times movies were rated 2 star
how many times movies were rated 1 star

3
3
1
2
1
3
2
4 9108179578
(3,1)
(3,1)
(1,1)
(2,1)
(1,1)

instead of using map where we say (x,1) and doing

reduceByKey later.

map + reduceByKey is a tranformation -> rdd

countByValue is an action -> local variable

so if you feel that countByValue is the last thing you are

doing and there are no more operations after that then it's
ok to have countByValue.

but if we feel that we need more operations after this then

we should not use countByValue because we wont get
parallelism.

9108179578
Spark practical - 6
====================

row_id, name, age, number_of_connections

we need to find average number of connections for each
age

33 , 100
33 , 200
33, 300

output 33,200

42, 200
42, 400
42, 500
42 ,700

output 42, 450

input 0,Will,33,385

9108179578
output (33,385)

//input
//(33,100)
//(33,200)
//(33,300)

x._1 33
x._2 100

//output
//(33,(100,1))
//(33,(200,1))
//(33,(300,1))

mappedInput.map(x => (x._1,(x._2,1)))

input
x //(33,(100,1))
y //(33,(200,1))
//(33,(300,1))

x._1
y._1
9108179578
output
//(33,(600,3))
//(34,(800,4))

reduceBykey((x,y) => (x._1 + y._1 , x._2 + y._2))

input
//(33,(600,3))
//(34,(800,4))

output
(33,200)
(34,200)

totalsByAge.map(x => (x._1,x._2._1/x._2._2))

9108179578
Spark practical - 7
===================

1,11

LUNA, Luis Eduardo - Functions of The Magic Melodies or Icaros PDF
100% (1)
LUNA, Luis Eduardo - Functions of The Magic Melodies or Icaros PDF
23 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Love Sick
100% (1)
Love Sick
627 pages
Spark
No ratings yet
Spark
96 pages
SPARK
No ratings yet
SPARK
66 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Dll-Eapp 12 Week 15
50% (2)
Dll-Eapp 12 Week 15
5 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark
No ratings yet
Spark
160 pages
SPARK
No ratings yet
SPARK
35 pages
Gujarati Parts of Speech
No ratings yet
Gujarati Parts of Speech
16 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Where The Forest Meets The Sea Sample Lesson Plan
100% (2)
Where The Forest Meets The Sea Sample Lesson Plan
29 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
30 pages
Phonemic Awareness and Phonics
No ratings yet
Phonemic Awareness and Phonics
19 pages
Spark
No ratings yet
Spark
11 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
CC PPT
No ratings yet
CC PPT
12 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Pyspark
No ratings yet
Pyspark
44 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
22 PDFsam Apache Spark Tutorial
No ratings yet
22 PDFsam Apache Spark Tutorial
7 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Grade 2 Cause Effect B
No ratings yet
Grade 2 Cause Effect B
3 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Apachespark 1 200125100620
No ratings yet
Apachespark 1 200125100620
24 pages
Supporting Young Learners With Dyslexia Pre A1 Starters A1 Movers and A2 Flyers A Guide For Teachers
No ratings yet
Supporting Young Learners With Dyslexia Pre A1 Starters A1 Movers and A2 Flyers A Guide For Teachers
13 pages
Deadlock and Starvation
No ratings yet
Deadlock and Starvation
5 pages
Sap Tables
No ratings yet
Sap Tables
5 pages
The Answers Are Below: Advertised Headhunted Refuse Apprenticeship Short-Listed
No ratings yet
The Answers Are Below: Advertised Headhunted Refuse Apprenticeship Short-Listed
7 pages
Wittgenstein Limitation of Language
No ratings yet
Wittgenstein Limitation of Language
6 pages
Forensic Speaker Identification: A Review of Literature and Reflection On Future
No ratings yet
Forensic Speaker Identification: A Review of Literature and Reflection On Future
14 pages
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
No ratings yet
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
2 pages
Education: CNS-218-3I Citrix ADC 12.x Essentials
100% (1)
Education: CNS-218-3I Citrix ADC 12.x Essentials
15 pages
English 3: Could-Couldn'T
No ratings yet
English 3: Could-Couldn'T
20 pages
Intel's Haswell CPU Microarchitecture
No ratings yet
Intel's Haswell CPU Microarchitecture
17 pages
15 Aug. 2024 - SOLEMNITY OF THE ASSUMPTION OF THE BVM Mass Songs - UPDATED
No ratings yet
15 Aug. 2024 - SOLEMNITY OF THE ASSUMPTION OF THE BVM Mass Songs - UPDATED
2 pages
Mariana
No ratings yet
Mariana
11 pages
2015.82742.delhi Darbar 1902 03 - Text
No ratings yet
2015.82742.delhi Darbar 1902 03 - Text
150 pages
Art and Prod Reviewer
No ratings yet
Art and Prod Reviewer
5 pages
World Lit Module 12 Korean
No ratings yet
World Lit Module 12 Korean
2 pages
Candidate Instructions
No ratings yet
Candidate Instructions
8 pages
Activity 6,7,8,9
No ratings yet
Activity 6,7,8,9
4 pages
Biradari PDF
No ratings yet
Biradari PDF
13 pages
(HW) Digital SAT - Command of Evidence 1
No ratings yet
(HW) Digital SAT - Command of Evidence 1
19 pages
Anthem Socratic Seminar 2019 Assignment
No ratings yet
Anthem Socratic Seminar 2019 Assignment
2 pages
Q3W7
No ratings yet
Q3W7
12 pages
Calculation of Puretone Average
No ratings yet
Calculation of Puretone Average
3 pages
Computational Complexity Theory
No ratings yet
Computational Complexity Theory
15 pages

Spark Running Notes

Uploaded by

Spark Running Notes

Uploaded by

Apache Spark

hadoop provides 3 things:

Spark is a plug and play compute engine which needs 2

1. Storage - local storage, hdfs, Amazon S3

mr1 mr2 mr3 mr4 mr5

for each mapreduce job we require 2 disk access

one is for reading and other is for writing.

only 2 disk IO's are required.

pig for cleaning

The basic unit which holds the data in spark is called as

rdd1 = load file1 from hdfs

DAG - Directed Acyclic graph

There are 2 kind of operations in spark

Transformations are lazy

Action are not.

Resilient (fault tolerant) - if we loose an rdd we can again

rdds are resilient to failures.

IF RDD3 IS LOST THEN IT WILL CHECK FOR ITS

once we load rdd with data the data cannot be changed.

why transformations are lazy?

assume that transformations are not lazy.

consider you have a 1 gb file in hdfs.

rdd1 = load file1 from hdfs

to print just 1 line we ended up loading 1 gb file in

file1 is in hdfs with 10 lakh rows

rdd1 = load file1 from hdfs

consider if spark is not lazy.

rdd1 will be materialized.

10 lakh lines will be processes

in case of a map the number of input records is the same

word count in spark

we need to find the frequency of each word in a file which

we have created a file in local and then moved it to hdfs

now we want to process this file in hdfs using apache

and it is the entry point to the spark cluster

val rdd1 = sc.textFile("/user/cloudera/sparkinput/file1")

val rdd2 = rdd1.flatMap(x => x.split(" "))

flatmap basically takes each line as input

val rdd4 = rdd3.reduceByKey((x,y) => x+y)

localhost:4040 (this gives you the spark ui)

val rdd2 = rdd1.flatMap(x => x.split(" "))

val rdd4 = rdd3.reduceByKey( x,y => x + y)

rdd2 = rdd1.flatMap(lambda x : x.split(" "))

rdd3 = rdd2.map(lambda x : (x, 1))

rdd4 = rdd3.reduceByKey(lambda x,y : x + y)

in scala we have anonymous functions and same thing is

word count problem - spark-shell (terminal)

word count problem in ide with a better dataset

we improved the word count by normalizing the case

so we sorted the data and get the top 10.

customer_id , product_id, amount_spent

we need to find out top 10 customers who spent the

reduceByKey((x,y) => x+y)

sortBy(x => x._2)

whenever rdd contains tuple of 2 elements it is called as a

user_id movie_idrating_given timestamp

how many times movies were rated 5 star

instead of using map where we say (x,1) and doing

map + reduceByKey is a tranformation -> rdd

countByValue is an action -> local variable

so if you feel that countByValue is the last thing you are

but if we feel that we need more operations after this then

row_id, name, age, number_of_connections

output 42, 450

mappedInput.map(x => (x._1,(x._2,1)))

reduceBykey((x,y) => (x._1 + y._1 , x._2 + y._2))

totalsByAge.map(x => (x._1,x._2._1/x._2._2))

You might also like