0% found this document useful (0 votes)

12 views16 pages

Spark and Scala Week 1

Uploaded by

sumeetmkhetan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views16 pages

Spark and Scala Week 1

Uploaded by

sumeetmkhetan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

WEEk 1

SPARK & SCALA

What is Apache Spark?

 Lightning Fast Cluster Computing

 Faster Data Processing Platform
 Fastest Open Source Engine for
sorting a Petabyte
 80 High-Level Operators making
Coding quick
“Hello World!” of Big Data: The Word Count
Example
Code Written in Java for 50 Lines of
MapReduce Code
Code Written in Spark &
Scala
sparkContext.textFile(“hdf
s://…”) .flatMap(line =>
line.split(“ “)) .map(word
=> (word,
1)).reduceByKey(_ +
_) .saveAsTextFile(“hdfs://
Other Significant Features of Apache
Spark…
Interactive Shell (REPL) makes you Understand
Outcomes

Provides APIs in Scala, Java, and Python, with support

for other languages (such as R) on the way

Integrates well with the Hadoop Ecosystem and Data

Sources (HDFS, Amazon S3, Hive, HBase, Cassandra,
etc.)

Can run on Clusters managed by Hadoop YARN or

Apache Mesos, and can also run standalone

Complemented by a Set of Powerful, Higher-level

Event Detection Case
Faster than Japan Meteorological Agency

Analyzing a Twitter Stream

Filtering Relevant Tweets – “Earthquake” or
“Shaking”
Using “Spark Streaming” Code…

TwitterUtils.createStre
am(...) .filter(_.getText
.contains("earthquake
") ||
_.getText.contains("sh
Event Detection
Case…
Semantic Analysis

Attending a
Earthquake
Conference on
Earth is Shaking Earthquake
Sending
Predicti Density Emails to
Code in MLlib
on of Tweets with Twitter Spark SQL
Retrieve
Model is Positive Location Services
d
Ready Tweets Addresse
s
Spark Architecture & Cluster
RDD (Resilient Distributed
Dataset)
Definiti A Basic Data Abstraction in Spark
on A Collection of Data Items distributed across
the Network
RDDs are interface for running
Transformation and Actions in Spark

Characteri
stics Immutable Lazily
Evaluated

Distribute Fault-Tolerant
d
Methods to Create RDD

Referring an External Data

Set in an External Storage
System

By Parallelizing a
Collection of Objects in the
Driver Program
Operations in RDD

Defining
Transformations
Which create a New Dataset from an Existing
one
Defining
Actions
Which return a value to the Driver Program after
running a Computation on the Dataset
Transformations in RDD:

 map(func)  distinct([numTasks]))
 filter(func)  groupByKey([numTasks])
 flatMap(func)  reduceByKey(func, [numTasks])
 mapPartitions(func)  aggregateByKey(zeroValue)
 mapPartitionsWithIndex (seqOp, combOp, [numTasks])
(func)  sortByKey([ascending],
 sample(withReplaceme [numTasks])
nt, fraction, seed)  join(otherDataset, [numTasks])
 union(otherDataset)  cogroup(otherDataset,
 intersection(otherDatas [numTasks])
et)  cartesian(otherDataset)
Actions in RDD:

 reduce(func) takeOrdered(n,
 collect() [ordering])
 saveAsTextFile(path)
count()
saveAsSequenceFile(
 first()
path)
 take(n) saveAsObjectFile(pat
 takeSample(withReplac h)
ement, num, [seed]) countByKey()
 foreach(func)
Variables

Broadcast Variables

Shared
Variable Accumulators
s
Broadcast
Variables
Allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks
Can be used to give every node a copy of a large input
dataset
Broadcast variables are created from a variable v by
calling SparkContext.broadcast(v)
 The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method
scala> val broadcastVar =
Code sc.broadcast(Array(1, 2,
3))broadcastVar:
org.apache.spark.broadcast.Broadcast[
Array[Int]] = Broadcast(0) scala>
broadcastVar.valueres0: Array[Int] =
Accumulators
Accumulators are Variables that are only “added” to through an
Associative and Commutative operation
They can be used to implement counters (as in MapReduce) or
sums
Spark supports accumulators of numeric types, and programmers
can add support for new types

An accumulator is created from an

scala> val accum =
initial value v by
sc.longAccumulator("My
calling SparkContext.accumulator(v) Accumulator")accum:
Tasks running on a cluster can then org.apache.spark.util.LongAccumulator =
add to it using the add method or LongAccumulator(id: 0, name: Some(My
the += operator Accumulator), value: 0) scala>
Only the driver program can read sc.parallelize(Array(1, 2, 3, 4)).foreach(x
=> accum.add(x))...10/09/29 18:41:08
the accumulator’s value, using
INFO SparkContext: Tasks finished in
Week 2

Object Function
Basics Oriented al
of Scala Program Program
ming ming

Scala Scala Scala

Data Functio Variabl
Types ns es

Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Training Report On C and C++
67% (3)
Training Report On C and C++
20 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
SPARK
No ratings yet
SPARK
35 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark
No ratings yet
Spark
51 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark
No ratings yet
Spark
96 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Lec 9
No ratings yet
Lec 9
33 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Spark 1
No ratings yet
Spark 1
57 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Spark
No ratings yet
Spark
160 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark
No ratings yet
Spark
33 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lec 9
No ratings yet
Lec 9
38 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Pca2 HTML
No ratings yet
Pca2 HTML
59 pages
RedmineTextFormattingTextile - Redmine
No ratings yet
RedmineTextFormattingTextile - Redmine
7 pages
Advanced Robotics Slides
No ratings yet
Advanced Robotics Slides
29 pages
02 DBS3900 Operation and Maintenance
No ratings yet
02 DBS3900 Operation and Maintenance
43 pages
Learning Tensorflow
No ratings yet
Learning Tensorflow
9 pages
Virtulization
No ratings yet
Virtulization
11 pages
01-07 VPLS Configuration
No ratings yet
01-07 VPLS Configuration
237 pages
HANA Mcqs
No ratings yet
HANA Mcqs
6 pages
Schneider Hybrid DCS
No ratings yet
Schneider Hybrid DCS
13 pages
Netwroking Assignment
No ratings yet
Netwroking Assignment
24 pages
Class Fundamentals in Java
No ratings yet
Class Fundamentals in Java
5 pages
Internet Cafe Timer Thesis
100% (3)
Internet Cafe Timer Thesis
6 pages
HPE2-B07 - Demo
No ratings yet
HPE2-B07 - Demo
12 pages
03 New To Algorithms and Programming
No ratings yet
03 New To Algorithms and Programming
2 pages
Policies Book Xe Sdwan
No ratings yet
Policies Book Xe Sdwan
230 pages
Ip Project 2024
No ratings yet
Ip Project 2024
14 pages
Os Solve Question Paper
No ratings yet
Os Solve Question Paper
26 pages
Lec 30-31 NAT-DHCP-IPv6-ICMP
No ratings yet
Lec 30-31 NAT-DHCP-IPv6-ICMP
34 pages
Git Cheat Sheet Education
No ratings yet
Git Cheat Sheet Education
3 pages
Asif Updated Resume
No ratings yet
Asif Updated Resume
1 page
Cs8792-Cryptography and Network Security Unit-3: Sn. No. Option 1 Option 2 Option 3 Option 4 Correct Option
No ratings yet
Cs8792-Cryptography and Network Security Unit-3: Sn. No. Option 1 Option 2 Option 3 Option 4 Correct Option
3 pages
HDL Project Report On Automatic Washing
No ratings yet
HDL Project Report On Automatic Washing
17 pages
Logic Circuits Switching Theory
No ratings yet
Logic Circuits Switching Theory
2 pages
Digital Circuit and System Syllabus
No ratings yet
Digital Circuit and System Syllabus
1 page
Exercises Meeting 6
No ratings yet
Exercises Meeting 6
2 pages
Chapter 7 Symbol Tables and Error Handler
No ratings yet
Chapter 7 Symbol Tables and Error Handler
34 pages
Casino Game
100% (1)
Casino Game
3 pages
Modularity and Object-Oriented Programming: Slide 1
No ratings yet
Modularity and Object-Oriented Programming: Slide 1
38 pages
Y10 05 P30 Assessment v2
No ratings yet
Y10 05 P30 Assessment v2
7 pages

Spark and Scala Week 1

Uploaded by

Spark and Scala Week 1

Uploaded by

WEEk 1

SPARK & SCALA

 Lightning Fast Cluster Computing

Provides APIs in Scala, Java, and Python, with support

Integrates well with the Hadoop Ecosystem and Data

Can run on Clusters managed by Hadoop YARN or

Complemented by a Set of Powerful, Higher-level

Analyzing a Twitter Stream

Referring an External Data

An accumulator is created from an

Scala Scala Scala

You might also like