0% found this document useful (0 votes)

22 views66 pages

SPARK

Uploaded by

Roy abhisek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views66 pages

SPARK

Uploaded by

Roy abhisek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Kishore Pusukuri,

Apache Spark Spring 2018

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
Recap
MapReduce
• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner
• Takes care of scheduling tasks, monitoring them and re-executes
the failed tasks
HDFS & MapReduce: Running on the same set of nodes 
compute nodes and storage nodes same (keeping data close
to the computation)  very high throughput
YARN & MapReduce: A single master resource manager, one
slave node manager per node, and AppMaster per application
2
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming

3
History of Hadoop and Spark

4
Apache Hadoop & Apache Spark
Map Other Spark Spark
Reduce Hive Pig
Applications Stream SQL
Processing

Resource Mesos
Yet Another Resource Spark Core Data
manager etc. Ingestion
Negotiator (YARN)
Systems
e.g.,
Cassandra Apache
Hadoop Database (HBase) etc., other Kafka,
Hadoop Distributed File System storage Flume,
Data (HDFS) systems etc
Storage

Hadoop Spark

5
Apache Spark
** Spark can connect to several types of cluster managers
(either Spark’s own standalone cluster manager, Mesos or
YARN)
Processing Spark Spark Other
Spark ML Applications
Stream SQL

Resource Spark Core Data

manager Mesos etc. Yet Another Resource
(Standalone Ingestion
Negotiator (YARN) Systems
Scheduler)
e.g.,
Apache
S3, Cassandra etc., Hadoop NoSQL Database (HBase)
Data Kafka,
Storage
other storage systems Flume, etc
Hadoop Distributed File System (HDFS)

Hadoop Spark

6
Apache Hadoop: No Unified Vision

• Sparse Modules
• Diversity of APIs
• Higher Operational Costs
7
Spark Ecosystem: A Unified Pipeline

8
Spark vs MapReduce: Data Flow

9
Data Access Rates

• With in a node:
 CPU to Memory: 10 GB/sec
 CPU to HardDisk: 0.1 GB/sec
 CPU to SSD: 0.6 GB/sec
• Nodes between networks: 0.125 GB/sec to 1
GB/sec
• Nodes in the same rack: 0.125 GB/sec to 1
GB/sec
• Nodes between racks: 0.1 GB/sec 10
Spark: High Performance & Simple Data Flow

11
Performance: Spark vs MapReduce (1)
• Iterative algorithms
 Spark is faster  a simplified data flow
 Avoids materializing data on HDFS after each iteration
• Example: k-means algorithm, 1 iteration
 HDFS Read
 Map(Assign sample to closest centroid)
 GroupBy(Centroid_ID)
 NETWORK Shuffle
 Reduce(Compute new centroids)
 HDFS Write
12
Performance: Spark vs MapReduce (2)

13
Code: Hadoop vs Spark (e.g., Word Count)

• Simple/Less code
• Multiple stages  pipeline
• Operations
 Transformations: apply user code to
distribute data in parallel
 Actions: assemble final output from
distributed data

14
Motivation (1)
MapReduce: The original scalable, general, processing
engine of the Hadoop ecosystem
• Disk-based data processing framework (HDFS files)
• Persists intermediate results to disk
• Data is reloaded from disk with every query → Costly I/O
• Best for ETL like workloads (batch processing)
• Costly I/O → Not appropriate for iterative or stream
processing workloads

15
Motivation (2)
Spark: General purpose computational framework that
substantially improves performance of MapReduce, but
retains the basic model
• Memory based data processing framework → avoids costly
I/O by keeping intermediate results in memory
• Leverages distributed memory
• Remembers operations applied to dataset
• Data locality based computation → High Performance
• Best for both iterative (or stream processing) and batch
workloads
16
Motivation - Summary
Software engineering point of view
 Hadoop code base is huge
 Contributions/Extensions to Hadoop are cumbersome
 Java-only hinders wide adoption, but Java support is fundamental
System/Framework point of view
 Unified pipeline
 Simplified data flow
 Faster processing speed
Data abstraction point of view
 New fundamental abstraction RDD
 Easy to extend with new operators
 More descriptive computing model

17
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming

18
Spark Basics(1)
Spark: Flexible, in-memory data processing framework written in Scala
Goals:
• Simplicity (Easier to use):
 Rich APIs for Scala, Java, and Python
• Generality: APIs for different types of workloads
 Batch, Streaming, Machine Learning, Graph
• Low Latency (Performance) : In-memory processing and
caching
• Fault-tolerance: Faults shouldn’t be special case

19
Spark Basics(2)
There are two ways to manipulate data in Spark
• Spark Shell:
 Interactive – for learning or data exploration
 Python or Scala
• Spark Applications
 For large scale data processing
 Python, Scala, or Java

20
Spark Shell
The Spark Shell provides interactive data exploration
(REPL)

REPL: Repeat/Evaluate/Print Loop

21
Spark Core: Code Base (2012)

22
Spark Fundamentals
Example of an
application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

23
Spark: Fundamentals
Spark Context
Resilient Distributed Datasets
(RDDs)
Transformations
Actions

24
Spark Context (1)
•Every Spark application requires a spark context: the main
entry point to the Spark API
•Spark Shell provides a preconfigured Spark Context
called “sc”

25
Spark Context (2)
• Standalone applications  Driver code  Spark Context
• Spark Context represents connection to a Spark cluster

Standalone Application
(Driver Program)

26
Spark Context (3)
Spark context works as a client and represents connection to a Spark cluster

27
Spark Fundamentals
Example of an application:

• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

28
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An
Immutable collection of objects (or records, or elements) that can be operated on “in
parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information  it can be recreated from
parent RDDs
Distributed -- processed across the cluster
• Each RDD is composed of one or more partitions  (more partitions –
more parallelism)
Dataset -- initial data can come from a file or be created

29
RDDs
Key Idea: Write applications in terms of transformations
on distributed datasets
• Collections of objects spread across a Memory caching
layer(cluster) that stores data in a distributed, fault-tolerant
cache
• Can fall back to disk when dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by,
join, etc)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
30
RDDs -- Immutability
• Immutability  lineage information  can be recreated
at any time  Fault-tolerance
• Avoids data inconsistency problems  no simultaneous
updates  Correctness
• Easily live in memory as on disk  Caching  Safe to
share across processes/tasks  Improves performance
• Tradeoff: (Fault-tolerance & Correctness) vs (Disk Memory &
CPU)

31
Creating a RDD
Three ways to create a RDD
• From a file or set of files
• From data in memory
• From another RDD

32
Example: A File-based RDD

33
Spark Fundamentals
Example of an application:

• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

34
RDD Operations
Two types of operations
Transformations: Define a
new RDD based on current
RDD(s)
Actions: return values

35
RDD Transformations
•Set of operations on a RDD that define how they should
be transformed
•As in relational algebra, the application of a
transformation to an RDD yields a new RDD (because
RDD are immutable)
•Transformations are lazily evaluated, which allow for
optimizations to take place before execution
•Examples: map(), filter(), groupByKey(), sortByKey(),
etc.
36
Example: map and filter Transformations

37
RDD Actions
• Apply transformation chains on RDDs, eventually performing
some additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g.
HDFS), others fetch data from the RDD (and its transformation
chain) upon which the action is applied, and convey it to the
driver
• Some common actions
count() – return the number of elements

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

38
Lazy Execution of RDDs (1)
Data in RDDs is not processed
until an action is performed

39
Lazy Execution of RDDs (2)
Data in RDDs is not processed
until an action is performed

40
Lazy Execution of RDDs (3)
Data in RDDs is not processed
until an action is performed

41
Lazy Execution of RDDs (4)
Data in RDDs is not processed
until an action is performed

42
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed

43
Example: Log Mining
Load error messages from a log into memory, then interactively
search for various patterns:

lines = spark.textFile(“hdfs://...”) HadoopRDD

errors = lines.filter(lambda s: s.startswith(“ERROR”)) FilteredRDD
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()

Result: full-text search of Wikipedia in 0.5 sec (vs 20 sec for on-disk data)

44
RDD and Partitions (More Parallelism)

45
RDD Graph: Data Set vs Partition Views
Much like in Hadoop MapReduce, each RDD is associated to
(input) partitions

46
RDDs: Data Locality
•Data Locality Principle
 Same as for Hadoop MapReduce
 Avoids network I/O, workers should manage local data
•Data Locality and Caching
 First run: data not in cache, so use HadoopRDD’s locality
preferences (from HDFS)
 Second run: FilteredRDD is in cache, so use its locations
 If something falls out of cache, go back to HDFS

47
RDDs -- Summary
RDD are partitioned, locality aware, distributed
collections
 RDD are immutable
RDD are data structures that:
 Either point to a direct data source (e.g. HDFS)
 Apply some transformations to its parent RDD(s) to
generate new data elements
Computations on RDDs
 Represented by lazily evaluated lineage DAGs composed
by chained RDDs

48
Lifetime of a Job in Spark

49
Anatomy of a Spark Application

Cluster Manager
(YARN/Mesos)

50
Typical RDD pattern of use
Hadoop job uses RDD to transform some input object, like
a “recipe” for generating a cooked version of the object.
The task might further transform the RDD with additional
RDDs, in the style of a functional program.
Eventually, some task consumes the RDD output (or
perhaps several of these RDDs) as part of a MapReduce-
style computation.

51
Spark: Key Techniques for Performance
Spark is an “execution engine for computing RDDs” but also decides when to
perform the actual computation, where to place tasks (on the Hadoop Cluster),
and whether to cache RDD output.
Avoids recomputing an RDD by saving its output if it will be needed again, and
to arrange for tasks to run close to these cached RDDs (or in a place where
later tasks will use the same RDD output)

52
Why is this a good strategy?
If MapReduce jobs were arbitrary programs, this wouldn’t help.
But in fact the MapReduce model is valuable because it often applies the same
transformations again and again on input files.
Also, MapReduce is often run again and again until a machine learning model
converges, or some huge batch of input is consumed, and by caching RDDs,
Spark can avoid wasteful effort.

53
Iterative Algorithms: Spark vs MapReduce

54
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming

55
Spark Programming (1)
Creating RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3

sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)

# Use existing Hadoop InputFormat (Java/Scala only)

sc.hadoopFile(keyClass, valClass, inputFmt, conf)
56
Spark Programming (2)
Basic Transformations

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate

even = squares.filter(lambda x: x % 2 == 0) // {4}

57
Spark Programming (3)
Basic Actions
nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection

nums.collect() # => [1, 2, 3]

# Return first K elements

nums.take(2) # => [1, 2]

# Count number of elements

nums.count() # => 3

# Merge elements with an associative function

nums.reduce(lambda x, y: x + y) # => 6
58
Spark Programming (4)
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on RDDs of
key-value pairs

Python: pair = (a, b)

pair[0] # => a
pair[1] # => b

Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b);

pair._1 // => a
pair._2 // => b 59
Spark Programming (5)
Some Key-Value Operations

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

60
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)

61
Example: Spark Streaming

Represents streams as a series of RDDs over time (typically sub second intervals, but it
is configurable)

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)

sc.twitterStream(...)
.filter(t => t.text.contains(“Santa Clara University”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()

62
Spark: Combining Libraries (Unified Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)

# Train a machine learning model

model = KMeans.train(points, 10)

# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 63
Spark: Setting the Level of Parallelism
All the pair RDD operations take an optional second
parameter for number of tasks

words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

64
MapReduce vs Spark (Summary)
Performance:
 While Spark performs better when all the data fits in the main memory (especially on
dedicated clusters), MapReduce is designed for data that doesn’t fit in the memory
Ease of Use:
 Spark is easier to use compared to Hadoop MapReduce as it comes with user-friendly APIs
for Scala (its native language), Java, Python, and Spark SQL.
Fault-tolerance:
 Batch processing: Spark  HDFS replication
 Stream processing: Spark RDDs replicated

65
Summary

Spark is a powerful “manager” for big data computing.

It centers on a job scheduler for Hadoop (MapReduce) that is smart
about where to run each task: co-locate task with data.
The data objects are “RDDs”: a kind of recipe for generating a file from
an underlying data collection. RDD caching allows Spark to run mostly
from memory-mapped data, for speed.

• Online tutorials: spark.apache.org/docs/latest

The Richest Man in Babylon
100% (1)
The Richest Man in Babylon
139 pages
Tech Leap-AWS-Data-Engineer-TeachLeap-School-Final PDF
No ratings yet
Tech Leap-AWS-Data-Engineer-TeachLeap-School-Final PDF
14 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Big Data
100% (1)
Big Data
82 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Spark
No ratings yet
Spark
160 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
MongoDB Aggregation Guide PDF
No ratings yet
MongoDB Aggregation Guide PDF
46 pages
Top 60 Splunk Interview Questions & Answers 2022 - Intellipaat
No ratings yet
Top 60 Splunk Interview Questions & Answers 2022 - Intellipaat
14 pages
SPARK
No ratings yet
SPARK
35 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Ishmeet Singh Saluja Resume
No ratings yet
Ishmeet Singh Saluja Resume
1 page
2marks With Answers
No ratings yet
2marks With Answers
10 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
NO SQL Data Management
No ratings yet
NO SQL Data Management
123 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Lecture 3 P4 NetFPGA
No ratings yet
Lecture 3 P4 NetFPGA
83 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
CS3301 Data Structures L T P C
No ratings yet
CS3301 Data Structures L T P C
2 pages
SPARK
No ratings yet
SPARK
125 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Module 3
No ratings yet
Module 3
51 pages
Lecture 18
No ratings yet
Lecture 18
57 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Spark 1
No ratings yet
Spark 1
57 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Spark
No ratings yet
Spark
96 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Cs8091 Bigdata Analytics Question Bank
No ratings yet
Cs8091 Bigdata Analytics Question Bank
40 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Spark
No ratings yet
Spark
51 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
SPARK
No ratings yet
SPARK
47 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
CS246 - Home
No ratings yet
CS246 - Home
13 pages
Week 3 CC
No ratings yet
Week 3 CC
7 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
!python Seminar
No ratings yet
!python Seminar
14 pages
BD Notes
No ratings yet
BD Notes
11 pages
Execute Java Map Reduce Sample Using Eclipse
No ratings yet
Execute Java Map Reduce Sample Using Eclipse
9 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Mapreduce Join Document
No ratings yet
Mapreduce Join Document
4 pages
Mining of Massive Data Sets 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman ISBN 1107077230 9781107077232 PDF Download
100% (1)
Mining of Massive Data Sets 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman ISBN 1107077230 9781107077232 PDF Download
40 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Dsebl ZG522
No ratings yet
Dsebl ZG522
4 pages
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
No ratings yet
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
6 pages
SR Data Engineer (Atlanta, GA) : Khaja Mohammed
No ratings yet
SR Data Engineer (Atlanta, GA) : Khaja Mohammed
5 pages
Ighpk6006b Itr Status
No ratings yet
Ighpk6006b Itr Status
2 pages
Health Checkup Bill Format
No ratings yet
Health Checkup Bill Format
1 page
BDA Assignment
No ratings yet
BDA Assignment
2 pages
21CS71 Model Set 1 Paper
No ratings yet
21CS71 Model Set 1 Paper
2 pages
(XXXX) Syllabus - Big Data Administration Training For Apache Hadoop - 280715
No ratings yet
(XXXX) Syllabus - Big Data Administration Training For Apache Hadoop - 280715
1 page
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

SPARK

Uploaded by

SPARK

Uploaded by

Kishore Pusukuri,

Apache Spark Spring 2018

Resource Spark Core Data

REPL: Repeat/Evaluate/Print Loop

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

lines = spark.textFile(“hdfs://...”) HadoopRDD

# Load text file from local FS, HDFS, or S3

# Use existing Hadoop InputFormat (Java/Scala only)

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

# Keep elements passing a predicate

# Retrieve RDD contents as a local collection

# Return first K elements

# Count number of elements

# Merge elements with an associative function

Python: pair = (a, b)

Scala: val pair = (a, b)

Java: Tuple2 pair = new Tuple2(a, b);

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)

# Train a machine learning model

Spark is a powerful “manager” for big data computing.

• Online tutorials: spark.apache.org/docs/latest

You might also like