0% found this document useful (0 votes)

6 views

15_PDFsam_apache_spark_tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

15_PDFsam_apache_spark_tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Apache Spark

15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication

disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server'
on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc

scala>

11
4. SPARK – CORE PROGRAMMING Apache Spark

Spark Core is the base of the whole project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data
structure known as RDD (Resilient Distributed Datasets) that is a logical collection of
data partitioned across machines. RDDs can be created in two ways; one is by
referencing datasets in external storage systems and second is by applying
transformations (e.g. map, filter, reducer, join) on existing RDDs.

The RDD abstraction is exposed through a language-integrated API. This simplifies

programming complexity because the way applications manipulate RDDs is similar to
manipulating local collections of data.

Spark Shell
Spark provides an interactive shell: a powerful tool to analyze data interactively. It is
available in either Scala or Python language. Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created
from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.

Open Spark Shell

The following command is used to open Spark shell.

$ spark-shell

Create simple RDD

Let us create a simple RDD from the text file. Use the following command to create a
simple RDD.

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

inputfile: org.apache.spark.rdd.RDD[String] = input.txt MappedRDD[1] at

textFile at <console>:12

The Spark RDD API introduces few Transformations and few Actions to manipulate
RDD.

RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function
for calculating its data and has a pointer (dependency) to its parent RDD.

12
Apache Spark

Spark is lazy, so nothing will be executed unless you call some transformation or action
that will trigger job creation and execution. Look at the following snippet of the word-
count example.

Therefore, RDD transformation is not a set of data but is a step in a program (might be
the only step) telling Spark how to get data and what to do with it.

Given below is a list of RDD transformations.

S. No Transformations & Meaning

map(func)

1 Returns a new distributed dataset, formed by passing each element of the

source through a function func.

filter(func)

2 Returns a new dataset formed by selecting those elements of the source on

which func returns true.

flatMap(func)

3 Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).

mapPartitions(func)

4 Similar to map, but runs separately on each partition (block) of the RDD, so
func must be of type Iterator<T> => Iterator<U> when running on an
RDD of type T.

mapPartitionsWithIndex(func)

5 Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when running on an RDD of type T.

sample(withReplacement, fraction, seed)

6 Sample a fraction of the data, with or without replacement, using a given

random number generator seed.

union(otherDataset)

7 Returns a new dataset that contains the union of the elements in the source

13
Apache Spark

dataset and the argument.

intersection(otherDataset)

8 Returns a new RDD that contains the intersection of elements in the source
dataset and the argument.

distinct([numTasks]))

9 Returns a new dataset that contains the distinct elements of the source
dataset.

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K,

Iterable<V>) pairs.
10 Note: If you are grouping in order to perform an aggregation (such as a
sum or average) over each key, using reduceByKey or aggregateByKey will
yield much better performance.

reduceByKey(func, [numTasks])

11 When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs

where the values for each key are aggregated using the given reduce
function func, which must be of type (V, V) => V. Like in groupByKey, the
number of reduce tasks is configurable through an optional second
argument.

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

12 When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs

where the values for each key are aggregated using the given combine
functions and a neutral "zero" value. Allows an aggregated value type that
is different from the input value type, while avoiding unnecessary
allocations. Like in groupByKey, the number of reduce tasks is configurable
through an optional second argument.

sortByKey([ascending], [numTasks])

13 When called on a dataset of (K, V) pairs where K implements Ordered,

returns a dataset of (K, V) pairs sorted by keys in ascending or descending
order, as specified in the Boolean ascending argument.

14 join(otherDataset, [numTasks])

14
Apache Spark

When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numTasks])

15 When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group
With.

cartesian(otherDataset)

16 When called on datasets of types T and U, returns a dataset of (T, U) pairs

(all pairs of elements).

pipe(command, [envVars])

17 Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to
its stdout are returned as an RDD of strings.

coalesce(numPartitions)

18 Decrease the number of partitions in the RDD to numPartitions. Useful for

running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

19 Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over the
network.

repartitionAndSortWithinPartitions(partitioner)

20 Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can push
the sorting down into the shuffle machinery.

15
Apache Spark

Actions
The following table gives a list of Actions, which return values.

S.No Action & Meaning

reduce(func)

1 Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.

collect()

2 Returns all the elements of the dataset as an array at the driver program.
This is usually useful after a filter or other operation that returns a sufficiently
small subset of the data.

count()

3 Returns the number of elements in the dataset.

first()

4 Returns the first element of the dataset (similar to take (1)).

take(n)

5 Returns an array with the first n elements of the dataset.

takeSample (withReplacement,num, [seed])

6 Returns an array with a random sample of num elements of the dataset, with
or without replacement, optionally pre-specifying a random number
generator seed.

takeOrdered(n, [ordering])

7 Returns the first n elements of the RDD using either their natural order or a
custom comparator.

saveAsTextFile(path)

8 Writes the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark calls toString on each element to convert it to a line of text

16
Apache Spark

in the file.

saveAsSequenceFile(path) (Java and Scala)

Writes the elements of the dataset as a Hadoop SequenceFile in a given path

in the local filesystem, HDFS or any other Hadoop-supported file system. This
9 is available on RDDs of key-value pairs that implement Hadoop's Writable
interface. In Scala, it is also available on types that are implicitly convertible
to Writable (Spark includes conversions for basic types like Int, Double,
String, etc).

saveAsObjectFile(path) (Java and Scala)

10 Writes the elements of the dataset in a simple format using Java serialization,
which can then be loaded using SparkContext.objectFile().

countByKey()

11 Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs
with the count of each key.

foreach(func)

Runs a function func on each element of the dataset. This is usually, done for
side effects such as updating an Accumulator or interacting with external
12 storage systems.

Note: modifying variables other than Accumulators outside of

the foreach() may result in undefined behavior. See Understanding
closures for more details.

Programming with RDD

Let us see the implementations of few RDD transformations and actions in RDD
programming with the help of an example.

Example
Consider a word count example: It counts each word appearing in a document. Consider
the following text as an input and is saved as an input.txt file in a home directory.

input.txt: input file.

people are not as beautiful as they look,

as they walk or as they talk.
they are only as beautiful as they love,
17

Oodj Apu
100% (1)
Oodj Apu
16 pages
Java Fundamental Final Exam
No ratings yet
Java Fundamental Final Exam
17 pages
Spark
No ratings yet
Spark
13 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Spark_RDD
No ratings yet
Spark_RDD
60 pages
SPARK
No ratings yet
SPARK
35 pages
Lab 04 Spark APIs (1)
No ratings yet
Lab 04 Spark APIs (1)
20 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Pyspark
No ratings yet
Pyspark
31 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
2335_m8_demo1_v1_0h2_cq188do
No ratings yet
2335_m8_demo1_v1_0h2_cq188do
9 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Lec28 - RDD (1)
No ratings yet
Lec28 - RDD (1)
56 pages
Actions Spark
No ratings yet
Actions Spark
5 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Day 17
No ratings yet
Day 17
9 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
4 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Spark
No ratings yet
Spark
96 pages
notes - Copy
No ratings yet
notes - Copy
5 pages
spark
No ratings yet
spark
160 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Unit 4
No ratings yet
Unit 4
8 pages
Airlines Dynamic Pricing
No ratings yet
Airlines Dynamic Pricing
24 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Spark
No ratings yet
Spark
51 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
301_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
301_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
6 pages
281_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
281_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
201_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
201_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
89_PDFsam_Start Sketching and Drawing Now Simple Techniques for Drawing Landscapes, People and Objects
No ratings yet
89_PDFsam_Start Sketching and Drawing Now Simple Techniques for Drawing Landscapes, People and Objects
8 pages
121_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
121_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
181_PDFsam_Programming Pig
No ratings yet
181_PDFsam_Programming Pig
10 pages
9_PDFsam_Beginner Guide Spark
No ratings yet
9_PDFsam_Beginner Guide Spark
2 pages
181_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
No ratings yet
181_PDFsam_The Big Book of Realistic Drawing Secrets Easy Techniques for Drawing People, Animals and More
20 pages
71_PDFsam_Programming Pig
No ratings yet
71_PDFsam_Programming Pig
10 pages
121_PDFsam_Programming Pig
No ratings yet
121_PDFsam_Programming Pig
10 pages
1_PDFsam_apache_spark_tutorial
No ratings yet
1_PDFsam_apache_spark_tutorial
7 pages
3_PDFsam_Beginner Guide Spark
No ratings yet
3_PDFsam_Beginner Guide Spark
2 pages
_assets_File_ans-key_iom_CC_CL_9
No ratings yet
_assets_File_ans-key_iom_CC_CL_9
1 page
1_PDFsam_Grafana Guidebook v1.1
No ratings yet
1_PDFsam_Grafana Guidebook v1.1
20 pages
_assets_File_ans-key_iio_CC_CL_9
No ratings yet
_assets_File_ans-key_iio_CC_CL_9
1 page
Difference between IRR and XIRR
No ratings yet
Difference between IRR and XIRR
7 pages
Language of Money
No ratings yet
Language of Money
19 pages
Intro to Vector Embeddings
No ratings yet
Intro to Vector Embeddings
8 pages
_assets_File_ans-key_ioel_BB_CL_9
No ratings yet
_assets_File_ans-key_ioel_BB_CL_9
1 page
9 career saving lessons
No ratings yet
9 career saving lessons
11 pages
MasterSpellers FINALS-WINNERS-2024-25 list
No ratings yet
MasterSpellers FINALS-WINNERS-2024-25 list
4 pages
Beginner Guide Spark
No ratings yet
Beginner Guide Spark
12 pages
Chapter 4 Using Excel as Database
No ratings yet
Chapter 4 Using Excel as Database
4 pages
Grafana Guidebook v1.1
No ratings yet
Grafana Guidebook v1.1
41 pages
Multi-Paradigm Design (J. Coplien)
No ratings yet
Multi-Paradigm Design (J. Coplien)
276 pages
Reserved Programming Keywords For Assorted Languages
100% (3)
Reserved Programming Keywords For Assorted Languages
30 pages
Java 9 with JShell Introducing the full range of Java 9 s new features via JShell 1st Edition Gastón C. Hillar - Own the ebook now with all fully detailed chapters
100% (2)
Java 9 with JShell Introducing the full range of Java 9 s new features via JShell 1st Edition Gastón C. Hillar - Own the ebook now with all fully detailed chapters
56 pages
Intro To Java Programming
No ratings yet
Intro To Java Programming
34 pages
Circular 20241111111639 Notice Announcement of (st3) Oops 18 11 2024
No ratings yet
Circular 20241111111639 Notice Announcement of (st3) Oops 18 11 2024
2 pages
Functions in C Sharp Lec 2
No ratings yet
Functions in C Sharp Lec 2
7 pages
Software Design
No ratings yet
Software Design
102 pages
CSE109_Section_B_C_Outline
No ratings yet
CSE109_Section_B_C_Outline
3 pages
Java Programs
No ratings yet
Java Programs
1 page
Introduction to Spring JDBC
No ratings yet
Introduction to Spring JDBC
11 pages
JAVA PART 2_Shambhu Patil (1)
No ratings yet
JAVA PART 2_Shambhu Patil (1)
4 pages
Comparison Between C and C++ and Lisp and Prolog
100% (1)
Comparison Between C and C++ and Lisp and Prolog
18 pages
PHP
No ratings yet
PHP
2 pages
Introduction To Java Programming: Week 3
No ratings yet
Introduction To Java Programming: Week 3
30 pages
Test JAVA Fundamentals Final Exam
50% (2)
Test JAVA Fundamentals Final Exam
18 pages
cs342 Old Midterm
No ratings yet
cs342 Old Midterm
4 pages
Ooabap
No ratings yet
Ooabap
5 pages
Python Programming 1 1723888682740
No ratings yet
Python Programming 1 1723888682740
5 pages
Asm 2 PCT Bh01999 Progaming
No ratings yet
Asm 2 PCT Bh01999 Progaming
45 pages
AJP MCQ Unit I Practice
No ratings yet
AJP MCQ Unit I Practice
44 pages
OOP - Spring24 - MidPaper Soln
No ratings yet
OOP - Spring24 - MidPaper Soln
5 pages
Polymorphism in Java
No ratings yet
Polymorphism in Java
54 pages
Class Module Step by Step PDF
No ratings yet
Class Module Step by Step PDF
4 pages
Package Section:: 1.predefined Packages
No ratings yet
Package Section:: 1.predefined Packages
26 pages
Lab Activity 1: Topics: Java Basics
No ratings yet
Lab Activity 1: Topics: Java Basics
2 pages
Cheat Sheet CS 31 Final
No ratings yet
Cheat Sheet CS 31 Final
5 pages
Module 3 Test Answers Py Ese
100% (3)
Module 3 Test Answers Py Ese
7 pages
Java Programming
No ratings yet
Java Programming
6 pages

15_PDFsam_apache_spark_tutorial

Uploaded by

15_PDFsam_apache_spark_tutorial

Uploaded by

Apache Spark

15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication

The RDD abstraction is exposed through a language-integrated API. This simplifies

Open Spark Shell

Create simple RDD

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

inputfile: org.apache.spark.rdd.RDD[String] = input.txt MappedRDD[1] at

Given below is a list of RDD transformations.

S. No Transformations & Meaning

1 Returns a new distributed dataset, formed by passing each element of the

2 Returns a new dataset formed by selecting those elements of the source on

sample(withReplacement, fraction, seed)

6 Sample a fraction of the data, with or without replacement, using a given

dataset and the argument.

When called on a dataset of (K, V) pairs, returns a dataset of (K,

11 When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

12 When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs

13 When called on a dataset of (K, V) pairs where K implements Ordered,

16 When called on datasets of types T and U, returns a dataset of (T, U) pairs

18 Decrease the number of partitions in the RDD to numPartitions. Useful for

S.No Action & Meaning

3 Returns the number of elements in the dataset.

4 Returns the first element of the dataset (similar to take (1)).

5 Returns an array with the first n elements of the dataset.

takeSample (withReplacement,num, [seed])

saveAsSequenceFile(path) (Java and Scala)

Writes the elements of the dataset as a Hadoop SequenceFile in a given path

saveAsObjectFile(path) (Java and Scala)

Note: modifying variables other than Accumulators outside of

Programming with RDD

input.txt: input file.

people are not as beautiful as they look,

You might also like