0% found this document useful (0 votes)
8 views20 pages

Lab 04 Spark APIs

The document provides an overview of Spark APIs, focusing on low-level APIs, RDDs, and their manipulation through transformations and actions. It explains how to create RDDs from local collections and external data sources, as well as various operations like filtering, mapping, and set operations. Additionally, it discusses the importance of caching and persisting RDDs to optimize performance during computations.

Uploaded by

Bentardeit Sarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Lab 04 Spark APIs

The document provides an overview of Spark APIs, focusing on low-level APIs, RDDs, and their manipulation through transformations and actions. It explains how to create RDDs from local collections and external data sources, as well as various operations like filtering, mapping, and set operations. Additionally, it discusses the importance of caching and persisting RDDs to optimize performance during computations.

Uploaded by

Bentardeit Sarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Spark APIs

Lab goals :
▪ How to use spark APIs:
o low level Api
o Dataframe
o Sql

Spark Low-Level API


There are two sets of low-level APIs:

▪ there is one for manipulating distributed data (RDDs),


▪ and another for distributing and manipulating distributed shared variables (broadcast
variables and accumulators).

How to Use the Low-Level APIs?

A SparkContext is the entry point for low-level API functionality. You access it through the
SparkSession, which is the tool you use to perform computation across a Spark cluster. You can access
a SparkContext via the following call:
spark.sparkContext

About RDDs

RDDs are not as commonly used. However, virtually all Spark code you run, whether DataFrames or
Datasets, compiles down to an RDD. The Spark UI, also describes job execution in terms of RDDs.
Therefore, it will behoove you to have at least a basic understanding of what an RDD is and how to
use it.

RDD manipulation Lab


Creating an RDD
There are two ways to create an RDD: from a local collection (e.g. list) or from an external data
source. Let’s look at each of these methods next.

From a Local Collection

To create an RDD from a collection, you will need to use the parallelize method on a SparkContext
(within a SparkSession). This turns a single node collection into a parallel collection. When creating
this parallel collection, you can also explicitly state the number of partitions into which you would like
to distribute this array. In this case, we are creating two partitions:

Ex1:
// in Scala
val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
val words = sc.parallelize(myCollection, 2)

words.foreach(println)
Ex2:
val mapRdd = sc.parallelize(List(1, 2, 3, 4)) // Line 1
numbers.foreach(x => println(x))

Ex3:
val numbers =sc.parallelize (Range(10, 50, 2), 4)
numbers.foreach(x => println(x))

From Data Sources

Although you can create RDDs from data sources or text files, it’s often preferable to use the Data
Source APIs. RDDs do not have a notion of “Data Source APIs” like DataFrames do. The Data Source
API is almost always a better way to read in data. That being said, you can also read data as RDDs
using sparkContext. For example, let’s read a text file line by line:
spark.sparkContext.textFile("/some/path/withTextFiles")

This creates an RDD for which each record in the RDD represents a line in that text file or files.
Alternatively, you can read in data for which each text file should become a single record. The use
case here would be where each file is a file that consists of a large JSON object or some document
that you will operate on as an individual:
spark.sparkContext.wholeTextFiles("/some/path/withTextFiles")

In this RDD, the name of the file is the first object and the value of the text file is the second string
object.

Additional features

An additional feature is that you can then name this RDD to show up in the Spark UI according to a
given name:
// in Scala
words.setName("myWords")
words.name // myWords

The getNumPartitions method returns the number of partitions of the RDD.


// in Scala
Words.getNumPartitions

Examples
// in Terminal
mkdir toto
cd toto

echo “Apache Spark


Big Data and Analytics using Spark
Learning Spark
Real time Spark Streaming
Machine Learning using Spark
Spark using Scala
Pyspark
Spark and Kafka
Spark and R
Spark SQL” > keywords.txt
echo “The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.” > quangle.txt

//In Scala
val rdd1 = sc.textFile("toto/keywords.txt")
rdd1.getNumPartitions
rdd1.collect
rdd1.first

//In Scala
val rdd2 = sc.textFile("toto/")
rdd2.getNumPartitions
rdd2.collect
rdd2.first

//In Scala
val rdd3 = sc.wholeTextFiles("toto/")
rdd3.getNumPartitions
rdd3.collect
rdd3.first

//In Scala
val rdd4 = sc.textFile("toto/keywords.txt", 3)
rdd4.getNumPartitions
val rdd5 = spark.sparkContext.wholeTextFiles("toto/", 4)
rdd5.getNumPartitions

// in Terminal
Start-dfs.sh
hadoop fs -mkdir toto
hadoop fs -put toto/*.txt toto

//In Scala
val rdd6 = sc.textFile("hdfs://localhost/user/hadoop/toto/")
rdd6.getNumPartitions
rdd6.collect
rdd6.first

//In Scala
val rdd7 = sc.wholeTextFiles("hdfs://localhost/user/hadoop/toto/")
rdd7.getNumPartitions
rdd8.collect
rdd8.first

Manipulating RDDs
Transformations
You specify transformations on one RDD to create another. In doing so, we define an RDD as a
dependency to another along with some manipulation of the data contained in that RDD.

Element-wise transformations
map(func): Returns a new RDD by operating on each element of the source RDD.
Objective: To illustrate a map(func) transformation.

Action: Create an RDD of a numeric list. Then apply map(func) to multiply each element by 2.
val mapRdd = sc.parallelize(List(1, 2, 3, 4)) // Line 1
val mapRdd1 = mapRdd.map(x => x * 2) // Line 2

mapRdd1.collect

flatMap(func): Like map, but each item can be mapped to zero, one, or more items.

Objective: To illustrate the flatMap(func) tranformation.

Action: Create an RDD for a list of Strings, apply flatMap(func).


val flatMapRdd = sc.parallelize(List("hello world", "hi")) //Line 1
val flatMapRdd1= flatMapRdd.flatMap(line => line.split(" ")) //Line 2

Another example:
val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
val words = spark.sparkContext.parallelize(myCollection)
val rddWords= words.flatMap(word => word.toSeq)
rddWords.collect

words.flatMap(word => word.toSeq).take(5)

Difference between flatMap() and map() on an RDD

filter(func): Returns a new RDD that contains only elements that satisfy the condition.

Objective: To illustrate filter(func) tranformation.

Action: Create an RDD using an external data set. Apply filter(func) to display the lines that contain
the word Kafka.

Input File: keywords.txt


val filterRdd = sc.textFile("/home/data/keywords.txt") //Line 1
val filterRdd1 = filterRdd.filter(line => line.contains("Kafka"))//Line 2

Filtering is equivalent to creating a SQL-like where clause. You can look through our records in the
RDD and see which ones match some predicate function. This function just needs to return a Boolean
type to be used as a filter function. The input should be whatever your given row is. In this next
example, we filter the RDD to keep only the words that begin with the letter “S”:

Example:
// in Scala

def startsWithS(individual:String) = {
individual.startsWith("S")
}

words.filter(word => startsWithS(word)).collect

Mapped and filtered RDD from an input RDD

mapPartitions(func): It is similar to map, but works on the partition level.

Objective: To illustrate the mapPartitions(func) tranformation.

Action: Create an RDD of numeric type. Apply mapPartition(func)


val rdd = sc.parallelize(10 to 90) //Line 1
rdd.mapPartitions( x => List(x.next).iterator).collect //Line 2

Pseudo set operations


Some simple set operations

union(otherDataset): This returns a new data set that contains the elements of the source RDD and
the argument RDD. The key rule here is the two RDDs should be of the same data type.

Objective: To illustrate union(otherDataset) .

Action: Create two RDDs of numeric type as shown here. Apply union(otherDataset) to combine both
RDDs.
val rdd = sc.parallelize(1 to 5) //Line 1
val rdd1 = sc.parallelize(6 to 10) //Line 2
val unionRdd=rdd.union(rdd1) //Line 3

intersection(otherDataset): This returns a new data set that contains the intersection of elements
from the source RDD and the argument RDD.

Objective: To illustrate intersection(otherDataset) .


Action: Create two RDDs of numeric type as shown here. Apply intersection(otherDataset) to display
all the elements of source RDD that also belong to argument RDD.
val rdd = sc.parallelize(1 to 5) //Line 1
val rdd1 = sc.parallelize(1 to 2) //Line 2
val intersectionRdd = rdd.intersection(rdd1) //Line 3

subtract(otherDataset): This returns a new data set that contains the elements of the first data set
removing from it the elements of the second data set.

Objective: To illustrate substract(otherDataset) .

Action: Create two RDDs of numeric type as shown here. Apply substract(otherDataset) to display all
the elements of source RDD removing from it the elements of argument RDD.
val rdd = sc.parallelize(1 to 5) //Line 1
val rdd1 = sc.parallelize(1 to 2) //Line 2
val substractRdd = rdd.subtract(rdd1) //Line 3

distinct([numTasks]): This returns a new RDD that contains distinct elements within a source RDD.

Objective: To illustrate distinct([numTasks]) .

Action: Create two RDDs of numeric type as shown here. Apply union(otherDataset) and
distinct([numTasks]) to display distinct values.
val rdd = sc.parallelize(10 to 15) //Line 1
val rdd1 = sc.parallelize(10 to 15) //Line 2
val distinctRdd=rdd.union(rdd1).distinct //Line 3

sample(withReplacement, fraction, seed): This returns a sample of an RDD of fraction elements of


the source RDD. The Boolean parameter withReplacement determines whether the same element
may be sampled multiple times. The third parameter represents the seed for random-number
generation.

Objective: To illustrate sample(withReplacement, fraction, seed).

Action: Create an RDD of numeric type as shown here. Apply sample(withReplacement, fraction,
seed) to display a sample of values.
val rdd = sc.parallelize(10 to 100) //Line 1
println(rdd.sample(false, 0.1).collect().mkString(",")) //Line 2
println(rdd.sample(false, 0.1).collect().mkString(",")) //Line 3
println(rdd.sample(false, 0.1, 100).collect().mkString(",")) //Line 4
println(rdd.sample(false, 0.1, 100).collect().mkString(",")) //Line 5
println(rdd.sample(true, 0.3, 100).collect().mkString(",")) //Line 6
println(rdd.sample(true, 0.3, 100).collect().mkString(",")) //Line 7
println(rdd.sample(true, 0.3).collect().mkString(",")) //Line 8
println(rdd.sample(true, 0.3).collect().mkString(",")) //Line 9

Actions
reduce(func): This returns a data set by aggregating the elements of the data set using a function
func. The function takes two arguments and returns a single argument. The function should be
commutative and associative so that it can be operated in parallel.

Objective: To illustrate reduce(func).

Action: Create an RDD that contains numeric values. Apply reduce(func) to display the sum of values.
val rdd = sc.parallelize(1 to 5) //Line 1
val sumRdd = rdd.reduce((t1,t2) => t1 + t2) //Line 2

collect(): All the elements of the data set are returned as an array to the driver program.

Objective: To illustrate collect().

Action: Create an RDD that contains a list of strings. Apply collect to display all the elements of the
RDD.
val rdd = sc.parallelize(List("Hello Spark", "Spark Programming")) //Line 1
rdd.collect() //Line

count(): This returns the number of elements in the data set.

Objective: To illustrate count().

Action: Create an RDD that contains a list of strings. Apply count to display the number of elements in
the RDD.
val rdd = sc.parallelize(List("Hello Spark", "Spark Programming")) //Line 1
rdd.count() //Line 2

first(): This returns the first element in the data set.

Objective: To illustrate first().

Action: Create an RDD that contains a list of strings. Apply first() to display the first element in the
RDD.
val rdd = sc.parallelize(List("Hello Spark", "Spark Programming")) //Line 1
rdd.first() //Line 2

take(n): This returns the first n elements in the data set as an array.

Objective: To illustrate take(n) .

Action: Create an RDD that contains a list of strings. Apply take(n) to display the first n elements in
the RDD.
val rdd = sc.parallelize(List("Hello","Spark","Spark SQL","MLib")) //Line 1
rdd.take(2) //Line 2

takeSample(withReplacement, num, seed): This returns a sample of num elements with or without
replacement. in the data set as an array. The third parameter represents the seed for random-number
generation.

Objective: To illustrate takeSample(withReplacement, num, seed).

Action: Create an RDD that contains a list of integer numbers. Apply takeSample(withReplacement,
num, seed) to display a sample with num elements in the RDD.
val rdd = sc.parallelize(10 to 100) //Line 1
rdd.takeSample(false, 10) //Line 2
rdd.takeSample(false, 10) //Line 3
rdd.takeSample(false, 10, 100) //Line 4
rdd.takeSample(false, 10, 100) //Line 5
rdd.takeSample(true, 30) //Line 6
rdd.takeSample(true, 30) //Line 7
rdd.takeSample(true, 30, 100) //Line 8
rdd.takeSample(true, 30, 100) //Line 9

foreach(func): perform computations on each element in the RDD without bringing it back locally.

Objective: To illustrate foreach(func).

Action: Create an RDD that contains a list of integer numbers. Apply foreach(func) to perform
computations over the RDD elements.
val rdd = sc.parallelize(10 to 100) //Line 1
rdd.takeSample(false, 10).foreach(println(_)) //Line

sortBy(func): the function func extract the element in your RDDs and then action sortBy sort them
based on the result of the func function.

Objective: To illustrate sortBy(func).

Action: Create an RDD that contains a list of integer numbers. Apply sortBy(func) to sort the RDD
elements.
val rdd = sc.parallelize(10 to 100) //Line 1
rdd.takeSample(false, 10).sortBy(word => word.length()) //Line1
rdd.takeSample(false, 10).sortBy(word => word.length() * -1) //Line2

Persistence (cache() & persist())


Spark RDDs are lazily evaluated, and sometimes we may wish to use the same RDD multiple times. If
we do this naively, Spark will recompute the RDD and all of its dependencies each time we call an
action on the RDD.

In the following example, the RDD result will be recomputed two times. Multiple evaluation of RDD
can be especially expensive for iterative algorithms, which look at the data many times.
val input = sc.parallelize(10 to 100)
val result = input.map(x => x*x)
println(result.count())
println(result.collect().mkString(“,”))

To avoid computing an RDD multiple times, we can ask Spark to cache or persist the data.

Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s. But, the
difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist()
method is used to store it to the user-defined storage level.

Spark has many levels of caching persisting to choose from based on what our goals are, as you can
see in the following Table.
When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if any
partition of a Dataset is lost, it will automatically be recomputed using the original transformations
that created it.

Example (persist method):


import org.apache.spark.storage.StorageLevel
val input = sc.parallelize(10 to 100)
val result = input.map(x => x * x)
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect().mkString(“,”))

Saving Files
Saving files means writing to plain-text files. With RDDs, you cannot actually “save” to a data source
in the conventional sense. You must iterate over the partitions in order to save the contents of each
partition to some external database. This is a low-level approach that reveals the underlying
operation that is being performed in the higher-level APIs. Spark will take each partition, and write
that out to the destination.

saveAsTextFile(path): Write the elements of the RDD as a text file in the local file system, HDFS, or
another storage system.

Objective: To illustrate saveAsTextFile(path).

Action: Create an RDD that contains a list of strings. Apply saveAsTextFile(path) to write the elements
in the RDD to a file.
val rdd = sc.parallelize(List("Hello","Spark","Spark SQL","MLib")) //Line 1
rdd. saveAsTextFile("/home/data/output") //Line 2

DataFrame API
Creating DataFrames
You can create DataFrames in three ways:

▪ Converting existing RDDs


▪ Running SQL queries
▪ Loading external data

The easiest method is to run an SQL query, but we’ll leave that for later. First, we’ll show you how to
create DataFrames from existing RDDs.

Creating DataFrames from RDDs


You can create DataFrames from RDDs in three ways:

▪ Using RDDs containing row data as tuples


▪ Using case classes
▪ Specifying a schema

Creating DataFrame from an RDD of tuples

So let’s create a DataFrame by calling toDF on an RDD:


val rdd = sc.textFile("csv/2015-summary.csv")
rdd.take(5)
val df = rdd.toDF()
df.show(false)
df.printSchema

val rdd1 = rdd.map(x => x.split(","))


rdd1.take(5)
val df1 = rdd1.toDF()
df1.show(false)
df1.printSchema

Now let’s create a DataFrame by converting the RDD’s element arrays to tuples and then calling toDF
on the resulting RDD. There’s no elegant way to convert an array to a tuple, so you have to resort to
this ugly expression:
val rdd3 = rdd1.map(x => (x(0), x(1), x(2)))
rdd3.take(5)
val df3 = rdd3.toDF()
df3.show(false)
df3.printSchema

If you call the show method without arguments, it shows the first 20 rows. This is nice, but the
column names are generic and aren’t particularly helpful. You can rectify that by specifying column
names when calling the toDF method:
val df3 = rdd3.toDF("Destination_Country_Name", "Origin_Country_Name", "Count")

If you call show now, you’ll see that column names appear in the output. You can also examine the
schema of the DataFrame with the printSchema method:
Df3.printSchema

This method shows the information that the DataFrame has about its columns. You can see that
column names are now available, but all the columns are of type String and they’re all nullable. This
could be desirable in some cases, but in this case it’s obviously wrong. Here, the column count ID
should be of type long, counts should be integers, and the date columns should be timestamps. The
following two RDD-to-DataFrame conversion methods let you specify the desirable column types and
their names.
Converting RDDS to DataFrames by specifying a schema

The last method for converting RDDs to DataFrames is to use SparkSession’s create-DataFrame
method, which takes an RDD containing objects of type Row and a Struct-Type. In Spark SQL, a
StructType represents a schema. It contains one or more StructFields, each describing a column. You
can construct a StructType schema for the post RDD with the following snippet:
import org.apache.spark.sql.types._
val flightSchema = StructType(Array(
StructField("Destination_Country_Name", StringType, true),
StructField("Origin_Country_Name", StringType, true),
StructField("Count", IntegerType, true),
)
)

Supported data types


DataFrame takes columns of the usual types supported by major relational databases: strings,
integers, shorts, floats, doubles, bytes, dates, timestamps, and binary values (in relational
databases, called BLOBs). But it can also contain complex data types:
Arrays contain several values of the same type.
Maps contain key-value pairs, where the key is a primitive type.
Structs contain nested column definitions.
You’ll find Scala data type objects that you can use when constructing StructType objects in the
org.apache.spark.sql.types package.

For mapping your RDD to the Row type use the following snippet:
// import java.sql.Timestamp
object StringImplicits {
implicit class StringImprovements(val s: String) {
import scala.util.control.Exception.catching
def toIntSafe = catching(classOf[NumberFormatException]) opt s.toInt
// def toLongSafe = catching(classOf[NumberFormatException]) opt s.toLong
// def toTimestampSafe = catching(classOf[IllegalArgumentException]) opt
// Timestamp.valueOf(s)
}
}

import StringImplicits._
import org.apache.spark.sql._
def stringToRow(row:String):Row = {
val r = row.split(",")
Row(r(0),
r(1),
r(2).toIntSafe. getOrElse(null)
// r(2).toInt
)
}

val rowRDD = rdd.map(row => stringToRow(row))

Then you can create the final DataFrame like this:


val DFStruct = spark.createDataFrame(rowRDD, flightSchema)

Getting schema information

Two additional DataFrame functions can give you some information about the Data-Frame’s schema.
The columns method returns a list of column names, and the dtypes method returns a list of tuples,
each containing the column name and the name of its type. For the itPostsDFCase DataFrame, the
results looks like this:
DFStruct.printSchema
DFStruct.columns
DFStruct.dtypes

Reading DataFrames from structured sources


The foundation for reading data in Spark is the DataFrameReader. We access this through the
SparkSession via the read attribute:

spark.read

After we have a DataFrame reader, we specify several values:

▪ The format
▪ The schema
▪ The read mode
▪ A series of options

The format, options, and schema each return a DataFrameReader that can undergo further
transformations and are all optional, except for one option. Each data source has a specific set of
options that determine how the data is read into Spark (we cover these options shortly). At a minimum,
you must supply the DataFrameReader a path to from which to read.

Here’s an example of the overall layout:


spark.read.format("csv")
.option("mode", "FAILFAST")
.option("inferSchema", "true")
.option("path", "path/to/file(s)")
.schema(someSchema)
.load()

The DataFrames can be created by using existing RDDs, Hive tables, and other data sources like text
files and external databases.

Source Json
The following example shows the steps to create DataFrames from the JSON file with SparkSession.

Follow the examples shown here to create the DataFrame from JSON contents.

Example 1:
val df = spark.read.format("json").load("data/flight-data/json/2015-summary.json")
val df = spark.read.json("data/flight-data/json/2015-summary.json")
Example 2:

Save the following in bookDetails.json.


{"bookId":101, "bookName":"Practical Spark", "Author":"Dharanitharan G"}
{"bookId":102, "bookName":"Spark Core", "Author":"Subhashini R C"}
{"bookId":103, "bookName":"Spark SQL", "Author":"Dharanitharan G"}
{"bookId":102, "bookName":"Spark Streaming", "Author":"Subhashini R C"}

val bookDetails = spark.read.json("bookDetails.json")

bookDetails.select("bookId","bookName").show()

bookDetails.filter($"bookName" === "Spark Core").show()

val total = grouped.count()

Source Csv
Follow the examples shown here to create the DataFrame from csv contents.

Example 1:
val flightData2015 = spark.read.option("inferSchema", "true").option("header", "true").csv("data/flight-data/csv/2015-
summary.csv")
Example 2:
val myManualSchema = new StructType(Array(
new StructField("DEST_COUNTRY_NAME", StringType, true),
new StructField("ORIGIN_COUNTRY_NAME", StringType, true),
new StructField("count", LongType, false)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "FAILFAST")
.schema(myManualSchema)
.load("data/flight-data/csv/2010-summary.csv")

Example 3:
import java.sql.Timestamp

import org.apache.spark.sql.types._

val postSchema = StructType(Seq(


StructField("commentCount", IntegerType, true),
StructField("lastActivityDate", TimestampType, true),
StructField("ownerUserId", LongType, true),
StructField("body", StringType, true),
StructField("score", IntegerType, true),
StructField("creationDate", TimestampType, true),
StructField("viewCount", IntegerType, true),
StructField("title", StringType, true),
StructField("tags", StringType, true),
StructField("answerCount", IntegerType, true),
StructField("acceptedAnswerId", LongType, true),
StructField("postTypeId", LongType, true),
StructField("id", LongType, false))
)

val df = spark.read.format("csv").option("delimiter", "~").option("mode", "FAILFAST").schema(postSchema).load("first-


edition-master/ch05/italianPosts.csv")
df.show(5)

Source JDBC
Spark SQL allows users to connect to external databases through JDBC (i.e., Java DataBase
Connectivity) connectivity. The tables from the databases can be loaded as DataFrame or Spark SQL
temporary tables using the Datasources API.

The following properties are mandatory to connect to the database.

URL: The JDBC URL to connect to (e.g., jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}).

Driver: The class name of the JDBC driver to connect to the URL (e.g., com.mysql.cj.jdbc.Driver, for
mysql database).

UserName and Password: To connect to the database.

The following code creates the DataFrame from the mysql table.
val driver = "com.mysql.cj.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/bookdetails"
val username = "root"
val password = "MaMaryouma_2014"
val tablename = "bookDetails"

val jdbcDF = spark.read.format("jdbc").option("url", url).option("dbtable", tablename).option("user",


username).option("password", password).option("driver", driver).load()

jdbcDF.show()

It is mandatory to keep the jar for mysql database in the Spark classpath.

Writing DataFrames
Writing CSV Files
Just as with reading data, there are a variety of options for writing data when we write CSV files. This
is a subset of the reading options because many do not apply when writing data (like maxColumns
and inferSchema). Here’s an example:
val bookDetailsDF = spark.read.format("json").load("bookDetails.json")

val bookNames = bookDetails.select('bookname, 'Author)

bookNames.write.format("csv").options(Map("header"->"true","delimiter"->",")).save("/path/to/output")

ou bien

bookNames.write.format("csv")
.option("header", "true")
.option("delimiter", ",")
.save("/path/to/output")
DataFrame API basics
Before we start manipulating the dataframe API, we will build a datafarme that we will use to
illustrate the different features of the API.

In 2009, Stack Exchange released an anonymized data dump of all questions and answers in the Stack
Exchange communities, and it continues to release data as new questions became available. One of
the communities whose data was released is the Italian Language Stack Exchange, and we’ll use its
data to illustrate Spark SQL concepts in this lab. We chose this community because of its small size
(you can easily download and use it on your laptop).

The original data is available in XML format. We preprocessed it and created a comma-separated
values (CSV) file. The first file you’ll use is italian-Posts.csv. It contains Italian language–related
questions (and answers) with the following fields (delimited with tilde signs):

▪ commentCount—Number of comments related to the question/answer


▪ lastActivityDate—Date and time of the last modification
▪ ownerUserId—User ID of the owner
▪ body—Textual contents of the question/answer
▪ score—Total score based on upvotes and downvotes
▪ creationDate—Date and time of creation
▪ viewCount—View count
▪ title—Title of the question
▪ tags—Set of tags the question has been marked with
▪ answerCount—Number of related answers
▪ acceptedAnswerId—If a question contains the ID of its accepted answer
▪ postTypeId—Type of the post; 1 is for questions, 2 for answers
▪ id—Post’s unique ID

Building dataframe from italian-Posts.csv


File ingestion in an RDD
val itPostsRows = sc.textFile("path/italianPosts.csv")

Converting the RDD to a DataFrame by specifying a schema

StringImplicits for converting data :


import java.sql.Timestamp

object StringImplicits {
implicit class StringImprovements(val s: String) {
import scala.util.control.Exception.catching
def toIntSafe = catching(classOf[NumberFormatException]) opt s.toInt
def toLongSafe = catching(classOf[NumberFormatException]) opt s.toLong
def toTimestampSafe = catching(classOf[IllegalArgumentException]) opt
Timestamp.valueOf(s)
}
}

Parsing rows function:


import StringImplicits._
import org.apache.spark.sql._
def stringToRow(row:String):Row = {
val r = row.split("~")
Row(r(0).toIntSafe.getOrElse(null),
r(1).toTimestampSafe.getOrElse(null),
r(2).toLongSafe.getOrElse(null),
r(3),
r(4).toIntSafe.getOrElse(null),
r(5).toTimestampSafe.getOrElse(null),
r(6).toIntSafe.getOrElse(null),
r(7),
r(8),
r(9).toIntSafe.getOrElse(null),
r(10).toLongSafe.getOrElse(null),
r(11).toLongSafe.getOrElse(null),
r(12).toLong)
}

Building dataframe schema:


import org.apache.spark.sql.types._
val postSchema = StructType(Seq(
StructField("commentCount", IntegerType, true),
StructField("lastActivityDate", TimestampType, true),
StructField("ownerUserId", LongType, true),
StructField("body", StringType, true),
StructField("score", IntegerType, true),
StructField("creationDate", TimestampType, true),
StructField("viewCount", IntegerType, true),
StructField("title", StringType, true),
StructField("tags", StringType, true),
StructField("answerCount", IntegerType, true),
StructField("acceptedAnswerId", LongType, true),
StructField("postTypeId", LongType, true),
StructField("id", LongType, false))
)

Converting itPostsRows to a row RDD


val rowRDD = itPostsRows.map(row => stringToRow(row))

Then you can create the final DataFrame like this:


val itPostsDFStruct = spark.createDataFrame(rowRDD, postSchema)

Printing schema

itPostsDFStruct.printSchema

Displaying data

itPostsDFStruct.show(5)

Now that you have your DataFrame loaded (no matter which method you used to load the data), you
can start exploring the rich DataFrame API. DataFrames come with a DSL for manipulating data, which
is fundamental to working with Spark SQL. Spark’s machine learning library (ML) also relies on
DataFrames. They’ve become a cornerstone of Spark, so it’s important to get acquainted with their
API.

DataFrame’s DSL has a set of functionalities similar to the usual SQL functions for manipulating data
in relational databases. DataFrames work like RDDs: they’re immutable and lazy. They inherit their
immutable nature from the underlying RDD architecture. You can’t directly change data in a
DataFrame; you have to transform it into another one. They’re lazy because most DataFrame DSL
functions don’t return results. Instead, they return another DataFrame, similar to RDD
transformations.

In this lab part, we’ll give you an overview of basic DataFrame functions and show how you can use
them to select, filter, map, group, and join data. All of these functions have also their SQL
counterparts.

Selecting data (slicing)


Most of the DataFrame DSL functions work with Column objects. When using the select function for
selecting data, you can pass column names or Column objects to it, and it will return a new
DataFrame containing only those columns. For example (first rename the DataFrame variable to
postsDf to make it shorter):
val postsDf = itPostsDFStruct
val postsIdBody = postsDf.select("id", "body")

Several methods exist for creating Column objects. You can create columns using an existing
DataFrame object and its col function:
val postsIdBody = postsDf.select(postsDf.col("id"), postsDf.col("body"))

or you can use some of the implicit methods (imported). One of them indirectly converts Scala’s
Symbol class to a Column. Symbol objects are sometimes used in Scala programs as identifiers
instead of Strings because they’re interned (at most one instance of an object exists) and can be
quickly checked for equality. They can be instantiated with Scala’s built-in quote mechanism or with
Scala’s apply function, so the following two statements are equivalent:
val postsIdBody = postsDf.select(Symbol("id"), Symbol("body"))
val postsIdBody = postsDf.select('id, 'body)

Another implicit method (called $) converts strings to ColumnName objects (which inherit from
Column objects, so you can use ColumnName objects as well):
val postsIdBody = postsDf.select($"id", $"body")

Column objects are important for DataFrame DSL, so all this flexibility is justified. That’s how you
specify which columns to select. The resulting DataFrame contains only the specified columns. If you
need to select all the columns except a single one, you can use the drop function, which takes a
column name or a Column object and returns a new DataFrame with the specified column missing.
For example, to remove the body column from the postsIdBody DataFrame (and thus leave only the
id column), you can use the following line (drop method):
val postIds = postsIdBody.drop("body")

Filtering data
You can filter DataFrame data using the where and filter functions (they’re synonymous). They take a
Column object or an expression string. The variant taking a string is used for parsing SQL expressions.
Why do you pass a Column object to a filtering function, you ask? Because the Column class, in
addition to representing a column name, contains a rich set of SQL-like operators that you can use to
build expressions. These expressions are also represented by the Column class.

For example, to see how many posts contain the word Italiano in their body, you can use the
following line:
postsIdBody.filter('body contains "Italiano").count

To select all the questions that don’t have an accepted answer, use this expression:

val noAnswer = postsDf.filter(('postTypeId === 1) and ('acceptedAnswerId isNull))

Here, the filter expression is based on two columns: the post type ID (which is equal to 1 for
questions) and the accepted answer ID columns. Both expressions yield a Column object, which is
then combined into a third one using the and operator. You need to use the extra parenthesis to help
the Scala parser find its way.

These are just some of the available operators.

Note that you can use these column expressions in select functions as well. You can select only the
first n rows of a DataFrame with the limit function. The following will return a DataFrame containing
only the first 10 questions:
val firstTenQs = postsDf.filter('postTypeId === 1).limit(10)

Adding and Renaming columns


In some situations, you may want to rename a column to give it a shorter or a more meaningful
name. That’s what the withColumnRenamed function is for. It accepts two strings: the old and the
new name of the column. For example:
val firstTenQsRn = firstTenQs.withColumnRenamed("ownerUserId", "owner")

To add a new column to a DataFrame, use the withColumn function and give it the column name and
the Column expression. Let’s say you’re interested in a “views per score point” metric (in other
words, how many views are needed to increase score by one), and you want to see the questions
whose value of this metric is less than some threshold (which means if the question is more
successful, it gains a higher score with fewer views). If your threshold is 35 (which is the actual
average), this can be accomplished with the following expression:
postsDf.filter('postTypeId === 1).withColumn("ratio", 'viewCount / 'score).where('ratio < 35).show()
The output is too wide to print on this page, but you can run the command yourself and see that the
output contains the extra column called ratio.

Sorting data
DataFrame’s orderBy and sort functions sort data (they’re equivalent). They take one or more column
names or one or more Column expressions. The Column class has asc and desc operators, which are
used for specifying the sort order. The default is to sort in ascending order.

As an exercise, try to list the 10 most recently modified questions.

Using built-in scalar and aggregate functions


Scalar functions return a single value for each row based on values in one or more columns in the
same row. Scalar functions include abs (calculates the absolute value), exp (computes the
exponential), and substring (extracts a substring of a given string).

Aggregate functions return a single value for a group of rows. They are min (returns the minimum
value in a group of rows), avg (calculates the average value in a group of rows), and so on.

Scalar and aggregate functions reside within the object org.apache.spark.sql.functions. You can
import them all at once (note that they’re automatically imported in Spark shell, so you don’t have to
do this yourself):
import org.apache.spark.sql.functions._

As an example, let’s find the question that was active for the largest amount of time. You can use the
lastActivityDate and creationDate columns for this and find the difference between them (in days)
using the datediff function, which takes two arguments: end date column and start date column. Here
goes:
postsDf.filter('postTypeId === 1).
withColumn("activePeriod", datediff('lastActivityDate, 'creationDate)).
orderBy('activePeriod desc).head.getString(3).
replace("&lt;","<").replace("&gt;",">")

As another example, let’s find the average and maximum score of all questions and the total number
of questions. Spark SQL makes this easy:
postsDf.select(avg('score), max('score), count('score)).show

Working with missing values


Sometimes you may need to clean your data before using it. It might contain null or empty values, or
equivalent string constants (for example, “N/A” or “unknown”). In those instances, the
DataFrameNaFunctions class, accessible through the DataFrames na field, may prove useful.
Depending on your case, you can choose to drop the rows containing null or NaN (the Scala constant
meaning “not a number”) values, to fill null or NaN values with constants, or to replace certain values
with string or numeric constants.

Each of these methods has several versions. For example, to remove all the rows from postsDf that
contain null or NaN values in at least one of their columns, call drop with no arguments:
val cleanPosts = postsDf.na.drop()
cleanPosts.count()

This is the same as calling drop("any"), which means null values can be in any of the columns. If you
use drop("all"), it removes rows that have null values in all of the columns. You can also specify
column names. For example, to remove the rows that don’t have an accepted answer ID, you can do
this:
postsDf.na.drop(Array("acceptedAnswerId"))

With the fill function, you can replace null and NaN values with a constant.
postsDf.na.fill(Map("viewCount" -> 0))
Finally, the replace function enables you to replace certain values in specific columns with different
ones. For example, imagine there was a mistake in your data export and you needed to change post
ID 1177 to 3000. You can do that with the replace function:
val postsDfCorrected = postsDf.na.replace(Array("id", "acceptedAnswerId"), Map(1177 -> 3000))

Grouping and joining data


To find the number of posts per author use the following:
postsDf.groupBy('ownerUserId).count.orderBy('ownerUserId desc).show(10)

Performing joins
As an example, load the italianVotes.csv file into a Data-Frame with the following code:
val itVotesRaw = sc.textFile("first-edition-master/ch05/italianVotes.csv").map(x => x.split("~"))
val itVotesRows = itVotesRaw.map(row => Row(row(0).toLong, row(1).toLong, row(2).toInt, Timestamp.valueOf(row(3))))
val votesSchema = StructType(Seq(StructField("id", LongType, false), StructField("postId", LongType, false),
StructField("voteTypeId", IntegerType, false), StructField("creationDate", TimestampType, false)) )
val votesDf = spark.createDataFrame(itVotesRows, votesSchema)

Joining the two DataFrames (postsDf and votesDf) on the postId column can be done like this:
val postsVotes = postsDf.join(votesDf, postsDf("id") === 'postId)

This performs an inner join. You can perform an outer join by adding another argument:
val postsVotesOuter = postsDf.join(votesDf, postsDf("id") === 'postId, "outer")

If you examine the contents of the postsVotesOuter DataFrame, you’ll notice there are some rows
with all null values in the votes columns. These are the posts that have no votes.

You might also like