Lab 04 Spark APIs
Lab 04 Spark APIs
Lab goals :
▪ How to use spark APIs:
o low level Api
o Dataframe
o Sql
A SparkContext is the entry point for low-level API functionality. You access it through the
SparkSession, which is the tool you use to perform computation across a Spark cluster. You can access
a SparkContext via the following call:
spark.sparkContext
About RDDs
RDDs are not as commonly used. However, virtually all Spark code you run, whether DataFrames or
Datasets, compiles down to an RDD. The Spark UI, also describes job execution in terms of RDDs.
Therefore, it will behoove you to have at least a basic understanding of what an RDD is and how to
use it.
To create an RDD from a collection, you will need to use the parallelize method on a SparkContext
(within a SparkSession). This turns a single node collection into a parallel collection. When creating
this parallel collection, you can also explicitly state the number of partitions into which you would like
to distribute this array. In this case, we are creating two partitions:
Ex1:
// in Scala
val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
val words = sc.parallelize(myCollection, 2)
words.foreach(println)
Ex2:
val mapRdd = sc.parallelize(List(1, 2, 3, 4)) // Line 1
numbers.foreach(x => println(x))
Ex3:
val numbers =sc.parallelize (Range(10, 50, 2), 4)
numbers.foreach(x => println(x))
Although you can create RDDs from data sources or text files, it’s often preferable to use the Data
Source APIs. RDDs do not have a notion of “Data Source APIs” like DataFrames do. The Data Source
API is almost always a better way to read in data. That being said, you can also read data as RDDs
using sparkContext. For example, let’s read a text file line by line:
spark.sparkContext.textFile("/some/path/withTextFiles")
This creates an RDD for which each record in the RDD represents a line in that text file or files.
Alternatively, you can read in data for which each text file should become a single record. The use
case here would be where each file is a file that consists of a large JSON object or some document
that you will operate on as an individual:
spark.sparkContext.wholeTextFiles("/some/path/withTextFiles")
In this RDD, the name of the file is the first object and the value of the text file is the second string
object.
Additional features
An additional feature is that you can then name this RDD to show up in the Spark UI according to a
given name:
// in Scala
words.setName("myWords")
words.name // myWords
Examples
// in Terminal
mkdir toto
cd toto
//In Scala
val rdd1 = sc.textFile("toto/keywords.txt")
rdd1.getNumPartitions
rdd1.collect
rdd1.first
//In Scala
val rdd2 = sc.textFile("toto/")
rdd2.getNumPartitions
rdd2.collect
rdd2.first
//In Scala
val rdd3 = sc.wholeTextFiles("toto/")
rdd3.getNumPartitions
rdd3.collect
rdd3.first
//In Scala
val rdd4 = sc.textFile("toto/keywords.txt", 3)
rdd4.getNumPartitions
val rdd5 = spark.sparkContext.wholeTextFiles("toto/", 4)
rdd5.getNumPartitions
// in Terminal
Start-dfs.sh
hadoop fs -mkdir toto
hadoop fs -put toto/*.txt toto
//In Scala
val rdd6 = sc.textFile("hdfs://localhost/user/hadoop/toto/")
rdd6.getNumPartitions
rdd6.collect
rdd6.first
//In Scala
val rdd7 = sc.wholeTextFiles("hdfs://localhost/user/hadoop/toto/")
rdd7.getNumPartitions
rdd8.collect
rdd8.first
Manipulating RDDs
Transformations
You specify transformations on one RDD to create another. In doing so, we define an RDD as a
dependency to another along with some manipulation of the data contained in that RDD.
Element-wise transformations
map(func): Returns a new RDD by operating on each element of the source RDD.
Objective: To illustrate a map(func) transformation.
Action: Create an RDD of a numeric list. Then apply map(func) to multiply each element by 2.
val mapRdd = sc.parallelize(List(1, 2, 3, 4)) // Line 1
val mapRdd1 = mapRdd.map(x => x * 2) // Line 2
mapRdd1.collect
flatMap(func): Like map, but each item can be mapped to zero, one, or more items.
Another example:
val myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple".split(" ")
val words = spark.sparkContext.parallelize(myCollection)
val rddWords= words.flatMap(word => word.toSeq)
rddWords.collect
filter(func): Returns a new RDD that contains only elements that satisfy the condition.
Action: Create an RDD using an external data set. Apply filter(func) to display the lines that contain
the word Kafka.
Filtering is equivalent to creating a SQL-like where clause. You can look through our records in the
RDD and see which ones match some predicate function. This function just needs to return a Boolean
type to be used as a filter function. The input should be whatever your given row is. In this next
example, we filter the RDD to keep only the words that begin with the letter “S”:
Example:
// in Scala
def startsWithS(individual:String) = {
individual.startsWith("S")
}
union(otherDataset): This returns a new data set that contains the elements of the source RDD and
the argument RDD. The key rule here is the two RDDs should be of the same data type.
Action: Create two RDDs of numeric type as shown here. Apply union(otherDataset) to combine both
RDDs.
val rdd = sc.parallelize(1 to 5) //Line 1
val rdd1 = sc.parallelize(6 to 10) //Line 2
val unionRdd=rdd.union(rdd1) //Line 3
intersection(otherDataset): This returns a new data set that contains the intersection of elements
from the source RDD and the argument RDD.
subtract(otherDataset): This returns a new data set that contains the elements of the first data set
removing from it the elements of the second data set.
Action: Create two RDDs of numeric type as shown here. Apply substract(otherDataset) to display all
the elements of source RDD removing from it the elements of argument RDD.
val rdd = sc.parallelize(1 to 5) //Line 1
val rdd1 = sc.parallelize(1 to 2) //Line 2
val substractRdd = rdd.subtract(rdd1) //Line 3
distinct([numTasks]): This returns a new RDD that contains distinct elements within a source RDD.
Action: Create two RDDs of numeric type as shown here. Apply union(otherDataset) and
distinct([numTasks]) to display distinct values.
val rdd = sc.parallelize(10 to 15) //Line 1
val rdd1 = sc.parallelize(10 to 15) //Line 2
val distinctRdd=rdd.union(rdd1).distinct //Line 3
Action: Create an RDD of numeric type as shown here. Apply sample(withReplacement, fraction,
seed) to display a sample of values.
val rdd = sc.parallelize(10 to 100) //Line 1
println(rdd.sample(false, 0.1).collect().mkString(",")) //Line 2
println(rdd.sample(false, 0.1).collect().mkString(",")) //Line 3
println(rdd.sample(false, 0.1, 100).collect().mkString(",")) //Line 4
println(rdd.sample(false, 0.1, 100).collect().mkString(",")) //Line 5
println(rdd.sample(true, 0.3, 100).collect().mkString(",")) //Line 6
println(rdd.sample(true, 0.3, 100).collect().mkString(",")) //Line 7
println(rdd.sample(true, 0.3).collect().mkString(",")) //Line 8
println(rdd.sample(true, 0.3).collect().mkString(",")) //Line 9
Actions
reduce(func): This returns a data set by aggregating the elements of the data set using a function
func. The function takes two arguments and returns a single argument. The function should be
commutative and associative so that it can be operated in parallel.
Action: Create an RDD that contains numeric values. Apply reduce(func) to display the sum of values.
val rdd = sc.parallelize(1 to 5) //Line 1
val sumRdd = rdd.reduce((t1,t2) => t1 + t2) //Line 2
collect(): All the elements of the data set are returned as an array to the driver program.
Action: Create an RDD that contains a list of strings. Apply collect to display all the elements of the
RDD.
val rdd = sc.parallelize(List("Hello Spark", "Spark Programming")) //Line 1
rdd.collect() //Line
Action: Create an RDD that contains a list of strings. Apply count to display the number of elements in
the RDD.
val rdd = sc.parallelize(List("Hello Spark", "Spark Programming")) //Line 1
rdd.count() //Line 2
Action: Create an RDD that contains a list of strings. Apply first() to display the first element in the
RDD.
val rdd = sc.parallelize(List("Hello Spark", "Spark Programming")) //Line 1
rdd.first() //Line 2
take(n): This returns the first n elements in the data set as an array.
Action: Create an RDD that contains a list of strings. Apply take(n) to display the first n elements in
the RDD.
val rdd = sc.parallelize(List("Hello","Spark","Spark SQL","MLib")) //Line 1
rdd.take(2) //Line 2
takeSample(withReplacement, num, seed): This returns a sample of num elements with or without
replacement. in the data set as an array. The third parameter represents the seed for random-number
generation.
Action: Create an RDD that contains a list of integer numbers. Apply takeSample(withReplacement,
num, seed) to display a sample with num elements in the RDD.
val rdd = sc.parallelize(10 to 100) //Line 1
rdd.takeSample(false, 10) //Line 2
rdd.takeSample(false, 10) //Line 3
rdd.takeSample(false, 10, 100) //Line 4
rdd.takeSample(false, 10, 100) //Line 5
rdd.takeSample(true, 30) //Line 6
rdd.takeSample(true, 30) //Line 7
rdd.takeSample(true, 30, 100) //Line 8
rdd.takeSample(true, 30, 100) //Line 9
foreach(func): perform computations on each element in the RDD without bringing it back locally.
Action: Create an RDD that contains a list of integer numbers. Apply foreach(func) to perform
computations over the RDD elements.
val rdd = sc.parallelize(10 to 100) //Line 1
rdd.takeSample(false, 10).foreach(println(_)) //Line
sortBy(func): the function func extract the element in your RDDs and then action sortBy sort them
based on the result of the func function.
Action: Create an RDD that contains a list of integer numbers. Apply sortBy(func) to sort the RDD
elements.
val rdd = sc.parallelize(10 to 100) //Line 1
rdd.takeSample(false, 10).sortBy(word => word.length()) //Line1
rdd.takeSample(false, 10).sortBy(word => word.length() * -1) //Line2
In the following example, the RDD result will be recomputed two times. Multiple evaluation of RDD
can be especially expensive for iterative algorithms, which look at the data many times.
val input = sc.parallelize(10 to 100)
val result = input.map(x => x*x)
println(result.count())
println(result.collect().mkString(“,”))
To avoid computing an RDD multiple times, we can ask Spark to cache or persist the data.
Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s. But, the
difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist()
method is used to store it to the user-defined storage level.
Spark has many levels of caching persisting to choose from based on what our goals are, as you can
see in the following Table.
When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if any
partition of a Dataset is lost, it will automatically be recomputed using the original transformations
that created it.
Saving Files
Saving files means writing to plain-text files. With RDDs, you cannot actually “save” to a data source
in the conventional sense. You must iterate over the partitions in order to save the contents of each
partition to some external database. This is a low-level approach that reveals the underlying
operation that is being performed in the higher-level APIs. Spark will take each partition, and write
that out to the destination.
saveAsTextFile(path): Write the elements of the RDD as a text file in the local file system, HDFS, or
another storage system.
Action: Create an RDD that contains a list of strings. Apply saveAsTextFile(path) to write the elements
in the RDD to a file.
val rdd = sc.parallelize(List("Hello","Spark","Spark SQL","MLib")) //Line 1
rdd. saveAsTextFile("/home/data/output") //Line 2
DataFrame API
Creating DataFrames
You can create DataFrames in three ways:
The easiest method is to run an SQL query, but we’ll leave that for later. First, we’ll show you how to
create DataFrames from existing RDDs.
Now let’s create a DataFrame by converting the RDD’s element arrays to tuples and then calling toDF
on the resulting RDD. There’s no elegant way to convert an array to a tuple, so you have to resort to
this ugly expression:
val rdd3 = rdd1.map(x => (x(0), x(1), x(2)))
rdd3.take(5)
val df3 = rdd3.toDF()
df3.show(false)
df3.printSchema
If you call the show method without arguments, it shows the first 20 rows. This is nice, but the
column names are generic and aren’t particularly helpful. You can rectify that by specifying column
names when calling the toDF method:
val df3 = rdd3.toDF("Destination_Country_Name", "Origin_Country_Name", "Count")
If you call show now, you’ll see that column names appear in the output. You can also examine the
schema of the DataFrame with the printSchema method:
Df3.printSchema
This method shows the information that the DataFrame has about its columns. You can see that
column names are now available, but all the columns are of type String and they’re all nullable. This
could be desirable in some cases, but in this case it’s obviously wrong. Here, the column count ID
should be of type long, counts should be integers, and the date columns should be timestamps. The
following two RDD-to-DataFrame conversion methods let you specify the desirable column types and
their names.
Converting RDDS to DataFrames by specifying a schema
The last method for converting RDDs to DataFrames is to use SparkSession’s create-DataFrame
method, which takes an RDD containing objects of type Row and a Struct-Type. In Spark SQL, a
StructType represents a schema. It contains one or more StructFields, each describing a column. You
can construct a StructType schema for the post RDD with the following snippet:
import org.apache.spark.sql.types._
val flightSchema = StructType(Array(
StructField("Destination_Country_Name", StringType, true),
StructField("Origin_Country_Name", StringType, true),
StructField("Count", IntegerType, true),
)
)
For mapping your RDD to the Row type use the following snippet:
// import java.sql.Timestamp
object StringImplicits {
implicit class StringImprovements(val s: String) {
import scala.util.control.Exception.catching
def toIntSafe = catching(classOf[NumberFormatException]) opt s.toInt
// def toLongSafe = catching(classOf[NumberFormatException]) opt s.toLong
// def toTimestampSafe = catching(classOf[IllegalArgumentException]) opt
// Timestamp.valueOf(s)
}
}
import StringImplicits._
import org.apache.spark.sql._
def stringToRow(row:String):Row = {
val r = row.split(",")
Row(r(0),
r(1),
r(2).toIntSafe. getOrElse(null)
// r(2).toInt
)
}
Two additional DataFrame functions can give you some information about the Data-Frame’s schema.
The columns method returns a list of column names, and the dtypes method returns a list of tuples,
each containing the column name and the name of its type. For the itPostsDFCase DataFrame, the
results looks like this:
DFStruct.printSchema
DFStruct.columns
DFStruct.dtypes
spark.read
▪ The format
▪ The schema
▪ The read mode
▪ A series of options
The format, options, and schema each return a DataFrameReader that can undergo further
transformations and are all optional, except for one option. Each data source has a specific set of
options that determine how the data is read into Spark (we cover these options shortly). At a minimum,
you must supply the DataFrameReader a path to from which to read.
The DataFrames can be created by using existing RDDs, Hive tables, and other data sources like text
files and external databases.
Source Json
The following example shows the steps to create DataFrames from the JSON file with SparkSession.
Follow the examples shown here to create the DataFrame from JSON contents.
Example 1:
val df = spark.read.format("json").load("data/flight-data/json/2015-summary.json")
val df = spark.read.json("data/flight-data/json/2015-summary.json")
Example 2:
bookDetails.select("bookId","bookName").show()
Source Csv
Follow the examples shown here to create the DataFrame from csv contents.
Example 1:
val flightData2015 = spark.read.option("inferSchema", "true").option("header", "true").csv("data/flight-data/csv/2015-
summary.csv")
Example 2:
val myManualSchema = new StructType(Array(
new StructField("DEST_COUNTRY_NAME", StringType, true),
new StructField("ORIGIN_COUNTRY_NAME", StringType, true),
new StructField("count", LongType, false)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "FAILFAST")
.schema(myManualSchema)
.load("data/flight-data/csv/2010-summary.csv")
Example 3:
import java.sql.Timestamp
import org.apache.spark.sql.types._
Source JDBC
Spark SQL allows users to connect to external databases through JDBC (i.e., Java DataBase
Connectivity) connectivity. The tables from the databases can be loaded as DataFrame or Spark SQL
temporary tables using the Datasources API.
Driver: The class name of the JDBC driver to connect to the URL (e.g., com.mysql.cj.jdbc.Driver, for
mysql database).
The following code creates the DataFrame from the mysql table.
val driver = "com.mysql.cj.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/bookdetails"
val username = "root"
val password = "MaMaryouma_2014"
val tablename = "bookDetails"
jdbcDF.show()
It is mandatory to keep the jar for mysql database in the Spark classpath.
Writing DataFrames
Writing CSV Files
Just as with reading data, there are a variety of options for writing data when we write CSV files. This
is a subset of the reading options because many do not apply when writing data (like maxColumns
and inferSchema). Here’s an example:
val bookDetailsDF = spark.read.format("json").load("bookDetails.json")
bookNames.write.format("csv").options(Map("header"->"true","delimiter"->",")).save("/path/to/output")
ou bien
bookNames.write.format("csv")
.option("header", "true")
.option("delimiter", ",")
.save("/path/to/output")
DataFrame API basics
Before we start manipulating the dataframe API, we will build a datafarme that we will use to
illustrate the different features of the API.
In 2009, Stack Exchange released an anonymized data dump of all questions and answers in the Stack
Exchange communities, and it continues to release data as new questions became available. One of
the communities whose data was released is the Italian Language Stack Exchange, and we’ll use its
data to illustrate Spark SQL concepts in this lab. We chose this community because of its small size
(you can easily download and use it on your laptop).
The original data is available in XML format. We preprocessed it and created a comma-separated
values (CSV) file. The first file you’ll use is italian-Posts.csv. It contains Italian language–related
questions (and answers) with the following fields (delimited with tilde signs):
object StringImplicits {
implicit class StringImprovements(val s: String) {
import scala.util.control.Exception.catching
def toIntSafe = catching(classOf[NumberFormatException]) opt s.toInt
def toLongSafe = catching(classOf[NumberFormatException]) opt s.toLong
def toTimestampSafe = catching(classOf[IllegalArgumentException]) opt
Timestamp.valueOf(s)
}
}
Printing schema
itPostsDFStruct.printSchema
Displaying data
itPostsDFStruct.show(5)
Now that you have your DataFrame loaded (no matter which method you used to load the data), you
can start exploring the rich DataFrame API. DataFrames come with a DSL for manipulating data, which
is fundamental to working with Spark SQL. Spark’s machine learning library (ML) also relies on
DataFrames. They’ve become a cornerstone of Spark, so it’s important to get acquainted with their
API.
DataFrame’s DSL has a set of functionalities similar to the usual SQL functions for manipulating data
in relational databases. DataFrames work like RDDs: they’re immutable and lazy. They inherit their
immutable nature from the underlying RDD architecture. You can’t directly change data in a
DataFrame; you have to transform it into another one. They’re lazy because most DataFrame DSL
functions don’t return results. Instead, they return another DataFrame, similar to RDD
transformations.
In this lab part, we’ll give you an overview of basic DataFrame functions and show how you can use
them to select, filter, map, group, and join data. All of these functions have also their SQL
counterparts.
Several methods exist for creating Column objects. You can create columns using an existing
DataFrame object and its col function:
val postsIdBody = postsDf.select(postsDf.col("id"), postsDf.col("body"))
or you can use some of the implicit methods (imported). One of them indirectly converts Scala’s
Symbol class to a Column. Symbol objects are sometimes used in Scala programs as identifiers
instead of Strings because they’re interned (at most one instance of an object exists) and can be
quickly checked for equality. They can be instantiated with Scala’s built-in quote mechanism or with
Scala’s apply function, so the following two statements are equivalent:
val postsIdBody = postsDf.select(Symbol("id"), Symbol("body"))
val postsIdBody = postsDf.select('id, 'body)
Another implicit method (called $) converts strings to ColumnName objects (which inherit from
Column objects, so you can use ColumnName objects as well):
val postsIdBody = postsDf.select($"id", $"body")
Column objects are important for DataFrame DSL, so all this flexibility is justified. That’s how you
specify which columns to select. The resulting DataFrame contains only the specified columns. If you
need to select all the columns except a single one, you can use the drop function, which takes a
column name or a Column object and returns a new DataFrame with the specified column missing.
For example, to remove the body column from the postsIdBody DataFrame (and thus leave only the
id column), you can use the following line (drop method):
val postIds = postsIdBody.drop("body")
Filtering data
You can filter DataFrame data using the where and filter functions (they’re synonymous). They take a
Column object or an expression string. The variant taking a string is used for parsing SQL expressions.
Why do you pass a Column object to a filtering function, you ask? Because the Column class, in
addition to representing a column name, contains a rich set of SQL-like operators that you can use to
build expressions. These expressions are also represented by the Column class.
For example, to see how many posts contain the word Italiano in their body, you can use the
following line:
postsIdBody.filter('body contains "Italiano").count
To select all the questions that don’t have an accepted answer, use this expression:
Here, the filter expression is based on two columns: the post type ID (which is equal to 1 for
questions) and the accepted answer ID columns. Both expressions yield a Column object, which is
then combined into a third one using the and operator. You need to use the extra parenthesis to help
the Scala parser find its way.
Note that you can use these column expressions in select functions as well. You can select only the
first n rows of a DataFrame with the limit function. The following will return a DataFrame containing
only the first 10 questions:
val firstTenQs = postsDf.filter('postTypeId === 1).limit(10)
To add a new column to a DataFrame, use the withColumn function and give it the column name and
the Column expression. Let’s say you’re interested in a “views per score point” metric (in other
words, how many views are needed to increase score by one), and you want to see the questions
whose value of this metric is less than some threshold (which means if the question is more
successful, it gains a higher score with fewer views). If your threshold is 35 (which is the actual
average), this can be accomplished with the following expression:
postsDf.filter('postTypeId === 1).withColumn("ratio", 'viewCount / 'score).where('ratio < 35).show()
The output is too wide to print on this page, but you can run the command yourself and see that the
output contains the extra column called ratio.
Sorting data
DataFrame’s orderBy and sort functions sort data (they’re equivalent). They take one or more column
names or one or more Column expressions. The Column class has asc and desc operators, which are
used for specifying the sort order. The default is to sort in ascending order.
Aggregate functions return a single value for a group of rows. They are min (returns the minimum
value in a group of rows), avg (calculates the average value in a group of rows), and so on.
Scalar and aggregate functions reside within the object org.apache.spark.sql.functions. You can
import them all at once (note that they’re automatically imported in Spark shell, so you don’t have to
do this yourself):
import org.apache.spark.sql.functions._
As an example, let’s find the question that was active for the largest amount of time. You can use the
lastActivityDate and creationDate columns for this and find the difference between them (in days)
using the datediff function, which takes two arguments: end date column and start date column. Here
goes:
postsDf.filter('postTypeId === 1).
withColumn("activePeriod", datediff('lastActivityDate, 'creationDate)).
orderBy('activePeriod desc).head.getString(3).
replace("<","<").replace(">",">")
As another example, let’s find the average and maximum score of all questions and the total number
of questions. Spark SQL makes this easy:
postsDf.select(avg('score), max('score), count('score)).show
Each of these methods has several versions. For example, to remove all the rows from postsDf that
contain null or NaN values in at least one of their columns, call drop with no arguments:
val cleanPosts = postsDf.na.drop()
cleanPosts.count()
This is the same as calling drop("any"), which means null values can be in any of the columns. If you
use drop("all"), it removes rows that have null values in all of the columns. You can also specify
column names. For example, to remove the rows that don’t have an accepted answer ID, you can do
this:
postsDf.na.drop(Array("acceptedAnswerId"))
With the fill function, you can replace null and NaN values with a constant.
postsDf.na.fill(Map("viewCount" -> 0))
Finally, the replace function enables you to replace certain values in specific columns with different
ones. For example, imagine there was a mistake in your data export and you needed to change post
ID 1177 to 3000. You can do that with the replace function:
val postsDfCorrected = postsDf.na.replace(Array("id", "acceptedAnswerId"), Map(1177 -> 3000))
Performing joins
As an example, load the italianVotes.csv file into a Data-Frame with the following code:
val itVotesRaw = sc.textFile("first-edition-master/ch05/italianVotes.csv").map(x => x.split("~"))
val itVotesRows = itVotesRaw.map(row => Row(row(0).toLong, row(1).toLong, row(2).toInt, Timestamp.valueOf(row(3))))
val votesSchema = StructType(Seq(StructField("id", LongType, false), StructField("postId", LongType, false),
StructField("voteTypeId", IntegerType, false), StructField("creationDate", TimestampType, false)) )
val votesDf = spark.createDataFrame(itVotesRows, votesSchema)
Joining the two DataFrames (postsDf and votesDf) on the postId column can be done like this:
val postsVotes = postsDf.join(votesDf, postsDf("id") === 'postId)
This performs an inner join. You can perform an outer join by adding another argument:
val postsVotesOuter = postsDf.join(votesDf, postsDf("id") === 'postId, "outer")
If you examine the contents of the postsVotesOuter DataFrame, you’ll notice there are some rows
with all null values in the votes columns. These are the posts that have no votes.