9 SparkSQL
9 SparkSQL
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında
yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde
gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın
görüşlerini yansıtmamaktadır.
Spark SQL
blurs the lines between RDDs and relational tables
people.registerAsTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
Spark SQL provides two methods for creating a DataFrame from an RDD: toDF
and createDataFrame.
Creating a DataFrame using toDF
Spark SQL provides an implicit conversion method named toDF, which creates
a DataFrame from an RDD of objects represented by a case class.
• Spark SQL infers the schema of a dataset.
• The toDF method is not defined in the RDD class, but it is available through
an implicit conversion.
• To convert an RDD to a DataFrame using toDF, you need to import the
implicit methods defined in the implicits object.
createDataFrame
The createDataFrame method takes
two arguments, an RDD of Rows and a
schema, and returns a DataFrame.
abstraction
• The schema for a dataset can
be specified with an instance of
StructType, which is a case
class.
• A StructType object contains a
sequence of StructField
objects.
• StructField is also defined as a
case class.
• The key difference between the
toDF and createDataFrame
methods is that the former
infers the schema of a dataset
and the latter requires you to
specify the schema.
Creating a DataFrame from a Data
Source Spark SQL provides a unified
interface for creating a DataFrame
from a variety of data sources.
• Spark SQL provides a class
named DataFrameReader,
which defines the interface
for reading data froma data
source.
• It allows you to specify
different options for reading
data
The caching functionality can be tuned using the setConf method in the
SQLContext or HiveContext class.
The two configuration parameters for caching are
• spark.sql.inMemoryColumnarStorage.compressed
• and spark.sql.inMemoryColumnarStorage.batchSize.
By default, compression is turned on and the batch size for columnar caching is
10,000.
DataFrame columns and dtypes
The columns method returns the names of all the columns in the source
DataFrame as an array of String.
The dtypes method returns the data types of all the columns in the source
DataFrame as an array of tuples.
The first element in a tuple is the name of a column and the second element is
the data type of that column.
explain, printSchema methods
The explain method prints the physical plan on the console. It is useful for
debugging.
The printSchema method prints the schema of the source DataFrame on the
console in a tree format
registerTempTable, toDF methods
The registerTempTable method creates a temporary table in Hive metastore.
• It takes a table name as an argument and sql method returns a DataFrame.
• A temporary table can be queried using the sql method in SQLContext or
HiveContext.
• It is available only during the lifespan of the application that creates it.
The toDF method allows you to rename the columns in the source DataFrame.
It takes new names of the columns as arguments and returns new DataFrame.
Language-Integrated Query Methods: agg
The agg is a commonly used language-integrated query methods of the
DataFrame class. This method performs specified aggregations on one or more
columns in the source DataFrame and returns the result as a new DataFrame.
Language-Integrated Query Methods: apply
The apply method takes the name of a column as an argument and returns the
specified column in the source DataFrame as an instance of the Column class.
• The Column class provides operators for manipulating a column in a
DataFrame.
The distinct method returns a new DataFrame containing only the unique rows
in the source DataFrame.
cube
The cube method returns
a cube for multi-
dimensional analysis.
• It is useful for
generating cross-
tabular reports.
• Assume you have a
dataset that tracks
sales along three
dimensions: time,
product and country.
• The cube method
generates aggregates
for all the possible
combinations of the
dimensions.
explode
The explode method
generates zero or more rows
from a column using a user-
provided function.
It takes three arguments:
• input column,
• output column
• user provided function
generating one or more
values for the output
column for each value in
the input column.
It takes three
arguments, a
DataFrame, a join
expression and a join
type.
limit, orderBy
The limit method returns a DataFrame containing the specified number of rows
from the source DataFrame
The orderBy method returns a DataFrame sorted by the given columns. It takes
the names of one or more columns as arguments.
randomSplit, sample
The randomSplit method splits the source DataFrame into multiple
DataFrames. It takes an array of weights as argument and returns an array of
DataFrames. It is a useful method for machine learning, where you want
to split the raw dataset into training, validation and test datasets.
It is useful for
subaggregation along
a hierarchical
dimension such as
geography or time.
select
The select method
returns a DataFrame
containing only the
specified columns
from the source
DataFrame.
collect
The collect method returns the data in a DataFrame as an array of
Rows.
count
The count method returns the number of rows in the source
DataFrame.
DataFrame Actions: describe
The describe method can be used for exploratory data analysis.
• It returns summary statistics for numeric columns in the
source DataFrame.
• The summary statistics includes min, max, count, mean, and
standard deviation.
DataFrame Actions: first, show, take
The first method returns the first row in the source DataFrame.
The show method displays the rows in the source DataFrame on the
driver console in a tabular format.
Optionally displays the top N rows. By default, it shows the top 20.
The same
interface can be used to
write data to relational
databases, NoSQL data
stores and a variety of file
formats.
The DataFrameWriter
class defines the
interface for writing data
to a data source.
SparkSQL Built-in Functions
Spark SQL comes with a comprehensive list of built-in functions, which are
optimized for fast execution.
• The built-in functions can be used from both the DataFrame API and SQL
interface.
• To use Spark’s built-in functions from the DataFrame API, you need to add
the following import statement to your source code.
import org.apache.spark.sql.functions._
The field extraction functions allow you to extract year, month, day, hour,
minute, and second from a Date/Time value.
• The built-in field extraction functions include year, quarter, month, weekofyear,
dayofyear, dayofmonth, hour, minute, and second.
The string functions: Spark SQL provides a variety of built-in functions for
processing columns that contain string values.
• The built-in string functions include ascii, base64,concat, concat_ws, decode,
encode, format_number, format_string, get_json_object, initcap, instr,length,
levenshtein, locate, lower, lpad, ltrim, printf, regexp_extract, regexp_replace,
repeat,reverse, rpad, rtrim, soundex, space, split, substring, substring_index,
translate, trim, unbase64, upper, and other commonly used string functions
For using a few classes and functions from the Spark SQL library, use import
statement.
import org.apache.spark.sql._
The preceding code first uses the filter method to filter the businesses that have
average rating of 5.0.
You could have also written the language integrated query version as follows: