0% found this document useful (0 votes)
31 views

9 SparkSQL

Spark SQL allows querying data in Spark using SQL, and integrates with Hive. It improves performance over Hive by running queries faster. DataFrames are Spark SQL's main data abstraction, representing a distributed collection of rows organized into named columns. SQL queries can be run programmatically on DataFrames using methods like sql, and DataFrames support common SQL operations and can be registered as tables to run HiveQL queries.

Uploaded by

Murali Krishna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

9 SparkSQL

Spark SQL allows querying data in Spark using SQL, and integrates with Hive. It improves performance over Hive by running queries faster. DataFrames are Spark SQL's main data abstraction, representing a distributed collection of rows organized into named columns. SQL queries can be run programmatically on DataFrames using methods like sql, and DataFrames support common SQL operations and can be registered as tables to run HiveQL queries.

Uploaded by

Murali Krishna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Spark Programming – Spark SQL

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında
yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde
gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın
görüşlerini yansıtmamaktadır.
Spark SQL
blurs the lines between RDDs and relational tables

intermix SQL commands to query external data,


along with complex analytics, in a single app:
• allows SQL extensions based on MLlib
• Shark is being migrated to Spark SQL
Spark SQL

val sqlContext = new org.apache.spark.sql.SQLContext(sc)


import sqlContext._

// Define the schema using a case class.


case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.


val people = sc.textFile("examples/src/main/resources/
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs


// normal RDD operations.
// The columns of a row in the result can be
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Hive Interoperability
 Spark SQL is compatible with Hive.
 It not only supports HiveQL, but can also access Hive
metastore, SerDes, and UDFs.
 You can also replace Hive with Spark SQL to get better
performance.
 HiveQL queries run much faster on Spark SQL than on
Hive.
Spark SQL: queries in HiveQL

//val sc: SparkContext // An existing SparkContext.


//NB: example on laptop lacks a Hive MetaStore
val hiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)

// Importing the SQL context gives access to all the


// public SQL functions and implicit conversions.
import hiveContext._
hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL


hql("FROM src SELECT key,
value").collect().foreach(println)
Executing SQL Queries Programmatically
The SQLContext class provides a method named sql, which executes a SQL
query using Spark.

It takes a SQL statement as an argument and returns the result as an instance


of the DataFrame class.
DataFrame
DataFrame is Spark SQL’s primary data abstraction.
• Unlike RDD, DataFrame is schema aware.
• It represents a distributed collection of rows organized into named columns.
Conceptually, it is similar to a table in a relational database.
DataFrame Row
Row is a Spark SQL abstraction for representing a row of data.
• Conceptually, it is equivalent to a relational tuple or row in a table.
• Spark SQL provides factory methods to create Row objects. An example is
shown next.
Creating a DataFrame
A DataFrame can be created in two ways.
• it can be created from a data source.
• a DataFrame can be created from an RDD.

Spark SQL provides two methods for creating a DataFrame from an RDD: toDF
and createDataFrame.
Creating a DataFrame using toDF
Spark SQL provides an implicit conversion method named toDF, which creates
a DataFrame from an RDD of objects represented by a case class.
• Spark SQL infers the schema of a dataset.
• The toDF method is not defined in the RDD class, but it is available through
an implicit conversion.
• To convert an RDD to a DataFrame using toDF, you need to import the
implicit methods defined in the implicits object.
createDataFrame
The createDataFrame method takes
two arguments, an RDD of Rows and a
schema, and returns a DataFrame.
abstraction
• The schema for a dataset can
be specified with an instance of
StructType, which is a case
class.
• A StructType object contains a
sequence of StructField
objects.
• StructField is also defined as a
case class.
• The key difference between the
toDF and createDataFrame
methods is that the former
infers the schema of a dataset
and the latter requires you to
specify the schema.
Creating a DataFrame from a Data
Source Spark SQL provides a unified
interface for creating a DataFrame
from a variety of data sources.
• Spark SQL provides a class
named DataFrameReader,
which defines the interface
for reading data froma data
source.
• It allows you to specify
different options for reading
data

For example, the same API can


be used to create a DataFrame
from a MySQL, PostgreSQL,
Oracle, or Cassandra table.
DataFrame from JSON using schema
The DataFrameReader class provides a method named json for reading a
JSON dataset.
• It takes a path as argument and returns a DataFrame.
• The path can be the name of either a JSON file or a directory containing
multiple JSON files.

• Spark SQL automatically infers the schema of a JSON dataset by scanning


the entire dataset to determine the schema.
• Can avoid scan and speed up DataFrame creation by specifying schema.
Processing Data The sql method in the HiveContext
Programmatically class allows using HiveQL, whereas
the sql method in the SQLContext
with SQL/HiveQL class allows using SQL statements.
• The table referenced in a
SQL/HiveQL statement must
have an entry in a Hive
metastore.
• If not, can create a temporary
table using the
registerTempTable method
provided by the DataFrame class.
• The sql method returns result as
a DataFrame, for displaying the
returned result on a console or
saving it to a data source.
Processing Data
with the
DataFrame API
The DataFrame API provides an
alternative way for processing a
dataset.
Basic DataFrame Operations: cache
The cache method stores the source DataFrame in memory using a columnar
format.
• It scans only the required columns and stores them in compressed in-
memory columnar format.
• Spark SQL automatically selects a compression codec for each column
based on data statistics.

The caching functionality can be tuned using the setConf method in the
SQLContext or HiveContext class.
The two configuration parameters for caching are
• spark.sql.inMemoryColumnarStorage.compressed
• and spark.sql.inMemoryColumnarStorage.batchSize.
By default, compression is turned on and the batch size for columnar caching is
10,000.
DataFrame columns and dtypes
The columns method returns the names of all the columns in the source
DataFrame as an array of String.

The dtypes method returns the data types of all the columns in the source
DataFrame as an array of tuples.
The first element in a tuple is the name of a column and the second element is
the data type of that column.
explain, printSchema methods
The explain method prints the physical plan on the console. It is useful for
debugging.

The printSchema method prints the schema of the source DataFrame on the
console in a tree format
registerTempTable, toDF methods
The registerTempTable method creates a temporary table in Hive metastore.
• It takes a table name as an argument and sql method returns a DataFrame.
• A temporary table can be queried using the sql method in SQLContext or
HiveContext.
• It is available only during the lifespan of the application that creates it.

The toDF method allows you to rename the columns in the source DataFrame.
It takes new names of the columns as arguments and returns new DataFrame.
Language-Integrated Query Methods: agg
The agg is a commonly used language-integrated query methods of the
DataFrame class. This method performs specified aggregations on one or more
columns in the source DataFrame and returns the result as a new DataFrame.
Language-Integrated Query Methods: apply
The apply method takes the name of a column as an argument and returns the
specified column in the source DataFrame as an instance of the Column class.
• The Column class provides operators for manipulating a column in a
DataFrame.

Scala allows using productDF("price") instead of productDF.apply("price")


• It automatically converts productDF("price") to productDF.apply("price")
distinct
If a method or function expects an instance of the Column class as an
argument, you can use the $"... " notation to select a column in a DataFrame.

The following three statements are equivalent.

The distinct method returns a new DataFrame containing only the unique rows
in the source DataFrame.
cube
The cube method returns
a cube for multi-
dimensional analysis.

• It is useful for
generating cross-
tabular reports.
• Assume you have a
dataset that tracks
sales along three
dimensions: time,
product and country.
• The cube method
generates aggregates
for all the possible
combinations of the
dimensions.
explode
The explode method
generates zero or more rows
from a column using a user-
provided function.
It takes three arguments:
• input column,
• output column
• user provided function
generating one or more
values for the output
column for each value in
the input column.

For example, consider a text


column containing contents of
an email.
• to split the email content
into individual words and a
row for each word in an
email.
filter
The filter method filters rows in the source DataFrame using a SQL expression
provided to it as an argument.
It returns a new DataFrame containing only the filtered rows.
The SQL expression can be passed as a string argument.
groupBy
The groupBy method groups the rows in the source DataFrame using the
columns provided to it as arguments.
Aggregation can be performed on the grouped data returned by this method.
intersect
The intersect method takes a DataFrame as an argument and returns a new
DataFrame containing only the rows in both the input and source DataFrame
join
The join method
performs a SQL join of
the source DataFrame
with another
DataFrame.

It takes three
arguments, a
DataFrame, a join
expression and a join
type.
limit, orderBy
The limit method returns a DataFrame containing the specified number of rows
from the source DataFrame

The orderBy method returns a DataFrame sorted by the given columns. It takes
the names of one or more columns as arguments.
randomSplit, sample
The randomSplit method splits the source DataFrame into multiple
DataFrames. It takes an array of weights as argument and returns an array of
DataFrames. It is a useful method for machine learning, where you want
to split the raw dataset into training, validation and test datasets.

The sample method returns a DataFrame containing the specified fraction of


the rows in the source DataFrame.
It takes two arguments.
• The first argument is a Boolean value indicating whether sampling should be
done with replacement.
• The second argument specifies the fraction of the rows that should be
returned.
rollup
The rollup method
takes the names of
one or more columns
as arguments and
returns a multi-
dimensional rollup.

It is useful for
subaggregation along
a hierarchical
dimension such as
geography or time.
select
The select method
returns a DataFrame
containing only the
specified columns
from the source
DataFrame.

A variant of the select


method allows one or
more Column
expressions as
arguments.
selectExpr
The selectExpr method accepts one or more SQL expressions as
arguments

returns a DataFrame generated by executing the specified SQL


expressions.
withColumn
The withColumn method adds a new column to or replaces an
existing column in the source DataFrame and returns a new
DataFrame.

It takes two arguments:


• the name of the new column
• an expression for generating the values of the new column.
RDD Operations
The DataFrame class supports commonly used RDD operations
such as map, flatMap, foreach, foreachPartition, mapPartition,
coalesce, and repartition.
• These methods work similar to the operations in the RDD class.
• if you need access to other RDD methods that are not present in
the DataFrame class, can get an RDD from a DataFrame.
RDD Operations
Fields in a Row can also be extracted using Scala pattern matching.
DataFrame Actions
Similar to the RDD actions, the action methods in the DataFrame
class return results to the Driver program.

collect
The collect method returns the data in a DataFrame as an array of
Rows.

count
The count method returns the number of rows in the source
DataFrame.
DataFrame Actions: describe
The describe method can be used for exploratory data analysis.
• It returns summary statistics for numeric columns in the
source DataFrame.
• The summary statistics includes min, max, count, mean, and
standard deviation.
DataFrame Actions: first, show, take
The first method returns the first row in the source DataFrame.

The show method displays the rows in the source DataFrame on the
driver console in a tabular format.
Optionally displays the top N rows. By default, it shows the top 20.

The take method takes an integer N as an argument and returns the


first N rows from the source DataFrame as an array of Rows.
Saving a DataFrame
Spark SQL provides a
unified interface for
saving a DataFrame to a
variety of data sources

The same
interface can be used to
write data to relational
databases, NoSQL data
stores and a variety of file
formats.

The DataFrameWriter
class defines the
interface for writing data
to a data source.
SparkSQL Built-in Functions
Spark SQL comes with a comprehensive list of built-in functions, which are
optimized for fast execution.
• The built-in functions can be used from both the DataFrame API and SQL
interface.
• To use Spark’s built-in functions from the DataFrame API, you need to add
the following import statement to your source code.
import org.apache.spark.sql.functions._

The built-in functions can be classified into the following categories:


• aggregate,
• collection,
• date/time,
• math,
• string,
• window, and
• miscellaneous functions.
Aggregate

The aggregate functions can be used to


perform aggregations on a column.

The built-in aggregate functions include


• approxCountDistinct,
• avg,
• count,
• countDistinct,
• first,
• last,
• max,
• mean,
• min,
• sum, and
• sumDistinct.
Collection, Date/Time functions

The collection functions operate on columns containing a collection of


elements.
The built-in collection functions include array_contains, explode, size, and
sort_array.

The date/time functions make it easy to process columns containing date/time


values.
These functions can be further sub-classified into the following categories:
conversion, extraction, arithmetic, and miscellaneous functions.
Conversion, Field Extraction, Arithmetic
The conversion functions convert date/time values from one format to another.
For example, you can convert a timestamp string in yyyy-MM-dd HH:mm:ss
format to a Unix epoch value using the unix_timestamp function.
• The built-in conversion functions include unix_timestamp, from_unixtime,
to_date, quarter, day, dayofyear, weekofyear, from_utc_timestamp, and
to_utc_timestamp.

The field extraction functions allow you to extract year, month, day, hour,
minute, and second from a Date/Time value.
• The built-in field extraction functions include year, quarter, month, weekofyear,
dayofyear, dayofmonth, hour, minute, and second.

The arithmetic functions allow you to perform arithmetic operation on columns


containing dates. For example, you can calculate the difference between two
dates, add days to a date, or subtract days from a date.
• The built-in date arithmetic functions include datediff, date_add, date_sub,
add_months, last_day, next_day, and months_between.
Miscellaneous functions
Spark SQL provides a few other useful date- and time-related functions:
• current_date, current_timestamp, trunc, date_format.

The math functions operate on columns containing numerical values. Spark


SQL comes with a long list of built-in math functions.
• abs, ceil, cos, exp, factorial, floor, hex, hypot, log, log10, pow, round, shiftLeft,
sin, sqrt, tan, and other commonly used math functions.

The string functions: Spark SQL provides a variety of built-in functions for
processing columns that contain string values.
• The built-in string functions include ascii, base64,concat, concat_ws, decode,
encode, format_number, format_string, get_json_object, initcap, instr,length,
levenshtein, locate, lower, lpad, ltrim, printf, regexp_extract, regexp_replace,
repeat,reverse, rpad, rtrim, soundex, space, split, substring, substring_index,
translate, trim, unbase64, upper, and other commonly used string functions

Spark SQL supports window functions for analytics. A window function


performs a calculation across a set of rows that are related to the current row.
• The built-in window functions provided by Spark SQL include cumeDist,
denseRank, lag, lead, ntile, percentRank, rank, and rowNumber.
Interactive Analysis Example
launch the Spark shell from a terminal,
path/to/spark/bin/spark-shell --master local[*]

For using a few classes and functions from the Spark SQL library, use import
statement.

import org.apache.spark.sql._

create a DataFrame from a dataset.


val biz = sqlContext.read.json("path/to/yelp_academic_dataset_business.json")
Language-Integrated Query vs SQL

The preceding code first uses the filter method to filter the businesses that have
average rating of 5.0.

You could have also written the language integrated query version as follows:

You might also like