0% found this document useful (0 votes)

37 views

9 SparkSQL

Spark SQL allows querying data in Spark using SQL, and integrates with Hive. It improves performance over Hive by running queries faster. DataFrames are Spark SQL's main data abstraction, representing a distributed collection of rows organized into named columns. SQL queries can be run programmatically on DataFrames using methods like sql, and DataFrames support common SQL operations and can be registered as tables to run HiveQL queries.

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

9 SparkSQL

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Spark Programming – Spark SQL

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında
yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde
gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın
görüşlerini yansıtmamaktadır.
Spark SQL
blurs the lines between RDDs and relational tables

intermix SQL commands to query external data,

along with complex analytics, in a single app:
• allows SQL extensions based on MLlib
• Shark is being migrated to Spark SQL
Spark SQL

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext._

// Define the schema using a case class.

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.

val people = sc.textFile("examples/src/main/resources/
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs

// normal RDD operations.
// The columns of a row in the result can be
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
Hive Interoperability
 Spark SQL is compatible with Hive.
 It not only supports HiveQL, but can also access Hive
metastore, SerDes, and UDFs.
 You can also replace Hive with Spark SQL to get better
performance.
 HiveQL queries run much faster on Spark SQL than on
Hive.
Spark SQL: queries in HiveQL

//val sc: SparkContext // An existing SparkContext.

//NB: example on laptop lacks a Hive MetaStore
val hiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)

// Importing the SQL context gives access to all the

// public SQL functions and implicit conversions.
import hiveContext._
hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL

hql("FROM src SELECT key,
value").collect().foreach(println)
Executing SQL Queries Programmatically
The SQLContext class provides a method named sql, which executes a SQL
query using Spark.

It takes a SQL statement as an argument and returns the result as an instance

of the DataFrame class.
DataFrame
DataFrame is Spark SQL’s primary data abstraction.
• Unlike RDD, DataFrame is schema aware.
• It represents a distributed collection of rows organized into named columns.
Conceptually, it is similar to a table in a relational database.
DataFrame Row
Row is a Spark SQL abstraction for representing a row of data.
• Conceptually, it is equivalent to a relational tuple or row in a table.
• Spark SQL provides factory methods to create Row objects. An example is
shown next.
Creating a DataFrame
A DataFrame can be created in two ways.
• it can be created from a data source.
• a DataFrame can be created from an RDD.

Spark SQL provides two methods for creating a DataFrame from an RDD: toDF
and createDataFrame.
Creating a DataFrame using toDF
Spark SQL provides an implicit conversion method named toDF, which creates
a DataFrame from an RDD of objects represented by a case class.
• Spark SQL infers the schema of a dataset.
• The toDF method is not defined in the RDD class, but it is available through
an implicit conversion.
• To convert an RDD to a DataFrame using toDF, you need to import the
implicit methods defined in the implicits object.
createDataFrame
The createDataFrame method takes
two arguments, an RDD of Rows and a
schema, and returns a DataFrame.
abstraction
• The schema for a dataset can
be specified with an instance of
StructType, which is a case
class.
• A StructType object contains a
sequence of StructField
objects.
• StructField is also defined as a
case class.
• The key difference between the
toDF and createDataFrame
methods is that the former
infers the schema of a dataset
and the latter requires you to
specify the schema.
Creating a DataFrame from a Data
Source Spark SQL provides a unified
interface for creating a DataFrame
from a variety of data sources.
• Spark SQL provides a class
named DataFrameReader,
which defines the interface
for reading data froma data
source.
• It allows you to specify
different options for reading
data

For example, the same API can

be used to create a DataFrame
from a MySQL, PostgreSQL,
Oracle, or Cassandra table.
DataFrame from JSON using schema
The DataFrameReader class provides a method named json for reading a
JSON dataset.
• It takes a path as argument and returns a DataFrame.
• The path can be the name of either a JSON file or a directory containing
multiple JSON files.

• Spark SQL automatically infers the schema of a JSON dataset by scanning

the entire dataset to determine the schema.
• Can avoid scan and speed up DataFrame creation by specifying schema.
Processing Data The sql method in the HiveContext
Programmatically class allows using HiveQL, whereas
the sql method in the SQLContext
with SQL/HiveQL class allows using SQL statements.
• The table referenced in a
SQL/HiveQL statement must
have an entry in a Hive
metastore.
• If not, can create a temporary
table using the
registerTempTable method
provided by the DataFrame class.
• The sql method returns result as
a DataFrame, for displaying the
returned result on a console or
saving it to a data source.
Processing Data
with the
DataFrame API
The DataFrame API provides an
alternative way for processing a
dataset.
Basic DataFrame Operations: cache
The cache method stores the source DataFrame in memory using a columnar
format.
• It scans only the required columns and stores them in compressed in-
memory columnar format.
• Spark SQL automatically selects a compression codec for each column
based on data statistics.

The caching functionality can be tuned using the setConf method in the
SQLContext or HiveContext class.
The two configuration parameters for caching are
• spark.sql.inMemoryColumnarStorage.compressed
• and spark.sql.inMemoryColumnarStorage.batchSize.
By default, compression is turned on and the batch size for columnar caching is
10,000.
DataFrame columns and dtypes
The columns method returns the names of all the columns in the source
DataFrame as an array of String.

The dtypes method returns the data types of all the columns in the source
DataFrame as an array of tuples.
The first element in a tuple is the name of a column and the second element is
the data type of that column.
explain, printSchema methods
The explain method prints the physical plan on the console. It is useful for
debugging.

The printSchema method prints the schema of the source DataFrame on the
console in a tree format
registerTempTable, toDF methods
The registerTempTable method creates a temporary table in Hive metastore.
• It takes a table name as an argument and sql method returns a DataFrame.
• A temporary table can be queried using the sql method in SQLContext or
HiveContext.
• It is available only during the lifespan of the application that creates it.

The toDF method allows you to rename the columns in the source DataFrame.
It takes new names of the columns as arguments and returns new DataFrame.
Language-Integrated Query Methods: agg
The agg is a commonly used language-integrated query methods of the
DataFrame class. This method performs specified aggregations on one or more
columns in the source DataFrame and returns the result as a new DataFrame.
Language-Integrated Query Methods: apply
The apply method takes the name of a column as an argument and returns the
specified column in the source DataFrame as an instance of the Column class.
• The Column class provides operators for manipulating a column in a
DataFrame.

Scala allows using productDF("price") instead of productDF.apply("price")

• It automatically converts productDF("price") to productDF.apply("price")
distinct
If a method or function expects an instance of the Column class as an
argument, you can use the $"... " notation to select a column in a DataFrame.

The following three statements are equivalent.

The distinct method returns a new DataFrame containing only the unique rows
in the source DataFrame.
cube
The cube method returns
a cube for multi-
dimensional analysis.

• It is useful for
generating cross-
tabular reports.
• Assume you have a
dataset that tracks
sales along three
dimensions: time,
product and country.
• The cube method
generates aggregates
for all the possible
combinations of the
dimensions.
explode
The explode method
generates zero or more rows
from a column using a user-
provided function.
It takes three arguments:
• input column,
• output column
• user provided function
generating one or more
values for the output
column for each value in
the input column.

For example, consider a text

column containing contents of
an email.
• to split the email content
into individual words and a
row for each word in an
email.
filter
The filter method filters rows in the source DataFrame using a SQL expression
provided to it as an argument.
It returns a new DataFrame containing only the filtered rows.
The SQL expression can be passed as a string argument.
groupBy
The groupBy method groups the rows in the source DataFrame using the
columns provided to it as arguments.
Aggregation can be performed on the grouped data returned by this method.
intersect
The intersect method takes a DataFrame as an argument and returns a new
DataFrame containing only the rows in both the input and source DataFrame
join
The join method
performs a SQL join of
the source DataFrame
with another
DataFrame.

It takes three
arguments, a
DataFrame, a join
expression and a join
type.
limit, orderBy
The limit method returns a DataFrame containing the specified number of rows
from the source DataFrame

The orderBy method returns a DataFrame sorted by the given columns. It takes
the names of one or more columns as arguments.
randomSplit, sample
The randomSplit method splits the source DataFrame into multiple
DataFrames. It takes an array of weights as argument and returns an array of
DataFrames. It is a useful method for machine learning, where you want
to split the raw dataset into training, validation and test datasets.

The sample method returns a DataFrame containing the specified fraction of

the rows in the source DataFrame.
It takes two arguments.
• The first argument is a Boolean value indicating whether sampling should be
done with replacement.
• The second argument specifies the fraction of the rows that should be
returned.
rollup
The rollup method
takes the names of
one or more columns
as arguments and
returns a multi-
dimensional rollup.

It is useful for
subaggregation along
a hierarchical
dimension such as
geography or time.
select
The select method
returns a DataFrame
containing only the
specified columns
from the source
DataFrame.

A variant of the select

method allows one or
more Column
expressions as
arguments.
selectExpr
The selectExpr method accepts one or more SQL expressions as
arguments

returns a DataFrame generated by executing the specified SQL

expressions.
withColumn
The withColumn method adds a new column to or replaces an
existing column in the source DataFrame and returns a new
DataFrame.

It takes two arguments:

• the name of the new column
• an expression for generating the values of the new column.
RDD Operations
The DataFrame class supports commonly used RDD operations
such as map, flatMap, foreach, foreachPartition, mapPartition,
coalesce, and repartition.
• These methods work similar to the operations in the RDD class.
• if you need access to other RDD methods that are not present in
the DataFrame class, can get an RDD from a DataFrame.
RDD Operations
Fields in a Row can also be extracted using Scala pattern matching.
DataFrame Actions
Similar to the RDD actions, the action methods in the DataFrame
class return results to the Driver program.

collect
The collect method returns the data in a DataFrame as an array of
Rows.

count
The count method returns the number of rows in the source
DataFrame.
DataFrame Actions: describe
The describe method can be used for exploratory data analysis.
• It returns summary statistics for numeric columns in the
source DataFrame.
• The summary statistics includes min, max, count, mean, and
standard deviation.
DataFrame Actions: first, show, take
The first method returns the first row in the source DataFrame.

The show method displays the rows in the source DataFrame on the
driver console in a tabular format.
Optionally displays the top N rows. By default, it shows the top 20.

The take method takes an integer N as an argument and returns the

first N rows from the source DataFrame as an array of Rows.
Saving a DataFrame
Spark SQL provides a
unified interface for
saving a DataFrame to a
variety of data sources

The same
interface can be used to
write data to relational
databases, NoSQL data
stores and a variety of file
formats.

The DataFrameWriter
class defines the
interface for writing data
to a data source.
SparkSQL Built-in Functions
Spark SQL comes with a comprehensive list of built-in functions, which are
optimized for fast execution.
• The built-in functions can be used from both the DataFrame API and SQL
interface.
• To use Spark’s built-in functions from the DataFrame API, you need to add
the following import statement to your source code.
import org.apache.spark.sql.functions._

The built-in functions can be classified into the following categories:

• aggregate,
• collection,
• date/time,
• math,
• string,
• window, and
• miscellaneous functions.
Aggregate

The aggregate functions can be used to

perform aggregations on a column.

The built-in aggregate functions include

• approxCountDistinct,
• avg,
• count,
• countDistinct,
• first,
• last,
• max,
• mean,
• min,
• sum, and
• sumDistinct.
Collection, Date/Time functions

The collection functions operate on columns containing a collection of

elements.
The built-in collection functions include array_contains, explode, size, and
sort_array.

The date/time functions make it easy to process columns containing date/time

values.
These functions can be further sub-classified into the following categories:
conversion, extraction, arithmetic, and miscellaneous functions.
Conversion, Field Extraction, Arithmetic
The conversion functions convert date/time values from one format to another.
For example, you can convert a timestamp string in yyyy-MM-dd HH:mm:ss
format to a Unix epoch value using the unix_timestamp function.
• The built-in conversion functions include unix_timestamp, from_unixtime,
to_date, quarter, day, dayofyear, weekofyear, from_utc_timestamp, and
to_utc_timestamp.

The field extraction functions allow you to extract year, month, day, hour,
minute, and second from a Date/Time value.
• The built-in field extraction functions include year, quarter, month, weekofyear,
dayofyear, dayofmonth, hour, minute, and second.

The arithmetic functions allow you to perform arithmetic operation on columns

containing dates. For example, you can calculate the difference between two
dates, add days to a date, or subtract days from a date.
• The built-in date arithmetic functions include datediff, date_add, date_sub,
add_months, last_day, next_day, and months_between.
Miscellaneous functions
Spark SQL provides a few other useful date- and time-related functions:
• current_date, current_timestamp, trunc, date_format.

The math functions operate on columns containing numerical values. Spark

SQL comes with a long list of built-in math functions.
• abs, ceil, cos, exp, factorial, floor, hex, hypot, log, log10, pow, round, shiftLeft,
sin, sqrt, tan, and other commonly used math functions.

The string functions: Spark SQL provides a variety of built-in functions for
processing columns that contain string values.
• The built-in string functions include ascii, base64,concat, concat_ws, decode,
encode, format_number, format_string, get_json_object, initcap, instr,length,
levenshtein, locate, lower, lpad, ltrim, printf, regexp_extract, regexp_replace,
repeat,reverse, rpad, rtrim, soundex, space, split, substring, substring_index,
translate, trim, unbase64, upper, and other commonly used string functions

Spark SQL supports window functions for analytics. A window function

performs a calculation across a set of rows that are related to the current row.
• The built-in window functions provided by Spark SQL include cumeDist,
denseRank, lag, lead, ntile, percentRank, rank, and rowNumber.
Interactive Analysis Example
launch the Spark shell from a terminal,
path/to/spark/bin/spark-shell --master local[*]

For using a few classes and functions from the Spark SQL library, use import
statement.

import org.apache.spark.sql._

create a DataFrame from a dataset.

val biz = sqlContext.read.json("path/to/yelp_academic_dataset_business.json")
Language-Integrated Query vs SQL

The preceding code first uses the filter method to filter the businesses that have
average rating of 5.0.

You could have also written the language integrated query version as follows:

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6435)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (641)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (997)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1853)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5143)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2126)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Microsoft - Premium.Az-900.By - Vceplus.236Q: Mix Questions: 131 Questions Hotspot: 97 Questions Drag Drop: 8 Questions
No ratings yet
Microsoft - Premium.Az-900.By - Vceplus.236Q: Mix Questions: 131 Questions Hotspot: 97 Questions Drag Drop: 8 Questions
171 pages
Practice Test 1
No ratings yet
Practice Test 1
58 pages
Practice Test 2
No ratings yet
Practice Test 2
51 pages
Practice Test 5
No ratings yet
Practice Test 5
35 pages
Practice Test 4
No ratings yet
Practice Test 4
36 pages
Practice Test 3
No ratings yet
Practice Test 3
34 pages
H To T
No ratings yet
H To T
288 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
13-Steam Turbines (Compatibility Mode) - 43-50
No ratings yet
13-Steam Turbines (Compatibility Mode) - 43-50
8 pages
Questions for Practice
No ratings yet
Questions for Practice
2 pages
Lightweight Gantry
No ratings yet
Lightweight Gantry
16 pages
Exp1 - TTL Gate Characteristics
No ratings yet
Exp1 - TTL Gate Characteristics
3 pages
Printed in Germany
No ratings yet
Printed in Germany
30 pages
Vvism Placement Brochure(1)
No ratings yet
Vvism Placement Brochure(1)
17 pages
IoT Case Study - 2WeatherMonitoring
No ratings yet
IoT Case Study - 2WeatherMonitoring
16 pages
Computer Aided Design of Multi-Stage Gearboxes
No ratings yet
Computer Aided Design of Multi-Stage Gearboxes
10 pages
Optogration FSO System
No ratings yet
Optogration FSO System
13 pages
HHJJJ
No ratings yet
HHJJJ
6 pages
Abdul Hadi Che Hamzah IES080002 2 Semester 2010/2011 WMES3108 IT Project Management Tutorial 2
100% (1)
Abdul Hadi Che Hamzah IES080002 2 Semester 2010/2011 WMES3108 IT Project Management Tutorial 2
3 pages
Pramod Kumar - Talend Developer - Bangalore
No ratings yet
Pramod Kumar - Talend Developer - Bangalore
5 pages
Unit 2 Creativity PDF
100% (1)
Unit 2 Creativity PDF
40 pages
Government Degree College Siddipet (Autonomous) : Hall Ticket
No ratings yet
Government Degree College Siddipet (Autonomous) : Hall Ticket
1 page
Data Mining New Notes Unit 2 PDF
No ratings yet
Data Mining New Notes Unit 2 PDF
15 pages
JavaScript Interview Questions
No ratings yet
JavaScript Interview Questions
13 pages
Rish Optima VAF Final
No ratings yet
Rish Optima VAF Final
26 pages
Global Citizenship and Diplomacy - German - 11th Grade by Slidesgo
No ratings yet
Global Citizenship and Diplomacy - German - 11th Grade by Slidesgo
40 pages
Weekly_Fire_Pump_Test_checklist
No ratings yet
Weekly_Fire_Pump_Test_checklist
1 page
Capacity Building Program Calendar 2024 Final
No ratings yet
Capacity Building Program Calendar 2024 Final
4 pages
Copy A SQL Server Database With Just The Objects and No Data
No ratings yet
Copy A SQL Server Database With Just The Objects and No Data
10 pages
Stage2A
No ratings yet
Stage2A
5 pages
Basic Maths DPPs62c68ea7a3d8f30018fa08ea
No ratings yet
Basic Maths DPPs62c68ea7a3d8f30018fa08ea
11 pages
Unitronics Software VisiLogic Ladder en 0511
No ratings yet
Unitronics Software VisiLogic Ladder en 0511
278 pages
Biology 9 Pipette and Micropipette Use Pipettes
No ratings yet
Biology 9 Pipette and Micropipette Use Pipettes
4 pages
MA - LFP Submersible Mixer
No ratings yet
MA - LFP Submersible Mixer
12 pages
3X3X3 LED Cube Circuit Principle
No ratings yet
3X3X3 LED Cube Circuit Principle
3 pages
Worksheet 6
No ratings yet
Worksheet 6
10 pages
MCADChapter1 Material
No ratings yet
MCADChapter1 Material
28 pages
Accessing Vcenter Server
No ratings yet
Accessing Vcenter Server
40 pages

9 SparkSQL

Uploaded by

9 SparkSQL

Uploaded by

Spark Programming – Spark SQL

intermix SQL commands to query external data,

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Define the schema using a case class.

// Create an RDD of Person objects and register it as a table.

// The results of SQL queries are SchemaRDDs

//val sc: SparkContext // An existing SparkContext.

// Importing the SQL context gives access to all the

// Queries are expressed in HiveQL

It takes a SQL statement as an argument and returns the result as an instance

For example, the same API can

• Spark SQL automatically infers the schema of a JSON dataset by scanning

Scala allows using productDF("price") instead of productDF.apply("price")

The following three statements are equivalent.

For example, consider a text

The sample method returns a DataFrame containing the specified fraction of

A variant of the select

returns a DataFrame generated by executing the specified SQL

It takes two arguments:

The take method takes an integer N as an argument and returns the

The built-in functions can be classified into the following categories:

The aggregate functions can be used to

The built-in aggregate functions include

The collection functions operate on columns containing a collection of

The date/time functions make it easy to process columns containing date/time

The arithmetic functions allow you to perform arithmetic operation on columns

The math functions operate on columns containing numerical values. Spark

Spark SQL supports window functions for analytics. A window function

create a DataFrame from a dataset.

You might also like