0% found this document useful (0 votes)
147 views96 pages

Fall209 Spark SQL MC

This document provides information on loading and manipulating structured data in Spark SQL using DataFrames and Datasets. It discusses how to load data from files in various formats like CSV, JSON, Parquet etc. using Spark SQL or the DataFrame/Dataset API. It also describes various operations that can be performed on DataFrames like filtering, aggregation, joining etc. Additionally, it covers writing data back to files or saving as Hive tables in different formats.

Uploaded by

Oneil Henry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views96 pages

Fall209 Spark SQL MC

This document provides information on loading and manipulating structured data in Spark SQL using DataFrames and Datasets. It discusses how to load data from files in various formats like CSV, JSON, Parquet etc. using Spark SQL or the DataFrame/Dataset API. It also describes various operations that can be performed on DataFrames like filtering, aggregation, joining etc. Additionally, it covers writing data back to files or saving as Hive tables in different formats.

Uploaded by

Oneil Henry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Features

• Spark module for structured data processing


• Unit of processing is the Dataset or DataFrame
• DataFrame is a Dataset having named columns
• Data read and manipulated using SQL
• Ability to read Hive data as well
• High performance is achieved using the Catalyst engine
Load Data

val salesRecords =
spark.read.load("/Users/hadoop-user/Documents/SalesJan2009.parquet")

• This loads a parquet file by default


• The default option is specified by the configuration property
“spark.sql.sources.default”
• The default can be overridden by setting the parameter in the
configuration object of the SparkSession
spark.conf.set("spark.sql.sources.default", "csv")
Load Data
• An easier way to load various types of data, other than parquet, is
using the following:
val salesRecords = spark
.read
.format(“csv”)
.load("/Users/hadoop-user/Documents/SalesJan2009.csv”)
• The formats supported are json, parquet, jdbc, orc, csv and text
• The formats are data sources and should be referred to using their
fully qualified names like “org.apache.spark.sql.parquet”
• The out-of-the-box supported data sources have short names like
mentioned above - jdbc
https://spark.apache.org/docs/2.3.0/api/scala/#package
Load Options
Data sources have their own options that can be specified during the
load process:
val salesRecords = spark.read.format("csv”)
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true”)
.load("/Users/hadoop-user/Documents/SalesJan2009.csv")
Note: While reading a csv file, the first row is always taken as the
header, if the header option is set to true, as shown in the above
example
Reading Parquet Files
• Parquet files can be read using:
spark.read.parquet(“path”)
• If the base location of the table is specified as the path, the partitions are
automatically discovered
• Partition columns of numeric data types, date, timestamp and string types are
automatically inferred
• These columns are automatically inferred because the property
“spark.sql.sources.partitionColumnTypeInference.enabled” is set to true
by default
• If the above property is set to false, all partition columns will be read as
String
• If different partitions have different schema, Spark can merge their schemas
if the option “mergeSchema” is set to true
• Spark caches hive metadata by default. Hence, it is necessary to refresh the
metadata if there is a chance it being changed from outside.
Other Data Sources

• ORC
• JSON
• Avro
• JDBC
• Text
It has imported the schema by itself
Dataframe Operations
• alias(String alias)
• apply(String columnName)
• cache (default level: MEMORY_AND_DISK)
• coalesce(int numPartitions)
• col(String colName)
• collect
• collectAsList
• columns
• count
• createGlobalTempView
• createOrReplaceGlobalTempView
• createTempView
• createOrReplaceTempView
Dataframe Operations

• distinct
• drop(Column col/String colName)
• drop(String.. colNames)
• dropDuplicates
• dropDuplicates(String[] colNames)
• except(Dataset other)
• explain
• filter(Column condition) //df.filter($”id” > 100) & df.filter(“id > 100”)
• filter(function)
• first
• foreach(function)
• foreachPartition(funstion)
Dataframe Operations

• rdd
• groupBy(Colomn col/String… cols)
• head
• intersect(Dataset other)
• join(Dataset other)
• union
• limit(n)
• map
• mapPartitions
• orderBy
• persist //with & without StorageLevel
• unpersist
Dataframe Operations

• repartition(numPartitions)
• select
• where
• withColumn
• show
• sort
• sparkSession
• take
• takeAsList
• toJSON
• write
Select
Every “.” return me a new DataFrame
Every “.” return me a new DataFrame
Filter
alias

Apply

Apply returns col object .You can use it without apply


Collect

Collect gives array of row objects


Count
Distinct
Drop Duplicate
Filter function

Filter function : we use when is complicated condition because you can not use typical sql
stuff that’s why we write the filter function
First
Take and Head
Intersection
union
OrderBy
Cover to Jason
CreateOrReplaceTempView
createOrReplaceTempView vs createOrReplaceGlobalTempView

Global view available to all sessions even we close the current session

createGlobalTempView :The view is not exist and we just creating that. we will receive an error if it exist
createOrReplaceGlobalTempView : It will overwrite the existing view if it exist
createTempView : The view is not exist and we just creating that we will receive an error if it exist
createOrReplaceTempView : It will overwrite the existing view if it exist
groupBy
Count
Run SQL on Files

Instead of loading a file into a dataset and selecting columns, we may run
sql directly on the file:

val salesRecords = spark.sql("SELECT * FROM


csv.`/Users/hdpuser/DocumentsSalesJan2009.csv`")
spark.sql("Select)

We apply sql on the file directly not in the data frame


Apply in file rather than the data frame
Spark.sql

The entire things happening in memory .We do not have any db


Every thing happening in memory .Spark is In memory db .Any thing you do in any db
you can do it here too.
Write Data

• The write functions are quite similar to the load functions


• The following will write the contents of the “salesRecords” dataset in the default
format:
salesRecords.write.save(“/Users/hadoop-user/Documents/output”)
• To write using in a specific format, either change the default format or specify the
format, like:
salesRecords.write
.format(“csv”)
.save(“/Users/hadoop-user/Documents/output”)
• Like in read, options, applicable to the specific data source, can be specified:
salesRecords.write
.format(“csv”)
.option(“header”, true)
.save(“/Users/hadoop-user/Documents/output”)
Get dataframe write by .write.save
We were doing read on session object .(spark.read.load).Here we are doing on the data
frame not spark session
The write object is derived from data frame.The reader object is derived from session object
(spark session)
Two partition two task two files
Save Mode

• SaveMode.ErrorIfExists
• SaveMode.Append
• SaveMode.Overwrite
• SaveMode.Ignore
SaveMode.Append
SaveMode.Overwrite
SaveMode.Ignore
Read & Write Hive Tables
• Hive support needs to be enabled on spark session
• The hive warehouse directory needs to be set as “spark.sql.warehouse.dir”
• Once the session is created, SQL statements can be issued using
sparkSession.sql(“<sql_statement>”)
• Sorting and partitioning can be done on the tables being saved
Derby DB warehouse directory “spark.sql.warehouse.dir”
metastore directory hive.metastore.warehouse.dir
Save as Hive Table

• Dataframes can be saved as Hive tables using the saveAsTable


function
• If no Hive metastore exists, a table is created in the default Derby
database
• Even if the SparkSession is closed, the table metadata is retained
as long as the Derby session remains active
RDD to Dataframe

An RDD can be converted a dataframe using a Java bean class as


follows:
val df = sparkSession.createDataFrame(rdd, beanClass)
Convert RDD to DF
We need to remove the first
line in csv which is header
Spark session is an object so if we put it on the top we get an error.
We should do it after spark session is created.
With The
Column
Example
Run without ‘with column”
Lit : It create column object out of it
It creates a new column
If I run it tomorrow I should have it 2019-04-13

Why we need the current date ? Because we need to partition the data

You might also like