0% found this document useful (0 votes)

147 views96 pages

Fall209 Spark SQL MC

This document provides information on loading and manipulating structured data in Spark SQL using DataFrames and Datasets. It discusses how to load data from files in various formats like CSV, JSON, Parquet etc. using Spark SQL or the DataFrame/Dataset API. It also describes various operations that can be performed on DataFrames like filtering, aggregation, joining etc. Additionally, it covers writing data back to files or saving as Hive tables in different formats.

Uploaded by

Oneil Henry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views96 pages

Fall209 Spark SQL MC

Uploaded by

Oneil Henry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 96

Features

• Spark module for structured data processing

• Unit of processing is the Dataset or DataFrame
• DataFrame is a Dataset having named columns
• Data read and manipulated using SQL
• Ability to read Hive data as well
• High performance is achieved using the Catalyst engine
Load Data

val salesRecords =
spark.read.load("/Users/hadoop-user/Documents/SalesJan2009.parquet")

• This loads a parquet file by default

• The default option is specified by the configuration property
“spark.sql.sources.default”
• The default can be overridden by setting the parameter in the
configuration object of the SparkSession
spark.conf.set("spark.sql.sources.default", "csv")
Load Data
• An easier way to load various types of data, other than parquet, is
using the following:
val salesRecords = spark
.read
.format(“csv”)
.load("/Users/hadoop-user/Documents/SalesJan2009.csv”)
• The formats supported are json, parquet, jdbc, orc, csv and text
• The formats are data sources and should be referred to using their
fully qualified names like “org.apache.spark.sql.parquet”
• The out-of-the-box supported data sources have short names like
mentioned above - jdbc
https://spark.apache.org/docs/2.3.0/api/scala/#package
Load Options
Data sources have their own options that can be specified during the
load process:
val salesRecords = spark.read.format("csv”)
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true”)
.load("/Users/hadoop-user/Documents/SalesJan2009.csv")
Note: While reading a csv file, the first row is always taken as the
header, if the header option is set to true, as shown in the above
example
Reading Parquet Files
• Parquet files can be read using:
spark.read.parquet(“path”)
• If the base location of the table is specified as the path, the partitions are
automatically discovered
• Partition columns of numeric data types, date, timestamp and string types are
automatically inferred
• These columns are automatically inferred because the property
“spark.sql.sources.partitionColumnTypeInference.enabled” is set to true
by default
• If the above property is set to false, all partition columns will be read as
String
• If different partitions have different schema, Spark can merge their schemas
if the option “mergeSchema” is set to true
• Spark caches hive metadata by default. Hence, it is necessary to refresh the
metadata if there is a chance it being changed from outside.
Other Data Sources

• ORC
• JSON
• Avro
• JDBC
• Text
It has imported the schema by itself
Dataframe Operations
• alias(String alias)
• apply(String columnName)
• cache (default level: MEMORY_AND_DISK)
• coalesce(int numPartitions)
• col(String colName)
• collect
• collectAsList
• columns
• count
• createGlobalTempView
• createOrReplaceGlobalTempView
• createTempView
• createOrReplaceTempView
Dataframe Operations

• distinct
• drop(Column col/String colName)
• drop(String.. colNames)
• dropDuplicates
• dropDuplicates(String[] colNames)
• except(Dataset other)
• explain
• filter(Column condition) //df.filter($”id” > 100) & df.filter(“id > 100”)
• filter(function)
• first
• foreach(function)
• foreachPartition(funstion)
Dataframe Operations

• rdd
• groupBy(Colomn col/String… cols)
• head
• intersect(Dataset other)
• join(Dataset other)
• union
• limit(n)
• map
• mapPartitions
• orderBy
• persist //with & without StorageLevel
• unpersist
Dataframe Operations

• repartition(numPartitions)
• select
• where
• withColumn
• show
• sort
• sparkSession
• take
• takeAsList
• toJSON
• write
Select
Every “.” return me a new DataFrame
Every “.” return me a new DataFrame
Filter
alias

Apply

Apply returns col object .You can use it without apply

Collect

Collect gives array of row objects

Count
Distinct
Drop Duplicate
Filter function

Filter function : we use when is complicated condition because you can not use typical sql
stuff that’s why we write the filter function
First
Take and Head
Intersection
union
OrderBy
Cover to Jason
CreateOrReplaceTempView
createOrReplaceTempView vs createOrReplaceGlobalTempView

Global view available to all sessions even we close the current session

createGlobalTempView :The view is not exist and we just creating that. we will receive an error if it exist
createOrReplaceGlobalTempView : It will overwrite the existing view if it exist
createTempView : The view is not exist and we just creating that we will receive an error if it exist
createOrReplaceTempView : It will overwrite the existing view if it exist
groupBy
Count
Run SQL on Files

Instead of loading a file into a dataset and selecting columns, we may run
sql directly on the file:

val salesRecords = spark.sql("SELECT * FROM

csv.`/Users/hdpuser/DocumentsSalesJan2009.csv`")
spark.sql("Select)

We apply sql on the file directly not in the data frame

Apply in file rather than the data frame
Spark.sql

The entire things happening in memory .We do not have any db

Every thing happening in memory .Spark is In memory db .Any thing you do in any db
you can do it here too.
Write Data

• The write functions are quite similar to the load functions

• The following will write the contents of the “salesRecords” dataset in the default
format:
salesRecords.write.save(“/Users/hadoop-user/Documents/output”)
• To write using in a specific format, either change the default format or specify the
format, like:
salesRecords.write
.format(“csv”)
.save(“/Users/hadoop-user/Documents/output”)
• Like in read, options, applicable to the specific data source, can be specified:
salesRecords.write
.format(“csv”)
.option(“header”, true)
.save(“/Users/hadoop-user/Documents/output”)
Get dataframe write by .write.save
We were doing read on session object .(spark.read.load).Here we are doing on the data
frame not spark session
The write object is derived from data frame.The reader object is derived from session object
(spark session)
Two partition two task two files
Save Mode

• SaveMode.ErrorIfExists
• SaveMode.Append
• SaveMode.Overwrite
• SaveMode.Ignore
SaveMode.Append
SaveMode.Overwrite
SaveMode.Ignore
Read & Write Hive Tables
• Hive support needs to be enabled on spark session
• The hive warehouse directory needs to be set as “spark.sql.warehouse.dir”
• Once the session is created, SQL statements can be issued using
sparkSession.sql(“<sql_statement>”)
• Sorting and partitioning can be done on the tables being saved
Derby DB warehouse directory “spark.sql.warehouse.dir”
metastore directory hive.metastore.warehouse.dir
Save as Hive Table

• Dataframes can be saved as Hive tables using the saveAsTable

function
• If no Hive metastore exists, a table is created in the default Derby
database
• Even if the SparkSession is closed, the table metadata is retained
as long as the Derby session remains active
RDD to Dataframe

An RDD can be converted a dataframe using a Java bean class as

follows:
val df = sparkSession.createDataFrame(rdd, beanClass)
Convert RDD to DF
We need to remove the first
line in csv which is header
Spark session is an object so if we put it on the top we get an error.
We should do it after spark session is created.
With The
Column
Example
Run without ‘with column”
Lit : It create column object out of it
It creates a new column
If I run it tomorrow I should have it 2019-04-13

Why we need the current date ? Because we need to partition the data

Class 9 Maths Number System Worksheets
50% (2)
Class 9 Maths Number System Worksheets
12 pages
VERB TO BE - Boardgame PDF
100% (3)
VERB TO BE - Boardgame PDF
2 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Taxonomy of Bugs
100% (1)
Taxonomy of Bugs
8 pages
Annexure 7: Medical Certificate (To Be Issued by A Registered Medical Practitioner) General Expectations
No ratings yet
Annexure 7: Medical Certificate (To Be Issued by A Registered Medical Practitioner) General Expectations
1 page
Hostel Management SRS PDF
64% (14)
Hostel Management SRS PDF
17 pages
Husqvarna/Viking 350 Sewing Machine Service Manual
No ratings yet
Husqvarna/Viking 350 Sewing Machine Service Manual
54 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
DLL Matatag - Pe&health 7 q1 w1
No ratings yet
DLL Matatag - Pe&health 7 q1 w1
18 pages
Lesson 3
No ratings yet
Lesson 3
70 pages
Quiz
No ratings yet
Quiz
4 pages
CH. A. Charalambides, M.V. Koutras, N. Balakrishnan - Probability and Statistical Models With Applications-Chapman and Hall - CRC (2000)
No ratings yet
CH. A. Charalambides, M.V. Koutras, N. Balakrishnan - Probability and Statistical Models With Applications-Chapman and Hall - CRC (2000)
609 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Business Plan Cafe
100% (1)
Business Plan Cafe
10 pages
Tax Lien List Multi Format
100% (1)
Tax Lien List Multi Format
318 pages
Service Quality of Coca REX LUYAO
No ratings yet
Service Quality of Coca REX LUYAO
12 pages
Spark
No ratings yet
Spark
160 pages
Job Safety Analysis Form: Law M. Mechanical Supervisor Alex A./ Egbejimi Adebayo PSC
0% (1)
Job Safety Analysis Form: Law M. Mechanical Supervisor Alex A./ Egbejimi Adebayo PSC
4 pages
MIL-STD-1553 Tutorial and Reference
No ratings yet
MIL-STD-1553 Tutorial and Reference
22 pages
Bailment and Pledge Are Two Special Contracts That Are Often Confused
No ratings yet
Bailment and Pledge Are Two Special Contracts That Are Often Confused
9 pages
Venerable Mahākassapa: The Father of The Sangha
No ratings yet
Venerable Mahākassapa: The Father of The Sangha
6 pages
AF3313 Course Outline
No ratings yet
AF3313 Course Outline
6 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
How Many 4 Letter Words Can Be Formed by The Word MISSISSIPPI
No ratings yet
How Many 4 Letter Words Can Be Formed by The Word MISSISSIPPI
5 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Official Poker Hand Rankings Full Chart
No ratings yet
Official Poker Hand Rankings Full Chart
2 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Assessment Centre
No ratings yet
Assessment Centre
34 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Week 4 - PIG SqoopFall2019
No ratings yet
Week 4 - PIG SqoopFall2019
117 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Financial Maths Extra Questions
No ratings yet
Financial Maths Extra Questions
52 pages
Cineplex Values The Cineplex Way: Who We Are Whatwedoandhowwedoit
No ratings yet
Cineplex Values The Cineplex Way: Who We Are Whatwedoandhowwedoit
2 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Endogenous Analysis
No ratings yet
Endogenous Analysis
9 pages
Critical Information Literacy Foundations Inspirations and Ideas
No ratings yet
Critical Information Literacy Foundations Inspirations and Ideas
204 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Cloudera Certification Dump - 410-Anil
100% (3)
Cloudera Certification Dump - 410-Anil
49 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Cuttinlines EN
No ratings yet
Cuttinlines EN
32 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Identifying The Health-Conscious Consumer: Targeting Strategies For Image-Conscious, Ethical Workaholics by Ben Longman
No ratings yet
Identifying The Health-Conscious Consumer: Targeting Strategies For Image-Conscious, Ethical Workaholics by Ben Longman
6 pages
gdp3q23 Adv
No ratings yet
gdp3q23 Adv
11 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Co-Op Winter 2020 Job Description KPMG PDF
No ratings yet
Co-Op Winter 2020 Job Description KPMG PDF
1 page
Interview Questions
No ratings yet
Interview Questions
2 pages
Focused Teaching: Promoting Accelerated Learning
No ratings yet
Focused Teaching: Promoting Accelerated Learning
23 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Lab 1 and Lab 2
No ratings yet
Lab 1 and Lab 2
13 pages
Timesheet ON-Time HR
No ratings yet
Timesheet ON-Time HR
1 page
HDFS Commands
No ratings yet
HDFS Commands
15 pages
ED636445
No ratings yet
ED636445
9 pages
Spark
No ratings yet
Spark
13 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
44 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Product Margin Sheet - Make A Copy or Download Your Own
No ratings yet
Product Margin Sheet - Make A Copy or Download Your Own
10 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Glicemie Taifasuri
No ratings yet
Glicemie Taifasuri
39 pages
Spiritual Awakening Guide Scott Jeffrey
No ratings yet
Spiritual Awakening Guide Scott Jeffrey
14 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Romeike V Holder - Decision of Immigration Judge Lawrence O. Burman
No ratings yet
Romeike V Holder - Decision of Immigration Judge Lawrence O. Burman
19 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Few Shot Learning Algorithms
No ratings yet
Few Shot Learning Algorithms
1 page
BSC Hall of Fame Chrysler
No ratings yet
BSC Hall of Fame Chrysler
16 pages
Hive in Class Assignment Winter 2021
No ratings yet
Hive in Class Assignment Winter 2021
2 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Love Attraction, Love Spell
No ratings yet
Love Attraction, Love Spell
8 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
100 Interview Questions On Hadoop PDF
No ratings yet
100 Interview Questions On Hadoop PDF
24 pages
Pair RDD Operations: Flat Map
No ratings yet
Pair RDD Operations: Flat Map
4 pages
Final Practice Set
No ratings yet
Final Practice Set
31 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Chapter 7
No ratings yet
Chapter 7
6 pages
Owl Pellet Lesson
No ratings yet
Owl Pellet Lesson
4 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Sqoop Cheatsheet
No ratings yet
Sqoop Cheatsheet
3 pages
Posselt Diagram
No ratings yet
Posselt Diagram
2 pages
Sqoop Export and Import Commands
No ratings yet
Sqoop Export and Import Commands
5 pages
Sector Breakdown US & Europe
No ratings yet
Sector Breakdown US & Europe
64 pages
100 Great Problems of Elementary Mathematics - Their History and Solution
No ratings yet
100 Great Problems of Elementary Mathematics - Their History and Solution
308 pages
Tenochtitlan (Aztec) : Geography and Culture
No ratings yet
Tenochtitlan (Aztec) : Geography and Culture
1 page
Modification A. Incomplete Sentences
No ratings yet
Modification A. Incomplete Sentences
5 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
IBM Integration Bus Third Edition
From Everand
IBM Integration Bus Third Edition
Gerardus Blokdyk
No ratings yet
ORACLE 12C Complete Self-Assessment Guide
From Everand
ORACLE 12C Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet

Fall209 Spark SQL MC

Uploaded by

Fall209 Spark SQL MC

Uploaded by

Features

• Spark module for structured data processing

• This loads a parquet file by default

Apply returns col object .You can use it without apply

Collect gives array of row objects

val salesRecords = spark.sql("SELECT * FROM

We apply sql on the file directly not in the data frame

The entire things happening in memory .We do not have any db

• The write functions are quite similar to the load functions

• Dataframes can be saved as Hive tables using the saveAsTable

An RDD can be converted a dataframe using a Java bean class as

You might also like