Fall209 Spark SQL MC
Fall209 Spark SQL MC
val salesRecords =
spark.read.load("/Users/hadoop-user/Documents/SalesJan2009.parquet")
• ORC
• JSON
• Avro
• JDBC
• Text
It has imported the schema by itself
Dataframe Operations
• alias(String alias)
• apply(String columnName)
• cache (default level: MEMORY_AND_DISK)
• coalesce(int numPartitions)
• col(String colName)
• collect
• collectAsList
• columns
• count
• createGlobalTempView
• createOrReplaceGlobalTempView
• createTempView
• createOrReplaceTempView
Dataframe Operations
• distinct
• drop(Column col/String colName)
• drop(String.. colNames)
• dropDuplicates
• dropDuplicates(String[] colNames)
• except(Dataset other)
• explain
• filter(Column condition) //df.filter($”id” > 100) & df.filter(“id > 100”)
• filter(function)
• first
• foreach(function)
• foreachPartition(funstion)
Dataframe Operations
• rdd
• groupBy(Colomn col/String… cols)
• head
• intersect(Dataset other)
• join(Dataset other)
• union
• limit(n)
• map
• mapPartitions
• orderBy
• persist //with & without StorageLevel
• unpersist
Dataframe Operations
• repartition(numPartitions)
• select
• where
• withColumn
• show
• sort
• sparkSession
• take
• takeAsList
• toJSON
• write
Select
Every “.” return me a new DataFrame
Every “.” return me a new DataFrame
Filter
alias
Apply
Filter function : we use when is complicated condition because you can not use typical sql
stuff that’s why we write the filter function
First
Take and Head
Intersection
union
OrderBy
Cover to Jason
CreateOrReplaceTempView
createOrReplaceTempView vs createOrReplaceGlobalTempView
Global view available to all sessions even we close the current session
createGlobalTempView :The view is not exist and we just creating that. we will receive an error if it exist
createOrReplaceGlobalTempView : It will overwrite the existing view if it exist
createTempView : The view is not exist and we just creating that we will receive an error if it exist
createOrReplaceTempView : It will overwrite the existing view if it exist
groupBy
Count
Run SQL on Files
Instead of loading a file into a dataset and selecting columns, we may run
sql directly on the file:
• SaveMode.ErrorIfExists
• SaveMode.Append
• SaveMode.Overwrite
• SaveMode.Ignore
SaveMode.Append
SaveMode.Overwrite
SaveMode.Ignore
Read & Write Hive Tables
• Hive support needs to be enabled on spark session
• The hive warehouse directory needs to be set as “spark.sql.warehouse.dir”
• Once the session is created, SQL statements can be issued using
sparkSession.sql(“<sql_statement>”)
• Sorting and partitioning can be done on the tables being saved
Derby DB warehouse directory “spark.sql.warehouse.dir”
metastore directory hive.metastore.warehouse.dir
Save as Hive Table
Why we need the current date ? Because we need to partition the data