Spark Essentials
Spark Essentials
By
Prof Shibdas Dutta
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Table of Contents
Introduction
Basic Commands
• Introduction
• Many data scientists, analysts, and general business intelligence users
rely on interactive SQL queries for exploring data. And Spark SQL is a
tool that enables you to do so.
What Is Spark SQL?
Spark SQL is a Spark module for structured data processing. One use
of Spark SQL is to execute SQL queries. In this post, let’s focus on
highlighting 9 basic commands of running SQL queries.
1 — Creating a SparkSession
There are many ways to create DataFrames in Spark. One of the ways
is from Spark Data Sources which is the example below.
# Creating DataFrames
df =
spark.read.format('bigquery').option('project','<your_project_ID>').o
ption('table',<your_table_name>).load()
3 — Inspecting Data
After you’ve created your DataFrame, I guess the next thing you may want to do is to do some quick inspect
your data. Here’s a few commands!
Let’s say you want to create a new column named newdf and display everything in the new
DataFrame.
.select
newdf.select(“customer_id”).show()
6 — Filtering
.filter
# Filtering
df.filter(df["order_quantity"]>10).show()
7 — Aggregating
All of the common aggregation methods, like .min(), .max(), and .count() are
GroupedData methods. These are created by calling the .groupBy()
DataFrame method. To use these functions we call that method on the
DataFrame. For example, to find the minimum value of a column, col, in a
DataFrame, df, you could do:
df.groupBy().min("order_quantity").show()
This creates a GroupedData object (so you can use the .min() method), then finds the
minimum value in col, and returns it as a DataFrame.
8 — Grouping and Aggregating
.groupBy() | .avg()
The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
You can save DataFrame as a Temporary Table and sql query on the saved table.