Unit-5 Spark SQL and Spark Streaming
Unit-5 Spark SQL and Spark Streaming
STREAMING
Prepared By:
Aayushi Chaudhari,
Assistant Professor, CE, CSPIT,
CHARUSAT
• Perform ETL to and from various (semi- or • A DataFrames API that can perform relational
unstructured) data sources. operations on both external data sources and
• Perform advanced analytics (e.g. machine Spark's built-in RDDs.
learning, graph processing) that are hard to • A highly extensible optimizer, Catalyst, that
express in relational systems. uses features of Scala to add compassable rule,
control code gen., and define extensions.
Schema RDD:
• Spark Core is designed with special data structure called RDD.
• Generally, Spark SQL works on schemas, tables, and records.
• Therefore, we can use the Schema RDD as temporary table.
• We can call this Schema RDD as Data Frame.
Data Sources:
• Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data
• Sources for Spark SQL is different.
• Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
November 11, 2024| U & P U. Patel Department of Computer Engineering 8
Spark-Managed Tables
• In Apache Spark, there are two types of tables that can be created: managed tables and
external tables.
• Both of these tables are essential for data storage and management in Spark projects.
Creating Tables
Follow these steps to create tables using the Spark SQL API:
• Define the table schema by specifying column names and data types.
• Use the CREATE TABLE statement, providing the table name and schema.
• For managed tables, data is stored in the default location. For external tables, specify the
location using the LOCATION keyword.
• Execute the SQL command to create the table.
• When working with managed or external tables, always consider your data storage, access,
and management requirements to achieve optimal performance.
Dropping Tables
• Open your Apache Spark environment or platform.
• Use the DROP command to delete either managed or external tables.
• Verify the deletion by checking the table list or running a query.
Link : https://ucsdlib.github.io/python-novice-gapminder/07-reading-tabular/
Link :
https://kaizen.itversity.com/courses/hdpcsd-hdp-certified-spark-developer-hdpcsd-python/less
ons/hdpcsd-apache-spark-2-data-frames-and-spark-sql-python/topic/hdpcsd-data-frame-operat
ions-basic-transformations-such-as-filtering-aggregations-joins-etc-python/
November 11, 2024| U & P U. Patel Department of Computer Engineering 13
Aggregations, Joins
Spark Joins
• Joins combine rows from two or more DataFrames based on a related column.
• Types of Joins:
• Inner Join: Only matching rows.
• Left Join: All rows from the left, matching from the right.
• Right Join: All rows from the right, matching from the left.
• Outer Join: All rows from both, with null for non-matches
Link :
https://kaizen.itversity.com/courses/hdpcsd-hdp-certified-spark-developer-hdpcsd-python/less
ons/hdpcsd-apache-spark-2-data-frames-and-spark-sql-python/topic/hdpcsd-data-frame-operat
ions-basic-transformations-such-as-filtering-aggregations-joins-etc-python/
Backpressure Handling:
• Backpressure occurs when the system can't handle incoming data fast enough, causing a bottleneck.
• Managing backpressure involves mechanisms like buffering, rate limiting, or dynamic resource allocation,
which add complexity.