0% found this document useful (0 votes)

14 views

Unit-5 Spark SQL and Spark Streaming

Uploaded by

21dce011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Unit-5 Spark SQL and Spark Streaming

Uploaded by

21dce011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

S PA R K S Q L A N D S PA R K

STREAMING
Prepared By:
Aayushi Chaudhari,
Assistant Professor, CE, CSPIT,
CHARUSAT

November 11, 2024| U & P U. Patel Department of Computer Engineering 1

Agenda
• Big Data and Spark SQL
• Spark-Managed Tables
• Reading Tables into Data Frames
• Aggregations
• Joins
• Creating Views
• Spark Streaming and Challenges of Stream Processing
• Spark’s Streaming APIs
• Spark streaming case study

November 11, 2024| U & P U. Patel Department of Computer Engineering 2

Big Data and Spark SQL
• Big Data and Spark SQL are crucial components of modern data processing and
analysis, particularly for handling large-scale datasets.
• Big Data refers to extremely large datasets that are difficult to process and analyze
using traditional data processing tools.
• The main characteristics of Big Data as the 4 V's, are:
• Volume: The amount of data is massive, often in terabytes, petabytes, or even exabytes.
• Velocity: Data is generated and processed at high speeds, requiring real-time or near-real-
time analysis.
• Variety: Data comes in various formats—structured, semi-structured, and unstructured.
Examples include text, images, videos, logs, and more.
• Veracity: The quality and accuracy of data, which can vary greatly.
• Big Data technologies and frameworks, like Apache Hadoop, Apache Spark, and NoSQL
databases, are designed to process, analyze, and store these large datasets efficiently.
November 11, 2024| U & P U. Patel Department of Computer Engineering 3
Big Data and Spark SQL
• Apache Spark is a powerful open-source data processing engine designed for
large-scale data processing. It provides:
• In-memory Processing: Unlike traditional Hadoop, which writes intermediate results to disk, Spark
processes data in memory, making it much faster.
• Ease of Use: APIs in multiple languages (Scala, Java, Python, R) and built-in libraries for machine
learning (MLlib), graph processing (GraphX), and streaming data (Spark Streaming).
• Unified Engine: Supports batch processing, interactive queries, real-time streaming, and complex
analytics.

November 11, 2024| U & P U. Patel Department of Computer Engineering 4

Spark SQL
• Spark introduces a programming module for
structured data processing called Spark SQL.
• Spark SQL was first released in Spark 1.0
(May, 2014).
• It provides a programming abstraction called
DataFrames and can also act as a distributed
SQL query engine.
• Spark SQL is a component on top of Spark
Core that introduces a new data abstraction
• Spark
called SQL willRDD.
Schema allow developers to:
• Import relational data from Parquet files and Hive tables
• Run SQL queries over imported data and existing RDDs
• Easily write RDDs out to Hive tables or Parquet files

November 11, 2024| U & P U. Patel Department of Computer Engineering 5

Spark SQL
• Challenges • Solutions

• Perform ETL to and from various (semi- or • A DataFrames API that can perform relational
unstructured) data sources. operations on both external data sources and
• Perform advanced analytics (e.g. machine Spark's built-in RDDs.
learning, graph processing) that are hard to • A highly extensible optimizer, Catalyst, that
express in relational systems. uses features of Scala to add compassable rule,
control code gen., and define extensions.

November 11, 2024| U & P U. Patel Department of Computer Engineering 6

Spark SQL Architecture

November 11, 2024| U & P U. Patel Department of Computer Engineering 7

Spark SQL Architecture
 Language API:
• Spark is compatible with different languages and Spark SQL.
• It is also, supported by these languages- API (python, scala, java, HiveQL).

 Schema RDD:
• Spark Core is designed with special data structure called RDD.
• Generally, Spark SQL works on schemas, tables, and records.
• Therefore, we can use the Schema RDD as temporary table.
• We can call this Schema RDD as Data Frame.

 Data Sources:
• Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data
• Sources for Spark SQL is different.
• Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
November 11, 2024| U & P U. Patel Department of Computer Engineering 8
Spark-Managed Tables
• In Apache Spark, there are two types of tables that can be created: managed tables and
external tables.
• Both of these tables are essential for data storage and management in Spark projects.

 Creating Tables
Follow these steps to create tables using the Spark SQL API:
• Define the table schema by specifying column names and data types.
• Use the CREATE TABLE statement, providing the table name and schema.
• For managed tables, data is stored in the default location. For external tables, specify the
location using the LOCATION keyword.
• Execute the SQL command to create the table.
• When working with managed or external tables, always consider your data storage, access,
and management requirements to achieve optimal performance.

November 11, 2024| U & P U. Patel Department of Computer Engineering 9

Spark-Managed Tables
 Querying Tables
• For managed tables, use SQL queries to retrieve data from the Spark catalog.
• For external tables, SQL queries can access data that resides in external locations.
• When querying managed tables, data is retrieved from Spark’s default storage location.
• For external tables, the data is fetched from the location specified when the table was created.

 Dropping Tables
• Open your Apache Spark environment or platform.
• Use the DROP command to delete either managed or external tables.
• Verify the deletion by checking the table list or running a query.

November 11, 2024| U & P U. Patel Department of Computer Engineering 10

Spark-Managed Tables
 Altering Table Properties
• To modify table properties in Spark, use the ALTER TABLE command along with the SET
TBLPROPERTIES keyword.
• Specify the table name and the properties to be changed, such as adjusting storage format or
adding custom parameters.
• Use the DESCRIBE EXTENDED command to verify that the table properties have been
updated accordingly.

November 11, 2024| U & P U. Patel Department of Computer Engineering 11

Reading Tables into Data Frames

 Refer This Link : Click Here

 Link : https://ucsdlib.github.io/python-novice-gapminder/07-reading-tabular/

November 11, 2024| U & P U. Patel Department of Computer Engineering 12

Aggregations, Joins
 Aggregations in Spark
• Aggregations summarize data (e.g., sum, count, average).
• Can be used for grouping data and performing operations on grouped data.
• Common Functions:
• sum(), avg(), min(), max()
• Methods:
• groupBy(): Groups rows based on one or more columns.
• agg(): Applies multiple aggregate functions at once.

 Refer This Link : Click Here

 Link :
https://kaizen.itversity.com/courses/hdpcsd-hdp-certified-spark-developer-hdpcsd-python/less
ons/hdpcsd-apache-spark-2-data-frames-and-spark-sql-python/topic/hdpcsd-data-frame-operat
ions-basic-transformations-such-as-filtering-aggregations-joins-etc-python/
November 11, 2024| U & P U. Patel Department of Computer Engineering 13
Aggregations, Joins
 Spark Joins
• Joins combine rows from two or more DataFrames based on a related column.
• Types of Joins:
• Inner Join: Only matching rows.
• Left Join: All rows from the left, matching from the right.
• Right Join: All rows from the right, matching from the left.
• Outer Join: All rows from both, with null for non-matches

 Refer This Link : Click Here

November 11, 2024| U & P U. Patel Department of Computer Engineering 14

Spark Streaming
 Why Spark Streaming?
• Many important applications must process large streams of live data and
provide results in near-real-time
 Social network trends
 Website statistics
 Intrusion detection systems etc.
• Requires low latencies for faster processing
 What is spark Streaming?
• a scalable and fault-tolerant stream processing
• can express your streaming computation the same way you would express a
batch computation on static data
November 11, 2024| U & P U. Patel Department of Computer Engineering 15
Spark Streaming
• Data can be ingested from many sources like Kafka, Flume, or TCP sockets, and can be
processed using complex algorithms expressed with high-level functions like map, reduce,
join and window
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput
fault-tolerant stream processing of live data streams

November 11, 2024| U & P U. Patel Department of Computer Engineering 16

Spark Streaming
• Data can be ingested from many sources like Kafka, Flume, or TCP sockets, and can be processed
using complex algorithms expressed with high-level functions like map, reduce, join and window
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput fault-
tolerant stream processing of live data streams
• Spark Streaming is an extension of Apache Spark that enables scalable, high-throughput, and fault-
tolerant stream processing of live data streams.
• It is built on the Spark core engine, allowing seamless integration with other Spark components like
Spark SQL, MLlib, and GraphX.
• Key Features:
• Micro-batch Processing
• Ease of Use
• Integration with Ecosystem
• Fault Tolerance
• Scalability
• Windowed Operations

November 11, 2024| U & P U. Patel Department of Computer Engineering 17

Challenges Stream Processing
• Stream processing comes with its own set of challenges, especially when dealing with real-
time data. Some of the main challenges are:
 Latency vs. Throughput:
• Latency refers to the time delay in processing and delivering the results of a stream.
• Throughput is the amount of data processed in a given time.
• Balancing latency and throughput is a key challenge.
• Lower latency may lead to lower throughput, while optimizing for high through
 Fault Tolerance and Data Loss:
• Ensuring that no data is lost in case of node or network failures is critical.
• Implementing exactly-once semantics, where data is neither lost nor processed multiple times, is difficult
and often requires sophisticated mechanisms like check pointing and write-ahead logs.
 Scalability:
• Stream processing systems must be able to scale horizontally to handle fluctuating data volumes.
• Dynamic scaling, where resources are allocated and deallocated automatically based on load, adds another
layer of complexity.

November 11, 2024| U & P U. Patel Department of Computer Engineering 18

Challenges Stream Processing
 Complex Event Processing:
• Detecting complex patterns or sequences of events in a data stream, like fraud detection or anomaly
detection, can be computationally intensive and requires sophisticated algorithms.

 Backpressure Handling:
• Backpressure occurs when the system can't handle incoming data fast enough, causing a bottleneck.
• Managing backpressure involves mechanisms like buffering, rate limiting, or dynamic resource allocation,
which add complexity.

 Integration and Compatibility:

• Integrating with various data sources (Kafka, Flume, etc.) and sinks (HDFS, databases) requires
understanding each component's behavior under load and failure conditions.
• Compatibility with different formats, protocols, and data models adds another layer of complexity.

November 11, 2024| U & P U. Patel Department of Computer Engineering 19

Spark’s Streaming API
• The Spark Streaming API is based on DStreams (Discretized Streams), which is an abstraction
that represents a continuous stream of data, divided into small micro-batches.
• These micro-batches are processed like Resilient Distributed Datasets (RDDs) internally.

 Discretized Stream API(DStream)

• A high-level abstraction which represents a continuous stream of data.
• Dstream can be created from input data streams from sources such as Kafka, Flume, and
Kinesis, by applying high-level operations on other Dstreams.
• A DStream represents a continuous flow of data in the form of a sequence of RDDs.
• Each RDD in the sequence represents data collected over a time window (batch interval).
• Spark Streaming uses DStream as its fundamental abstraction for real-time processing.

November 11, 2024| U & P U. Patel Department of Computer Engineering 20

Spark’s Streaming API
• Micro-Batch Processing: Spark Streaming processes data in small time intervals (batches).
• DStream Transformations: Similar to RDD operations, DStreams provide transformations like
map, flatMap, reduceByKey, window, etc.
• Stateful Operations: The updateStateByKey and mapWithState functions allow stateful stream
processing.
• Fault Tolerance: Data can be replayed in case of a failure by using checkpointing and reliable
data sources (like Kafka).
• Windowing Operations: You can define windows of data over which computations can be
performed, such as aggregating over the last 10 seconds of data.

November 11, 2024| U & P U. Patel Department of Computer Engineering 21

Spark’s Streaming API
 Structured Streaming API
• Structured Streaming is a scalable and fault-tolerant stream processing engine built on the
Spark SQL engine.
• It operates in a declarative manner, allowing users to define a computation as a query on a
streaming DataFrame or Dataset, much like writing SQL queries on static data.
• Continuous Processing: Structured Streaming provides near real-time data processing by
continuously appending new data to an unbounded table.
• Event-Time Processing: Unlike the DStreams API, Structured Streaming supports event-time
processing and late data handling through watermarks.
• Exactly-Once Semantics: Structured Streaming guarantees end-to-end exactly-once fault
tolerance, which is especially important in stateful streaming applications.
• SQL Integration: Since Structured Streaming is built on the Spark SQL engine, it allows
seamless integration with SQL-based queries, aggregations, and joins.

November 11, 2024| U & P U. Patel Department of Computer Engineering 22

Spark’s Streaming API
Feature DStreams API Structured Streaming API

Processing Model Micro-batch Micro-batch and Continuous Processing

Abstraction DStreams (built on RDDs) DataFrame/Dataset

Event Time Support Limited Yes (Event-time, Watermarks)

Fault Tolerance Achieved via checkpointing Exactly-once end-to-end fault tolerance

Yes (using DStream windowing

Windowing Support Yes (more flexible and efficient)
functions)
Latency Higher due to batch processing Lower with Continuous Processing
Declarative API using DataFrame and
API Functional API similar to RDDs
SQL
State Management updateStateByKey, mapWithState Built-in, efficient state handling
November 11, 2024| U & P U. Patel Department of Computer Engineering 23
Thank You.

November 11, 2024| U & P U. Patel Department of Computer Engineering 24

PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Resumen Ejercicios Libro Spark
No ratings yet
Resumen Ejercicios Libro Spark
86 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark SQL_updated
No ratings yet
Spark SQL_updated
19 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Apache Spark Analytics Made Simple PDF
No ratings yet
Apache Spark Analytics Made Simple PDF
76 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
ds2 5 Pig Pyspark
No ratings yet
ds2 5 Pig Pyspark
64 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
SparkSql_AND_DF
No ratings yet
SparkSql_AND_DF
89 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
BDA U5 copy
No ratings yet
BDA U5 copy
42 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Unit 4
No ratings yet
Unit 4
60 pages
Spark SQL
No ratings yet
Spark SQL
34 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Unit-4 - Apache Spark
No ratings yet
Unit-4 - Apache Spark
24 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Learning Spark Preview Ed
No ratings yet
Learning Spark Preview Ed
18 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache_Spark_Lecture_Notes
No ratings yet
Apache_Spark_Lecture_Notes
4 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
Module 3
No ratings yet
Module 3
51 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Unit 5
100% (1)
Unit 5
109 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
spark_sql
No ratings yet
spark_sql
18 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
Hadoop Spark
No ratings yet
Hadoop Spark
73 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Module 4
No ratings yet
Module 4
29 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet