0% found this document useful (0 votes)
14 views

Unit-5 Spark SQL and Spark Streaming

Uploaded by

21dce011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Unit-5 Spark SQL and Spark Streaming

Uploaded by

21dce011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

S PA R K S Q L A N D S PA R K

STREAMING
Prepared By:
Aayushi Chaudhari,
Assistant Professor, CE, CSPIT,
CHARUSAT

November 11, 2024| U & P U. Patel Department of Computer Engineering 1


Agenda
• Big Data and Spark SQL
• Spark-Managed Tables
• Reading Tables into Data Frames
• Aggregations
• Joins
• Creating Views
• Spark Streaming and Challenges of Stream Processing
• Spark’s Streaming APIs
• Spark streaming case study

November 11, 2024| U & P U. Patel Department of Computer Engineering 2


Big Data and Spark SQL
• Big Data and Spark SQL are crucial components of modern data processing and
analysis, particularly for handling large-scale datasets.
• Big Data refers to extremely large datasets that are difficult to process and analyze
using traditional data processing tools.
• The main characteristics of Big Data as the 4 V's, are:
• Volume: The amount of data is massive, often in terabytes, petabytes, or even exabytes.
• Velocity: Data is generated and processed at high speeds, requiring real-time or near-real-
time analysis.
• Variety: Data comes in various formats—structured, semi-structured, and unstructured.
Examples include text, images, videos, logs, and more.
• Veracity: The quality and accuracy of data, which can vary greatly.
• Big Data technologies and frameworks, like Apache Hadoop, Apache Spark, and NoSQL
databases, are designed to process, analyze, and store these large datasets efficiently.
November 11, 2024| U & P U. Patel Department of Computer Engineering 3
Big Data and Spark SQL
• Apache Spark is a powerful open-source data processing engine designed for
large-scale data processing. It provides:
• In-memory Processing: Unlike traditional Hadoop, which writes intermediate results to disk, Spark
processes data in memory, making it much faster.
• Ease of Use: APIs in multiple languages (Scala, Java, Python, R) and built-in libraries for machine
learning (MLlib), graph processing (GraphX), and streaming data (Spark Streaming).
• Unified Engine: Supports batch processing, interactive queries, real-time streaming, and complex
analytics.

November 11, 2024| U & P U. Patel Department of Computer Engineering 4


Spark SQL
• Spark introduces a programming module for
structured data processing called Spark SQL.
• Spark SQL was first released in Spark 1.0
(May, 2014).
• It provides a programming abstraction called
DataFrames and can also act as a distributed
SQL query engine.
• Spark SQL is a component on top of Spark
Core that introduces a new data abstraction
• Spark
called SQL willRDD.
Schema allow developers to:
• Import relational data from Parquet files and Hive tables
• Run SQL queries over imported data and existing RDDs
• Easily write RDDs out to Hive tables or Parquet files

November 11, 2024| U & P U. Patel Department of Computer Engineering 5


Spark SQL
• Challenges • Solutions

• Perform ETL to and from various (semi- or • A DataFrames API that can perform relational
unstructured) data sources. operations on both external data sources and
• Perform advanced analytics (e.g. machine Spark's built-in RDDs.
learning, graph processing) that are hard to • A highly extensible optimizer, Catalyst, that
express in relational systems. uses features of Scala to add compassable rule,
control code gen., and define extensions.

November 11, 2024| U & P U. Patel Department of Computer Engineering 6


Spark SQL Architecture

November 11, 2024| U & P U. Patel Department of Computer Engineering 7


Spark SQL Architecture
 Language API:
• Spark is compatible with different languages and Spark SQL.
• It is also, supported by these languages- API (python, scala, java, HiveQL).

 Schema RDD:
• Spark Core is designed with special data structure called RDD.
• Generally, Spark SQL works on schemas, tables, and records.
• Therefore, we can use the Schema RDD as temporary table.
• We can call this Schema RDD as Data Frame.

 Data Sources:
• Usually the Data source for spark-core is a text file, Avro file, etc. However, the Data
• Sources for Spark SQL is different.
• Those are Parquet file, JSON document, HIVE tables, and Cassandra database.
November 11, 2024| U & P U. Patel Department of Computer Engineering 8
Spark-Managed Tables
• In Apache Spark, there are two types of tables that can be created: managed tables and
external tables.
• Both of these tables are essential for data storage and management in Spark projects.

 Creating Tables
Follow these steps to create tables using the Spark SQL API:
• Define the table schema by specifying column names and data types.
• Use the CREATE TABLE statement, providing the table name and schema.
• For managed tables, data is stored in the default location. For external tables, specify the
location using the LOCATION keyword.
• Execute the SQL command to create the table.
• When working with managed or external tables, always consider your data storage, access,
and management requirements to achieve optimal performance.

November 11, 2024| U & P U. Patel Department of Computer Engineering 9


Spark-Managed Tables
 Querying Tables
• For managed tables, use SQL queries to retrieve data from the Spark catalog.
• For external tables, SQL queries can access data that resides in external locations.
• When querying managed tables, data is retrieved from Spark’s default storage location.
• For external tables, the data is fetched from the location specified when the table was created.

 Dropping Tables
• Open your Apache Spark environment or platform.
• Use the DROP command to delete either managed or external tables.
• Verify the deletion by checking the table list or running a query.

November 11, 2024| U & P U. Patel Department of Computer Engineering 10


Spark-Managed Tables
 Altering Table Properties
• To modify table properties in Spark, use the ALTER TABLE command along with the SET
TBLPROPERTIES keyword.
• Specify the table name and the properties to be changed, such as adjusting storage format or
adding custom parameters.
• Use the DESCRIBE EXTENDED command to verify that the table properties have been
updated accordingly.

November 11, 2024| U & P U. Patel Department of Computer Engineering 11


Reading Tables into Data Frames

 Refer This Link : Click Here

 Link : https://ucsdlib.github.io/python-novice-gapminder/07-reading-tabular/

November 11, 2024| U & P U. Patel Department of Computer Engineering 12


Aggregations, Joins
 Aggregations in Spark
• Aggregations summarize data (e.g., sum, count, average).
• Can be used for grouping data and performing operations on grouped data.
• Common Functions:
• sum(), avg(), min(), max()
• Methods:
• groupBy(): Groups rows based on one or more columns.
• agg(): Applies multiple aggregate functions at once.

 Refer This Link : Click Here

 Link :
https://kaizen.itversity.com/courses/hdpcsd-hdp-certified-spark-developer-hdpcsd-python/less
ons/hdpcsd-apache-spark-2-data-frames-and-spark-sql-python/topic/hdpcsd-data-frame-operat
ions-basic-transformations-such-as-filtering-aggregations-joins-etc-python/
November 11, 2024| U & P U. Patel Department of Computer Engineering 13
Aggregations, Joins
 Spark Joins
• Joins combine rows from two or more DataFrames based on a related column.
• Types of Joins:
• Inner Join: Only matching rows.
• Left Join: All rows from the left, matching from the right.
• Right Join: All rows from the right, matching from the left.
• Outer Join: All rows from both, with null for non-matches

 Refer This Link : Click Here

 Link :
https://kaizen.itversity.com/courses/hdpcsd-hdp-certified-spark-developer-hdpcsd-python/less
ons/hdpcsd-apache-spark-2-data-frames-and-spark-sql-python/topic/hdpcsd-data-frame-operat
ions-basic-transformations-such-as-filtering-aggregations-joins-etc-python/

November 11, 2024| U & P U. Patel Department of Computer Engineering 14


Spark Streaming
 Why Spark Streaming?
• Many important applications must process large streams of live data and
provide results in near-real-time
 Social network trends
 Website statistics
 Intrusion detection systems etc.
• Requires low latencies for faster processing
 What is spark Streaming?
• a scalable and fault-tolerant stream processing
• can express your streaming computation the same way you would express a
batch computation on static data
November 11, 2024| U & P U. Patel Department of Computer Engineering 15
Spark Streaming
• Data can be ingested from many sources like Kafka, Flume, or TCP sockets, and can be
processed using complex algorithms expressed with high-level functions like map, reduce,
join and window
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput
fault-tolerant stream processing of live data streams

November 11, 2024| U & P U. Patel Department of Computer Engineering 16


Spark Streaming
• Data can be ingested from many sources like Kafka, Flume, or TCP sockets, and can be processed
using complex algorithms expressed with high-level functions like map, reduce, join and window
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput fault-
tolerant stream processing of live data streams
• Spark Streaming is an extension of Apache Spark that enables scalable, high-throughput, and fault-
tolerant stream processing of live data streams.
• It is built on the Spark core engine, allowing seamless integration with other Spark components like
Spark SQL, MLlib, and GraphX.
• Key Features:
• Micro-batch Processing
• Ease of Use
• Integration with Ecosystem
• Fault Tolerance
• Scalability
• Windowed Operations

November 11, 2024| U & P U. Patel Department of Computer Engineering 17


Challenges Stream Processing
• Stream processing comes with its own set of challenges, especially when dealing with real-
time data. Some of the main challenges are:
 Latency vs. Throughput:
• Latency refers to the time delay in processing and delivering the results of a stream.
• Throughput is the amount of data processed in a given time.
• Balancing latency and throughput is a key challenge.
• Lower latency may lead to lower throughput, while optimizing for high through
 Fault Tolerance and Data Loss:
• Ensuring that no data is lost in case of node or network failures is critical.
• Implementing exactly-once semantics, where data is neither lost nor processed multiple times, is difficult
and often requires sophisticated mechanisms like check pointing and write-ahead logs.
 Scalability:
• Stream processing systems must be able to scale horizontally to handle fluctuating data volumes.
• Dynamic scaling, where resources are allocated and deallocated automatically based on load, adds another
layer of complexity.

November 11, 2024| U & P U. Patel Department of Computer Engineering 18


Challenges Stream Processing
 Complex Event Processing:
• Detecting complex patterns or sequences of events in a data stream, like fraud detection or anomaly
detection, can be computationally intensive and requires sophisticated algorithms.

 Backpressure Handling:
• Backpressure occurs when the system can't handle incoming data fast enough, causing a bottleneck.
• Managing backpressure involves mechanisms like buffering, rate limiting, or dynamic resource allocation,
which add complexity.

 Integration and Compatibility:


• Integrating with various data sources (Kafka, Flume, etc.) and sinks (HDFS, databases) requires
understanding each component's behavior under load and failure conditions.
• Compatibility with different formats, protocols, and data models adds another layer of complexity.

November 11, 2024| U & P U. Patel Department of Computer Engineering 19


Spark’s Streaming API
• The Spark Streaming API is based on DStreams (Discretized Streams), which is an abstraction
that represents a continuous stream of data, divided into small micro-batches.
• These micro-batches are processed like Resilient Distributed Datasets (RDDs) internally.

 Discretized Stream API(DStream)


• A high-level abstraction which represents a continuous stream of data.
• Dstream can be created from input data streams from sources such as Kafka, Flume, and
Kinesis, by applying high-level operations on other Dstreams.
• A DStream represents a continuous flow of data in the form of a sequence of RDDs.
• Each RDD in the sequence represents data collected over a time window (batch interval).
• Spark Streaming uses DStream as its fundamental abstraction for real-time processing.

November 11, 2024| U & P U. Patel Department of Computer Engineering 20


Spark’s Streaming API
• Micro-Batch Processing: Spark Streaming processes data in small time intervals (batches).
• DStream Transformations: Similar to RDD operations, DStreams provide transformations like
map, flatMap, reduceByKey, window, etc.
• Stateful Operations: The updateStateByKey and mapWithState functions allow stateful stream
processing.
• Fault Tolerance: Data can be replayed in case of a failure by using checkpointing and reliable
data sources (like Kafka).
• Windowing Operations: You can define windows of data over which computations can be
performed, such as aggregating over the last 10 seconds of data.

November 11, 2024| U & P U. Patel Department of Computer Engineering 21


Spark’s Streaming API
 Structured Streaming API
• Structured Streaming is a scalable and fault-tolerant stream processing engine built on the
Spark SQL engine.
• It operates in a declarative manner, allowing users to define a computation as a query on a
streaming DataFrame or Dataset, much like writing SQL queries on static data.
• Continuous Processing: Structured Streaming provides near real-time data processing by
continuously appending new data to an unbounded table.
• Event-Time Processing: Unlike the DStreams API, Structured Streaming supports event-time
processing and late data handling through watermarks.
• Exactly-Once Semantics: Structured Streaming guarantees end-to-end exactly-once fault
tolerance, which is especially important in stateful streaming applications.
• SQL Integration: Since Structured Streaming is built on the Spark SQL engine, it allows
seamless integration with SQL-based queries, aggregations, and joins.

November 11, 2024| U & P U. Patel Department of Computer Engineering 22


Spark’s Streaming API
Feature DStreams API Structured Streaming API

Processing Model Micro-batch Micro-batch and Continuous Processing

Abstraction DStreams (built on RDDs) DataFrame/Dataset


Event Time Support Limited Yes (Event-time, Watermarks)

Fault Tolerance Achieved via checkpointing Exactly-once end-to-end fault tolerance

Yes (using DStream windowing


Windowing Support Yes (more flexible and efficient)
functions)
Latency Higher due to batch processing Lower with Continuous Processing
Declarative API using DataFrame and
API Functional API similar to RDDs
SQL
State Management updateStateByKey, mapWithState Built-in, efficient state handling
November 11, 2024| U & P U. Patel Department of Computer Engineering 23
Thank You.

November 11, 2024| U & P U. Patel Department of Computer Engineering 24

You might also like