0% found this document useful (0 votes)
11 views3 pages

Topic 1:: Spark Structured Streaming

hgjk

Uploaded by

dharnamittal07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views3 pages

Topic 1:: Spark Structured Streaming

hgjk

Uploaded by

dharnamittal07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Topic 1:

What is Spark Streaming?

Spark Streaming is a part of Apache Spark that helps process and analyse real-time data
streams. Think of it as a tool that takes live data (like tweets, website clicks, or sensor data)
and processes it to provide useful insights, like counting events or detecting patterns. It
works in small, fast batches to handle streaming data efficiently.

In Spark Streaming, "sliding window analytics" means analysing data within overlapping time
periods. Instead of dividing data into separate, non-overlapping chunks, a sliding window
looks at data from overlapping intervals. You set a window size (how much data to analyse)
and a slide interval (how often to update), so you can continuously track trends and
calculate results as new data comes in.

How to implement sliding window analysis in Spark Streaming:

 Spark Structured Streaming:

Use the window function within a Spark SQL query to define a sliding window with specified
duration and slide interval.

 Defining window parameters:

When using the window function, you specify the time column to group by, the window
duration, and the slide interval.

 Aggregations on windows:

Once you've defined the sliding window, you can apply aggregations like sum, average,
count, etc., to calculate values within each overlapping window.

Example scenario:

 Monitoring website traffic: You could use a 5-minute sliding window with a 1-minute
slide interval to analyse website hits over a continuous period, capturing changes in
traffic volume as new data arrives.

Topic 2:
The CAP theorem, or Consistency, Availability, and Partition tolerance theorem, describes
the trade-offs between these three properties in distributed systems. It states that it's not
possible to guarantee all three properties at the same time.

The three properties of the CAP theorem

 Consistency: All reads receive the most recent write or an error

 Availability: All reads contain data, but it might not be the most recent
 Partition tolerance: The system can continue operating even if there's a network
fault that splits the system into partitions

How the CAP theorem works

 Distributed systems usually produce two of the three properties simultaneously

 When a partition occurs, the system must choose between consistency and
availability

 Systems can prioritize availability and partition tolerance, accepting temporary data
inconsistency to ensure the system remains operational

Topic 3:
Amazon DynamoDB is a key-value NoSQL database that uses a key-value storage model. It's a
managed database service from Amazon Web Services (AWS).

Key features

 Scalability: DynamoDB is serverless and can scale to zero. It also has auto-scaling,
which automatically adjusts throughput capacity based on traffic demands.

 Data models: DynamoDB supports both key-value and document data models.

 Global tables: DynamoDB's global tables are multi-region databases that


automatically replicate data across different AWS regions.

 Secondary indexes: DynamoDB uses secondary indexes to provide more querying


flexibility.

 Primary keys: DynamoDB uses primary keys to uniquely identify each item in a table.

Data types

 DynamoDB supports three data types: number, string, and binary.

 It also supports document stores such as JSON, XML, or HTML.

Topic 4:
Apache Cassandra is a distributed, open-source database that uses a simple data model to
store structured data. It was originally developed at Facebook and released in 2008.

Data model

 Cassandra's data model is simple and flexible, with dynamic control over data layout
and format

 It stores data as rows organized into tables or column families


 Each row is identified by a primary key value

 The primary key partitions data, allowing for partial or full data fetches

You might also like