Topic 1:: Spark Structured Streaming
Topic 1:: Spark Structured Streaming
Spark Streaming is a part of Apache Spark that helps process and analyse real-time data
streams. Think of it as a tool that takes live data (like tweets, website clicks, or sensor data)
and processes it to provide useful insights, like counting events or detecting patterns. It
works in small, fast batches to handle streaming data efficiently.
In Spark Streaming, "sliding window analytics" means analysing data within overlapping time
periods. Instead of dividing data into separate, non-overlapping chunks, a sliding window
looks at data from overlapping intervals. You set a window size (how much data to analyse)
and a slide interval (how often to update), so you can continuously track trends and
calculate results as new data comes in.
Use the window function within a Spark SQL query to define a sliding window with specified
duration and slide interval.
When using the window function, you specify the time column to group by, the window
duration, and the slide interval.
Aggregations on windows:
Once you've defined the sliding window, you can apply aggregations like sum, average,
count, etc., to calculate values within each overlapping window.
Example scenario:
Monitoring website traffic: You could use a 5-minute sliding window with a 1-minute
slide interval to analyse website hits over a continuous period, capturing changes in
traffic volume as new data arrives.
Topic 2:
The CAP theorem, or Consistency, Availability, and Partition tolerance theorem, describes
the trade-offs between these three properties in distributed systems. It states that it's not
possible to guarantee all three properties at the same time.
Availability: All reads contain data, but it might not be the most recent
Partition tolerance: The system can continue operating even if there's a network
fault that splits the system into partitions
When a partition occurs, the system must choose between consistency and
availability
Systems can prioritize availability and partition tolerance, accepting temporary data
inconsistency to ensure the system remains operational
Topic 3:
Amazon DynamoDB is a key-value NoSQL database that uses a key-value storage model. It's a
managed database service from Amazon Web Services (AWS).
Key features
Scalability: DynamoDB is serverless and can scale to zero. It also has auto-scaling,
which automatically adjusts throughput capacity based on traffic demands.
Data models: DynamoDB supports both key-value and document data models.
Primary keys: DynamoDB uses primary keys to uniquely identify each item in a table.
Data types
Topic 4:
Apache Cassandra is a distributed, open-source database that uses a simple data model to
store structured data. It was originally developed at Facebook and released in 2008.
Data model
Cassandra's data model is simple and flexible, with dynamic control over data layout
and format
The primary key partitions data, allowing for partial or full data fetches