Big Data 3rd Unit
Big Data 3rd Unit
1. Definition:
o A data stream is a continuous, rapid, and unbounded sequence of data
generated in real-time. Examples include sensor data, social media updates,
stock prices, and clickstreams.
o Unlike traditional databases, data streams cannot be stored in their entirety due
to their large volume and velocity.
2. Key Characteristics of Data Streams:
o Unbounded: Data arrives continuously with no fixed size.
o Time-sensitive: Processing must occur quickly to derive actionable insights.
o Dynamic and Evolving: Data patterns can change over time.
o Imperfect Information: Due to real-time processing, the data may not always
be complete or fully accurate.
Stream Computing
Definition:
o Selecting a representative subset of the data stream for analysis.
Techniques:
1. Reservoir Sampling:
Maintains a fixed-size sample from the stream by replacing elements
with a decreasing probability as the stream grows.
2. Random Sampling:
Randomly selects items from the stream based on a predefined
probability.
2. Filtering Streams
Definition:
o Extracting only the relevant data from the stream while discarding irrelevant
data.
Methods:
o Use conditional operators (e.g., price > 100) to filter data.
o Example:
From a social media stream, extract only posts containing the word
“urgent.”
Problem:
o Determining the number of unique elements in a data stream (e.g., distinct IP
addresses).
Solution:
o Use probabilistic algorithms:
HyperLogLog:
Approximates the count of distinct elements using hash
functions and bitmaps.
Bloom Filters:
Efficiently tests whether an element is part of a set.
4. Estimating Moments
Definition:
o Moments are statistical properties of data streams, such as mean, variance, or
skewness.
Method:
o Use incremental algorithms to compute moments without storing the entire
stream.
Problem:
o Track the number of unique items within a sliding window of time or events.
Solution:
o Use sliding window techniques combined with hash-based counting.
6. Decaying Window
Definition:
o Assigns more weight to recent data while gradually discounting older data.
Applications:
o Useful in scenarios where recent trends are more important, such as fraud
detection.
1. Definition:
o RTAP refers to systems and tools designed for real-time data stream
processing and analytics.
2. Examples:
o Apache Kafka, Spark Streaming, Flink, and Google Cloud Dataflow.
Definition:
o Analyzing stock price streams to predict future movements and trends.
Workflow:
Case Studies
Problem:
o Track customer sentiment about a brand during a marketing campaign.
Solution:
Problem:
o Predict stock price movements based on live trading data.
Solution:
1. Use a real-time data feed to collect trading data.
2. Implement predictive models using tools like TensorFlow or PyTorch.
3. Execute trades automatically based on predictions.
The stream data model represents a framework for managing and querying continuous data
streams.
Key Concepts
The architecture enables real-time data ingestion, processing, and analysis. It consists of the
following components:
2. Components of RTAP
3. Applications of RTAP
Description: RTAP enables monitoring and analyzing IoT device data in real time.
Applications:
o Smart home automation.
o Predictive maintenance of machinery.
o Environmental monitoring (e.g., weather or pollution sensors).
Example: Detecting anomalies in factory equipment to prevent breakdowns.
Description: Analyzing social media feeds in real time to capture trends or public
sentiment.
Applications:
o Tracking viral content.
o Monitoring brand sentiment.
Example: Identifying trending hashtags for marketing campaigns.
3.9. Telecommunications
4. Benefits of RTAP
Context: A company wants to monitor the public perception of its brand on Twitter in
real-time, especially to track product launches, marketing campaigns, or customer
satisfaction.
o Data Ingestion: Data is continuously collected from Twitter using APIs (like
Tweepy) or platforms like Kafka for real-time streaming.
o Data Processing:
The streaming data is processed using NLP techniques to identify
keywords, hashtags, and mentions of the brand.
Sentiment analysis models (trained on labeled data) classify each tweet
as positive, negative, or neutral.
o Real-Time Analytics: The results are immediately reflected on dashboards,
showing trends and sentiment scores, helping the company assess the impact
of their marketing campaign.
Challenges:
o Ambiguity in language: Tweets often contain sarcasm, slang, or emojis,
which complicates sentiment analysis.
o Volume: The volume of tweets can be overwhelming, necessitating scalable
infrastructure.
o Latency: Ensuring near-zero latency to provide real-time feedback.
Outcome: The company can react to customer concerns, respond to negative
sentiment, or capitalize on positive sentiment immediately, optimizing marketing
efforts.
2.2. Use Case Example: Predicting Stock Prices Using News and Historical Data
Data Stream: Real-time stock price data (via APIs), financial news (via web
scraping, RSS feeds), or social media sentiment.
Processing Layer:
o Historical stock data analysis using machine learning models (e.g., LSTM,
ARIMA).
o Sentiment analysis of financial news using NLP tools like NLTK or TextBlob.
o Real-time processing using Apache Kafka for data streaming and Apache
Flink for stream analytics.
Machine Learning Model: A combination of time-series models (like ARIMA or
LSTM) and sentiment-based models.
Visualization: Real-time dashboards with predicted stock trends, implemented with
tools like Tableau or Power BI.
Timely Decision Making: Both systems allow businesses and investors to make
decisions based on real-time data, improving responsiveness.
Market Insight: Sentiment analysis provides insights into public perception, which
can influence stock prices or brand reputation.
Predictive Power: Stock market prediction models offer the potential to anticipate
market movements, giving traders a competitive edge.