Big Data Analytics - Chapter 4
Big Data Analytics - Chapter 4
It handles:
• Large volumes of data
• High velocity (real-time streaming)
• Variety (structured, semi-structured, and unstructured data)
• Distributed storage systems (like Hadoop, NoSQL databases)
Key Components of Big Data Modelling:
Component Description Examples/Tools
Identifying the origin and type of data
1. Data Source Databases, Logs, Social Media,
(structured, semi-structured,
Identification Sensors
unstructured)
Defining how data is structured and Schema-on-read (NoSQL), Schema-
2. Schema Design
interpreted on-write (RDBMS)
Choosing appropriate data storage Hadoop HDFS, MongoDB,
3. Storage Technology
based on scalability and data type Cassandra, Data Lakes
Understanding how data entities are ER Diagrams, Graph Databases
4. Data Relationships
related to each other (Neo4j)
5. Query and Processing Planning how data will be queried and Apache Spark, MapReduce, SQL,
Needs processed (real-time or batch) Hive, Kafka
6. Scalability & Ensuring the model can handle data Sharding, Partitioning, Distributed
Performance growth and maintain performance Systems
7. Security & Defining rules for data access, privacy, Role-based access, Encryption,
Governance and compliance Audit Logs
Data Model Structures
In Big Data Analytics, data model structures define how data is organized, stored, and accessed across
large, distributed systems. These models must be scalable, flexible, and efficient to handle massive,
complex, and fast-changing data.
Data Model Structure Description Best Suited For Examples/Tools
Key-Value Store Stores data as a collection of key-value Caching, session storage, user Redis, Riak, Amazon DynamoDB
pairs. High-speed read/write operations. profiles
Column-Oriented Stores data in columns instead of rows. Data warehousing, real-time Apache HBase, Cassandra,
Model Optimized for analytical queries on large analytics Amazon Redshift
datasets.
Document Store Stores data as documents (usually JSON Content management, product MongoDB, CouchDB,
or XML), allowing semi-structured catalogs Elasticsearch
formats.
Graph Model Represents data as nodes and Social networks, fraud detection, Neo4j, JanusGraph, Amazon
relationships (edges). Ideal for highly recommendation engines Neptune
connected data.
Object-Oriented Model Data stored as objects (used in OOP). Complex application development db4o, ObjectDB
Integrates well with application code.
Time-Series Model Optimized for data indexed over time IoT, sensor data, monitoring, stock InfluxDB, OpenTSDB,
intervals. prices TimescaleDB
Multi-Model Databases Support multiple data models within one Flexibility across diverse data types ArangoDB, OrientDB, Azure
system. Cosmos DB
Data Model Operations
Data Model Operations refer to the actions or processes performed on data within a big data
environment. These operations enable the storage, retrieval, transformation, and analysis of large-scale
data across distributed systems.
Operation Description Purpose Tools/Technologies
Collecting and importing data
To make raw data available Apache Kafka, Flume, Sqoop,
1. Data Ingestion from multiple sources into a big
for storage and analysis NiFi
data system.
Converting data into a suitable To clean, format, and enrich Apache Spark, Apache Beam,
4. Data Transformation
format or structure for analysis. data Talend
10. Data Presenting analyzed data through To communicate insights Tableau, Power BI,
Visualization dashboards, charts, or graphs. and support decisions Grafana, Kibana
Processing Workloads
A processing workload refers to the amount and type of data being processed within a defined
time period. Understanding workloads is essential for selecting the right tools and
infrastructure in Big Data Analytics.
Batch Processing: Processes large volumes of data at once (not in real time). Runs on a
schedule or demand.
Ex: Monthly sales report
Because data is processed in bulk, there's a delay from data collection to result
High Latency
generation.
Batch jobs typically work on huge datasets, often covering hours, days, or
Large Data Volumes
months of data.
Sequential Read/Writes Data is accessed in a sequential manner, allowing for efficient disk operations.
Involves multi-stage queries with joins, aggregations, and filters across large
Complex Queries
datasets.
Write-Intensive Most OLTP systems prioritize writing new data (e.g., creating transactions)
Fewer Joins Simpler queries focused on operational tasks, not complex analysis
Designed to handle a large number of users simultaneously with high
High Availability
uptime
Strong focus on ACID properties (Atomicity, Consistency, Isolation,
Data Consistency
Durability)
Binary Data Blocks:
• Represent streams of incoming data (e.g., from a
web app, ATM, or POS terminal).
• These chunks are small and processed immediately
instead of being grouped like in batch processing.
INSERT: Adds a single new record (e.g., a new order
or transaction).
UPDATE: Modifies an existing record (e.g., updating
the status of an order).
SELECT: Retrieves specific, small pieces of data
(e.g., checking account balance).
DELETE: Removes a record (e.g., deleting an expired
session token).
DBMS (Database Management System):
• The central system that stores and manages all
transactions.
• It’s optimized for fast access, minimal latency, and
high concurrency (i.e., handling many users at
once).
Difference between Batch and Transaction Processing
Feature Transactional Workload Batch Workload
1) Event Stream Processing (ESP): Event Stream Processing (ESP) is the real-time analysis of
events as they arrive, in the order they happen. It uses simple rules and processes data in
memory for speed. The results can be shown live or used to trigger automatic actions. ESP
focuses on being fast rather than complex.
2) Complex Event Processing (CEP): Complex Event Processing (CEP) looks at real-time events
from different sources to find patterns and take action. It uses rules and logic to do deeper
analysis. CEP is more advanced than ESP and may be slower because it handles more complex
tasks. Often, it uses results from ESP as input.
Realtime Big Data Processing and SCV:
In realtime Big Data processing, the SCV principle (Speed, Consistency, Volume) helps design
efficient systems. Both hard-realtime and near-realtime systems must handle high data volume
without losing data.
• In a hard-realtime system, speed is critical, so consistency may be reduced. Approximation or
sampling techniques are used to get fast results, but accuracy may suffer.
• In a near-realtime system, speed is slightly less critical, so the system can maintain
consistency by using the full dataset, leading to more accurate results.
Realtime Big Data Processing and MapReduce:
MapReduce is not suitable for real-time Big Data processing because it works in batch mode,
needs the complete dataset upfront, and involves high processing overhead and latency. It waits
for all map tasks to finish before reduce tasks begin, making it too slow for real-time needs.
• Using in-memory storage.
• Running micro-batch jobs at short intervals.
• Combining pre-computed views with fresh streaming data can help in near-real-time
scenarios.
END OF CHAPTER 4
THANK YOU