0% found this document useful (0 votes)
9 views22 pages

Big Data Analytics - Chapter 4

Chapter 4 discusses Big Data Modelling, focusing on the design and organization of data for efficient storage and analysis. It covers key components such as data sources, schema design, storage technology, and processing needs, along with operations like data ingestion, storage, retrieval, and transformation. The chapter also differentiates between batch and transactional processing, highlighting their characteristics and use cases.

Uploaded by

01fe23bca218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views22 pages

Big Data Analytics - Chapter 4

Chapter 4 discusses Big Data Modelling, focusing on the design and organization of data for efficient storage and analysis. It covers key components such as data sources, schema design, storage technology, and processing needs, along with operations like data ingestion, storage, retrieval, and transformation. The chapter also differentiates between batch and transactional processing, highlighting their characteristics and use cases.

Uploaded by

01fe23bca218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT – II

Chapter 4: BIG DATA


MODELLING
• Data Model Structures
• Data Model Operations
• Processing Workloads
• Processing in Batch Mode
• Processing in Real-time Mode
Big Data Modelling
Big Data Modelling is the process of designing the structure and organization of data in a
way that it can be efficiently stored, processed, and analyzed using big data technologies.

It handles:
• Large volumes of data
• High velocity (real-time streaming)
• Variety (structured, semi-structured, and unstructured data)
• Distributed storage systems (like Hadoop, NoSQL databases)
Key Components of Big Data Modelling:
Component Description Examples/Tools
Identifying the origin and type of data
1. Data Source Databases, Logs, Social Media,
(structured, semi-structured,
Identification Sensors
unstructured)
Defining how data is structured and Schema-on-read (NoSQL), Schema-
2. Schema Design
interpreted on-write (RDBMS)
Choosing appropriate data storage Hadoop HDFS, MongoDB,
3. Storage Technology
based on scalability and data type Cassandra, Data Lakes
Understanding how data entities are ER Diagrams, Graph Databases
4. Data Relationships
related to each other (Neo4j)
5. Query and Processing Planning how data will be queried and Apache Spark, MapReduce, SQL,
Needs processed (real-time or batch) Hive, Kafka
6. Scalability & Ensuring the model can handle data Sharding, Partitioning, Distributed
Performance growth and maintain performance Systems
7. Security & Defining rules for data access, privacy, Role-based access, Encryption,
Governance and compliance Audit Logs
Data Model Structures
In Big Data Analytics, data model structures define how data is organized, stored, and accessed across
large, distributed systems. These models must be scalable, flexible, and efficient to handle massive,
complex, and fast-changing data.
Data Model Structure Description Best Suited For Examples/Tools
Key-Value Store Stores data as a collection of key-value Caching, session storage, user Redis, Riak, Amazon DynamoDB
pairs. High-speed read/write operations. profiles
Column-Oriented Stores data in columns instead of rows. Data warehousing, real-time Apache HBase, Cassandra,
Model Optimized for analytical queries on large analytics Amazon Redshift
datasets.
Document Store Stores data as documents (usually JSON Content management, product MongoDB, CouchDB,
or XML), allowing semi-structured catalogs Elasticsearch
formats.
Graph Model Represents data as nodes and Social networks, fraud detection, Neo4j, JanusGraph, Amazon
relationships (edges). Ideal for highly recommendation engines Neptune
connected data.
Object-Oriented Model Data stored as objects (used in OOP). Complex application development db4o, ObjectDB
Integrates well with application code.
Time-Series Model Optimized for data indexed over time IoT, sensor data, monitoring, stock InfluxDB, OpenTSDB,
intervals. prices TimescaleDB
Multi-Model Databases Support multiple data models within one Flexibility across diverse data types ArangoDB, OrientDB, Azure
system. Cosmos DB
Data Model Operations
Data Model Operations refer to the actions or processes performed on data within a big data
environment. These operations enable the storage, retrieval, transformation, and analysis of large-scale
data across distributed systems.
Operation Description Purpose Tools/Technologies
Collecting and importing data
To make raw data available Apache Kafka, Flume, Sqoop,
1. Data Ingestion from multiple sources into a big
for storage and analysis NiFi
data system.

Persisting data in distributed


To store data efficiently in a HDFS, Cassandra, MongoDB,
2. Data Storage systems with appropriate
scalable manner Amazon S3
structure and format.
Accessing or querying data from
To read data based on user Apache Hive, Pig, SQL-on-
3. Data Retrieval storage for processing or
queries or program logic Hadoop
analysis.

Converting data into a suitable To clean, format, and enrich Apache Spark, Apache Beam,
4. Data Transformation
format or structure for analysis. data Talend

Summarizing and combining Apache Hive, Spark SQL,


5. Data Aggregation To generate insights and KPIs
data for reporting and analysis. Presto
Data Model Operations
Operation Description Purpose Tools/Technologies
Creating indexes to improve the To enable faster query
6. Data Indexing Elasticsearch, Solr, HBase
speed of data access and queries. performance

Running analytical or operational HiveQL, Spark SQL,


7. Data Querying To explore and analyze data
queries on large datasets. Drill, Impala

Modifying existing data in big


To correct or refine existing HBase, Delta Lake,
8. Data Updating data systems (less common in
records Apache Iceberg
traditional Hadoop).
For data lifecycle
Removing outdated or irrelevant HDFS commands, NoSQL
9. Data Deletion management and
data from storage systems. APIs, retention policies
optimization

10. Data Presenting analyzed data through To communicate insights Tableau, Power BI,
Visualization dashboards, charts, or graphs. and support decisions Grafana, Kibana
Processing Workloads
A processing workload refers to the amount and type of data being processed within a defined
time period. Understanding workloads is essential for selecting the right tools and
infrastructure in Big Data Analytics.

It contains mainly 2 types:

Batch Processing: Processes large volumes of data at once (not in real time). Runs on a
schedule or demand.
Ex: Monthly sales report

Transactional Processing: Deals with real-time or near real-time data changes.


Ensures immediate updates and consistency.
Ex: Real-time fraud detection
Batch Processing
Batch processing, also known as offline processing, refers to the method of processing large volumes
of data in one go, instead of handling data in real time. In this mode, data is collected, stored, and
processed at scheduled intervals (e.g., hourly, daily, or weekly), rather than instantly as it arrives.
Use cases
Business Intelligence (BI) Reports:
• Monthly or quarterly sales performance
• Customer segmentation analysis
ETL Pipelines (Extract, Transform, Load):
• Moving and transforming data from operational databases to data warehouses
Data Warehousing and OLAP (Online Analytical Processing):
• Complex OLAP cubes built for multidimensional analysis
Log Processing:
• Collecting server logs and analyzing them in bulk to find trends or issues
Financial Auditing:
• Batch jobs analyzing large sets of transaction logs for compliance checks
Characteristics of Batch Processing
Characteristic Explanation

Because data is processed in bulk, there's a delay from data collection to result
High Latency
generation.

Batch jobs typically work on huge datasets, often covering hours, days, or
Large Data Volumes
months of data.

Sequential Read/Writes Data is accessed in a sequential manner, allowing for efficient disk operations.

Involves multi-stage queries with joins, aggregations, and filters across large
Complex Queries
datasets.

Read-Intensive Often more reading than writing, especially in analytics workloads.

Not executed instantly — relies on scheduled triggers like cron jobs or


Offline Execution
workflows.
• Data Blocks (Binary Data Representation):The blocks
of binary data (1010110...1101) represent large datasets
that need to be processed. These datasets are grouped
together for batch execution instead of being processed
individually in real-time.
• INSERT: Bulk data is inserted into the database at once.
• SELECT: Large queries retrieve and analyze data in bulk.
• UPDATE: Multiple records are updated together in a
scheduled process.
• DELETE: Bulk deletion of outdated or redundant data to
optimize storage.
• DBMS (Database Management System): Acts as a
central data repository where batch processes execute.
Handles large-scale transactions in bulk, ensuring
consistency and efficiency.
Transactional Processing
Transactional processing, also known as online processing or OLTP (Online Transaction Processing),
refers to the real-time, interactive processing of small amounts of data. This model emphasizes speed,
responsiveness, and immediate data consistency, and is designed to support day-to-day business
operations.
Use cases
E-Commerce:
• Adding new customer orders, updating inventory, payment processing.
Banking:
• Processing ATM withdrawals, fund transfers, balance checks.
Healthcare:
• Patient check-ins, prescription updates, appointment bookings.
Telecommunications:
• Call records logging, real-time billing.
Retail:
• Point-of-sale transactions, loyalty point updates.
Characteristics of Transactional Processing
Characteristic Explanation
Low Latency Processes data in real time or near real time, ensuring immediate response

Small Data Footprint Each transaction involves a small set of data

Random Reads/Writes Access patterns are non-sequential and often unpredictable

Write-Intensive Most OLTP systems prioritize writing new data (e.g., creating transactions)

Fewer Joins Simpler queries focused on operational tasks, not complex analysis
Designed to handle a large number of users simultaneously with high
High Availability
uptime
Strong focus on ACID properties (Atomicity, Consistency, Isolation,
Data Consistency
Durability)
Binary Data Blocks:
• Represent streams of incoming data (e.g., from a
web app, ATM, or POS terminal).
• These chunks are small and processed immediately
instead of being grouped like in batch processing.
INSERT: Adds a single new record (e.g., a new order
or transaction).
UPDATE: Modifies an existing record (e.g., updating
the status of an order).
SELECT: Retrieves specific, small pieces of data
(e.g., checking account balance).
DELETE: Removes a record (e.g., deleting an expired
session token).
DBMS (Database Management System):
• The central system that stores and manages all
transactions.
• It’s optimized for fast access, minimal latency, and
high concurrency (i.e., handling many users at
once).
Difference between Batch and Transaction Processing
Feature Transactional Workload Batch Workload

Latency Low (real-time) High (delayed/overnight)

Data Size per Operation Small Large

Complex (many joins,


Query Complexity Simple (fewer joins)
aggregations)

Read/Write Pattern Random Sequential

OLTP (Online Transaction OLAP (Online Analytical


System Type
Processing) Processing)

Banking, E-commerce, BI Reports, Data Warehousing,


Typical Use Case
Reservations ETL
Processing in Batch Mode
Batch processing is a data processing method where large volumes of data are collected, stored,
and processed together in a single batch, rather than being processed immediately. It is typically
used for non-real-time tasks where data can be handled in groups at scheduled times.

A processing engine is a software component or system


responsible for executing operations on data.

In Big Data systems it:


Takes input data,
Transforms or analyzes it,
And then produces output or results.
Batch Processing with MapReduce
• MapReduce is a batch processing framework used to process large datasets.
• It's built on the divide-and-conquer principle.
• Designed for parallel and distributed computing over clusters of low-cost (commodity)
hardware.
Characteristics of Batch Processing with MapReduce
•Batch-Oriented – Processes large datasets in batches, not real-time.
•Parallel Processing – Splits tasks and runs them simultaneously.
•Distributed – Works across multiple machines (cluster).
•Fault-Tolerant – Recovers from failures automatically.
•Scalable – Easily handles growing data by adding nodes.
•Schema-Free – Can process any type of data (structured or not).
•Code-to-Data – Moves code to where data is stored.
•High Latency – Not for quick responses—takes time.
•Efficient – Uses low-cost hardware effectively.
•Map & Reduce Logic – Uses Map() and Reduce() functions.
Processing in Real-time Mode
Real-time processing is a data processing method where data is captured, processed, and responded to
immediately or within a few seconds, in-memory, before being stored on disk. It enables systems to act
on data as it arrives, making it ideal for time-sensitive applications.
Speed Consistency Volume (SCV) Principle:
The SCV principle in distributed data processing states that a system can optimize only two out of the
following three at the same time.
Speed: Speed refers to how fast data is processed after it's generated.
It focuses on analysis time, not data collection. Real-time systems
process data much quicker than batch systems, enabling rapid
insights for time-sensitive tasks.
Consistency: Consistency refers to the accuracy and precision of
results. A consistent system uses all available data for more reliable
outputs, while less consistent systems may use sampling, leading to
acceptable precision but lower accuracy.
Volume: Volume refers to the amount of data that can be processed.
Big Data systems deal with rapidly growing datasets, and handling
all this data often requires distributed processing. Managing large
volumes while also maintaining speed and consistency is a major
challenge.
This diagram shows how real-time data is processed:
1) Streaming data enters the system continuously.
2) A data transfer engine sends it to two places:
a) In-memory storage for fast processing.
b) Storage device for saving the data.
3) The processing engine analyzes the data in-memory.
4) The results are displayed on a dashboard in real-time.
The 2 important concepts related to Realtime Big Data processing are:

1) Event Stream Processing (ESP): Event Stream Processing (ESP) is the real-time analysis of
events as they arrive, in the order they happen. It uses simple rules and processes data in
memory for speed. The results can be shown live or used to trigger automatic actions. ESP
focuses on being fast rather than complex.

2) Complex Event Processing (CEP): Complex Event Processing (CEP) looks at real-time events
from different sources to find patterns and take action. It uses rules and logic to do deeper
analysis. CEP is more advanced than ESP and may be slower because it handles more complex
tasks. Often, it uses results from ESP as input.
Realtime Big Data Processing and SCV:
In realtime Big Data processing, the SCV principle (Speed, Consistency, Volume) helps design
efficient systems. Both hard-realtime and near-realtime systems must handle high data volume
without losing data.
• In a hard-realtime system, speed is critical, so consistency may be reduced. Approximation or
sampling techniques are used to get fast results, but accuracy may suffer.
• In a near-realtime system, speed is slightly less critical, so the system can maintain
consistency by using the full dataset, leading to more accurate results.
Realtime Big Data Processing and MapReduce:
MapReduce is not suitable for real-time Big Data processing because it works in batch mode,
needs the complete dataset upfront, and involves high processing overhead and latency. It waits
for all map tasks to finish before reduce tasks begin, making it too slow for real-time needs.
• Using in-memory storage.
• Running micro-batch jobs at short intervals.
• Combining pre-computed views with fresh streaming data can help in near-real-time
scenarios.
END OF CHAPTER 4
THANK YOU

You might also like