Unit - Iv Data Analytics Frameworks: Centralized and Distributed Functional Architectures of Relational Systems
Unit - Iv Data Analytics Frameworks: Centralized and Distributed Functional Architectures of Relational Systems
Big Data refers to datasets that are so large or complex that traditional data processing
applications are inadequate to deal with them. The challenges include data capture, storage,
analysis, data curation, search, sharing, transfer, visualization, querying, updating, and
information privacy.
To manage such extensive data volumes effectively, specialized Big Data Architectures are
developed. These architectures address the three V's of big data—Volume (the amount of
data), Velocity (the speed at which data is generated), and Variety (the different types of
data)—and sometimes also include Veracity (data accuracy) and Value(data's usefulness).
Big Data Architectures can be broadly divided into Centralized and Distributed systems,
especially when dealing with relational data management.
Relational Database Management Systems (RDBMS) are databases that store data in a
structured format, using rows and columns (i.e., tables). They use Structured Query Language
(SQL) to perform queries and manage data. Examples of traditional RDBMS include Oracle
Database, MySQL, Microsoft SQL Server, and PostgreSQL.
1. Tables (Relations): The fundamental structure in which data is stored. Each table is a
collection of related data entries organized by rows and columns.
2. Schema: Defines the structure of the database, including the tables, fields, data types,
and relationships between tables.
3. Primary Key: A unique identifier for each record in a table. Ensures that each entry can
be uniquely distinguished.
4. Foreign Key: A field in one table that uniquely identifies a row in another table,
establishing a relationship between the two tables.
5. Indexes: Data structures that improve the speed of data retrieval operations at the cost
of additional space and processing time for writes.
Example Scenario:
● Students Table: Contains student ID (Primary Key), name, date of birth, and other
personal information.
● Courses Table: Contains course ID (Primary Key), course name, and description.
● Enrollments Table: Manages the many-to-many relationship between students and
courses, using student ID and course ID as Foreign Keys.
● Since all data is stored at a single location only thus it is easier to access and
coordinate data.
● The centralized database has very minimal data redundancy since all data is stored
in a single place.
● It is cheaper in comparison to all other databases available.
Disadvantages:
2. Distributed Database:
A distributed database is basically a type of database which consists of multiple databases that
are connected with each other and are spread across different physical locations. The data that
is stored in various physical locations can thus be managed independently of other physical
locations. The communication between databases at different physical locations is thus done by
a computer network.
Advantages:
● This database can be easily expanded as data is already spread across different
physical locations.
● The distributed database can easily be accessed from different networks.
● This database is more secure in comparison to a centralized database.
Disadvantages:
● This database is very costly and is difficult to maintain because of its complexity.
Definition
A Distributed Functional Architecture refers to a system where data storage, processing, and
management are spread across multiple interconnected systems or nodes. This architecture is
designed to handle large-scale data processing by leveraging distributed computing and
storage.
Components
Detailed Workflow
1. Data Ingestion:
○ Data is ingested in parallel across multiple nodes. Data pipelines are designed to
feed into distributed storage and processing systems.
○ Example: Sensor data from thousands of IoT devices is ingested simultaneously
into a distributed database.
2. Data Storage:
○ Data is partitioned across multiple nodes using sharding strategies. Replication
ensures that even if one node fails, data is still available.
○ Example: A large-scale e-commerce platform stores customer and transaction
data across different data centers globally.
3. Data Processing:
○ Distributed computing frameworks like Spark process data in parallel, using the
power of multiple nodes to handle large datasets.
○ Example: A real-time recommendation system processes user behavior data
across a distributed cluster to generate recommendations.
4. Data Access:
○ Users and applications access data through distributed queries that retrieve and
aggregate data from multiple nodes.
○ Example: A BI tool queries a distributed database to generate a report combining
data from multiple sources.
Advantages
● Scalability: Easily scales horizontally by adding more nodes, making it suitable for very
large datasets and high transaction volumes.
● Fault Tolerance: If one node fails, others can take over, minimizing downtime.
● Performance: Distributed processing allows for handling large workloads more
efficiently, reducing processing time.
● Global Accessibility: Data can be stored in geographically distributed nodes, ensuring
faster access for users worldwide.
Challenges
Examples
● Google Spanner: A globally distributed database system that supports both SQL
queries and strong consistency.
● Apache Cassandra: A distributed NoSQL database designed for high availability and
scalability, often used by large-scale enterprises like Netflix.
● Amazon DynamoDB: A key-value and document database that delivers single-digit
millisecond performance at any scale.
Data Warehousing architectures
Data Warehousing is the process of collecting, storing, and managing large volumes of data
from various sources within an organization to support business intelligence (BI) activities,
including reporting, data analysis, and decision-making. A Data Warehouse is a central
repository of integrated data from one or more disparate sources.
Data Warehousing systems are designed to support queries and analysis rather than
transaction processing, providing a historical view of data. The architecture of a Data
Warehouse is crucial as it defines how data is stored, managed, and accessed.
1. Data Sources:
○ The origin points from where data is collected. These can include databases, flat
files, online transaction processing (OLTP) systems, enterprise resource planning
(ERP) systems, and external data sources.
2. ETL (Extract, Transform, Load) Process:
○ Extract: Collecting data from various source systems.
○ Transform: Cleaning, filtering, and reformatting the data into a suitable structure
for analysis.
○ Load: Loading the transformed data into the Data Warehouse.
3. Staging Area:
○ A temporary storage area where data is processed during the ETL process. It is
used to clean, transform, and prepare data before it is loaded into the Data
Warehouse.
4. Data Warehouse:
○ The central repository where integrated, historical data is stored. It is optimized
for query performance and analysis rather than transaction processing.
5. Data Marts:
○ Subsets of the Data Warehouse that are designed for specific business lines or
departments. They contain data tailored to the needs of a particular group or
function within the organization.
6. OLAP (Online Analytical Processing) Cubes:
○ Multidimensional data structures that allow users to perform complex queries and
analysis quickly. OLAP cubes pre-aggregate data for rapid querying.
7. Metadata:
○ Data about the data stored in the Data Warehouse. It includes information on
data sources, transformations, data structures, and relationships. Metadata is
critical for managing and navigating the Data Warehouse.
8. Query Tools:
○ Software applications that allow users to query the Data Warehouse, generate
reports, and perform data analysis. These tools often include SQL interfaces,
reporting software, and data visualization tools.
9. Business Intelligence (BI) Tools:
○ Applications used to analyze data and generate insights. BI tools include
dashboards, data mining tools, and machine learning platforms.
Definition:
● A simple architecture where the Data Warehouse, ETL process, and BI tools are all
housed on a single system.
Characteristics:
Limitations:
Use Case:
● Small businesses or departments with limited data volume and straightforward analytics
needs.
2. Two-Tier Architecture
Definition:
● A more common architecture where the Data Warehouse and data marts are on one
layer, and the BI tools are on another.
Components:
Characteristics:
Limitations:
Use Case:
3. Three-Tier Architecture
Definition:
● The most widely used architecture, consisting of three layers: the data source layer, the
Data Warehouse layer, and the BI tool layer.
Components:
Characteristics:
Limitations:
Use Case:
4. Hybrid Architecture
Definition:
Components:
Characteristics:
Limitations:
Use Case:
● Organizations transitioning to the cloud or with a mix of legacy systems and modern BI
needs.
1. Data Extraction:
○ Data is extracted from various sources such as OLTP systems, flat files, and
external sources. This process may occur in real-time or in batch mode.
2. Data Transformation:
○ The extracted data is cleaned, normalized, and transformed to fit the schema of
the Data Warehouse. This step includes data validation, deduplication, and
conversion of data types.
3. Data Loading:
○ Transformed data is loaded into the staging area first for temporary storage and
further processing, and then into the Data Warehouse.
4. Data Storage:
○ Data is stored in the Data Warehouse in a structured format. Data marts may be
created to cater to specific departments or functions.
5. Data Aggregation and Cubes:
○ Data is aggregated and summarized into OLAP cubes to support fast,
multidimensional analysis.
6. Query and Analysis:
○ Users access the data through BI tools, querying the Data Warehouse or OLAP
cubes to generate reports, dashboards, and insights.
7. Metadata Management:
○ Metadata is maintained to track the origin, transformation, and location of data
within the system.
ETL Tools:
● Informatica PowerCenter
● Talend
● Microsoft SQL Server Integration Services (SSIS)
● Apache NiFi
● Oracle Exadata
● Amazon Redshift
● Google BigQuery
● Microsoft Azure Synapse Analytics
● SAP BW
● IBM Cognos
● Microsoft Power BI
● Tableau
Data Warehousing Platforms:
● Snowflake
● Teradata
● Cloudera Data Platform
1. Loose Coupling:
○ Services are designed to minimize dependencies on each other. Changes to one
service should not impact other services.
2. Interoperability:
○ Services can communicate across different platforms and technologies using
standardized protocols.
3. Reusability:
○ Services are designed to be reused across different applications and business
processes.
4. Abstraction:
○ Services encapsulate the underlying logic, exposing only the necessary interface
to consumers.
5. Autonomy:
○ Each service operates independently, with control over its own logic and data.
6. Discoverability:
○ Services are published in a registry or repository, allowing them to be easily
discovered and invoked by consumers.
7. Composability:
○ Services can be combined or orchestrated to create complex business processes
or workflows.
Components of SOA
1. Services:
○ Self-contained units of functionality that can be independently developed,
deployed, and managed. Each service provides a specific business function,
such as customer management or order processing.
2. Enterprise Service Bus (ESB):
○ A middleware component that facilitates communication between services. The
ESB handles message routing, transformation, and protocol mediation, enabling
seamless integration between services.
3. Service Registry and Repository:
○ A centralized directory where services are registered and stored. The registry
contains metadata about each service, including its location, interface, and usage
policies, enabling service discovery and governance.
4. Service Consumers:
○ Applications or systems that invoke services to perform specific tasks.
Consumers can be web applications, desktop applications, or other services.
5. Service Providers:
○ The entities responsible for implementing and hosting services. Providers define
the service interface, business logic, and data processing.
6. Service Contracts:
○ The formal agreements between service consumers and providers, specifying the
service's interface, input/output data formats, communication protocols, and
quality of service (QoS) requirements.
7. Business Process Management (BPM):
○ Tools and technologies that enable the orchestration and management of
services to create end-to-end business processes.
SOA Architecture
1. Enterprise Service Bus (ESB)
Definition:
● The ESB acts as a communication layer that connects and integrates services within an
SOA. It provides routing, mediation, and transformation services, allowing different
services to communicate seamlessly.
Functions:
Examples of ESB:
● Apache Camel
● Mule ESB
● IBM WebSphere ESB
Definition:
● A centralized directory where services are published and stored. It includes metadata
about each service, such as its location, interface, and policies.
Functions:
Examples:
Service Providers:
Service Consumers:
Interaction:
● Consumers discover services via the registry, invoke them through the ESB, and receive
the required functionality.
4. Service Contracts
Definition:
● A service contract is a formal specification that defines the interaction between a service
provider and a consumer.
Components:
Importance:
Lambda architecture
Lambda Architecture is designed to handle large-scale data processing in a way that balances
latency, throughput, and fault-tolerance. It combines batch processing for historical data and
real-time stream processing for fresh data, providing a unified view of both historical and
real-time data.
The architecture is particularly effective for scenarios where real-time data processing is
required, but the results need to be eventually consistent with batch-processed data to ensure
accuracy and completeness.
1. Batch Layer:
○ Purpose: Handles large-scale historical data processing and generates batch
views that are accurate and complete.
○ Data Storage: Stores the master dataset, which is an immutable and
append-only dataset containing all the raw data.
○ Processing: Performs computation on the entire dataset (e.g., aggregation,
filtering) to generate batch views. This layer typically has higher latency but
guarantees accuracy.
2. Speed Layer:
○ Purpose: Handles real-time data processing to provide low-latency updates to
the system.
○ Data Storage: Stores the incoming data temporarily until it is processed.
○ Processing: Performs real-time computation on the data as it arrives, generating
real-time views or updates. This layer is designed for low latency but may
sacrifice some accuracy due to the approximate nature of real-time
computations.
3. Serving Layer:
○ Purpose: Combines the results from both the batch and speed layers to serve
queries in a unified manner.
○ Data Storage: Stores the precomputed views (from both batch and speed layers)
and indexes them for fast query performance.
○ Query Handling: Ensures that queries return the most up-to-date and accurate
data by combining batch views and real-time updates.
1. Data Ingestion:
○ Raw data is ingested from various sources (e.g., sensors, logs, transactions) and
stored in both the batch layer (as the master dataset) and the speed layer.
2. Batch Processing:
○ The batch layer periodically processes the entire master dataset to generate
comprehensive batch views. This process may take minutes or hours, depending
on the data volume and complexity.
3. Real-Time Processing:
○ Simultaneously, the speed layer processes incoming data in real-time, generating
real-time views that reflect the latest data. This processing is typically done in
seconds or milliseconds.
4. Serving Layer:
○ The serving layer merges the batch views with the real-time views to provide a
complete and up-to-date result for any query. The batch views offer accuracy,
while the real-time views ensure low-latency updates.
5. Query Execution:
○ When a query is made, the serving layer accesses both the batch and speed
layers to retrieve the required data, ensuring that the response is both accurate
and timely.
1. Scalability:
○ Can handle massive volumes of data due to the separation of batch and real-time
processing.
2. Fault-Tolerance:
○ The architecture is resilient to system failures, as data is stored in an immutable
format in the batch layer, allowing for recomputation if necessary.
3. Flexibility:
○ Supports various data processing use cases, from real-time analytics to
long-term data warehousing.
4. Accuracy and Completeness:
○ Batch processing ensures that historical data is processed accurately, while the
speed layer allows for real-time updates, providing a balance between accuracy
and latency.
5. Separation of Concerns:
○ By dividing processing into batch and real-time layers, each can be optimized
independently for its specific requirements.
1. Complexity:
○ The architecture is inherently complex, requiring the development and
maintenance of two parallel processing pipelines (batch and speed).
2. Data Consistency:
○ Ensuring consistency between the batch and speed layers can be challenging,
especially when dealing with late-arriving data or updates.
3. Latency in Batch Processing:
○ The batch layer introduces inherent latency, as it processes large volumes of
data over longer periods.
4. Resource Intensive:
○ Requires significant computational resources to manage and maintain both
processing layers and the serving layer.
5. Data Duplication:
○ Data is often stored and processed in multiple layers, leading to potential
duplication of storage and processing efforts.
Batch Layer:
Speed Layer:
Serving Layer:
● HBase: For storing and serving batch views with low-latency read access.
● Cassandra: For distributed and scalable data storage.
● Elasticsearch: For indexing and searching across the batch and real-time views.
1. Real-Time Analytics:
○ Monitoring and analyzing streaming data, such as financial transactions, social
media feeds, or sensor data, to provide real-time insights.
2. Fraud Detection:
○ Identifying fraudulent activities by combining historical data analysis with
real-time monitoring to detect anomalies.
3. Personalization Engines:
○ Delivering personalized recommendations by processing user behavior data in
real-time, while refining models with batch processing.
4. IoT Applications:
○ Managing data from a vast network of IoT devices, where real-time data is
crucial, but long-term trends and patterns also need to be analyzed.
● Lambda Architecture:
○ Uses both batch and speed layers to process data.
○ Suitable for scenarios where both historical accuracy and real-time updates are
required.
● Kappa Architecture:
○ Relies solely on stream processing, eliminating the batch layer.
○ Simplifies the architecture but may compromise on the accuracy of long-term
data processing.
● Lambda Architecture:
○ Combines real-time and batch processing to offer low-latency insights.
○ Better suited for modern, large-scale data environments.
● Traditional Data Warehousing:
○ Focuses primarily on batch processing of historical data.
○ May not provide the real-time capabilities needed for modern applications.