0% found this document useful (0 votes)

120 views41 pages

Chapter 6 - Big Data Architecture Part 1

This document provides an overview of big data architecture. It defines big data architecture and describes its key components, including data sources, storage, processing, analytics and reporting. It also outlines best practices for building a big data architecture, such as analyzing business needs, selecting vendors, deployment strategies, capacity planning and disaster recovery. The benefits of an open reference architecture are also discussed.

Uploaded by

Suren Dev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views41 pages

Chapter 6 - Big Data Architecture Part 1

Uploaded by

Suren Dev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Big Data Architecture

Part 1
Introduction
The non-stop growth of data, the frantic releases of
new electronic devices and the data-driven decision-
making trend in companies is fueling a constant
demand for more efficient Big Data processing
systems.

The investment in Big Data architecture has been

rapidly growing these past years and according to
Gartner, businesses will keep investing more in IT in
2018 and 2019 focusing on IOT, Block-chain and Big
Data
Introduction
Big Data refers to huge amounts of heterogeneous
data from both traditional and new sources, growing
at a higher rate than ever.

Due to their high heterogeneity, it is a challenge to

build systems to centrally process and analyze
efficiently such huge amount of data which are
internal and external to an organization
Big Data Architecture
Definition
“Big data architecture refers to the logical and
physical structure that dictates how high volumes of
data are ingested, processed, stored, managed, and
accessed.” (Omni, 2020)

“A Big data architecture describes the blueprint of a

system handling massive volume of data during its
storage, processing, analysis and visualization.”
(Kalipe & Behera, 2019)
Big Data Architecture
Definition
The big data architecture framework serves as a
reference blueprint for big data infrastructures and
solutions, logically defining how big data solutions
will work, the components that will be used, how
information will flow, and security details.
How to Build a Big Data
Architecture
Designing a big data reference architecture, while
complex, follows the same general procedure:
Analyze the Problem
Select A Vendor
Deployment Strategy
Capacity Planning
Infrastructure Sizing
Plan a Disaster Recovery
How to Build a Big Data
Architecture
• Analyze the Problem
• First determine if the business does in fact have a big
data problem, taking into consideration criteria such
as data variety, velocity, and challenges with the
current system.
• Common use cases include data archival, process
offload, data lake implementation, unstructured data
processing, and data warehouse modernization.
How to Build a Big Data
Architecture
• Select a Vendor
• Hadoop is one of the most widely recognized big
data architecture tools for managing big data end to
end architecture.
• Popular vendors for Hadoop distribution include
Amazon Web Services, BigInsights, Cloudera,
Hortonworks, Mapr, and Microsoft.
How to Build a Big Data
Architecture
• Deployment Strategy
• Deployment can be either on-premises, which tends
to be more secure; cloud-based, which is cost
effective and provides flexibility regarding scalability;
or a mix deployment strategy.
How to Build a Big Data
Architecture
• Capacity Planning
• When planning hardware and infrastructure sizing,
consider daily data ingestion volume, data volume for
one-time historical load, the data retention period,
multi-data center deployment, and the time period
for which the cluster is sized
How to Build a Big Data
Architecture
• Infrastructure Sizing
• This is based on capacity planning and determines
the number of clusters/environment required and
the type of hardware required.
• Consider the type of disk and number of disks per
machine, the types of processing memory and
memory size, number of CPUs and cores, and the
data retained and stored in each environment.
How to Build a Big Data
Architecture
• Plan a Disaster Recover
• In developing a backup and disaster recovery plan,
consider the criticality of data stored, the Recovery
Point Objective and Recovery Time Objective
requirements, backup interval, multi datacenter
deployment, and whether Active-Active or Active-
Passive disaster recovery is most appropriate.
General Big Data architecture
General Big Data
architecture
• Big data solutions typically involve one or
more of the following types of workload:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
General Big Data
architecture
• Most big data architectures include some or all of
the following components:
• Data Sources
• Data Storage
• Batch Processing
• Real-time message ingestion
• Stream Processing
• Analytical Data Store
• Analysis & Reporting
• Orchestration
General Big Data
architecture
• Data sources: All big data solutions start with
one or more data sources. Examples include:
• Application data stores, such as relational
databases.
• Static files produced by applications, such as
web server log files.
• Real-time data sources, such as IoT devices.
General Big Data
architecture
• Data storage:
• Data for batch processing operations is
typically stored in a distributed file store that
can hold high volumes of large files in various
formats. This kind of store is often called a
data lake.
• Examples: Azure Data Lake Store or blob
containers in Azure Storage.
General Big Data
architecture
• Batch processing
• Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to
filter, aggregate, and otherwise prepare the data for analysis.
• Usually these jobs involve reading source files, processing
them, and writing the output to new files.
• Options include running U-SQL jobs in Azure Data Lake
Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python
programs in an HDInsight Spark cluster.
General Big Data
architecture
• Real-time message ingestion
• If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream
processing.
• This might be a simple data store, where incoming messages are
dropped into a folder for processing.
• However, many solutions need a message ingestion store to act
as a buffer for messages, and to support scale-out processing,
reliable delivery, and other message queuing semantics. Options
include Azure Event Hubs, Azure IoT Hubs, and Kafka.
General Big Data
architecture
• Stream processing
• After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise
preparing the data for analysis.
• The processed stream data is then written to an output sink.
• Example: Azure Stream Analytics provides a managed stream
processing service based on perpetually running SQL queries
that operate on unbounded streams. You can also use open
source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster.
General Big Data
architecture
• Analytical data store:
• Many big data solutions prepare data for analysis and then serve
the processed data in a structured format that can be queried
using analytical tools.
• The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional
business intelligence (BI) solutions.
• Alternatively, the data could be presented through a low-latency
NoSQL technology such as HBase, or an interactive Hive database
that provides a metadata abstraction over data files in the
distributed data store.
General Big Data
architecture
• Analysis and reporting:
• The goal of most big data solutions is to provide
insights into the data through analysis and
reporting.
• To empower users to analyze the data, the
architecture may include a data modeling layer,
such as a multidimensional OLAP cube or tabular
data model
General Big Data
architecture
• Analysis and reporting:
• Analysis and reporting can also take the form of
interactive data exploration by data scientists or data
analysts.
• For these scenarios, many Azure services support
analytical notebooks, such as Jupyter, enabling these
users to leverage their existing skills with Python or
R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
General Big Data
architecture
• Orchestration
• Most big data solutions consist of repeated data
processing operations, encapsulated in workflows, that
transform source data, move data between multiple
sources and sinks, load the processed data into an
analytical data store, or push the results straight to a
report or dashboard.
• To automate these workflows, you can use an
orchestration technology such Azure Data Factory or
Apache Oozie and Sqoop.
The benefits of using an ‘open’ Big
Data reference architecture

• It provides a common language for the various

stakeholders;
• It encourages adherence to common standards,
specifications, and patterns;
• It provides consistent methods for implementation
of technology to solve similar problem sets;
The benefits of using an ‘open’ Big
Data reference architecture

• It illustrates and improves understanding of the

various Big Data components, processes, and
systems, in the context of a vendor- and technology-
agnostic Big Data conceptual model;
• It facilitates analysis of candidate standards for
interoperability, portability, reusability, and
extendibility.
The benefits of using an ‘open’ Big
Data reference architecture
Big Data Architecture application in
industry
• Use cases of Big Data based on Architectural
components
Types of Big Data Architecture

• Lambda Architecture
• The lambda architecture is an approach to big data
processing that aims to achieve low latency updates
while maintaining the highest possible accuracy.
• It is divided in 3 layers. The first, “the batch layer” is
composed of a distributed file system which stores
the entirety of the collected data.
• The same layer stores a set of predefined functions
to be run on the dataset to produce what is called a
batch view. Those views are stored in a database
constituting the “serving layer” from which they
can be queried interactively by the user.
Types of Big Data Architecture

• Lambda Architecture
Types of Big Data Architecture

• Lambda Architecture
• The third layer called “speed layer” computes
incremental functions on the new data as it
arrives in the system.
• It processes only data which is generated
between two consecutive batch views re-
computation producing and it produces real-
time views which are also stored in the
serving layer. The different views are queried
together to obtain the most accurate possible
results
Types of Big Data Architecture

• Lambda Architecture advantages:

• provides better accuracy, higher throughput and
lower latency for reads and updates simultaneously
without compromise on data consistency
• more resilient thanks to the Distributed File System
used to store the master dataset, mostly because it is
less subject to human errors (such as unintended bulk
deletions) than a traditional RDBMS
• helps achieve the main requirements of a reliable Big
Data system among which are robustness and fault
tolerance provided through the batch layer.
Types of Big Data Architecture

• Lambda Architecture disadvantages:

• Different layers of this architecture may make it complex.
Synchronization between the layers can be an expensive affair.
So, it has to be handled in a cautious manner.
• Support and maintenance become difficult because of distinct
and distributed layers namely batch and speed.
• A lot of technologies have emerged that can help in the
construction of Lambda architecture but finding people who
have mastered these technologies can be difficult.
• It can be difficult to apply this architecture for the open-source
technologies and the trouble further solidifies if it has to be
implemented in the cloud.
• Maintenance of the code of the architecture is also difficult. As
it has to produce the same results in the distributed system.
Types of Big Data Architecture

• Lambda Architecture disadvantages:

• one of the major disadvantages of this
architecture is the need to maintain two
similar code bases: one in the speed layer and
another in the batch layer to perform the
same computation on different sets of data.
• That implies redundancy and it requires two
different sets of skills in order to write the
logic for the streaming and for the batch data
Types of Big Data Architecture

• Lambda Architecture Implementation

• A particularly suitable application of the Lambda
architecture is found in Log ingestion and analytics.
The reason is that log messages are immutable and
often generated at a high speed in systems that
need to offer high availability
• The Lambda Architecture is preferred in cases
where there is an equal need for real-time/fluid
analysis of incoming data and for periodic analysis
of the entire repository of data collected. Social
media and especially tweets analysis is a perfect
example of such an application.
Types of Big Data Architecture

• Lambda Architecture Hardware Requirement

Types of Big Data Architecture

• Lambda Architecture Software Requirement

• Batch layer
• The requirements of the batch layer make
• Hadoop the most suitable framework to use for
its implementation. HDFS provide the perfect
append-only technology to accommodate the
master dataset
Types of Big Data Architecture

• Lambda Architecture Software Requirement

• Speed layer. The speed layer can be implemented using real- time
processing tools such as Storm or S4. Spark Streaming can also be used
although it treats data in micro-batches rather than in real streams. The
advantage is that the Spark code can be reused of in the batch layer

• Serving layer. Any random-access NoSQL database can

host the real-time and batch views. Some examples
are: HBase, CouchDB, V oldemort or even MongoDB.
Cassandra is particularly preferred because of the
write-fast option that it provides.
Types of Big Data Architecture

• Lambda Architecture Software Requirement

• Queuing system. A queuing system is necessary to ensure

asynchronous and fault-tolerant transmission of the real- time
data to the batch and speed layer. Popular options include Apache
Kafka or Flume.
Types of Big Data Architecture

• Kappa Architecture

• The Kappa architecture was proposed to reduce the lambda

architecture’s overhead that came with handling two separate
code bases for stream and batch processing.

• Its author, Jay Kreps, observed that the necessity of a batch

processing system came from the need to reprocess previously
streamed data again when the code changed. In Kappa
architecture the batch layer was removed and the speed layer
enhanced to offer reprocessing capabilities.
References
• Kalipe, Godson & Behera, Rajat. (2019). Big Data Architectures : A
detailed and application oriented review. ]
• OmniSci, 2020 -
https://www.omnisci.com/technical-glossary/big-data-architecture
• Big Data Architecture Style -
https://docs.microsoft.com/en-us/azure/architecture/guide/archite
cture-styles/big-data#:~:text=A%20big%20data%20architecture
%20is,big%20data%20sources%20at%20rest.

MH2p Tutorial:: Red Menu
100% (1)
MH2p Tutorial:: Red Menu
12 pages
DT-EDU-DeN60EDU0101. Virtual DataPort Architecture
No ratings yet
DT-EDU-DeN60EDU0101. Virtual DataPort Architecture
23 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Multicloud Architecture
No ratings yet
Multicloud Architecture
25 pages
Ahsanul Aqaed Bab Hadi Ashar
No ratings yet
Ahsanul Aqaed Bab Hadi Ashar
353 pages
Backstage Meetup
No ratings yet
Backstage Meetup
15 pages
TFS Version Control Git For TFVC Users
No ratings yet
TFS Version Control Git For TFVC Users
27 pages
Clean Architecture With .NET and .NET Core - Overview by Ashish Patel .NET Hub Medium
No ratings yet
Clean Architecture With .NET and .NET Core - Overview by Ashish Patel .NET Hub Medium
16 pages
AWS Well-Architected Framework - Disaster Recovery - Tutorials Dojo
No ratings yet
AWS Well-Architected Framework - Disaster Recovery - Tutorials Dojo
6 pages
Shahab Juman CV
No ratings yet
Shahab Juman CV
5 pages
TFS Version Control Part 2 - TFVC Gems PDF
100% (1)
TFS Version Control Part 2 - TFVC Gems PDF
29 pages
AWS Marketplace Cloud-Native Ebook 6 Modern Data FINAL
No ratings yet
AWS Marketplace Cloud-Native Ebook 6 Modern Data FINAL
40 pages
Implementing Microsoft Azure Infrastructure Solutions
No ratings yet
Implementing Microsoft Azure Infrastructure Solutions
128 pages
Azure Container Registry
No ratings yet
Azure Container Registry
6 pages
Softmatic ASP - Net Notes
No ratings yet
Softmatic ASP - Net Notes
23 pages
About Azure Landing Zone
No ratings yet
About Azure Landing Zone
4 pages
Top10 Enterprise Architect Certification
No ratings yet
Top10 Enterprise Architect Certification
4 pages
Azure Data Factory
No ratings yet
Azure Data Factory
3,167 pages
Chief Technology Officer or IT Director or IT Manager
No ratings yet
Chief Technology Officer or IT Director or IT Manager
4 pages
App Settings Demystified (C# & VB)
No ratings yet
App Settings Demystified (C# & VB)
20 pages
WinWire IoT Project-HLD-0.1
No ratings yet
WinWire IoT Project-HLD-0.1
8 pages
AWS Well-Architected Framework - Design Principles - Tutorials Dojo
No ratings yet
AWS Well-Architected Framework - Design Principles - Tutorials Dojo
8 pages
Free Microsoft Az 400 Exam Questions by Bray
No ratings yet
Free Microsoft Az 400 Exam Questions by Bray
13 pages
Udemy Bonus Lecture Perpetual v13
No ratings yet
Udemy Bonus Lecture Perpetual v13
4 pages
How To Architect A Multicloud-Capable Hybrid Integration Platform
100% (1)
How To Architect A Multicloud-Capable Hybrid Integration Platform
10 pages
AZ 400 Demo
No ratings yet
AZ 400 Demo
17 pages
SDLC User Guide PDF
No ratings yet
SDLC User Guide PDF
32 pages
API-Centre PaymentsNZ Looking-Ahead Open-Banking
No ratings yet
API-Centre PaymentsNZ Looking-Ahead Open-Banking
21 pages
Session 5-Azure Components
No ratings yet
Session 5-Azure Components
28 pages
Windows Server 2019 Licensing Datasheet ESP
No ratings yet
Windows Server 2019 Licensing Datasheet ESP
3 pages
CC Final Pre
No ratings yet
CC Final Pre
19 pages
Azure AnalysisServiceOverview
No ratings yet
Azure AnalysisServiceOverview
173 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
Pragmatic Approach To Describing Solution Architectures
No ratings yet
Pragmatic Approach To Describing Solution Architectures
24 pages
The Rise of The Developer Infographic
No ratings yet
The Rise of The Developer Infographic
1 page
Session1 - Big Data Overview
No ratings yet
Session1 - Big Data Overview
55 pages
Advanced SQL Case Study
No ratings yet
Advanced SQL Case Study
42 pages
Next Pathway - Azure Synapse Analytics Migration Checklist
No ratings yet
Next Pathway - Azure Synapse Analytics Migration Checklist
3 pages
Configuring Teradata Vantage™ After Installation
No ratings yet
Configuring Teradata Vantage™ After Installation
57 pages
The Operational Data Store - Tactical Analysis at Your Fingertips
86% (7)
The Operational Data Store - Tactical Analysis at Your Fingertips
64 pages
Azure Cloud Architect Master Program - Slim Up - N PDF
No ratings yet
Azure Cloud Architect Master Program - Slim Up - N PDF
20 pages
VM4 - Migration of Azure VM Within Different VNets, Resource Groups or Different Regions
No ratings yet
VM4 - Migration of Azure VM Within Different VNets, Resource Groups or Different Regions
24 pages
BlueGranite Data Lake Ebook
100% (1)
BlueGranite Data Lake Ebook
23 pages
Windchill 7.0 Business Administrators Guide
No ratings yet
Windchill 7.0 Business Administrators Guide
395 pages
David J. Rosenthal VP & GM, Digital Business July 15, 2021
No ratings yet
David J. Rosenthal VP & GM, Digital Business July 15, 2021
78 pages
SQL in Azure
No ratings yet
SQL in Azure
121 pages
Okta-2024 Businesses at Work
No ratings yet
Okta-2024 Businesses at Work
36 pages
Enterprise Architecture Tool Selection Guide v6.3
No ratings yet
Enterprise Architecture Tool Selection Guide v6.3
25 pages
Architecture Vision: 1 Executive Summary 2 Version History
No ratings yet
Architecture Vision: 1 Executive Summary 2 Version History
3 pages
$100+ MILLION 5-8% / PRACTICE: Eliminated Risk Sap Erp/Crm
No ratings yet
$100+ MILLION 5-8% / PRACTICE: Eliminated Risk Sap Erp/Crm
1 page
Lab - Qlik Replicate Oracle To Azure Synapse
No ratings yet
Lab - Qlik Replicate Oracle To Azure Synapse
23 pages
Microsoft Cloud Partner Program Badge - Guidelines - OCT22 - NEW VERSION 102422
No ratings yet
Microsoft Cloud Partner Program Badge - Guidelines - OCT22 - NEW VERSION 102422
24 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
POC For SF Cloud Migration
No ratings yet
POC For SF Cloud Migration
2 pages
Glossary of Business Terms
No ratings yet
Glossary of Business Terms
20 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Session2-Cloud Computing
No ratings yet
Session2-Cloud Computing
30 pages
AZ-304-Version 4.0
No ratings yet
AZ-304-Version 4.0
164 pages
Mainframe Meets Modernization: Mastering Hybrid Cloud Design: Mainframes
From Everand
Mainframe Meets Modernization: Mastering Hybrid Cloud Design: Mainframes
Ricardo Nuqui
No ratings yet
Integration platform The Ultimate Step-By-Step Guide
From Everand
Integration platform The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Enterprise architecture planning Complete Self-Assessment Guide
From Everand
Enterprise architecture planning Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
GBDK Manual
No ratings yet
GBDK Manual
377 pages
CSC 421 Operating System For DLC2
No ratings yet
CSC 421 Operating System For DLC2
87 pages
SG 248485
No ratings yet
SG 248485
146 pages
Citrix Xen Desktop Exam Dumps
No ratings yet
Citrix Xen Desktop Exam Dumps
112 pages
IPSDC FW Release Notes
No ratings yet
IPSDC FW Release Notes
2 pages
TL-WN723N (Eu) V3 Qig
No ratings yet
TL-WN723N (Eu) V3 Qig
2 pages
Lab6a2 - Configuring NAT Pool Overload and PAT
No ratings yet
Lab6a2 - Configuring NAT Pool Overload and PAT
5 pages
MX One Installation
No ratings yet
MX One Installation
45 pages
Y2K38 Problem
No ratings yet
Y2K38 Problem
23 pages
Modbus For Grundfos Pumps: Functional Profile and User Manual
No ratings yet
Modbus For Grundfos Pumps: Functional Profile and User Manual
56 pages
Fireye HX DOC
No ratings yet
Fireye HX DOC
32 pages
Mychron5/2T Management Using Race Studio 3: Aim Infotech
No ratings yet
Mychron5/2T Management Using Race Studio 3: Aim Infotech
18 pages
How To Setup Wireguard VPN Client in Ubuntu
No ratings yet
How To Setup Wireguard VPN Client in Ubuntu
10 pages
Assignment Ankush
No ratings yet
Assignment Ankush
11 pages
MCQS 8085
No ratings yet
MCQS 8085
7 pages
Systec Store Approved Weight With Alibi Memory
No ratings yet
Systec Store Approved Weight With Alibi Memory
3 pages
Biostar A68i 350 Deluxe Owners Manual
No ratings yet
Biostar A68i 350 Deluxe Owners Manual
45 pages
Addressing Mode and Instruction Cycle
No ratings yet
Addressing Mode and Instruction Cycle
7 pages
Computer Architecture and Organization Learning Module 1
No ratings yet
Computer Architecture and Organization Learning Module 1
31 pages
Clamav Report 110410 170516
No ratings yet
Clamav Report 110410 170516
2 pages
TCP and UDP Ports Used by FactoryTalk Optix
No ratings yet
TCP and UDP Ports Used by FactoryTalk Optix
3 pages
Inderjeet Singh Hanspal: Education
No ratings yet
Inderjeet Singh Hanspal: Education
3 pages
SimpleIDE User Guide 9 26 2
No ratings yet
SimpleIDE User Guide 9 26 2
33 pages
Operating Systems Notes
No ratings yet
Operating Systems Notes
137 pages
Guitar Pro 6 On Ubuntu 64bit
No ratings yet
Guitar Pro 6 On Ubuntu 64bit
3 pages
Red Hat System Administration Objectives
No ratings yet
Red Hat System Administration Objectives
9 pages
Chat GPTChat GPT
No ratings yet
Chat GPTChat GPT
7 pages
The Complete Windows 11 User Manual - 4th Edition, 2023
No ratings yet
The Complete Windows 11 User Manual - 4th Edition, 2023
150 pages
Infineon XMC4500 DS
No ratings yet
Infineon XMC4500 DS
123 pages

Chapter 6 - Big Data Architecture Part 1

Uploaded by

Chapter 6 - Big Data Architecture Part 1

Uploaded by

Big Data Architecture

The investment in Big Data architecture has been

Due to their high heterogeneity, it is a challenge to

“A Big data architecture describes the blueprint of a

• It provides a common language for the various

• It illustrates and improves understanding of the

• Lambda Architecture advantages:

• Lambda Architecture disadvantages:

• Lambda Architecture disadvantages:

• Lambda Architecture Implementation

• Lambda Architecture Hardware Requirement

• Lambda Architecture Software Requirement

• Lambda Architecture Software Requirement

• Serving layer. Any random-access NoSQL database can

• Lambda Architecture Software Requirement

• Queuing system. A queuing system is necessary to ensure

• The Kappa architecture was proposed to reduce the lambda

• Its author, Jay Kreps, observed that the necessity of a batch

You might also like