0% found this document useful (0 votes)
120 views41 pages

Chapter 6 - Big Data Architecture Part 1

This document provides an overview of big data architecture. It defines big data architecture and describes its key components, including data sources, storage, processing, analytics and reporting. It also outlines best practices for building a big data architecture, such as analyzing business needs, selecting vendors, deployment strategies, capacity planning and disaster recovery. The benefits of an open reference architecture are also discussed.

Uploaded by

Suren Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views41 pages

Chapter 6 - Big Data Architecture Part 1

This document provides an overview of big data architecture. It defines big data architecture and describes its key components, including data sources, storage, processing, analytics and reporting. It also outlines best practices for building a big data architecture, such as analyzing business needs, selecting vendors, deployment strategies, capacity planning and disaster recovery. The benefits of an open reference architecture are also discussed.

Uploaded by

Suren Dev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Big Data Architecture

Part 1
Introduction
The non-stop growth of data, the frantic releases of
new electronic devices and the data-driven decision-
making trend in companies is fueling a constant
demand for more efficient Big Data processing
systems.

The investment in Big Data architecture has been


rapidly growing these past years and according to
Gartner, businesses will keep investing more in IT in
2018 and 2019 focusing on IOT, Block-chain and Big
Data
Introduction
Big Data refers to huge amounts of heterogeneous
data from both traditional and new sources, growing
at a higher rate than ever.

Due to their high heterogeneity, it is a challenge to


build systems to centrally process and analyze
efficiently such huge amount of data which are
internal and external to an organization
Big Data Architecture
Definition
“Big data architecture refers to the logical and
physical structure that dictates how high volumes of
data are ingested, processed, stored, managed, and
accessed.” (Omni, 2020)

“A Big data architecture describes the blueprint of a


system handling massive volume of data during its
storage, processing, analysis and visualization.”
(Kalipe & Behera, 2019)
Big Data Architecture
Definition
The big data architecture framework serves as a
reference blueprint for big data infrastructures and
solutions, logically defining how big data solutions
will work, the components that will be used, how
information will flow, and security details.
How to Build a Big Data
Architecture
Designing a big data reference architecture, while
complex, follows the same general procedure:
Analyze the Problem
Select A Vendor
Deployment Strategy
Capacity Planning
Infrastructure Sizing
Plan a Disaster Recovery
How to Build a Big Data
Architecture
• Analyze the Problem
• First determine if the business does in fact have a big
data problem, taking into consideration criteria such
as data variety, velocity, and challenges with the
current system.
• Common use cases include data archival, process
offload, data lake implementation, unstructured data
processing, and data warehouse modernization.
How to Build a Big Data
Architecture
• Select a Vendor
• Hadoop is one of the most widely recognized big
data architecture tools for managing big data end to
end architecture.
• Popular vendors for Hadoop distribution include
Amazon Web Services, BigInsights, Cloudera,
Hortonworks, Mapr, and Microsoft.
How to Build a Big Data
Architecture
• Deployment Strategy
• Deployment can be either on-premises, which tends
to be more secure; cloud-based, which is cost
effective and provides flexibility regarding scalability;
or a mix deployment strategy.
How to Build a Big Data
Architecture
• Capacity Planning
• When planning hardware and infrastructure sizing,
consider daily data ingestion volume, data volume for
one-time historical load, the data retention period,
multi-data center deployment, and the time period
for which the cluster is sized
How to Build a Big Data
Architecture
• Infrastructure Sizing
• This is based on capacity planning and determines
the number of clusters/environment required and
the type of hardware required.
• Consider the type of disk and number of disks per
machine, the types of processing memory and
memory size, number of CPUs and cores, and the
data retained and stored in each environment.
How to Build a Big Data
Architecture
• Plan a Disaster Recover
• In developing a backup and disaster recovery plan,
consider the criticality of data stored, the Recovery
Point Objective and Recovery Time Objective
requirements, backup interval, multi datacenter
deployment, and whether Active-Active or Active-
Passive disaster recovery is most appropriate.
General Big Data architecture
General Big Data
architecture
• Big data solutions typically involve one or
more of the following types of workload:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
General Big Data
architecture
• Most big data architectures include some or all of
the following components:
• Data Sources
• Data Storage
• Batch Processing
• Real-time message ingestion
• Stream Processing
• Analytical Data Store
• Analysis & Reporting
• Orchestration
General Big Data
architecture
• Data sources: All big data solutions start with
one or more data sources. Examples include:
• Application data stores, such as relational
databases.
• Static files produced by applications, such as
web server log files.
• Real-time data sources, such as IoT devices.
General Big Data
architecture
• Data storage:
• Data for batch processing operations is
typically stored in a distributed file store that
can hold high volumes of large files in various
formats. This kind of store is often called a
data lake.
• Examples: Azure Data Lake Store or blob
containers in Azure Storage.
General Big Data
architecture
• Batch processing
• Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to
filter, aggregate, and otherwise prepare the data for analysis.
• Usually these jobs involve reading source files, processing
them, and writing the output to new files.
• Options include running U-SQL jobs in Azure Data Lake
Analytics, using Hive, Pig, or custom Map/Reduce jobs in an
HDInsight Hadoop cluster, or using Java, Scala, or Python
programs in an HDInsight Spark cluster.
General Big Data
architecture
• Real-time message ingestion
• If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream
processing.
• This might be a simple data store, where incoming messages are
dropped into a folder for processing.
• However, many solutions need a message ingestion store to act
as a buffer for messages, and to support scale-out processing,
reliable delivery, and other message queuing semantics. Options
include Azure Event Hubs, Azure IoT Hubs, and Kafka.
General Big Data
architecture
• Stream processing
• After capturing real-time messages, the solution must
process them by filtering, aggregating, and otherwise
preparing the data for analysis.
• The processed stream data is then written to an output sink.
• Example: Azure Stream Analytics provides a managed stream
processing service based on perpetually running SQL queries
that operate on unbounded streams. You can also use open
source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster.
General Big Data
architecture
• Analytical data store:
• Many big data solutions prepare data for analysis and then serve
the processed data in a structured format that can be queried
using analytical tools.
• The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional
business intelligence (BI) solutions.
• Alternatively, the data could be presented through a low-latency
NoSQL technology such as HBase, or an interactive Hive database
that provides a metadata abstraction over data files in the
distributed data store.
General Big Data
architecture
• Analysis and reporting:
• The goal of most big data solutions is to provide
insights into the data through analysis and
reporting.
• To empower users to analyze the data, the
architecture may include a data modeling layer,
such as a multidimensional OLAP cube or tabular
data model
General Big Data
architecture
• Analysis and reporting:
• Analysis and reporting can also take the form of
interactive data exploration by data scientists or data
analysts.
• For these scenarios, many Azure services support
analytical notebooks, such as Jupyter, enabling these
users to leverage their existing skills with Python or
R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
General Big Data
architecture
• Orchestration
• Most big data solutions consist of repeated data
processing operations, encapsulated in workflows, that
transform source data, move data between multiple
sources and sinks, load the processed data into an
analytical data store, or push the results straight to a
report or dashboard.
• To automate these workflows, you can use an
orchestration technology such Azure Data Factory or
Apache Oozie and Sqoop.
The benefits of using an ‘open’ Big
Data reference architecture

• It provides a common language for the various


stakeholders;
• It encourages adherence to common standards,
specifications, and patterns;
• It provides consistent methods for implementation
of technology to solve similar problem sets;
The benefits of using an ‘open’ Big
Data reference architecture

• It illustrates and improves understanding of the


various Big Data components, processes, and
systems, in the context of a vendor- and technology-
agnostic Big Data conceptual model;
• It facilitates analysis of candidate standards for
interoperability, portability, reusability, and
extendibility.
The benefits of using an ‘open’ Big
Data reference architecture
Big Data Architecture application in
industry
• Use cases of Big Data based on Architectural
components
Types of Big Data Architecture

• Lambda Architecture
• The lambda architecture is an approach to big data
processing that aims to achieve low latency updates
while maintaining the highest possible accuracy.
• It is divided in 3 layers. The first, “the batch layer” is
composed of a distributed file system which stores
the entirety of the collected data.
• The same layer stores a set of predefined functions
to be run on the dataset to produce what is called a
batch view. Those views are stored in a database
constituting the “serving layer” from which they
can be queried interactively by the user.
Types of Big Data Architecture

• Lambda Architecture
Types of Big Data Architecture

• Lambda Architecture
• The third layer called “speed layer” computes
incremental functions on the new data as it
arrives in the system.
• It processes only data which is generated
between two consecutive batch views re-
computation producing and it produces real-
time views which are also stored in the
serving layer. The different views are queried
together to obtain the most accurate possible
results
Types of Big Data Architecture

• Lambda Architecture advantages:


• provides better accuracy, higher throughput and
lower latency for reads and updates simultaneously
without compromise on data consistency
• more resilient thanks to the Distributed File System
used to store the master dataset, mostly because it is
less subject to human errors (such as unintended bulk
deletions) than a traditional RDBMS
• helps achieve the main requirements of a reliable Big
Data system among which are robustness and fault
tolerance provided through the batch layer.
Types of Big Data Architecture

• Lambda Architecture disadvantages:


• Different layers of this architecture may make it complex.
Synchronization between the layers can be an expensive affair.
So, it has to be handled in a cautious manner.
• Support and maintenance become difficult because of distinct
and distributed layers namely batch and speed.
• A lot of technologies have emerged that can help in the
construction of Lambda architecture but finding people who
have mastered these technologies can be difficult.
• It can be difficult to apply this architecture for the open-source
technologies and the trouble further solidifies if it has to be
implemented in the cloud.
• Maintenance of the code of the architecture is also difficult. As
it has to produce the same results in the distributed system.
Types of Big Data Architecture

• Lambda Architecture disadvantages:


• one of the major disadvantages of this
architecture is the need to maintain two
similar code bases: one in the speed layer and
another in the batch layer to perform the
same computation on different sets of data.
• That implies redundancy and it requires two
different sets of skills in order to write the
logic for the streaming and for the batch data
Types of Big Data Architecture

• Lambda Architecture Implementation


• A particularly suitable application of the Lambda
architecture is found in Log ingestion and analytics.
The reason is that log messages are immutable and
often generated at a high speed in systems that
need to offer high availability
• The Lambda Architecture is preferred in cases
where there is an equal need for real-time/fluid
analysis of incoming data and for periodic analysis
of the entire repository of data collected. Social
media and especially tweets analysis is a perfect
example of such an application.
Types of Big Data Architecture

• Lambda Architecture Hardware Requirement


Types of Big Data Architecture

• Lambda Architecture Software Requirement


• Batch layer
• The requirements of the batch layer make
• Hadoop the most suitable framework to use for
its implementation. HDFS provide the perfect
append-only technology to accommodate the
master dataset
Types of Big Data Architecture

• Lambda Architecture Software Requirement

• Speed layer. The speed layer can be implemented using real- time
processing tools such as Storm or S4. Spark Streaming can also be used
although it treats data in micro-batches rather than in real streams. The
advantage is that the Spark code can be reused of in the batch layer

• Serving layer. Any random-access NoSQL database can


host the real-time and batch views. Some examples
are: HBase, CouchDB, V oldemort or even MongoDB.
Cassandra is particularly preferred because of the
write-fast option that it provides.
Types of Big Data Architecture

• Lambda Architecture Software Requirement

• Queuing system. A queuing system is necessary to ensure


asynchronous and fault-tolerant transmission of the real- time
data to the batch and speed layer. Popular options include Apache
Kafka or Flume.
Types of Big Data Architecture

• Kappa Architecture

• The Kappa architecture was proposed to reduce the lambda


architecture’s overhead that came with handling two separate
code bases for stream and batch processing.

• Its author, Jay Kreps, observed that the necessity of a batch


processing system came from the need to reprocess previously
streamed data again when the code changed. In Kappa
architecture the batch layer was removed and the speed layer
enhanced to offer reprocessing capabilities.
References
• Kalipe, Godson & Behera, Rajat. (2019). Big Data Architectures : A
detailed and application oriented review. ]
• OmniSci, 2020 -
https://www.omnisci.com/technical-glossary/big-data-architecture
• Big Data Architecture Style -
https://docs.microsoft.com/en-us/azure/architecture/guide/archite
cture-styles/big-data#:~:text=A%20big%20data%20architecture
%20is,big%20data%20sources%20at%20rest.

You might also like