0% found this document useful (0 votes)

249 views41 pages

Big Data Architecture

The document describes the architecture of big data systems. It discusses the different layers including data sources, ingestion, storage, analysis and consumption. It also covers cross-layer operations like connecting data sources, governance, systems management and quality of service. Finally, it explains the Lambda architecture which combines batch and real-time processing for queries.

Uploaded by

nikhilkaintura8254.12e

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

249 views41 pages

Big Data Architecture

Uploaded by

nikhilkaintura8254.12e

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Big Data Architecture

• The architecture of Big Data consists of methods
and mechanisms for collecting and storing data,
securing it, processing it and then converting it into
data base structures and file systems.
• The analysis tools help us analyze the data that
is collected and then make intelligent decisions on
the basis of this collected data.
• Hence, the greater the amount of data collected
and analyzed the better will be the decision
taking ability of the machine or device.
Big Data Architecture: A very generic
design
• The architecture of Big Data consists of multiple
layers:
• Layer 1. Big data sources layer
• Layer 2. Data Ingestion layer
• Layer 3. Data messaging and storage layer
• Layer 4. Analysis layer
• Layer 5. Consumption layer
• There are four logical layers that are as follows:
• Big data sources layer: The data that comes to
Big Data has a lot of various sources. These
sources can be company servers, third-party
data providers and various sensors relate to
companies.
• Big Data has the ability to store and take in data
in two modes namely real-time mode and batch
mode. Some examples of the sources of data
include applications and softwares such as MS
Office docs, ERP, Relational Database
Management System (RDMS), mobile devices,
social media, sensors, data warehouses and
Data Ingestion layer
• Data ingestion is the process of obtaining and
importing data for immediate use or storage in
a database. To ingest something is to take
something in or absorb something.
• Data can be streamed in real time or ingested
in batches. In real-time data ingestion, each data
item is imported as the source emits it. When data
is ingested in batches, data items are imported in
discrete chunks at periodic intervals of time. The
first step in an effective data ingestion process is
to prioritize the data sources. Individual files must
be validated and data items routed to the correct
destinations.
• Data messaging and storage layer:
• All the data from various sources is received by
this layer. If the data received is unstructured
and is not in a format that could be understood
by the analytic tools then this layer converts this
data into a format that is readable by the
analysis tools. In Big Data, unstructured data is
stored in specialized file systems such as
Hadoop Distributed File System (HDFS) or in a
NoSQL database whereas, structured data is
stored in RDBMS.
• Analysis layer:
• This layer deals with the analysis of stored data.
In this layer the stored data is analyzed to
extract various trends and business intelligence
from it. Many different sorts of tools operate in
big data environment. For the analysis of
structured data techniques such as sampling is
used whereas, for unstructured data advanced
and never specialized analytics toolsets are
required.
• Consumption layer:
• All the analyzed data is received by this layer.
The task of this layer is to present this
analyzed data as an output to the desired
receiver.
• There are various types of outputs such as
applications, business processes and human
viewers
• There are four types of processes that operate
in between these four logical layers. These
cross-layer operations are:
• connecting to data sources,
• governance,
• systems management and
• quality of service (QoS).
Cross-layer operations
• 1. Connecting to data sources: The data
received by Big Data is at a very fast rate.
In order to quickly receive and analyze this
data we need to have connections that
can support these actions at a fast rate.
• For that the architecture requires adapters
and connecters that can connect data to
the storage system, protocols and
networks.
Cross-layer operations
• 2. Governing big data: The architecture of Big
Data provides privacy as well as security of
the data that it receives and analyzes.
• The organizations using Big Data has a choice
to use a security tool of their own on the
analytics storage system, spend in a specialized
software to keep their Hadoop environment
safe and secure or they can sign an agreement
with their cloud Hadoop provider that provide
service level security.
• The policies that deal with the protection and
security of data should include the security of the
process starting from data ingestion till analysis
and deletion or archiving of data.
Cross-layer operations
• 3. Managing systems:
• The architecture of Big Data is a large-scale
cluster which has a distributive structure that
has highly scalable performance and
capacity.
• It should regularly and continuously check
the health of the system with the help of central
management system consoles.
• If the consumer is using cloud as an
environment for Big Data then they should
establish and monitor Strong Service Level
Agreements (SLAs) with their cloud provider.
Cross-layer operations
• 4. Quality of service:
• Quality of service is an important aspect of Big
Data and it is the framework that helps define
the quality of data, security and compliance
policies, sizes and frequency of incoming data
sets and filtering data.
Lambda Architecture
• The Lambda Architecture is a deployment
model for data processing that organizations use
to combine a traditional batch pipeline with a
fast real-time stream pipeline for data access.
• It is a common architecture model in IT and
development organizations toolkits as
businesses strive to become more data-driven
and event-driven in the face of massive
volumes of rapidly generated data, often
referred to as “big data.”
Lambda Architecture
Lambda Architecture
• The Lambda Architecture contains both a
traditional batch data pipeline and a fast
streaming pipeline for real-time data, as well
as a serving layer for responding to queries.

• Five main components of the Lambda Architecture:

• Data Sources
• Batch Layer
• Serving Layer
• Speed Layer
• Query
Lambda Architecture

• Data Sources : Data can be obtained from a

variety of sources, which can then be included in
the Lambda Architecture for analysis.
• This component is oftentimes a streaming source
like Apache Kafka, which is not the original data
source, but is an intermediary store that can hold
data in order to serve both the batch layer and
the speed layer of the Lambda Architecture.
• The data is delivered simultaneously to both the
batch layer and the speed layer to enable a
parallel indexing effort.
Lambda Architecture

• Batch Layer. This component saves all data coming

into the system as batch views in preparation for
indexing.
• The input data is saved in a model that looks like a
series of changes/updates that were made to a system
of record, similar to the output of a change data
capture (CDC) system. Oftentimes this is simply a file in
the comma-separated values (CSV) format.
• The data is treated as immutable and append-only to
ensure a trusted historical record of all incoming data.
A technology like Apache Hadoop is often used as a
system for ingesting the data as well as storing the data
in a cost-effective way.
Lambda Architecture

• Serving Layer. This layer incrementally indexes

the latest batch views to make it queryable by
end users. This layer can also reindex all data to
fix a coding bug or to create different indexes for
different use cases.
• The key requirement in the serving layer is that
the processing is done in an extremely
parallelized way to minimize the time to index
the data set. While an indexing job is run, newly
arriving data will be queued up for indexing in
the next indexing job.
• Speed Layer. This layer complements the serving layer by
indexing the most recently added data not yet fully indexed
by the serving layer. This includes the data that the serving
layer is currently indexing as well as new data that arrived
after the current indexing job started. Since there is an
expected lag between the time the latest data was added to
the system and the time the latest data is available for
querying (due to the time it takes to perform the batch
indexing work), it is up to the speed layer to index the latest
data to narrow this gap.
• This layer typically leverages stream processing software to
index the incoming data in near real-time to minimize the
latency of getting the data available for querying. When the
Lambda Architecture was first introduced, Apache Storm was
a leading stream processing engine used in deployments, but
other technologies have since gained more popularity as
candidates for this component (like Hazelcast Jet, Apache
Flink, and Apache Spark Streaming).
Lambda Architecture

• Query. This component is responsible for

submitting end user queries to both the
serving layer and the speed layer and
consolidating the results.
• This gives end users a complete query on all
data, including the most recently added data,
to provide a near real-time analytics system.
Working of Lambda Architecture
• Data is indexed simultaneously by both the
serving layer and the speed layer.
Working of Lambda Architecture
• The batch/serving layers continue to index
incoming data in batches.
• Since the batch indexing takes time, the speed
layer complements the batch/serving layers by
indexing all the new, unindexed data in near
real-time.
• This gives you a large and consistent view of
data in the batch/serving layers that can be
recreated at any time, along with a smaller index
that contains the most recent data.
• Once a batch indexing job completes, the newly
batch-indexed data is available for querying, so
the speed layer’s copy of the same data/indexes is
no longer needed and is therefore deleted from
the speed layer.
• The serving layer then begins indexing the latest
data in the system that had not yet been indexed
by this layer, which has already been indexed by
the speed layer (so it is available for querying at
the speed layer).
• This ongoing hand-off between the speed layer
and the batch/serving layers ensures that all data
is ready for querying and that the latency for data
availability is low.
When the serving layer completes a job, it moves to
the next batch and the speed layer discards its copy of
the data that the serving layer just indexed.
Benefits of the Lambda Architecture
• The Lambda Architecture attempts to:
• Reduced latency: The speed layer uses stream processing
technologies to immediately index recent data that is currently not
queryable in the batch/serving layers, thus narrowing the time
window of unanalyzable data. This helps to reduce the latency.
• Data consistency: the indexing process can ensure the data
reflects the latest state in both the batch and speed layers.
• Scalability: The Lambda Architecture is based on distributed,
scale-out technologies that can be expanded by simply adding
more nodes.
• Fault tolerance: the Lambda Architecture is based on distributed
systems that support fault tolerance, so should a hardware failure
occur, other nodes are available to continue the workload.
• Human fault tolerance: if there are any bugs in the indexing code
or any omissions, the code can be updated and then rerun to
reindex all data.
Kappa Architecture
• The Kappa Architecture is a software architecture
used for processing streaming data. The main
premise behind the Kappa Architecture is that
you can perform both real-time and batch
processing, especially for analytics, with a single
technology stack.
• It is based on a streaming architecture in which
an incoming series of data is first stored in a
messaging engine like Apache Kafka.
• From there, a stream processing engine will read
the data and transform it into an analyzable
format, and then store it into an analytics
database for end users to query.
Kappa Architecture

• The Kappa Architecture is useful for on-demand

analytics.
• The Kappa Architecture is considered a simpler
alternative to the Lambda Architecture as it uses the
same technology stack to handle both real-time stream
processing and historical batch processing.
• A streaming architecture is a defined set of
technologies that work together to handle stream
processing, which is the practice of taking action on a
series of data at the time the data is created.
Kappa Architecture
Kappa Architecture
• Streaming layer
• The streaming layer delivers low-latency, near-real-
time results. It uses incremental algorithms to
perform updates, which saves time, but sacrifices
accuracy.
• Serving layer
• The serving layer is used to serve the data computed
from the streaming layer.
Lambda vs. Kappa architecture

Lambda architecture Kappa architecture

Separate layers for batch Unified layer for both batch

and streaming and stream

Lower code complexity, i.e.,

Higher code complexity, i.e.,
maintain single technology
maintain 2 technology
stack for batch and stream
stacks for batch and stream
layer

Processing large amounts of

Faster performance with
data from database would
batch and stream layer
be expensive
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
ADVANTAGE OF BIG DATA TOOLS

Psycho Cybernetics MAxwell Maltz PDF
100% (5)
Psycho Cybernetics MAxwell Maltz PDF
10 pages
TSMP1003 - SmartPlant 3D Grid-Structure Labs v2011
No ratings yet
TSMP1003 - SmartPlant 3D Grid-Structure Labs v2011
422 pages
Lecture Notes For ECON 631
No ratings yet
Lecture Notes For ECON 631
556 pages
Ansi Aga B109.4 - 2016
No ratings yet
Ansi Aga B109.4 - 2016
39 pages
16 Respiratory Alkalosis
100% (2)
16 Respiratory Alkalosis
28 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
List of Potential PHD Advisors and Co-Pd
100% (1)
List of Potential PHD Advisors and Co-Pd
12 pages
Module 1
No ratings yet
Module 1
29 pages
Challenges That Face Entrepreneurships in Tanzania
No ratings yet
Challenges That Face Entrepreneurships in Tanzania
6 pages
BÀI TẬP HOÀN THÀNH CÂU
No ratings yet
BÀI TẬP HOÀN THÀNH CÂU
18 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Data Science
No ratings yet
Data Science
87 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Lectur 5
No ratings yet
Lectur 5
37 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
76 pages
Big Data
No ratings yet
Big Data
51 pages
Module-3 (Part-2)
No ratings yet
Module-3 (Part-2)
46 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
For More Details, Please Consult Your Hyundai Dealer. Hyundai Motor India LTD 5th-6th Floor, Corporate One - Baani Building, Plot No.-5, Commercial Centre, Jasola, New Delhi-110076
No ratings yet
For More Details, Please Consult Your Hyundai Dealer. Hyundai Motor India LTD 5th-6th Floor, Corporate One - Baani Building, Plot No.-5, Commercial Centre, Jasola, New Delhi-110076
8 pages
Yugoslav Register YU-CJA To YU-CJZ
No ratings yet
Yugoslav Register YU-CJA To YU-CJZ
3 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
India's Grand Strategy
No ratings yet
India's Grand Strategy
14 pages
Lambda - A Modern Big Data Architecture 5 - 12 PDF
No ratings yet
Lambda - A Modern Big Data Architecture 5 - 12 PDF
128 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Lez.a-03 Architectures BigData NewStyle
No ratings yet
Lez.a-03 Architectures BigData NewStyle
23 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
No ratings yet
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
32 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Lambda Architecture
No ratings yet
Lambda Architecture
20 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Far FA1200-5300047040 Despiece
No ratings yet
Far FA1200-5300047040 Despiece
6 pages
BDA Unit3
No ratings yet
BDA Unit3
17 pages
Catalogue Khớp Nối Mềm Rắc Co
No ratings yet
Catalogue Khớp Nối Mềm Rắc Co
2 pages
1 - Big Data Analytics & IoT
No ratings yet
1 - Big Data Analytics & IoT
13 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
Interview Topics 1749449767
No ratings yet
Interview Topics 1749449767
5 pages
Module II
No ratings yet
Module II
22 pages
What Is Lambda Architecture
No ratings yet
What Is Lambda Architecture
5 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Lecture 2 - Big Data
No ratings yet
Lecture 2 - Big Data
8 pages
Eddf Fra
No ratings yet
Eddf Fra
173 pages
Dev Eco Survey Stern
No ratings yet
Dev Eco Survey Stern
90 pages
3
No ratings yet
3
2 pages
Lambda Archi
No ratings yet
Lambda Archi
2 pages
Details
No ratings yet
Details
2 pages
4
No ratings yet
4
2 pages
Rad Stack
No ratings yet
Rad Stack
10 pages
6
No ratings yet
6
1 page
7
No ratings yet
7
1 page
5
No ratings yet
5
1 page
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Semantics Term Paper
No ratings yet
Semantics Term Paper
14 pages
8
No ratings yet
8
1 page
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
Ingestion Layer PDF
No ratings yet
Ingestion Layer PDF
11 pages
Manual MCS
No ratings yet
Manual MCS
209 pages
Answers To End-Of-Chapter Questions For Chapter 4, Chemical Calculations
0% (1)
Answers To End-Of-Chapter Questions For Chapter 4, Chemical Calculations
2 pages
The Captain's Shirt
100% (1)
The Captain's Shirt
3 pages
Open Hole Logging Costs ( ) : Platform Express
No ratings yet
Open Hole Logging Costs ( ) : Platform Express
8 pages
Document
No ratings yet
Document
5 pages
Big Data
0% (1)
Big Data
2 pages
Lesson 5 Portfolio Assessment
No ratings yet
Lesson 5 Portfolio Assessment
9 pages
Acute Mesenteric Ischemia
No ratings yet
Acute Mesenteric Ischemia
20 pages
EPA HQ OPP 2008 0440 0059 - Attachment - 2
No ratings yet
EPA HQ OPP 2008 0440 0059 - Attachment - 2
396 pages
Energy Fiji Limited: Application Form To Sit Wireman'S Licence Examination
100% (1)
Energy Fiji Limited: Application Form To Sit Wireman'S Licence Examination
2 pages
Two Phase Pressure Drop & Flooding Characyeristics in A Horizontal Vertical Pulsed Seive Plate Column
No ratings yet
Two Phase Pressure Drop & Flooding Characyeristics in A Horizontal Vertical Pulsed Seive Plate Column
11 pages
De Cuong On Thi Tieng Anh Hoc Ky II Lop 11 Nang Cao
No ratings yet
De Cuong On Thi Tieng Anh Hoc Ky II Lop 11 Nang Cao
13 pages
Edexcel History Coursework Exemplar
100% (2)
Edexcel History Coursework Exemplar
4 pages
ZLAN9480A User Manual
No ratings yet
ZLAN9480A User Manual
8 pages
Jealousy, Jealousy - Olivia Rodrigo Song Worksheet
No ratings yet
Jealousy, Jealousy - Olivia Rodrigo Song Worksheet
1 page
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Big Data Architecture

Uploaded by

Big Data Architecture

Uploaded by

Big Data Architecture