0% found this document useful (0 votes)
249 views41 pages

Big Data Architecture

The document describes the architecture of big data systems. It discusses the different layers including data sources, ingestion, storage, analysis and consumption. It also covers cross-layer operations like connecting data sources, governance, systems management and quality of service. Finally, it explains the Lambda architecture which combines batch and real-time processing for queries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
249 views41 pages

Big Data Architecture

The document describes the architecture of big data systems. It discusses the different layers including data sources, ingestion, storage, analysis and consumption. It also covers cross-layer operations like connecting data sources, governance, systems management and quality of service. Finally, it explains the Lambda architecture which combines batch and real-time processing for queries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Big Data Architecture

Big Data Architecture


• The architecture of Big Data consists of methods
and mechanisms for collecting and storing data,
securing it, processing it and then converting it into
data base structures and file systems.
• The analysis tools help us analyze the data that
is collected and then make intelligent decisions on
the basis of this collected data.
• Hence, the greater the amount of data collected
and analyzed the better will be the decision
taking ability of the machine or device.
Big Data Architecture: A very generic
design
• The architecture of Big Data consists of multiple
layers:
• Layer 1. Big data sources layer
• Layer 2. Data Ingestion layer
• Layer 3. Data messaging and storage layer
• Layer 4. Analysis layer
• Layer 5. Consumption layer
• There are four logical layers that are as follows:
• Big data sources layer: The data that comes to
Big Data has a lot of various sources. These
sources can be company servers, third-party
data providers and various sensors relate to
companies.
• Big Data has the ability to store and take in data
in two modes namely real-time mode and batch
mode. Some examples of the sources of data
include applications and softwares such as MS
Office docs, ERP, Relational Database
Management System (RDMS), mobile devices,
social media, sensors, data warehouses and
Data Ingestion layer
• Data ingestion is the process of obtaining and
importing data for immediate use or storage in
a database. To ingest something is to take
something in or absorb something.
• Data can be streamed in real time or ingested
in batches. In real-time data ingestion, each data
item is imported as the source emits it. When data
is ingested in batches, data items are imported in
discrete chunks at periodic intervals of time. The
first step in an effective data ingestion process is
to prioritize the data sources. Individual files must
be validated and data items routed to the correct
destinations.
• Data messaging and storage layer:
• All the data from various sources is received by
this layer. If the data received is unstructured
and is not in a format that could be understood
by the analytic tools then this layer converts this
data into a format that is readable by the
analysis tools. In Big Data, unstructured data is
stored in specialized file systems such as
Hadoop Distributed File System (HDFS) or in a
NoSQL database whereas, structured data is
stored in RDBMS.
• Analysis layer:
• This layer deals with the analysis of stored data.
In this layer the stored data is analyzed to
extract various trends and business intelligence
from it. Many different sorts of tools operate in
big data environment. For the analysis of
structured data techniques such as sampling is
used whereas, for unstructured data advanced
and never specialized analytics toolsets are
required.
• Consumption layer:
• All the analyzed data is received by this layer.
The task of this layer is to present this
analyzed data as an output to the desired
receiver.
• There are various types of outputs such as
applications, business processes and human
viewers
• There are four types of processes that operate
in between these four logical layers. These
cross-layer operations are:
• connecting to data sources,
• governance,
• systems management and
• quality of service (QoS).
Cross-layer operations
• 1. Connecting to data sources: The data
received by Big Data is at a very fast rate.
In order to quickly receive and analyze this
data we need to have connections that
can support these actions at a fast rate.
• For that the architecture requires adapters
and connecters that can connect data to
the storage system, protocols and
networks.
Cross-layer operations
• 2. Governing big data: The architecture of Big
Data provides privacy as well as security of
the data that it receives and analyzes.
• The organizations using Big Data has a choice
to use a security tool of their own on the
analytics storage system, spend in a specialized
software to keep their Hadoop environment
safe and secure or they can sign an agreement
with their cloud Hadoop provider that provide
service level security.
• The policies that deal with the protection and
security of data should include the security of the
process starting from data ingestion till analysis
and deletion or archiving of data.
Cross-layer operations
• 3. Managing systems:
• The architecture of Big Data is a large-scale
cluster which has a distributive structure that
has highly scalable performance and
capacity.
• It should regularly and continuously check
the health of the system with the help of central
management system consoles.
• If the consumer is using cloud as an
environment for Big Data then they should
establish and monitor Strong Service Level
Agreements (SLAs) with their cloud provider.
Cross-layer operations
• 4. Quality of service:
• Quality of service is an important aspect of Big
Data and it is the framework that helps define
the quality of data, security and compliance
policies, sizes and frequency of incoming data
sets and filtering data.
Lambda Architecture
• The Lambda Architecture is a deployment
model for data processing that organizations use
to combine a traditional batch pipeline with a
fast real-time stream pipeline for data access.
• It is a common architecture model in IT and
development organizations toolkits as
businesses strive to become more data-driven
and event-driven in the face of massive
volumes of rapidly generated data, often
referred to as “big data.”
Lambda Architecture
Lambda Architecture
• The Lambda Architecture contains both a
traditional batch data pipeline and a fast
streaming pipeline for real-time data, as well
as a serving layer for responding to queries.

• Five main components of the Lambda Architecture:


• Data Sources
• Batch Layer
• Serving Layer
• Speed Layer
• Query
Lambda Architecture

• Data Sources : Data can be obtained from a


variety of sources, which can then be included in
the Lambda Architecture for analysis.
• This component is oftentimes a streaming source
like Apache Kafka, which is not the original data
source, but is an intermediary store that can hold
data in order to serve both the batch layer and
the speed layer of the Lambda Architecture.
• The data is delivered simultaneously to both the
batch layer and the speed layer to enable a
parallel indexing effort.
Lambda Architecture

• Batch Layer. This component saves all data coming


into the system as batch views in preparation for
indexing.
• The input data is saved in a model that looks like a
series of changes/updates that were made to a system
of record, similar to the output of a change data
capture (CDC) system. Oftentimes this is simply a file in
the comma-separated values (CSV) format.
• The data is treated as immutable and append-only to
ensure a trusted historical record of all incoming data.
A technology like Apache Hadoop is often used as a
system for ingesting the data as well as storing the data
in a cost-effective way.
Lambda Architecture

• Serving Layer. This layer incrementally indexes


the latest batch views to make it queryable by
end users. This layer can also reindex all data to
fix a coding bug or to create different indexes for
different use cases.
• The key requirement in the serving layer is that
the processing is done in an extremely
parallelized way to minimize the time to index
the data set. While an indexing job is run, newly
arriving data will be queued up for indexing in
the next indexing job.
• Speed Layer. This layer complements the serving layer by
indexing the most recently added data not yet fully indexed
by the serving layer. This includes the data that the serving
layer is currently indexing as well as new data that arrived
after the current indexing job started. Since there is an
expected lag between the time the latest data was added to
the system and the time the latest data is available for
querying (due to the time it takes to perform the batch
indexing work), it is up to the speed layer to index the latest
data to narrow this gap.
• This layer typically leverages stream processing software to
index the incoming data in near real-time to minimize the
latency of getting the data available for querying. When the
Lambda Architecture was first introduced, Apache Storm was
a leading stream processing engine used in deployments, but
other technologies have since gained more popularity as
candidates for this component (like Hazelcast Jet, Apache
Flink, and Apache Spark Streaming).
Lambda Architecture

• Query. This component is responsible for


submitting end user queries to both the
serving layer and the speed layer and
consolidating the results.
• This gives end users a complete query on all
data, including the most recently added data,
to provide a near real-time analytics system.
Working of Lambda Architecture
• Data is indexed simultaneously by both the
serving layer and the speed layer.
Working of Lambda Architecture
• The batch/serving layers continue to index
incoming data in batches.
• Since the batch indexing takes time, the speed
layer complements the batch/serving layers by
indexing all the new, unindexed data in near
real-time.
• This gives you a large and consistent view of
data in the batch/serving layers that can be
recreated at any time, along with a smaller index
that contains the most recent data.
• Once a batch indexing job completes, the newly
batch-indexed data is available for querying, so
the speed layer’s copy of the same data/indexes is
no longer needed and is therefore deleted from
the speed layer.
• The serving layer then begins indexing the latest
data in the system that had not yet been indexed
by this layer, which has already been indexed by
the speed layer (so it is available for querying at
the speed layer).
• This ongoing hand-off between the speed layer
and the batch/serving layers ensures that all data
is ready for querying and that the latency for data
availability is low.
When the serving layer completes a job, it moves to
the next batch and the speed layer discards its copy of
the data that the serving layer just indexed.
Benefits of the Lambda Architecture
• The Lambda Architecture attempts to:
• Reduced latency: The speed layer uses stream processing
technologies to immediately index recent data that is currently not
queryable in the batch/serving layers, thus narrowing the time
window of unanalyzable data. This helps to reduce the latency.
• Data consistency: the indexing process can ensure the data
reflects the latest state in both the batch and speed layers.
• Scalability: The Lambda Architecture is based on distributed,
scale-out technologies that can be expanded by simply adding
more nodes.
• Fault tolerance: the Lambda Architecture is based on distributed
systems that support fault tolerance, so should a hardware failure
occur, other nodes are available to continue the workload.
• Human fault tolerance: if there are any bugs in the indexing code
or any omissions, the code can be updated and then rerun to
reindex all data.
Kappa Architecture
• The Kappa Architecture is a software architecture
used for processing streaming data. The main
premise behind the Kappa Architecture is that
you can perform both real-time and batch
processing, especially for analytics, with a single
technology stack.
• It is based on a streaming architecture in which
an incoming series of data is first stored in a
messaging engine like Apache Kafka.
• From there, a stream processing engine will read
the data and transform it into an analyzable
format, and then store it into an analytics
database for end users to query.
Kappa Architecture

• The Kappa Architecture is useful for on-demand


analytics.
• The Kappa Architecture is considered a simpler
alternative to the Lambda Architecture as it uses the
same technology stack to handle both real-time stream
processing and historical batch processing.
• A streaming architecture is a defined set of
technologies that work together to handle stream
processing, which is the practice of taking action on a
series of data at the time the data is created.
Kappa Architecture
Kappa Architecture
• Streaming layer
• The streaming layer delivers low-latency, near-real-
time results. It uses incremental algorithms to
perform updates, which saves time, but sacrifices
accuracy.
• Serving layer
• The serving layer is used to serve the data computed
from the streaming layer.
Lambda vs. Kappa architecture

Lambda architecture Kappa architecture

Separate layers for batch Unified layer for both batch


and streaming and stream

Lower code complexity, i.e.,


Higher code complexity, i.e.,
maintain single technology
maintain 2 technology
stack for batch and stream
stacks for batch and stream
layer

Processing large amounts of


Faster performance with
data from database would
batch and stream layer
be expensive
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
Technology of Big Data
ADVANTAGE OF BIG DATA TOOLS

You might also like