0% found this document useful (0 votes)

14 views8 pages

Big Data Architectures

Big data architectures are designed to manage large and complex data sets that exceed traditional database capabilities, focusing on data ingestion, processing, and analysis. Key components include data sources, storage solutions like data lakes, batch and real-time processing, machine learning, and orchestration technologies. The document also discusses lambda and kappa architectures for handling data flow and processing, particularly in IoT scenarios.

Uploaded by

oc hoan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views8 pages

Big Data Architectures

Uploaded by

oc hoan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Big data architectures

Azure Data Lake Analytics Azure IoT Hub Azure Machine Learning Azure Synapse Analytics

A big data architecture is designed to handle the ingestion, processing, and analysis of
data that is too large or complex for traditional database systems. The threshold at
which organizations enter into the big data realm differs, depending on the capabilities
of the users and their tools. For some, it can mean hundreds of gigabytes of data, while
for others it means hundreds of terabytes. As tools for working with big datasets
advance, so does the meaning of big data. More and more, this term relates to the value
you can extract from your data sets through advanced analytics, rather than strictly the
size of the data, although in these cases they tend to be quite large.

Over the years, the data landscape has changed. What you can do, or are expected to
do, with data has changed. The cost of storage has fallen dramatically, while the means
by which data is collected keeps growing. Some data arrives at a rapid pace, constantly
demanding to be collected and observed. Other data arrives more slowly, but in very
large chunks, often in the form of decades of historical data. You might be facing an
advanced analytics problem, or one that requires machine learning. These are
challenges that big data architectures seek to solve.

Big data solutions typically involve one or more of the following types of workload:

Batch processing of big data sources at rest.

Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.

Consider big data architectures when you need to:

Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low
latency.

Components of a big data architecture

The following diagram shows the logical components that fit into a big data
architecture. Individual solutions may not contain every item in this diagram.
Most big data architectures include some or all of the following components:

Data sources. All big data solutions start with one or more data sources. Examples
include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.

Data storage. Data for batch processing operations is typically stored in a

distributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake. Options for implementing this storage
include Azure Data Lake Store or blob containers in Azure Storage.

Batch processing. Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading source
files, processing them, and writing the output to new files. Options include running
U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce
jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in
an HDInsight Spark cluster.

Real-time message ingestion. If the solution includes real-time sources, the

architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out
processing, reliable delivery, and other message queuing semantics. This portion
of a streaming architecture is often referred to as stream buffering. Options
include Azure Event Hubs, Azure IoT Hub, and Kafka.
Stream processing. After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics
provides a managed stream processing service based on perpetually running SQL
queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Spark Streaming in an HDInsight cluster.

Machine learning. Reading the prepared data for analysis (from batch or stream
processing), machine learning algorithms can be used to build models that can
predict outcomes or classify data. These models can be trained on large datasets,
and the resulting models can be used to analyze new data and make predictions.
This can be done using Azure Machine Learning

Analytical data store. Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database that
provides a metadata abstraction over data files in the distributed data store. Azure
Synapse Analytics provides a managed service for large-scale, cloud-based data
warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which
can also be used to serve data for analysis.

Analysis and reporting. The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the
data, the architecture may include a data modeling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis Services. It
might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can
also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks,
such as Jupyter, enabling these users to use their existing skills with Python or
Microsoft R. For large-scale data exploration, you can use Microsoft R Server,
either standalone or with Spark.

Orchestration. Most big data solutions consist of repeated data processing

operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report or dashboard. To automate
these workflows, you can use an orchestration technology such Azure Data Factory
or Apache Oozie and Sqoop.
Lambda architecture
When working with very large data sets, it can take a long time to run the sort of
queries that clients need. These queries can't be performed in real time, and often
require algorithms such as MapReduce that operate in parallel across the entire data
set. The results are then stored separately from the raw data and used for querying.

One drawback to this approach is that it introduces latency — if processing takes a few
hours, a query may return results that are several hours old. Ideally, you would like to
get some results in real time (perhaps with some loss of accuracy), and combine these
results with the results from the batch analytics.

The lambda architecture, first proposed by Nathan Marz, addresses this problem by
creating two paths for data flow. All data coming into the system goes through these
two paths:

A batch layer (cold path) stores all of the incoming data in its raw form and
performs batch processing on the data. The result of this processing is stored as a
batch view.

A speed layer (hot path) analyzes data in real time. This layer is designed for low
latency, at the expense of accuracy.

The batch layer feeds into a serving layer that indexes the batch view for efficient
querying. The speed layer updates the serving layer with incremental updates based on
the most recent data.

Data that flows into the hot path is constrained by latency requirements imposed by the
speed layer, so that it can be processed as quickly as possible. Often, this requires a
tradeoff of some level of accuracy in favor of data that is ready as quickly as possible.
For example, consider an IoT scenario where a large number of temperature sensors are
sending telemetry data. The speed layer may be used to process a sliding time window
of the incoming data.

Data flowing into the cold path, on the other hand, is not subject to the same low
latency requirements. This allows for high accuracy computation across large data sets,
which can be very time intensive.

Eventually, the hot and cold paths converge at the analytics client application. If the
client needs to display timely, yet potentially less accurate data in real time, it will
acquire its result from the hot path. Otherwise, it will select results from the cold path to
display less timely but more accurate data. In other words, the hot path has data for a
relatively small window of time, after which the results can be updated with more
accurate data from the cold path.

The raw data stored at the batch layer is immutable. Incoming data is always appended
to the existing data, and the previous data is never overwritten. Any changes to the
value of a particular datum are stored as a new timestamped event record. This allows
for recomputation at any point in time across the history of the data collected. The
ability to recompute the batch view from the original raw data is important, because it
allows for new views to be created as the system evolves.

Kappa architecture
A drawback to the lambda architecture is its complexity. Processing logic appears in two
different places — the cold and hot paths — using different frameworks. This leads to
duplicate computation logic and the complexity of managing the architecture for both
paths.

The kappa architecture was proposed by Jay Kreps as an alternative to the lambda
architecture. It has the same basic goals as the lambda architecture, but with an
important distinction: All data flows through a single path, using a stream processing
system.
There are some similarities to the lambda architecture's batch layer, in that the event
data is immutable and all of it is collected, instead of a subset. The data is ingested as a
stream of events into a distributed and fault tolerant unified log. These events are
ordered, and the current state of an event is changed only by a new event being
appended. Similar to a lambda architecture's speed layer, all event processing is
performed on the input stream and persisted as a real-time view.

If you need to recompute the entire data set (equivalent to what the batch layer does in
lambda), you simply replay the stream, typically using parallelism to complete the
computation in a timely fashion.

Internet of Things (IoT)

From a practical viewpoint, Internet of Things (IoT) represents any device that is
connected to the Internet. This includes your PC, mobile phone, smart watch, smart
thermostat, smart refrigerator, connected automobile, heart monitoring implants, and
anything else that connects to the Internet and sends or receives data. The number of
connected devices grows every day, as does the amount of data collected from them.
Often this data is being collected in highly constrained, sometimes high-latency
environments. In other cases, data is sent from low-latency environments by thousands
or millions of devices, requiring the ability to rapidly ingest the data and process
accordingly. Therefore, proper planning is required to handle these constraints and
unique requirements.

Event-driven architectures are central to IoT solutions. The following diagram shows a
possible logical architecture for IoT. The diagram emphasizes the event-streaming
components of the architecture.
The cloud gateway ingests device events at the cloud boundary, using a reliable, low
latency messaging system.

Devices might send events directly to the cloud gateway, or through a field gateway. A
field gateway is a specialized device or software, usually collocated with the devices, that
receives events and forwards them to the cloud gateway. The field gateway might also
preprocess the raw device events, performing functions such as filtering, aggregation, or
protocol transformation.

After ingestion, events go through one or more stream processors that can route the
data (for example, to storage) or perform analytics and other processing.

The following are some common types of processing. (This list is certainly not
exhaustive.)

Writing event data to cold storage, for archiving or batch analytics.

Hot path analytics, analyzing the event stream in (near) real time, to detect
anomalies, recognize patterns over rolling time windows, or trigger alerts when a
specific condition occurs in the stream.

Handling special types of nontelemetry messages from devices, such as

notifications and alarms.

Machine learning.

The boxes that are shaded gray show components of an IoT system that are not directly
related to event streaming, but are included here for completeness.

The device registry is a database of the provisioned devices, including the device
IDs and usually device metadata, such as location.
The provisioning API is a common external interface for provisioning and
registering new devices.

Some IoT solutions allow command and control messages to be sent to devices.

Contributors
This article is maintained by Microsoft. It was originally written by the following
contributors.

Principal author:

Zoiner Tejada | CEO and Architect

Next steps
See the following relevant Azure services:

Azure IoT Hub

Azure Event Hubs
Azure Stream Analytics
Azure Data Explorer

Related resources
Azure IoT reference architecture
IoT architecture design
Big data architecture style

Feedback
Was this page helpful?  Yes  No

Big Data Components
No ratings yet
Big Data Components
31 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
09 - Azure Data Engineering Cheatsheet
No ratings yet
09 - Azure Data Engineering Cheatsheet
37 pages
DC-30 - System Recovery Guide - V2.0 - EN
No ratings yet
DC-30 - System Recovery Guide - V2.0 - EN
12 pages
Design Sprint: How To Solve Big Problems and
No ratings yet
Design Sprint: How To Solve Big Problems and
84 pages
PowerHour ParallelingSolutions 2019-08-22 PDF
No ratings yet
PowerHour ParallelingSolutions 2019-08-22 PDF
49 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Module 1
No ratings yet
Module 1
29 pages
Go Bigwith Data Lake Architecture
No ratings yet
Go Bigwith Data Lake Architecture
35 pages
Azure Data Platform End2End - 2day
100% (2)
Azure Data Platform End2End - 2day
108 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
9-6 Error Messages Reference
No ratings yet
9-6 Error Messages Reference
2,536 pages
Chapter 3: The Electronic Wallet 3.1. Introduction To E-Wallet
100% (2)
Chapter 3: The Electronic Wallet 3.1. Introduction To E-Wallet
11 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Excel and Gpa Calculation
No ratings yet
Excel and Gpa Calculation
21 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Chassis Name Platform Model Name: 4100S Tsumv59 24PHA4100S/67
No ratings yet
Chassis Name Platform Model Name: 4100S Tsumv59 24PHA4100S/67
28 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
(English (Auto-Generated) ) How To Scrape Leads From EVERY Social Media Platform (2025) (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) How To Scrape Leads From EVERY Social Media Platform (2025) (DownSub - Com)
32 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Unit 1 Big Data Tutorial
No ratings yet
Unit 1 Big Data Tutorial
32 pages
2IL50 Data Structures: 2018-19 Q3 Lecture 1: Introduction
No ratings yet
2IL50 Data Structures: 2018-19 Q3 Lecture 1: Introduction
61 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
BDA Architecture
No ratings yet
BDA Architecture
15 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Unit - 1 (Big Data)
No ratings yet
Unit - 1 (Big Data)
15 pages
Pass Sqlsaturday Melbourne Azure Data Pipelines v0 1 PDF
No ratings yet
Pass Sqlsaturday Melbourne Azure Data Pipelines v0 1 PDF
41 pages
SMEs Cybercrime FL 496 Summary en
No ratings yet
SMEs Cybercrime FL 496 Summary en
20 pages
Technical Seminar Report
No ratings yet
Technical Seminar Report
24 pages
Azure Data Factory Microsoft Fabric
No ratings yet
Azure Data Factory Microsoft Fabric
14 pages
Unit 2
No ratings yet
Unit 2
17 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
BCE Report
No ratings yet
BCE Report
14 pages
Abhishek Seminar 222
No ratings yet
Abhishek Seminar 222
19 pages
SchneiderF M DomahidiE DietrichF 2020 Whatisimportantwhenweevaluatemovies
No ratings yet
SchneiderF M DomahidiE DietrichF 2020 Whatisimportantwhenweevaluatemovies
12 pages
Aniruddha BigDataandAnalytics
No ratings yet
Aniruddha BigDataandAnalytics
33 pages
Gatling Introduction For Java Section-A
No ratings yet
Gatling Introduction For Java Section-A
17 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
Modulador Digital IP To RF User's Manual - V1.0
No ratings yet
Modulador Digital IP To RF User's Manual - V1.0
12 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
SDC - Synapse Analytics
No ratings yet
SDC - Synapse Analytics
23 pages
Trans Connect
No ratings yet
Trans Connect
7 pages
4M - Fineelec
No ratings yet
4M - Fineelec
10 pages
Thinkpad Regulatory Notice: About This Manual
No ratings yet
Thinkpad Regulatory Notice: About This Manual
14 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
Vestige Best Deal
No ratings yet
Vestige Best Deal
19 pages
3 Assignment
No ratings yet
3 Assignment
5 pages
Adafruit Ultimate Gps PDF
No ratings yet
Adafruit Ultimate Gps PDF
52 pages
IOTBDM - Mid Sem
No ratings yet
IOTBDM - Mid Sem
16 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Fundamentals of Applets in Java
No ratings yet
Fundamentals of Applets in Java
14 pages
Unit 1
No ratings yet
Unit 1
9 pages
Unit 2 (ETI) BDA
No ratings yet
Unit 2 (ETI) BDA
22 pages
Module 4
No ratings yet
Module 4
3 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Code No. M1: Series: SA01
No ratings yet
Code No. M1: Series: SA01
9 pages
Sim Hosting Api Version 2.O
No ratings yet
Sim Hosting Api Version 2.O
6 pages
MT8121XE3 Datasheet ENG
No ratings yet
MT8121XE3 Datasheet ENG
2 pages
Building A Big Data Architecture - Core Components, Best Practices
No ratings yet
Building A Big Data Architecture - Core Components, Best Practices
6 pages
January Budget 2021
No ratings yet
January Budget 2021
6 pages
MATLAB Report
No ratings yet
MATLAB Report
17 pages
Sell Bitcoin For Bank Transfer NoOnes
No ratings yet
Sell Bitcoin For Bank Transfer NoOnes
1 page
Gas Detector Report
No ratings yet
Gas Detector Report
2 pages
7.5 Effects of Layer 2 Devices On Data Flow: 7.5.1 Ethernet LAN Segmentation
No ratings yet
7.5 Effects of Layer 2 Devices On Data Flow: 7.5.1 Ethernet LAN Segmentation
9 pages
Big Data Arch
No ratings yet
Big Data Arch
2 pages
Fundamentals of Big Data and Business Analytics
No ratings yet
Fundamentals of Big Data and Business Analytics
6 pages
f1 Self Assessment Checklist Doliente
No ratings yet
f1 Self Assessment Checklist Doliente
4 pages
Fundamentals
No ratings yet
Fundamentals
2 pages
LS1.1 - V6 Generalized Architecture of Big Data Systems
No ratings yet
LS1.1 - V6 Generalized Architecture of Big Data Systems
8 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Note 1670678 - New Features in SAP GUI For Windows 7.30
No ratings yet
Note 1670678 - New Features in SAP GUI For Windows 7.30
4 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Big Data Architectures

Uploaded by

Big Data Architectures

Uploaded by

Big data architectures

Batch processing of big data sources at rest.

Consider big data architectures when you need to:

Components of a big data architecture

Data storage. Data for batch processing operations is typically stored in a

Real-time message ingestion. If the solution includes real-time sources, the

Orchestration. Most big data solutions consist of repeated data processing

Internet of Things (IoT)

Writing event data to cold storage, for archiving or batch analytics.

Handling special types of nontelemetry messages from devices, such as

Zoiner Tejada | CEO and Architect

Azure IoT Hub

You might also like