0% found this document useful (0 votes)
19 views28 pages

Lec 4 - Big Data Ecosystem Architecture

The document outlines the Big Data Ecosystem, detailing its components, architecture, and processing patterns such as Lambda and Kappa architectures. It describes the layers involved in big data architecture including data sources, ingestion, storage, analysis, consumption, and governance. The document emphasizes the importance of managing and analyzing large datasets through various tools and frameworks to derive business insights.

Uploaded by

amirosama21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views28 pages

Lec 4 - Big Data Ecosystem Architecture

The document outlines the Big Data Ecosystem, detailing its components, architecture, and processing patterns such as Lambda and Kappa architectures. It describes the layers involved in big data architecture including data sources, ingestion, storage, analysis, consumption, and governance. The document emphasizes the importance of managing and analyzing large datasets through various tools and frameworks to derive business insights.

Uploaded by

amirosama21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Big Data Ecosystem &

Architecture

Lecture 4
Outline ▪ Big Data Ecosystem and the different
components that exist in this ecosystem.

▪ Very Generic Big Data Architecture

▪ Big Data Architecture Patterns | Lambda vs


Kappa Architecture

2
Big Data Ecosystem

▪ The Big Data Ecosystem refers to


the entire environment of tools,
technologies, frameworks, and
processes that work together to manage
and analyse big data.

▪ It’s a broad concept that includes not


only the architecture but also
the people, processes, and business
use cases surrounding big data.
Big Data Ecosystem Overview
Big Data Architecture

“Big Data architecture is the logical and/or physical layout/structure of


how Big Data will be stored, accessed and managed within a Big Data
or IT environment” … Techopedia

▪ Logically defines how Big Data solution will work, the core components
(hardware, database, software, storage) used, flow of information, security
and more.

▪ Designing a big data architecture is a complex and challenging process due


to the following:
• Characteristic of big data.
• Faster additions of new technological innovations.
• Competing products at lower costs in the market .

“Big Data Analytics “, Ch.01 L03: Introduction To ... Big Data Analytics Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
A Very Generic Big Data
Architecture

https://www.youtube.com/watch?v=rvqCqK2Lpjg
A Very Generic Big Data
Architecture
A Very Generic Big Data Architecture

“Big Data Analytics “, Ch.01 L03: Introduction To ... Big Data Analytics Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
A Very Generic Big Data
Architecture

▪ The processing workflow can be divided into three layers:


• Data sources;
• Data management (integration, storage and consumption);
• Data analytics, Business intelligence (BI) and knowledge discovery (KD).

▪ This division allows us to discuss big data topics from different


perspectives.
• For computer scientists and engineers - data storage and management,
communication, and computation.
• For data scientists and statisticians - machine learning models
development to get usable information out of too huge and complex
datasets.
• From an organizational viewpoint - business analysts are expected to
select and deploy analytics service.
Data Source

▪ The real power of big data is to ingest tons data from different sources.

▪ Any type of data can be acquired and stored

▪ The data sources layer is composed of both private (internal) and public
(external) data sources.

▪ The most challenging task is to capture the heterogeneous datasets from


various service providers.

• XML and JSON are the de facto format for the web and mobile applications due
to their ease of integration into browser technologies and server technologies
that support Javascript.

• Linked Data and Semantic technologies


Ingestion Layer

▪ Data ingestion means a process of obtaining and importing data from data
sources for immediate use or storage.

▪ Data ingestion can be classified into two types:

• Batch - Large sets of data are acquired and supplied in batches. Data
collection might be conditionally triggered, scheduled, or ad hoc.

• Streaming - The continuous flow of data. This is required for real-time data
analysis. It finds and retrieves data as it is generated. Because it is always
watching for changes in data pools, it necessitates more resources.
Data Ingestion - Data Access
Connectors

▪ The Data Access Connectors are tools and frameworks for extracting and
ingesting data from various sources into the big data storage .

▪ These connectors can include both wired and wireless connections

▪ The data ingestion mechanism can either be a push or pull mechanism.

▪ The choice of the specific tool or framework for data ingestion will be driven by
the data consumer.
• If the consumer has the capability (or requirement) to pull data, publish-subscribe
messaging frameworks which allow the consumers to pull the data or messaging
queues can be used. The data producers push data to a messaging framework or a
queue from which the consumers can pull the data.
• An alternative design approach is the push approach, where the data sources first
push data to the framework and the framework then pushes the data to the data
sinks.
Data Ingestion - Data Access
Connectors

▪ Messaging Queues: are useful for push-pull messaging where the producers
push data to the queues and the consumers pull the data from the queues.
The producers and consumers do not need to be aware of each other.
▪ i.e. RabbitMQ, ZeroMQ, RestMQ and Amazon SQS.
Data Ingestion - Data Access
Connectors

▪ Publish-Subscribe Messaging: is a communication model that involves


publishers, brokers and consumers. Publishers are the source of data.
Publishers send the data to the topics which are managed by the broker.
▪ Publish-subscribe messaging frameworks such as Apache Kafka and Amazon Kinesis
Data Ingestion - Data Access
Connectors

• Source-Sink Connectors: allow collecting, aggregating and moving data from various
sources (such as server logs, databases, social media, streaming sensor data from
Internet of Things devices and other sources) into a centralized data store (such as a
distributed file system).

• Database Connectors: used for importing data from RDBMS into big data storage and
analytics frameworks, i.e., Apache Sqoop.

• Custom Connectors: can be built based on the source of the data and the data collection
requirements. Some examples of custom connectors include custom connectors for
collecting data from social networks, for NoSQL databases and connectors for Internet of
Things (IoT). REST, WebSocket and MQTT, and AWS IoT and Azure IoT Hub as IoT
custom connectors.
Storage Layer

▪ Data Storage includes distributed filesystems (i.e. HDFS) and non-relational databases
(NoSQL), which store the data collected from the raw data sources using the data
access connectors.

▪ Hadoop Distributed File System (HDFS), a distributed file system that runs on large
clusters and provides high-throughput access to data.

▪ Data stored in HDFS can be analysed with various big data analytics frameworks built
on top of HDFS.

▪ For certain analytics applications, it is preferable to store data in a NoSQL database


such as HBase.
Analysis Layer

▪ Data analytics refers to technologies that are grounded mostly in data mining and
statistical analysis for business insights
▪ To draw insights from the data, it pulls data from the data storage layer or directly from the
data source.
▪ The selection of an appropriate processing model and analytical solution is a challenging
problem and depends on the business issues of the targeted domain

Predict the future, understand the past: the four types of data analysis
Data Consumption Layer

▪ The data consumption layer is where the processed and analysed data is made
available to end-users, applications, or systems for decision-making, reporting, or
further action.

▪ This layer is the final stage in the data pipeline, where the value of the data is realized
by stakeholders.

▪ It has tools like dashboards, reporting tools, and business intelligence (BI) platforms
(e.g., Tableau, Power BI, Qlik) are used to visualize and interact with the data. In
addition, custom applications or APIs may also consume data for specific business
needs.
Governance Layer

▪ Strong guidelines and processes are required to monitor, structure, store, and
secure the data from the time it enters the enterprise, gets processed, stored,
analysed, and purged or archived.

▪ Governance for big data includes:


• Managing high volumes of data in variety of formats.
• Continuously training and managing the statistical models required to pre-process
unstructured data and analytics.
• Setting policy and compliance regulations for external data regarding its retention and
usage.
• Defining the data archiving and purging policies.
• Creating the policy for how data can be replicated across various systems.
• Setting data encryption policies.
Batch Processing
Real-Time Processing

With Dedicated
Serving Layer

Without Dedicated
Serving Layer
Big Data Architecture Patterns | Lambda
Architecture

▪ Lambda architecture is a data processing framework that aims to provide a


unified and fault-tolerant approach to big data processing.

▪ The architecture is designed to handle both batch and real-time data processing,
providing a comprehensive solution for handling large-scale data analysis.

• Ability to process data at high speed in a streaming context is necessary for operational
needs, such as transaction processing and real-time reporting.
• Batch processing, involving massive amounts of data, and related correlation and
aggregation is important for business reporting.

What is Lambda Architecture


Asim Zahid | Mar 25, 2023
Batch Only Serving
Pattern 1(a) – Lambda Architecture Layer

https://www.youtube.com/watch?v=waDJcSCXz_Y
Dedicated Serving
Pattern 1(b) – Lambda Architecture Layer

https://www.youtube.com/watch?v=waDJcSCXz_Y
Common Serving
Pattern 1(c) – Lambda Architecture Layer

https://www.youtube.com/watch?v=waDJcSCXz_Y
Big Data Architecture Patterns | Kappa
Architecture

▪ A data processing architecture used for streaming data.

▪ It is based on a streaming architecture in which an incoming series of data is


first stored in a messaging engine like Apache Kafka.

▪ From there, a stream processing engine will read the data and transform it into
an analysable format.

▪ The Kappa Architecture supports (near) real-time analytics when the data is
read and transformed immediately after it is inserted into the messaging
engine.

▪ The main difference with the Kappa Architecture is that all data is treated as if
it were a stream, so the stream processing engine acts as the sole data
transformation engine.
Pattern 2 – Kappa Architecture

https://www.youtube.com/watch?v=waDJcSCXz_Y
Lambda Architecture VS Kappa Architecture

https://www.youtube.com/watch?v=waDJcSCXz_Y

You might also like