Lec 4 - Big Data Ecosystem Architecture
Lec 4 - Big Data Ecosystem Architecture
Architecture
Lecture 4
Outline ▪ Big Data Ecosystem and the different
components that exist in this ecosystem.
2
Big Data Ecosystem
▪ Logically defines how Big Data solution will work, the core components
(hardware, database, software, storage) used, flow of information, security
and more.
“Big Data Analytics “, Ch.01 L03: Introduction To ... Big Data Analytics Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
A Very Generic Big Data
Architecture
https://www.youtube.com/watch?v=rvqCqK2Lpjg
A Very Generic Big Data
Architecture
A Very Generic Big Data Architecture
“Big Data Analytics “, Ch.01 L03: Introduction To ... Big Data Analytics Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
A Very Generic Big Data
Architecture
▪ The real power of big data is to ingest tons data from different sources.
▪ The data sources layer is composed of both private (internal) and public
(external) data sources.
• XML and JSON are the de facto format for the web and mobile applications due
to their ease of integration into browser technologies and server technologies
that support Javascript.
▪ Data ingestion means a process of obtaining and importing data from data
sources for immediate use or storage.
• Batch - Large sets of data are acquired and supplied in batches. Data
collection might be conditionally triggered, scheduled, or ad hoc.
• Streaming - The continuous flow of data. This is required for real-time data
analysis. It finds and retrieves data as it is generated. Because it is always
watching for changes in data pools, it necessitates more resources.
Data Ingestion - Data Access
Connectors
▪ The Data Access Connectors are tools and frameworks for extracting and
ingesting data from various sources into the big data storage .
▪ The choice of the specific tool or framework for data ingestion will be driven by
the data consumer.
• If the consumer has the capability (or requirement) to pull data, publish-subscribe
messaging frameworks which allow the consumers to pull the data or messaging
queues can be used. The data producers push data to a messaging framework or a
queue from which the consumers can pull the data.
• An alternative design approach is the push approach, where the data sources first
push data to the framework and the framework then pushes the data to the data
sinks.
Data Ingestion - Data Access
Connectors
▪ Messaging Queues: are useful for push-pull messaging where the producers
push data to the queues and the consumers pull the data from the queues.
The producers and consumers do not need to be aware of each other.
▪ i.e. RabbitMQ, ZeroMQ, RestMQ and Amazon SQS.
Data Ingestion - Data Access
Connectors
• Source-Sink Connectors: allow collecting, aggregating and moving data from various
sources (such as server logs, databases, social media, streaming sensor data from
Internet of Things devices and other sources) into a centralized data store (such as a
distributed file system).
• Database Connectors: used for importing data from RDBMS into big data storage and
analytics frameworks, i.e., Apache Sqoop.
• Custom Connectors: can be built based on the source of the data and the data collection
requirements. Some examples of custom connectors include custom connectors for
collecting data from social networks, for NoSQL databases and connectors for Internet of
Things (IoT). REST, WebSocket and MQTT, and AWS IoT and Azure IoT Hub as IoT
custom connectors.
Storage Layer
▪ Data Storage includes distributed filesystems (i.e. HDFS) and non-relational databases
(NoSQL), which store the data collected from the raw data sources using the data
access connectors.
▪ Hadoop Distributed File System (HDFS), a distributed file system that runs on large
clusters and provides high-throughput access to data.
▪ Data stored in HDFS can be analysed with various big data analytics frameworks built
on top of HDFS.
▪ Data analytics refers to technologies that are grounded mostly in data mining and
statistical analysis for business insights
▪ To draw insights from the data, it pulls data from the data storage layer or directly from the
data source.
▪ The selection of an appropriate processing model and analytical solution is a challenging
problem and depends on the business issues of the targeted domain
Predict the future, understand the past: the four types of data analysis
Data Consumption Layer
▪ The data consumption layer is where the processed and analysed data is made
available to end-users, applications, or systems for decision-making, reporting, or
further action.
▪ This layer is the final stage in the data pipeline, where the value of the data is realized
by stakeholders.
▪ It has tools like dashboards, reporting tools, and business intelligence (BI) platforms
(e.g., Tableau, Power BI, Qlik) are used to visualize and interact with the data. In
addition, custom applications or APIs may also consume data for specific business
needs.
Governance Layer
▪ Strong guidelines and processes are required to monitor, structure, store, and
secure the data from the time it enters the enterprise, gets processed, stored,
analysed, and purged or archived.
With Dedicated
Serving Layer
Without Dedicated
Serving Layer
Big Data Architecture Patterns | Lambda
Architecture
▪ The architecture is designed to handle both batch and real-time data processing,
providing a comprehensive solution for handling large-scale data analysis.
• Ability to process data at high speed in a streaming context is necessary for operational
needs, such as transaction processing and real-time reporting.
• Batch processing, involving massive amounts of data, and related correlation and
aggregation is important for business reporting.
https://www.youtube.com/watch?v=waDJcSCXz_Y
Dedicated Serving
Pattern 1(b) – Lambda Architecture Layer
https://www.youtube.com/watch?v=waDJcSCXz_Y
Common Serving
Pattern 1(c) – Lambda Architecture Layer
https://www.youtube.com/watch?v=waDJcSCXz_Y
Big Data Architecture Patterns | Kappa
Architecture
▪ From there, a stream processing engine will read the data and transform it into
an analysable format.
▪ The Kappa Architecture supports (near) real-time analytics when the data is
read and transformed immediately after it is inserted into the messaging
engine.
▪ The main difference with the Kappa Architecture is that all data is treated as if
it were a stream, so the stream processing engine acts as the sole data
transformation engine.
Pattern 2 – Kappa Architecture
https://www.youtube.com/watch?v=waDJcSCXz_Y
Lambda Architecture VS Kappa Architecture
https://www.youtube.com/watch?v=waDJcSCXz_Y