0% found this document useful (0 votes)
36 views6 pages

Architecture

The ingestion layer ingests data from various sources like databases, streaming data, APIs, and third parties using services like AWS DMS, Kinesis Firehose, AppFlow, and Data Exchange. The data is stored in the data lake's raw zone for processing. The processing layer transforms the data using AWS Glue and Step Functions to validate, clean, and enrich data moving it to the curated zone. The cataloging layer uses AWS Glue crawlers and Lake Formation to track metadata for discovery and queries in Athena.

Uploaded by

codrin enea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

Architecture

The ingestion layer ingests data from various sources like databases, streaming data, APIs, and third parties using services like AWS DMS, Kinesis Firehose, AppFlow, and Data Exchange. The data is stored in the data lake's raw zone for processing. The processing layer transforms the data using AWS Glue and Step Functions to validate, clean, and enrich data moving it to the curated zone. The cataloging layer uses AWS Glue crawlers and Lake Formation to track metadata for discovery and queries in Athena.

Uploaded by

codrin enea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MKS infrastructure:

Ingestion layer
The ingestion layer is responsible for bringing data into the data lake. It provides the ability to
connect to internal and external data sources over a variety of protocols. It can ingest batch and
streaming data into the storage layer. The ingestion layer is also responsible for delivering
ingested data to a diverse set of targets in the data storage layer (including the object store,
databases, and warehouses).

The ingestion layer in our serverless architecture is composed of a set of purpose-built AWS
services to enable data ingestion from a variety of sources. Each of these services enables
simple self-service data ingestion into the data lake landing zone and provides integration with
other AWS services in the storage and security layers.

Operational database sources:


- This are Operational and NonSQL Databases
- Use AWS Data Migration Service (AWS DMS) to connect to SQL and NoSQL
databases and ingest their data into Amazon Simple Storage Service (Amazon S3)
buckets.
- AWS DMS, you can first perform a one-time import of the source data into the data lake
and replicate ongoing changes happening in the source database.
- AWS DMS encrypts S3 objects using AWS Key Management Service (AWS KMS) keys
as it stores them in the data lake.
- AWS DMS is a fully managed, resilient service and provides a wide choice of instance
sizes to host database replication tasks.

Streaming data sources:


- Amazon Kinesis Data Firehose to receive streaming data from internal and external
sources.
- you can configure a Kinesis Data Firehose API endpoint where sources can send
streaming data such as clickstreams, application and infrastructure logs and monitoring
metrics, and IoT data such as devices telemetry and sensor readings.

Kinesis Data Firehose does the following:

● Buffers incoming streams


● Batches, compresses, transforms, and encrypts the streams
● Stores the streams as S3 objects in the landing zone in the data lake
- Kinesis Data Firehose natively integrates with the security and storage layers and can
deliver data to Amazon S3, Amazon Redshift, and Amazon OpenSearch Service for
real-time analytics use cases.
- Kinesis Data Firehose is serverless, requires no administration, and has a cost model
where you pay only for the volume of data you transmit and process through the service.

Data APIs
- Organizations use SaaS and partner applications such as Salesforce, Marketo, and
Google Analytics to support their business operations.
- SaaS APIs:
- The ingestion layer uses AWS AppFlow to easily ingest SaaS application data
into the data lake.
- You can schedule AppFlow data ingestion flows or trigger them by events in the
SaaS application.
- Ingested data can be validated, filtered, mapped and masked before storing in
the data lake.
- AppFlow natively integrates with authentication, authorization, and encryption
services in the security and governance layer.
- Partner APIs
- To ingest data from partner and third-party APIs, organizations build or purchase
custom applications that connect to APIs, fetch data, and create S3 objects in the
landing zone by using AWS SDKs. These applications and their dependencies
can be packaged into Docker containers and hosted on AWS Fargate. Fargate is
a serverless compute engine for hosting Docker containers without having to
provision, manage, and scale servers. Fargate natively integrates with AWS
security and monitoring services to provide encryption, authorization, network
isolation, logging, and monitoring to the application containers.
- AWS Glue Python shell jobs also provide serverless alternative to build and
schedule data ingestion jobs that can interact with partner APIs by using native,
open-source, or partner-provided Python libraries. AWS Glue provides
out-of-the-box capabilities to schedule singular Python shell jobs or include them
as part of a more complex data ingestion workflow built on AWS Glue workflows.

Third-party data sources:


- Ingest form third-party datasets such as historical demographics, weather data, and
consumer behavior data.
- AWS Data Exchange provides a serverless way to find, subscribe to, and ingest
third-party data directly into S3 buckets in the data lake landing zone.

Data Lake
The Data lake layer is responsible for providing durable, scalable, secure, and cost-effective
components to store vast quantities of data. It supports storing unstructured data and datasets of
a variety of structures and formats.

To store data based on its consumption readiness for different personas across the organization,
the storage layer is organized into the following zones:

● Raw zone – where components from the ingestion layer land data. This is a transient
area where data is ingested from sources as-is.
● Cleaned zone – After the preliminary quality checks, the data from the raw zone is
moved to the cleaned zone for permanent storage. Here, data is stored in its original
format. Having all data from all sources permanently stored in the cleaned zone provides
the ability to “replay” downstream data processing in case of errors or data loss in
downstream storage zones.
● Curated zone – This zone hosts data that is in the most consumption-ready state and
conforms to organizational standards and data models. Datasets in the curated zone are
typically partitioned, cataloged, and stored in formats that support performant and
cost-effective access by the consumption layer. The processing layer creates datasets in
the curated zone after cleaning, normalizing, standardizing, and enriching data from the
raw zone. All personas across organizations use the data stored in this zone to drive
business decisions.

Processing layer
The processing layer is responsible for transforming data into a consumable state through data
validation, cleanup, normalization, transformation, and enrichment. It’s responsible for advancing
the consumption readiness of datasets along the landing, raw, and curated zones and registering
metadata for the raw and transformed data into the cataloging layer.

- AWS Glue and AWS Step Functions provide serverless components to build,
orchestrate, and run pipelines that can easily scale to process large data volumes.
Multi-step workflows built using AWS Glue and Step Functions can catalog, validate,
clean, transform, and enrich individual datasets and advance them from landing to raw
and raw to curated zones in the storage layer.
- AWS Glue is a serverless, pay-per-use ETL service for building and running Python or
Spark jobs (written in Scala or Python) without requiring you to deploy or manage
clusters. AWS Glue automatically generates the code to accelerate your data
transformations and loading processes. AWS Glue ETL builds on top of Apache Spark
and provides commonly used out-of-the-box data source connectors, data structures,
and ETL transformations to validate, clean, transform, and flatten data stored in many
open-source formats such as CSV, JSON, Parquet, and Avro. AWS Glue ETL also
provides capabilities to incrementally process partitioned data.
- Step Functions is a serverless engine that you can use to build and orchestrate
scheduled or event-driven data processing workflows.
- You use Step Functions to build complex data processing pipelines that involve
orchestrating steps implemented by using multiple AWS services such as AWS
Glue, AWS Lambda, Amazon Elastic Container Service (Amazon ECS) containers,
and more.
- Step Functions provides visual representations of complex workflows and their
running state to make them easy to understand. It manages state, checkpoints, and
restarts of the workflow for you to make sure that the steps in your data pipeline run in
order and as expected. Built-in try/catch, retry, and rollback capabilities deal with errors
and exceptions automatically.

Cataloging and search layer


The cataloging and search layer is responsible for storing business and technical metadata about
datasets hosted in the storage layer.

It provides the ability to track schema and the granular partitioning of dataset information in the
lake.

As the number of datasets in the data lake grows, this layer makes datasets in the data lake
discoverable by providing search capabilities.
AWS Glue crawlers in the processing layer can track evolving schemas and newly added
partitions of datasets in the data lake, and add new versions of corresponding metadata in the
Lake Formation catalog.

Data Analysis:
- Athena is an interactive query service that enables you to run complex ANSI SQL
against terabytes of data stored in Amazon S3 without needing to first load it into a
database. Athena queries can analyze structured, semi-structured, and columnar data
stored in open-source formats such as CSV, JSON, XML Avro, Parquet, and ORC.
Athena uses table definitions from Lake Formation to apply schema-on-read to data read
from Amazon S3.

Athena is serverless, so there is no infrastructure to set up or manage, and you pay only
for the amount of data scanned by the queries you run. Athena provides faster results
and lower costs by reducing the amount of data it scans by using dataset partitioning
information stored in the Lake Formation catalog. You can run queries directly on the
Athena console of submit them using Athena JDBC or ODBC endpoints.

Athena natively integrates with AWS services in the security and monitoring layer to
support authentication, authorization, encryption, logging, and monitoring. It supports
table- and column-level access controls defined in the Lake Formation catalog.

- Amazon QuickSight provides a serverless BI( Business Intelligence) capability to


easily create and publish rich, interactive dashboards. QuickSight enriches
dashboards and visuals with out-of-the-box, automatically generated ML insights such as
forecasting, anomaly detection, and narrative highlights. QuickSight natively integrates
with Amazon SageMaker to enable additional custom ML model-based insights to your BI
dashboards. You can access QuickSight dashboards from any device using a QuickSight
app, or you can embed the dashboard into web applications, portals, and websites.
- To achieve blazing fast performance for dashboards, QuickSight provides an in-memory
caching and calculation engine called SPICE. SPICE automatically replicates data for
high availability and enables thousands of users to simultaneously perform fast,
interactive analysis while shielding your underlying data infrastructure. QuickSight
automatically scales to tens of thousands of users and provides a cost-effective,
pay-per-session pricing model.
- QuickSight allows you to securely manage your users and content via a comprehensive
set of security features, including role-based access control, active directory integration,
AWS CloudTrail auditing, single sign-on (IAM or third-party), private VPC subnets, and
data backup.

Data Warehouse:
Amazon Redshift is a fully managed data warehouse service that can host and process
petabytes of data and run thousands highly performant queries in parallel. Amazon Redshift uses
a cluster of compute nodes to run very low-latency queries to power interactive dashboards and
high-throughput batch analytics to drive business decisions. You can run Amazon Redshift
queries directly on the Amazon Redshift console or submit them using the JDBC/ODBC
endpoints provided by Amazon Redshift.

Amazon Redshift provides native integration with Amazon S3 in the storage layer, Lake
Formation catalog, and AWS services in the security and monitoring layer.

Amazon Redshift provides the capability, called Amazon Redshift Spectrum, to perform
in-place queries on structured and semi-structured datasets in Amazon S3 without
needing to load it into the cluster. Amazon Redshift Spectrum can spin up thousands of
query-specific temporary nodes to scan exabytes of data to deliver fast results. Organizations
typically load most frequently accessed dimension and fact data into an Amazon Redshift cluster
and keep up to exabytes of structured, semi-structured, and unstructured historical data in
Amazon S3. Amazon Redshift Spectrum enables running complex queries that combine data in a
cluster with data on Amazon S3 in the same query.

ML (Machine Learning):

Amazon SageMaker is a fully managed service that provides components to build, train, and
deploy ML models using an interactive development environment (IDE) called Amazon
SageMaker Studio. In Amazon SageMaker Studio, you can upload data, create new notebooks,
train and tune models, move back and forth between steps to adjust experiments, compare
results, and deploy models to production, all in one place by using a unified visual interface.

ML models are trained on Amazon SageMaker managed compute instances, including highly
cost-effective Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances. You can organize
multiple training jobs by using Amazon SageMaker Experiments. You can build training jobs
using Amazon SageMaker built-in algorithms, your custom algorithms, or hundreds of algorithms
you can deploy from AWS Marketplace. Amazon SageMaker Debugger provides full visibility into
model training jobs. Amazon SageMaker also provides automatic hyperparameter tuning for ML
training jobs.

You can deploy Amazon SageMaker trained models into production with a few clicks and easily
scale them across a fleet of fully managed EC2 instances. You can choose from multiple EC2
instance types and attach cost-effective GPU-powered inference acceleration. After the models
are deployed, Amazon SageMaker can monitor key model metrics for inference accuracy and
detect any concept drift.

Amazon SageMaker provides native integrations with AWS services in the storage and security
layers.

Security and monitoring


The security and governance layer is responsible for protecting the data in the storage layer and
processing resources in all other layers. It provides mechanisms for access control, encryption,
network protection, usage monitoring, and auditing. The security layer also monitors activities of
all components in other layers and generates a detailed audit trail. Components of all other
layers provide native integration with the security and governance layer.
Authentication and authorization
IAM provides user-, group-, and role-level identity to users and the ability to configure
fine-grained access control for resources managed by AWS services in all layers of our
architecture. IAM supports multi-factor authentication and single sign-on through integrations with
corporate directories and open identity providers such as Google, Facebook, and Amazon.

Lake Formation provides a simple and centralized authorization model for tables hosted in the
data lake. After implemented in Lake Formation, authorization policies for databases and tables
are enforced by other AWS services such as Athena, Amazon EMR, QuickSight, and Amazon
Redshift Spectrum. In Lake Formation, you can grant or revoke database-, table-, or
column-level access for IAM users, groups, or roles defined in the same account hosting the
Lake Formation catalog or another AWS account. The simple grant/revoke-based authorization
model of Lake Formation considerably simplifies the previous IAM-based authorization model
that relied on separately securing S3 data objects and metadata objects in the AWS Glue Data
Catalog.

Encryption
AWS KMS provides the capability to create and manage symmetric and asymmetric
customer-managed encryption keys. AWS services in all layers of our architecture natively
integrate with AWS KMS to encrypt data in the data lake. It supports both creating new keys and
importing existing customer keys. Access to the encryption keys is controlled using IAM and is
monitored through detailed audit trails in CloudTrail.

Network protection
Our architecture uses Amazon Virtual Private Cloud (Amazon VPC) to provision a logically
isolated section of the AWS Cloud (called VPC) that is isolated from the internet and other AWS
customers. AWS VPC provides the ability to choose your own IP address range, create subnets,
and configure route tables and network gateways. AWS services from other layers in our
architecture launch resources in this private VPC to protect all traffic to and from these
resources.

Monitoring and logging


AWS services in all layers of our architecture store detailed logs and monitoring metrics in AWS
CloudWatch. CloudWatch provides the ability to analyze logs, visualize monitored metrics, define
monitoring thresholds, and send alerts when thresholds are crossed.

All AWS services in our architecture also store extensive audit trails of user and service actions
in CloudTrail. CloudTrail provides event history of your AWS account activity, including actions
taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS
services. This event history simplifies security analysis, resource change tracking, and
troubleshooting. In addition, you can use CloudTrail to detect unusual activity in your AWS
accounts. These capabilities help simplify operational analysis and troubleshooting.

You might also like