0% found this document useful (0 votes)
7 views26 pages

Slides

The document provides an overview of AWS Data Engineering, focusing on key services such as AWS Kinesis for real-time data streaming, AWS Glue for ETL processes, and AWS Database Migration Service for database migrations. It details the functionalities of each service, including data ingestion, processing, and management features. Additionally, it explains the architecture and components involved in these services, emphasizing their scalability and serverless nature.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views26 pages

Slides

The document provides an overview of AWS Data Engineering, focusing on key services such as AWS Kinesis for real-time data streaming, AWS Glue for ETL processes, and AWS Database Migration Service for database migrations. It details the functionalities of each service, including data ingestion, processing, and management features. Additionally, it explains the architecture and components involved in these services, emphasizing their scalability and serverless nature.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

AWS Data Engineering

Course by Johnny Chivers

https://www.youtube.com/c/JohnnyChivers
What is Data Enginering?
Data Engineering is the process of collecting, analysing and
transforming data from numerous sources. Data can be
transient, or persisted to a repository.
AWS Data Streaming

AWS Data Engineering


What is AWS Kinesis?
Realtime AWS Managed Service Scalable

Amazon Kinesis enables you to Amazon Kinesis is fully managed and Amazon Kinesis can handle any
ingest, buffer, and process runs your streaming applications amount of streaming data and
streaming data in real-time, so you without requiring you to manage any process data from hundreds of
can derive insights in seconds or infrastructure. thousands of sources with very low
minutes instead of hours or days. latencies.
Kinesis Video Streams Kinesis Data Firehose
Amazon Kinesis Video Streams Amazon Kinesis Data Firehose is
makes it easy to securely stream the easiest way to capture,
video from connected devices to transform, and load data streams
AWS for analytics, machine learning into AWS data stores for near real-
(ML), and other processing. time analytics with existing
business intelligence tools.

Kinesis Data Streams Kinesis Data Analytics


Amazon Kinesis Data Streams is a Amazon Kinesis Data Analytics is
scalable and durable real-time data the easiest way to process data
streaming service that can streams in real time with SQL or
continuously capture gigabytes of Apache Flink without having to
data per second from hundreds of learn new programming languages
thousands of sources. or processing frameworks.
Kinesis Data Streams High-Level Architecture

https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
Producer Shard
Producers put records into Amazon
Kinesis Data Streams. For example, A shard is a uniquely identified sequence of data
a web server sending log data to a records in a stream. A stream is composed of one or
stream is a producer. more shards, each of which provides a fixed unit of
capacity. Each shard can support up to 5
transactions per second for reads, up to a maximum
total data read rate of 2 MB per second and up to
1,000 records per second for writes, up to a
Retention Period maximum total data write rate of 1 MB per second
(including partition keys). The data capacity of your
The length of time that data
stream is a function of the number of shards that
records are accessible after they
you specify for the stream. The total capacity of the
are added to the stream. A stream’s
stream is the sum of the capacities of its shards.
retention period is set to a default
If your data rate increases, you can increase or
of 24 hours after creation. You can
decrease the number of shards allocated to your
increase the retention period up to
stream
8760 hours (365 days)
Partition Key Consumer
A partition key is used to group Consumers get records from
data by shard within a stream. Amazon Kinesis Data Streams and
process them. These consumers
are known as Amazon Kinesis Data
Streams Application.

Sequence Number
Each data record has a sequence
number that is unique per
partition-key within its shard.
Kinesis Data Firehose High-Level Architecture

https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
Record
The data of interest that your data Data Producer
producer sends to a Kinesis Data
Producers send records to Kinesis
Firehose delivery stream. A record
Data Firehose delivery streams. For
can be as large as 1,000 KB.
example, a web server that sends
log data to a delivery stream is a
data producer. You can also
configure your Kinesis Data
Buffer Size and Buffer Interval Firehose delivery stream to
automatically read data from an
Kinesis Data Firehose buffers existing Kinesis data stream, and
incoming streaming data to a load it into destinations. For more
certain size or for a certain period information, see Sending Data to
of time before delivering it to an Amazon Kinesis Data Firehose
destinations. Buffer Size is in MBs Delivery Stream.
and Buffer Interval is in seconds.
Destinations

Amazon Simple Storage Service (Amazon S3)


Amazon Redshift
Amazon Elasticsearch Service (Amazon ES)
Splunk
Datadog
Dynatrace
LogicMonitor
MongoDB
New Relic
Sumo Logic
Kinesis Data Analytics

https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
AWS Glue

AWS Data Engineering


What is AWS Glue?
Managed ETL Service Collection of Components Serverless

AWS Glue is a fully managed ETL AWS Glue consists of a central AWS Glue is serverless, so there’s
(extract, transform, and load) metadata repository known as the no infrastructure to set up or
service that makes it simple and AWS Glue Data Catalog, an ETL manage.
cost-effective to categorize your engine that automatically generates
data, clean it, enrich it, and move it Python or Scala code, and a flexible
reliably between various data scheduler that handles dependency
stores and data streams. resolution, job monitoring, and
retries.
AWS Glue Key Concepts

https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html
AWS Glue Data Catalog Crawler
The persistent metadata store in A program that connects to a data
AWS Glue. It contains table store (source or target), progresses
definitions, job definitions, and through a prioritized list of
other control information to classifiers to determine the schema
manage your AWS Glue for your data, and then creates
environment. Each AWS account metadata tables in the AWS Glue
has one AWS Glue Data Catalog per Data Catalog.
region.

Classifers Data Store


Determines the schema of your A data store is a repository for
data. AWS Glue provides classifiers persistently storing your data.
for common file types, such as CSV, Examples include Amazon S3
JSON, AVRO, XML, and others. buckets and relational databases. A
Plus Common relational database data source is a data store that is
management systems using a JDBC used as input to a process or
connection. You can write your own transform. A data target is a data
classifier store that a process or transform
writes to.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
AWS Glue Data Catalog

The AWS Glue Data Catalog is your The Data Catalog also provides
persistent metadata store. It is a comprehensive audit and
managed service that lets you governance capabilities, with
store, annotate, and share schema change tracking and data
metadata in the AWS Cloud in the access controls. You can audit
same way you would in an Apache changes to data schemas. This
Hive metastore. helps ensure that data is not
inappropriately modified or
inadvertently shared.

Each AWS account has one AWS AWS Glue Data Catalog consists of
Glue Data Catalog per AWS region. a hierarchy of databases and
tables.Tables are the metadata
definition that represents your data
and databases are logically
grouped tables.
AWS Database
Migration Service
(DMS)

AWS Data Engineering


What is AWS DMS?
AWS Database Migration Service helps you migrate databases to AWS
quickly and securely. The source database remains fully operational during
the migration, minimizing downtime to applications that rely on the
database. The AWS Database Migration Service can migrate your data to
and from most widely used commercial and open-source databases.

https://aws.amazon.com/dms/
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Replication instance Replication tasks
At a high level, an AWS DMS You use an AWS DMS replication
replication instance is simply a task to move a set of data from the
managed Amazon Elastic Compute source endpoint to the target
Cloud (Amazon EC2) instance that endpoint. Creating a replication
hosts one or more replication task is the last step you need to
tasks. take before you start a migration.

Endpoint Schema and Code Migration


AWS DMS uses an endpoint to AWS DMS doesn't perform schema
access your source or target data or code conversion
store. The specific connection
information is different, depending
on your data store, but in general
you supply the following
information when you create an
endpoint:

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Sources for AWS DMS

Oracle
Microsoft SQL Server
MySQL
MariaDB
PostgreSQL
MongoDB
SAP Adaptive Server Enterprise (ASE)
Microsoft Azure
Amazon RDS instance databases, and Amazon Simple
Storage Service

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Sources.html

You might also like