Slides
Slides
https://www.youtube.com/c/JohnnyChivers
What is Data Enginering?
Data Engineering is the process of collecting, analysing and
transforming data from numerous sources. Data can be
transient, or persisted to a repository.
AWS Data Streaming
Amazon Kinesis enables you to Amazon Kinesis is fully managed and Amazon Kinesis can handle any
ingest, buffer, and process runs your streaming applications amount of streaming data and
streaming data in real-time, so you without requiring you to manage any process data from hundreds of
can derive insights in seconds or infrastructure. thousands of sources with very low
minutes instead of hours or days. latencies.
Kinesis Video Streams Kinesis Data Firehose
Amazon Kinesis Video Streams Amazon Kinesis Data Firehose is
makes it easy to securely stream the easiest way to capture,
video from connected devices to transform, and load data streams
AWS for analytics, machine learning into AWS data stores for near real-
(ML), and other processing. time analytics with existing
business intelligence tools.
https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
Producer Shard
Producers put records into Amazon
Kinesis Data Streams. For example, A shard is a uniquely identified sequence of data
a web server sending log data to a records in a stream. A stream is composed of one or
stream is a producer. more shards, each of which provides a fixed unit of
capacity. Each shard can support up to 5
transactions per second for reads, up to a maximum
total data read rate of 2 MB per second and up to
1,000 records per second for writes, up to a
Retention Period maximum total data write rate of 1 MB per second
(including partition keys). The data capacity of your
The length of time that data
stream is a function of the number of shards that
records are accessible after they
you specify for the stream. The total capacity of the
are added to the stream. A stream’s
stream is the sum of the capacities of its shards.
retention period is set to a default
If your data rate increases, you can increase or
of 24 hours after creation. You can
decrease the number of shards allocated to your
increase the retention period up to
stream
8760 hours (365 days)
Partition Key Consumer
A partition key is used to group Consumers get records from
data by shard within a stream. Amazon Kinesis Data Streams and
process them. These consumers
are known as Amazon Kinesis Data
Streams Application.
Sequence Number
Each data record has a sequence
number that is unique per
partition-key within its shard.
Kinesis Data Firehose High-Level Architecture
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
Record
The data of interest that your data Data Producer
producer sends to a Kinesis Data
Producers send records to Kinesis
Firehose delivery stream. A record
Data Firehose delivery streams. For
can be as large as 1,000 KB.
example, a web server that sends
log data to a delivery stream is a
data producer. You can also
configure your Kinesis Data
Buffer Size and Buffer Interval Firehose delivery stream to
automatically read data from an
Kinesis Data Firehose buffers existing Kinesis data stream, and
incoming streaming data to a load it into destinations. For more
certain size or for a certain period information, see Sending Data to
of time before delivering it to an Amazon Kinesis Data Firehose
destinations. Buffer Size is in MBs Delivery Stream.
and Buffer Interval is in seconds.
Destinations
https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
AWS Glue
AWS Glue is a fully managed ETL AWS Glue consists of a central AWS Glue is serverless, so there’s
(extract, transform, and load) metadata repository known as the no infrastructure to set up or
service that makes it simple and AWS Glue Data Catalog, an ETL manage.
cost-effective to categorize your engine that automatically generates
data, clean it, enrich it, and move it Python or Scala code, and a flexible
reliably between various data scheduler that handles dependency
stores and data streams. resolution, job monitoring, and
retries.
AWS Glue Key Concepts
https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html
AWS Glue Data Catalog Crawler
The persistent metadata store in A program that connects to a data
AWS Glue. It contains table store (source or target), progresses
definitions, job definitions, and through a prioritized list of
other control information to classifiers to determine the schema
manage your AWS Glue for your data, and then creates
environment. Each AWS account metadata tables in the AWS Glue
has one AWS Glue Data Catalog per Data Catalog.
region.
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
AWS Glue Data Catalog
The AWS Glue Data Catalog is your The Data Catalog also provides
persistent metadata store. It is a comprehensive audit and
managed service that lets you governance capabilities, with
store, annotate, and share schema change tracking and data
metadata in the AWS Cloud in the access controls. You can audit
same way you would in an Apache changes to data schemas. This
Hive metastore. helps ensure that data is not
inappropriately modified or
inadvertently shared.
Each AWS account has one AWS AWS Glue Data Catalog consists of
Glue Data Catalog per AWS region. a hierarchy of databases and
tables.Tables are the metadata
definition that represents your data
and databases are logically
grouped tables.
AWS Database
Migration Service
(DMS)
https://aws.amazon.com/dms/
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Replication instance Replication tasks
At a high level, an AWS DMS You use an AWS DMS replication
replication instance is simply a task to move a set of data from the
managed Amazon Elastic Compute source endpoint to the target
Cloud (Amazon EC2) instance that endpoint. Creating a replication
hosts one or more replication task is the last step you need to
tasks. take before you start a migration.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Sources for AWS DMS
Oracle
Microsoft SQL Server
MySQL
MariaDB
PostgreSQL
MongoDB
SAP Adaptive Server Enterprise (ASE)
Microsoft Azure
Amazon RDS instance databases, and Amazon Simple
Storage Service
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Sources.html