0% found this document useful (0 votes)
60 views

AWS Data Analytics Specialty Exam Cram Notes

This document provides an introduction and overview of the AWS Certified Data Analytics - Specialty exam. It discusses the recommended knowledge and experience needed to pass the exam, including a minimum of 5 years experience with popular data analytics tools and 2 years of hands-on experience designing, building, and maintaining analytics systems using AWS services. The document also defines data analytics as inspecting, cleaning, converting, and modeling data to uncover usable information and assist with decision making. Finally, it introduces Amazon S3 and multipart uploads as important concepts related to collecting and storing large amounts of data.

Uploaded by

mnats2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

AWS Data Analytics Specialty Exam Cram Notes

This document provides an introduction and overview of the AWS Certified Data Analytics - Specialty exam. It discusses the recommended knowledge and experience needed to pass the exam, including a minimum of 5 years experience with popular data analytics tools and 2 years of hands-on experience designing, building, and maintaining analytics systems using AWS services. The document also defines data analytics as inspecting, cleaning, converting, and modeling data to uncover usable information and assist with decision making. Finally, it introduces Amazon S3 and multipart uploads as important concepts related to collecting and storing large amounts of data.

Uploaded by

mnats2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

DAS-C01: AWS Certified Data Analytics -

Specialty

Exam Cram Notes


First Edition
Chapter 01: Introduction

Introduction
Course Introduction
Individuals with experience and skills working with AWS
services to develop, create, protect, and maintain
analytics systems can pursue the AWS Certified Data
Analytics - Specialty. This course walks you through the
different concepts, including designing analytical
questions, gathering, analyzing, and preparing data while
evaluating the data and uncovering insights from it.

Recommended AWS Knowledge


The ideal applicant should have:

1) A minimum of five years of expertise with popular Digital Use Cases


data analytics tools 1. Application Monitoring: We can monitor our
2) Two years of hands-on experience and competence application performance, stability, and resource
in designing, building, securing, and maintaining usage with data analytics to make our applications
analytics systems using AWS services more efficient and cost-effective.
3) The ability to create AWS data analytics services and 2. Financial Analysis: Financial analysis is another huge
comprehend how they interact area for data analytics. We can gain useful insights
4) An understanding of how AWS data analytics from huge amounts of financial data.
services fit within the data lifecycle of collection, 3. Machine Learning: We can manage our large data
storage, processing, and visualization sets, train models, feed information from our
machine learning application back into that same
What is Data Analytics? data analytics system to help refine those models
Inspection, cleaning, converting, and modeling data to and generally improve our machine learning systems
uncover usable information, informing conclusions, and through the use of data analytics tools.
assisting decision-making are what data analytics is all
4. IOT Management: We can track, manage, and refine
about. Simply said, data analytics implies that we have
distributed networks of discrete devices using data
some data, and we need to answer questions with it. We
analytics tools.
have a process that needs to be improved, and we can
use the data we collect from our users in different
locations and on different devices to answer questions
and improve the processes that occur within our data
collection, data processing, data analytics, and our
understanding of what that data represents.

Steps for Success


To understand the different steps that go into a data
analytics pipeline, we need to understand the idea of the
steps for a successful data analytics pipeline.
Chapter 02: Amazon Simple Storage Service

Introduction to S3 add items to the bucket and read articles. As long as our
In our data analytics steps for success, S3 will fall in data users are properly close to the us-east-1 region, they will
collection primarily. It has features that will use in data not have any issues. We will see some latency once we
preparation, analysis, and data interpretation. get out to the other geographic locations. As users get
further and further away, that latency will increase. So,
Amazon S3 is a type of object storage that allows you to they may complain about how long it takes to get things
store and recover any quantity of data from any location. from our S3 data store.
It is a low-cost storage solution with business resilience,
reliability, efficiency, privacy, and infinite expansion. S3 S3 Multipart Upload
uses an object called the bucket. The bucket is the atomic What do we do when we have a lot of data?
unit for S3. If we think about a large file or object in the context of
S3, we can think about it as a large building. Moving it all
Amazon S3 offers a straightforward web service interface
at once is impractical if we are moving a building. It can
for storing and retrieving any amount of data from any
be time-consuming if we have to carry a completely big
location at any time. You may quickly create projects that
building all by itself, in one piece, it will be moved very
integrate cloud-native storage using this service.
slowly, and we are probably going to damage the
Because Amazon S3 is easily customizable and you only
structure, and it is potentially expensive. Hence, when
pay for what you use, you can start small and scale up as
we talk about moving a large object, we run into some
needed without sacrificing performance or
same issues. It is kind of impractical. Depending on the
dependability.
compute resources involved, it will be time-consuming
Upload Interfaces might be expensive. We might have a lot of idle CPU
When we upload data at S3, we have several interfaces cycles because we are just moving a large file all at once.
to work with. The AWS management console, the AWS To get around that, multipart upload comes in. If we look
CLI, and several AWS SDKs. at some of the limitations of the standard ways to get
data into S3, a single S3 Put is only going to let us upload
AWS Management Console
five gigabytes of data in a single Put, but an S3 object can
When we use the management console, we use a
be up to five tebibytes.
graphical user interface. We can add files and folders. We
can set most of the options to upload with this interface. We will use a multipart upload to get a five-tebibyte
object into S3. It has three steps, like if we were going to
AWS CLI
move a building, we would probably break it down into
When using the AWS CLI, we enter commands in our
components and have a plan and use that plan to
terminal that will allow us to move data into the S3
reassemble those components on the other side of the
bucket. We can use these commands to retrieve data
move.
from S3 buckets.
 Prepare Data: We will break the data into reasonably
Transfer Acceleration
sized pieces.
Another feature of S3 that we should be aware of when
talking about getting data into S3 is Transfer  Move Pieces: We will perform the multipart upload
Acceleration. To understand why we use Transfer steps to move all data to your S3 bucket.
Acceleration and how it works, let's look at a scenario.  S3 puts it together: We will let S3 know the upload
Assume we have an application. Our application will is complete, and S3 puts the data back together in
store data in a bucket in the us-east-1 region. And that the bucket.
works very well to send and receive data. Our users will
Three Multipart Upload API Calls 1) Parts: Multipart uploads can be made up of up to
There are three API calls that we use to perform this 10,000 fragments.
process. 2) Overwrite: Specifying the same part number as a
1) Create Multipart Upload: First, we have the previously uploaded part can be utilized to overwrite
CreateMultipartUpload API call. It returns Bucket, that part. We can overwrite the parts while the
Key, and UploadID. The main thing we need from multipart upload is still in progress. Suppose
returning this API call is the UploadID. The multipart something in your log file changes, or there is an
upload acts as a meta object that stores all of the update to a section of it. In that case, you can
information about our upload while it is happening. overwrite that part, that that section is in and have
It is going to hold the information about all the parts. the latest version of the log file in your object when
2) Upload Parts: Next, we have the UploadPart API call. it uploads, or if one of the parts fails or has some
We need to provide Bucket, Key, Part Number, and corrupted data in it, you can write it again into the
Upload ID. It returns an ETag. It is very important multipart upload. It will use whatever the latest data
because we need to deliver that when we do our is for that piece of the upload when it is reassembled.
final API call. 3) Auto-Abort: A bucket lifecycle policy that can be
3) Complete Multipart Upload: Finally, we have the utilized to abort multipart uploads after a specified
CompleteMultipartUpload API call. It returns time automatically. It prevents situations where a
Bucket, ETags, and Key. We need to provide Bucket, multipart upload gets started, something goes wrong
Key, Part Number, Upload ID, all Part Numbers, and with it, and the closing piece of it never goes
ETags. through, or there is an error when the close is
requested that upload will not work until the upload
Three Multipart Upload API Calls is completed or aborted. Generally, in a production
There are three API calls that we use to perform this system, it is good to have an auto abort configured
process.
for your bucket if you know that there will be a lot of
1) Create Multipart Upload: First, we have the multipart uploads going to that bucket.
CreateMultipartUpload API call. It returns Bucket, Key, Best Practices and Limitations
and UploadID. The main thing we need from returning There are also a few best practices and limitations that
this API call is the UploadID. The multipart upload acts as we need to think about.
a meta object that stores all of the information about our
upload while it is happening. It is going to hold the  AWS recommends considering multipart upload for
information about all the parts. files more significant than a hundred megabytes. It
means that you might want to use multipart upload
2) Upload Parts: Next, we have the UploadPart API call. for anything larger than a hundred megabytes (100
We need to provide Bucket, Key, Part Number, and MiB).
Upload ID. It returns an ETag. It is very important because  We need to consider the limitation that all parts
we need to deliver that when we do our final API call. must be at least five megabytes (5 MiB), except for
3) Complete Multipart Upload: Finally, we have the the final part.
CompleteMultipartUpload API call. It returns Bucket,  When we put these together, parts should be
ETags, and Key. We need to provide Bucket, Key, Part between five and a hundred megabytes.
Number, Upload ID, all Part Numbers, and ETags.
S3 Storage Classes
Considerations What are Storage Classes?
We have some considerations for this process. Amazon S3 provides a variety of storage classes to satisfy
a variety of use cases. Storage classes can be assigned to
individual objects, or the bucket can be configured to use S3 Standard
a specific storage class by default for anything added to S3 Standard provides excellent durability, availability,
it. These include: and performance object storage for frequently accessed
dataS3 Standard is suitable for a wide range of use cases,
 S3 Standard is a storage type for general-purpose
including cloud services, dynamic websites, content
storage of commonly accessed data. distribution, mobile and gaming apps, and big data
 S3 Intelligent-Tiering is utilized for data with analytics, due to its low latency and high throughput. A
uncertain or changing access patterns. It is more of a single bucket can contain objects stored across S3
management engine than a storage class itself. Standard, S3 Intelligent-Tiering, S3 Standard-IA, and S3
 Standard Infrequent Access is used for infrequently One Zone-IA, S3 Storage Classes, which can be defined at
accessed data. the object level. You may also utilize S3 Lifecycle policies
 One Zone Infrequent Access is very similar to to migrate items across storage classes without having to
Standard Infrequent Access, except it can easily make any modifications to your application.
replace data. We will use this storage class for that S3 Infrequent Access
data if we have an on-premise data store that we S3 Standard-IA is for data accessed infrequently but
want to keep a copy of in the cloud for easier access. has to be available quickly when needed. S3 Standard-IA
 The Glacier storage class is used for data archives combines S3 Standard's strong durability, speed, and low
that we may need faster than the Glacier Deep latency with a cheap per-GB storage and retrieval charge.
Archive. S3 Standard-IA is appropriate for long-term storage,
 The Glacier Deep Archive is also an archival storage backups, and data storage for disaster recovery files
class. It is utilized for digital and long-term archive because of its low cost and significant performance. A
preservation. single bucket can contain objects stored across S3
Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One
Availability and Durability Zone-IA, and S3 Storage Classes can be defined at the
S3 Standard, S3 Standard–IA, S3 Intelligent-Tiering, S3 object level. You may also utilize S3 Lifecycle policies to
One Zone–IA, S3 Glacier, and S3 Glacier Deep Archive are migrate items across storage classes without having to
all meant to offer 99.999999999 percent (11 9's) data make any modifications to your application.
durability over a year. This level of durability corresponds
to a projected yearly loss of 0.000000001 percent of S3 Intelligent Tiering
items. For example, if you store 10,000,000 objects on Independent of object size or retention duration, S3
Amazon S3, you may anticipate a single object to be lost Intelligent-Tiering is the best storage class for data with
once every 10,000 years on average. On Outposts, S3 is unknown, changing, or unexpected access patterns. S3
meant to store data reliably and redundantly across Intelligent-Tiering may be the default storage class for
several devices and servers. Furthermore, Amazon S3 data lakes, analytics, and new applications.
Standard, S3 Standard-IA, S3 Glacier, and S3 Glacier Deep S3 Intelligent-Tiering monitors our objects for access
Archive are all built to keep data alive in the case of a assigned to the Intelligent-Tiering storage class. If the
complete S3 Availability Zone failure. object is not accessed for 30 days, it will move it to the
The S3 Standard storage class is designed for 99.99 configured Infrequent Access storage class. And once
percent availability, the S3 Standard-IA storage class and that object is accessed, it will be transferred back to the
the S3 Intelligent-Tiering storage class for 99.9% Standard storage class. It is where the unknown access
availability, the S3 One Zone-IA storage class for 99.5 pattern comes in. We will pay to monitor our objects, but
percent availability, and the S3 Glacier and S3 Glacier we trade off the access cost of Infrequent Access for this
Deep Archive storage classes for 99.99 percent monitoring cost. The monitoring is 1/4 of a penny per
availability and a 99.9% Service Level Agreement (SLA). 1,000 objects, and the storage cost will depend on the
storage class the object is in, in the underlying storage
class. If it is in Standard or Infrequent Access, that will
determine what we are paying to store those objects.

S3 Glacier
S3 Glacier is a safe, long-lasting, low-cost data archiving
storage type. You can store any quantity of data reliably
at prices comparable to or lower than on-premises
alternatives. S3 Glacier offers three retrieval options,
ranging from a few minutes to hours, to keep costs
reasonable while meeting various demands. You may
utilize S3 Lifecycle policies to move data between the S3
Storage Classes for active data (S3 Standard, S3
Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA)
S3 Security and Encryption
S3 Glacier, or you can upload objects directly to S3
Glacier. S3 Security Overview
At AWS, cloud security is a top priority. As an AWS client,
S3 Lifecycle Policies you have access to a data center and network
Life Of Data architecture designed to fulfill the needs of the most
When we talk about the life of data, we mean that we security-conscious businesses. When we look at S3
will create our data, that data will be actively utilized, security and encryption, there are many S3 features and
and then eventually, we will likely archive or delete that integrated services that provide various functions to
data. Once the data has been formed, it will be active in maintain the security of our S3 buckets.
either the standard or infrequent access S3 storage class. S3 Features
And then, it will move over to the archive storage class.
We may then push that data back into active utilization  Access Analyzer for S3
at some point. We can manage most of this with our  Amazon S3 Server Access Logging
lifecycle policy. We do not need to work potentially  Bucket Policy
millions of objects stored in S3, and all of this will happen  Bucket Access Control List (ACL)
for us automatically.  Cross-Region Replication
Data Lifecycle in S3  Multi-factor authentication (MFA) Delete
When we look at the storage classes, we overlap our  Object Access Control List (ACL)
Venn diagram of active and archive. Typically, we will  Object Locking
bounce from the S3 standard to the infrequent access  Versioning
storage classes and then move to the archive. We might
move back to standard again and repeat this process in Integrated Services
the lifecycle of our data. It may have a loop around and  Amazon CloudWatch Alarms
around, or it may eventually be deleted after a certain
 AWS CloudTrail Logs
amount of time.
 Identity Access Management (IAM)
 VPC Endpoints
 Service Control Policies
 Key Management Service (KMS)
In-Flight Security
The S3 Bucket in In-Flight security requires TLS support
from the clients that connect to the S3 buckets. We also
have VPC endpoints, making it so that we can only access
our bucket through our VPC. We can combine that VPC generate a report around what is available. It is useful if
with VPN options to access S3 buckets from the outside. there is a hole that we missed in our access policies that
Then we can use some access control and auditing maybe we do not want to provide access to a bucket
feature to ensure that our above security features are through some specific avenue, which will reveal that for
operating in the way we expect them to. us.

Client-Side Encryption
The application server will request an encrypted object
for client-side encryption if we are using the KMS option.
Our bucket is going to return that encrypted object and
a cipher blob. The cipher blob identifies the Key that that
object was encrypted with. Our application server then
needs to call KMS for that Key. Hence, it will request the
Key associated with the cipher blob that is returned to Object Protection and Replication
the object, and then a data key is returned for that Protection
object. We can then combine our encrypted object and We have an object that is in our S3 bucket. This object
our plain text data key into a decrypted object. happens to be a beat. We want to protect this object so
that it cannot be deleted, or we are going to put it into a
Server-Side Encryption Write Once Read Many modes or WORM. We can turn
For server-side encryption, if S3 manages our encryption on object locking, which prevents the deletion of this
keys, our application server will request an object, and object without disabling object locking.
S3 will decrypt it and send it back.
Alternatively, we can enable multi-factor authentication
The S3 Access Security Waterfall to delete our bucket, requiring an MFA token to delete
There is a permission waterfall when we talk about the objects in the bucket. It is useful because we can control
API and our various ways to access objects in S3 and this feature with some granularity, and these users with
control that access. Each one of the services in the MFA tokens are allowed to delete objects from a bucket.
waterfall adds to the flow, and at the bottom line of the All other users are not, which gives our administrators
waterfall, they are all combined to create a single policy the ability to remove objects if needed.
that determines whether or not we can access an object
or a bucket in S3. Replication
We can also replicate our bucket across regions. Hence,
Access Logging, Alerting, and Auditing the objects in our bucket will be copied to a bucket in a
We have the S3 server access log and cloud trail logs, and second region bucket. We will need to have a different
with these two combined, we can get various name, but this provides disaster recovery in case there is
granularities of access to our buckets. It can be used to a loss of the entire AWS region for whatever reason. We
feed data into CloudWatch, or CloudWatch on its own can turn on cross-region replication, and we will still
can be used for alerting to perform various actions. If a access our buckets. We may need to update our code to
bucket suddenly receives a considerable amount of point our applications to the correct bucket or set up
requests, we could set up a CloudWatch alarm that will some automatic failover in our application code. To turn
trigger a Lambda function that will turn off access to that on cross-region replication, you need to enable
bucket. Maybe a timer waits a certain amount of time versioning. This is because if an object is placed in a
and then automatically re-enables access to that bucket bucket and the replication engine starts, and it is a huge
to resume operation. object, that object is replaced. Suddenly you have an
Understanding all the layers of access authorization can invalid replication occurring. Therefore, the service can
get quite complicated, and that is where the access replicate a specific version. Then if that object is
analyzer for S3 comes in. It will analyze the various overwritten while replication is still happening, it will
policies and ACLs involved in providing access, and it will complete the reproduction of the performance that it is
copying and then replicate the newer version. It means
that it is possible for there to be some delay in replicating
our objects into our second region. Still, generally, this
works out very well for disaster recovery.
Chapter 03: Databases in AWS

Introduction to Databases in AWS Relational databases are used in the enterprise to


Introduction organize data and find links between crucial data
In the analytics process, how do you use databases? We elements. They simplify managing and finding
will discuss databases services in AWS in this chapter. In information, allowing businesses to make better-
the analytics process, how do you use databases? For informed decisions and save money. They work
example, databases. For the most part, Databases will be effectively with data that is structured.
in our data preparation section. They will be a starting Row Databases
point for aggregating data or a source of data that you
will input into your pipeline, perhaps as a secondary Row-oriented databases organize data by history and
dimension, or collect and send into our data analysis. It retain all of the data associated with a record in memory
might be moved to a data warehouse or data lake, or it adjacent to each other. Row-oriented databases are the
could be read directly from the database in our pipeline. conventional method of data organization, and they still
offer some important advantages for storing data fast.
Organizing Data They have been designed to read and write rows quickly.
We might have user profile service metadata application
logs and IoT data. We will take that data and put it in our Row Database Use Cases
databases, then gather it all together, get it organized, OLTP-Online Transaction Processing has the following
and clean up our little pile of data. features.
Services
 Used for rapid transactions
We have the relational database service Aurora, an
 Protects data through transaction rollback
engine that runs in the relational database service,
 Ideal for low latency applications
DynamoDB, and Elasticache. All of these have different
use cases. Columnar Databases

Database Engines Types A columnar database is a Database Management System


A database engine (sometimes known as a storage (DBMS) that stores data in columns rather than rows. A
engine) is the software component that enables a columnar database's goal is to reduce the time it takes to
Database Management System (DBMS) to generate, return a query by quickly writing and reading data to and
read, update, and delete (GRUD) data from a database. from hard disc storage. Columnar databases store data
Most database management systems have an so that disc I/O speed is considerably improved. They are
Application Programming Interface (API) that allows very useful for data warehousing and analytics.
users to communicate with the underlying engine Columnar Database Use Cases
without going via the DBMS's user interface.
OLAB – Online Analytics Processing has the following
Relational Database features.
A relational database is a collection of data objects with
specified relationships and can be easily retrieved. In the  Uses for analytics workloads
relational database paradigm, data structures such as  Manages large amounts of data
data tables, indexes, and views are maintained distinct  Handles complex long-running query operations
from physical storage structures, allowing database
managers to alter the physical data storage without
affecting the logical data structure.
Non-Relational Database entities, while edges store relationships between things
Non-relational databases (often referred to as ‘NoSQL’ or in graph databases. Edge contains a start node, an end
‘JSON’ or ‘key: value’ databases) differ from traditional node, a type, and a direction, and it may be used to
relational databases because they are stored in a non- define parent-child connections, actions, and ownership,
tabular format. Non-relational databases are among other things. A node can have an infinite number
substantially more adaptable than relational databases and variety of relationships.
since they can digest and organize various information
side-by-side. Non-relational databases, on the other Relational Database Service
hand, should be built on data formats such as Introduction
documents. A document can be quite thorough while Amazon RDS (Amazon Relational Database Service) is a
also including various data types in multiple forms. cloud-based managed service that makes it easier to set
up, operate, and scale a relational database. It offers
Key-Value Database cost-effectiveness and scalability while handling time-
A key-value database is a non-relational database that consuming database management responsibilities,
uses a basic key-value method to store data. Data is allowing you to focus on your applications and company.
stored in a key-value database as a collection of key- Amazon RDS offers the functionality of a well-known
value pairs. A key serves as a unique identifier. Both keys MySQL, MariaDB, Oracle, SQL Server, or PostgreSQL
and values can be any type of object, from basic to database. It implies that the code, apps, and tools you
sophisticated compound objects. Key-value databases are already using with your old databases should operate
are extremely partitioned tables and can scale just fine with Amazon RDS. Amazon RDS can back up your
horizontally to scales that other databases cannot. database automatically and maintain your database
Suppose a current section fills and extra storage space is software up to date with the newest version. You have
necessary. In that case, Amazon DynamoDB assigns the option of scaling the processing resources or storage
additional partitions to the database. space associated with your relational database instance.
Document Database Furthermore, for read-heavy database workloads,
Amazon RDS makes it simple to implement replication to
A document database is a non-relational database that increase database availability data durability or grow
stores and queries data as JSON-like documents. beyond the capacity restrictions of a single database
Documents and document databases can adapt to the instance. There are no upfront expenditures necessary,
demands of applications due to their flexible, semi- and you just pay for the resources you use, as with other
structured, and hierarchical nature. By employing the Amazon Web Services.
same document-model format as their application code,
document databases make it easier for developers to Amazon Aurora, MySQL, MariaDB, Oracle, SQL Server,
store and query data in a database. The document model and PostgreSQL are all supported by Amazon RDS.
is particularly suited to use cases where each document Managed Service
is unique and changes over time, such as catalogs, user RDS is a managed service. What does this mean? The
profiles, and content management systems. Flexible basic explanation is that we do not have access to the
indexing, strong ad hoc searches, and analytics over operating system of the instances we run in these
collections of documents are all possible with document services. Hence, what we get is a management API that
databases. allows us two manage database instances. RDS runs
Graph Database several engines, and you have Aurora, MySQL, MariaDB,
PostgreSQL, SQL Server, and Oracle and Aurora in two
Graph databases are especially suited for storing and varieties MySQL and PostgreSQL.
traversing relationships. Relationships are treated as
first-class citizens in graph databases, accounting for the
vast bulk of the database's value. Nodes store data
Second Level Service Neptune
RDS is considered the second level of service, and to Introduction
understand that, we use the dig command. You create a Amazon Neptune is a fast, trustworthy, and fully-
database instance and run a dig against it. You can get a managed graph database service that simplifies the
ton of information about RDS structure from the simple design and operation of applications that deal with vast,
control. linked datasets. Neptune is designed around a purpose-
RDS Instance made, high-performance graph database engine. This
A DB instance is a standalone database environment that engine is designed to store billions of relationships and
runs on the cloud. It is the fundamental component of query the graph in milliseconds. Neptune supports the
Amazon RDS. A DB instance can hold numerous user- popular graph query languages Apache TinkerPop
created databases and be accessed using the same client Gremlin and W3C's SPARQL, allowing you to effectively
tools and applications as a single database instance. create searches that explore densely linked datasets.
Neptune drives graph application cases, including
Operation System Access recommendation engines, fraud detection, knowledge
How do you manage not having operating system access graphs, medicine discovery, and network security.
and still tune our database to function to our workload?
You have parameter groups, which are controlled Neptune is extremely available with read replicas, point-
through the RDS API. You have option groups, which are in-time recovery, continuous backup to Amazon S3, and
also held through the RDS API parameter groups, which replication across Availability Zones. Neptune offers data
will allow us to change engine parameters, like how security measures such as encryption at rest and in
logging is configured and stored various memory transit. Because Neptune is fully managed, you no longer
management parameters and any other engine need to worry about database administration activities
variables. Some operating system variables/option such as hardware provisioning, software patching, setup,
groups are used to manage plug-ins in some cases. configuration, or backups.
Oracle uses option groups pretty heavily, whereas Graph Structure
PostgreSQL does not use them. A graph is a data structure consisting of a finite number
Disaster Recovery of nodes (or vertices) and edges that link them. In the
The big disaster recovery feature for RDS is multi-AZ following examples, circles represent vertices, and lines
deployments. The way that it works is that you are going indicate edges. An edge (x,y) signals that the x vertex
to have a primary and secondary instance. In this relates to the y vertex.
example, our prime samples are in US West 2A or We know that a graph has a node from our database
secondary ones in US West 2B. These instances are going engine types section. Those nodes will contain data, and
to be replicating at the block level. The database engine then the nodes will be connected via edges. The edges
is not involved at all; the database engine will not be are common data in those nodes. We can traverse these
running at all and on the secondary. It is going to be graphs and collect data secured by the nodes.
sleeping. RDS then monitors the engine, the compute
instances, the EBS volumes, and the replication between Interface Languages
those EBS volumes, and if the machine or the EC2 It is two different interface languages that Neptune
instance or the EBS volume fails, it will terminate that supports. We have Apache's TinkerPop Gremlin, or
instance. That instance will no longer be the primary, and simply Gremlin, and we have the W3C sparkle protocol,
the secondary in US West 2B will become the primary. which is used with the resource description framework
The RDS service will wake up the database engine, and or RDF query language.
then a new secondary in our old availability zone US
1. Apache – TinkerPop Germlin
West 2A will be created with an engine that is not
running. o Graph structure property
o Interface: WebSocket
o Query Pattern: Traversal
2. W3C – SPARQL Protocol & RDF Query Language
o Graph structure: Resource Description Framework
(RDF)
o Interface: HTTP Rest
o Query Pattern: SQL S3 Select
DocumenDB Using simple SQL expressions, S3 Select allows apps to
get only a subset of data from an object. You may
Introduction
drastically improve speed by retrieving only the data
Amazon DocumentDB (with MongoDB compatibility) is a
required by your application using S3 Select. In many
highly scalable, dependable, and completely managed
circumstances, you might expect to see a 400%
database service. You may run the same application code
improvement.
and utilize the same drivers and tools with MongoDB
with Amazon DocumentDB. Amazon DocumentDB S3 Select
simplifies the setup, operation, and scaling of MongoDB- Using simple SQL expressions, S3 Select allows apps to
compatible databases in the cloud. get only a subset of data from an object. You may
drastically improve speed by retrieving only the data
Some programmers may not consider their data model
required by your application using S3 Select. In many
normalized rows and columns. Data is typically
circumstances, you might expect to see a 400%
represented as a JSON document at the application tier
improvement.
because it is more intuitive for developers to think of
their data model as a document. Athena
Amazon Athena is an interactive query service that
Serverless Options allows you to use conventional SQL to evaluate data
Introduction directly in Amazon Simple Storage Service (Amazon S3).
Serverless is a means of offering backend services as With a few clicks in the AWS Management Console, you
needed. Users may use a serverless provider to build and can aim Athena at your Amazon S3 data and start running
publish code without worrying about the underlying ad-hoc searches with conventional SQL to obtain results
infrastructure. Because the service is auto-scaling, a firm in seconds.
that obtains backend services from a serverless vendor is
charged depending on their calculations and does not DynamoDB
have to reserve and pay for a predetermined amount of Amazon DynamoDB is a fully managed NoSQL database
bandwidth or number of servers. It should be noted that, service that delivers quick and predictable performance
despite the term, actual servers are still utilized, but while seamless scaling. You can offload the
developers are not required to be aware of them. administrative requirements of running and growing a
distributed database using DynamoDB, so you do not
To start, we have S3 select, which is just an API call that have to worry about hardware provisioning, setup,
lets us make selects against data in S3 buckets. We have configuration, replication, software patching, or cluster
Athena, which is more of a fully relational database scalability. DynamoDB also supports encryption at rest,
management system that S3 backs. We have DynamoDB, removing the operational load and complexity
a key-value plus a fully managed serverless database. We associated with securing sensitive data.
also have Aurora serverless, which leverages the
architecture of Aurora to treat our database as though it Aurora
is serverless. Serverless means that we do not need to Amazon Aurora (Aurora) is a relational database engine
administrate any servers. Hence there are still servers that is fully managed and compatible with MySQL and
behind all of these services. We are just not going to be PostgreSQL. MySQL and PostgreSQL combine the
building them at all. performance and dependability of high-end commercial
databases with the simplicity and low cost of open-
source databases. The code, tools, and applications that
you are now using with your existing MySQL and
PostgreSQL databases may be utilized with Aurora.
Aurora may give up to five times the performance of
MySQL and three times the throughput of PostgreSQL in
specific workloads without needing modifications to the
bulk of your existing applications.

Aurora Serverless
Amazon Aurora Serverless v1 (Amazon Aurora Serverless
version 1) is a configuration for on-demand autoscaling
in Amazon Aurora. An Aurora Serverless DB cluster is a
database cluster that dynamically scales processing
capacity based on the needs of your application. In
contrast, Aurora supplied DB clusters require manual
capacity management. Aurora Serverless v1 is a simple,
low-cost choice for occasional, intermittent, or
unexpected workloads. It saves money since it starts up
automatically, boosts processing capacity to match the
needs of your application, and shuts down when not in
use.
Chapter 04: Collecting Streaming Data

Introduction Big Data Collection


The streaming data makes up a mass amount of data It is based on collecting and then mining large amounts
collected, so it is important to understand what it is and of data for information. Big data refers to massive
what are the different ways you can collect, process, and volumes of organized, semi-structured, and
store it within AWS. Streaming data is new data, and it unstructured data acquired by businesses. New
plays a big role in actionable decisions that can be made approaches for collecting and analyzing data have
with that data. emerged because it takes a lot of time and money to load
big data into a traditional relational database for
Kinesis Family analysis. In a data lake, raw data with extra information
Data Collection is aggregated. Machine learning and artificial intelligence
The practice of gathering, measuring, and evaluating systems then employ complicated algorithms to search
correct research insights using established procedures is for repeating patterns.
data collection. Based on the evidence gathered, a
Streaming Data
researcher can assess their hypothesis. Depending on
Streaming data is generated continuously by thousands
the information requested, the approach to data
of data sources. These are typically sent in small data
gathering differs for different topics of research.
records simultaneously, think in the order of kilobytes.
Regardless of the subject of study, data gathering is the
For example, think about sensors in vehicles or industrial
first and most significant stage in most situations.
equipment that you might see in some factory or farming
The systematic process of gathering and measuring machinery. These sensors send data to streaming
information from many sources to create a full and applications; these applications monitor performance,
accurate picture of a subject is known as data collection. detect any potential defects in advance, or even place an
Data collecting allows a person or organization to answer order for spare parts. When defects or any type of
pertinent questions, assess results, and forecast future anomaly is detected within the equipment it is running
probability and trends. on; it prevents equipment downtime. Another example
is a financial institution that tracks changes in the stock
Data collection’s most critical objective is to ensure that
market in real-time. It then computes value at risk and
information-rich and reliable data is collected for
rebalances portfolios automatically depending on stock
statistical analysis so that data-driven decisions can be
price fluctuations.
made.
AWS Kinesis Family
Data Collection Methods
The AWS Kinesis family is not just one service but a family
Surveys, interviews, and focus groups are the most
of services. Kinesis offers several different services that
common methods for gathering information.
help you get your streaming data into AWS and build
Corporations may now collect data from mobile devices,
robust applications around streaming data. The first
website traffic, server activity, and other relevant
service is the Kinesis data stream.
sources using Web and analytics technologies depending
on the project. Kinesis data stream allows us to collect and process large
streams of data records in real-time. It is the most
important service when it comes to Kinesis services. The
next Kinesis service is Kinesis Data Firehose. Kinesis Data
Firehose is our delivery stream, which allows us to deliver 1. Accelerated log and data feed intake and processing:
our streaming data to various data sources, such as
Data can be immediately sent into a stream by
Amazon S3, Redshift, Elasticsearch, or Splunk. The next
producers. For example, push system and application
Kinesis service is Kinesis video streams that allow us to
logs will be ready for analysis in seconds. If the front-end
stream live videos from devices to the AWS cloud, and
or application server dies, the log data will not be lost.
you can build real-time applications around these video
Kinesis Data Streams provide quicker data feed intake
streams. Finally, we have Kinesis data analytics. It is what
because you do not batch the data on the servers before
you can use to process and analyze streaming data using
submitting it for input.
standard SQL queries. Hence, you can essentially run SQL
queries in real-time on streaming data. 2. Real-time metrics and reporting:

Thus, Kinesis Data Stream, Kinesis Data Firehose, Kinesis Data gathered via Kinesis Data Streams may be used for
Video Streams, and Kinesis Data Analytics are the four real-time data analysis and reporting. Instead of waiting
services that make up the Kinesis family. for batches of data to arrive, your data-processing
application might work on metrics and reporting for
Kinesis Data Streams system and application logs as they come.
Introduction
One of the benefits of using Kinesis Data Stream is 3. Real-time data analytics:
aggregating real-time data and then loading it into some The strength of parallel processing is combined with the
data warehousing solution like Redshift or some value of real-time data in this way. For example, utilizing
MapReduce cluster like EMR. Kinesis Data Streams are many Kinesis Data Streams applications operating in
also durable and elastic, meaning that you would not parallel, process website clickstreams in real-time, and
lose our data records. You would not lose our streaming then assesses site usability engagement.
data, and it scales up or down depending on the number
of records coming into our stream. It means that you can 4. Complex stream processing:
get all the benefits of a managed streaming service Kinesis Data Stream applications and data streams may
rather than having to host it on EC2 instances ourselves. be turned into Directed Acyclic Graphs (DAGs). It usually
You want to take advantage of managed services entails combining data from many Kinesis Data Stream
wherever you can, which is an integral part of collecting applications into a single stream for later processing by
streaming data within AWS. You can also have parallel another Kinesis Data Stream application.
applications reading from the stream. Hence, you can
perform different functions on that data. It is one of the Shard
significant differences between Kinesis and other Shards are a container that holds our information
queuing services. shipped off to consumers. Let us assume that you have a
single shard. You can see this shard has two data records;
Working with Kinesis Data Streams each one of these data records consists of a partition key,
Kinesis Data Streams may be used to collect and a sequence ID, the data, and the actual data you want to
aggregate data in real-time. IT infrastructure log data, ship off to consumers. The partition key is going to be the
application logs, social media, market data feeds, and same for all the records within a shard. The sequence
online clickstream data are some examples of the data number will be in the order in which the shard received
types that may be employed. Because the data intake the record, and that will be our data. Each shard consists
and processing are both done in real-time, the of a sequence of data records. Here, you have two, which
processing is generally minimal. can be ingested at 1000 records per second. The actual
data payload per record can be up to one megabyte.
The following are some examples of how Kinesis Data
Streams can be used: Processing & Storage
A shard is temporary data storage. Data records are
stored for 24 hours by default and can be extended up to
365 days. By default, the data retention period is 24 3. Kinesis Agent
hours. You can raise the data retention period to seven
The Kinesis Agent is a ready-to-use Java application that
days by enabling extended data retention. You can
can be deployed on a Linux-based server. It is an agent
increase it even further by allowing long-term data
that monitors specific files and continuously sends data
retention to have the data persist in the shard for up to
to our Data Stream. Hence, you might want to install this
365 days. You can do this by using the increased stream
on web servers, log servers, or database servers.
retention period operation. You can decrease it by using
the reduced stream retention period operation. Hence, Kinesis Agent is a Java software application that allows
going back to our train example, passengers will be you to gather and transfer data to Kinesis Data Streams
booted from the train every 24 hours, but some rules quickly. The agent watches a group of files in real-time
may differ for some trains. It is the retention period, so and feeds new data to your stream. The agent performs
some passengers may be allowed to stay for up to 365 file rotation, checkpointing, and retries in the event of a
days. failure. It distributes all of your data in a dependable,
fast, and straightforward manner. It also emits Amazon
Interacting with Kinesis Data Stream
CloudWatch metrics to assist you in monitoring and
There are a few different ways to interact with Kinesis
troubleshooting the streaming operation.
Data Streams:
By default, entries from each file are processed based on
1. Kinesis Producer Library (KPL):
the newline ('n') character. The agent, on the other hand,
An application that inserts user data into a Kinesis data may be set to parse multi-line entries.
stream is an Amazon Kinesis Data Streams producer (also
The agent may be deployed on Linux-based web servers,
called data ingestion). The Kinesis Producer Library (KPL)
log servers, and database servers. Configure the agent
makes it easier for developers to create producer
after installing it by providing the files to monitor and the
applications by achieving high write throughput to a
data stream. Once set up, the agent takes data from files
Kinesis data stream.
and consistently feeds it to the stream.
2. Kinesis Client Library (KCL):
4. Kinesis API (AWS SDK):
KCL takes care of many complicated duties connected
It is used to process data from the Kinesis Data Stream.
with distributed computing, allowing you to receive and
Once the data is in Kinesis Data Streams, you can use
process data from a Kinesis data stream. Load balancing
Kinesis Client Library, abbreviated as KCL, to directly
across numerous consumer application instances,
interact with the Kinesis Producer Library to consume.
responding to consumer application instance failures,
These libraries are used to abstract some of the low-level
checkpointing processed records, and responding to re-
commands you would have to use with the Kinesis API.
sharding are examples of these. The KCL handles all of
However, it is used for more low-level API operations and
these subtasks, allowing you to concentrate on
more manual configurations. Hence, the interaction with
implementing your unique record-processing logic.
Kinesis Data Stream is by using the Kinesis API. With the
The KCL is not the same as the Kinesis Data Streams APIs Kinesis API, you can perform the same actions that you
offered in the AWS SDKs. The Kinesis Data Streams APIs can achieve with the Kinesis Producer Library or the
assist you in managing many elements of Kinesis Data Kinesis Client Library. Hence, you can install the Kinesis
Streams, such as establishing streams, re-sharding, Producer Library on two EC2 Instances or integrate it
inserting and receiving information. The KCL adds a layer directly into your Java applications.
of abstraction around all of these subtasks, allowing you
KPL VS Kinesis API
to focus on the particular data processing logic in your
Some key features between the Kinesis Producer Library
consumer application.
and the Kinesis API are mentioned below:
Features of KPL: 3. Data Producer:
 Provides a layer of abstraction dedicated to data
Records are sent to Kinesis Data Firehose delivery
intake
streams by producers. A data producer is, for example, a
 Retry system that is both automatic and adjustable
web server that delivers log data to a delivery stream.
 In order to achieve higher packing efficiency and You can also set up your Kinesis Data Firehose delivery
better performance, additional processing delays stream to read data from an existing Kinesis data stream
may occur and put it into destinations automatically.
 Java wrapper
4. Buffer Size & Buffer Interval:
Features of Kinesis API:
 Low-level API calls (PutRecords and GetRecords) Before sending data from Kinesis to destinations,
 Stream creations, re-sharding, and putting and Firehose buffers incoming streaming data to a specific
getting records are manually handled size or for a certain amount of time. Buffer size is
 No delays in processing measured in megabytes, while buffer interval is
 Any AWS SDK measured in seconds.

Kinesis Data Firehose Redshift Destination


Introduction The next destination is Redshift. With Redshift, you have
Amazon Kinesis Data Firehose is a fully managed service our data producers, and you can send the data through
that delivers real-time streaming data to Amazon S3, Kinesis Data Firehose. The data will always go to S3 first
Amazon Redshift, Amazon OpenSearch Service (Amazon and then to Redshift. Hence, what happens is that the
ES), Splunk, and any custom HTTP endpoint or HTTP streaming data is delivered to an S3 bucket first, and then
endpoints owned by supported third-party service Firehose automatically issues the copy command to load
providers like Datadog, Dynatrace, LogicMonitor, the data from S3 to Redshift. Anytime you load data into
MongoDB, New Relic, and Sumo Log. Kinesis Data Redshift, it first gets loaded into S3, and then it issues a
Firehose, Kinesis Data Streams, Kinesis Video Streams, copy command to load that data from S3 to Redshift. You
and Amazon Kinesis Data Analytics are part of the Kinesis can do the same thing by intercepting records,
streaming data platform. You do not need to build transforming them using AWS Lambda, loading them
applications or manage resources using Kinesis Data onto the S3 bucket, and then eventually using the copy
Firehose. You set up your data producers to transmit command to load it onto Redshift. You can also intercept
data to Kinesis Data Firehose, and the data is delivered here and transform the records as well. Hence, you have
automatically to the destination you specify. Kinesis Data a few different options to transform our records before
Firehose may also be configured to alter data before you load them onto Redshift. You can always have a
sending it. backup bucket if you want to and store off that data
before you load onto Redshift or before you transform
AWS Kinesis Firehose Key Concepts those records. As a result, Redshift just knows that it is
Understanding the following ideas will help you get the data warehousing solution that AWS offers.
started with Kinesis Data Firehose:
Elasticsearch Destination
1. Kinesis Data Firehose Delivery Stream: Similar to the other destinations, you have our data
Kinesis Data Firehose is an underlying entity. You utilize producers that go through Kinesis Data Firehose. You can
Kinesis Data Firehose by first generating a delivery then load it into Elasticsearch and transform the records
stream and then delivering data to it. before they load onto our Elasticsearch cluster. Similarly,
you can load the data before converting it onto
2. Record: Elasticsearch into a backup bucket in the S3 bucket.
Your data producer provides the relevant data to a
Kinesis Data Firehose delivery stream. A record can be up
to 1,000 KB in size.
Splunk Destination that pushes Video Streams into the cloud. Since Amazon
The last destination is Splunk instances. Splunk is a way builds it, it is all captured by Amazon, and then it has to
to aggregate your log files from servers or applications to use Kinesis Video Streams.
have a single place where you can have all your log files
Real-time vs. Batch-oriented
aggregated. It is important to know that it is a destination
There are many different services that you can use to
for Firehose. Hence, just like other destinations, you
analyze our data, either in real-time or in a batch-
have our data producers. It goes through Kinesis Data
oriented process. You can use Amazon EC2 instances;
Firehose, where you can load the data off into our Splunk
you can hook it into Amazon Rekognition that allows us
instances and transform the records as well. Similarly,
to connect in machine learning services for our video
you can load that data into an S3 bucket as a backup
stream. You can also hook it into several other AWS and
before you transform it or load it into our Splunk
even connect it to other third-party services. You can use
instances.
any of these services that you see below in the figure to
Buffer Size & Buffer Interval analyze, process, and consume our video applications.
You have all sorts of data producers. It might be
Kinesis Video Stream Benefits
clickstream data, gaming data, IoT data, or any streaming
The following are some of the advantages of using
data. You can feed that through Kinesis Data Firehose.
Kinesis Video Streams:
You have data coming in aggregated, which then finally
shipped off to our destination. Hence, Amazon Kinesis 1. Connect & Stream from Millions of Devices:
Firehose buffers the incoming streaming data to a certain
Kinesis video streams link and stream video, audio, and
size for a certain period before delivering it to a
other data from millions of devices, such as cellphones,
destination. The buffer size for S3 as a destination spans
drones, and dash cams. Using the Kinesis Video Streams
from 1 to 128 megabytes. In contrast, Elasticsearch
producer libraries, you may set up your devices to
ranges from 1 megabyte to 100 megabytes.
broadcast in real-time or as after-the-fact media uploads.
Kinesis Video Streams 2. Durably Store, Encrypt, & Index Data:
Introduction
Amazon Kinesis Video Streams is a fully managed You may set your Kinesis video stream to save media
Amazon Web Services (AWS) solution for streaming live data indefinitely for specified retention periods. Kinesis
videos from devices to the AWS Cloud. Develop Video Streams additionally create an index over the
applications for real-time video processing and batch- recorded data based on timestamps supplied by the
oriented video analytics. service producer. Using the time index in your
applications, you may obtain specified data in a stream.
Kinesis Video Streams offers more than simply video data
storage. You may use it to see your video feeds as they 3. Focus on Managing Applications Instead of
arrive in the cloud in real-time. You may either watch Infrastructure:
your live streams via the AWS Management Console or Because Kinesis Video Streams is server-less, no
create your monitoring application that displays live infrastructure is required to set up or administer. You do
video using the Kinesis Video Streams API library. not have to worry about the underlying infrastructure's
Producer & Consumer Applications deployment, setup, or elastic scalability as your data
CCTV monitoring streams video into the cloud using an streams and the number of consuming applications
Amazon cloud cam, which is an opt-in service that you increase and decline. Kinesis Video Streams provide all of
can subscribe to. This is an example where videos are the administration and maintenance required to
being streamed every day. You have a camera that automatically maintain streams, allowing you to
streams through your wifi and collects any videos you concentrate on the applications rather than the
want, for either surveillance or in front of your own infrastructure.
home. This camera has some software installed onto it
4. Build Real-Time & Batch Applications on Data AWS Kinesis Data Analytics Benefits
Streams: You can create SQL code that reads, analyzes, and stores
data in real-time using Amazon Kinesis Data Analytics.
Kinesis Video Streams may be used to construct bespoke
You may build applications that convert and give insights
real-time applications. That works on live data streams
into your data using conventional SQL queries on
and batch or ad hoc applications that function on durably
streaming data. The following are some use-case
persistent data. It does not have tight latency
examples for Kinesis Data Analytics:
constraints, so you may use open source (Apache MXNet,
OpenCV), homemade, or third-party solutions from the 1. Generate Time-Series Analytics:
AWS Marketplace to create, deploy, and manage
Metrics may be calculated over periods and then sent to
bespoke applications to process and analyze data
Amazon S3 or Amazon Redshift through a Kinesis data
streams. Kinesis Video Streams Get APIs allow you to
delivery stream.
create several concurrent applications that can handle
real-time or batch mode data. 2. Feed Real-Time Dashboards:
5. Stream Data More Securely: You may feed real-time dashboards with aggregated and
processed streaming data findings.
Kinesis Video Streams encrypt all data as it passes
through the service and as it is saved. Kinesis Video 3. Create Real-Time Metrics:
Streams use AWS Key Management Service to encrypt all
Custom metrics and triggers may be created for usage in
data at rest and enforce Transport Layer Security (TLS)
real-time monitoring, alerts, and alarms.
based on data streaming from devices (AWS KMS). AWS
Identity and Access Management may also be used to Amazon Managed Streaming for Kafka
manage data access (IAM). Apache Kafka
Kinesis Data Analytics Apache Kafka is a real-time data input and processing
distributed data storage system. Streaming data is
Introduction
constantly created by hundreds of data sources, which
You may use standard SQL to handle and analyze
often transmit data records simultaneously. A streaming
streaming data using Amazon Kinesis Data Analytics for
platform must manage this continual input of data while
SQL Applications. The service lets you quickly develop
still processing it sequentially and progressively. Apache
run sophisticated SQL code against streaming sources to
Kafka was originally developed by LinkedIn and was
do time-series analytics, feed real-time dashboards, and
made open source in 2011. The Apache community then
generate real-time metrics.
took it over and a distributed streaming platform with
To get started with Kinesis Data Analytics, develop a three key capabilities.
Kinesis data analytics application that constantly reads
Kafka offers its consumers the following three primary
and analyses streaming data. Data may be ingested via
functions:
Amazon Kinesis Data Streams and Amazon Kinesis Data
Firehose streaming sources. Then, using the interactive  Streams of recordings can be published and
editor, you can write your SQL code and test it with live subscribed to.
streaming data. You may also choose where you want  Streams of records can be effectively stored in the
Kinesis Data Analytics to deliver the results. sequence in which they were created.
Amazon Kinesis Data Firehose (Amazon S3, Amazon  Real-time processing of record streams.
Redshift, Amazon OpenSearch Service, and Splunk), AWS Apache Kafka Management
Lambda, and Amazon Kinesis Data Streams are The Apache Kafka is an open-source piece of software.
supported as destinations by Kinesis Data Analytics.
You can download that and install it onto a server or an
EC2 instance. However, if you decide to go down that
path, you have to manage all of these Apache Kafka
servers yourself. You have to ensure that the cluster is up
to date and make sure that it is auto-scaling as well. It
scales up to demand and scales back down whenever the
streaming data spikes decrease. This is where MSK,
Managed Streaming service for Kafka, comes to the
rescue.

Amazon MSK
Amazon Managed Streaming for Apache Kafka (Amazon
MSK) is a completely managed service that allows you to
design and run applications that handle streaming data
using Apache Kafka. Amazon MSK takes control-plane
actions, including building, updating, and removing
clusters. It enables the usage of Apache Kafka data-plane
tasks, such as data production and data consumption. It
runs Apache Kafka open-source versions and implies that
existing applications, tools, and plugins from partners
and the Apache Kafka community are supported without
application code modifications.
Chapter 05: Data Collection and Getting Data into AWS

Introduction Online Data Transfer


It makes it simple and easy to transfer your data into and
This chapter is concerned with collecting data on AWS out of AWS via online methods. AWS Data Sync allows
and deciding the best ways to ingest your data into AWS. you to automate moving data between your on-premises
You will also learn how to collect data in AWS via a storage into AWS, into S3, EFS, or FSX, which is a windows
dedicated network or to use hardware appliances and file server. You can use AWS Data Sync to transfer data
transfer databases using the Database Migration Service at speeds up to 10 times faster than some of the other
(DMS). You will also learn how to use the Amazon Kinesis open-source tools that you can use. You can use Data
family of services and when each one is most useful. For Sync for one-time data migrations or have recurring data
data interpretation and the discovery of our data, we can processing workflows. You can also automate the
use QuickSight. replication and the data protection and recovery of your
data. It is a great tool to set up automation for moving
Data Loses Value Quickly Over Time data between various locations into AWS. Several
From the graph, data loses value quickly over time. On different transfer families allow you to transfer directly
the left-hand side, our streaming data or our into and out of S3. You can use FTP, SFTP, or FTPS. S3
preventative, predictive, and actionable data are where Transfer Acceleration allows you to maximize your
our time-critical decisions can be made. If we want fast, available bandwidth regardless of the distance from your
actionable data, then that is where our streaming data customers into S3. Kinesis Data Firehose is a simple way
lies. We want to build tools around batch-oriented to load streaming data into AWS.
processes, build out ETL pipelines, or possibly build
business intelligence tools around our data. Offline Data Transfer – The Snow Family
This introduces the snow family, which makes it simple
Direct Connect, Snowball, Snowball to get your data into and out of AWS via offline methods.
Edge, Snowmobile Snowcone
AWS General Rule Of Thumb You can load data to Snowcone through Wi-fi wired 10
AWS has a general rule of thumb. This graph lays out your GbE networking. You can then ship the device with data
network connection speed, the amount of data, and to AWS for offline data transfer.
whether to use a managed or unmanaged service from
AWS. The difference between an unmanaged and Snowball
managed is that unmanaged uses the CLI, the AWS Used for petabyte-scale data transport with import and
command line, or the AWS console to transfer data into export to S3.
AWS. A managed service is the Direct Connect option, Snowmobile
the Snowball family, and other tools that you can use to An exabyte-scale data transport solution uses a secure
move data from one location to another or from on- semi 40-foot shipping container to transfer large
premises into the AWS network. amounts of data into and out of AWS.
Data Migration Service (Managed Services To Snowball Edge
Move Your Data To AWS) It is local storage and large-scale data transfer. Also, local
Hybrid Cloud Storage Lambda and EC2 instances compute, and AWS IoT
The hybrid cloud storage connects your on-premises Greengrass.
applications that require low-latency access or need
rapid data transfer to cloud storage.
Database Migration Service 1. Cross-Region Replication
1. Data Migration It gives you the ability to create cross-region replications
of your database for applications running in other
Easily and securely migrate widely used commercial and
regions.
open-source databases and data warehouses into the
cloud. 2. Offload Analytics
2. Replication You can replicate data to the cloud and run analytics on
your cloud databases rather than the original database
Easily replicate your databases and data warehouse
that users interact with.
between two locations.
3. Keeping Data in Sync
3. Fully Operational
Sometimes you need to keep your data in sync between
Databases stay fully operational during the migration,
testing, staging, and production environments.
minimizing downtime for the applications using them.

DMS Use Cases Data Pipeline


Data Pipeline automates the movement and
1. Migrate Applications transformation of your data. It helps you process and
Migrate business-critical databases and migrate from move data between AWS compute storage services and
Classic to VPC, a costly and license-driven data on-premises data sources. You can create an ETL
warehouse to Redshift. workflow to automate the processing and movement of
data at scheduled intervals.
2. Upgrade
Key Concepts
With DMS, you can upgrade versions of your database
Data Pipeline is a container that consists of four parts
software easily with no downtime.
that include data nodes, activities, Precondition, and
3. Achieve Old Data schedules. Data nodes define where the data is coming
from; this can be DynamoDB, S3, RDS instance or some
You can migrate historical data to a more cost-efficient on-premises database, or from Redshift. Activities are a
storage solution while still retaining the Solution. way for the components of the Pipeline to define work
4. Migrate Datastores to perform. Preconditions are conditional statements
that must be true before the activity is run.
You can migrate from NoSQL to SQL, SQL to NoSQL, and
SQL to SQL. In schedules, we set up when we want the Data Pipeline
to run AWS provisions and terminates the EC2 instances
Mass Amount Of Data or the EMR clusters that transform and process the data
Use a Snowball device to: from one source to another. When it is finished, it
 Store 80 TB storage, 10 GB network terminates automatically.
 User interface similar to S3 Data Pipeline for On-premises
 All data is encrypted end-to-end We can run Data Pipeline for on-premises by installing
the task runner on a server in our local network. We can
Replication
access the local database securely and pull the Data
Leverages “change data Capture,” which pulls just the
Pipeline for the next task to run. When it recognizes a
changes from the source and delivers them to the
task to run, it runs that task on the task runner installed
destination.
on-premises.
Lambda, API Gateway, and CloudFront through S3 via edge locations. It can trigger API Gateway
Definitions endpoints that, in turn, trigger Lambda functions. These
Lambda functions can communicate with Cognito. You
Lambda
can use Cognito for the login and authentication services.
An event-driven service that allows you to run your code
In addition, you can communicate back and forth to S3
in AWS without managing infrastructure.
and the same with DynamoDB. These architectures are
API Gateway pretty much endless to trigger endpoints through API
A serverless API that can be used to create RESTful, HTTP, Gateway, communicate with Lambda functions, do
and WebSocket APIs. things like login users, get data back and forth from S3,
and create data stores within DynamoDB.
CloudFront
A content delivery network that allows you to deliver
data, videos, applications, and APIs with low latency.

Serverless Architectures
These services in action can trigger things like Lambda
functions. You can have an application that runs on
CloudFront that gets the data from S3 or is served out
Chapter 06: Amazon Elastic Map Reduce

Introduction reduces cluster shares the data with S3. We also can
Elastic Map Reduce or EMR plays a huge role in data store it on the local file system. This can be an instance
analytics, processing, and big data frameworks. We can store or on EBS volumes.
use the EMR architecture and the Hadoop framework to
EMR Architecture
process and analyze massive amounts of data. Log
analysis, web indexing, data warehousing, machine Introduction
learning (ML), financial analysis, scientific modeling, and The entire cluster is spun up in a single availability zone.
bioinformatics all use Amazon EMR for data analysis. It Every single EMR cluster has a primary node, or it can
also supports Apache Spark, Apache Hive, Presto, and have three primary nodes. Hence, it is either a single
Apache HBase workloads, connecting with Hive and Pig, primary node or three primary nodes. This primary node
free source Hadoop data warehousing technologies. Pig manages all of the components in the distributed
provides a high-level interface for scripting Map-Reduce applications. When a job needs to be submitted or some
tasks in Hadoop, whereas Hive utilizes queries and processing or Map reduce tasks, the core nodes come
analyses. This service is expensive, and it uses a lot of into play. The primary node manages these core nodes.
computing power, but it gives you the ability to process The last part of the EMR architecture is our task nodes.
huge amounts of data in a short amount of time. Task nodes are optional. They add power to perform
parallel, computational tasks on the data, and they help
Apache Hadoop and EMR Software the core nodes.
Collection Primary Node Features
Map Reduce
Map-reduce is a technique that data scientists can use to  Single or Multi-Primary Nodes
distribute workloads across many different computing Whenever you launch a cluster, you will have the option
nodes to process other data and get the information to choose between one primary node and three primary
back quicker than just on a single node. nodes. You will only have a single primary node most of
the time, but now you can also have multiple primary
Hadoop Distributed File System (HDFS)
nodes. You would have numerous primary nodes
Hadoop Distributed File System is open-source software
because you do not have a single point of failure.
that allows you to operate a distributed file system over
Therefore, if one master node fails, the cluster uses the
several computers to tackle challenges requiring large
other two master nodes to run without interruptions.
amounts of data. HDFS is meant to run on low-cost
EMR automatically replaces the primary node and
hardware and is extremely fault-tolerant. HDFS is a file
provisions it with any configurations or bootstrap actions
system that allows high-throughput access to application
that need to happen. Hence, all it does is remediate that
data and is well suited to applications with huge data
single point of failure.
collections. The problem with setting up an HDFS cluster
so it requires a lot of maintenance and management.  Manages the Cluster Resources
This is where Elastic Map Reduce comes in.
The primary node also manages the cluster resources. It
EMR coordinates the distribution of the parallel execution for
Elastic Map reduce is a fully managed AWS service that the different Map reduce tasks.
allows you to spin up Hadoop ecosystems. Not only can
you store data on HDFS, but you have some other
storage options as well. We also have the EMR file
system or EMRFS. This means that the Elastic Map
 Tracks and Directs HDFS  Added and Removed from running clusters
The primary node also tracks and directs the HDFS. The Task nodes can be added and removed from the core
primary node knows how to lookup files and track data nodes to ramp up extra CPU or memory for compute-
on the core nodes. intensive tasks.

 YARN Resource Management Single Availability Zone Concept


The EMR clusters only reside in a single availability zone.
The primary node is also responsible for the YARN
The main reason behind the single availability zone
resource management. EMR uses YARN (Yet Another
concept so the nodes in the cluster can communicate
Resource Negotiator) to manage cluster resources for
faster. It means that they do not have to traverse as
multiple data-processing frameworks.
much internet or the AWS backbone. They are closer
 Monitors Core and Task Nodes Health together, and they are in the same availability zone. It
means block replication can happen more quickly.
The primary node tracks the status of jobs submitted to
Hence, you can find your files faster when finding them
the cluster and monitors the health of the core and task
on HDFS. In addition, the communication between nodes
nodes.
happens faster. Hence, whenever core nodes are
Core Node Features processing back and forth, they can communicate faster.
In addition, the access to metadata and the ease of
 Run Tasks for the Primary Node
launching a replacement cluster is faster if one of the
The primary node manages core nodes and runs Hadoop nodes goes down or has some overload.
Map reduce tasks, Hive Scripts, and Spark executors.
Long-Running Clusters
 Coordinates Data Storage If you set up the cluster to continue operating after
processing is completed, the type of cluster is known as
The core node is also responsible for coordinating data
a long-running cluster. Long-running allows you to
storage. The core nodes know how and where to store
communicate with the cluster after it has completed its
the data. This data is stored on HDFS or EMRFS. The
operations, but it requires manual shutdown.
DataNode daemons run on the core node.
Considerations
 Multiple Code Nodes, Only One Core Instance Group Transient Cluster
We can have multiple core nodes but only one core
 The total number of EMR processing hours per
instance group. These multiple core nodes are made up
day is less than 24, and you can benefit from
of multiple EC2 instances. This makeup the instance in
shutting down your cluster when it is not being
group or fleet from.
used.
Task Node Features  You are not using HDFS as your primary data
 Optional Helpers storage (instead, you are using EMRFS with S3).
 Your job processing is intensive, iterative data
Task nodes are optional and can add power to perform
processing.
parallel computation tasks on data like Map reduce tasks
and Spark executor. Long-running Cluster

 No HDFS or DataNode Daemon  You frequently run processing jobs where it is


beneficial to keep the cluster running after the
Task nodes do not store data in HDFS. It is not used as a
data store and does not run the Data Node daemon. previous position.
 Your processing jobs have an input-output
dependency on one another.
 It is more cost-effective to store your data on EMR Operations - On-Demand and Spot
HDFS instead of S3. Instances
 You have a requirement for higher performance On-Demand Instances
I/O HDFS provides.
You pay for computing capacity by the second with On-
EMR Operations - Choosing an Demand Instances, and there are no long-term
Instance Type obligations. You have complete control over its lifespan,
Whenever we provision an EMR cluster, the instance size deciding when to start, restart, hibernate, or terminate
of our nodes is important because we might have it.
workloads that are CPU intensive. We might have some When you buy On-Demand Instances, you do not have to
that are input-output or memory intensive. commit to anything long-term. You pay for the time your
You can choose many different instances. Whenever we On-Demand Instances are operating. A running On-
choose an instance type, we prefer it either for the Demand Instance's pricing per second is fixed.
primary node or for the core of task nodes. We can bunch Spot Instances
core and task nodes together because the primary node
will not be a super compute-intensive machine like the A Spot Instance is a virtual machine that runs on spare
core and task nodes will be. EC2 capacity accessible at a lower price than the On-
Demand price. Spot Instances allow you to request new
Choosing an Instance Type EC2 instances at great discounts, allowing you to reduce
Primary Nodes your Amazon EC2 charges dramatically.
 Primary Nodes does not have large If you can be flexible about when your applications run
computational requirements. and if your applications is interrupted, Spot Instances are
 For clusters with 50 or fewer nodes, you can use a cost-effective option. Spot Instances are ideal for data
the M5 family. analysis, batch processes, background processing, and
 For clusters with greater than 50 nodes, you can optional activities.
use the M4 family.
CloudWatch Metrics
Core and Tasks Nodes We will discuss some metrics that we might want to
follow to scale up our cluster or add more instances or
 Depends on the type of processing. need to scale down our cluster.
 For general purposes, the balance of CPU, disk
space, and I/O, you can use the M5 family.  Tracking Cluster Progress
 For Batch Processing, HPC, or CPU-based If we want to track the progress of the cluster, we can
machine learning, you can use C4, C5, and Z1d use these metrics:
families.
1. RunningMapTasks
 For Graphics processing or GPU-based machine
2. RemainingMapTasks
learning, you can use G3, P2, and P3 families.
3. RunningReduceTasks
 For spark applications (in-memory caching), you
4. RemainingReduceTasks
can use R4 and R5 families.
 For Large HDFS and Map reduce jobs requiring It will help us determine the number of maps, reduce
high I/O performance and high IOPS, you can use tasks currently running on the cluster, as well as the
H1, I3, and D2 families. number of maps, and reduce tasks remaining for a
particular job.

 Detecting Idle Cluster


We could also track a Cloud Watch metric that helps us EMR File Storage and Compression
determine if one of our clusters is idle. It means we are How Hadoop Splits Files?
being charged, and the EMR cluster is not even doing any Hadoop splits large files into multiple chunks of smaller
work. That means life is costing us money, but it is not sizes. After breaking the files, a single map task processes
running any task. each part. The HDFS framework has already separated
 No More Storage for HDFS the data files into various blocks using Hadoop as the
underlying data storage. Since our data is already
Another important Cloud Watch metric so if there is no fragmented, Hadoop uses HDFS data blocks to assign a
more storage for HDFS, we can monitor the HDFS single map task to each of the HDFS blocks. Hence,
utilization metric, which is the percentage of disk space whenever we use Hadoop on our EMR cluster, Hadoop
currently being used. We can trigger an event that fires either splits the files or stores the files in HDFS or has the
once a high rate of power is used; let's say 80% of the file stored in S3. If stored in HDFS, the files are
capacity is used. We will fire a trigger that sends an email automatically divided into chunks. If they are stored into
informing us that we are running out of space in HDFS. S3, files are split into multiple HTTP range requests.
You need to add more instances, you need to add more Whether using HDFS or S3, the compression algorithm
EBS volumes, or you need to do something to make sure needs to have splitting available.
you have enough room for all of your HDFS data, as well
as replications. The Benefits of File Compression
 Better Performance: You get better performance
Monitor a Cluster with UI when less data is transferred between S3, mappers,
Not only can we monitor our cluster with CloudWatch
and reducers.
metrics, but we can also monitor our cluster with an
 Less Network Traffic: It also gives us less network
actual user interface. If you go into the EMR
traffic between S3 and EMR since you share fewer
documentation, you can see all of the links for these
various UI components that come installed with an EMR data.
cluster.  Reduced Storage Costs: Smaller compressed files
take up less storage, so you end up paying less for
Resizing a Cluster – Auto Scaling storage.
EMR-managed Scaling
With EMR managed scaling, we can automatically File Sizes Best Practices
increase and decrease the number of instances in the According to AWS and the EMR best practices, we can
core and task nodes based on your workload. Master look at the algorithms we are using, whether they are
nodes do not scale, though. You set a minimum and splitable or not, and then determine the file sizes.
maximum limit for the number of instances in your Hadoop will assign a single mapper to process our data if
cluster nodes. You create a custom auto-scaling policy. the compression type does not allow for splitting. That
means that a single thread is responsible for fetching that
Whenever we create our EMR cluster within the console, data from S3. Since a single line is limited to how much
we can select cluster scaling, and then we can use an information it can pull from S3 at any given time, this is
EMR manager to scale or create a custom auto-scaling the throughput. The process of reading the entire file
policy. from S3 into a mapper becomes a bottleneck for our
data. Hence, the best practice is to have a file size of one
If we select EMR managed to scale, we set the minimum,
to two gigabytes for algorithms that do not allow
the maximum, the on-demand limit, and the full core
splitting. If splitting is available.
nodes we want. Depending on our workload, it will
automatically decrease and increase the number of S3DistCp Command
instances. S3DistCP is an extension of DistCP. It is optimized to work
with the AWS S3 service. The Apache framework creates
DistCP, while AWS makes the S3DistCP command. It has
been optimized for your EMR and S3 workloads.

 It allows you to copy files within a cluster or from one


cluster to another or S3 into your HDFS cluster.
 It also allows you to combine smaller files into larger
files. It can help you copy data between S3 buckets
or S3 to HDFS or HDFS to S3.
 Either you can run S3DistCP by using a step within
your EMR cluster, or you can run it on the primary
node. It allows you to copy data and combine many
small files into fewer larger files. We can either run
this on the primary node's command line-shell or
create a new step in the existing EMR clusters.
Chapter 07: Using Redshift

Introduction worker nodes will go to their Slices, and then they will
Redshift is a data warehousing service. It can warehouse each have a piece of data that they need to return, which
the data at the petabyte scale, which means Redshift can will come back, and the leader will combine that data
store large data. It can also index and query data so that into our query response, and that query response will go
it remains usable. We can store petabytes and hexabytes back to our end user. That is the Redshift query process
of data in S3. at a very high level.

Redshift Architecture Redshift in the AWS Service


Cluster Ecosystem
Within the cluster, we have either leader nodes or With Redshift, we are in the active utilization area. We
worker nodes. The leader node manages the schema. It have end users on the left and a businessperson on the
contains the data warehouse metadata and performs all right that needs to perform some analytics on the
the query planning and script generation. The worker application. End-users collect authentication data, and
nodes perform query execution and slice management they will then log in to our service. An email goes back to
before storing all the data within the Slices. the end-user and logs something in a database. There
might be some data that we want to make sure makes it
Node through into our storage or otherwise into our
These are EC2 instances; there are three types to choose application, so we will put it into a data stream, which
from for the most part. Node is the individual compute will be processed out by another compute resource that
resources with storage attached for the Redshift cluster. computes resource may log something into Redshift. It
The storage attached to these nodes is faster. These are may also store something in our database. It will
generally used in cases where we need to perform generate a return that goes back to the end-user.
queries quickly. We want near real-time analytics.
Redshift Use Cases
Slice
Data Warehouse VS Data Lake
A slice is a group of configurable entities kept in a
The dissimilarity between a data warehouse and a data
reusable asset as a single unit. Slices are useful for
grouping entities and other slices for reuse. Prefabs and lake is more structured. The data in it will be cataloged,
slices are similar, but slices are part of the new and the access speed for retrieving data from a data
component entity structure. Prefabs cannot contain warehouse is much faster than it would be from a data
lake. It is primarily because of the lack of structure and
component entities, although slices can.
cataloging that can be seen in a data lake.
Redshift Query Process
A cluster has a leader node and several worker nodes. If What makes Redshift Different?
It is, in most cases, much faster if optimizing to the same
there is a single node cluster, the leader and worker will
be the same node. It will separate the leader into a Slice, level. It is much more scalable, so we can have a small
and we will have our user that sends a query to the warehouse and easily scale it up to a gigantic warehouse.
leader node. The leader node generates a query plan, A few AWS service integrations make some of the paths
that we would perform with the data warehouse
which lets it create execution scripts to complete that
considerably easier than they would be if we had to
query because it knows where all of the data is stored in
the cluster. Those execution scripts will go to worker integrate with other systems in an ad hoc manner.
nodes. The Slices themselves do not store any table data.
They have data, and they know where that data is. The
Redshift Table Design 3. All – All slices store a copy of table data.
Columnar databases let us access columns of our data Constraints
more efficiently. The column table is arranged to set a 1. Primary, Foreign Key: Used by the query planner
columnar data file. To get the node size from our row
as a hint about relations.
database, we access every row in our table, whereas we
need to access a single row with the columnar database. 2. Unique: Not enforced used by query planner as
It has significant implications for OLAP or OLAP a hint.
transactions where we need to make comparisons or
3. Not Null: Not enforced or respected.
calculations from large amounts of single-column data.
Redshift Spectrum
Data Types
There are several numeric data types, and they have How do you query flow?
aliases. You can maintain compatibility between Redshift Spectrum is an interface to create foreign tables
Postgres and Redshift, for instance, because Redshift's in our Redshift cluster from stored data in S3. It uses the
interface is Postgres compatible. Redshift only has a external keyword when creating schemas and tables in
reduced set, but it contains aliases to map them our collection. It can read and write to Spectrum tables.
appropriately. There are few signed integer data types It does not support update and delete operations.
and few floating-point data types. Boolean is standard. It Access control can either IAM or use AWS Lake
is only true or false. Texts have character and variable Formation in query flow, giving more granular control
characters. The character is fixed length, whereas over tables. It is an external service that sends data
varchar or variable character is a variable length. In time requests through access control and then goes to the
data types, the date is the calendar date, which is the data store. The data store comprises a Glue data catalog
year, month, day, and the timestamp is the date and and Athena SQL interface connected to the underlying
time. The timestamp TZ includes the time zone. Time and data in S3. Once that query has been answered in
time TZ is the daytime or zone daytime had; this gives us Athena, it will return the data to Spectrum, which then
our set of data types. These will map incompatibility to passes it back to the Redshift cluster and is combined
the Postgres data types because Redshift is the primary with any other data involved in our query to provide a
compatible interface. Aliases are used to map to the query response cluster.
actual Redshift data types to maintain compatibility.

Compression
In Redshift, Postgres can compress individual columns,
which means different compression types are available
depending on the data type. Most of our number and
time data types will default to AZ64 compression, and
our character and variable character data types will
default to LZO compression and several other impression
encodings available. The other data types are default to
raw, Boolean, real, and double. If something is our sort
key, it will need to be raw; it is uncompressed because
the database engine frequently performs queries.

Distribution Styles
1. Even – Blocks are distributed evenly between
cluster slices (default).
2. Key – identical key values are stored on the same
Slice.
Chapter 08: Redshift Operation and Maintenance

Introduction Utilizing Vacuum and Deep Copy


Launching a Redshift Cluster The Vacuum Process
Interfaces The Vacuum process will first reclaim disk space to
There are several interfaces for launching a Redshift remove rows marked for deletion. . Then it will sort the
cluster. Examples of such include the AWS management table, which will make queries more efficient, and then
console, the AWS CLI, and various AWS SDKs. reindex the table to account for that new sorting so that
our query planner knows exactly where all of our rows
Required Parameters are on our table.
To use the AWS CLI, the parameters that need to provide
when we launch a Redshift cluster using the create Automatic Vacuuming
cluster command for the CLI is a node type for our Automatic Vacuum Delete
cluster. The cluster will have the same node type for  Automatically reclaim disk space.It is triggered by a
every node in our group. We cannot configure this per high percentage of rows marked for deletion.
node. If our cluster type is multi-node, we do not need to  Activity monitoring dictates the schedule.
provide the cluster type parameter but require several
nodes. If our cluster type is a single node, which is the Automatic Table Sort
other option for cluster type, we do not need to give the  A high percentage of unsorted Rows triggers it.
number of nodes parameter. We will need to provide a  Utilizes SCAN operations to identify unsorted tables.
user name and a master user password. A cluster Automatic Analysis
identifier must be provided to identify our cluster.  Automatically updates table statistics.
Resizing a Redshift Cluster  Waits for low activity periods to analyze jobs
Classic Resize  Utilizes table statistics age for triggering.
 Hours To Days Backup and Restore
Duration Impacting Factors: Snapshots
 Point in time backup of the whole cluster.
 Source cluster activity.
 It can be manually triggered.
 Size and number of tables.
 Scheduled automated snapshots.
 Uniformity of data distribution across nodes.
 Source and target node configuration. Restoring from Snapshot
It creates a new cluster, and then that snapshot data is
Elastic Resize loaded as queries request it. Hence, as the cluster needs
 Minutes data to fulfil a query, it will bump that data up in the
Constraints: queue of loading from S3 to can complete those queries
in a timely fashion. RDS does this for most of its engines
 It cannot be used on single-node clusters. as well. It is not as noticeable for Redshift because of how
 The cluster must be in a VPC. the Redshift query process works.
 The new configuration must have sufficient
storage.
Loading Data From S3  Total table count.
If we want to copy single tables in and out of a backup,  Health status
Redshift has very tight S3 integration. We can copy data
out of S3. The IAM role permits us to read out of that S3 Queries/Load
bucket we identified. Redshift will go and retrieve that  Query duration.
data and load it into our table. We need to provide some  Query throughput.
information about that data that we are pulling into our  Query duration per WLM queue.
bucket, but if it is formatted in a CSV or Parquet format,  Concurrency scaling activity.
it can infer that information from that data. We can also  Concurrency scaling usage.
use this command to copy Elastic MapReduce,  Average query time by priority.
DynamoDB, and SSH connections.
Usage limits
Unloading Data To S3  Usage limit for concurrency scaling.
We can use the unload command and provide a piece of  Usage limit for Redshift Spectrum.
SQL that specifies the data we want to unload. When we
give two directives, which tell the command where to CloudWatch
save that data, we need to provide an IAM role with In the CloudWatch, we have:
permissions to write into our target bucket and provide Cluster
a format. If we format it as Parquet, AWS calls this lake  Commit queue length.
to unload because Parquet is a common data lake
 Concurrency scaling sounds.
format. Parquet is very efficient. We are offloading a
 Database connections.
table to S3 because we will switch part of our analytics
workload to Athena. Athena works well with Parquet.  Health status.
We can dump out our entire Redshift cluster into S3 and  Maintenance mode etc.
then point Athena at that bucket, and we have access to Node
the same data without running the Redshift cluster. Once  CPU utilization.
all is specified, Redshift will generate a file and place it
 Read IOPS.
into our S3 bucket.
 Write IOPS etc.
Monitoring
Redshift Console
Redshift console has the following services that we can
monitor.

Cluster
 CPU Utilization.
 Maintenance mode.
Storage
 Percentage Disk Space used.
 Auto Vacuum Freed.
 Read throughput.
 Read latency.
 Write throughput.
 Write latency.
Database
 Database connections.
Chapter 09: AWS Glue, Athena, and QuickSight

Introduction You can also have databases outside of AWS, so as long


Construct Glue Crawler, perform SQL queries in Athena, as you can associate with them using a JDBC connection,
and create Visualization Charts with AWS QuickSight in you will access that data as a data source within AWS
this chapter. Glue.

Glue Data Catalog AWS Glue Data Catalog


What is AWS? President Metadata Store
You can store, annotate, and share metadata between
Serverless ETL Service
AWS services (similar to Apache Hive megastore).
Some important points about Serverless ETL services are:
Centralized Repository
 No server provisioning
There is only one data catalog per AWS region, providing
 AWS fully manages them a uniform repository so different systems can store and
 Extract Transform load find metadata to query and transform that data.
 Categorize, clean, and enrich your data
Provided Comprehensive Audit
 Move data between various data stores
You can stack schema changes and data access control.
AWS Glue - Use cases This helps ensure that its data is not inappropriately
 Query Data in S3 modified or inadvertently shared.

Consider an example; you have a massive data lake in S3 Glue Jobs


with customer feedback data. You can use AWS Glue to AWS Glue Jobs
crawl your S3 data lake to prepare tables that you can AWS Glue job performs the extract, transform, and load
then query using Athena to see your customer feedback. (ETL) work in AWS Glue. You have some input data;
consider the input data as the fabric. The Glue job is
 Joining Data for A Data Warehouse
where the actual work is being done; this is the cleaning
Consider an example; you have a Clickstream data lake in of your data, the transformation of your data, the
RDS and customer data in S3. You can use Redshift enrichment of your data, and the joining of your data.
Spectrum to Query your data or QuickSight to visualize This will be the actual physical labor and the tools that
the data. You can use AWS Glue to join and enrich your you use to create the final output. Output data onto
data and then load the results into Redshift. another data store or into some data lake or data
warehousing solution.
 Creating a Centralized Data Catalog
Output File Formats
Consider an example; you have different types of stored
If we are storing data in a relational database or a JDBC
in many other locations. You can use AWS Glue to
connection, the output file format would not matter.
manage the metadata and create a central repository.
However, if we are storing it into a file server or S3, we
You can then access the Data Catalog for ETL and
can have various output file formats as follows;
analytics with many other AWS services.
Output File Formats
AWS Glue Components
JSON*
The data can be in a plethora of different locations; it can
CSV*
be in S3, DynamoDB, RDS, Redshift, or a database on EC2.
ORC
Parquet
Avro
*optional compression (gzip, bzip2) Integrates With QuickSight
Create easy data visualizations by using Athena to
generate reports to explore data with BI tools of SQL
Glue Jobs Run In Isolated
clients connected with JDBC and ODBC drivers.
Glue Runs Jobs on Virtual Resources
All the resources needed to run ETL jobs are provisioned Integrates With AWS Glue
and managed in its isolated service account. Athena Integrates with AWS Glue Data Catalog, allowing
you to create tables and query data in Athena as well as
What a Glue Job needs? use the ETL and data discovery features of AWS Glue.
You provide output data sources and input data targets
in your VPC. In addition, you give the IAM role, VPC ID, Connecting to Data Sources
subnet ID, and security group that is needed to access Athena natively supports querying datasets and data
data sources and targets. sources that are registered with the Glue Data Catalog.
You can have a data connector using an external Hive
Traffic governed by Your VPC metastore to query datasets in Amazon S3. You can also
Traffic in, out, and within the spark environment is use a data connector for external Hive meta stores to
determined by your networking policies. The one query data in S3 that is using an Apache Hive metastore.
exception by your networking policies is calls made to You do not have to migrate your Hive metastore data to
the AWS Glue API. However, these can be audited the AWS Glue Data Catalog.
through CloudTrail.
When To Use Athena
Getting Started with Athena
S3 Select And Glacier Select
What Is Athena?  S3 Select – Use SQL statements to filter the
Athena helps to easily query your S3 data. You can use
contents of S3 objects and retrieve just the
standard SQL queries to analyze data directly in S3.
subset of data you need.
Athena is serverless, and you can only pay for the queries
 Glacier Select - Use SQL statements directly on
that you run. Athena automatically scales, so results are
fast, even with large datasets and complex questions. your data in S3 Glacier without restoring data to
a more frequently accessible tier.
Athena Federated Queries
Athena could only query data in S3, but since customers QuickSight Visualizations and
have data in other data sources, AWS created the ability Dashboards
to connect to external data sources using Athena Amazon QuickSight
federated queries. Athena uses data source connectors Amazon QuickSight helps to create visualization and
to run on AWS Lambda to run federated queries. A data dashboard with your data. Business intelligence tool
source connector is a section of code that can translate makes it easy to create visualization and dashboard from
between Athena and your target data source. With this your data. Simply point QuickSight at your input data
new feature, you can query data in places or build source to start creating visualizations.
pipelines to extract data from multiple data sources,
such as CloudWatch Metrics, DynamoDB, Elasticsearch, Visualization types
JDBC compliant data sources like Redshift and RDS, and  Bar Charts
store the query results in S3.  Combo Charts
 Donut Charts
Athena Data Formats And Integrations
 Gauge Charts (Maps)
Data Formats
 Heat Maps
Athena helps you analyze unstructured, semi-structured,
 Histograms
and structured data stored in S3. Examples include CSV,
TSV, JSON, Textfiles, Parquet, ORC, Snappy, Zlib, LZO, and  KPIs
GZIP.  Line Charts
 Pie Charts Identity And Access Management In QuickSight
 Scatter Plots After creating dashboards and visualizations, you want to
 Tree Maps give users access to that data or those visualizations, so
 Word Clouds you need to find a way to make sure that you can set up
your users within QuickSight. We have our QuickSight
QuickSight Dashboards dashboards; they sit on the public internet. We can set
Specify User Access up users either using IAM credentials or have QuickSight
When sharing a dashboard, you specify which user has only users with just email addresses, but these are
access to it. filtered through.

Dashboard Viewers vs. Owners


Dashboard viewers have limited access (viewing,
filtering, sorting). Owners can share the dashboard and
optionally edit and share the analysis.

Embedded In Website or Application


A shared dashboard can be embedded only in a website
or app if QuickSight is set up as an Enterprise edition.

QuickSight Security and


Authentication
QuickSight Data Encryption
Encryption at Rest (Enterprise edition only)
The metadata and data uploaded into SPICE are
encrypted with AWS-managed keys.

Encryption in Transit
QuickSight supports encryption of all data transfers using
SSL. This includes data to and from SPICE and from SPICE
to the user interface.

Key Management
AWS manages all the keys associated with QuickSight.
Database server certificates are the responsibility of the
customer.

Connecting To AWS Resources


We can connect to resources in AWS, for example,
services like Redshift, RDS, Aurora, and databases on
EC2. Make sure that all of these security mechanisms are
set up and configured properly to allow QuickSight
access to these resources. We can also connect
QuickSight into S3. We will need to make sure that IAM
roles, the bucket policy, and the manifest file are set up
properly, which is essentially just a metadata file that
allows QuickSight to connect the dots and better
understand the data schema S3.
Chapter 10: ElasticSearch
The Interface
Introduction to Elasticsearch Elasticsearch uses a REST (Representational State
The Amazon Elasticsearch Service is a managed service
Transfer) API for its interface. JSON is a common format
that makes deploying, operating, and scaling
for REST APIs. With the standard HTTP methods, we can
Elasticsearch in the AWS Cloud simple. Elasticsearch is a
interact with Elasticsearch; assuming we have all the
prominent open-source search and analytics engine for
permissions open, we can send a GET to the base URL,
log analytics, real-time application monitoring, and click
index name, and provide a type and item ID. It will return
stream analytics, among other applications.
that item.
With Amazon Elasticsearch Service, you have direct
Loading Data
access to the Elasticsearch open-source API, allowing you
Anything that can interact with the API can send data
to reuse code and applications from your existing
into Elasticsearch. We will have to write the actual code
Elasticsearch setups. Kibana is incorporated into the
to send that data into Elasticsearch, but it is just
Amazon Elasticsearch Service, allowing you to easily
interacting with an API.
display and analyze your data.
Service Integration
Elasticsearch Service
Kinesis Data Firehose
The Amazon ElasticSearch service is a search domain that
Elasticsearch is a configurable target.
runs most of the ELK (ElasticSearch Logstash and Kibana)
stack. The service created with ElasticSearch is called Cloudwatch
search domains. They come pre-installed with CloudWatch Logs subscription can deliver to
ElasticSearch and Kibana Logstash; it is a fairly complex Elasticsearch Service.
system all on its own, but it does have tight integration
with ElasticSearch. IoT
IoT rules can send data to Elasticsearch Service.
Searching
ElasticSearch organizes data as indexes. The purpose of Visualizing Elasticsearch Data
ElasticSearch is to enable word searching. It is a search Visualization Tools Examples
engine utility. The way that searching works is called a Kibana
reverse index. For example, we take two documents,  Part of ELK stack
ElasticSearch will index each word in two documents.  Pre-installed on Elasticsearch Service (ES)
Essentially, this service creates an index and is able to domains
define some characteristics about it, and it puts every
 It does not require writing extra code
word in these two documents in all of the documents and
stores it into that index. Part of the metadata in that D3j3
index is which documents those words appear in and  They are easily embedded in external
exactly where they are within that document. We get applications
this mapping of all of the words in each document in the  Can visualize any JSON formatted data
store. This is very powerful because it can return those  Third-party connectors are available
documents very quickly, so we can search for words or
phrases and find them very quickly.
Chapter 11: AWS Security Services

Introduction Secrets Manager


In this chapter, we are going to discuss AWS security Secrets Manager is a great service that acts as a vault for
services. Security services do not fit in any of the steps in our passwords and API keys. You can use it to rotate
the data analytics steps for success, but we need to those security objects in several of our services. You can
secure our data. Data is the lifeblood that makes our manage who has access to those security objects
application work. It is the most valuable thing in any through IAM. Hence, you can say this IAM group can use
application. It is what allows us to generate revenue from this set of API keys or passwords so on and so forth.
our application. Without data, our application does
nothing. Identity Access Management
Overview
The AWS security services are Identity and Access Introduction
Management or IAM, VPC security features, the Key AWS Identity and Access Management (IAM) is a web
Management Service, and Secrets Manager. All of these service that provides secured control access to AWS
have some integration or functionality that allows us to resources such as compute, storage, database, and
secure the other services. application services in the AWS Cloud. IAM manages
authentication and authorization by controlling who is
1. Identity Access Management signed in and has permission to utilize the resources. IAM
Identity and Access Management, or IAM, allows us to uses access control concepts such as Users, Groups,
control access to other AWS services. Any API call that Roles, and Policies to control which users can access
you make is going to pass through Identity and Access specific services, the kinds of actions they can perform,
Management. Several of the services that you use that and which resources are available. The IAM service is
may not use IAM for their primary authentication on provided at no additional cost. However, your account
their primary interface. You can use IAM users or groups will be charged upon the usage of other AWS services by
to control access to RDS instances or Redshift clusters, or your users.
Elasticsearch service domains, and it also provides an IAM Concept
external identity federation. Hence, if you have an active
directory in an on-premises system, you can integrate With IAM, you have Amazon Web Services that is at its
that with IAM and allow Active Directory users to access core one big API. Hence, when you send a request to that
AWS services with their Active Directory credentials. API, it is then passed through IAM to check the
permissions. If you present credentials that permit us,
2. VPC-Security our request is sent to whatever service you are making
VPC security features are about network layer security an API call for. If you do not have permission, then you
hence security groups and network ACLs will allow us to get a response that tells us no.
control network traffic flow to and from our resources IAM Permission Objects
within our VPC.
Policies govern AWS access by establishing and applying
3. Key Management Service them to IAM identities (users, groups of users, or roles)
Key Management Service is another very important or AWS resources. A policy in AWS is an object that
service. It allows you to store your encryption keys as specifies the rights of identity or resource when it is
well as generate encryption keys in conjunction with IAM associated with it. AWS checks these policies when an
and CloudTrail. You can log and audit that key usage. IAM principal (user or role) submits a request. The
permissions granted by the policies determine whether
the request is authorized or refused. The bulk of AWS 3. IAM Roles
rules is stored as JSON documents. AWS supports all
An IAM role is an IAM object that allows you to define a
policy types, including identity-based policies, resource-
set of permissions for a user or service to access
based policies, permission boundaries, organizations
resources. However, the permissions are not connected
SCPs, ACLs, and session rules.
to a specific IAM user or group. Instead, IAM users,
IAM Features mobile and EC2-based applications, and AWS services
The IAM service is part of AWS's secure worldwide (such as Amazon EC2) can programmatically acquire a
infrastructure. With IAM, you can create and manage role. The role results in temporary security credentials
users and groups, security credentials such as that the user or application may use to make AWS
passwords, access keys, and permission policies to programmatic requests. These temporary security
allow and deny access to the AWS resources. credentials have a configurable expiration date and are
automatically cycled.
1. IAM Users
4. IAM Policies
An IAM user is a unique identity with limited access to an
AWS account and its resources, defined by their IAM An IAM policy is a rule or group of rules defining actions
permissions and policies. Users of IAM can represent a that may not be done on an AWS resource. Policies are
person, a system, or an application. IAM policies assigned used to give permissions. When a policy is associated
to users must grant explicit permissions to services or with an identity or resource, it defines the permissions
resources before viewing or using them. IAM lets you for that identity or resource. When a user makes a
create individual users within your AWS account and give request, AWS examines these rules. The policies'
them their username, password, and access keys. permissions decide whether the request is approved or
Individual users can then log into the console using a URL rejected. AWS rules are maintained as JSON documents
that is specific to their account. You can also create and can be either identity-based or resource-based.
access keys for individual users to make programmatic Policies can be granted in several ways:
calls to access AWS resources. You can provide user  Include a controlled policy. AWS offers several pre-
access to any or all of the AWS services linked with IAM, defined rules, such as AmazonS3 Read-Only Access
or you can utilize IAM in combination with external  Include an inline policy; an inline policy is a
identity sources, such as Microsoft Active Directory, AWS handwritten policy
Directory Service, or log in with Amazon.  Add the user to a group with suitable permission
policies
2. IAM Groups  Clone an existing IAM user's permissions
A group is a group of IAM users. You may utilize groups IAM Secured Services
to establish permissions for a group of users, making it There are a few examples of services that are secured
easier to manage their access. For example, you may with IAM. Simple Storage Service (S3) is often controlled
create a group named Admins and assign the rights that with IAM. It also has its security features. S3 predates the
administrators generally require. Any user in that group AWS API. It was the second service launched as part of
has the default permissions granted to the group. Amazon web services. Bucket, object policies, and the
Assume a new user joins your organization and requires ACLs are artifacts before the main larger API was created.
administrator capabilities. In that instance, you may DynamoDB access is controlled entirely with IAM.
provide the necessary rights by adding the user to the Database Migration Service, Athena, Glue, Lake
relevant group. Similarly, suppose a user changes jobs Formation, Kinesis Elasticsearch, and QuickSight are also
within your firm rather than modifying that user's controlled with IAM. Various code services are secured
permissions; you may delete them from the old groups with IAM. There are more than the mentioned services
and add them to the new ones in such a case. that are secured with IAM. Anything that is primarily
accessed through an API interface will flow through IAM
for the most part.
Key Management System encryption, symmetric KMS keys are a good option
because they never leave AWS KMS unprotected.
Introduction
AWS services that are connected with AWS KMS secure
AWS Key Management Service (KMS) is a managed
your data with symmetric KMS keys. These services do
service that allows you to produce and govern the keys
not support encryption with asymmetric KMS keys.
used in cryptographic activities. The service offers a
highly available key generation, storage, administration, Asymmetric KMS Keys
and auditing solution that allows you to encrypt or
AWS KMS allows you to generate asymmetric KMS keys.
digitally sign data within your applications or govern data
An asymmetric KMS key is a mathematically connected
encryption across AWS services.
pair of public and private keys. The private key is never
AWS KMS Keys left unprotected in AWS KMS. It would help if you used
AWS KMS to utilize the private key. The public key can be
The fundamental resource in AWS KMS is AWS KMS keys
used within AWS KMS by executing the AWS KMS API
(KMS keys). A KMS key can be used to encrypt, decrypt,
activities or downloaded and used outside of AWS KMS.
and re-encrypt data. It can also create data keys for
Multi-Region asymmetric KMS keys can also be
usage outside of AWS KMS. Typically, symmetric KMS
generated.
keys are used, although asymmetric KMS keys can be
created and used for encryption or signing. It is possible to produce asymmetric KMS keys that
represent RSA key pairs for public-key encryption,
Customer Managed Keys
signing and verification, or elliptic curve key pairs for
The customer handles the KMS keys you generate. signing and verification.
Customer-managed keys are KMS keys that you
Data Keys
generate, own, and administer in your AWS account. You
have entire authority over these KMS keys, including the Data keys are encryption keys that may be used to
ability to create and manage key policies, IAM policies, encrypt data, especially enormous volumes of data. Data
and grants, as well as activate and disable them. They keys are returned to you for use outside of AWS KMS
rotate their cryptographic material, add tags, make instead of KMS keys, which cannot be downloaded.
aliases for the KMS keys, and schedule the destruction of
When AWS KMS generates data keys, it provides you
the KMS keys.
with a plaintext data key that you may use immediately
Customer-controlled keys are displayed on the AWS away (optional), as well as an encrypted copy that you
Management Console's Customer-managed keys page can preserve safely with the data. When you are ready to
for AWS KMS. Use the DescribeKey operation to decrypt the data, ask AWS KMS to decode the encrypted
definitively identify a customer-maintained key. The data key.
value of the KeyManager field of the DescribeKey
AWS KMS is a service that produces, encrypts, and
response for customer-managed keys is CUSTOMER.
decrypts data keys. AWS KMS, on the other hand, does
Symmetric KMS Keys not store, manage, or monitor your data keys, nor does
it execute cryptographic operations with data keys. Data
When you generate an AWS KMS key, you are given a
keys must be used and managed outside of AWS KMS.
symmetric KMS key by default.
Customer Master Key
In AWS KMS, a symmetric KMS key is a 256-bit encryption
key that never leaves the system unencrypted. It would CMKs are classified into two types: AWS-managed and
help if you used AWS KMS to utilize a symmetric KMS customer-managed. When you activate server-side
key. Symmetric keys are used in symmetric encryption, encryption of an AWS resource for the first time under
which employs the same key for encryption and the AWS-managed CMK for that service, AWS-controlled
decoding. Unless your job requires asymmetric CMK is produced (e.g., SSE-KMS). The AWS-managed
CMK is specific to your AWS account and the region in plaintext so that you may decrypt the keys and your data.
which it is deployed. AWS-managed CMKs can only The root key is the top-level plaintext encryption key.
safeguard resources within the AWS service for which
Envelope encryption has various advantages:
they were designed. It does not offer the granular control
that a customer-managed CMK does. To get further  Protecting Data Keys
control, employ a customer-managed CMK in all
supported AWS services and your applications. A When you encrypt a data key, you do not have to worry
customer-managed CMK is generated at your request about keeping the encrypted data key since encryption
and should be set according to your specific use case. protects the data key by default. The encrypted data key
can be safely stored alongside the encrypted data
Envelope Encryption
 Encrypting the Same Data under Multiple Keys
When you encrypt your data, it is protected, but you
Encryption procedures can be time-intensive, especially
must safeguard the encryption key. Encrypting plaintext
if the data being encrypted is substantial. You can re-
data with a data key and then encrypting the data key
encrypt only the data keys that secure the raw data
with a separate key is known as envelope encryption.
instead of re-encrypting raw data with multiple keys
You may even encrypt the data encryption key using numerous times
another encryption key and then encrypt that encryption
key again. However, one key must finally remain in  Combining the Strengths of Multiple Algorithms
plaintext so that you may decrypt the keys and your data. Symmetric key algorithms are quicker than public-key
The root key is the top-level plaintext encryption key. algorithms and create smaller ciphertexts. However,
public-key algorithms allow intrinsic role separation and
Customer Master Key
easier key management. Envelope encryption allows you
CMKs are classified into two types: AWS-managed and to combine the advantages of each technique
customer-managed. When you activate server-side
KMS Encryption Flows
encryption of an AWS resource for the first time under
the AWS-managed CMK for that service, AWS-controlled Following are the KMS encryption flows:
CMK is produced (e.g., SSE-KMS). The AWS-managed
CMK is specific to your AWS account and the region in  Service Side Encryption
which it is deployed. AWS-managed CMKs can only  Envelope Encryption
safeguard resources within the AWS service they were
Secrets Manager
designed for. It does not offer the granular control that a
Secrets Manager allows you to replace hardcoded
customer-managed CMK does. To get further control,
credentials, such as passwords, in your code with an API
employ a customer-managed CMK in all supported AWS
call to Secrets Manager to get the secret
services and your applications. A customer-managed
programmatically. As the secret no longer resides in the
CMK is generated at your request and should be set
code, this helps ensure that it cannot be compromised by
according to your specific use case.
someone reviewing your code. You may also set Secrets
Envelope Encryption Manager to rotate the secret for you on a pre-defined
period. It allows you to substitute long-term secrets with
It is protected when you encrypt your data, but you must short-term ones, drastically lowering the chance of
safeguard the encryption key. Encrypting plaintext data compromise.
with a data key and then encrypting the data key with a
separate key is known as envelope encryption. Secrets Manager Concept

You may even encrypt the data encryption key using AWS Secrets Manager secures access to your
another encryption key and then encrypt that encryption applications, services, and IT resources without the
key again. However, one key must finally remain in
upfront investment or ongoing maintenance expenses port number, and the user name and password used to
associated with running your infrastructure. access the service.

Secrets Manager is designed for IT administrators who  Secret name and description
need a safe and scalable way to store and manage  Rotation or expiration settings
secrets. Security administrators may use a secrets  ARN of the KMS key associated with the secret
manager to meet regulatory and compliance standards  Any attached AWS tags
to monitor and cycle secrets without interfering with
Encrypt Secret Data
applications. Secrets Manager may be retrieved
programmatically by developers that want to replace Secrets Manager encrypts a secret's protected text using
hardcoded secrets in their applications. AWS Key Management Service (AWS KMS). AWS KMS is
used for key storage and encryption by many AWS
Secrets Manager Features
services. When your secret is at rest, AWS KMS assures
At runtime, encoded secret values can be retrieved its safe encryption. Every secret is associated with a KMS
programmatically. Secrets Manager improves your key in Secrets Manager. It can be a customer-controlled
security posture by removing hard-coded credentials key created in AWS KMS or an AWS-managed key for the
from your application source code and preventing account's Secrets Manager (aws/secretsmanager).
credentials from being stored in any form within the
Secret Rotation
application. Storing the credentials in or with the
application exposes them to compromise by anybody The process of updating a secret regularly is known as
who has access to your program or its components. This rotation. When you rotate a secret, the credentials in
technique makes rotating your credentials tough. Before both the secret and the database or service are updated.
you can depreciate the old credentials, you must update You may set up an automatic rotation for your secrets in
your application and distribute the updates to each Secrets Manager. After rotation, applications that obtain
client. the secret from Secrets Manager automatically receive
the updated credentials.
Secrets Manager allows you to acquire stored credentials
by replacing them with a runtime call to the Secrets Automatically Rotate Secrets
Manager Web service.
Secrets Manager may be configured to automatically
Secret Storage rotate your secrets on a pre-defined schedule and
without human interaction.
The AWS secrets manager uses JSON to distort data,
approximating what it might look like is shown in the Rotation is defined and implemented using an AWS
figure below. It stores a string in the JSON. Hence those Lambda function. This function specifies how Secrets
secrets can be passwords. They could be SSH keys. It Manager will carry out the following tasks:
could be API keys, really any string that fits within 64
 Creates a new version of the secret
kilobytes can be stored in Secrets Manager. You could
have Base64 encoded strings that are stored in Secrets  Stores the secret in Secrets Manager
Manager that you know. You have to decode that Base64  Configures the protected service to use the new
encoding. version
 Verifies the new version
Different Secrets Types Store in AWS Secrets Manager
 Marks the new version as production-ready
Secrets Manager allows you to store text in a secret Rotation Strategies
encrypted secret data part. It normally comprises the
Secrets Manager has two rotation strategies:
database or service's connection information. This
information may contain the server name, IP address,
1. Single User Rotation Strategy  NAT Gateway: A highly available, managed Network
The single-user technique refreshes one user's Address Translation (NAT) service for your resources
credentials in a single secret. It is the most basic rotation in a private subnet to access the Internet
approach, and it is suitable for the majority of use  Hardware VPN Connection: A hardware-based VPN
scenarios. connection between your Amazon VPC and your
data center, home network, or co-location facility
2. Alternating Users Rotation Strategy
 Virtual Private Gateway: The Amazon VPC side of a
The alternating user’s technique refreshes two users' VPN connection
credentials in a single secret. You make the first user,  Customer Gateway: Your side of a VPN connection
then rotate clones to make the second.  Router: Router interconnect subnets and direct
The other user is kept up to speed with each new version traffic between Internet gateways, virtual private
of a secret. If the first version contains user1/password1, gateways, NAT gateways, and subnets
the second version has user2/password2.  Peering Connection: A peering connection enables
User1/password3 is used in the third version, while you to route traffic via private IP addresses between
user2/password4 is used in the fourth. You have two sets two peered VPCs
of valid credentials at any one time: the current and prior  VPC Endpoints: Enables private connectivity to
credentials. services hosted in AWS from within your VPC
without using an Internet Gateway, VPN, Network
VPC Network Security Features
Amazon VPC lets you provision a logically isolated section Address Translation (NAT) devices, or firewall
of the AWS cloud to launch AWS resources in a virtual proxies
network that you define. You have complete control over  Egress-only Internet Gateway: A stateful gateway
your virtual networking environment, including the provides egress-only IPv6 traffic from the VPC to the
ability to define route tables and network gateways, as Internet
well as choose IP address ranges and construct subnets.
Network Access Control List
A Virtual Private Cloud (VPC) is a cloud computing A network access control list (NACL) is an optional
concept that provides an on-demand modifiable pool of security layer for your VPC that operates as a firewall to
shared computing resources assigned inside a public manage traffic in and out of one or more subnets. Set up
cloud environment while also offering some level of network ACLs with rules similar to your security groups
separation from other public cloud users. As the cloud to provide your VPC with an extra layer of protection.
(pool of resources) in a VPC model is exclusively available
Security Groups
to a single client, it provides privacy with more control
For your instance, a security group acts as a virtual
and a safe environment where only the defined client
firewall, controlling inbound and outgoing traffic. When
may work.
you deploy an instance in a VPC, you may designate the
Components of VPC instance up to five security groups. Security groups
 Virtual Private Cloud: VPC logically isolated virtual operate at the instance level rather than the subnet
network in the AWS cloud. The IP address space of a level. As a result, each instance in a VPC subnet can be
VPC is defined by the ranges you choose allocated to a separate set of security groups.
 Subnet: A segment of a VPC’s IP address range to Traffic Monitoring
place groups of isolated resources Traffic Mirroring is an Amazon VPC feature that allows
 Internet Gateway: The Amazon VPC side of a you to repeat network traffic from the elastic network
connection to the public Internet interface of an Amazon EC2 instance. The traffic can then
be routed to out-of-band security and monitoring
equipment for:
 Content inspection
 Threat monitoring
 Troubleshooting

You might also like