AWS Data Analytics Specialty Exam Cram Notes
AWS Data Analytics Specialty Exam Cram Notes
Specialty
Introduction
Course Introduction
Individuals with experience and skills working with AWS
services to develop, create, protect, and maintain
analytics systems can pursue the AWS Certified Data
Analytics - Specialty. This course walks you through the
different concepts, including designing analytical
questions, gathering, analyzing, and preparing data while
evaluating the data and uncovering insights from it.
Introduction to S3 add items to the bucket and read articles. As long as our
In our data analytics steps for success, S3 will fall in data users are properly close to the us-east-1 region, they will
collection primarily. It has features that will use in data not have any issues. We will see some latency once we
preparation, analysis, and data interpretation. get out to the other geographic locations. As users get
further and further away, that latency will increase. So,
Amazon S3 is a type of object storage that allows you to they may complain about how long it takes to get things
store and recover any quantity of data from any location. from our S3 data store.
It is a low-cost storage solution with business resilience,
reliability, efficiency, privacy, and infinite expansion. S3 S3 Multipart Upload
uses an object called the bucket. The bucket is the atomic What do we do when we have a lot of data?
unit for S3. If we think about a large file or object in the context of
S3, we can think about it as a large building. Moving it all
Amazon S3 offers a straightforward web service interface
at once is impractical if we are moving a building. It can
for storing and retrieving any amount of data from any
be time-consuming if we have to carry a completely big
location at any time. You may quickly create projects that
building all by itself, in one piece, it will be moved very
integrate cloud-native storage using this service.
slowly, and we are probably going to damage the
Because Amazon S3 is easily customizable and you only
structure, and it is potentially expensive. Hence, when
pay for what you use, you can start small and scale up as
we talk about moving a large object, we run into some
needed without sacrificing performance or
same issues. It is kind of impractical. Depending on the
dependability.
compute resources involved, it will be time-consuming
Upload Interfaces might be expensive. We might have a lot of idle CPU
When we upload data at S3, we have several interfaces cycles because we are just moving a large file all at once.
to work with. The AWS management console, the AWS To get around that, multipart upload comes in. If we look
CLI, and several AWS SDKs. at some of the limitations of the standard ways to get
data into S3, a single S3 Put is only going to let us upload
AWS Management Console
five gigabytes of data in a single Put, but an S3 object can
When we use the management console, we use a
be up to five tebibytes.
graphical user interface. We can add files and folders. We
can set most of the options to upload with this interface. We will use a multipart upload to get a five-tebibyte
object into S3. It has three steps, like if we were going to
AWS CLI
move a building, we would probably break it down into
When using the AWS CLI, we enter commands in our
components and have a plan and use that plan to
terminal that will allow us to move data into the S3
reassemble those components on the other side of the
bucket. We can use these commands to retrieve data
move.
from S3 buckets.
Prepare Data: We will break the data into reasonably
Transfer Acceleration
sized pieces.
Another feature of S3 that we should be aware of when
talking about getting data into S3 is Transfer Move Pieces: We will perform the multipart upload
Acceleration. To understand why we use Transfer steps to move all data to your S3 bucket.
Acceleration and how it works, let's look at a scenario. S3 puts it together: We will let S3 know the upload
Assume we have an application. Our application will is complete, and S3 puts the data back together in
store data in a bucket in the us-east-1 region. And that the bucket.
works very well to send and receive data. Our users will
Three Multipart Upload API Calls 1) Parts: Multipart uploads can be made up of up to
There are three API calls that we use to perform this 10,000 fragments.
process. 2) Overwrite: Specifying the same part number as a
1) Create Multipart Upload: First, we have the previously uploaded part can be utilized to overwrite
CreateMultipartUpload API call. It returns Bucket, that part. We can overwrite the parts while the
Key, and UploadID. The main thing we need from multipart upload is still in progress. Suppose
returning this API call is the UploadID. The multipart something in your log file changes, or there is an
upload acts as a meta object that stores all of the update to a section of it. In that case, you can
information about our upload while it is happening. overwrite that part, that that section is in and have
It is going to hold the information about all the parts. the latest version of the log file in your object when
2) Upload Parts: Next, we have the UploadPart API call. it uploads, or if one of the parts fails or has some
We need to provide Bucket, Key, Part Number, and corrupted data in it, you can write it again into the
Upload ID. It returns an ETag. It is very important multipart upload. It will use whatever the latest data
because we need to deliver that when we do our is for that piece of the upload when it is reassembled.
final API call. 3) Auto-Abort: A bucket lifecycle policy that can be
3) Complete Multipart Upload: Finally, we have the utilized to abort multipart uploads after a specified
CompleteMultipartUpload API call. It returns time automatically. It prevents situations where a
Bucket, ETags, and Key. We need to provide Bucket, multipart upload gets started, something goes wrong
Key, Part Number, Upload ID, all Part Numbers, and with it, and the closing piece of it never goes
ETags. through, or there is an error when the close is
requested that upload will not work until the upload
Three Multipart Upload API Calls is completed or aborted. Generally, in a production
There are three API calls that we use to perform this system, it is good to have an auto abort configured
process.
for your bucket if you know that there will be a lot of
1) Create Multipart Upload: First, we have the multipart uploads going to that bucket.
CreateMultipartUpload API call. It returns Bucket, Key, Best Practices and Limitations
and UploadID. The main thing we need from returning There are also a few best practices and limitations that
this API call is the UploadID. The multipart upload acts as we need to think about.
a meta object that stores all of the information about our
upload while it is happening. It is going to hold the AWS recommends considering multipart upload for
information about all the parts. files more significant than a hundred megabytes. It
means that you might want to use multipart upload
2) Upload Parts: Next, we have the UploadPart API call. for anything larger than a hundred megabytes (100
We need to provide Bucket, Key, Part Number, and MiB).
Upload ID. It returns an ETag. It is very important because We need to consider the limitation that all parts
we need to deliver that when we do our final API call. must be at least five megabytes (5 MiB), except for
3) Complete Multipart Upload: Finally, we have the the final part.
CompleteMultipartUpload API call. It returns Bucket, When we put these together, parts should be
ETags, and Key. We need to provide Bucket, Key, Part between five and a hundred megabytes.
Number, Upload ID, all Part Numbers, and ETags.
S3 Storage Classes
Considerations What are Storage Classes?
We have some considerations for this process. Amazon S3 provides a variety of storage classes to satisfy
a variety of use cases. Storage classes can be assigned to
individual objects, or the bucket can be configured to use S3 Standard
a specific storage class by default for anything added to S3 Standard provides excellent durability, availability,
it. These include: and performance object storage for frequently accessed
dataS3 Standard is suitable for a wide range of use cases,
S3 Standard is a storage type for general-purpose
including cloud services, dynamic websites, content
storage of commonly accessed data. distribution, mobile and gaming apps, and big data
S3 Intelligent-Tiering is utilized for data with analytics, due to its low latency and high throughput. A
uncertain or changing access patterns. It is more of a single bucket can contain objects stored across S3
management engine than a storage class itself. Standard, S3 Intelligent-Tiering, S3 Standard-IA, and S3
Standard Infrequent Access is used for infrequently One Zone-IA, S3 Storage Classes, which can be defined at
accessed data. the object level. You may also utilize S3 Lifecycle policies
One Zone Infrequent Access is very similar to to migrate items across storage classes without having to
Standard Infrequent Access, except it can easily make any modifications to your application.
replace data. We will use this storage class for that S3 Infrequent Access
data if we have an on-premise data store that we S3 Standard-IA is for data accessed infrequently but
want to keep a copy of in the cloud for easier access. has to be available quickly when needed. S3 Standard-IA
The Glacier storage class is used for data archives combines S3 Standard's strong durability, speed, and low
that we may need faster than the Glacier Deep latency with a cheap per-GB storage and retrieval charge.
Archive. S3 Standard-IA is appropriate for long-term storage,
The Glacier Deep Archive is also an archival storage backups, and data storage for disaster recovery files
class. It is utilized for digital and long-term archive because of its low cost and significant performance. A
preservation. single bucket can contain objects stored across S3
Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One
Availability and Durability Zone-IA, and S3 Storage Classes can be defined at the
S3 Standard, S3 Standard–IA, S3 Intelligent-Tiering, S3 object level. You may also utilize S3 Lifecycle policies to
One Zone–IA, S3 Glacier, and S3 Glacier Deep Archive are migrate items across storage classes without having to
all meant to offer 99.999999999 percent (11 9's) data make any modifications to your application.
durability over a year. This level of durability corresponds
to a projected yearly loss of 0.000000001 percent of S3 Intelligent Tiering
items. For example, if you store 10,000,000 objects on Independent of object size or retention duration, S3
Amazon S3, you may anticipate a single object to be lost Intelligent-Tiering is the best storage class for data with
once every 10,000 years on average. On Outposts, S3 is unknown, changing, or unexpected access patterns. S3
meant to store data reliably and redundantly across Intelligent-Tiering may be the default storage class for
several devices and servers. Furthermore, Amazon S3 data lakes, analytics, and new applications.
Standard, S3 Standard-IA, S3 Glacier, and S3 Glacier Deep S3 Intelligent-Tiering monitors our objects for access
Archive are all built to keep data alive in the case of a assigned to the Intelligent-Tiering storage class. If the
complete S3 Availability Zone failure. object is not accessed for 30 days, it will move it to the
The S3 Standard storage class is designed for 99.99 configured Infrequent Access storage class. And once
percent availability, the S3 Standard-IA storage class and that object is accessed, it will be transferred back to the
the S3 Intelligent-Tiering storage class for 99.9% Standard storage class. It is where the unknown access
availability, the S3 One Zone-IA storage class for 99.5 pattern comes in. We will pay to monitor our objects, but
percent availability, and the S3 Glacier and S3 Glacier we trade off the access cost of Infrequent Access for this
Deep Archive storage classes for 99.99 percent monitoring cost. The monitoring is 1/4 of a penny per
availability and a 99.9% Service Level Agreement (SLA). 1,000 objects, and the storage cost will depend on the
storage class the object is in, in the underlying storage
class. If it is in Standard or Infrequent Access, that will
determine what we are paying to store those objects.
S3 Glacier
S3 Glacier is a safe, long-lasting, low-cost data archiving
storage type. You can store any quantity of data reliably
at prices comparable to or lower than on-premises
alternatives. S3 Glacier offers three retrieval options,
ranging from a few minutes to hours, to keep costs
reasonable while meeting various demands. You may
utilize S3 Lifecycle policies to move data between the S3
Storage Classes for active data (S3 Standard, S3
Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA)
S3 Security and Encryption
S3 Glacier, or you can upload objects directly to S3
Glacier. S3 Security Overview
At AWS, cloud security is a top priority. As an AWS client,
S3 Lifecycle Policies you have access to a data center and network
Life Of Data architecture designed to fulfill the needs of the most
When we talk about the life of data, we mean that we security-conscious businesses. When we look at S3
will create our data, that data will be actively utilized, security and encryption, there are many S3 features and
and then eventually, we will likely archive or delete that integrated services that provide various functions to
data. Once the data has been formed, it will be active in maintain the security of our S3 buckets.
either the standard or infrequent access S3 storage class. S3 Features
And then, it will move over to the archive storage class.
We may then push that data back into active utilization Access Analyzer for S3
at some point. We can manage most of this with our Amazon S3 Server Access Logging
lifecycle policy. We do not need to work potentially Bucket Policy
millions of objects stored in S3, and all of this will happen Bucket Access Control List (ACL)
for us automatically. Cross-Region Replication
Data Lifecycle in S3 Multi-factor authentication (MFA) Delete
When we look at the storage classes, we overlap our Object Access Control List (ACL)
Venn diagram of active and archive. Typically, we will Object Locking
bounce from the S3 standard to the infrequent access Versioning
storage classes and then move to the archive. We might
move back to standard again and repeat this process in Integrated Services
the lifecycle of our data. It may have a loop around and Amazon CloudWatch Alarms
around, or it may eventually be deleted after a certain
AWS CloudTrail Logs
amount of time.
Identity Access Management (IAM)
VPC Endpoints
Service Control Policies
Key Management Service (KMS)
In-Flight Security
The S3 Bucket in In-Flight security requires TLS support
from the clients that connect to the S3 buckets. We also
have VPC endpoints, making it so that we can only access
our bucket through our VPC. We can combine that VPC generate a report around what is available. It is useful if
with VPN options to access S3 buckets from the outside. there is a hole that we missed in our access policies that
Then we can use some access control and auditing maybe we do not want to provide access to a bucket
feature to ensure that our above security features are through some specific avenue, which will reveal that for
operating in the way we expect them to. us.
Client-Side Encryption
The application server will request an encrypted object
for client-side encryption if we are using the KMS option.
Our bucket is going to return that encrypted object and
a cipher blob. The cipher blob identifies the Key that that
object was encrypted with. Our application server then
needs to call KMS for that Key. Hence, it will request the
Key associated with the cipher blob that is returned to Object Protection and Replication
the object, and then a data key is returned for that Protection
object. We can then combine our encrypted object and We have an object that is in our S3 bucket. This object
our plain text data key into a decrypted object. happens to be a beat. We want to protect this object so
that it cannot be deleted, or we are going to put it into a
Server-Side Encryption Write Once Read Many modes or WORM. We can turn
For server-side encryption, if S3 manages our encryption on object locking, which prevents the deletion of this
keys, our application server will request an object, and object without disabling object locking.
S3 will decrypt it and send it back.
Alternatively, we can enable multi-factor authentication
The S3 Access Security Waterfall to delete our bucket, requiring an MFA token to delete
There is a permission waterfall when we talk about the objects in the bucket. It is useful because we can control
API and our various ways to access objects in S3 and this feature with some granularity, and these users with
control that access. Each one of the services in the MFA tokens are allowed to delete objects from a bucket.
waterfall adds to the flow, and at the bottom line of the All other users are not, which gives our administrators
waterfall, they are all combined to create a single policy the ability to remove objects if needed.
that determines whether or not we can access an object
or a bucket in S3. Replication
We can also replicate our bucket across regions. Hence,
Access Logging, Alerting, and Auditing the objects in our bucket will be copied to a bucket in a
We have the S3 server access log and cloud trail logs, and second region bucket. We will need to have a different
with these two combined, we can get various name, but this provides disaster recovery in case there is
granularities of access to our buckets. It can be used to a loss of the entire AWS region for whatever reason. We
feed data into CloudWatch, or CloudWatch on its own can turn on cross-region replication, and we will still
can be used for alerting to perform various actions. If a access our buckets. We may need to update our code to
bucket suddenly receives a considerable amount of point our applications to the correct bucket or set up
requests, we could set up a CloudWatch alarm that will some automatic failover in our application code. To turn
trigger a Lambda function that will turn off access to that on cross-region replication, you need to enable
bucket. Maybe a timer waits a certain amount of time versioning. This is because if an object is placed in a
and then automatically re-enables access to that bucket bucket and the replication engine starts, and it is a huge
to resume operation. object, that object is replaced. Suddenly you have an
Understanding all the layers of access authorization can invalid replication occurring. Therefore, the service can
get quite complicated, and that is where the access replicate a specific version. Then if that object is
analyzer for S3 comes in. It will analyze the various overwritten while replication is still happening, it will
policies and ACLs involved in providing access, and it will complete the reproduction of the performance that it is
copying and then replicate the newer version. It means
that it is possible for there to be some delay in replicating
our objects into our second region. Still, generally, this
works out very well for disaster recovery.
Chapter 03: Databases in AWS
Aurora Serverless
Amazon Aurora Serverless v1 (Amazon Aurora Serverless
version 1) is a configuration for on-demand autoscaling
in Amazon Aurora. An Aurora Serverless DB cluster is a
database cluster that dynamically scales processing
capacity based on the needs of your application. In
contrast, Aurora supplied DB clusters require manual
capacity management. Aurora Serverless v1 is a simple,
low-cost choice for occasional, intermittent, or
unexpected workloads. It saves money since it starts up
automatically, boosts processing capacity to match the
needs of your application, and shuts down when not in
use.
Chapter 04: Collecting Streaming Data
Thus, Kinesis Data Stream, Kinesis Data Firehose, Kinesis Data gathered via Kinesis Data Streams may be used for
Video Streams, and Kinesis Data Analytics are the four real-time data analysis and reporting. Instead of waiting
services that make up the Kinesis family. for batches of data to arrive, your data-processing
application might work on metrics and reporting for
Kinesis Data Streams system and application logs as they come.
Introduction
One of the benefits of using Kinesis Data Stream is 3. Real-time data analytics:
aggregating real-time data and then loading it into some The strength of parallel processing is combined with the
data warehousing solution like Redshift or some value of real-time data in this way. For example, utilizing
MapReduce cluster like EMR. Kinesis Data Streams are many Kinesis Data Streams applications operating in
also durable and elastic, meaning that you would not parallel, process website clickstreams in real-time, and
lose our data records. You would not lose our streaming then assesses site usability engagement.
data, and it scales up or down depending on the number
of records coming into our stream. It means that you can 4. Complex stream processing:
get all the benefits of a managed streaming service Kinesis Data Stream applications and data streams may
rather than having to host it on EC2 instances ourselves. be turned into Directed Acyclic Graphs (DAGs). It usually
You want to take advantage of managed services entails combining data from many Kinesis Data Stream
wherever you can, which is an integral part of collecting applications into a single stream for later processing by
streaming data within AWS. You can also have parallel another Kinesis Data Stream application.
applications reading from the stream. Hence, you can
perform different functions on that data. It is one of the Shard
significant differences between Kinesis and other Shards are a container that holds our information
queuing services. shipped off to consumers. Let us assume that you have a
single shard. You can see this shard has two data records;
Working with Kinesis Data Streams each one of these data records consists of a partition key,
Kinesis Data Streams may be used to collect and a sequence ID, the data, and the actual data you want to
aggregate data in real-time. IT infrastructure log data, ship off to consumers. The partition key is going to be the
application logs, social media, market data feeds, and same for all the records within a shard. The sequence
online clickstream data are some examples of the data number will be in the order in which the shard received
types that may be employed. Because the data intake the record, and that will be our data. Each shard consists
and processing are both done in real-time, the of a sequence of data records. Here, you have two, which
processing is generally minimal. can be ingested at 1000 records per second. The actual
data payload per record can be up to one megabyte.
The following are some examples of how Kinesis Data
Streams can be used: Processing & Storage
A shard is temporary data storage. Data records are
stored for 24 hours by default and can be extended up to
365 days. By default, the data retention period is 24 3. Kinesis Agent
hours. You can raise the data retention period to seven
The Kinesis Agent is a ready-to-use Java application that
days by enabling extended data retention. You can
can be deployed on a Linux-based server. It is an agent
increase it even further by allowing long-term data
that monitors specific files and continuously sends data
retention to have the data persist in the shard for up to
to our Data Stream. Hence, you might want to install this
365 days. You can do this by using the increased stream
on web servers, log servers, or database servers.
retention period operation. You can decrease it by using
the reduced stream retention period operation. Hence, Kinesis Agent is a Java software application that allows
going back to our train example, passengers will be you to gather and transfer data to Kinesis Data Streams
booted from the train every 24 hours, but some rules quickly. The agent watches a group of files in real-time
may differ for some trains. It is the retention period, so and feeds new data to your stream. The agent performs
some passengers may be allowed to stay for up to 365 file rotation, checkpointing, and retries in the event of a
days. failure. It distributes all of your data in a dependable,
fast, and straightforward manner. It also emits Amazon
Interacting with Kinesis Data Stream
CloudWatch metrics to assist you in monitoring and
There are a few different ways to interact with Kinesis
troubleshooting the streaming operation.
Data Streams:
By default, entries from each file are processed based on
1. Kinesis Producer Library (KPL):
the newline ('n') character. The agent, on the other hand,
An application that inserts user data into a Kinesis data may be set to parse multi-line entries.
stream is an Amazon Kinesis Data Streams producer (also
The agent may be deployed on Linux-based web servers,
called data ingestion). The Kinesis Producer Library (KPL)
log servers, and database servers. Configure the agent
makes it easier for developers to create producer
after installing it by providing the files to monitor and the
applications by achieving high write throughput to a
data stream. Once set up, the agent takes data from files
Kinesis data stream.
and consistently feeds it to the stream.
2. Kinesis Client Library (KCL):
4. Kinesis API (AWS SDK):
KCL takes care of many complicated duties connected
It is used to process data from the Kinesis Data Stream.
with distributed computing, allowing you to receive and
Once the data is in Kinesis Data Streams, you can use
process data from a Kinesis data stream. Load balancing
Kinesis Client Library, abbreviated as KCL, to directly
across numerous consumer application instances,
interact with the Kinesis Producer Library to consume.
responding to consumer application instance failures,
These libraries are used to abstract some of the low-level
checkpointing processed records, and responding to re-
commands you would have to use with the Kinesis API.
sharding are examples of these. The KCL handles all of
However, it is used for more low-level API operations and
these subtasks, allowing you to concentrate on
more manual configurations. Hence, the interaction with
implementing your unique record-processing logic.
Kinesis Data Stream is by using the Kinesis API. With the
The KCL is not the same as the Kinesis Data Streams APIs Kinesis API, you can perform the same actions that you
offered in the AWS SDKs. The Kinesis Data Streams APIs can achieve with the Kinesis Producer Library or the
assist you in managing many elements of Kinesis Data Kinesis Client Library. Hence, you can install the Kinesis
Streams, such as establishing streams, re-sharding, Producer Library on two EC2 Instances or integrate it
inserting and receiving information. The KCL adds a layer directly into your Java applications.
of abstraction around all of these subtasks, allowing you
KPL VS Kinesis API
to focus on the particular data processing logic in your
Some key features between the Kinesis Producer Library
consumer application.
and the Kinesis API are mentioned below:
Features of KPL: 3. Data Producer:
Provides a layer of abstraction dedicated to data
Records are sent to Kinesis Data Firehose delivery
intake
streams by producers. A data producer is, for example, a
Retry system that is both automatic and adjustable
web server that delivers log data to a delivery stream.
In order to achieve higher packing efficiency and You can also set up your Kinesis Data Firehose delivery
better performance, additional processing delays stream to read data from an existing Kinesis data stream
may occur and put it into destinations automatically.
Java wrapper
4. Buffer Size & Buffer Interval:
Features of Kinesis API:
Low-level API calls (PutRecords and GetRecords) Before sending data from Kinesis to destinations,
Stream creations, re-sharding, and putting and Firehose buffers incoming streaming data to a specific
getting records are manually handled size or for a certain amount of time. Buffer size is
No delays in processing measured in megabytes, while buffer interval is
Any AWS SDK measured in seconds.
Amazon MSK
Amazon Managed Streaming for Apache Kafka (Amazon
MSK) is a completely managed service that allows you to
design and run applications that handle streaming data
using Apache Kafka. Amazon MSK takes control-plane
actions, including building, updating, and removing
clusters. It enables the usage of Apache Kafka data-plane
tasks, such as data production and data consumption. It
runs Apache Kafka open-source versions and implies that
existing applications, tools, and plugins from partners
and the Apache Kafka community are supported without
application code modifications.
Chapter 05: Data Collection and Getting Data into AWS
Serverless Architectures
These services in action can trigger things like Lambda
functions. You can have an application that runs on
CloudFront that gets the data from S3 or is served out
Chapter 06: Amazon Elastic Map Reduce
Introduction reduces cluster shares the data with S3. We also can
Elastic Map Reduce or EMR plays a huge role in data store it on the local file system. This can be an instance
analytics, processing, and big data frameworks. We can store or on EBS volumes.
use the EMR architecture and the Hadoop framework to
EMR Architecture
process and analyze massive amounts of data. Log
analysis, web indexing, data warehousing, machine Introduction
learning (ML), financial analysis, scientific modeling, and The entire cluster is spun up in a single availability zone.
bioinformatics all use Amazon EMR for data analysis. It Every single EMR cluster has a primary node, or it can
also supports Apache Spark, Apache Hive, Presto, and have three primary nodes. Hence, it is either a single
Apache HBase workloads, connecting with Hive and Pig, primary node or three primary nodes. This primary node
free source Hadoop data warehousing technologies. Pig manages all of the components in the distributed
provides a high-level interface for scripting Map-Reduce applications. When a job needs to be submitted or some
tasks in Hadoop, whereas Hive utilizes queries and processing or Map reduce tasks, the core nodes come
analyses. This service is expensive, and it uses a lot of into play. The primary node manages these core nodes.
computing power, but it gives you the ability to process The last part of the EMR architecture is our task nodes.
huge amounts of data in a short amount of time. Task nodes are optional. They add power to perform
parallel, computational tasks on the data, and they help
Apache Hadoop and EMR Software the core nodes.
Collection Primary Node Features
Map Reduce
Map-reduce is a technique that data scientists can use to Single or Multi-Primary Nodes
distribute workloads across many different computing Whenever you launch a cluster, you will have the option
nodes to process other data and get the information to choose between one primary node and three primary
back quicker than just on a single node. nodes. You will only have a single primary node most of
the time, but now you can also have multiple primary
Hadoop Distributed File System (HDFS)
nodes. You would have numerous primary nodes
Hadoop Distributed File System is open-source software
because you do not have a single point of failure.
that allows you to operate a distributed file system over
Therefore, if one master node fails, the cluster uses the
several computers to tackle challenges requiring large
other two master nodes to run without interruptions.
amounts of data. HDFS is meant to run on low-cost
EMR automatically replaces the primary node and
hardware and is extremely fault-tolerant. HDFS is a file
provisions it with any configurations or bootstrap actions
system that allows high-throughput access to application
that need to happen. Hence, all it does is remediate that
data and is well suited to applications with huge data
single point of failure.
collections. The problem with setting up an HDFS cluster
so it requires a lot of maintenance and management. Manages the Cluster Resources
This is where Elastic Map Reduce comes in.
The primary node also manages the cluster resources. It
EMR coordinates the distribution of the parallel execution for
Elastic Map reduce is a fully managed AWS service that the different Map reduce tasks.
allows you to spin up Hadoop ecosystems. Not only can
you store data on HDFS, but you have some other
storage options as well. We also have the EMR file
system or EMRFS. This means that the Elastic Map
Tracks and Directs HDFS Added and Removed from running clusters
The primary node also tracks and directs the HDFS. The Task nodes can be added and removed from the core
primary node knows how to lookup files and track data nodes to ramp up extra CPU or memory for compute-
on the core nodes. intensive tasks.
Introduction worker nodes will go to their Slices, and then they will
Redshift is a data warehousing service. It can warehouse each have a piece of data that they need to return, which
the data at the petabyte scale, which means Redshift can will come back, and the leader will combine that data
store large data. It can also index and query data so that into our query response, and that query response will go
it remains usable. We can store petabytes and hexabytes back to our end user. That is the Redshift query process
of data in S3. at a very high level.
Compression
In Redshift, Postgres can compress individual columns,
which means different compression types are available
depending on the data type. Most of our number and
time data types will default to AZ64 compression, and
our character and variable character data types will
default to LZO compression and several other impression
encodings available. The other data types are default to
raw, Boolean, real, and double. If something is our sort
key, it will need to be raw; it is uncompressed because
the database engine frequently performs queries.
Distribution Styles
1. Even – Blocks are distributed evenly between
cluster slices (default).
2. Key – identical key values are stored on the same
Slice.
Chapter 08: Redshift Operation and Maintenance
Cluster
CPU Utilization.
Maintenance mode.
Storage
Percentage Disk Space used.
Auto Vacuum Freed.
Read throughput.
Read latency.
Write throughput.
Write latency.
Database
Database connections.
Chapter 09: AWS Glue, Athena, and QuickSight
Encryption in Transit
QuickSight supports encryption of all data transfers using
SSL. This includes data to and from SPICE and from SPICE
to the user interface.
Key Management
AWS manages all the keys associated with QuickSight.
Database server certificates are the responsibility of the
customer.
You may even encrypt the data encryption key using AWS Secrets Manager secures access to your
another encryption key and then encrypt that encryption applications, services, and IT resources without the
key again. However, one key must finally remain in
upfront investment or ongoing maintenance expenses port number, and the user name and password used to
associated with running your infrastructure. access the service.
Secrets Manager is designed for IT administrators who Secret name and description
need a safe and scalable way to store and manage Rotation or expiration settings
secrets. Security administrators may use a secrets ARN of the KMS key associated with the secret
manager to meet regulatory and compliance standards Any attached AWS tags
to monitor and cycle secrets without interfering with
Encrypt Secret Data
applications. Secrets Manager may be retrieved
programmatically by developers that want to replace Secrets Manager encrypts a secret's protected text using
hardcoded secrets in their applications. AWS Key Management Service (AWS KMS). AWS KMS is
used for key storage and encryption by many AWS
Secrets Manager Features
services. When your secret is at rest, AWS KMS assures
At runtime, encoded secret values can be retrieved its safe encryption. Every secret is associated with a KMS
programmatically. Secrets Manager improves your key in Secrets Manager. It can be a customer-controlled
security posture by removing hard-coded credentials key created in AWS KMS or an AWS-managed key for the
from your application source code and preventing account's Secrets Manager (aws/secretsmanager).
credentials from being stored in any form within the
Secret Rotation
application. Storing the credentials in or with the
application exposes them to compromise by anybody The process of updating a secret regularly is known as
who has access to your program or its components. This rotation. When you rotate a secret, the credentials in
technique makes rotating your credentials tough. Before both the secret and the database or service are updated.
you can depreciate the old credentials, you must update You may set up an automatic rotation for your secrets in
your application and distribute the updates to each Secrets Manager. After rotation, applications that obtain
client. the secret from Secrets Manager automatically receive
the updated credentials.
Secrets Manager allows you to acquire stored credentials
by replacing them with a runtime call to the Secrets Automatically Rotate Secrets
Manager Web service.
Secrets Manager may be configured to automatically
Secret Storage rotate your secrets on a pre-defined schedule and
without human interaction.
The AWS secrets manager uses JSON to distort data,
approximating what it might look like is shown in the Rotation is defined and implemented using an AWS
figure below. It stores a string in the JSON. Hence those Lambda function. This function specifies how Secrets
secrets can be passwords. They could be SSH keys. It Manager will carry out the following tasks:
could be API keys, really any string that fits within 64
Creates a new version of the secret
kilobytes can be stored in Secrets Manager. You could
have Base64 encoded strings that are stored in Secrets Stores the secret in Secrets Manager
Manager that you know. You have to decode that Base64 Configures the protected service to use the new
encoding. version
Verifies the new version
Different Secrets Types Store in AWS Secrets Manager
Marks the new version as production-ready
Secrets Manager allows you to store text in a secret Rotation Strategies
encrypted secret data part. It normally comprises the
Secrets Manager has two rotation strategies:
database or service's connection information. This
information may contain the server name, IP address,
1. Single User Rotation Strategy NAT Gateway: A highly available, managed Network
The single-user technique refreshes one user's Address Translation (NAT) service for your resources
credentials in a single secret. It is the most basic rotation in a private subnet to access the Internet
approach, and it is suitable for the majority of use Hardware VPN Connection: A hardware-based VPN
scenarios. connection between your Amazon VPC and your
data center, home network, or co-location facility
2. Alternating Users Rotation Strategy
Virtual Private Gateway: The Amazon VPC side of a
The alternating user’s technique refreshes two users' VPN connection
credentials in a single secret. You make the first user, Customer Gateway: Your side of a VPN connection
then rotate clones to make the second. Router: Router interconnect subnets and direct
The other user is kept up to speed with each new version traffic between Internet gateways, virtual private
of a secret. If the first version contains user1/password1, gateways, NAT gateways, and subnets
the second version has user2/password2. Peering Connection: A peering connection enables
User1/password3 is used in the third version, while you to route traffic via private IP addresses between
user2/password4 is used in the fourth. You have two sets two peered VPCs
of valid credentials at any one time: the current and prior VPC Endpoints: Enables private connectivity to
credentials. services hosted in AWS from within your VPC
without using an Internet Gateway, VPN, Network
VPC Network Security Features
Amazon VPC lets you provision a logically isolated section Address Translation (NAT) devices, or firewall
of the AWS cloud to launch AWS resources in a virtual proxies
network that you define. You have complete control over Egress-only Internet Gateway: A stateful gateway
your virtual networking environment, including the provides egress-only IPv6 traffic from the VPC to the
ability to define route tables and network gateways, as Internet
well as choose IP address ranges and construct subnets.
Network Access Control List
A Virtual Private Cloud (VPC) is a cloud computing A network access control list (NACL) is an optional
concept that provides an on-demand modifiable pool of security layer for your VPC that operates as a firewall to
shared computing resources assigned inside a public manage traffic in and out of one or more subnets. Set up
cloud environment while also offering some level of network ACLs with rules similar to your security groups
separation from other public cloud users. As the cloud to provide your VPC with an extra layer of protection.
(pool of resources) in a VPC model is exclusively available
Security Groups
to a single client, it provides privacy with more control
For your instance, a security group acts as a virtual
and a safe environment where only the defined client
firewall, controlling inbound and outgoing traffic. When
may work.
you deploy an instance in a VPC, you may designate the
Components of VPC instance up to five security groups. Security groups
Virtual Private Cloud: VPC logically isolated virtual operate at the instance level rather than the subnet
network in the AWS cloud. The IP address space of a level. As a result, each instance in a VPC subnet can be
VPC is defined by the ranges you choose allocated to a separate set of security groups.
Subnet: A segment of a VPC’s IP address range to Traffic Monitoring
place groups of isolated resources Traffic Mirroring is an Amazon VPC feature that allows
Internet Gateway: The Amazon VPC side of a you to repeat network traffic from the elastic network
connection to the public Internet interface of an Amazon EC2 instance. The traffic can then
be routed to out-of-band security and monitoring
equipment for:
Content inspection
Threat monitoring
Troubleshooting