0% found this document useful (0 votes)

12 views25 pages

4.1. Notes

Uploaded by

Nirupa Lenka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

4.1. Notes

Uploaded by

Nirupa Lenka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

77.

AWS DataSync

What is AWS DataSync?

AWS DataSync is a tool that helps you move or copy large amounts of data between different places,
such as your own servers or other cloud services, and Amazon Web Services (AWS).

Where Can You Move Data?

 From Your Own Servers to AWS: If you have data stored on your company's servers, you can
move it to AWS, like into Amazon S3 (for storing files), Amazon EFS (a file system), or Amazon
FSx (a specialized storage).

 Between Different AWS Services: You can also use DataSync to copy data between different
AWS services, like moving data from one S3 storage to another.

 From AWS Back to Your Servers: You can even move data back from AWS to your own
servers.

How Does DataSync Work?

 Connection: To connect your own servers with AWS, you’ll need to install a small program
(called an "agent") on your server. This agent helps your server talk to AWS and transfer the
data.

 No Agent for AWS to AWS: If you're moving data only within AWS services, you don’t need
this agent.

When Does the Data Move?

 Scheduled Transfers: Data doesn’t move continuously; you set up a schedule, like every hour,
day, or week, for DataSync to do its job. So there’s a bit of a delay, but it keeps everything in
sync as per your schedule.

Keeping Data Details Intact:

 Metadata and Permissions: When DataSync moves your files, it keeps all the important
details (like who can access the files and other security settings) intact.

Handling Large Data:

 Fast Transfers: DataSync can move data very quickly (up to 10 gigabits per second), but if you
don’t want to use too much of your network’s capacity, you can slow it down.

Exam Tip - Snowcone:

 Limited Network? If you can't transfer data because your network isn’t strong enough, AWS
offers a small device called Snowcone. You can load your data onto Snowcone, and then send
the device to AWS, where it will transfer the data for you.
78. AWS DataSync - Solution Architecture

Private Access to AWS DataSync with Direct Connect

When you want to use AWS DataSync and ensure that your connection is private (not over the public
internet), you can use AWS Direct Connect. Direct Connect is a service that creates a direct, private
network link between your own data center and AWS.

Steps to Set It Up:

1. Use a VPC (Virtual Private Cloud):

o To keep the connection private, you need to go through your VPC, which is your own
private network within AWS.

2. Direct Connect Connection:

o Your DataSync agent (the software that moves your data) will connect to AWS
through Direct Connect.

3. Public VIF (Virtual Interface) Option:

o If you use a Public VIF, the connection goes around the VPC and uses a public URL
for DataSync. This might not be what you want because it’s not fully private.

4. Private Connection Option:

o Create a VPC Interface Endpoint: This is a special connection point in your VPC that
allows you to privately connect to AWS services like DataSync.

o Set Up a Private VIF: This will create a private link between your Direct Connect
connection and the VPC Interface Endpoint.

o Now, the DataSync agent can securely send data through Direct Connect, into your
VPC, and then to the DataSync service, all without ever touching the public internet.

Why This Matters:

 Using this method ensures that your data stays private and secure as it moves between your
data center and AWS.

99. Kinesis Data Streams

What is AWS Kinesis?

AWS Kinesis is a service that helps you handle a large amount of data that needs to be processed in
real-time. It’s like a fast-moving river (or stream) of data where you can continuously add and analyze
information.

What Can You Use Kinesis For?

 Real-Time Data Processing: If your application generates a lot of data that needs to be
analyzed immediately—like logs, IoT sensor data, or user clickstreams—Kinesis is perfect for
this.

 Big Data Applications: Kinesis is often used in big data scenarios where large amounts of
information are processed and analyzed quickly.

Key Features of Kinesis:

1. Highly Available: Kinesis automatically copies your data across three different locations
(Availability Zones), making it reliable and fault-tolerant.

2. Three Main Services:

o Kinesis Streams: For real-time data ingestion at large scale, where data is divided
into parts called "Shards."

o Kinesis Analytics: Allows you to perform real-time data analysis using SQL on data
coming through Kinesis Streams.

o Kinesis Firehose: Automatically loads the data from streams into other AWS services
like S3, Redshift, Elasticsearch, or Splunk.

Understanding Kinesis Streams:

 Shards: Think of Shards as individual lanes in a highway. Data flows through these lanes, and
you can decide how many lanes (Shards) you need based on the amount of data.

o Producers send data to Shards.

o Consumers read data from Shards.

 Data Retention: By default, Kinesis Streams keep data for 24 hours, but you can extend this
up to a year. This means you can replay and reprocess data within this time frame.

 Multiple Consumers: Unlike SQS (another AWS service), where once data is read it’s gone,
Kinesis allows multiple applications to read the same data stream simultaneously. This is
useful for real-time data processing by different systems.

 Data Is Immutable: Once data is in Kinesis, it can’t be deleted until it expires. This ensures
that the data is always available for processing.

How to Manage Shards:

 On-Demand Mode: Kinesis automatically adjusts the number of Shards based on the data
flow, so you don’t have to plan for capacity.

 Provisioned Mode: You manage the number of Shards yourself, adjusting them as needed.

Producers and Consumers in Kinesis:

 Producers: These are the sources of data. They could be applications using the AWS SDK,
Kinesis Producer Library (KPL) for advanced features like batching and compression, or
Kinesis Agents that monitor log files.
 Consumers: These are the applications that read the data. They could be simple consumers
using the SDK, AWS Lambda functions, or more complex ones using Kinesis Client Library
(KCL) for coordinated reads and checkpoints.

Key Limits:

 Producer Limits: Can send up to 1 MB of data per second or 1,000 messages per second, per
Shard. To handle more data, you add more Shards.

 Consumer Limits: There are two types of consumer modes:

o Classic Consumers: Can read up to 2 MB per second per Shard, shared among all
consumers.

o Enhanced Fan-Out Consumers: Each consumer can read 2 MB per second per Shard,
independently, without API call limits, providing better performance.

Why Use Kinesis?

 Real-Time Data Processing: If you need to process and analyze data as soon as it arrives,
Kinesis is a great choice.

 Scalability: You can easily scale the amount of data you process by adding more Shards.

 Flexibility: Multiple applications can consume the same data stream, allowing different types
of analysis on the same data.

100. Kinesis Data Firehose

What is Kinesis Data Firehose?

AWS Kinesis Data Firehose is a fully managed service that takes data from various sources and
delivers it to destinations like Amazon S3, Redshift, OpenSearch, or other third-party services.

How Does Kinesis Data Firehose Work?

1. Data Sources:

o Producers: These are applications or services like Kinesis Data Streams, Amazon
CloudWatch Logs, IoT devices, etc., that send data to Kinesis Data Firehose.

o Data Transformation (Optional): You can use a Lambda function to transform or

process the data before sending it to its destination.

2. Data Destinations:

o AWS Destinations:

 Amazon S3: Store data files.

 Amazon Redshift: For large-scale data analysis (data first goes to S3, then is
copied to Redshift).

 Amazon OpenSearch: For search and analytics.

o Third-Party Destinations: Services like Datadog, Splunk, NewRelic, MongoDB.

o Custom HTTP Endpoint: If you have your own API, Firehose can send data directly to
it.

3. Backup Options:

o You can back up all data or only the data that failed to be delivered to its destination
into an S3 bucket.

Key Features of Kinesis Data Firehose:

 Fully Managed: No need to manage servers or infrastructure.

 Automatic Scaling: It adjusts automatically based on the data load.

 Near Real-Time: It delivers data in batches, making it slightly delayed (not true real-time).
You can set the buffer size (e.g., 32 MB) and buffer time (e.g., 1 minute) to control when data
is sent.

 Data Formats & Transformation: Supports various data formats and allows you to transform
data using Lambda functions if needed.

How Kinesis Data Firehose Delivers Data:

1. Buffering: Data is collected in a buffer before being sent to the destination.

o Buffer Size: The amount of data collected before it is sent.

o Buffer Time: The maximum time data will stay in the buffer before being sent, even if
the buffer size isn't reached.

2. Data Delivery:

o Real-Time: If you need real-time data delivery, Firehose is not the best option. For
real-time, use Lambda with Kinesis Data Streams.

o Near Real-Time: Firehose is near real-time because of the buffering mechanism.

When to Use Kinesis Data Firehose vs. Kinesis Data Streams:

 Kinesis Data Streams: Use when you need to handle real-time streaming data at scale, with
the ability to manage scaling and data retention.

 Kinesis Data Firehose: Use when you need a simple, fully managed way to load streaming
data into storage or analysis services, and near real-time processing is sufficient.

101. Kinesis Data Analytics

What is Kinesis Data Analytics?

AWS Kinesis Data Analytics is a service that allows you to analyze streaming data in real-time. You
can use it to process data coming from Kinesis Data Streams or Kinesis Data Firehose and then send
the processed results to various destinations like S3, Redshift, or dashboards.

How Does Kinesis Data Analytics Work?

1. Data Input:

o Kinesis Data Streams or Kinesis Data Firehose: These are the sources of streaming
data that Kinesis Data Analytics reads and processes.

o Reference Data (Optional): You can also use static data from Amazon S3 to enrich
your streaming data during processing.

2. Data Processing:

o SQL Queries: You can write SQL queries to analyze the streaming data in real time.
For example, you might count the number of items by their ID or perform other
transformations.

o Error Handling: If something goes wrong (like an unexpected data type), Kinesis Data
Analytics generates an error stream, which can be monitored.

3. Data Output:

o Output Stream: The processed data can be sent to another Kinesis Data Stream,
where other applications or consumers can access it.

o Firehose: Alternatively, the output can be sent to Kinesis Data Firehose, which then
delivers the data to Amazon S3, Redshift, or other destinations.

Use Cases for Kinesis Data Analytics:

 Streaming ETL: Extract, Transform, Load data in real time, like selecting specific columns or
making simple data transformations.

 Continuous Metrics: Generate live metrics, such as a leaderboard for a mobile game.

 Responsive Analytics: Filter and analyze streaming data in real time to trigger alerts or
actions based on specific criteria.

Key Features of Kinesis Data Analytics:

 Pay for What You Use: You only pay for the resources consumed, though it can be expensive.

 Serverless: No need to manage servers; it automatically scales based on the data load.

 IAM Permissions: You need to set up IAM roles to control access to streaming sources and
destinations.

 SQL and Apache Flink: You can write your data processing logic in SQL or use Apache Flink, a
Java-based framework for more complex processing.

 Schema Discovery: Kinesis Data Analytics can automatically detect and understand the data
structure in your streams.

 Lambda for Pre-processing: You can use AWS Lambda to pre-process data before it enters
Kinesis Data Analytics.

102. Streaming Architectures

1. Real-Time Data Pipeline Example:

Producers and Kinesis Data Streams:

 Producers (like sensors or apps) send data into Kinesis Data Streams.

 Kinesis Data Streams is like a conveyor belt where data flows in real-time.

Analyzing Data:

 Use Amazon Kinesis Data Analytics to look at and make sense of the data as it comes in.

 Lambda Functions can change or process the data.

Processing and Storage:

 You can send the processed data to Kinesis Data Streams again or use Kinesis Data Firehose.

 Kinesis Data Firehose can move the data to places like Amazon S3 (a storage service),
Amazon Redshift (a data warehouse), or Amazon Elasticsearch Service (a search service).

Direct Data Production:

 Producers can also send data straight to Kinesis Data Firehose, which will then store it in
Amazon S3.

2. Cost-Effectiveness Comparison:

Using Kinesis Data Streams and Lambda:

 Suppose you need to handle 3000 messages per second, each 1 KB in size.

 Kinesis Data Streams would need 3 "shards" (which are like separate lanes on the conveyor
belt), costing about $32 per month.

 Lambda functions will handle the processing without extra costs.

Using DynamoDB with DynamoDB Streams:

 For the same data, DynamoDB would cost around $1450 per month.

 While DynamoDB provides long-term storage, its streaming capabilities are much more
expensive.

Comparison:

 Kinesis Data Streams is much cheaper and better for streaming data compared to
DynamoDB.

3. Overview of Technologies:

Kinesis Data Streams:

 Data: Once added, you can't change it.

 Retention: Keeps data for up to 1 year.

 Ordering: Keeps data in the order it arrives, but by "shard."

 Readers: Can be read by EC2 (servers), Lambda functions, or other services.

 Latency: Takes about 200 milliseconds to process.

Kinesis Data Firehose:

 Data: Almost real-time, updates every minute or so.

 Retention: Depends on how it's set up with S3.

SQS (Standard and FIFO):

 Data: Once added, it can't be changed.

 Retention: Keeps data for 1 to 14 days.

 Ordering:

o Standard Queue: No specific order.

o FIFO Queue: Keeps data in the order it arrives.

 Scalability: Handles a lot of messages easily.

 Latency: Takes 10 to 100 milliseconds.

SNS:

 Data: Once added, it can't be changed.

 Retention: No retention, data disappears once delivered.

 Ordering: No specific order.

 Scalability: Handles a lot of messages easily.

 Latency: Takes 10 to 100 milliseconds.

DynamoDB:

 Data: Can be changed or updated.

 Retention: Data can be kept indefinitely or with a time limit.

 Ordering: No specific order.

 Scalability: Can adjust to handle more or less data.

 Latency: Takes 10 to 100 milliseconds.

S3:

 Data: Can be replaced but kept indefinitely with versioning.

 Retention: Keeps data forever with lifecycle policies.

 Ordering: No specific order.

 Scalability: Handles a large number of read and write requests.

 Latency: Takes 10 to 100 milliseconds.

103. Amazon MSK

Amazon MSK and Apache Kafka Overview

Amazon MSK (Managed Streaming for Apache Kafka):

 Purpose: Provides a fully-managed Kafka service on AWS. It simplifies the setup,

management, and scaling of Kafka clusters.

 Features:

o Fully Managed: AWS handles Kafka broker nodes and Zookeeper nodes.

o High Availability: Deploys clusters in your VPC across multiple Availability Zones
(AZs) for redundancy.

o Automatic Recovery: Recovers from common Kafka failures automatically.

o Storage: Data is stored on EBS volumes, with retention configurable as needed.

Apache Kafka Basics:

 Kafka Cluster: Consists of multiple brokers. Producers send data to Kafka topics, and
consumers read from these topics.

 Kafka Topics: Topics are partitioned to allow parallel processing and scalable consumption.
Data in topics is replicated across brokers for fault tolerance.

 Producers & Consumers: Producers push data to Kafka topics, and consumers pull data from
these topics for processing or sending to other destinations.

Key Differences Between Kinesis Data Streams and Amazon MSK

1. Message Limits:

o Kinesis Data Streams: Default message limit is 1 MB.

o Amazon MSK: Default limit is 1 MB but can be configured up to 10 MB.

2. Scaling:

o Kinesis Data Streams: Scale by adding or removing shards (shard splitting/merging).

o Amazon MSK: Scale by adding partitions to Kafka topics (partitions cannot be

removed).

3. Encryption:

o Kinesis Data Streams: In-flight encryption is enabled by default.

o Amazon MSK: Options include plain text or TLS for in-flight encryption. Both services
provide at-rest encryption.

4. Data Retention:

o Kinesis Data Streams: Data retention ranges from 24 hours to 1 year.

o Amazon MSK: Data retention can be configured to exceed one year, depending on
the EBS storage paid for.
Integration with Amazon MSK

1. Data Processing Options:

o Kinesis Data Analytics for Apache Flink: Use Flink to process data directly from MSK.

o AWS Glue: Perform streaming ETL jobs with Apache Spark Streaming.

o AWS Lambda: Set up Lambda functions to process data from MSK as an event
source.

o Custom Kafka Consumers: Implement custom consumers using EC2, ECS, or EKS.

104. AWS Batch

1. What is AWS Batch?

AWS Batch allows you to run many jobs (like processing images or data) at once. You can choose how
you want these jobs to run:

 Serverless with AWS Fargate: No need to manage servers. AWS handles it for you.

 EC2 Instances: Use regular or Spot Instances (which are cheaper but can be interrupted).

Key Points:

 Fargate is completely serverless.

 EC2 and Spot Instances provide more control but need some management.

2. How to Set Up a Batch Job

Example: Creating Thumbnails from Images:

1. Upload Images: Upload images to Amazon S3.

2. Trigger Job:

o Option 1: Use Amazon S3 event notifications to trigger an AWS Lambda function,

which then starts a batch job.

o Option 2: Use Amazon EventBridge to directly start a batch job when an image is
uploaded.

3. Process Job:

o AWS Batch pulls a Docker image (a package with your code) from Amazon ECR
(Elastic Container Registry).

o The job processes the image and saves the results back to Amazon S3.

o Optionally, you can add some details into Amazon DynamoDB.

3. Lambda vs. Batch

 Lambda:

o Limited runtime (usually up to 15 minutes).

o Limited disk space.

o Good for short, quick tasks.

 Batch:

o No time limit; runs as long as needed.

o Any runtime as long as it’s in a Docker image.

o Uses EBS or EC2 instances for disk space.

o More flexibility and suitable for long-running tasks.

4. Compute Environments in AWS Batch

 Managed Compute Environment:

o AWS manages the resources for you.

o You choose between On-Demand or Spot Instances.

o AWS handles scaling and capacity.

o Ensure that your VPC setup allows access to ECS services.

 Unmanaged Compute Environment:

o You control and manage the resources yourself.

o More responsibility and potentially more cost.

5. Multi-Node Mode

 Multi-Node Mode is for high-performance tasks that need multiple instances.

o One main node controls several other nodes.

o Suitable for tightly coupled workloads.

o Note: Does not work with Spot Instances and is better with EC2 instances in a cluster
placement group.

105. Amazon EMR

AWS EMR helps you handle large-scale data processing tasks by running Hadoop clusters in the
cloud. Here's a simple breakdown:

1. What is AWS EMR?

AWS EMR is a cloud service that lets you run Hadoop clusters to process big data. It’s useful if you’re
moving from an on-premise Hadoop setup to the cloud because:

 Elasticity: You can scale your cluster up or down quickly.

 Cost: You only pay for the time you use.

Key Components:
 Apache Spark

 HBase

 Presto

 Flink

 Hive

These tools help with tasks like data processing, machine learning, web indexing, and more.

2. How EMR Works

1. Clusters: EMR uses clusters of EC2 instances to process data.

o Master Node: Manages the cluster.

o Core Nodes: Run tasks and store data.

o Task Nodes: Run tasks (optional and can use Spot Instances).

2. Storage:

o Temporary Storage: EC2 instances use EBS volumes with Hadoop Distributed File
System (HDFS) for temporary storage.

o Long-term Storage: Use EMRFS to store data in Amazon S3 for durability and multi-
AZ storage.

3. Optimizing Cost

 On-Demand Instances: Reliable but more expensive.

 Reserved Instances: Lower cost for long-term use (e.g., master and core nodes).

 Spot Instances: Cheapest but less reliable (good for task nodes).

Cluster Types:

 Long-running Clusters: Ideal for continuous processing tasks.

 Transient Clusters: Use for temporary tasks and shut down when done.

4. Instance Configuration

 Uniform Instance Groups: Choose a single instance type and purchasing option for each
node type (master, core, task). Supports auto-scaling.

 Instance Fleets: Allows mixing of instance types and purchasing options (e.g., some on-
demand, some spot). Provides flexibility but currently doesn’t support auto-scaling.

Summary

106. Running Jobs on AWS

Strategies for Running Jobs on AWS

1. EC2 Instances with CRON Jobs

 Description: Provision an EC2 instance to run CRON jobs.

 Pros:

o Simple to set up for basic tasks.

 Cons:

o Not highly available or scalable.

o If the instance fails, the jobs fail too.

o Not a recommended strategy for production environments.

2. Amazon EventBridge and Lambda

 Description: Use EventBridge to trigger Lambda functions on a schedule.

 Pros:

o Serverless: No need to manage infrastructure.

o Scalable: AWS handles scaling automatically.

o Highly Available: Reduces the risk of single points of failure.

 Cons:

o Limits: Lambda has time and resource limitations.

 Use Case: Ideal for regular tasks like scheduled reports or notifications.

3. Reactive Workflows with Lambda

 Description: Trigger Lambda functions based on events from various AWS services.

 Event Sources:

o EventBridge: For events from different AWS services.

o S3: For new object creations or deletions.

o API Gateway: For API requests.

o SQS and SNS: For messaging and notifications.

 Pros:

o Reactive: Responds to real-time events.

 Use Case: Good for workflows that depend on specific events occurring in your
infrastructure.

4. AWS Batch

 Description: Use Batch for running large-scale batch processing jobs.

 Pros:
o Scalable: Manages compute resources efficiently.

o Flexibility: Works with both EC2 and Fargate.

 Use Case: Suitable for long-running, batch processing tasks that need more control over
compute resources.

5. AWS Fargate

 Description: Run containers without managing servers.

 Pros:

o Serverless: No need to manage EC2 instances.

o Scalable: Automatically scales based on demand.

 Use Case: Good for tasks requiring containers but where you don't need the extensive
features of Batch.

6. AWS EMR

 Description: Use EMR for big data processing and clustering.

 Pros:

o Powerful: Handles large-scale data processing with tools like Hadoop, Spark, and
Hive.

 Use Case: Ideal for big data workloads, step executions, and complex data processing tasks.

Summary

 For Simple, Scheduled Jobs: Use EventBridge with Lambda.

 For Reactive Workflows: Use Lambda triggered by various events.

 For Batch Processing: Use AWS Batch or Fargate.

 For Big Data Processing: Use AWS EMR.

107. AWS Glue

AWS Glue Overview

What is AWS Glue?

AWS Glue is a managed ETL (Extract, Transform, Load) service that helps you prepare and transform
data for analytics. It is fully serverless, meaning you don’t need to manage any servers or
infrastructure.

Key Components

1. AWS Glue ETL Jobs

o Description: These are tasks that extract data from various sources, transform it (i.e.,
clean, modify, or aggregate it), and then load it into a target data store.
o Example Workflow:

 Extract: Data from Amazon S3 or an Amazon RDS database.

 Transform: Perform data cleaning and transformations.

 Load: Move the processed data into a data warehouse like Amazon Redshift.

2. AWS Glue Data Catalog

o Description: A centralized repository that stores metadata about your data. This
metadata includes information about tables, columns, data types, etc.

o How It Works:

 Crawlers: AWS Glue has crawlers that scan your data sources (like Amazon
S3, Amazon RDS, DynamoDB, or JDBC-compatible databases).

 Cataloging: The crawlers detect and record metadata about your data
sources and store this information in the Glue Data Catalog.

o Benefits:

 Discovery: Services like Amazon Athena, Amazon Redshift Spectrum, and

Amazon EMR use the Glue Data Catalog to understand and query your data.

How AWS Glue Helps

 Serverless: No need to manage the underlying infrastructure.

 Data Preparation: Simplifies the process of preparing and transforming data.

 Integration: Works seamlessly with other AWS services for data analytics.

Use Cases

 Data Preparation: Preparing data for analytics or machine learning.

 Data Cataloging: Managing metadata and making data discoverable.

 Data Integration: Combining data from multiple sources.

108. Redshift

What is Amazon Redshift?

Amazon Redshift is a data warehousing service designed for OLAP (Online Analytical Processing),
which means it's great for performing complex queries and analysis on large volumes of data. It's
different from OLTP (Online Transaction Processing) which is more suited for real-time transactional
databases.

Key Features

1. Columnar Storage

o Description: Redshift stores data by columns instead of rows. This is efficient for
analytical queries that often aggregate data from many rows.
o Benefit: Faster performance for operations like summing or averaging columns.

2. Massively Parallel Processing (MPP)

o Description: Redshift uses MPP to distribute queries across multiple nodes in a

cluster.

o Benefit: Enhanced performance for large-scale queries.

3. Scalability

o Description: Redshift clusters can scale to petabytes of data.

o Details: Clusters can have hundreds of nodes, and each node can hold up to 16
terabytes of data.

4. Data Loading

o Sources: Data can be loaded into Redshift from Amazon S3, Kinesis Data Firehose,
DynamoDB, or using AWS Database Migration Service (DMS).

5. Node Types

o Leader Node: Manages query planning and aggregates results from compute nodes.

o Compute Nodes: Perform queries and send results to the leader node.

6. Backup and Restore

o Snapshots: Point-in-time backups stored in Amazon S3. Snapshots can be automated

or manual.

o Cross-Region Snapshots: Snapshots can be copied to another AWS region for

disaster recovery.

7. Redshift Spectrum

o Description: Allows querying of data in Amazon S3 without loading it into Redshift.

o How It Works: Redshift spins up Spectrum nodes to process data in S3 and then
aggregates results in the Redshift cluster.

8. Workload Management (WLM)

o Description: Manages query priorities to prevent long-running queries from blocking

short-running ones.

o Types:

 Automatic WLM: Redshift manages queues and resources.

 Manual WLM: Users define queues and manage resources.

9. Concurrency Scaling

o Description: Automatically adds cluster capacity to handle increased query loads.

o Benefit: Ensures consistent performance with a high number of queries.

Deployment and Integration

 Deployment: Redshift clusters are typically deployed within a VPC (Virtual Private Cloud) and
use IAM for security, KMS for encryption, and CloudWatch for monitoring.

 Tools: You can use AWS QuickSight, Tableau, and other BI tools for dashboarding and
reporting with Redshift.

When to Use Redshift?

 High, Sustained Query Volume: Redshift is ideal if you have a consistent need for complex,
large-scale queries.

 Big Data Analytics: For analyzing massive datasets efficiently.

When to Consider Alternatives?

 Sporadic Usage: If your usage is occasional, consider AWS Athena for ad-hoc querying of
data in S3, which can be more cost-effective.

109. Amazon DocumentDB

Amazon DocumentDB Overview

What is Amazon DocumentDB?

Amazon DocumentDB is a fully managed NoSQL document database service that is designed to be
compatible with MongoDB. It provides a cloud-native solution for handling JSON data, offering
similar benefits to AWS's Aurora for relational databases.

Key Features

1. MongoDB Compatibility

o Description: DocumentDB is designed to be compatible with MongoDB APIs, making

it easier for users of MongoDB to migrate to AWS.

o Benefit: Simplifies the transition to a managed cloud database service.

2. Fully Managed

o Description: DocumentDB is a managed service, meaning AWS handles maintenance

tasks like backups, patching, and scaling.

o Benefit: Reduces the operational overhead of managing a database.

3. High Availability

o Description: DocumentDB replicates data across three Availability Zones (AZs) for
fault tolerance.

o Benefit: Ensures high availability and durability of data.

4. Automatic Scaling

o Description: The storage layer automatically scales in increments of 10 GB.

o Benefit: Simplifies storage management as your data grows.

5. Performance

o Description: DocumentDB can handle millions of requests per second.

o Benefit: Provides high performance for large-scale applications.

Architecture and Pricing

1. Database Storage

o Description: Data is stored in DocumentDB’s database storage.

o Cost: You pay for the database storage used, billed per gigabyte per month.

2. Instances

o On-Demand Instances: Primary and replica instances handle read and write
operations.

o Cost: Charged per second with a minimum billing of 10 minutes.

3. IO Operations

o Description: Read and write operations against the database storage.

o Cost: Charged per million I/O operations.

4. Backups

o Description: Backups are stored in Amazon S3.

o Cost: Charged per gigabyte per month for backup storage.

Deployment

 No On-Demand Tier: DocumentDB does not have an on-demand pricing tier. Costs are based
on instance usage, I/O operations, storage, and backup.

When to Use DocumentDB

 For MongoDB Users: If you're already using MongoDB and need a managed, cloud-native
solution, DocumentDB provides a compatible environment with added benefits of AWS's
infrastructure.

 NoSQL Applications: Ideal for applications that require high performance and scalable
document storage with JSON data.

110. Amazon Timestream

Amazon Timestream Overview

What is Amazon Timestream?

Amazon Timestream is a fully managed, serverless time series database designed for handling time-
stamped data efficiently. It provides a scalable and cost-effective solution for managing and analyzing
large volumes of time series data.
Key Features

1. Time Series Data

o Description: Time series data consists of time-stamped points that track changes
over time, such as measurements or events.

o Example: A graph showing temperature readings over several years.

2. Serverless and Scalable

o Description: Automatically adjusts capacity to handle varying loads without manual

intervention.

o Benefit: Scales up or down based on the volume of data and queries.

3. Performance and Cost Efficiency

o Description: Optimized for storing and analyzing trillions of events per day.

o Benefit: More efficient and cost-effective for time series data compared to
traditional relational databases.

4. Data Management

o Recent Data: Stored in memory for fast access.

o Historical Data: Stored in a cost-optimized storage tier.

5. SQL Compatibility

o Description: Supports SQL for querying time series data.

o Benefit: Facilitates complex queries and analysis.

6. Analytics Functions

o Description: Includes built-in time series analytics functions for real-time pattern
detection and analysis.

o Benefit: Helps identify trends and anomalies quickly.

7. Security

o Description: Supports encryption both in transit and at rest.

Use Cases

 IoT Applications: Track and analyze sensor data from connected devices.

 Operational Applications: Monitor and analyze system performance and operational

metrics.

 Real-Time Analytics: Perform real-time analysis on time-stamped data.

Integration

1. Data Ingestion
o Sources: AWS IoT, Kinesis Data Streams, Prometheus, Telegraf, Kinesis Data Analytics
(Apache Flink), Amazon MSK.

o Description: Supports integration with various data sources for seamless data
ingestion.

2. Data Visualization and Analysis

o Tools: Amazon QuickSight for dashboards, Amazon SageMaker for machine learning,
Grafana for visualization.

o JDBC Compatibility: Allows connection from any JDBC-compatible application for

querying and analysis.

Architecture

 Data Flow: Data can be ingested from various sources, stored in Timestream, and queried
using SQL or integrated tools.

 Analytics: Timestream’s time series analytics functions provide real-time insights into data
patterns.

111. Amazon Athena

Amazon Athena Overview

What is Amazon Athena?

Amazon Athena is a serverless query service that allows you to analyze data stored in Amazon S3
using standard SQL. It is built on the Presto engine and does not require you to provision or manage
any infrastructure.

Key Features

1. Serverless and SQL-Based

o Description: Athena is serverless, meaning you don’t need to manage servers. You
can run SQL queries directly on data stored in S3.

o Engine: Built on Presto, which supports SQL queries.

2. Data Formats

o Supported Formats: CSV, JSON, ORC, Avro, Parquet, and potentially others.

o Benefit: Athena can handle various data formats for flexible querying.

3. Pricing

o Description: Pricing is based on the amount of data scanned per terabyte.

o Benefit: You only pay for the data you query, with no upfront costs.

4. Integration with Other Tools

o Common Use: Often used with Amazon QuickSight for creating reports and
dashboards.

o Additional Tools: Can also integrate with machine learning tools like Amazon
SageMaker and visualization tools like Grafana.

Use Cases

 Ad Hoc Queries: Quickly run queries on data stored in S3.

 Business Intelligence: Perform analytics and reporting.

 Log Analysis: Analyze logs from AWS services (e.g., VPC flow logs, CloudTrail logs).

Performance Optimization

1. Columnar Data Formats

o Recommended Formats: Apache Parquet and ORC.

o Benefit: Scanning only the necessary columns reduces data scanned and improves
performance.

2. Data Compression

o Description: Use compression mechanisms to reduce data size.

o Benefit: Smaller data sizes lead to faster query times and lower costs.

3. Partitioning

o Description: Organize data into partitions based on certain criteria (e.g., year, month,
day).

o Benefit: Queries can target specific partitions, reducing the amount of data scanned.

Example: For flight data, partitions could be organized by year, month, and day (e.g.,
/year=1991/month=01/day=01).

4. File Size

o Description: Use larger files (e.g., 128 MB or more).

o Benefit: Reduces overhead and improves query performance compared to many

small files.

Federated Queries

 Description: Athena can query data from various sources beyond S3, including relational and
non-relational databases.

 Mechanism: Uses Data Source Connectors (Lambda functions) to execute federated queries.

 Supported Sources: CloudWatch Logs, DynamoDB, RDS, ElastiCache, DocumentDB, Redshift,

Aurora, SQL Server, MySQL, HBase, and on-premises databases.

Workflow: Athena sends queries to Lambda functions, which then execute queries on other data
sources. Results are returned to Athena and can be stored in S3 for further analysis.
112. Amazon QuickSight

Amazon QuickSight Overview

What is Amazon QuickSight?

Amazon QuickSight is a serverless business intelligence (BI) service designed to help you create
interactive dashboards and perform data analysis. It is fast, scalable, and offers per-session pricing.

Key Features

1. Serverless BI Service

o Description: No server management is required. QuickSight scales automatically

based on usage.

o Benefit: Easily handles large datasets and concurrent users.

2. SPICE Engine

o Description: SPICE (Super-fast, Parallel, In-memory Calculation Engine) is an in-

memory engine that speeds up data processing.

o Usage: Works with imported data (e.g., from CSV files, Excel). Does not work directly
with live connections to databases.

3. User-Level Security

o Enterprise Edition: Supports column-level security (CLS) to restrict access to specific

columns based on user permissions.

4. Integration with Data Sources

o AWS Services: RDS, Aurora, Redshift, Athena, S3, OpenSearch, Timestream.

o Third-Party Sources: Salesforce, Jira, Teradata, on-premises databases using JDBC.

o Data Imports: Excel, CSV, JSON, TSV, and EFS CLF log formats.

5. Dashboards and Analysis

o Analysis: Create interactive and detailed visualizations. Analysis allows for deeper
exploration and manipulation of data.

o Dashboards: Read-only snapshots of analyses, preserving filters, parameters, and

sorting. Useful for sharing consistent views with users.

6. User Management

o Standard Edition: Individual users.

o Enterprise Edition: Groups of users for better access control and management.

Use Cases

 Business Analytics: Create interactive reports and visualizations to gain business insights.
 Ad-Hoc Analysis: Quickly analyze and visualize data as needed.

 Reporting: Generate and share reports with dashboards.

Performance Optimization

1. Use SPICE for Faster Computation

o Benefit: Import data into QuickSight to leverage SPICE for in-memory processing and
faster query performance.

2. Optimize Data Formats

o Recommended Formats: Use efficient data formats like Parquet and ORC when
importing data for better performance.

3. Efficient Data Management

o Large Datasets: Import data efficiently and manage it to ensure quick access and
analysis.

113. Big Data Architecture

Data Engineering Pipeline on AWS

Analytics Layer

1. Amazon S3

o Role: Centralized data storage.

2. Amazon EMR (Elastic MapReduce)

o Description: Processes large datasets using Hadoop, Spark, or Hive.

o Use Case: Ideal for migrating existing Big Data workloads to AWS.

3. Amazon Redshift

o Description: Data warehousing service for complex SQL queries.

o Options:

 Redshift Spectrum: Queries data directly in S3 without loading it into

Redshift.

 Redshift Warehouse: Load data into Redshift for extensive SQL-based

analysis.

4. Amazon Athena

o Description: Serverless SQL engine for querying data stored in S3.

o Use Case: Best for ad-hoc queries and sporadic data analysis.

5. Amazon QuickSight

o Description: Business intelligence service for creating interactive dashboards.

o Integration: Connects with Redshift and Athena to visualize data.

Big Data Ingestion

1. IoT Devices

o Data Stream: Send data to Kinesis Data Stream.

2. Kinesis Data Firehose

o Description: Real-time data ingestion service that delivers data to destinations like
S3.

3. Data Transformation

o Lambda Functions: Optionally transform data before it is stored in S3.

4. S3 Events

o Description: Trigger notifications or further processing with SQS or Lambda when

new files are added to S3.

5. Data Processing

o Amazon Athena: Run queries on the data in S3 and update reporting buckets.

6. Data Reporting

o Amazon QuickSight: Create dashboards from reporting buckets.

o Amazon Redshift: Alternatively, store and query data for more complex analysis.

Comparing Warehousing Technologies

1. Amazon EMR

o Description: Big Data processing using Apache Hive, Spark, etc.

o Cluster Options:

 Long-Running Cluster: For multiple jobs.

 Cluster per Job: For isolated job processing.

o Cost Management:

 Spot Instances, On-Demand Instances, Reserved Instances.

o Data Access: Integrates with DynamoDB, S3 (via EMR FS), and uses EBS for scratch
storage.

2. Amazon Athena

o Description: Serverless SQL querying for S3 data.

o Use Case: Simple queries and data aggregation.

o Integration: Works with AWS services, and queries are auditable via CloudTrail.

3. Amazon Redshift
o Description: Advanced SQL queries and full-scale data warehousing.

o Options:

 Redshift Spectrum: For querying data in S3 without loading it into Redshift.

o Usage: Best for sustained query needs to justify investment.

Msce Computer Studies Notes
100% (1)
Msce Computer Studies Notes
61 pages
AWS Machine Learning Specialty
100% (1)
AWS Machine Learning Specialty
67 pages
AWS Data Engineering Services
No ratings yet
AWS Data Engineering Services
24 pages
AWS Data Engineering Notes by Iusmanmaqbool
No ratings yet
AWS Data Engineering Notes by Iusmanmaqbool
79 pages
AWS Certified Cloud Practitioner 03-09-2021
100% (1)
AWS Certified Cloud Practitioner 03-09-2021
111 pages
WhizCard CLF C02 Cheat Sheet Nov 2024
No ratings yet
WhizCard CLF C02 Cheat Sheet Nov 2024
110 pages
Analytical Paragraph - Grade 10
No ratings yet
Analytical Paragraph - Grade 10
4 pages
Unit - I Database Mangement Systems
No ratings yet
Unit - I Database Mangement Systems
12 pages
SPM Thesis
100% (3)
SPM Thesis
8 pages
Amazon Solution Architect Associate SAA-C03 PDF
No ratings yet
Amazon Solution Architect Associate SAA-C03 PDF
5 pages
WhizCard AWS Certified Developer Associate (DVA C02)
No ratings yet
WhizCard AWS Certified Developer Associate (DVA C02)
87 pages
AWS Cloud Practitioner (CLF C02)
100% (1)
AWS Cloud Practitioner (CLF C02)
102 pages
AWS Certified Architect Questions
100% (1)
AWS Certified Architect Questions
36 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
AWS Whitepaper
No ratings yet
AWS Whitepaper
31 pages
Data Science & Web Developmeny
No ratings yet
Data Science & Web Developmeny
29 pages
Raj Sahu 2025
No ratings yet
Raj Sahu 2025
1 page
Amazon Redshift Interview Questions
100% (1)
Amazon Redshift Interview Questions
4 pages
Log Analytics Withamazonelasticsearchservice
No ratings yet
Log Analytics Withamazonelasticsearchservice
46 pages
Authors Book
No ratings yet
Authors Book
600 pages
# Software Engineer Resume - Embedded Systems
No ratings yet
# Software Engineer Resume - Embedded Systems
17 pages
Firehose DG
No ratings yet
Firehose DG
146 pages
AWS Solutions Architect: Associate Level
No ratings yet
AWS Solutions Architect: Associate Level
69 pages
WhizCard CLF C01 06 09 2022
No ratings yet
WhizCard CLF C01 06 09 2022
111 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Amazon Kinesis Data Firehose: Developer Guide
No ratings yet
Amazon Kinesis Data Firehose: Developer Guide
113 pages
Career Prediction System
No ratings yet
Career Prediction System
31 pages
Data Mining
No ratings yet
Data Mining
8 pages
AWS-Doc-Amazon Kinesis Data Streams
No ratings yet
AWS-Doc-Amazon Kinesis Data Streams
3 pages
Unit V - AI
No ratings yet
Unit V - AI
41 pages
AWS Data Lake
No ratings yet
AWS Data Lake
87 pages
Introducing Amazon Kinesis: Managed Service For Streaming Data Ingestion & Processing
No ratings yet
Introducing Amazon Kinesis: Managed Service For Streaming Data Ingestion & Processing
36 pages
Final CDA - Pptx-No - Ix - and - Coverslides
No ratings yet
Final CDA - Pptx-No - Ix - and - Coverslides
73 pages
Modernize Your Analyticsand Data Architecture
No ratings yet
Modernize Your Analyticsand Data Architecture
47 pages
AWS Interview Quetion Ans
No ratings yet
AWS Interview Quetion Ans
70 pages
PDF Slide How To Get Your Data To The Cloud
No ratings yet
PDF Slide How To Get Your Data To The Cloud
41 pages
Adi Krishnan, Sr. Product Manager Amazon Kinesis: November 13, 2014 - Las Vegas, NV
No ratings yet
Adi Krishnan, Sr. Product Manager Amazon Kinesis: November 13, 2014 - Las Vegas, NV
38 pages
Ppb1 Workshop Streaming
No ratings yet
Ppb1 Workshop Streaming
64 pages
AWS Big Data Specialty Study Guide PDF
No ratings yet
AWS Big Data Specialty Study Guide PDF
13 pages
The Kinesis Family Slides
No ratings yet
The Kinesis Family Slides
26 pages
Slides Moving Data To AWS Find The Right Tool and Process
No ratings yet
Slides Moving Data To AWS Find The Right Tool and Process
52 pages
Parallel Database Systems and Their Architecture
No ratings yet
Parallel Database Systems and Their Architecture
17 pages
AWS Study Notes
No ratings yet
AWS Study Notes
30 pages
Module 4
No ratings yet
Module 4
38 pages
Event Streaming With Modern Data Pipelines in A SaaS Architecture ISV201
No ratings yet
Event Streaming With Modern Data Pipelines in A SaaS Architecture ISV201
22 pages
A Study of Cyberbullying Detection Using Machine
No ratings yet
A Study of Cyberbullying Detection Using Machine
14 pages
Wa0025.
No ratings yet
Wa0025.
25 pages
Slides
No ratings yet
Slides
26 pages
AWS - Kinesis Quizlet
No ratings yet
AWS - Kinesis Quizlet
15 pages
ADBMS-Module 1 Notes
No ratings yet
ADBMS-Module 1 Notes
18 pages
Module3 4
No ratings yet
Module3 4
16 pages
Database Management System MCQs
No ratings yet
Database Management System MCQs
7 pages
Exercise 2 - Building A Log-Analytics Solution With Amazon Kinesis
No ratings yet
Exercise 2 - Building A Log-Analytics Solution With Amazon Kinesis
12 pages
Kinesis
No ratings yet
Kinesis
11 pages
Cep3 Ps
No ratings yet
Cep3 Ps
9 pages
AWS Config
No ratings yet
AWS Config
10 pages
7.1-Amazon Kinesis - Digital Cloud Training
No ratings yet
7.1-Amazon Kinesis - Digital Cloud Training
9 pages
CT Lung Nodule Segmentation A Comparative Study of Data Preprocessing and Deep Learning Models
No ratings yet
CT Lung Nodule Segmentation A Comparative Study of Data Preprocessing and Deep Learning Models
7 pages
Architecture
No ratings yet
Architecture
6 pages
1605192076066-614 DAS-C01 Study Guide
No ratings yet
1605192076066-614 DAS-C01 Study Guide
18 pages
Tutorial 2 Answers For Data Mining and Warehousing (Universiti Malaya)
No ratings yet
Tutorial 2 Answers For Data Mining and Warehousing (Universiti Malaya)
10 pages
Data Fusion
No ratings yet
Data Fusion
6 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
102 - T3593 Computational Thinking
No ratings yet
102 - T3593 Computational Thinking
4 pages
CV - David Hadrianus Hutapea - ReactJsNet
No ratings yet
CV - David Hadrianus Hutapea - ReactJsNet
4 pages
Website Monitoring Project Overview
No ratings yet
Website Monitoring Project Overview
4 pages
Hackathon Kiki.
No ratings yet
Hackathon Kiki.
4 pages
RPL PWH #4 Part List Travelling Motor
No ratings yet
RPL PWH #4 Part List Travelling Motor
3 pages
Exploring Data Transformation With Google Cloud
No ratings yet
Exploring Data Transformation With Google Cloud
3 pages
M1 - Fund. of Database Sys.
No ratings yet
M1 - Fund. of Database Sys.
2 pages
Amazon Kinesis Data Streams Documentation
No ratings yet
Amazon Kinesis Data Streams Documentation
3 pages
Cloud Native
No ratings yet
Cloud Native
1 page
Assignment Questions
No ratings yet
Assignment Questions
1 page
Serverless Analytics For Mobile Gaming
No ratings yet
Serverless Analytics For Mobile Gaming
1 page
EishMalvi IITIndore
No ratings yet
EishMalvi IITIndore
1 page
Amazon Web Services: A Complete Guide
From Everand
Amazon Web Services: A Complete Guide
Christopher Ford
No ratings yet
A Comprehensive Guide to Amazon Web Services
From Everand
A Comprehensive Guide to Amazon Web Services
Josh Luberisse
No ratings yet
AWS for Beginners: A Step-by-Step Guide to Cloud Computing
From Everand
AWS for Beginners: A Step-by-Step Guide to Cloud Computing
Sankar Srinivasan
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Amazon Web Services: A Complete Guide: The IT Collection
From Everand
Amazon Web Services: A Complete Guide: The IT Collection
Christopher Ford
No ratings yet
AWS for Beginners
From Everand
AWS for Beginners
Sankar Srinivasan
No ratings yet
AWS Cloud Practitioner: From Basic to Advanced
From Everand
AWS Cloud Practitioner: From Basic to Advanced
Alex Carvalho
No ratings yet
AWS in ACTION Part -1: Real-world Solutions for Cloud Professionals
From Everand
AWS in ACTION Part -1: Real-world Solutions for Cloud Professionals
Poonam Devi
No ratings yet
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
AWS SysOps Administrator Associate: From basic to advanced
From Everand
AWS SysOps Administrator Associate: From basic to advanced
Alex Carvalho
No ratings yet
AWS Cloud Practitioner Exam Success Kit
From Everand
AWS Cloud Practitioner Exam Success Kit
SUJAN
No ratings yet
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
From Everand
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
Christopher Ford
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Cloud Computing For Noobs
From Everand
Cloud Computing For Noobs
Silas Meadowlark
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet