0% found this document useful (0 votes)
12 views25 pages

4.1. Notes

Uploaded by

Nirupa Lenka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

4.1. Notes

Uploaded by

Nirupa Lenka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

77.

AWS DataSync

What is AWS DataSync?

AWS DataSync is a tool that helps you move or copy large amounts of data between different places,
such as your own servers or other cloud services, and Amazon Web Services (AWS).

Where Can You Move Data?

 From Your Own Servers to AWS: If you have data stored on your company's servers, you can
move it to AWS, like into Amazon S3 (for storing files), Amazon EFS (a file system), or Amazon
FSx (a specialized storage).

 Between Different AWS Services: You can also use DataSync to copy data between different
AWS services, like moving data from one S3 storage to another.

 From AWS Back to Your Servers: You can even move data back from AWS to your own
servers.

How Does DataSync Work?

 Connection: To connect your own servers with AWS, you’ll need to install a small program
(called an "agent") on your server. This agent helps your server talk to AWS and transfer the
data.

 No Agent for AWS to AWS: If you're moving data only within AWS services, you don’t need
this agent.

When Does the Data Move?

 Scheduled Transfers: Data doesn’t move continuously; you set up a schedule, like every hour,
day, or week, for DataSync to do its job. So there’s a bit of a delay, but it keeps everything in
sync as per your schedule.

Keeping Data Details Intact:

 Metadata and Permissions: When DataSync moves your files, it keeps all the important
details (like who can access the files and other security settings) intact.

Handling Large Data:

 Fast Transfers: DataSync can move data very quickly (up to 10 gigabits per second), but if you
don’t want to use too much of your network’s capacity, you can slow it down.

Exam Tip - Snowcone:

 Limited Network? If you can't transfer data because your network isn’t strong enough, AWS
offers a small device called Snowcone. You can load your data onto Snowcone, and then send
the device to AWS, where it will transfer the data for you.
78. AWS DataSync - Solution Architecture

Private Access to AWS DataSync with Direct Connect

When you want to use AWS DataSync and ensure that your connection is private (not over the public
internet), you can use AWS Direct Connect. Direct Connect is a service that creates a direct, private
network link between your own data center and AWS.

Steps to Set It Up:

1. Use a VPC (Virtual Private Cloud):

o To keep the connection private, you need to go through your VPC, which is your own
private network within AWS.

2. Direct Connect Connection:

o Your DataSync agent (the software that moves your data) will connect to AWS
through Direct Connect.

3. Public VIF (Virtual Interface) Option:

o If you use a Public VIF, the connection goes around the VPC and uses a public URL
for DataSync. This might not be what you want because it’s not fully private.

4. Private Connection Option:

o Create a VPC Interface Endpoint: This is a special connection point in your VPC that
allows you to privately connect to AWS services like DataSync.

o Set Up a Private VIF: This will create a private link between your Direct Connect
connection and the VPC Interface Endpoint.

o Now, the DataSync agent can securely send data through Direct Connect, into your
VPC, and then to the DataSync service, all without ever touching the public internet.

Why This Matters:

 Using this method ensures that your data stays private and secure as it moves between your
data center and AWS.

99. Kinesis Data Streams

What is AWS Kinesis?

AWS Kinesis is a service that helps you handle a large amount of data that needs to be processed in
real-time. It’s like a fast-moving river (or stream) of data where you can continuously add and analyze
information.

What Can You Use Kinesis For?


 Real-Time Data Processing: If your application generates a lot of data that needs to be
analyzed immediately—like logs, IoT sensor data, or user clickstreams—Kinesis is perfect for
this.

 Big Data Applications: Kinesis is often used in big data scenarios where large amounts of
information are processed and analyzed quickly.

Key Features of Kinesis:

1. Highly Available: Kinesis automatically copies your data across three different locations
(Availability Zones), making it reliable and fault-tolerant.

2. Three Main Services:

o Kinesis Streams: For real-time data ingestion at large scale, where data is divided
into parts called "Shards."

o Kinesis Analytics: Allows you to perform real-time data analysis using SQL on data
coming through Kinesis Streams.

o Kinesis Firehose: Automatically loads the data from streams into other AWS services
like S3, Redshift, Elasticsearch, or Splunk.

Understanding Kinesis Streams:

 Shards: Think of Shards as individual lanes in a highway. Data flows through these lanes, and
you can decide how many lanes (Shards) you need based on the amount of data.

o Producers send data to Shards.

o Consumers read data from Shards.

 Data Retention: By default, Kinesis Streams keep data for 24 hours, but you can extend this
up to a year. This means you can replay and reprocess data within this time frame.

 Multiple Consumers: Unlike SQS (another AWS service), where once data is read it’s gone,
Kinesis allows multiple applications to read the same data stream simultaneously. This is
useful for real-time data processing by different systems.

 Data Is Immutable: Once data is in Kinesis, it can’t be deleted until it expires. This ensures
that the data is always available for processing.

How to Manage Shards:

 On-Demand Mode: Kinesis automatically adjusts the number of Shards based on the data
flow, so you don’t have to plan for capacity.

 Provisioned Mode: You manage the number of Shards yourself, adjusting them as needed.

Producers and Consumers in Kinesis:

 Producers: These are the sources of data. They could be applications using the AWS SDK,
Kinesis Producer Library (KPL) for advanced features like batching and compression, or
Kinesis Agents that monitor log files.
 Consumers: These are the applications that read the data. They could be simple consumers
using the SDK, AWS Lambda functions, or more complex ones using Kinesis Client Library
(KCL) for coordinated reads and checkpoints.

Key Limits:

 Producer Limits: Can send up to 1 MB of data per second or 1,000 messages per second, per
Shard. To handle more data, you add more Shards.

 Consumer Limits: There are two types of consumer modes:

o Classic Consumers: Can read up to 2 MB per second per Shard, shared among all
consumers.

o Enhanced Fan-Out Consumers: Each consumer can read 2 MB per second per Shard,
independently, without API call limits, providing better performance.

Why Use Kinesis?

 Real-Time Data Processing: If you need to process and analyze data as soon as it arrives,
Kinesis is a great choice.

 Scalability: You can easily scale the amount of data you process by adding more Shards.

 Flexibility: Multiple applications can consume the same data stream, allowing different types
of analysis on the same data.

100. Kinesis Data Firehose

What is Kinesis Data Firehose?

AWS Kinesis Data Firehose is a fully managed service that takes data from various sources and
delivers it to destinations like Amazon S3, Redshift, OpenSearch, or other third-party services.

How Does Kinesis Data Firehose Work?

1. Data Sources:

o Producers: These are applications or services like Kinesis Data Streams, Amazon
CloudWatch Logs, IoT devices, etc., that send data to Kinesis Data Firehose.

o Data Transformation (Optional): You can use a Lambda function to transform or


process the data before sending it to its destination.

2. Data Destinations:

o AWS Destinations:

 Amazon S3: Store data files.

 Amazon Redshift: For large-scale data analysis (data first goes to S3, then is
copied to Redshift).

 Amazon OpenSearch: For search and analytics.

o Third-Party Destinations: Services like Datadog, Splunk, NewRelic, MongoDB.


o Custom HTTP Endpoint: If you have your own API, Firehose can send data directly to
it.

3. Backup Options:

o You can back up all data or only the data that failed to be delivered to its destination
into an S3 bucket.

Key Features of Kinesis Data Firehose:

 Fully Managed: No need to manage servers or infrastructure.

 Automatic Scaling: It adjusts automatically based on the data load.

 Near Real-Time: It delivers data in batches, making it slightly delayed (not true real-time).
You can set the buffer size (e.g., 32 MB) and buffer time (e.g., 1 minute) to control when data
is sent.

 Data Formats & Transformation: Supports various data formats and allows you to transform
data using Lambda functions if needed.

How Kinesis Data Firehose Delivers Data:

1. Buffering: Data is collected in a buffer before being sent to the destination.

o Buffer Size: The amount of data collected before it is sent.

o Buffer Time: The maximum time data will stay in the buffer before being sent, even if
the buffer size isn't reached.

2. Data Delivery:

o Real-Time: If you need real-time data delivery, Firehose is not the best option. For
real-time, use Lambda with Kinesis Data Streams.

o Near Real-Time: Firehose is near real-time because of the buffering mechanism.

When to Use Kinesis Data Firehose vs. Kinesis Data Streams:

 Kinesis Data Streams: Use when you need to handle real-time streaming data at scale, with
the ability to manage scaling and data retention.

 Kinesis Data Firehose: Use when you need a simple, fully managed way to load streaming
data into storage or analysis services, and near real-time processing is sufficient.

101. Kinesis Data Analytics

What is Kinesis Data Analytics?

AWS Kinesis Data Analytics is a service that allows you to analyze streaming data in real-time. You
can use it to process data coming from Kinesis Data Streams or Kinesis Data Firehose and then send
the processed results to various destinations like S3, Redshift, or dashboards.

How Does Kinesis Data Analytics Work?


1. Data Input:

o Kinesis Data Streams or Kinesis Data Firehose: These are the sources of streaming
data that Kinesis Data Analytics reads and processes.

o Reference Data (Optional): You can also use static data from Amazon S3 to enrich
your streaming data during processing.

2. Data Processing:

o SQL Queries: You can write SQL queries to analyze the streaming data in real time.
For example, you might count the number of items by their ID or perform other
transformations.

o Error Handling: If something goes wrong (like an unexpected data type), Kinesis Data
Analytics generates an error stream, which can be monitored.

3. Data Output:

o Output Stream: The processed data can be sent to another Kinesis Data Stream,
where other applications or consumers can access it.

o Firehose: Alternatively, the output can be sent to Kinesis Data Firehose, which then
delivers the data to Amazon S3, Redshift, or other destinations.

Use Cases for Kinesis Data Analytics:

 Streaming ETL: Extract, Transform, Load data in real time, like selecting specific columns or
making simple data transformations.

 Continuous Metrics: Generate live metrics, such as a leaderboard for a mobile game.

 Responsive Analytics: Filter and analyze streaming data in real time to trigger alerts or
actions based on specific criteria.

Key Features of Kinesis Data Analytics:

 Pay for What You Use: You only pay for the resources consumed, though it can be expensive.

 Serverless: No need to manage servers; it automatically scales based on the data load.

 IAM Permissions: You need to set up IAM roles to control access to streaming sources and
destinations.

 SQL and Apache Flink: You can write your data processing logic in SQL or use Apache Flink, a
Java-based framework for more complex processing.

 Schema Discovery: Kinesis Data Analytics can automatically detect and understand the data
structure in your streams.

 Lambda for Pre-processing: You can use AWS Lambda to pre-process data before it enters
Kinesis Data Analytics.

102. Streaming Architectures


1. Real-Time Data Pipeline Example:

Producers and Kinesis Data Streams:

 Producers (like sensors or apps) send data into Kinesis Data Streams.

 Kinesis Data Streams is like a conveyor belt where data flows in real-time.

Analyzing Data:

 Use Amazon Kinesis Data Analytics to look at and make sense of the data as it comes in.

 Lambda Functions can change or process the data.

Processing and Storage:

 You can send the processed data to Kinesis Data Streams again or use Kinesis Data Firehose.

 Kinesis Data Firehose can move the data to places like Amazon S3 (a storage service),
Amazon Redshift (a data warehouse), or Amazon Elasticsearch Service (a search service).

Direct Data Production:

 Producers can also send data straight to Kinesis Data Firehose, which will then store it in
Amazon S3.

2. Cost-Effectiveness Comparison:

Using Kinesis Data Streams and Lambda:

 Suppose you need to handle 3000 messages per second, each 1 KB in size.

 Kinesis Data Streams would need 3 "shards" (which are like separate lanes on the conveyor
belt), costing about $32 per month.

 Lambda functions will handle the processing without extra costs.

Using DynamoDB with DynamoDB Streams:

 For the same data, DynamoDB would cost around $1450 per month.

 While DynamoDB provides long-term storage, its streaming capabilities are much more
expensive.

Comparison:

 Kinesis Data Streams is much cheaper and better for streaming data compared to
DynamoDB.

3. Overview of Technologies:

Kinesis Data Streams:

 Data: Once added, you can't change it.

 Retention: Keeps data for up to 1 year.

 Ordering: Keeps data in the order it arrives, but by "shard."

 Readers: Can be read by EC2 (servers), Lambda functions, or other services.


 Latency: Takes about 200 milliseconds to process.

Kinesis Data Firehose:

 Data: Almost real-time, updates every minute or so.

 Retention: Depends on how it's set up with S3.

SQS (Standard and FIFO):

 Data: Once added, it can't be changed.

 Retention: Keeps data for 1 to 14 days.

 Ordering:

o Standard Queue: No specific order.

o FIFO Queue: Keeps data in the order it arrives.

 Scalability: Handles a lot of messages easily.

 Latency: Takes 10 to 100 milliseconds.

SNS:

 Data: Once added, it can't be changed.

 Retention: No retention, data disappears once delivered.

 Ordering: No specific order.

 Scalability: Handles a lot of messages easily.

 Latency: Takes 10 to 100 milliseconds.

DynamoDB:

 Data: Can be changed or updated.

 Retention: Data can be kept indefinitely or with a time limit.

 Ordering: No specific order.

 Scalability: Can adjust to handle more or less data.

 Latency: Takes 10 to 100 milliseconds.

S3:

 Data: Can be replaced but kept indefinitely with versioning.

 Retention: Keeps data forever with lifecycle policies.

 Ordering: No specific order.

 Scalability: Handles a large number of read and write requests.

 Latency: Takes 10 to 100 milliseconds.


103. Amazon MSK

Amazon MSK and Apache Kafka Overview

Amazon MSK (Managed Streaming for Apache Kafka):

 Purpose: Provides a fully-managed Kafka service on AWS. It simplifies the setup,


management, and scaling of Kafka clusters.

 Features:

o Fully Managed: AWS handles Kafka broker nodes and Zookeeper nodes.

o High Availability: Deploys clusters in your VPC across multiple Availability Zones
(AZs) for redundancy.

o Automatic Recovery: Recovers from common Kafka failures automatically.

o Storage: Data is stored on EBS volumes, with retention configurable as needed.

Apache Kafka Basics:

 Kafka Cluster: Consists of multiple brokers. Producers send data to Kafka topics, and
consumers read from these topics.

 Kafka Topics: Topics are partitioned to allow parallel processing and scalable consumption.
Data in topics is replicated across brokers for fault tolerance.

 Producers & Consumers: Producers push data to Kafka topics, and consumers pull data from
these topics for processing or sending to other destinations.

Key Differences Between Kinesis Data Streams and Amazon MSK

1. Message Limits:

o Kinesis Data Streams: Default message limit is 1 MB.

o Amazon MSK: Default limit is 1 MB but can be configured up to 10 MB.

2. Scaling:

o Kinesis Data Streams: Scale by adding or removing shards (shard splitting/merging).

o Amazon MSK: Scale by adding partitions to Kafka topics (partitions cannot be


removed).

3. Encryption:

o Kinesis Data Streams: In-flight encryption is enabled by default.

o Amazon MSK: Options include plain text or TLS for in-flight encryption. Both services
provide at-rest encryption.

4. Data Retention:

o Kinesis Data Streams: Data retention ranges from 24 hours to 1 year.

o Amazon MSK: Data retention can be configured to exceed one year, depending on
the EBS storage paid for.
Integration with Amazon MSK

1. Data Processing Options:

o Kinesis Data Analytics for Apache Flink: Use Flink to process data directly from MSK.

o AWS Glue: Perform streaming ETL jobs with Apache Spark Streaming.

o AWS Lambda: Set up Lambda functions to process data from MSK as an event
source.

o Custom Kafka Consumers: Implement custom consumers using EC2, ECS, or EKS.

104. AWS Batch

1. What is AWS Batch?

AWS Batch allows you to run many jobs (like processing images or data) at once. You can choose how
you want these jobs to run:

 Serverless with AWS Fargate: No need to manage servers. AWS handles it for you.

 EC2 Instances: Use regular or Spot Instances (which are cheaper but can be interrupted).

Key Points:

 Fargate is completely serverless.

 EC2 and Spot Instances provide more control but need some management.

2. How to Set Up a Batch Job

Example: Creating Thumbnails from Images:

1. Upload Images: Upload images to Amazon S3.

2. Trigger Job:

o Option 1: Use Amazon S3 event notifications to trigger an AWS Lambda function,


which then starts a batch job.

o Option 2: Use Amazon EventBridge to directly start a batch job when an image is
uploaded.

3. Process Job:

o AWS Batch pulls a Docker image (a package with your code) from Amazon ECR
(Elastic Container Registry).

o The job processes the image and saves the results back to Amazon S3.

o Optionally, you can add some details into Amazon DynamoDB.

3. Lambda vs. Batch

 Lambda:

o Limited runtime (usually up to 15 minutes).


o Limited disk space.

o Good for short, quick tasks.

 Batch:

o No time limit; runs as long as needed.

o Any runtime as long as it’s in a Docker image.

o Uses EBS or EC2 instances for disk space.

o More flexibility and suitable for long-running tasks.

4. Compute Environments in AWS Batch

 Managed Compute Environment:

o AWS manages the resources for you.

o You choose between On-Demand or Spot Instances.

o AWS handles scaling and capacity.

o Ensure that your VPC setup allows access to ECS services.

 Unmanaged Compute Environment:

o You control and manage the resources yourself.

o More responsibility and potentially more cost.

5. Multi-Node Mode

 Multi-Node Mode is for high-performance tasks that need multiple instances.

o One main node controls several other nodes.

o Suitable for tightly coupled workloads.

o Note: Does not work with Spot Instances and is better with EC2 instances in a cluster
placement group.

105. Amazon EMR

AWS EMR helps you handle large-scale data processing tasks by running Hadoop clusters in the
cloud. Here's a simple breakdown:

1. What is AWS EMR?

AWS EMR is a cloud service that lets you run Hadoop clusters to process big data. It’s useful if you’re
moving from an on-premise Hadoop setup to the cloud because:

 Elasticity: You can scale your cluster up or down quickly.

 Cost: You only pay for the time you use.

Key Components:
 Apache Spark

 HBase

 Presto

 Flink

 Hive

These tools help with tasks like data processing, machine learning, web indexing, and more.

2. How EMR Works

1. Clusters: EMR uses clusters of EC2 instances to process data.

o Master Node: Manages the cluster.

o Core Nodes: Run tasks and store data.

o Task Nodes: Run tasks (optional and can use Spot Instances).

2. Storage:

o Temporary Storage: EC2 instances use EBS volumes with Hadoop Distributed File
System (HDFS) for temporary storage.

o Long-term Storage: Use EMRFS to store data in Amazon S3 for durability and multi-
AZ storage.

3. Optimizing Cost

 On-Demand Instances: Reliable but more expensive.

 Reserved Instances: Lower cost for long-term use (e.g., master and core nodes).

 Spot Instances: Cheapest but less reliable (good for task nodes).

Cluster Types:

 Long-running Clusters: Ideal for continuous processing tasks.

 Transient Clusters: Use for temporary tasks and shut down when done.

4. Instance Configuration

 Uniform Instance Groups: Choose a single instance type and purchasing option for each
node type (master, core, task). Supports auto-scaling.

 Instance Fleets: Allows mixing of instance types and purchasing options (e.g., some on-
demand, some spot). Provides flexibility but currently doesn’t support auto-scaling.

Summary

106. Running Jobs on AWS

Strategies for Running Jobs on AWS


1. EC2 Instances with CRON Jobs

 Description: Provision an EC2 instance to run CRON jobs.

 Pros:

o Simple to set up for basic tasks.

 Cons:

o Not highly available or scalable.

o If the instance fails, the jobs fail too.

o Not a recommended strategy for production environments.

2. Amazon EventBridge and Lambda

 Description: Use EventBridge to trigger Lambda functions on a schedule.

 Pros:

o Serverless: No need to manage infrastructure.

o Scalable: AWS handles scaling automatically.

o Highly Available: Reduces the risk of single points of failure.

 Cons:

o Limits: Lambda has time and resource limitations.

 Use Case: Ideal for regular tasks like scheduled reports or notifications.

3. Reactive Workflows with Lambda

 Description: Trigger Lambda functions based on events from various AWS services.

 Event Sources:

o EventBridge: For events from different AWS services.

o S3: For new object creations or deletions.

o API Gateway: For API requests.

o SQS and SNS: For messaging and notifications.

 Pros:

o Reactive: Responds to real-time events.

 Use Case: Good for workflows that depend on specific events occurring in your
infrastructure.

4. AWS Batch

 Description: Use Batch for running large-scale batch processing jobs.

 Pros:
o Scalable: Manages compute resources efficiently.

o Flexibility: Works with both EC2 and Fargate.

 Use Case: Suitable for long-running, batch processing tasks that need more control over
compute resources.

5. AWS Fargate

 Description: Run containers without managing servers.

 Pros:

o Serverless: No need to manage EC2 instances.

o Scalable: Automatically scales based on demand.

 Use Case: Good for tasks requiring containers but where you don't need the extensive
features of Batch.

6. AWS EMR

 Description: Use EMR for big data processing and clustering.

 Pros:

o Powerful: Handles large-scale data processing with tools like Hadoop, Spark, and
Hive.

 Use Case: Ideal for big data workloads, step executions, and complex data processing tasks.

Summary

 For Simple, Scheduled Jobs: Use EventBridge with Lambda.

 For Reactive Workflows: Use Lambda triggered by various events.

 For Batch Processing: Use AWS Batch or Fargate.

 For Big Data Processing: Use AWS EMR.

107. AWS Glue

AWS Glue Overview

What is AWS Glue?

AWS Glue is a managed ETL (Extract, Transform, Load) service that helps you prepare and transform
data for analytics. It is fully serverless, meaning you don’t need to manage any servers or
infrastructure.

Key Components

1. AWS Glue ETL Jobs

o Description: These are tasks that extract data from various sources, transform it (i.e.,
clean, modify, or aggregate it), and then load it into a target data store.
o Example Workflow:

 Extract: Data from Amazon S3 or an Amazon RDS database.

 Transform: Perform data cleaning and transformations.

 Load: Move the processed data into a data warehouse like Amazon Redshift.

2. AWS Glue Data Catalog

o Description: A centralized repository that stores metadata about your data. This
metadata includes information about tables, columns, data types, etc.

o How It Works:

 Crawlers: AWS Glue has crawlers that scan your data sources (like Amazon
S3, Amazon RDS, DynamoDB, or JDBC-compatible databases).

 Cataloging: The crawlers detect and record metadata about your data
sources and store this information in the Glue Data Catalog.

o Benefits:

 Discovery: Services like Amazon Athena, Amazon Redshift Spectrum, and


Amazon EMR use the Glue Data Catalog to understand and query your data.

How AWS Glue Helps

 Serverless: No need to manage the underlying infrastructure.

 Data Preparation: Simplifies the process of preparing and transforming data.

 Integration: Works seamlessly with other AWS services for data analytics.

Use Cases

 Data Preparation: Preparing data for analytics or machine learning.

 Data Cataloging: Managing metadata and making data discoverable.

 Data Integration: Combining data from multiple sources.

108. Redshift

What is Amazon Redshift?

Amazon Redshift is a data warehousing service designed for OLAP (Online Analytical Processing),
which means it's great for performing complex queries and analysis on large volumes of data. It's
different from OLTP (Online Transaction Processing) which is more suited for real-time transactional
databases.

Key Features

1. Columnar Storage

o Description: Redshift stores data by columns instead of rows. This is efficient for
analytical queries that often aggregate data from many rows.
o Benefit: Faster performance for operations like summing or averaging columns.

2. Massively Parallel Processing (MPP)

o Description: Redshift uses MPP to distribute queries across multiple nodes in a


cluster.

o Benefit: Enhanced performance for large-scale queries.

3. Scalability

o Description: Redshift clusters can scale to petabytes of data.

o Details: Clusters can have hundreds of nodes, and each node can hold up to 16
terabytes of data.

4. Data Loading

o Sources: Data can be loaded into Redshift from Amazon S3, Kinesis Data Firehose,
DynamoDB, or using AWS Database Migration Service (DMS).

5. Node Types

o Leader Node: Manages query planning and aggregates results from compute nodes.

o Compute Nodes: Perform queries and send results to the leader node.

6. Backup and Restore

o Snapshots: Point-in-time backups stored in Amazon S3. Snapshots can be automated


or manual.

o Cross-Region Snapshots: Snapshots can be copied to another AWS region for


disaster recovery.

7. Redshift Spectrum

o Description: Allows querying of data in Amazon S3 without loading it into Redshift.

o How It Works: Redshift spins up Spectrum nodes to process data in S3 and then
aggregates results in the Redshift cluster.

8. Workload Management (WLM)

o Description: Manages query priorities to prevent long-running queries from blocking


short-running ones.

o Types:

 Automatic WLM: Redshift manages queues and resources.

 Manual WLM: Users define queues and manage resources.

9. Concurrency Scaling

o Description: Automatically adds cluster capacity to handle increased query loads.

o Benefit: Ensures consistent performance with a high number of queries.


Deployment and Integration

 Deployment: Redshift clusters are typically deployed within a VPC (Virtual Private Cloud) and
use IAM for security, KMS for encryption, and CloudWatch for monitoring.

 Tools: You can use AWS QuickSight, Tableau, and other BI tools for dashboarding and
reporting with Redshift.

When to Use Redshift?

 High, Sustained Query Volume: Redshift is ideal if you have a consistent need for complex,
large-scale queries.

 Big Data Analytics: For analyzing massive datasets efficiently.

When to Consider Alternatives?

 Sporadic Usage: If your usage is occasional, consider AWS Athena for ad-hoc querying of
data in S3, which can be more cost-effective.

109. Amazon DocumentDB

Amazon DocumentDB Overview

What is Amazon DocumentDB?

Amazon DocumentDB is a fully managed NoSQL document database service that is designed to be
compatible with MongoDB. It provides a cloud-native solution for handling JSON data, offering
similar benefits to AWS's Aurora for relational databases.

Key Features

1. MongoDB Compatibility

o Description: DocumentDB is designed to be compatible with MongoDB APIs, making


it easier for users of MongoDB to migrate to AWS.

o Benefit: Simplifies the transition to a managed cloud database service.

2. Fully Managed

o Description: DocumentDB is a managed service, meaning AWS handles maintenance


tasks like backups, patching, and scaling.

o Benefit: Reduces the operational overhead of managing a database.

3. High Availability

o Description: DocumentDB replicates data across three Availability Zones (AZs) for
fault tolerance.

o Benefit: Ensures high availability and durability of data.

4. Automatic Scaling

o Description: The storage layer automatically scales in increments of 10 GB.

o Benefit: Simplifies storage management as your data grows.


5. Performance

o Description: DocumentDB can handle millions of requests per second.

o Benefit: Provides high performance for large-scale applications.

Architecture and Pricing

1. Database Storage

o Description: Data is stored in DocumentDB’s database storage.

o Cost: You pay for the database storage used, billed per gigabyte per month.

2. Instances

o On-Demand Instances: Primary and replica instances handle read and write
operations.

o Cost: Charged per second with a minimum billing of 10 minutes.

3. IO Operations

o Description: Read and write operations against the database storage.

o Cost: Charged per million I/O operations.

4. Backups

o Description: Backups are stored in Amazon S3.

o Cost: Charged per gigabyte per month for backup storage.

Deployment

 No On-Demand Tier: DocumentDB does not have an on-demand pricing tier. Costs are based
on instance usage, I/O operations, storage, and backup.

When to Use DocumentDB

 For MongoDB Users: If you're already using MongoDB and need a managed, cloud-native
solution, DocumentDB provides a compatible environment with added benefits of AWS's
infrastructure.

 NoSQL Applications: Ideal for applications that require high performance and scalable
document storage with JSON data.

110. Amazon Timestream

Amazon Timestream Overview

What is Amazon Timestream?

Amazon Timestream is a fully managed, serverless time series database designed for handling time-
stamped data efficiently. It provides a scalable and cost-effective solution for managing and analyzing
large volumes of time series data.
Key Features

1. Time Series Data

o Description: Time series data consists of time-stamped points that track changes
over time, such as measurements or events.

o Example: A graph showing temperature readings over several years.

2. Serverless and Scalable

o Description: Automatically adjusts capacity to handle varying loads without manual


intervention.

o Benefit: Scales up or down based on the volume of data and queries.

3. Performance and Cost Efficiency

o Description: Optimized for storing and analyzing trillions of events per day.

o Benefit: More efficient and cost-effective for time series data compared to
traditional relational databases.

4. Data Management

o Recent Data: Stored in memory for fast access.

o Historical Data: Stored in a cost-optimized storage tier.

5. SQL Compatibility

o Description: Supports SQL for querying time series data.

o Benefit: Facilitates complex queries and analysis.

6. Analytics Functions

o Description: Includes built-in time series analytics functions for real-time pattern
detection and analysis.

o Benefit: Helps identify trends and anomalies quickly.

7. Security

o Description: Supports encryption both in transit and at rest.

Use Cases

 IoT Applications: Track and analyze sensor data from connected devices.

 Operational Applications: Monitor and analyze system performance and operational


metrics.

 Real-Time Analytics: Perform real-time analysis on time-stamped data.

Integration

1. Data Ingestion
o Sources: AWS IoT, Kinesis Data Streams, Prometheus, Telegraf, Kinesis Data Analytics
(Apache Flink), Amazon MSK.

o Description: Supports integration with various data sources for seamless data
ingestion.

2. Data Visualization and Analysis

o Tools: Amazon QuickSight for dashboards, Amazon SageMaker for machine learning,
Grafana for visualization.

o JDBC Compatibility: Allows connection from any JDBC-compatible application for


querying and analysis.

Architecture

 Data Flow: Data can be ingested from various sources, stored in Timestream, and queried
using SQL or integrated tools.

 Analytics: Timestream’s time series analytics functions provide real-time insights into data
patterns.

111. Amazon Athena

Amazon Athena Overview

What is Amazon Athena?

Amazon Athena is a serverless query service that allows you to analyze data stored in Amazon S3
using standard SQL. It is built on the Presto engine and does not require you to provision or manage
any infrastructure.

Key Features

1. Serverless and SQL-Based

o Description: Athena is serverless, meaning you don’t need to manage servers. You
can run SQL queries directly on data stored in S3.

o Engine: Built on Presto, which supports SQL queries.

2. Data Formats

o Supported Formats: CSV, JSON, ORC, Avro, Parquet, and potentially others.

o Benefit: Athena can handle various data formats for flexible querying.

3. Pricing

o Description: Pricing is based on the amount of data scanned per terabyte.

o Benefit: You only pay for the data you query, with no upfront costs.

4. Integration with Other Tools


o Common Use: Often used with Amazon QuickSight for creating reports and
dashboards.

o Additional Tools: Can also integrate with machine learning tools like Amazon
SageMaker and visualization tools like Grafana.

Use Cases

 Ad Hoc Queries: Quickly run queries on data stored in S3.

 Business Intelligence: Perform analytics and reporting.

 Log Analysis: Analyze logs from AWS services (e.g., VPC flow logs, CloudTrail logs).

Performance Optimization

1. Columnar Data Formats

o Recommended Formats: Apache Parquet and ORC.

o Benefit: Scanning only the necessary columns reduces data scanned and improves
performance.

2. Data Compression

o Description: Use compression mechanisms to reduce data size.

o Benefit: Smaller data sizes lead to faster query times and lower costs.

3. Partitioning

o Description: Organize data into partitions based on certain criteria (e.g., year, month,
day).

o Benefit: Queries can target specific partitions, reducing the amount of data scanned.

Example: For flight data, partitions could be organized by year, month, and day (e.g.,
/year=1991/month=01/day=01).

4. File Size

o Description: Use larger files (e.g., 128 MB or more).

o Benefit: Reduces overhead and improves query performance compared to many


small files.

Federated Queries

 Description: Athena can query data from various sources beyond S3, including relational and
non-relational databases.

 Mechanism: Uses Data Source Connectors (Lambda functions) to execute federated queries.

 Supported Sources: CloudWatch Logs, DynamoDB, RDS, ElastiCache, DocumentDB, Redshift,


Aurora, SQL Server, MySQL, HBase, and on-premises databases.

Workflow: Athena sends queries to Lambda functions, which then execute queries on other data
sources. Results are returned to Athena and can be stored in S3 for further analysis.
112. Amazon QuickSight

Amazon QuickSight Overview

What is Amazon QuickSight?

Amazon QuickSight is a serverless business intelligence (BI) service designed to help you create
interactive dashboards and perform data analysis. It is fast, scalable, and offers per-session pricing.

Key Features

1. Serverless BI Service

o Description: No server management is required. QuickSight scales automatically


based on usage.

o Benefit: Easily handles large datasets and concurrent users.

2. SPICE Engine

o Description: SPICE (Super-fast, Parallel, In-memory Calculation Engine) is an in-


memory engine that speeds up data processing.

o Usage: Works with imported data (e.g., from CSV files, Excel). Does not work directly
with live connections to databases.

3. User-Level Security

o Enterprise Edition: Supports column-level security (CLS) to restrict access to specific


columns based on user permissions.

4. Integration with Data Sources

o AWS Services: RDS, Aurora, Redshift, Athena, S3, OpenSearch, Timestream.

o Third-Party Sources: Salesforce, Jira, Teradata, on-premises databases using JDBC.

o Data Imports: Excel, CSV, JSON, TSV, and EFS CLF log formats.

5. Dashboards and Analysis

o Analysis: Create interactive and detailed visualizations. Analysis allows for deeper
exploration and manipulation of data.

o Dashboards: Read-only snapshots of analyses, preserving filters, parameters, and


sorting. Useful for sharing consistent views with users.

6. User Management

o Standard Edition: Individual users.

o Enterprise Edition: Groups of users for better access control and management.

Use Cases

 Business Analytics: Create interactive reports and visualizations to gain business insights.
 Ad-Hoc Analysis: Quickly analyze and visualize data as needed.

 Reporting: Generate and share reports with dashboards.

Performance Optimization

1. Use SPICE for Faster Computation

o Benefit: Import data into QuickSight to leverage SPICE for in-memory processing and
faster query performance.

2. Optimize Data Formats

o Recommended Formats: Use efficient data formats like Parquet and ORC when
importing data for better performance.

3. Efficient Data Management

o Large Datasets: Import data efficiently and manage it to ensure quick access and
analysis.

113. Big Data Architecture

Data Engineering Pipeline on AWS

Analytics Layer

1. Amazon S3

o Role: Centralized data storage.

2. Amazon EMR (Elastic MapReduce)

o Description: Processes large datasets using Hadoop, Spark, or Hive.

o Use Case: Ideal for migrating existing Big Data workloads to AWS.

3. Amazon Redshift

o Description: Data warehousing service for complex SQL queries.

o Options:

 Redshift Spectrum: Queries data directly in S3 without loading it into


Redshift.

 Redshift Warehouse: Load data into Redshift for extensive SQL-based


analysis.

4. Amazon Athena

o Description: Serverless SQL engine for querying data stored in S3.

o Use Case: Best for ad-hoc queries and sporadic data analysis.

5. Amazon QuickSight

o Description: Business intelligence service for creating interactive dashboards.


o Integration: Connects with Redshift and Athena to visualize data.

Big Data Ingestion

1. IoT Devices

o Data Stream: Send data to Kinesis Data Stream.

2. Kinesis Data Firehose

o Description: Real-time data ingestion service that delivers data to destinations like
S3.

3. Data Transformation

o Lambda Functions: Optionally transform data before it is stored in S3.

4. S3 Events

o Description: Trigger notifications or further processing with SQS or Lambda when


new files are added to S3.

5. Data Processing

o Amazon Athena: Run queries on the data in S3 and update reporting buckets.

6. Data Reporting

o Amazon QuickSight: Create dashboards from reporting buckets.

o Amazon Redshift: Alternatively, store and query data for more complex analysis.

Comparing Warehousing Technologies

1. Amazon EMR

o Description: Big Data processing using Apache Hive, Spark, etc.

o Cluster Options:

 Long-Running Cluster: For multiple jobs.

 Cluster per Job: For isolated job processing.

o Cost Management:

 Spot Instances, On-Demand Instances, Reserved Instances.

o Data Access: Integrates with DynamoDB, S3 (via EMR FS), and uses EBS for scratch
storage.

2. Amazon Athena

o Description: Serverless SQL querying for S3 data.

o Use Case: Simple queries and data aggregation.

o Integration: Works with AWS services, and queries are auditable via CloudTrail.

3. Amazon Redshift
o Description: Advanced SQL queries and full-scale data warehousing.

o Options:

 Redshift Spectrum: For querying data in S3 without loading it into Redshift.

o Usage: Best for sustained query needs to justify investment.

You might also like