4.1. Notes
4.1. Notes
AWS DataSync
AWS DataSync is a tool that helps you move or copy large amounts of data between different places,
such as your own servers or other cloud services, and Amazon Web Services (AWS).
From Your Own Servers to AWS: If you have data stored on your company's servers, you can
move it to AWS, like into Amazon S3 (for storing files), Amazon EFS (a file system), or Amazon
FSx (a specialized storage).
Between Different AWS Services: You can also use DataSync to copy data between different
AWS services, like moving data from one S3 storage to another.
From AWS Back to Your Servers: You can even move data back from AWS to your own
servers.
Connection: To connect your own servers with AWS, you’ll need to install a small program
(called an "agent") on your server. This agent helps your server talk to AWS and transfer the
data.
No Agent for AWS to AWS: If you're moving data only within AWS services, you don’t need
this agent.
Scheduled Transfers: Data doesn’t move continuously; you set up a schedule, like every hour,
day, or week, for DataSync to do its job. So there’s a bit of a delay, but it keeps everything in
sync as per your schedule.
Metadata and Permissions: When DataSync moves your files, it keeps all the important
details (like who can access the files and other security settings) intact.
Fast Transfers: DataSync can move data very quickly (up to 10 gigabits per second), but if you
don’t want to use too much of your network’s capacity, you can slow it down.
Limited Network? If you can't transfer data because your network isn’t strong enough, AWS
offers a small device called Snowcone. You can load your data onto Snowcone, and then send
the device to AWS, where it will transfer the data for you.
78. AWS DataSync - Solution Architecture
When you want to use AWS DataSync and ensure that your connection is private (not over the public
internet), you can use AWS Direct Connect. Direct Connect is a service that creates a direct, private
network link between your own data center and AWS.
o To keep the connection private, you need to go through your VPC, which is your own
private network within AWS.
o Your DataSync agent (the software that moves your data) will connect to AWS
through Direct Connect.
o If you use a Public VIF, the connection goes around the VPC and uses a public URL
for DataSync. This might not be what you want because it’s not fully private.
o Create a VPC Interface Endpoint: This is a special connection point in your VPC that
allows you to privately connect to AWS services like DataSync.
o Set Up a Private VIF: This will create a private link between your Direct Connect
connection and the VPC Interface Endpoint.
o Now, the DataSync agent can securely send data through Direct Connect, into your
VPC, and then to the DataSync service, all without ever touching the public internet.
Using this method ensures that your data stays private and secure as it moves between your
data center and AWS.
AWS Kinesis is a service that helps you handle a large amount of data that needs to be processed in
real-time. It’s like a fast-moving river (or stream) of data where you can continuously add and analyze
information.
Big Data Applications: Kinesis is often used in big data scenarios where large amounts of
information are processed and analyzed quickly.
1. Highly Available: Kinesis automatically copies your data across three different locations
(Availability Zones), making it reliable and fault-tolerant.
o Kinesis Streams: For real-time data ingestion at large scale, where data is divided
into parts called "Shards."
o Kinesis Analytics: Allows you to perform real-time data analysis using SQL on data
coming through Kinesis Streams.
o Kinesis Firehose: Automatically loads the data from streams into other AWS services
like S3, Redshift, Elasticsearch, or Splunk.
Shards: Think of Shards as individual lanes in a highway. Data flows through these lanes, and
you can decide how many lanes (Shards) you need based on the amount of data.
Data Retention: By default, Kinesis Streams keep data for 24 hours, but you can extend this
up to a year. This means you can replay and reprocess data within this time frame.
Multiple Consumers: Unlike SQS (another AWS service), where once data is read it’s gone,
Kinesis allows multiple applications to read the same data stream simultaneously. This is
useful for real-time data processing by different systems.
Data Is Immutable: Once data is in Kinesis, it can’t be deleted until it expires. This ensures
that the data is always available for processing.
On-Demand Mode: Kinesis automatically adjusts the number of Shards based on the data
flow, so you don’t have to plan for capacity.
Provisioned Mode: You manage the number of Shards yourself, adjusting them as needed.
Producers: These are the sources of data. They could be applications using the AWS SDK,
Kinesis Producer Library (KPL) for advanced features like batching and compression, or
Kinesis Agents that monitor log files.
Consumers: These are the applications that read the data. They could be simple consumers
using the SDK, AWS Lambda functions, or more complex ones using Kinesis Client Library
(KCL) for coordinated reads and checkpoints.
Key Limits:
Producer Limits: Can send up to 1 MB of data per second or 1,000 messages per second, per
Shard. To handle more data, you add more Shards.
o Classic Consumers: Can read up to 2 MB per second per Shard, shared among all
consumers.
o Enhanced Fan-Out Consumers: Each consumer can read 2 MB per second per Shard,
independently, without API call limits, providing better performance.
Real-Time Data Processing: If you need to process and analyze data as soon as it arrives,
Kinesis is a great choice.
Scalability: You can easily scale the amount of data you process by adding more Shards.
Flexibility: Multiple applications can consume the same data stream, allowing different types
of analysis on the same data.
AWS Kinesis Data Firehose is a fully managed service that takes data from various sources and
delivers it to destinations like Amazon S3, Redshift, OpenSearch, or other third-party services.
1. Data Sources:
o Producers: These are applications or services like Kinesis Data Streams, Amazon
CloudWatch Logs, IoT devices, etc., that send data to Kinesis Data Firehose.
2. Data Destinations:
o AWS Destinations:
Amazon Redshift: For large-scale data analysis (data first goes to S3, then is
copied to Redshift).
3. Backup Options:
o You can back up all data or only the data that failed to be delivered to its destination
into an S3 bucket.
Near Real-Time: It delivers data in batches, making it slightly delayed (not true real-time).
You can set the buffer size (e.g., 32 MB) and buffer time (e.g., 1 minute) to control when data
is sent.
Data Formats & Transformation: Supports various data formats and allows you to transform
data using Lambda functions if needed.
o Buffer Time: The maximum time data will stay in the buffer before being sent, even if
the buffer size isn't reached.
2. Data Delivery:
o Real-Time: If you need real-time data delivery, Firehose is not the best option. For
real-time, use Lambda with Kinesis Data Streams.
Kinesis Data Streams: Use when you need to handle real-time streaming data at scale, with
the ability to manage scaling and data retention.
Kinesis Data Firehose: Use when you need a simple, fully managed way to load streaming
data into storage or analysis services, and near real-time processing is sufficient.
AWS Kinesis Data Analytics is a service that allows you to analyze streaming data in real-time. You
can use it to process data coming from Kinesis Data Streams or Kinesis Data Firehose and then send
the processed results to various destinations like S3, Redshift, or dashboards.
o Kinesis Data Streams or Kinesis Data Firehose: These are the sources of streaming
data that Kinesis Data Analytics reads and processes.
o Reference Data (Optional): You can also use static data from Amazon S3 to enrich
your streaming data during processing.
2. Data Processing:
o SQL Queries: You can write SQL queries to analyze the streaming data in real time.
For example, you might count the number of items by their ID or perform other
transformations.
o Error Handling: If something goes wrong (like an unexpected data type), Kinesis Data
Analytics generates an error stream, which can be monitored.
3. Data Output:
o Output Stream: The processed data can be sent to another Kinesis Data Stream,
where other applications or consumers can access it.
o Firehose: Alternatively, the output can be sent to Kinesis Data Firehose, which then
delivers the data to Amazon S3, Redshift, or other destinations.
Streaming ETL: Extract, Transform, Load data in real time, like selecting specific columns or
making simple data transformations.
Continuous Metrics: Generate live metrics, such as a leaderboard for a mobile game.
Responsive Analytics: Filter and analyze streaming data in real time to trigger alerts or
actions based on specific criteria.
Pay for What You Use: You only pay for the resources consumed, though it can be expensive.
Serverless: No need to manage servers; it automatically scales based on the data load.
IAM Permissions: You need to set up IAM roles to control access to streaming sources and
destinations.
SQL and Apache Flink: You can write your data processing logic in SQL or use Apache Flink, a
Java-based framework for more complex processing.
Schema Discovery: Kinesis Data Analytics can automatically detect and understand the data
structure in your streams.
Lambda for Pre-processing: You can use AWS Lambda to pre-process data before it enters
Kinesis Data Analytics.
Producers (like sensors or apps) send data into Kinesis Data Streams.
Kinesis Data Streams is like a conveyor belt where data flows in real-time.
Analyzing Data:
Use Amazon Kinesis Data Analytics to look at and make sense of the data as it comes in.
You can send the processed data to Kinesis Data Streams again or use Kinesis Data Firehose.
Kinesis Data Firehose can move the data to places like Amazon S3 (a storage service),
Amazon Redshift (a data warehouse), or Amazon Elasticsearch Service (a search service).
Producers can also send data straight to Kinesis Data Firehose, which will then store it in
Amazon S3.
2. Cost-Effectiveness Comparison:
Suppose you need to handle 3000 messages per second, each 1 KB in size.
Kinesis Data Streams would need 3 "shards" (which are like separate lanes on the conveyor
belt), costing about $32 per month.
For the same data, DynamoDB would cost around $1450 per month.
While DynamoDB provides long-term storage, its streaming capabilities are much more
expensive.
Comparison:
Kinesis Data Streams is much cheaper and better for streaming data compared to
DynamoDB.
3. Overview of Technologies:
Ordering:
SNS:
DynamoDB:
S3:
Features:
o Fully Managed: AWS handles Kafka broker nodes and Zookeeper nodes.
o High Availability: Deploys clusters in your VPC across multiple Availability Zones
(AZs) for redundancy.
Kafka Cluster: Consists of multiple brokers. Producers send data to Kafka topics, and
consumers read from these topics.
Kafka Topics: Topics are partitioned to allow parallel processing and scalable consumption.
Data in topics is replicated across brokers for fault tolerance.
Producers & Consumers: Producers push data to Kafka topics, and consumers pull data from
these topics for processing or sending to other destinations.
1. Message Limits:
2. Scaling:
3. Encryption:
o Amazon MSK: Options include plain text or TLS for in-flight encryption. Both services
provide at-rest encryption.
4. Data Retention:
o Amazon MSK: Data retention can be configured to exceed one year, depending on
the EBS storage paid for.
Integration with Amazon MSK
o Kinesis Data Analytics for Apache Flink: Use Flink to process data directly from MSK.
o AWS Glue: Perform streaming ETL jobs with Apache Spark Streaming.
o AWS Lambda: Set up Lambda functions to process data from MSK as an event
source.
o Custom Kafka Consumers: Implement custom consumers using EC2, ECS, or EKS.
AWS Batch allows you to run many jobs (like processing images or data) at once. You can choose how
you want these jobs to run:
Serverless with AWS Fargate: No need to manage servers. AWS handles it for you.
EC2 Instances: Use regular or Spot Instances (which are cheaper but can be interrupted).
Key Points:
EC2 and Spot Instances provide more control but need some management.
2. Trigger Job:
o Option 2: Use Amazon EventBridge to directly start a batch job when an image is
uploaded.
3. Process Job:
o AWS Batch pulls a Docker image (a package with your code) from Amazon ECR
(Elastic Container Registry).
o The job processes the image and saves the results back to Amazon S3.
Lambda:
Batch:
5. Multi-Node Mode
o Note: Does not work with Spot Instances and is better with EC2 instances in a cluster
placement group.
AWS EMR helps you handle large-scale data processing tasks by running Hadoop clusters in the
cloud. Here's a simple breakdown:
AWS EMR is a cloud service that lets you run Hadoop clusters to process big data. It’s useful if you’re
moving from an on-premise Hadoop setup to the cloud because:
Key Components:
Apache Spark
HBase
Presto
Flink
Hive
These tools help with tasks like data processing, machine learning, web indexing, and more.
o Task Nodes: Run tasks (optional and can use Spot Instances).
2. Storage:
o Temporary Storage: EC2 instances use EBS volumes with Hadoop Distributed File
System (HDFS) for temporary storage.
o Long-term Storage: Use EMRFS to store data in Amazon S3 for durability and multi-
AZ storage.
3. Optimizing Cost
Reserved Instances: Lower cost for long-term use (e.g., master and core nodes).
Spot Instances: Cheapest but less reliable (good for task nodes).
Cluster Types:
Transient Clusters: Use for temporary tasks and shut down when done.
4. Instance Configuration
Uniform Instance Groups: Choose a single instance type and purchasing option for each
node type (master, core, task). Supports auto-scaling.
Instance Fleets: Allows mixing of instance types and purchasing options (e.g., some on-
demand, some spot). Provides flexibility but currently doesn’t support auto-scaling.
Summary
Pros:
Cons:
Pros:
Cons:
Use Case: Ideal for regular tasks like scheduled reports or notifications.
Description: Trigger Lambda functions based on events from various AWS services.
Event Sources:
Pros:
Use Case: Good for workflows that depend on specific events occurring in your
infrastructure.
4. AWS Batch
Pros:
o Scalable: Manages compute resources efficiently.
Use Case: Suitable for long-running, batch processing tasks that need more control over
compute resources.
5. AWS Fargate
Pros:
Use Case: Good for tasks requiring containers but where you don't need the extensive
features of Batch.
6. AWS EMR
Pros:
o Powerful: Handles large-scale data processing with tools like Hadoop, Spark, and
Hive.
Use Case: Ideal for big data workloads, step executions, and complex data processing tasks.
Summary
AWS Glue is a managed ETL (Extract, Transform, Load) service that helps you prepare and transform
data for analytics. It is fully serverless, meaning you don’t need to manage any servers or
infrastructure.
Key Components
o Description: These are tasks that extract data from various sources, transform it (i.e.,
clean, modify, or aggregate it), and then load it into a target data store.
o Example Workflow:
Load: Move the processed data into a data warehouse like Amazon Redshift.
o Description: A centralized repository that stores metadata about your data. This
metadata includes information about tables, columns, data types, etc.
o How It Works:
Crawlers: AWS Glue has crawlers that scan your data sources (like Amazon
S3, Amazon RDS, DynamoDB, or JDBC-compatible databases).
Cataloging: The crawlers detect and record metadata about your data
sources and store this information in the Glue Data Catalog.
o Benefits:
Integration: Works seamlessly with other AWS services for data analytics.
Use Cases
108. Redshift
Amazon Redshift is a data warehousing service designed for OLAP (Online Analytical Processing),
which means it's great for performing complex queries and analysis on large volumes of data. It's
different from OLTP (Online Transaction Processing) which is more suited for real-time transactional
databases.
Key Features
1. Columnar Storage
o Description: Redshift stores data by columns instead of rows. This is efficient for
analytical queries that often aggregate data from many rows.
o Benefit: Faster performance for operations like summing or averaging columns.
3. Scalability
o Details: Clusters can have hundreds of nodes, and each node can hold up to 16
terabytes of data.
4. Data Loading
o Sources: Data can be loaded into Redshift from Amazon S3, Kinesis Data Firehose,
DynamoDB, or using AWS Database Migration Service (DMS).
5. Node Types
o Leader Node: Manages query planning and aggregates results from compute nodes.
o Compute Nodes: Perform queries and send results to the leader node.
7. Redshift Spectrum
o How It Works: Redshift spins up Spectrum nodes to process data in S3 and then
aggregates results in the Redshift cluster.
o Types:
9. Concurrency Scaling
Deployment: Redshift clusters are typically deployed within a VPC (Virtual Private Cloud) and
use IAM for security, KMS for encryption, and CloudWatch for monitoring.
Tools: You can use AWS QuickSight, Tableau, and other BI tools for dashboarding and
reporting with Redshift.
High, Sustained Query Volume: Redshift is ideal if you have a consistent need for complex,
large-scale queries.
Sporadic Usage: If your usage is occasional, consider AWS Athena for ad-hoc querying of
data in S3, which can be more cost-effective.
Amazon DocumentDB is a fully managed NoSQL document database service that is designed to be
compatible with MongoDB. It provides a cloud-native solution for handling JSON data, offering
similar benefits to AWS's Aurora for relational databases.
Key Features
1. MongoDB Compatibility
2. Fully Managed
3. High Availability
o Description: DocumentDB replicates data across three Availability Zones (AZs) for
fault tolerance.
4. Automatic Scaling
1. Database Storage
o Cost: You pay for the database storage used, billed per gigabyte per month.
2. Instances
o On-Demand Instances: Primary and replica instances handle read and write
operations.
3. IO Operations
4. Backups
Deployment
No On-Demand Tier: DocumentDB does not have an on-demand pricing tier. Costs are based
on instance usage, I/O operations, storage, and backup.
For MongoDB Users: If you're already using MongoDB and need a managed, cloud-native
solution, DocumentDB provides a compatible environment with added benefits of AWS's
infrastructure.
NoSQL Applications: Ideal for applications that require high performance and scalable
document storage with JSON data.
Amazon Timestream is a fully managed, serverless time series database designed for handling time-
stamped data efficiently. It provides a scalable and cost-effective solution for managing and analyzing
large volumes of time series data.
Key Features
o Description: Time series data consists of time-stamped points that track changes
over time, such as measurements or events.
o Description: Optimized for storing and analyzing trillions of events per day.
o Benefit: More efficient and cost-effective for time series data compared to
traditional relational databases.
4. Data Management
5. SQL Compatibility
6. Analytics Functions
o Description: Includes built-in time series analytics functions for real-time pattern
detection and analysis.
7. Security
Use Cases
IoT Applications: Track and analyze sensor data from connected devices.
Integration
1. Data Ingestion
o Sources: AWS IoT, Kinesis Data Streams, Prometheus, Telegraf, Kinesis Data Analytics
(Apache Flink), Amazon MSK.
o Description: Supports integration with various data sources for seamless data
ingestion.
o Tools: Amazon QuickSight for dashboards, Amazon SageMaker for machine learning,
Grafana for visualization.
Architecture
Data Flow: Data can be ingested from various sources, stored in Timestream, and queried
using SQL or integrated tools.
Analytics: Timestream’s time series analytics functions provide real-time insights into data
patterns.
Amazon Athena is a serverless query service that allows you to analyze data stored in Amazon S3
using standard SQL. It is built on the Presto engine and does not require you to provision or manage
any infrastructure.
Key Features
o Description: Athena is serverless, meaning you don’t need to manage servers. You
can run SQL queries directly on data stored in S3.
2. Data Formats
o Supported Formats: CSV, JSON, ORC, Avro, Parquet, and potentially others.
o Benefit: Athena can handle various data formats for flexible querying.
3. Pricing
o Benefit: You only pay for the data you query, with no upfront costs.
o Additional Tools: Can also integrate with machine learning tools like Amazon
SageMaker and visualization tools like Grafana.
Use Cases
Log Analysis: Analyze logs from AWS services (e.g., VPC flow logs, CloudTrail logs).
Performance Optimization
o Benefit: Scanning only the necessary columns reduces data scanned and improves
performance.
2. Data Compression
o Benefit: Smaller data sizes lead to faster query times and lower costs.
3. Partitioning
o Description: Organize data into partitions based on certain criteria (e.g., year, month,
day).
o Benefit: Queries can target specific partitions, reducing the amount of data scanned.
Example: For flight data, partitions could be organized by year, month, and day (e.g.,
/year=1991/month=01/day=01).
4. File Size
Federated Queries
Description: Athena can query data from various sources beyond S3, including relational and
non-relational databases.
Mechanism: Uses Data Source Connectors (Lambda functions) to execute federated queries.
Workflow: Athena sends queries to Lambda functions, which then execute queries on other data
sources. Results are returned to Athena and can be stored in S3 for further analysis.
112. Amazon QuickSight
Amazon QuickSight is a serverless business intelligence (BI) service designed to help you create
interactive dashboards and perform data analysis. It is fast, scalable, and offers per-session pricing.
Key Features
1. Serverless BI Service
2. SPICE Engine
o Usage: Works with imported data (e.g., from CSV files, Excel). Does not work directly
with live connections to databases.
3. User-Level Security
o Data Imports: Excel, CSV, JSON, TSV, and EFS CLF log formats.
o Analysis: Create interactive and detailed visualizations. Analysis allows for deeper
exploration and manipulation of data.
6. User Management
o Enterprise Edition: Groups of users for better access control and management.
Use Cases
Business Analytics: Create interactive reports and visualizations to gain business insights.
Ad-Hoc Analysis: Quickly analyze and visualize data as needed.
Performance Optimization
o Benefit: Import data into QuickSight to leverage SPICE for in-memory processing and
faster query performance.
o Recommended Formats: Use efficient data formats like Parquet and ORC when
importing data for better performance.
o Large Datasets: Import data efficiently and manage it to ensure quick access and
analysis.
Analytics Layer
1. Amazon S3
o Use Case: Ideal for migrating existing Big Data workloads to AWS.
3. Amazon Redshift
o Options:
4. Amazon Athena
o Use Case: Best for ad-hoc queries and sporadic data analysis.
5. Amazon QuickSight
1. IoT Devices
o Description: Real-time data ingestion service that delivers data to destinations like
S3.
3. Data Transformation
4. S3 Events
5. Data Processing
o Amazon Athena: Run queries on the data in S3 and update reporting buckets.
6. Data Reporting
o Amazon Redshift: Alternatively, store and query data for more complex analysis.
1. Amazon EMR
o Cluster Options:
o Cost Management:
o Data Access: Integrates with DynamoDB, S3 (via EMR FS), and uses EBS for scratch
storage.
2. Amazon Athena
o Integration: Works with AWS services, and queries are auditable via CloudTrail.
3. Amazon Redshift
o Description: Advanced SQL queries and full-scale data warehousing.
o Options: