AWS Data Engineering Cheatsheet2
AWS Data Engineering Cheatsheet2
Nata in Data
Hello dears, here you can find cheat sheets for most commonly used AWS
services in Data Engineering, like:
Features
Columnar Storage: Redshift uses columnar storage, data
compression, and zone maps to minimize the amount of I/O needed
for queries.
Components
Cluster: Comprises a leader node and one or more compute nodes.
A database is created upon provisioning a cluster for loading data
and running queries.
Redshift Nodes
Leader Node: Manages client connections, parses queries, and
coordinates execution plans with compute nodes.
Node Types
Dense Storage (DS): For large data workloads using HDD storage.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 3/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Parameter Groups
Parameter groups apply to all databases within a cluster. The default
parameter group has preset values and cannot be modified.
Redshift Spectrum
Query Exabytes of Data: Run queries against data in S3 without
loading or transforming it.
Columnar Format: Scans only the needed columns for your query,
reducing data processing.
Redshift ML
Machine Learning: Train and deploy machine learning models using
SQL commands within Redshift.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 5/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
regardless of the database you are connected to. This feature is available on
Redshift RA3 node types at no extra cost.
Cluster Snapshots
Types: There are two types of snapshots, automated and manual,
stored in S3 using SSL.
Monitoring
Audit Logging: Tracks authentication attempts, connections,
disconnections, user definition changes, and queries. Logs are
stored in S3.
Security
Access Control: By default, only the AWS account that creates the
cluster can access it.
Pricing
Billing: Pay per second based on the type and number of nodes in
your cluster.
Cluster Management
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 7/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Creating a Cluster
Deleting a Cluster
Describing a Cluster
aws redshift describe-clusters \
--cluster-identifier my-redshift-cluster
Database Management
Connecting to the Database
Use a PostgreSQL-compatible tool such as psql or a SQL client:
Creating a Database
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 8/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Dropping a Database
User Management
Creating a User
CREATE USER myuser WITH PASSWORD 'mypassword' ;
Dropping a User
Granting Permissions
Revoking Permissions
Table Management
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 9/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Creating a Table
Dropping a Table
Inserting Data
INSERT INTO mytable (id, name, age) VALUES (1, 'John Doe', 30);
Updating Data
Deleting Data
Querying Data
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 10/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Performance Tuning
Analyzing a Table
ANALYZE mytable;
Vacuuming a Table
VACUUM mytable;
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 11/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Security
Enabling SSL
In psql or your SQL client, use the sslmode parameter:
Maintenance
Resizing a Cluster
--node-type dc2.large \
--number-of-nodes 4
CPU Utilization
Database Connections
Read/Write IOPS
Network Traffic
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 13/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Buckets
Access Control: For each bucket, you can control access, create,
delete, and list objects, view access logs, and choose the
geographical region for storage.
Eventual Consistency: For listing all buckets after deletion and for
enabling versioning on a bucket for the first time.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 14/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Storage Classes
Frequently Accessed Objects
S3 Standard: General-purpose storage for frequently accessed
data.
S3 One Zone-IA: Less expensive, stores data in one AZ, and is not
resilient to AZ loss. Suitable for objects over 128 KB stored for at
least 30 days.
Amazon S3 Intelligent-Tiering
Automatic Cost Optimization: Moves data between frequent and
infrequent access tiers based on access patterns.
S3 Glacier
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 15/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Retrieval Options
Expedited: Access data within 1-5 minutes for urgent requests.
Additional Information
Object Storage: For S3 Standard, Standard-IA, and Glacier classes,
objects are stored across multiple devices in at least three AZs.
Overview
Amazon Athena is an interactive query service that allows you to analyze
data directly in Amazon S3 and other data sources using SQL. It is serverless
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 16/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Features
Serverless: No infrastructure to manage.
AWS Glue Integration: Works seamlessly with AWS Glue for data
cataloging.
Managed Data Catalog: Stores metadata and schemas for your S3-
stored data.
Queries
Geospatial Data: You can query geospatial data.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 17/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 18/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Cost Controls
Workgroups: Isolate queries by teams, applications, or workloads
and enforce cost controls.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 19/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
No Charge for Failed Queries: You are not charged for queries that
fail.
Benefits
Device Connectivity: Connect and stream from millions of devices.
Components
Producer: Source that puts data into a Kinesis video stream.
Video Playbacks
HLS (HTTP Live Streaming): For live playback.
Metadata
Nonpersistent Metadata: Ad hoc metadata for specific fragments.
Pricing
Pay for the volume of data ingested, stored, and consumed.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 21/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Components
Data Producer: Application emitting data records to a Kinesis data
stream, assigning partition keys to records.
Data Record
Record: Unit of data in a stream with a sequence number, partition
key, and data blob (max 1 MB).
Sequence Number
Unique identifier for each data record, assigned by Kinesis when
data is added.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 22/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Monitoring
Monitor shard-level metrics using CloudWatch, Kinesis Agent, and
Kinesis libraries. Log API calls with CloudTrail.
Security
Automatically encrypt sensitive data with AWS KMS.
Use IAM for access control and VPC endpoints to keep traffic within
the Amazon network.
Pricing
Charged per shard hour, PUT Payload Unit, and enhanced fan-out
usage. Extended data retention incurs additional charges.
Features
Scalable: Automatically scales to match data throughput.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 23/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Batch Size and Interval: Control data upload frequency and size.
Pricing
Pay for the volume of data transmitted. Additional charges for data
format conversion.
General Features
Serverless: Automatically manages infrastructure.
SQL Features
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 24/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Java Features
Apache Flink: Uses open-source libraries for building streaming
applications.
Components
Input: Streaming source for the application.
Pricing
Charged based on the number of KPUs used. Additional charges for
Java application orchestration and storage.
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 25/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
Sign up now
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 26/27
2/22/25, 7:28 PM AWS Data Engineering Cheatsheet
https://www.nataindata.com/blog/aws-data-engineering-cheat-sheet/ 27/27