Database Notes
Database Notes
Building applications is all about data collection and management. If you design an e-commerce
application, you want to show available inventory catalog data to customers and collect purchase
data as they make a transaction. Similarly, if you are running an autonomous vehicle application,
on that data. As of now, you have learned about networking, storage, and compute in previous
chapters. In this chapter, you will learn the choices of database services available in AWS to
complete the core architecture tech stack.
With so many choices at your disposal, it is easy to get analysis paralysis. So, in this chapter, we will
By the end of this chapter, you will learn about different AWS database service offerings and how
to choose a suitable database for your workload.
Relational databases have been
and rows in tables are associated with other rows in other tables by using the column values
in each row as relationship keys. Another important feature of relational databases is that they
normally use Structured Query Language (SQL) to access, insert, update, and delete records. SQL
was created by IBM researchers Raymond Boyce and Donald Chamberlin in the 1970s. Relational
databases and SQL have served us well for decades.
As the internet’s popularity increased in the 1990s, we started hitting scalability limits with
relational databases. Additionally, a wider variety of data types started cropping up. Relational
Database Management Systems (RDBMSs) were
development of new designs, and we got the term NoSQL databases. As confusing as the term
is, it does convey the idea that it can deal with data that is not structured, and it deals with it
NoSQL
by Eric Evans and Johan Oskarsson to describe databases that were not relational.
and query it. Let’s see an example of making a choice between a relational and non-relational
every customer’s bank account to always be consistent and roll back in case of any error. In such
a scenario, you want to use a relational database. For a relational database, if some information
is not available, then you are forced to store null or some other value. Now take an example of
and education. In such cases, you want to use a non-relational database and store only the infor-
mation provided by the user without adding null values where the user doesn’t provide details
(unlike in relational databases).
223
country. You can see that only Maverick provided their full information while the other users
in-
formation with a NULL value across all columns, while in a non-relational database, that column
doesn’t exist at all.
It is nothing short of amazing what has occurred since then. Hundreds of new offerings have
been developed, each trying to solve a different problem. In this environment, deciding the best
service or product to solve your problem becomes complicated. And you must consider not only
your current requirements and workloads, but also take into account that your choice of database
will be able to cover your future requirements and new demands. With so much data getting
generated, it is natural that much of innovation is driven by data. Let’s look in detail at how data
is driving innovation.
Since high-speed internet became available in the last decade, more and more data is getting
generated. Before we -
tive on data:
• The surge of data: Our current era is witnessing an enormous surge in data generation.
Managing the vast amount of data originating from your business applications is essential.
224 Selecting the Right Database Service
However, the exponential growth primarily stems from the data produced by network-con-
devices, including but not limited to mobile phones, connected vehicles, smart homes,
wearable technologies, household appliances, security systems, industrial equipment,
machinery, and electronic gadgets, constantly generate real-time data. Notably, over one-
third of mobile sign-ups on cellular networks result from built-in cellular connections in
most modern cars. In addition, applications generate real-time data, such as purchase
data from e-commerce sites, user behavior from mobile apps, and social media posts/
faster. Microservices enable developers to break down their applications into smaller
and it must be incorporated into every aspect of the business, rather than just being an
after-the-fact activity. Monitoring the organization’s operations in real time is critical
to fuel innovation and quick decision-making, whether through human intervention or
rate of change and change management, enabling businesses to adapt to evolving market
needs and stay ahead of the competition.
While you see the trend that the industry is adopting, let’s learn some basics of databases and
learn about the database consistency model in more detail.
225
In the context of databases, ensuring transaction data consistency involves restricting any da-
tabase transaction’s ability to modify data in unauthorized ways. When data is written to the
stringent process ensures that data integrity is maintained and that the information stored in the
database is accurate and trustworthy. Currently, there are two popular data consistency models.
We’ll discuss these models in the following subsections.
When database sizes were measured in megabytes, we could have stringent requirements that
enforced strict consistency. Since storage has become exponentially cheaper, databases can be
much bigger, often measured in terabytes and even petabytes. For this reason, making databas-
following:
ACID was taken as the law of the land for many years, but a new model emerged with the advent
of bigger-scale projects and implementations. In many instances, the ACID model is more pessi-
mistic than required, and it’s too safe at the expense of scalability and performance.
ACID requirements, such as data freshness, immediate consistency, and accuracy, to gain other
• Basic availability available for the majority of the time (but not necessarily all
numbers, and others may be a few milliseconds behind and not have the latest updates.
In this case, the readers will have different results, but if they rerun the query soon after,
getting the results fast versus being entirely up to date may be acceptable.
• Eventual consistency data exhibits consistency eventually and maybe not
until the data is retrieved at a later point.
Let’s look at the database usage model, which is a crucial differentiator when storing your data.
227
-
always be present.
On the ingestion side, the data will be ingested in two different ways. It will either be a data
change data capture (CDC) set, which is changes in existing data or accessing brand new data.
But what drives your choice of database is not the fact that these two operations are present but
rather the following:
•
•
•
•
•
•
•
•
-
ogies have been the standards to address these questions for many years: online transaction
processing (OLTP) systems and online analytics processing (OLAP
that needs to be answered is - is it more important for the database to perform during data ingestion
or retrieval?
need to be read-heavy or write-heavy.
-
tions (such as inserts and
A table using 3NF will reduce data duplication, minimize data anomalies, guarantee referential
Edgar F. Codd, the
inventor of the relational model for database management.
A database relation (for example, a database table) meets the 3NF standard if each table’s col-
umns only depend on the table’s primary key. Let’s look at an example of a table that fails to meet
columns, contains the employee’s supervisor’s name as well as the supervisor’s phone number. A
supervisor can undoubtedly have more than one employee under supervision, so the supervisor’s
resolve this issue, we could add a supervisor table, put the supervisor’s name and phone number
in the supervisor table, and remove the phone number from the employee table.
Conversely, OLAP databases do not process many transactions. Once data is ingested, it is usually
retrieval is often performed using some query language (the Structured Query Language (SQL)).
Queries in an OLAP environment are often complex and involve subqueries and aggregations.
In the context of OLAP systems, the performance of queries is the relevant measure. An OLAP
database typically contains historical data aggregated and stored in multi-dimensional schemas
(typically using the star schema).
For example, a bank might handle millions of daily transactions, storing deposits, withdrawals,
As you have learned about the database consistency model and its uses, you must be wondering
which model is
BASE can be applied for OLAP.
Let’s go further and learn about the various kinds of database services available in AWS and how
AWS offers a broad range of database services that are purpose-built for every major use case.
are battle-tested and provide deep functionality, so you get the high availability, performance,
reliability, and security required by production workloads.
230 Selecting the Right Database Service
-
actional applications, such as Amazon RDS and Amazon Aurora, non-relational databases like
Amazon DynamoDB for internet-scale applications, an in-memory data store called Amazon
ElastiCache for caching and real-time workloads, and a graph database, Amazon Neptune, for
developing applications with highly connected data. Migrating your existing databases to AWS
is made simple and cost-effective with the AWS Database Migration Service. Each of these da-
tabase services is so vast that going into details warrants a book for each of these services itself.
database space, but relational databases have served us well for
many years without needing any other type of database. A relational database is probably the
So, let’s analyze the different relational options that AWS offers us.
Given what we said in the previous section, it is not surprising that Amazon has a robust lineup
possible to install your database into an EC2 instance and manage it yourself. Unless you have
you consider all the costs, including system administration costs, you will most likely be better
off and save money using Amazon RDS.
Amazon RDS was designed by AWS to simplify the management of crucial transactional appli-
cations by providing an easy-to-use platform for setting up, operating, and scaling a relational
database in the cloud. With RDS, laborious administrative tasks such as hardware provisioning,
-
for memory, performance, or I/O, and supports six well-known database engines, including
Amazon Aurora (compatible with MySQL and PostgreSQL), MySQL, PostgreSQL, MariaDB, SQL
Server, and Oracle.
If you want more control of your database at the OS level, AWS has now launched Amazon RDS
Custom. It provisions all AWS resources in your account, enabling full access to the underlying
Amazon EC2 resources and database environment access.
231
You can install third-party and packaged applications directly onto the database instance as they
would have in a self-managed environment while
RDS traditionally provides.
• Community (Postgres, MySQL, and MariaDB): AWS offers RDS with three different open-
source -
• Amazon Aurora (Postgres and MySQL): As you can see, Postgres and MySQL are here,
as they are in the
within the Aurora wrapper can add many
started offering the MySQL service in 2014 and added the Postgres version in 2017. Some
of these are as follows:
a.
b. Fivefold performance increase over the vanilla MySQL version
c. Automatic six-way replication across availability zones to improve availability
and fault tolerance
• Commercial (Oracle and SQLServer): Many organizations still run Oracle workloads, so
AWS offer
the cost of this service, there will be a licensing cost associated with using this service,
which otherwise might not be present if you use a community edition.
managed database service offered by AWS. Let’s look
at its key attributes to make your database more resilient and performant.
Multi-AZ deployments - Multi-AZ deployments in RDS provide improved availability and du-
rability for database instances, making them an ideal choice for production database workloads.
With Multi-AZ DB instances, RDS synchronously replicates data to a standby instance in a dif-
ferent Availability Zone (AZ) for enhanced resilience. You can change your environment from
Single-AZ to Multi-AZ at any time. Each AZ runs on its own distinct, independent infrastructure
and is built to be highly dependable.
232 Selecting the Right Database Service
In the event of an infrastructure failure, RDS initiates an automatic failover to the standby instance,
allowing you to resume database operations as soon as the failover is complete. Additionally, the
endpoint for your DB instance remains the same after a failover, eliminating manual adminis-
trative intervention and enabling your application to resume database operations seamlessly.
Read replicas - RDS makes it easy to create read replicas of your database and automatically keeps
them in sync with the primary database (for MySQL, PostgreSQL, and MariaDB engines). Read
replicas are helpful for both read scaling and disaster recovery use cases. You can add read repli-
cas to handle read workloads, so your master database doesn’t become overloaded with reading
requests. Depending on the database engine, you may be able to position your read replica in a
different region than your master, providing you with the option of having a read location that
in case of an issue with the master, ensuring you have coverage in the event of a disaster.
While both Multi-AZ deployments and read replicas can be used independently, they can also
be used together to provide even greater availability and performance for your database. In this
case, you would create a Multi-AZ deployment for your primary database and then create one or
Automated backup - With RDS, a scheduled backup is automatically performed once a day during
retention period for your backups, which can be up to 35 days. While automated backups are
available for 35 days, you can retain longer backups using the manual snapshots feature provided
by RDS. RDS keeps multiple copies of your backup in each AZ where you have an instance de-
ployed to ensure their durability and availability. During the automatic backup window, storage
help achieve high performance if your application is time-sensitive and needs to be always on.
Database snapshots - You can manually create backups of your instance stored in Amazon S3,
which are retained until you decide to remove them. You can use a database snapshot to create a
new instance whenever needed. Even though database snapshots function as complete backups,
you are charged only for incremental storage usage.
233
Data storage - Amazon RDS supports the most demanding database applications by utilizing
Amazon Elastic Block Store (Amazon EBS
two SSD-backed storage options to choose from: a cost-effective general-purpose option and a
Scalability - You can often scale your RDS database compute and storage resources without
and price requirements. You may want to scale your database instance up or down, including
scaling up to handle the higher load, scaling down to preserve resources when you have a lower
load, and scaling up and down to control costs if you have regular periods of high and low usage.
Monitoring - RDS offers a set of 15-18 monitoring metrics that are automatically available for you.
monitor crucial aspects such as CPU utilization, memory usage, storage, and latency. You can view
the metrics in individual or multiple graphs or integrate them into your existing monitoring tool.
Additionally, RDS provides Enhanced Monitoring, which offers access to more than 50 additional
metrics. By enabling Enhanced Monitoring, you can specify the granularity at which you want
Amazon RDS Performance Insights is a performance monitoring tool for Amazon RDS databases.
It allows you to monitor the performance of your databases in real-time and provides insights
and recommendations for improving the performance of your applications. With Performance
Insights, you can view a graphical representation of your database’s performance over time and
Performance Insights also provides recommendations for improving the performance of your
-
Security - Controlling network access to your database is made simple with RDS. You can run
your database instances in Amazon Virtual Private Cloud (Amazon VPC) to isolate them and
Additionally, most RDS engine types offer encryption at rest, and all engines support encryption
in transit. RDS offers a wide range of compliance readiness, including HIPAA eligibility.
234 Selecting the Right Database Service
You can learn more about RDS by visiting the AWS page: https://aws.amazon.com/rds/.
As you have learned about RDS, let’s dive deeper into AWS cloud-native databases with Amazon
Aurora.
Amazon Aurora is a relational database service that blends the availability and rapidity of high-
end commercial databases with the simplicity and cost-effectiveness of open-source databases.
Aurora is built with full compatibility with MySQL and PostgreSQL engines, enabling applications
Amazon Aurora has many key features that have been added to expand the service’s capabilities
since it launched in 2014. Let’s review some of these key features:
•
-
tomatically start up, shut down, and adjust its capacity based on the needs of your appli-
cation. Amazon Aurora Serverless v2 scales almost instantly to accommodate hundreds
ensure the right resources for your application. You won’t have to manage the database
capacity, and you’ll only pay for your application’s resources. Compared to peak load
provisioning capacity, you could save up to 90% of your database cost with Amazon
Aurora Serverless.
• Global Database
across multiple AWS regions, allowing for faster local reads and rapid disaster recovery.
Global Database utilizes storage-based replication to replicate your database across var-
ious regions, typically resulting in less than one-second latency. By utilizing a secondary
region, you can have a backup option in case of a regional outage or degradation and can
quickly recover. Additionally, it takes less than one minute to promote a database in the
secondary region to full read/write capabilities.
235
• Encryption - With Amazon Aurora, you can encrypt your databases by using keys you
create and manage through AWS Key Management Service (AWS KMS). When you use
Amazon Aurora encryption, data stored on the underlying storage and automated backups,
snapshots, and replicas within the same cluster are encrypted. Amazon Aurora secures
data in transit using SSL (AES-256).
• Automatic, continuous, incremental backups and point-in-time restore - Amazon Au-
-
capability is known as
• Monitoring and metrics - Amazon Aurora offers a range of monitoring and performance
tools to help you keep your database instances running smoothly. You can use Amazon
CloudWatch metrics at no additional cost to monitor over 20 key operational metrics, such
as compute, memory, storage, query throughput, cache hit ratio, and active connections.
If you need more detailed insights, you can use Enhanced Monitoring to gather metrics
from the operating system instance that your database runs on. Additionally, you can use
Amazon RDS Performance Insights, a powerful database monitoring tool that provides an
easy-to-understand dashboard for visualizing database load and detecting performance
problems, so you can take corrective action quickly.
• Governance
AWS infrastructure, providing you with oversight over storage, analysis, and corrective
actions. You can ensure your organization remains compliant with regulations such as
platform enables you to capture
and unify user activity and API usage across AWS Regions and accounts in a centralized,
controlled environment, which can help you avoid penalties.
• Amazon Aurora machine learning - With Amazon Aurora machine learning, you can
incorporate machine learning predictions into your applications through SQL program-
ming language, eliminating the need to acquire separate tools or possess prior machine
learning experience. It offers a straightforward, optimized, and secure integration between
Aurora and AWS ML services, eliminating the need to create custom integrations or move
data between them.
Enterprise use cases for Amazon Aurora span multiple industries. Here are examples of some of
for each:
Amazon Aurora provides several security features, such as encryption at rest and in transit,
to help protect your data and ensure compliance with industry standards and regulations.
• Deploy globally distributed applications - Achieve multi-region scalability and resilience
for internet-scale applications, such as mobile games, social media apps, and online
requirements, databases are often split across multiple instances, but this can lead to
over-provisioning or under-provisioning, resulting in increased costs or limited scalability.
Aurora Serverless solves this problem by automatically scaling the capacity of multiple
-
You will not need to manage or upsize your servers manually. You need to set a min/max
capacity unit setting and allow Aurora to scale to meet the load.
Amazon RDS Proxy is a service that acts as a database proxy for Amazon Relational Database
Service (RDS). It is fully managed by AWS and helps to increase the scalability and resilience of
• Improved scalability: RDS Proxy automatically scales to handle a large number of con-
current connections, making it easier for your application to scale.
• Better resilience to database failures: RDS Proxy can automatically failover to a standby
replica if the primary database instance becomes unavailable, reducing downtime and
improving availability.
• Enhanced security: RDS Proxy can authenticate and authorize incoming connections,
helping to prevent unauthorized access to your database. It can also encrypt data in transit,
providing an extra security measure for your data.
238 Selecting the Right Database Service
database passwords are managed by AWS Secrets Manager. Aurora put across 2 Availability
Zones (AZs) to achieve high availability, where AZ1 hosts Aurora’s primary database while AZ2
has the Aurora read replica.
With RDS Proxy, when a failover happens, application connections are preserved. Only trans-
actions that are actively sending or processing data will be impacted. During failover, the proxy
Amazon RDS Proxy is a useful tool for improving the performance, availability, and security of
your database-powered applications. It is fully managed, so you don’t have to worry about the
underlying infrastructure, and it can help make your applications more scalable, resilient, and
secure. You can learn more about RDS Proxy by visiting the AWS page: https://aws.amazon.
com/rds/proxy/.
High availability and performance are a database’s most essential and tricky parts. But this prob-
lem can be solved in an intelligent way using machine learning. Let’s look at RDS’s newly launched
feature, Amazon DevOps Guru, to help with database performance issues using ML.
239
DevOps Guru for Amazon RDS is a recently introduced capability that uses machine learning to
automatically identify and troubleshoot performance and operational problems related to rela-
or problematic SQL queries and provide recommendations for resolution, helping developers
address these issues quickly. DevOps Guru for Amazon RDS utilizes machine learning models to
deliver these insights and suggestions.
DevOps Guru for RDS aids in quickly resolving operational problems related to databases by
notifying developers immediately via Amazon (SNS
and EventBridge when issues arise. It also provides diagnostic information, as well as intelligent
remediation recommendations, and details on the extent of the issue.
AWS keeps adding innovations for Amazon RDS as a core service. Recently, they launched Ama-
zon RDS instances available on AWS’s chip Graviton2, which helps them offer lower prices with
database near your workload in an on-premise environment. You can learn more about RDS Proxy
by visiting the AWS page: https://aws.amazon.com/devops-guru/.
Besides SQL databases, Amazon’s well-known NoSQL databases are very popular. Let’s learn
about some NoSQL databases.
When it comes to NoSQL databases, you need to understand two main categories, key-value and
document, as those can needs high throughput, low latency,
reads and writes, and endless scale, while the document database stores documents and quickly
accesses querying on any attribute. Document and key-value databases are close cousins. Both
that in a document database, the value stored (the document) will be transparent to the database
and, therefore, can be indexed to assist retrieval. In the case of a key-value database, the value is
opaque and will not be scannable, indexed, or visible until the value is retrieved by specifying the
key. Retrieving a value without using the key would require a full table scan. Content is stored as
240 Selecting the Right Database Service
Binary Large Objects (BLOBs). Data stores keep the value without regard
for the content. Data in key-value database records is accessed using the key (in rare instances,
are neither supported nor a secondary consideration. Key-value stores often use the hash table
pattern to store the keys. No column-type relationships exist, which keeps the implementation
details simple. Starting in the key-value category, AWS provides Amazon DynamoDB. Let’s learn
more about Dynamo DB.
DynamoDB is a fully managed, multi-Region, multi-active database that delivers exceptional
performance, with single-digit-millisecond latency, at any scale. It is capable of handling more
than 10 trillion daily requests, with the ability to support peaks of over 20 million requests per sec-
ond, making it an ideal choice for internet-scale applications. DynamoDB offers built-in security,
backup and restore features, and in-memory caching. One of the unique features of DynamoDB
is its elastic scaling, which allows for seamless growth as the number of users and required I/O
throughput increases. You pay only for the storage and I/O throughput you provision, or on a
to support millions of users making thousands of concurrent requests every second. In addition,
control and support for end-to-end encryption to ensure
data security.
• Fully managed
• Supports multi-region deployment
• Multi-master deployment
• Fine-grained identity and access control
241
DynamoDB provides the option of on-demand backups for archiving data to meet regulatory
Additionally, you can enable continuous backups for point-in-time recovery, allowing restoration
to any point in the last 35 days with per-second granularity. All backups are automatically en-
DynamoDB is built for high availability and durability. All writes are persisted on an SSD disk and
DynamoDB Accelerator (DAX) is a managed, highly available, in-memory cache for DynamoDB
even at high request rates. DAX eliminates the need for you to manage cache invalidation, data
population, or cluster management, and delivers microseconds of latency by doing all the heavy
lifting required to add in-memory acceleration to your DynamoDB tables.
DynamoDB is a table-based database. While creating the table, you can specify at least three
components:
1. Keys
second, a sort key to sort and retrieve a batch of data in a given range. For example, trans-
action ID can be your primary key, and transaction date-time can be the sort key.
2. WCU: Write capacity unit at what rate you want to write your data
in DynamoDB.
3. RCU: Read capacity unit at what rate you want to read from your given
DynamoDB table.
-
visioning capacity of the table are equally distributed for all partitions.
242 Selecting the Right Database Service
-
base, and within the table, you have items.
two. As more items are added to the table in DynamoDB, it becomes apparent that attributes can
differ between items, and each item can have a unique set of attributes. Additionally, the primary
Sometimes you need to query data using the primary key, and sometimes you need to query by
the same partition as the item in the table, ensuring consistency. Whenever an item is
updated, the corresponding LSI is also updated and acknowledged.
243
the parent table can have a different sort key. In the index, you can choose to have just the
scenar-
hand, allows you to query data based on attributes that are not part of the primary key or sort key
of the base table. It’s useful when you want to perform ad-hoc queries on different attributes of
the data or when you need to support multiple access patterns. A GSI can be used to scale read
queries beyond the capacity of the base table and can also be used to query data across partitions.
In general, if your data size is small enough, and you only need to query data based on a different
sort key within the same partition key, you should use an LSI. If your data size is larger, or you
need to query data based on attributes that are not part of the primary key or sort key, you should
use a GSI. However, keep in mind that a GSI comes with some additional cost and complexity in
terms of provisioned throughput, index maintenance, and eventual consistency.
If an item collection’s data size exceeds 10 GB, the only option is to use a GSI as an LSI limits the
data size in a particular partition. If eventual consistency is acceptable for your use case, a GSI
can be used as it is suitable for 99% of scenarios.
DynamoDB is very useful in designing serverless event-driven architecture. You can capture the
item-level data change, e.g., putItem, updateItem, and delete, by using DynamoDB Streams.
You can learn more about Amazon DynamoDB by visiting the AWS page: https://aws.amazon.
com/dynamodb/.
You may want a more sophisticated database when storing JSON documents for index and fast
search. Let’s learn about the AWS document database offering.
In this section, let’s talk about Amazon DocumentDB, “a fully managed MongoDB-compatible da-
tabase service designed from the ground up to be fast, scalable, and highly available”. DocumentDB is
a purpose-built document database engineered for the cloud that provides millions of requests
per second with millisecond latency and can scale-out to 15 read replicas in minutes. It is com-
patible with MongoDB 3.6/4.0 and managed by AWS, which means no hardware provisioning,
auto-patching, quick setup, good security, and automatic backups are needed.
If you look at the evolution of document databases, what was the need for document databases
data modeling within applications. Using JSON in the application and then trying to map JSON
to relational databases introduced friction and complication.
245
Object Relational Mappers (ORMs) were created to help with this friction, but there were com-
plications with performance and functionality. A crop of document databases popped up to solve
• Data is stored in documents that are in a JSON-like format, and these documents are
documents are stored as values or data types, in DocumentDB, documents are the key
design point of the database.
•
semi-structured data. Additionally, powerful indexing capabilities make querying such
documents much faster.
•
data between your application and the database.
•
-
-
Compared to traditional relational databases, Amazon DocumentDB offers several advantages,
such as:
• On-demand instance pricing: You can pay by the hour without any upfront fees or long-
term commitments, which eliminates the complexity of planning and purchasing database
and testing.
• Compatibility with MongoDB 3.x and 4.x: Amazon DocumentDB supports MongoDB 3.6
drivers and tools, allowing customers to use their existing applications, drivers, and tools
data being stored has a complex or hierarchical structure, or when the data being stored
is subject to frequent changes.
• High performance: DocumentDB is designed for high performance and can be well suited
for applications that require fast read and write access to data.
• Scalability: DocumentDB is designed to be horizontally scalable, which means that it can
be easily expanded to support large amounts of data and a high number of concurrent users.
• Easy querying
it easy to retrieve and manipulate data within the database.
DocumentDB has recently introduced new features that allow for ACID transactions across mul-
tiple documents, statements, collections, or databases. You can learn more about Amazon Docu-
mentDB by visiting the AWS page: https://aws.amazon.com/documentdb/.
If you want sub-millisecond performance, you need your data in memory. Let’s learn about
in-memory databases.
247
Studies have shown that if your site is slow, even for just a few seconds, you will lose customers. A
slow site results in 90% of your customers leaving the site. 57% of those customers will purchase
are demanding times for services, and to keep up with demand, you need to ensure that users
aren’t waiting to purchase your service, products, and offerings to continue growing your business
Applications and databases have changed dramatically not only in the past 50 years but also just
in the past 10. Where you might have had a few thousand users who could wait for a few seconds
-
the application and database world to rethink how data is stored and accessed. It’s essential to
use the right tool for the job. In-memory data stores are used when there is a need for maximum
throughput as you are caching data in memory, which helps to increase the performance by taking
In-memory databases, or IMDBs for short, usually store the entire data in the main memory.
Contrast this with databases that use a machine’s RAM for optimization but do not store all
the data simultaneously in primary memory and instead rely on disk storage. IMDBs generally
perform better than disk-optimized databases because disk access is slower than direct memory
access. In-memory operations are more straightforward and can be performed using fewer CPU
cycles. In-memory data access seeks time when querying the data, which enables faster and more
-
mance, in-memory operations are usually measured in nanoseconds, whereas operations that
require disk access are usually measured in milliseconds. So, in-memory operations are usually
about a million times faster than operations needing disk access.
Some use cases of in-memory databases are real-time analytics, chat apps, gaming leaderboards,
248 Selecting the Right Database Service
As shown in the following diagram, based on your data access pattern, you can use either lazy
caching or write-through. In lazy caching, the cache engine checks whether the data is in the
cache and, if not, gets it from the database and keeps it in the cache to serve future requests. Lazy
caching is also called the cache aside pattern.
to the caching engine, which tries to load data from the cache. If data is not available in the cache,
-
azon ElastiCache service is an AWS-provided cache database. Let’s learn more details about it.
Amazon ElastiCache is a cloud-based web service that enables users to deploy and manage an
in-memory cache with ease. By storing frequently accessed data in memory, in-memory caches
can enhance application performance, enabling faster data access compared to retrieving data
from a slower backing store such as a disk-based database. ElastiCache offers support for two
popular open-source in-memory cache engines: Memcached and Redis. Both engines are well
known for their reliability, scalability, and performance.
With ElastiCache, you can quickly and easily set up, manage, and scale an in-memory cache in
the cloud.
249
clusters, and monitoring the health of the cache environment, allowing you to focus on developing
and deploying your application.
In addition, ElastiCache integrates seamlessly with other AWS services, making it easy to use the
Amazon ElastiCache
build data-intensive applications. In-memory data storage boosts applications’ performance by
retrieving data directly from memory.
Since ElastiCache offers two
user of Memcached. If your organization already has committed to Memcached, it is likely not
the future, but as of this writing, Redis continues to gain supporters. Here is a comparison of the
features and capabilities of the two:
250 Selecting the Right Database Service
AWS recently launched Amazon MemoryDB for Redis due to Redis’s popularity. It is a durable,
in-memory database service that provides ultra-fast performance and is compatible with Redis
is stored in memory, allowing for microsecond read and single-digit millisecond write latency,
as well as high throughput. You can learn more about Amazon MemoryDB by visiting the AWS
page: https://aws.amazon.com/memorydb/.
Now data is getting more complicated with many-to-many relationships and several layers in
their common likes. Let’s look at graph databases to solve this problem.
Graph databases are data stores that
In traditional databases, relationships are often an afterthought. In the case of relational data-
bases, relationships are implicit and manifest themselves as foreign key relationships. In graph
databases, relationships are
these relationships are called edges.
certain use cases, they offer much better data retrieval performance than traditional databases.
As you can imagine, graph databases are particularly suited for use cases that place heavy im-
portance on relationships among entities.
constant time. With graph databases, it is not uncommon to be able to traverse millions of edges
per second.
Graph databases can handle nodes with many edges regardless of the dataset’s number of nodes.
You only need a pattern and an initial node to traverse a graph database. Graph databases can
easily navigate the adjacent edges and nodes around an initial starting node while caching and
aggregating data from the visited nodes and edges. As an example of a pattern and a starting
point, you might have a database that contains ancestry information. In this case, the starting
point might be you, and the pattern might be a parent.
251
can share edges regardless of the quantity or type without a performance penalty. Edges
project. It provides an imperative traversal language, called Gremlin, that can be used to write
traversals on property graphs, and many open-source and vendor implementations support it. You
traversal language could be a favorable option as it offers a method to navigate through property
graphs. You might also like openCypher, an open-source declarative query language for graphs,
as it provides a familiar SQL-like structure to compose queries for graph data.
252 Selecting the Right Database Service
labeled, directed multi-graph, but it uses the concept of triples, subject, predicate, and object, to
encode the graph. Now let’s look at Amazon Neptune, which is Amazon’s graph database service.
Amazon Neptune is a managed service for graph databases, which uses nodes, edges, and prop-
suited to represent the complex relationships found in many types of data, such as the relation-
ships between people in a social network or the interactions between different products on an
e-commerce website.
Neptune supports the property graph and W3C’s RDF standards, making it easy to integrate with
other systems and tools that support these standards. Neptune also provides a query language
called Gremlin which is powerful and easy to use, which makes it easy to perform complex graph
traversals and data manipulation operations on the data stored in the database.
In addition, Neptune is highly scalable and available, with the ability to support billions of ver-
tices and edges in a single graph. It is also fully managed, which means that Amazon takes care
of the underlying infrastructure and performs tasks such as provisioning, patching, and backup
and recovery, allowing you to focus on building and using your application. You can learn more
about Amazon Neptune by visiting the AWS page: https://aws.amazon.com/neptune/.
A time-series database (TSDB) is a database
not only important to track what happened but just as important to track when
unit of measure to use for the time depends on the use case. For some applications, it might be
enough to know on what day the event happened. But for other applications, it might be required
• Performance monitoring
• Networking and infrastructure applications
253
•
•
•
RDBMSes can store this data, but they are not optimized to process, store, and analyze this type
generate events that need to be tracked and measured, sometimes with real-time requirements.
254 Selecting the Right Database Service
has features that can automate query rollups, retention, tiering, and data compression. Like
depending on how much data is coming into the streams. Also, because it’s serverless and fully
-
ing are not the responsibility of the DevOps team, allowing them to focus on more important tasks.
records feature, instead of one measure per table row, making it easier to migrate existing data
automatically and periodically run the queries and store the results in a separate table. Addi-
A ledger database (LDB
and transparent transaction log orchestrated by a central authority:
• LDB immutability: Imagine you deposit $1,000 in your bank. You see on your phone that
re-
after the fact. In other words, only inserts are allowed, and updates cannot be performed.
• LDB transparency: In this context, transparency refers to the ability to track changes to
-
imum, should include who changed the data, when the data was changed, and what the
value of the data was before it was changed.
255
the transaction is recorded, the entire transaction data is hashed. In simple terms, the
string of data that forms the transaction is whittled down into a smaller string of unique
characters. Whenever the transaction is hashed, it needs to match that string. In the
ledger, the hash comprises the transaction data and appends the previous transaction’s
hash. Doing this ensures that the entire chain of transactions is valid. If someone tried to
enter another transaction in between, it would invalidate the hash, and it would detect
that the foreign transaction was added via an unauthorized method.
the ledger records all credits and debits related to the bank account. It can then be followed from
a point in history, allowing us to calculate the current account balance. With immutability and
Amazon QLDB is a fully managed service that provides a centralized trusted authority to manage
an immutable, transparent, and -
tion value changes and manages a
track of the history of transactions that need high availability and reliability. Some examples that
need this level of reliability are as follows:
Amazon QLDB offers various blockchain services, such as anonymous data sharing and smart
contracts, while still using a centrally trusted transaction log.
QLDB is designed to act as your system of record or source of truth. When you write to QLDB,
your transaction is
append-only interactions and -
actions, such as reads, inserts, updates, and deletes, like a transaction and catalogs everything
sequentially in this journal.
256 Selecting the Right Database Service
Once the transaction is committed to the journal, it is immediately materialized into tables and
indexes. QLDB provides a current state table and indexed history as a default when you create a
new ledger. Leveraging these allows customers to, as the names suggest, view the current states
learn more
about Amazon QLDB by visiting the AWS page: https://aws.amazon.com/qldb/.
Sometimes you need a database for large-scale applications that need fast read and write perfor-
mance, which a wide-column store can achieve. Let’s learn more about it.
Wide-column databases can sometimes be referred to as column family databases. A wide-column
database is a NoSQL database that can store petabyte-scale amounts of data. Its architecture relies
on persistent, sparse matrix, multi-dimensional mapping using a tabular format. Wide-column
databases are generally not relational.
•
• Geolocation data
• User preferences
• Reporting
•
• Logging applications
• Many inserts, but not many updates
• Low latency requirements
hoc queries:
Apache Cassandra is probably the most popular wide-column store implementation today. Its
architecture allows deployments without single points of failure. It can be deployed across clus-
ters and data centers.
257
Amazon Keyspaces (formerly Amazon Managed Apache Cassandra Service, or Amazon MCS) is a
fully managed service that allows users to deploy Cassandra workloads. Let’s learn more about it.
Amazon Keyspaces (formerly known as Amazon Cassandra) is a fully managed, scalable, and
highly available NoSQL database service. NoSQL databases are a type of database that does not
use the traditional table-based relational database model and is well suited for applications that
require fast, scalable access to large amounts of data.
Keyspaces is based on Apache Cassandra, an open-source NoSQL database that is widely used for
applications that require high performance, scalability, and availability. Keyspaces provides the
Keyspaces supports both table and Cassandra Query Language (CQL) APIs, making it easy to
migrate existing Cassandra applications to Keyspaces. It also provides built-in security features,
such as encryption at rest and network isolation, using Amazon VPC and integrates seamlessly
with other AWS services, such as Amazon EMR and Amazon SageMaker.
Servers are automatically spun up or brought down, and, as such, users are only charged for the
servers Cassandra is using at any one time. Since AWS manages it, users of the service never have
shell, which is called cqlsh. With cqlsh, you can create tables, insert data into the tables, and
access the data via queries, among other operations.
Keyspaces supports the Cassandra CQL API. Because of this, the current code and drivers devel-
oped in Cassandra will work without changing the code. Using Amazon Keyspaces instead of
just Apache Cassandra is as easy as modifying your database endpoint to point to an Amazon
MCS service table.
In addition to Keyspaces being wide-column, the major difference from DynamoDB is that it
supports composite partition keys and multiple clustering keys, which are not available in Dy-
namoDB. However, DynamoDB has better connectivity with other AWS services, such as Athena,
Kinesis, and Elasticsearch.
258 Selecting the Right Database Service
Keyspaces provides an SLA for 99.99% availability within an AWS Region. Encryption is enabled
by default for tables, and tables are replicated three times in multiple AWS Availability Zones to
ensure high availability. You can create continuous backups of tables with hundreds of terabytes
of data with no effect on your application’s performance, and recover data to any point in time
within the last 35 days. You can learn more about Amazon Keyspaces by visiting the AWS page:
https://aws.amazon.com/keyspaces/.
In the new world of cloud-born
Modern organizations will not only use multiple types of databases for multiple applications, but
you can choose the following three options available in AWS based on your workload.
Managing and scaling databases in a legacy infrastructure, whether on-premises or self-managed
in the cloud (on EC2), can be a tedious, time-consuming, and costly process. You have to worry
about operational
• -
ating backups can be time-consuming and laborious
• Performance and availability issues
• Scalability issues, such as capacity planning and scaling clusters for computing and storage
• Security and compliance issues, such as network isolation, encryption, and compliance
programs, including PCI, HIPAA, FedRAMP, ISO, and SOC
Instead of dealing with the challenges mentioned above, you would rather spend your time in-
novating and creating new applications instead of managing infrastructure. With AWS-managed
databases, you can avoid the need to over- or under-provision infrastructure to accommodate
costs such as software licensing, hardware refresh, and maintenance resources. AWS manages
everything for you, so you can focus on innovation and application development, rather than
infrastructure management. You won’t need to worry about administrative tasks such as server
your clusters to ensure your workloads are running with self-healing storage and automated scal-
ing, allowing you to focus on higher-value tasks such as schema design and query optimization.
259
In the 60s and 70s, mainframes were the primary means of building applications, but by the 80s,
Applications became more distributed, but the underlying data model remained mostly struc-
tured, and the database often functioned as a monolith. With the advent of the internet in the
90s, three-tier application architecture emerged. Although client and application code became
more distributed, the underlying data model continued to be mainly structured, and the database
remained a monolith. For nearly three decades, developers typically built applications against
a single database. And that is an interesting data point because if you have been in the industry
for a while, you often bump into folks whose mental model is, “Hey, I’ve been building apps for
a long time, and it’s always against this one database.”
-
tions are built in the cloud. Microservices have now extended to databases, providing developers
with the ability to break down larger applications into smaller, specialized services that cater to
260 Selecting the Right Database Service
database with a single storage and compute engine that struggles to handle every access pattern.
lower latency and the ability to handle millions of transactions per second with many concurrent
users. As a result, data management systems have evolved to include specialized storage and
Plus, what we’ve seen over the last few years is that more and more companies are hiring tech-
nical talent in-house to take advantage of the enormous wave of technological innovation that
microservices, where they compose the different elements together using the right tool for the
Many factors contribute to the performance, scale, and availability requirements of modern apps:
• Users
businesses to touch millions of new customers.
• Data volume
• Request rate
experiences in more markets, they need their apps and databases to handle unprecedented
levels of throughput.
• Access/scale
billions
of smartphones worldwide, and businesses connect smartphones, cars, manufacturing
• Economics
and hope they’ll succeed, that model is unrealistic in 2023. Instead, they have to hedge
their success by only paying for what they use, without capping how much they can grow.
applications have wildly different database requirements, which are more advanced and
nuanced than simply running everything in a relational database.
complex applications into smaller components and choosing the most appropriate tool for each
a given task often varies by use case, leading developers to build highly distributed applications
using multiple specialized databases.
Now you have learned about the different types of AWS databases, let’s go into more detail about
moving on from legacy databases.
Numerous legacy applications have been developed on conventional databases, and consumers
have had to grapple with database providers that are expensive, proprietary, and impose punish-
ing licensing terms and frequent audits. Oracle, for instance, announced that they would double
licensing fees if their software is run on AWS or Microsoft. As a result, customers are attempting
to switch as soon as possible to open-source databases such as MySQL, PostgreSQL, and MariaDB.
Customers who are migrating to open-source databases are seeking to strike a balance between
-
cial-grade databases. Achieving the same level of performance on open-source databases as on
AWS introduced Amazon Aurora, a cloud-native relational database that is compatible with MySQL
and PostgreSQL to address this need. Aurora aims to provide a balance between the performance
and availability of high-end commercial databases and the simplicity and cost-effectiveness of
open-source databases. It boasts 5 times better performance than standard MySQL and 3 times
better performance than standard PostgreSQL, while maintaining the security, availability, and
reliability of commercial-grade databases, all at a fraction of the cost. Additionally, customers
can migrate their workloads to other AWS services, such as DynamoDB, to achieve application
scalability.
262 Selecting the Right Database Service
database services that you have learned about, so let’s put them together and learn how to choose
the right database.
In the previous sections, you learned how to classify databases and the different database ser-
vices that AWS provides. In a nutshell, you learned about the following database services under
different categories:
When you think about the collection of databases shown in the preceding diagram, you may think,
“Oh, no. You don’t need that many databases. I have a relational database, and it can take care
of all this for you”. Swiss Army knives are hardly the best solution for anything other than the
most straightforward task. If you want the right tool for the right job that gives you the expected
So no one tool rules the world, and you should have the right tool for the right job to make you
spend less money, be more productive, and change the customer experience.
Consider focusing on common database categories to choose the right database instead of brows-
ing through hundreds of different databases. One such category is ‘relational,’ which many people
are familiar with. Suppose you have a workload where strong consistency is crucial, where you
that will be asked of the data and require consistent answers. In that case, a relational database
263
Popular options for this category include Amazon Aurora, Amazon RDS, open-source engines
like PostgreSQL, MySQL, and MariaDB, as well as RDS commercial engines such as SQL Server
and Oracle Database.
AWS has developed several purpose-built non-relational databases to facilitate the evolution of
application development. For instance, in the key-value category, Amazon DynamoDB is a data-
base that provides optimal performance for running key-value pairs at a single-digit millisecond
querying data in the same document model used in your application code, then Amazon Docu-
-
ed an XML data type to become an XML database. However, this approach had limitations, as
have replaced XML databases. Amazon DocumentDB, launched in January 2019, is an excellent
example of such a database.
If your application requires faster response times than single-digit millisecond latency, consider an
in-memory database and cache that can access data in microseconds. Amazon ElastiCache offers
management for Redis and Memcached, making it possible to retrieve data rapidly for real-time
processing use cases such as messaging, and real-time geospatial data such as drive distance.
Suppose you have large datasets with many connections between them. For instance, a sports
company should link its athletes with its followers and provide personalized recommendations
based on the interests of millions of users. Managing all these connections and providing fast
queries can be challenging with traditional relational databases. In this case, you can use Amazon
data.
not just a timestamp or a data type that you might use in a relational database.
Instead, a time-series database’s core feature is that the primary axis of the data model is time.
an example of a purpose-built time-series database that provides fast and scalable querying of
time-series data.
Amazon QLDB is a fully managed ledger database service. A ledger is a type of database that is
used to store and track transactions and is typically characterized by its immutability and the
ability to append data in sequential order.
264 Selecting the Right Database Service
means that once data is written to the ledger, it cannot be changed, and new data can only be
A wide-column database is an excellent choice for applications that require fast data processing
-
ment, and route optimization. Amazon Keyspaces for Apache Cassandra provides a wide-column
database option that allows you to develop applications that can handle thousands of requests
per second with practically unlimited throughput and storage.
trying to solve. Some of the questions the requirements should answer are as follows:
•
•
•
•
•
In instances where there is a lot of data and it needs to be accessed quickly, NoSQL databases
might be a better solution. SQL vendors realize this and are constantly trying to improve their
offerings to better compete with NoSQL, including adopting techniques from the NoSQL world.
For example, Aurora is a SQL service, and it now offers Aurora Serverless, taking a page out of
the NoSQL playbook.
As services get better, the line between NoSQL and SQL databases keeps on blurring, making
might want to draw up a Proof of Concept using a couple of options to determine which option
Another reason to choose SQL or NoSQL might be the feature offered by NoSQL to create sche-
However, tread carefully. Not having a schema might come at a high price.
265
which becomes too variable and creates more problems than it solves. Just because we can create
databases without a schema in a NoSQL environment, we should not forgo validation checks
before creating a record. If possible, a validation scheme should be implemented, even when
using a NoSQL option.
It is true that going schema-less increases implementation agility during the data ingestion phase.
However, it increases complexity during the data access phase. So, make your choice by making
a required trade-off between data context vs. data performance.
maintain your relational databases as they scale, consider switching
to a managed database service such as Amazon RDS or Amazon Aurora. With these services, you
can migrate your workloads and applications without the need to redesign your application, and
you can continue to utilize your current database skills.
Self-managed databases like Oracle, SQL Server, MySQL, PostgreSQL, and MariaDB can be mi-
grated to Amazon RDS using the lift and shift approach. For better performance and availability,
MySQL and PostgreSQL databases can be moved to Amazon Aurora, which offers 3-5 times better
throughput. Non-relational databases like MongoDB and Redis are popularly used for document
and in-memory databases in use cases like content management, personalization, mobile apps,
maintain non-relational databases at scale, organizations can move to a managed database ser-
vice like Amazon DocumentDB for self-managed MongoDB databases or Amazon ElastiCache for
to manage the databases without rearchitecting the application and enable the same DB skill sets
to be leveraged while migrating workloads and applications.
266 Selecting the Right Database Service
As you understand the different choices of databases, then the question comes of how to migrate
• Self-service - For many migrations, the self-service path using the DMS and Schema
Conversion Tool (SCT) offers the tools necessary to execute with over 250,000 migrations
completed through DMS, customers have successfully migrated their instances to AWS.
Using the Database Migration Service (DMS), you can make homogeneous migrations
from your legacy database service to a managed service on AWS, such as from Oracle to
• The AWS Data Lab is a service that helps customers choose their platform and understand
the differences between self-managed and managed services. It involves a 4-day intensive
engagement between the customer and AWS database service teams, supported by AWS
solutions architecture resources, to create an actionable deliverable that accelerates the
customer’s use and success with database services. Customers work directly with Data
Lab architects and each service’s product managers and engineers. At the end of a Lab, the
customer will have a working prototype of a solution that they can put into production
at an accelerated rate.
A Data Lab is a mutual commitment between a customer and AWS. Each party dedicates
key personnel for an intensive joint engagement, where potential solutions will be eval-
days to create usable deliverables to enable the customer to accelerate the deployment of
large AWS projects. After the Lab, the teams remain in communication until the projects
are successfully implemented.
In addition to the above, AWS has an extensive Partner Network of consulting and software vendor
partners who can provide expertise and tools to migrate your data to AWS.
In this chapter, you learned about many of the database options available in AWS. You started by
revisiting a brief history of databases and innovation trends led by data. After that, you explored
You further explored different types of databases and when it’s appropriate to use each one, and
-
ple database choices available in AWS, and you learned about making a choice to use the right
database service for your workload.
In the next chapter, you will learn about AWS’s services for cloud security and monitoring.