DYNAMO DB
INSIGHTS
Antti Stenvall
P2P/AP Senior product architect
Contents
• What and why
• From SQL design to DynamoDB design – Key differences
• Indices, partitions, data duplication
• Pricing
• Examples:
• Consumption manager with conditional updates
• Relational data (single table design)
What and Why?
• An AWS maintained NoSQL database
• Can be used as key-value store, but also in (substantially) more complex situations
• Not as flexible as SQL database (e.g. aggregation is not possible, update or delete is possible only with full key, transaction
support not like in SQL, arbitrary queries are not feasible)
• Query happens in two ways and only in two ways
1. Two exactly known keys (this is get, returns only one item)
2. One exactly known key and part of other (or nothing) and sort order comes from the other key
• In an actual use case it is usually an AWS maintained table, not database, but you can use single table as a whole database,
which makes it awesome
• You can’t know the total number of rows based on condition, pagination is implemented in browsing index (backwards or
forward) starting from given item
• Why would I use something that is not as flexible as SQL?
• Fully serverless, zero maintenance, easy provision, security at rest is easy to implement, easy back-ups
• Super fast, scales infinitely, global tables for multiregion replication
• Easy to use, but challenging to design and it is not suitable for all use cases (like SQL isn’t either)
• Very cheap (low to zero cost at rest), reliable
• Automatic row expiration (data removal at given time)
From SQL design to DynamoDB
design
• Indexing requires substantially more thought
• ALL access patterns are tightly coupled to indices and must be known beforehand
• DynamoDB limitations must be considered in the beginning
• You may want to forget transaction support (you can e.g. save at max 25 items simultaneously, so there is limited
support for atomic operations)
• Aggregation is missing (if needed, consider maintaining aggregated objects yourself)
• Arbitrary querying is not possible (only one-or-two key queries)
• Note! You may scan the table with any conditions BUT that’s like full table scan for the whole SQL database → It can only be used for
maintenance related purposes e.g. if key schema update is needed
• Update is happens only to known primary key (PK+SK)
• DynamoDB special features may be considered
• You may want to react to changes via DynamoDB streams and decouple logic
• Conditional updates can (should) be used for optimistic locking
• Automatic row remove at defined time
• Automatic back-ups
• With Python, boto3 Table API should be used
• 400 KB row size limit
Indices – Primary index
• Primary key consists of two parts
• Partition key: PK (also known as hash key, not as primary key)
• Sort key: SK (also known as range key)
• (PK, SK) pair is unique. Put operation will overwrite if there is already
entry. There are get and query operations for index.
• In get operation PK and SK must be given exactly
• In query operation PK must be given exactly, SK with e.g. begins,
between (or no need to give at all)
• Name your index columns always with PK and SK and fix type to string
What does PK (partition key) and SK
(sort key) do?
• PK partitions the data and it is
dictating the performance
• SK sorts the data within partition,
therefore queries like ends_with or
contains can’t be possible because
they are not performant
You can’t query how many rows there are! You can only browse index through in the order defined by SK.
In addition to keys, rows also contains attributes, these are like columns in a table in relational database
but they are fully schema free. In one table col_a can be number, in another nested json object.
Indices – Secondary index
• Two kind of secondary indices Local and Global
• Local (don’t use, complicates your design and no benefit)
• Only sort key (partition key is the same as global) -> Local because it’s in the same
partition
• Strongly consistent reads
• With same PK up to 10 GB data max
• Global
• Use index name GSI1, GSI2, ... and column names GSI1PK (partition) and GSI1SK
(sort)
• Different items can have same values for pair (GSI1PK, GSI1SK) → No get operation
based on index, only query
• Eventually consistent reads
Indices – Partitions and data
duplication
• PK (or GSI PK) defines partition where data physically locates → speed
• GSIs duplicate data (attributes), you can defined which or all (use all,
disk space is cheap and keeps designing easier)
• Design always for the minimum amount of secondary indices, write
down and revisit your access patterns
• There may be rows that are not relevant for GSI3, then if GSI3PK or
GSI3SK is empty, this data is not in the duplicated to partition defined
by GSI3PK
Pricing
• Read request: up to 4 KB consumes one unit (strongly consistent operation),
or half unit (eventually consistent operation)
• Write request: up to 1 KB consumes one unit
• Two capacity modes: on-demand and provisioned (use on-demand)
• On-demand: pay when you use (read $ 0.25 / M units, write $ 1.25 / M units)
• Provisioned: provision capacity, billed hourly
• Data storage: first 25 GB free then $ 0.25 / GB
• + Other costs for other service: https://aws.amazon.com/dynamodb/pricing/
• Always when designing/workign with AWS, it’s important to understands
costs
Example: Consumption manager
• Use case: we want to limit the
usage of a service
• Service A can be for example
”inifinitely scalable lambda” and
Service B managed AWS service
that has quota and we want
Service A to consume only part
of that quota
• What can possibly go wrong
here?
Consumption manager – Database
design
• One database, two different kind of items: Consumption, Queued job
• Access patterns
• Give me current consumption
• Update current consumption
• Add job to queue
TYPE PK SK Attributes
• Get next job in queue
CONSUMPTION CONSUMPTION CONSUMPTION consumption
• Remove job from queue #SERVICE-A service_name
max_capacity?
JOB JOB#SERVICE-A QUEUED- id,
AT#165994317 service
6149#c474e63 queued_at
payload
Updating consumption table
Example: Relational data in
DynamoDB
• Simplified view of invoice, order and matching data
• Design schema for this in DynamoDB (exclude user initiated search
cases)
DynamoDB design
TYPE PK SK GSI1PK GSI1SK GSI2PK GSI2SK GSI3PK GSI3SK GSI4PK GSI4SK
INVOICE INVOICE#{id} INVOICE#{i
d}
• Always include TYPE, ids are in SKs, remember PK,SK pair is unique
INVOICE_LIN INVOICE#{inv IL#{id} INVOICE#{in MATCHING_
E oice_id} voice_id} STATUS#{sta • Add all attributes appearing in keys to individual attributes as well
tus} • Use GSI only when needed
CODING_RO INVOICE#{inv CR#{id} INVOICE#{in MATCHING_ • All invoice data can be fetched with single query (for given id)
W oice_id} voice_id} STATUS#{sta
tus} • All invoice / order data with given matching status with single
ORDER ORDER#{id} ORDER#{id} ORDER_NU DUMMY query (for given header id + matching status)
MBER#{orde • All order data can be fetched with single query (for given id)
r_number}
ORDER_RO ORDER#{ord OR#{id} ORDER#{ord MATCHING_
• Not possible to fetch for given gr e.g. all the invoices’ header data
W er_id} er_id} STATUS#{sta (which would be trivial in SQL)
tus}
GOODS_REC ORDER#{ord GR#{id} ORDER#{ord MATCHING_ OR#{or_id} GR
EIPT er_id} er_id} STATUS#{sta
tus}
GR_MATCHI ORDER#{ord GRMD#{id} OR#{or_id} GRMDATA GR#{gr_id} GRMDATA INVOICE#{in IL#{il-id} CR#{cr_id} CR
NG_DATA er_id} v_id}
How to document object?
Key definitions of
Key schema of the table the object
SK Optional for
querying
Attributes
Key takeaways
• Plan your access patterns, revisit them continuously
• Querying is with a key pair, there is no searching (forget scanning)
• There are no joins
• There are simple/no transactions, use conditional updates and optimistic locking
• Aggregation is not possible, use DynamoDB stream or aggregate on CUD
• Name one attribute to TYPE (will help in debugging and development)
• Use convention PK, SK, GSI1PK, GSI1SK, GSI2PK, ... for naming keys
• Don’t take DynamoDB for granted, but consider it as an option to other databases
• Sometimes more than one table is better (e.g. lot of CUD, but only little use for streams)
• Centralize you key schema to models, don’t let it leak to repository functions
• Plan your access patterns, revisit them continuously
Thank you!