Setting Expectations With RPO: Architectural Considerations Chapter 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

In many ways, database server architecture is treated as a mere afterthought.

It's often
much easier to simply create a single node, install some software, and consider the
whole affair resolved. If a company is particularly paranoid, they may even spare
some thought for a replica server, or perhaps some kind of backup.
The true importance of database cluster architecture is easily overlooked as a result.
But what is server architecture? Why does it matter?
Look down the street. Any street is fine. What do you see? Homes, offices, and
buildings of various descriptions. With very rare exceptions, each one of these was
meticulously planned, from the foundation to the walls to the electrical wires, pipes,
up to the roof and drainage systems. A failure in any of these components could lead
to the ultimate demise of the entire structure, given enough time.
The same also applies to a PostgreSQL cluster! Database architecture
defines what goes into a database server cluster, and the reason for each element. How
does it communicate? How many nodes are required? Where do we put those nodes,
and why? What common problems are inherent in those decisions? How will our
decisions influence the underlying cost? What trade-offs can we make, given some
important constraints? How does all of this affect data availability? We need those
answers before we even consider hardware or virtualization. There are many
important considerations we must entertain when designing a highly available
PostgreSQL cluster.
Why then is it so common for critical application and user data that drives the entire
application stack behind the company itself to be treated so callously? We direct so
much attention and focus on the application, with its various layers of indirection,
queues, caches, container automation, and microarchitecture, that the data layer is
overlooked or considered a nuisance.
Architectural Considerations Chapter 1
[8]
This is actually highly understandable. In most cases, a PostgreSQL database layer
demands an entirely different approach that development, system administration,
and other information technology fields may not be entirely familiar with managing.
Even experienced database administrators may not comprehend the scale and
necessary theoretical concepts that drive the high availability of databases.
While we can't reduce the subtle art of database server architecture to a few
memorable quips sure to entertain at parties, we can make the subject far
more approachable. It shouldn't be necessary to have a Ph.D. in abstract theoretical
frameworks to prevent a costly database outage.
In this chapter, we will learn how the layout of the nodes in our PostgreSQL cluster
can drastically influence its availability. We will cover the following recipes:
Setting expectations with RPO
Defining timetables through RTO
Picking redundant copies
Selecting locations
Having enough backups
Considering quorum
Introducing indirection
Preventing split brain
Incorporating multi-master
Leveraging multi-master

Setting expectations with RPO


RPO is a common term in business continuity known as Recovery Point Objective.
In the context of a database system, it describes the amount of data that may be lost
following an unexpected outage before it is once again operational. It's important to
understand this at an early stage because it will drive decisions such as node count,
data synchronization methods, and backup technologies.
In this recipe, we will examine the ingredients for concocting a comprehensive RPO
that will influence the PostgreSQL cluster composition itself.
Architectural Considerations Chapter 1
[9]
Getting ready
The first thing we need to do is set expectations. These are most often defined by
upper management or some other decision-making entity. Data loss is never desirable
but is unavoidable in catastrophic scenarios. How much data loss can the business
tolerate under these circumstances? Seconds, minutes, or hours' worth?
This recipe will mainly focus on information gathering from key individuals, so make
sure it's possible to at least email anyone involved with the application stack.
Hardware purchases depend on budget proposals, so it may even be necessary to
interact with VP and C-level executives as well. Even if we don't do this right away,
try to determine the extent of influence available to you.
How to do it...
Since we're dealing with many vectors, we should iterate them if possible. Try to
follow a process like this:
1. Seek the input of major decision makers:
VP and C-level executives involved with technology
Product manager
Application designers and architects
Infrastructure team lead
2. Find an amount of time that will satisfy most or all of the above.
3. Follow the rest of the advice in this chapter to find a suitable architecture.
4. Try to determine a rough cost for this and the closest alternative.
5. Present one or more designs and cost estimates to decision makers.
6. Document the final RPO decision and architecture as reference material.
Architectural Considerations Chapter 1
[ 10 ]
How it works...
Decision makers such as the technology VP, CEO, CTO, and such are the final word
in most cases. Their input is vital and should be considered a requirement before ever
taking a step further. Keep in mind that these people are likely not familiar with the
technical feasibility of their demands at this extreme implementation level. When
asked a question such as How much data can we lose in a major outage? they're probably
going to say None! Regardless, this is a vital first step for reasons that will shortly
become apparent.
Then, we simply traverse the stack of people who helped define the features the
application stack fulfills, those who designed and implemented it, and whoever may
be in charge of the requisite hardware and network where everything runs. Perhaps
the design has a built-in tolerance for certain amounts of loss. Perhaps inherent
queues or caches act as a sort of buffer for data backend difficulties. Maybe the design
assumes there are multiple data systems all ingesting the same stream for
redundancy. The architecture and those who built it are the best sources of this
information.
Once we know the maximum amount of data the backend can lose before being
restored, we must apply what we learn from the rest of this chapter and choose one or
two best-case designs that can deliver that promise. The point here is that we will be
executing this recipe several times until everyone agrees to all inherent design costs
and limitations before continuing.
The best way to estimate cost is to take the chosen database server architectures and
iterate a gross cost for each element. The next chapter on Hardware Planning describes
in detail how to do this. We don't have to be exact here; the goal is to have some
numbers we can present to decision makers. Do they still want zero RPO if it costs
10x as much as ten seconds of data loss? Are they willing to compromise on a hybrid
design?
Once we have chosen a final structure, possibly the most important step is to produce
a document describing that architecture, why it was chosen, the known limitations,
and the RPO it delivers. Present this document to decision makers and encourage
them to sign it if possible. Save it in any corporate documentation management
system available, and make sure it's one of the first things people see regarding the
database cluster layer. This document will single-handedly answer multiple questions
about the capabilities of the database cluster, all while acting as a reference
specification.
Architectural Considerations Chapter 1
[ 11 ]
There's more...
RPO is considered a vital part of business continuity planning. Entire books have
been written on this subject, and what we've presented here is essentially a functional
summary. The subject is deep and varied, rich with its own inherent techniques
beyond simply architecture and design. It is the language of business and resource
management, so it can be a key component when interacting with decision makers.
Learning these concepts in depth can help influence the overall application stack to a
more sustainable long-term structure. We'll cover more of these techniques in this
chapter, but don't be afraid to proactively incorporate these techniques into your
repertoire.

Defining timetables through RTO


Like RPO, RTO refers to a common business continuity term known as Recovery
Time Objective. In practice, this is the amount of time an outage of the database layer
may last. Often, it is incorporated into a Service Level Agreement (SLA) contract
presented to clients or assumed as a metric within the application stack. Like RPO,
this is a contractual-level element that can determine the number of required nodes at
steadily increasing expense as the amount of tolerable downtime decreases.
In this recipe, we will examine the necessary steps to defining a realistic RTO, and
what that could mean given known industry standards.
Getting ready
As with RPO, our goal in determining a functional RTO is to set expectations
regarding inherent architecture limitations. The primary difference here is that RTO is
more easily quantifiable. Fire up your favorite spreadsheet program, such as
OpenOffice, Microsoft Excel, or Google Sheets; we'll be using it to keep track of how
much time each layer of the application, including the database layer contributes to a
potential outage scenario.
Architectural Considerations Chapter 1
[ 12 ]
How to do it...
We simply need to produce a spreadsheet to track all of the elements of known RTO
that depend on the database. We can do this with the following steps:
1. Locate an already-defined RTO SLA for each portion of the application
dependent on PostgreSQL if possible.
2. If this does not exist, seek the input of major decision makers:
VP and C-level executives involved with technology
Product manager
Application designers and architects
Infrastructure team lead
3. Find an amount of time that will satisfy most or all of the above.
4. Create a new spreadsheet for RTO.
5. Create a heading row with the following columns:
Activity
Time (seconds)
Count
Total (seconds)
6. In the Total column, create the following formula:
=B2*C2
7. Create one row for each type of the following Activity categories:
Minor Upgrade
Major Upgrade
Reboot
Switchover
Failover
OS Upgrade
Etc.
8. Copy and paste the formula into the Total column for all the rows we
created.

You might also like