Backups and DR
Backups and DR
/ Payal Singh
Database Administrator
OmniTI
1
Who am I ?
DBA@OmniTI
Github: payals
Blog: http://penningpence.blogspot.com
Twitter: @pallureshu
Email: [email protected]
2
Agenda
Types of backups
Validation
Backups management
Automation
Filesystem
In-house solution
3
Types of Backups
4
Business Continuity Objectives
5
Types of Backups
● Logical
○ pg_dump
○ pg_dumpall
● Physical (File-system Level)
○ Online
Most commonly used
No downtime
○ Offline
rarely used
shutdown database for backup
6
Logical Backups
Advantages:
Granularity
Fine-tuned restores
Multiple in-built compression types
Ease of use, no extra setup required
Disadvantages:
Relatively slower
Frozen snapshots in time, i.e. no PITR
Locks
Data spread in time
7
pg_dumpall
Pros:
• Globals
Cons:
● Requires superuser privileges
● No granularity while restoring
● Plaintext only
● Cannot take advantage of faster/parallel restore with pg_restore
8
Physical Backups
Advantages:
Faster
Incremental Backups
Point In Time Recovery
By default compression on certain file-systems
Disadvantages:
Lacks granularity
9
Why do you need both?
10
pg_basebackup
Advantage:
• In core postgres
• No explicit backup mode required
• Multiple instances can run concurrently
• Backups can be made on master as well as standby
Disadvantage:
• Slower when compared to snapshot file-system technologies
• Backups are always of the entire cluster
11
pg_basebackup Requirements
12
Delayed Replicas as Backups
• May still result in some data loss post recovery, but it depends on the
importance of table, other objects dependent on it, etc.
13
Version Controlled DDL
14
Validation
15
Continuous Restores
Validation is important
Estimates / Expectations
Procedure / External factors RTO
Does it even work?
Development Database
Routine refresh + restore testing
Reporting Databases:
Overnight restores for reporting databases refreshed daily. Great
candidate for daily validation.
16
Sample Validation Log
2015-03-01 10:00:03 : Testing backup for 2015-02-28 from
/data/backup/hot_backup/+/var/log/test/bin/test_backups.sql
2015-03-01 10:00:03 : Starting decompress of
/data/backup/hot_backup/test.local-data-2015-02-28.tar.gz
2015-03-01 10:00:03 : Starting decompress of
/data/backup/hot_backup/test.local-xlog-2015-02-28.tar.gz
2015-03-01 10:00:03 : Waiting for both decompressing processes to finish
2015-03-01 14:36:06 : Decompressing worked, generated directory test with
size: 963G
2015-03-01 14:36:06 : Starting PostgreSQL
server starting
2015-03-01 14:36:07.372 EST @ 3282 LOG: loaded library
"pg_scoreboard.so"
2015-03-01 15:52:36 : PostgreSQL started
2015-03-01 15:52:36 : Validating Database
starting Vacuum
2015-03-02 08:17:56 : Test result: OK
2015-03-02 08:18:11 : All OK.
17
Where did Gitlab go wrong?
No PITR, no alerts for missing backups, no documentation for backup
location, no restore testing
No retention policy
No monitoring
No RPO
“…Unless we can pull these from the past 24 hours they will be
lost”
18
Sobering Thought of the Day
19
Backup Management
20
Retention Period
21
Security
Transfer
One way passwordless SSH access
HTTPS for cloud uploads (e.g. s3cmd)
Storage
Encryption
Access control
PCI Compliance
Private keys
Logical backups preferred
Multi-Tenancy Environment
Backup Server Client
Client Backup server
22
Monitoring and Alerts
Alert at runtime
• To detect errors at runtime within script, or change in system/user
environments
• Immediate
• Cron emails
Alert on storage/retention:
• Error in retention logic/implementation
• Secondary/Delayed
• Cron’d script, external monitoring
Alert on backup validation:
• On routine refreshes/restores
23
S3 Backups
24
S3 Backups Contd.
Sample Bucket Policy:
{
"Version": "2012-10-17",
"Id": "S3-Account-Permissions",
"Statement": [
{
"Sid": "1",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_number>:user/omniti_testing"
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::omniti_testing",
"arn:aws:s3:::omniti_testing/*"
]
}
]
}
25
S3 Backups Contd.
26
Automation
27
Why Automate?
Reduced Chance of Human Error
pg_dump backups
filesystem level backups
retention scripts
backup monitoring and alerts
off-site and off-server transfer as well as storage
access channels
security keys
cronjobs…
Less work, faster
Move scripts, swap crontabs, ensure all accesses exist
Uniformity, Reliability
28
What to Automate?
Scripts
Crontabs
Accesses
Validation/restores
29
Automation Example In Chef
payal@payal-ThinkPad-T520$ ls -l s3_backup/
drwxr-xr-x 2 payal payal 4096 Dec 2 11:03 attributes
drwxr-xr-x 3 payal payal 4096 Dec 2 11:03 files
-rw-r--r-- 1 payal payal 1351 Dec 2 11:03 README.md
drwxr-xr-x 2 payal payal 4096 Dec 2 11:03 recipes
drwxr-xr-x 3 payal payal 4096 Dec 2 11:03 templates
payal@payal-ThinkPad-T520 $ ls -l s3_backup/templates/default/
-rwxr-xr-x 1 payal payal 3950 Dec 2 11:03 backup.sh.erb
-rw-r--r-- 1 payal payal 1001 Jan 12 12:40 check_offsite_backups.sh.erb
-rw-r--r-- 1 payal payal 37 Dec 2 11:03 pg_dump_backup.conf.erb
-rw-r--r-- 1 payal payal 19 Dec 2 11:03 pg_dump.conf.erb
-rw-r--r-- 1 payal payal 6319 Dec 2 11:03 pg_dump_upload.py.erb
-rw-r--r-- 1 payal payal 1442 Dec 2 11:03 pg_start_backup.sh.erb
-rw-r--r-- 1 payal payal 1631 Dec 2 11:03 s3cfg.erb
30
Automation Example in Ansible
32
Can Filesystems enhance DR?
ZFS
• Available on Solaris, Illumos. ZFS on Linux is new but improving!
• Built in compression
• Protections against data corruption
• Snapshots
• Copy-on-Write
33
How can ZFS help you with DR?
Scenario 1: You’re using pg_upgrade for a
major upgrade with hard link option (-k). It fails
in between. Since you used hardlinks, you
cannot turn on the older cluster. What do you
do?
34
How can ZFS help you with DR?
Scenario 2: You’ve accidentally deleted a large
table in a very large database where taking full
backups are infeasible. Is there a faster
alternative to a filesystem restore? ZFS rollback
will overwrite pgdata so you cannot use that.
36
OmniPITR
Comprehensive logging
Built in retention
37
Barman
Simplicity
Minimal knowledge of PITR inner workings required
Wrappers for recovery process
38
wal-e
Minimal setup
Cloud integration:
39
In House Solution
40
Logical
def take_dump():
try:
with open(config_file, 'r') as f:
for db in f:
if db.strip():
db = db.replace("\n", "")
dump_command = pg_dump_path + " -U postgres -v -Fc -f " +
dump_file_path + db.split()[-1] + "_" + start_time + ".sql" + " " + db + "
2>> " + dump_file_path + db.split()[-1] + "_" + start_time + ".log"
os.system(dump_command)
print('backup of ' + db.split()[-1] + ' completed successfully')
except:
print('ERROR: bash command did not execute properly')
41
Physical
pg_basebackup:
ZFS snapshots:
ZFS restore from snapshot
ZFS rollback to snapshot after failed upgrade or maintenance task
…
read_params "$@“
if [[ -z ${OFFLINE} ]]
then
postgres_start_backup
zfs_snap
postgres_stop_backup
else
zfs_snap
fi
backup
zfs_clear_snaps
42
S3 Backups
Basic steps:
check_lock()
take_dump()
gpg_encrypt()
s3_upload()
move_files()
cleanup()
43
Redundancy is Good
“That Sunday, Thomas deleted remotely stored backups and turned off the automated
backup system. He made some changes to VPN authentication that basically locked
everybody out, and turned off the automatic restart. He deleted internal IT wiki pages,
removed users from a mailing list, deactivated the company's pager notification system,
and a number of other things that basically created a huge mess that the company spent
the whole of Monday sorting out (it turned out there were local copies of the deleted
backups).”
https://www.theregister.co.uk/2017/02/23/michael_thomas_appeals_conviction/
44
S3 Backups Monitoring
Sample:
45
Backup Documentation
46
References
http://www.postgresql.org/docs/9.4/static
https://github.com/omniti-labs/omnipitr
http://docs.pgbarman.org
http://docs.chef.io/
https://github.com/omniti-labs/pgtreats
https://github.com/omniti-labs/pg_extractor
http://www.druva.com/blog/understanding-rpo-and-rto
http://evol-monkey.blogspot.co.uk/2013/08/postgresql-backup-
and-recovery.html
47
Questions?
Payal Singh
[email protected]
OmniTI Computer Consulting Inc.
48