SlideShare a Scribd company logo
Troubleshooting
streaming replication
Alexey Lesovsky
alexey.lesovsky@dataegret.com
dataegret.com
Quick introduction
• WAL and replication internals.
• Replication setup.
Troubleshooting tools
• 3rd party tools.
• Builtin tools.
Troubleshooting cases
• Symptoms and problems.
• Detection and solutions.
• Lessons learned.
02
03
01
Quick
introduction
01
Goals01
dataegret.com
Better understanding of streaming replication.
How to quickly find and fix problems.
https://goo.gl/Mm3ugt
Agenda01
dataegret.com
Write-Ahead Log.
Streaming replication internals.
Streaming replication setup.
Troubleshooting tools overview (3-rd party).
Troubleshooting tools overview (builtin).
Troubleshooting in practice.
Questions.
Write Ahead Log01
dataegret.com
Durability in ACID.
Almost all changes fixed in WAL.
pg_xlog/ (pg_wal/) directory in the DATADIR.
Synchronous WAL write by backends.
Asynchronous WAL write by WAL writer.
Recovery process relies on WAL.
Streaming replication internals01
dataegret.com
WAL Sender process.
WAL Receiver process.
Startup process (recovery).
Streaming replication vs. WAL archiving.
Streaming replication internals01
dataegret.com
WAL
Buffers
Storage
WAL
Sender
Network
Startup
Process
Storage
WAL
Receiver
Streaming replication setup01
dataegret.com
Master:
● postgresql.conf;
● Restart.
Standby:
● pg_basebackup;
● postgresql.conf;
● recovery.conf setup.
Master setup01
dataegret.com
wal_level, max_wal_senders, max_replication_slots.
archive_mode, archive_command.
wal_keep_segments.
wal_sender_timeout.
synchronous_standby_names.
Master setup pitfalls01
dataegret.com
wal_level, archive_mode, max_wal_senders,
max_replication_slots – require restart.
wal_keep_segments – requires extra storage space.
wal_sender_timeout – reduce that, if network is bad.
synchronous_standby_names – master freezes if standby fails.
Standby setup01
dataegret.com
hot_standby.
max_standby_streaming_delay, max_standby_archiving_delay.
hot_standby_feedback.
wal_receiver_timeout.
Standby setup pitfalls01
dataegret.com
hot_standby – enables SELECT queries.
max_standby_streaming_delay – increases max possible lag.
hot_standby_feedback:
● postpones vacuum;
● potential tables/indexes bloat.
wal_receiver_timeout – reduce that, if network is bad.
Recovery.conf01
dataegret.com
primary_conninfo and/or restore_command.
standby_mode.
trigger_file.
Recovery target:
● immediate;
● particular point/xid/timestamp;
● recovery_min_apply_delay.
Recovery.conf pitfalls01
dataegret.com
Any changes require restart.
Troubleshooting
tools
02
3rd party troubleshooting tools02
dataegret.com
Top (procps).
Iostat (sysstat), iotop.
Nicstat.
pgCenter.
Perf.
3rd party troubleshooting tools02
dataegret.com
Top (procps) – CPU usage, load average, mem/swap usage.
Iostat (sysstat), iotop – storage utilization, process IO.
Nicstat – network interfaces utilization.
pgCenter – replication stats.
Perf – deep investigations.
Builtin troubleshooting tools02
dataegret.com
Statistics views.
Auxiliary functions.
pg_xlogdump utility.
Builtin troubleshooting tools02
dataegret.com
Statistics view:
● pg_stat_replication;
● pg_stat_databases, pg_stat_databases_conflicts;
● pg_stat_activity;
● pg_stat_archiver.
Builtin troubleshooting tools02
dataegret.com
Auxiliary functions:
● pg_current_xlog_location, pg_last_xlog_receive_location;
● pg_xlog_location_dif;
● pg_xlog_replay_pause, pg_xlog_replay_resume;
● pg_is_xlog_replay_paused.
Builtin troubleshooting tools02
dataegret.com
pg_xlogdump:
● Decodes and displays XLOG for debugging;
● Can give wrong results when the server is running.
pg_xlogdump -f -p /xlog_96 
$(psql -qAtX -c "select pg_xlogfile_name(pg_current_xlog_location())")
Troubleshooting
cases
03
Troubleshooting cases03
dataegret.com
Replication lag.
pg_xlog/ bloat.
Long transactions and recovery conflicts.
Recovery process: 100% CPU usage.
Replication lag03
dataegret.com
Main symptom – answers differ between master and standbys.
Detection:
● pg_stat_replication and pg_xlog_location_dif();
● pg_last_xact_replay_timestamp().
Replication lag03
dataegret.com
# d pg_stat_replication
View "pg_catalog.pg_stat_replication"
Column | Type | Modifiers
------------------+--------------------------+-----------
pid | integer |
usesysid | oid |
usename | name |
application_name | text |
client_addr | inet |
client_hostname | text |
client_port | integer |
backend_start | timestamp with time zone |
backend_xmin | xid |
state | text |
sent_location | pg_lsn |
write_location | pg_lsn |
flush_location | pg_lsn |
replay_location | pg_lsn |
sync_priority | integer |
sync_state | text |
Replication lag03
dataegret.com
# SELECT
client_addr AS client, usename AS user, application_name AS name,
state, sync_state AS mode,
(pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending,
(pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write,
(pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush,
(pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay,
(pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag
FROM pg_stat_replication;
сlient | user | name | state | mode | pending | write | flush | replay | total_lag
----------+--------+-------------+-----------+-------+---------+-------+-------+--------+-----------
10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480
10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025
10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552
10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
Replication lag03
dataegret.com
# SELECT
client_addr AS client, usename AS user, application_name AS name,
state, sync_state AS mode,
(pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending,
(pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write,
(pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush,
(pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay,
(pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag
FROM pg_stat_replication;
сlient | user | name | state | mode | pending | write | flush | replay | total_lag
----------+--------+-------------+-----------+-------+---------+-------+-------+--------+-----------
10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480
10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025
10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552
10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
Replication lag03
dataegret.com
# SELECT
client_addr AS client, usename AS user, application_name AS name,
state, sync_state AS mode,
(pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending,
(pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write,
(pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush,
(pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay,
(pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag
FROM pg_stat_replication;
сlient | user | name | state | mode | pending | write | flush | replay | total_lag
----------+--------+-------------+-----------+-------+---------+-------+-------+--------+-----------
10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480
10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025
10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552
10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
Replication lag03
dataegret.com
# SELECT
client_addr AS client, usename AS user, application_name AS name,
state, sync_state AS mode,
(pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending,
(pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write,
(pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush,
(pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay,
(pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag
FROM pg_stat_replication;
сlient | user | name | state | mode | pending | write | flush | replay | total_lag
----------+--------+-------------+-----------+-------+---------+-------+-------+--------+-----------
10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480
10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025
10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552
10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
Replication lag03
dataegret.com
Network problems – nicstat.
Storage problems – iostat, iotop.
Recovery stucks – top, pg_stat_activity.
WAL pressure:
● pg_stat_activity, pg_stat_progress_vacuum;
● pg_xlog_location_dif().
Replication lag03
dataegret.com
Network/storage problems:
● check workload;
● upgrade hardware.
Recovery stucks – wait or cancel queries on standby.
WAL pressure:
● Reduce amount of work;
● Reduce amount of WAL:
● full_page_writes = of, wal_compression = on,
wal_log_hints = of;
● expand interval between checkpoints.
pg_xlog/ bloat03
dataegret.com
Main symptoms:
● unexpected increase in the usage of the disk space;
● abnormal size of pg_xlog/ directory.
pg_xlog/ bloat03
dataegret.com
Detection:
● du -csh;
● pg_replication_slots, pg_stat_archiver;
● errors in postgres logs.
pg_xlog/ bloat03
dataegret.com
Problems:
● Massive CRUD.
● Unused slot.
● Broken archive_command.
pg_xlog/ bloat03
dataegret.com
Solutions:
● check replication lag;
● reduce checkpoints_segments/max_wal_size,
wal_keep_segments;
● change reserved space ratio (ext filesystems);
● add an extra space (LVM, ZFS, etc);
● drop unused slot or fix slot consumer;
● fix WAL archiving;
● checkpoint, checkpoint, chekpoint.
Recovery conflicts03
dataegret.com
Main symptoms – errors in postgresql or application logs.
postgres.c:errdetail_recovery_conflict():
● User was holding shared bufer pin for too long.
● User was holding a relation lock for too long.
● User was or might have been using tablespace that must be dropped.
● User query might have needed to see row versions that must be removed.
● User transaction caused bufer deadlock with recovery.
● User was connected to a database that must be dropped.
Recovery conflicts03
dataegret.com
Detection:
● pg_stat_databases + pg_stat_databases_conflicts;
● postgresql logs.
Recovery conflicts03
dataegret.com
Problems:
● queries are cancelled too often;
● long transactions on a standby – check pg_stat_activity;
● huge apply lag – check pg_stat_replication.
Recovery conflicts03
dataegret.com
Solutions:
● increase streaming delay (potentially causes lag);
● enable hot_standby_feedback (potentially causes bloat);
● rewrite queries;
● setup dedicated standby for long queries.
Recovery 100% CPU usage03
dataegret.com
Main symptoms:
● huge apply lag;
● 100% CPU usage by recovery process.
Recovery 100% CPU usage03
dataegret.com
Detection:
● top – CPU usage;
● pg_stat_replication – amount of lag.
Recovery 100% CPU usage03
dataegret.com
Investigation:
● perf top/record/report (required debug symbols);
● pg_xlogdump.
Recovery 100% CPU usage03
dataegret.com
Solutions:
● depend on investigation' results;
● change problematic workload (if found).
Lessons learned03
dataegret.com
Streaming replication problems are always distributed.
There are many sources of problems:
● system resources, app/queries, workload.
Always use monitoring.
Learn how to use builtin tools.
Links
dataegret.com
PostgreSQL official documentation – The Statistics Collector
https://www.postgresql.org/docs/current/static/monitoring-stats.html
PostgreSQL Mailing Lists (general, performance, hackers)
https://www.postgresql.org/list/
PostgreSQL-Consulting company blog
http://blog.postgresql-consulting.com/
Thanks for watching!
dataegret.com alexey.lesovsky@dataegret.com

More Related Content

PDF
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
PDF
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
PDF
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
PDF
PostgreSQL and RAM usage
Alexey Bashtanov
 
PDF
Backup and-recovery2
Command Prompt., Inc
 
ODP
OpenGurukul : Database : PostgreSQL
Open Gurukul
 
PDF
Postgresql database administration volume 1
Federico Campoli
 
PDF
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
PostgreSQL-Consulting
 
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PostgresOpen
 
PostgreSQL and RAM usage
Alexey Bashtanov
 
Backup and-recovery2
Command Prompt., Inc
 
OpenGurukul : Database : PostgreSQL
Open Gurukul
 
Postgresql database administration volume 1
Federico Campoli
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
PostgreSQL-Consulting
 

What's hot (20)

PDF
PostgreSQL 공간관리 살펴보기 이근오
PgDay.Seoul
 
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
PPTX
PostgreSQL Database Slides
metsarin
 
ODP
Introduction to PostgreSQL
Jim Mlodgenski
 
PDF
PostgreSQL replication
NTT DATA OSS Professional Services
 
PDF
Linux Profiling at Netflix
Brendan Gregg
 
PDF
Mvcc in postgreSQL 권건우
PgDay.Seoul
 
PDF
PostgreSQL WAL for DBAs
PGConf APAC
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PPTX
Introduction to PostgreSQL
Joel Brewer
 
PDF
Performance Wins with BPF: Getting Started
Brendan Gregg
 
PDF
Understanding PostgreSQL LW Locks
Jignesh Shah
 
PDF
Building Network Functions with eBPF & BCC
Kernel TLV
 
PDF
Mastering PostgreSQL Administration
EDB
 
PDF
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PPTX
M|18 Deep Dive: InnoDB Transactions and Replication
MariaDB plc
 
PDF
Spark shuffle introduction
colorant
 
PDF
PostgreSQL Deep Internal
EXEM
 
PPTX
Postgresql Database Administration Basic - Day1
PoguttuezhiniVP
 
PostgreSQL 공간관리 살펴보기 이근오
PgDay.Seoul
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
DataStax
 
PostgreSQL Database Slides
metsarin
 
Introduction to PostgreSQL
Jim Mlodgenski
 
PostgreSQL replication
NTT DATA OSS Professional Services
 
Linux Profiling at Netflix
Brendan Gregg
 
Mvcc in postgreSQL 권건우
PgDay.Seoul
 
PostgreSQL WAL for DBAs
PGConf APAC
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Introduction to PostgreSQL
Joel Brewer
 
Performance Wins with BPF: Getting Started
Brendan Gregg
 
Understanding PostgreSQL LW Locks
Jignesh Shah
 
Building Network Functions with eBPF & BCC
Kernel TLV
 
Mastering PostgreSQL Administration
EDB
 
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
Introduction to Apache Calcite
Jordan Halterman
 
M|18 Deep Dive: InnoDB Transactions and Replication
MariaDB plc
 
Spark shuffle introduction
colorant
 
PostgreSQL Deep Internal
EXEM
 
Postgresql Database Administration Basic - Day1
PoguttuezhiniVP
 
Ad

Viewers also liked (11)

PDF
RxNetty vs Tomcat Performance Results
Brendan Gregg
 
ODP
G1 Garbage Collector: Details and Tuning
Simone Bordet
 
PPTX
Am I reading GC logs Correctly?
Tier1 App
 
PDF
Row Pattern Matching in SQL:2016
Markus Winand
 
PDF
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
PPTX
Shell,信号量以及java进程的退出
wang hongjiang
 
PDF
SREcon 2016 Performance Checklists for SREs
Brendan Gregg
 
POTX
Performance Tuning EC2 Instances
Brendan Gregg
 
PDF
Blazing Performance with Flame Graphs
Brendan Gregg
 
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
PDF
Container Performance Analysis
Brendan Gregg
 
RxNetty vs Tomcat Performance Results
Brendan Gregg
 
G1 Garbage Collector: Details and Tuning
Simone Bordet
 
Am I reading GC logs Correctly?
Tier1 App
 
Row Pattern Matching in SQL:2016
Markus Winand
 
Java Performance Analysis on Linux with Flame Graphs
Brendan Gregg
 
Shell,信号量以及java进程的退出
wang hongjiang
 
SREcon 2016 Performance Checklists for SREs
Brendan Gregg
 
Performance Tuning EC2 Instances
Brendan Gregg
 
Blazing Performance with Flame Graphs
Brendan Gregg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
Container Performance Analysis
Brendan Gregg
 
Ad

Similar to Troubleshooting PostgreSQL Streaming Replication (20)

PDF
Streaming replication in practice
Alexey Lesovsky
 
PPTX
Hack an ASP .NET website? Hard, but possible!
Vladimir Kochetkov
 
PDF
Why you should be using structured logs
Stefan Krawczyk
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
TXT
Gg steps
Hari Prasath
 
PPTX
Oracle Basics and Architecture
Sidney Chen
 
PDF
Oracle to Postgres Migration - part 2
PgTraining
 
PDF
2013 Collaborate - OAUG - Presentation
Biju Thomas
 
PDF
Improving the performance of Odoo deployments
Odoo
 
PDF
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
PPT
Logging with Logback in Scala
Knoldus Inc.
 
PDF
Php version 7
RANVIJAY GAUR
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
PDF
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
PDF
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Ontico
 
PDF
Osol Pgsql
Emanuel Calvo
 
PPT
LOGBack and SLF4J
jkumaranc
 
PPT
LOGBack and SLF4J
jkumaranc
 
Streaming replication in practice
Alexey Lesovsky
 
Hack an ASP .NET website? Hard, but possible!
Vladimir Kochetkov
 
Why you should be using structured logs
Stefan Krawczyk
 
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Gg steps
Hari Prasath
 
Oracle Basics and Architecture
Sidney Chen
 
Oracle to Postgres Migration - part 2
PgTraining
 
2013 Collaborate - OAUG - Presentation
Biju Thomas
 
Improving the performance of Odoo deployments
Odoo
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Logging with Logback in Scala
Knoldus Inc.
 
Php version 7
RANVIJAY GAUR
 
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Ontico
 
Osol Pgsql
Emanuel Calvo
 
LOGBack and SLF4J
jkumaranc
 
LOGBack and SLF4J
jkumaranc
 

More from Alexey Lesovsky (20)

PDF
Отладка и устранение проблем в PostgreSQL Streaming Replication.
Alexey Lesovsky
 
PDF
Call of Postgres: Advanced Operations (part 5)
Alexey Lesovsky
 
PDF
Call of Postgres: Advanced Operations (part 4)
Alexey Lesovsky
 
PDF
Call of Postgres: Advanced Operations (part 3)
Alexey Lesovsky
 
PDF
Call of Postgres: Advanced Operations (part 2)
Alexey Lesovsky
 
PDF
Call of Postgres: Advanced Operations (part 1)
Alexey Lesovsky
 
PDF
Troubleshooting PostgreSQL with pgCenter
Alexey Lesovsky
 
PDF
PostgreSQL Streaming Replication
Alexey Lesovsky
 
PDF
GitLab PostgresMortem: Lessons Learned
Alexey Lesovsky
 
PDF
PostgreSQL Vacuum: Nine Circles of Hell
Alexey Lesovsky
 
PDF
Tuning Linux for Databases.
Alexey Lesovsky
 
PDF
Managing PostgreSQL with PgCenter
Alexey Lesovsky
 
PDF
Nine Circles of Inferno or Explaining the PostgreSQL Vacuum
Alexey Lesovsky
 
PDF
Streaming replication in practice
Alexey Lesovsky
 
PDF
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
PDF
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
PDF
Pgcenter overview
Alexey Lesovsky
 
PDF
Highload 2014. PostgreSQL: ups, DevOps.
Alexey Lesovsky
 
PDF
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
Alexey Lesovsky
 
PDF
Linux tuning for PostgreSQL at Secon 2015
Alexey Lesovsky
 
Отладка и устранение проблем в PostgreSQL Streaming Replication.
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 5)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 4)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 3)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 2)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 1)
Alexey Lesovsky
 
Troubleshooting PostgreSQL with pgCenter
Alexey Lesovsky
 
PostgreSQL Streaming Replication
Alexey Lesovsky
 
GitLab PostgresMortem: Lessons Learned
Alexey Lesovsky
 
PostgreSQL Vacuum: Nine Circles of Hell
Alexey Lesovsky
 
Tuning Linux for Databases.
Alexey Lesovsky
 
Managing PostgreSQL with PgCenter
Alexey Lesovsky
 
Nine Circles of Inferno or Explaining the PostgreSQL Vacuum
Alexey Lesovsky
 
Streaming replication in practice
Alexey Lesovsky
 
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Pgcenter overview
Alexey Lesovsky
 
Highload 2014. PostgreSQL: ups, DevOps.
Alexey Lesovsky
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
Alexey Lesovsky
 
Linux tuning for PostgreSQL at Secon 2015
Alexey Lesovsky
 

Recently uploaded (20)

PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PPTX
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PPTX
ternal cell structure: leadership, steering
hodeeesite4
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
PPT
SCOPE_~1- technology of green house and poyhouse
bala464780
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PPTX
unit 3a.pptx material management. Chapter of operational management
atisht0104
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
Software Testing Tools - names and explanation
shruti533256
 
PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Information Retrieval and Extraction - Module 7
premSankar19
 
Color Model in Textile ( RGB, CMYK).pptx
auladhossain191
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
ternal cell structure: leadership, steering
hodeeesite4
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
SCOPE_~1- technology of green house and poyhouse
bala464780
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
unit 3a.pptx material management. Chapter of operational management
atisht0104
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
Software Testing Tools - names and explanation
shruti533256
 
Ppt for engineering students application on field effect
lakshmi.ec
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
Inventory management chapter in automation and robotics.
atisht0104
 
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 

Troubleshooting PostgreSQL Streaming Replication

  • 2. dataegret.com Quick introduction • WAL and replication internals. • Replication setup. Troubleshooting tools • 3rd party tools. • Builtin tools. Troubleshooting cases • Symptoms and problems. • Detection and solutions. • Lessons learned. 02 03 01
  • 4. Goals01 dataegret.com Better understanding of streaming replication. How to quickly find and fix problems. https://goo.gl/Mm3ugt
  • 5. Agenda01 dataegret.com Write-Ahead Log. Streaming replication internals. Streaming replication setup. Troubleshooting tools overview (3-rd party). Troubleshooting tools overview (builtin). Troubleshooting in practice. Questions.
  • 6. Write Ahead Log01 dataegret.com Durability in ACID. Almost all changes fixed in WAL. pg_xlog/ (pg_wal/) directory in the DATADIR. Synchronous WAL write by backends. Asynchronous WAL write by WAL writer. Recovery process relies on WAL.
  • 7. Streaming replication internals01 dataegret.com WAL Sender process. WAL Receiver process. Startup process (recovery). Streaming replication vs. WAL archiving.
  • 9. Streaming replication setup01 dataegret.com Master: ● postgresql.conf; ● Restart. Standby: ● pg_basebackup; ● postgresql.conf; ● recovery.conf setup.
  • 10. Master setup01 dataegret.com wal_level, max_wal_senders, max_replication_slots. archive_mode, archive_command. wal_keep_segments. wal_sender_timeout. synchronous_standby_names.
  • 11. Master setup pitfalls01 dataegret.com wal_level, archive_mode, max_wal_senders, max_replication_slots – require restart. wal_keep_segments – requires extra storage space. wal_sender_timeout – reduce that, if network is bad. synchronous_standby_names – master freezes if standby fails.
  • 13. Standby setup pitfalls01 dataegret.com hot_standby – enables SELECT queries. max_standby_streaming_delay – increases max possible lag. hot_standby_feedback: ● postpones vacuum; ● potential tables/indexes bloat. wal_receiver_timeout – reduce that, if network is bad.
  • 14. Recovery.conf01 dataegret.com primary_conninfo and/or restore_command. standby_mode. trigger_file. Recovery target: ● immediate; ● particular point/xid/timestamp; ● recovery_min_apply_delay.
  • 17. 3rd party troubleshooting tools02 dataegret.com Top (procps). Iostat (sysstat), iotop. Nicstat. pgCenter. Perf.
  • 18. 3rd party troubleshooting tools02 dataegret.com Top (procps) – CPU usage, load average, mem/swap usage. Iostat (sysstat), iotop – storage utilization, process IO. Nicstat – network interfaces utilization. pgCenter – replication stats. Perf – deep investigations.
  • 19. Builtin troubleshooting tools02 dataegret.com Statistics views. Auxiliary functions. pg_xlogdump utility.
  • 20. Builtin troubleshooting tools02 dataegret.com Statistics view: ● pg_stat_replication; ● pg_stat_databases, pg_stat_databases_conflicts; ● pg_stat_activity; ● pg_stat_archiver.
  • 21. Builtin troubleshooting tools02 dataegret.com Auxiliary functions: ● pg_current_xlog_location, pg_last_xlog_receive_location; ● pg_xlog_location_dif; ● pg_xlog_replay_pause, pg_xlog_replay_resume; ● pg_is_xlog_replay_paused.
  • 22. Builtin troubleshooting tools02 dataegret.com pg_xlogdump: ● Decodes and displays XLOG for debugging; ● Can give wrong results when the server is running. pg_xlogdump -f -p /xlog_96 $(psql -qAtX -c "select pg_xlogfile_name(pg_current_xlog_location())")
  • 24. Troubleshooting cases03 dataegret.com Replication lag. pg_xlog/ bloat. Long transactions and recovery conflicts. Recovery process: 100% CPU usage.
  • 25. Replication lag03 dataegret.com Main symptom – answers differ between master and standbys. Detection: ● pg_stat_replication and pg_xlog_location_dif(); ● pg_last_xact_replay_timestamp().
  • 26. Replication lag03 dataegret.com # d pg_stat_replication View "pg_catalog.pg_stat_replication" Column | Type | Modifiers ------------------+--------------------------+----------- pid | integer | usesysid | oid | usename | name | application_name | text | client_addr | inet | client_hostname | text | client_port | integer | backend_start | timestamp with time zone | backend_xmin | xid | state | text | sent_location | pg_lsn | write_location | pg_lsn | flush_location | pg_lsn | replay_location | pg_lsn | sync_priority | integer | sync_state | text |
  • 27. Replication lag03 dataegret.com # SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag FROM pg_stat_replication; сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
  • 28. Replication lag03 dataegret.com # SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag FROM pg_stat_replication; сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
  • 29. Replication lag03 dataegret.com # SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag FROM pg_stat_replication; сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
  • 30. Replication lag03 dataegret.com # SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lag FROM pg_stat_replication; сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638
  • 31. Replication lag03 dataegret.com Network problems – nicstat. Storage problems – iostat, iotop. Recovery stucks – top, pg_stat_activity. WAL pressure: ● pg_stat_activity, pg_stat_progress_vacuum; ● pg_xlog_location_dif().
  • 32. Replication lag03 dataegret.com Network/storage problems: ● check workload; ● upgrade hardware. Recovery stucks – wait or cancel queries on standby. WAL pressure: ● Reduce amount of work; ● Reduce amount of WAL: ● full_page_writes = of, wal_compression = on, wal_log_hints = of; ● expand interval between checkpoints.
  • 33. pg_xlog/ bloat03 dataegret.com Main symptoms: ● unexpected increase in the usage of the disk space; ● abnormal size of pg_xlog/ directory.
  • 34. pg_xlog/ bloat03 dataegret.com Detection: ● du -csh; ● pg_replication_slots, pg_stat_archiver; ● errors in postgres logs.
  • 35. pg_xlog/ bloat03 dataegret.com Problems: ● Massive CRUD. ● Unused slot. ● Broken archive_command.
  • 36. pg_xlog/ bloat03 dataegret.com Solutions: ● check replication lag; ● reduce checkpoints_segments/max_wal_size, wal_keep_segments; ● change reserved space ratio (ext filesystems); ● add an extra space (LVM, ZFS, etc); ● drop unused slot or fix slot consumer; ● fix WAL archiving; ● checkpoint, checkpoint, chekpoint.
  • 37. Recovery conflicts03 dataegret.com Main symptoms – errors in postgresql or application logs. postgres.c:errdetail_recovery_conflict(): ● User was holding shared bufer pin for too long. ● User was holding a relation lock for too long. ● User was or might have been using tablespace that must be dropped. ● User query might have needed to see row versions that must be removed. ● User transaction caused bufer deadlock with recovery. ● User was connected to a database that must be dropped.
  • 38. Recovery conflicts03 dataegret.com Detection: ● pg_stat_databases + pg_stat_databases_conflicts; ● postgresql logs.
  • 39. Recovery conflicts03 dataegret.com Problems: ● queries are cancelled too often; ● long transactions on a standby – check pg_stat_activity; ● huge apply lag – check pg_stat_replication.
  • 40. Recovery conflicts03 dataegret.com Solutions: ● increase streaming delay (potentially causes lag); ● enable hot_standby_feedback (potentially causes bloat); ● rewrite queries; ● setup dedicated standby for long queries.
  • 41. Recovery 100% CPU usage03 dataegret.com Main symptoms: ● huge apply lag; ● 100% CPU usage by recovery process.
  • 42. Recovery 100% CPU usage03 dataegret.com Detection: ● top – CPU usage; ● pg_stat_replication – amount of lag.
  • 43. Recovery 100% CPU usage03 dataegret.com Investigation: ● perf top/record/report (required debug symbols); ● pg_xlogdump.
  • 44. Recovery 100% CPU usage03 dataegret.com Solutions: ● depend on investigation' results; ● change problematic workload (if found).
  • 45. Lessons learned03 dataegret.com Streaming replication problems are always distributed. There are many sources of problems: ● system resources, app/queries, workload. Always use monitoring. Learn how to use builtin tools.
  • 46. Links dataegret.com PostgreSQL official documentation – The Statistics Collector https://www.postgresql.org/docs/current/static/monitoring-stats.html PostgreSQL Mailing Lists (general, performance, hackers) https://www.postgresql.org/list/ PostgreSQL-Consulting company blog http://blog.postgresql-consulting.com/