0% found this document useful (0 votes)
336 views48 pages

Errant GTIDs Breaking Replication - How To Detect and Avoid Them - FileId - 187306

This document discusses errant GTIDs, which are transactions that exist on a replica database but not the primary database, causing inconsistencies between databases. It describes how to detect errant GTIDs by comparing the GTID sets between databases and finding the differences. The consequences of errant GTIDs and some ways to detect and fix them are also covered, such as using tools like Orchestrator and custom monitoring checks.

Uploaded by

Andrei gg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
336 views48 pages

Errant GTIDs Breaking Replication - How To Detect and Avoid Them - FileId - 187306

This document discusses errant GTIDs, which are transactions that exist on a replica database but not the primary database, causing inconsistencies between databases. It describes how to detect errant GTIDs by comparing the GTID sets between databases and finding the differences. The consequences of errant GTIDs and some ways to detect and fix them are also covered, such as using tools like Orchestrator and custom monitoring checks.

Uploaded by

Andrei gg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Errant GTIDs breaking

replication
Dieter Adriaenssens
Ghent University
Who am I?

Dieter Adriaenssens
• Linux System Administrator
• MySQL DBA
• Works at Ghent University
• Open Source : former phpMyAdmin team member
• Lives in Ghent, Belgium
• Climber
• E-mail : [email protected]
• Twitter : @dcadriaenssens

• Pictures :

− (c) Thomas Kindermans https://en.wikipedia.org/wiki/File:Gent_Graslei_zonsondergang.JPG

− (c) Rouslan Placella 2


Errant GTIDs breaking replication
How to detect and avoid them
Context

Replicating MySQL cluster


• Primary with some replicas ‘[..] the master has purged binary
• Master failover
logs containing GTIDs that the
• Replication breaks from new primary
(previous replica) to remaining replicas slave requires.’
• Where did those transactions on the new
primary come from?

4
Overview

• Replication and GTID


• Errant GTIDs on a replica
• Detection
• Where do they come from?
• How to avoid them
• How to fix them
• Demo

5
Introduction
Replication, GTID, data consistency
Replication

MySQL server replication :


• Primary (master) Replic
a
• Several replicas (slaves)
• Identical state across several servers Primar
y
Replic
a

7
Replication : Why

• High availability
• Master failover
• Disaster recovery
• Scaling load
• Regional distribution
• ...

8
Replication : How

MySQL master-slave replication :


• All nodes start from an identical state
• Replicas apply transactions from the
primary node’s binary logs
• Transaction in correct order
• Asynchronous : replica initiates
• Maintain identical state
● Read/write on primary
● Replicas should be readonly

9
Replication : How

Keeping track of binlog position :


• Several binlogfiles : mysql_bin.xxxxxx
• Rotation (size based, restart server)
• Every transaction has a unique
(incremental) position
• Replication start :
● Binary log file name
● position in the binlog
• Problematic
● multiple primaries
● Master failover
● Restore from backup 10
GTID

Global Transaction identifier


• Unique identifier of a transaction Server_uuid:transaction_id
• GTID = Origin ID + transaction ID
3E11FA47-71CA-11E1-9E33-C80AA9429562:23
• Origin ID : server UUID
• Transaction ID : incremental number
SELECT @@GLOBAL.server_uuid;

3E11FA47-71CA-11E1-9E33-C80AA9429562

11
GTID set

Executed GTID set SHOW SLAVE STATUS;


• Set of applied GTIDs on a server SHOW MASTER STATUS;
SELECT @@GLOBAL.gtid_executed;
• Defines a state of data consistency
• Easy to compare between nodes Executed_Gtid_Set:
• Replica applies GTIDs of a primary
27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
• Simplifies failovers (no need to
remember binlog position) 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-80192973,

c004c0eb-c84e-11e6-8efc-aa00009002fd:
1-6084195:6140951-6141015

12
ROW based replication

• Changes in rows are replicated


• <-> transaction-based : actual queries
• In combination with GTID => data
consistency across the cluster nodes

13
Replicating cluster with GTID
orchestrator -c topology –alias=demo

node1:3306 [0s,ok,5.7.25,rw,ROW,>>,GTID]
MySQL Server cluster + node2:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID]
• Primary (R/W) + node3:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID]
• Replicas (R/O)
Orchestrator
• GTID enabled
• Row based replication
• ProxySQL to redirect traffic to correct
cluster node Node
2
• Orchestrator managing the cluster nodes Prox
Nod
(automatic master failover) y
e1
SQL
Node
3

14
Errant GTID
Definition, consequences, detection, examples, how to avoid, fixes
Errant GTID

• Unreplicated transaction existing only on


a replica
• Data is not the same on all nodes
• Cluster is no longer in a consistent state

16
Errant GTID : consequences

• Everything is fine?
• Inconsistent state between nodes
● Split brain
● Different data when reading from
that replica
• Unexpected behaviour when a replica is
promoted to master
‘[..] the master has purged binary
● Replication might fail
logs containing GTIDs that the
● If GTID is purged from binlog, on
slave requires.’
master failover → replication
stops
17
Errant GTID detection
Errant GTID detection

• Compare executed GTID sets between


primary node and replica nodes
• Replica has more GTIDs than primary =>
errant GTID

19
GTID executed set

# primary
SELECT @@GLOBAL.gtid_executed;

27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120

# replica
SELECT @@GLOBAL.gtid_executed;

27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4

20
GTID subset
SELECT GTID_SUBSET('<gtid_executed_replica>',
'<gtid_executed_primary>');
SELECT GTID_SUBSET(
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113',
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset;
+-----------+
| is_subset |
+-----------+
| 1|
+-----------+
1 row in set (0.00 sec)
Replica GTID set is a subset of primary GTID set : OK

21
GTID subset
SELECT GTID_SUBSET('<gtid_executed_replica>',
'<gtid_executed_primary>');
SELECT GTID_SUBSET(
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4',
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset;
+-----------+
| is_subset |
+-----------+
| 0|
+-----------+
1 row in set (0.00 sec)
Replica GTID set is NOT a subset of primary GTID set
=> Errant GTID on replica
22
Find errant GTIDs
SELECT GTID_SUBTRACT('<gtid_executed_replica>',
'<gtid_executed_primary>');
SELECT GTID_SUBTRACT(
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4',
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS errant_gtid;
+------------------------------------------+
| errant_gtid |
+------------------------------------------+
| 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 |
+------------------------------------------+
1 row in set (0.00 sec)
Result is subset of errant GTIDs

23
Errant GTID : automatic detection

• Icinga monitoring check


• Orchestrator
• Other tools :
● Mysqlfailover
● Mysqlrpladmin

24
Errant GTID : detection

Monitoring check
• Automate checking for errant GTID
• Icinga compatible output format
• Uses orchestrator for cluster info
• https://github.com/UGent-DICT/check_mysql_gtid

25
Errant GTID : monitoring check
./check_mysql_gtid <clustername>

./check_mysql_gtid demo

MYSQL_CLUSTER_GTID OK - GTIDs on all nodes are replicated in the cluster!


Cluster demo (primary : node1) :
- node2 : OK
- node3 : OK

Everything is fine!

26
Errant GTID : monitoring check
./check_mysql_gtid <clustername>

./check_mysql_gtid demo

MYSQL_CLUSTER_GTID WARNING : replicas containing unreplicated GTIDs : node2


Cluster demo (primary : node1) :
- node2 : GTIDs only exist on the replica :
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4

- node3 : OK

Check reports unreplicated (errant) GTIDs on one of the


nodes
27
Errant GTID : detection

Orchestrator
• Reports errant GTIDs (>= v3.0.13)

● Provides ways to fix them (>= v3.0.14)

28
Errant GTID

Find transaction
• Look for GTID in binary logs
• Each binlog mentions the executed GTID
set (initial state)
• Select relevant binlog
• Find transaction in that binlog

29
Errant GTID : find transaction

mysqlbinlog --base64-output=DECODE-ROWS --verbose node2-bin.00* | grep


50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1 -A6

SET @@SESSION.GTID_NEXT= '50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1'/*!*/;


# at 23890384
#180305 6:25:02 server id 590100 end_log_pos 23890464 CRC32 0x2fe8266b
Query thread_id=3848059 exec_time=0 error_code=0
SET TIMESTAMP=1520227502/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
FLUSH ERROR LOGS
/*!*/;

30
Errant GTID : Examples

Transactions on a replica
• Manual transactions (by accident on a replica)
• Scripted maintenance tasks (eg. config management)
● User creation
● Database creation
• Master failover gone bad :
● Split brain
● Writes redirected to a replica (eg. host is r/w by
accident, or after a restart)
• Log flushes

31
Intermezzo : flush-logs

• FLUSH LOGS; → not added to binlog (1)


• FLUSH ERROR LOGS; → added to binlog (2)
• FLUSH [LOCAL | NO_WRITE_TO_BINLOG] ERROR LOGS; → not added to binlog
• mysqladmin flush-logs executes FLUSH LOGS;
• mysqladmin flush-logs error (3) executes FLUSH ERROR LOGS;

● (1) expected behaviour, according to documentation : “FLUSH LOGS, FLUSH BINARY LOGS, FLUSH
TABLES WITH READ LOCK (with or without a table list), and FLUSH TABLES tbl_name ... FOR EXPORT are
not written to the binary log in any case because they would cause problems if replicated to a slave.“
https://dev.mysql.com/doc/refman/8.0/en/flush.html
• (2) Related bug report : https://bugs.mysql.com/bug.php?id=88720
• (3) introduced in MySQL 5.7.4
32
Avoid errant GTIDs
Avoid errant GTIDs

Avoid transactions on replica nodes:


• Don’t execute (write) statements on a replica

34
Avoid errant GTIDs

Use read_only:
• Set read_only on all replicas
• Preferably in the config file, to avoid a writable node after
restart
• Orchestrator can set a previous master to read_only on a
failover : ApplyMySQLPromotionAfterMasterFailover = true

35
Avoid errant GTIDs

Use super_read_only:
• Users with SUPER privileges can still write when read_only is set
• Limit SUPER privileges/users
• Set super_read_only on all replicas
• Orchestrator can set a previous master to super_read_only on a failover :
UseSuperReadOnly = true (>= v3.0.7)

36
Avoid errant GTIDs

Disable binlog for some transactions:


• Some statements don’t have to be replicated
• Examples:
● log flushes
● Fixing an inconsistent state on a replica (eg. errant
GTID)
SET @@SESSION.SQL_LOG_BIN=0;
• Use @@SESSION variable! <statement>;
SET @@SESSION.SQL_LOG_BIN=1;

37
Fix errant GTIDs
Fix errant GTIDs

Examine situation :
• Examine transaction (binlog)
• Does it change data?
• Can’t find transaction?
• Check consistency with pt-table-checksum and pt-table-sync
• Is data consistent?

39
Fix errant GTIDs

Possible fixes:
• Insert empty transactions on other nodes (including primary)
• ‘Remove’ GTIDs from replica binlog
• Rollback transactions : Unsplit brain
● Talk Shlomi Noach @ FOSDEM 2019
https://fosdem.org/2019/schedule/event/unplitmysql/
• Restore data from primary/backup

40
Errant GTID : Insert empty transactions
On all nodes (or only on the primary of replication still
works):
● Repeat for each errant GTID

SET gtid_next='50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1';
BEGIN;
COMMIT;

SET gtid_next='50d5e9eb-c5d3-11e6-b86b-aa00009002f7:2';
BEGIN;
COMMIT;

SET gtid_next=automatic;

41
Errant GTID : Remove from binlog
On the primary
SELECT @@GLOBAL.gtid_executed;

27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120

On the replica
STOP SLAVE;
RESET MASTER;

SET GLOBAL GTID_PURGED="27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120";

START SLAVE;

RESET MASTER : Binlogs are purged on the replica!

42
Errant GTID : Fix them

Orchestrator (>= v3.0.14)


• Inject empty transactions
• Reset master (remove GTIDs)

43
Demo
Click to add text
Conclusion

Avoid errant GTIDs


• It can break replication
• Data consistency across the cluster is lost
• Avoid manual transactions on replicas
• Set read_only
• Set super_read_only
• Temporarily Disable binlog if you have to do
something on a replica
• Monitor for errant GTIDs

45
Acknowledgements

Thanks to:
• Colleagues at Ghent University
• Tibor Korocz from Percona
• Blogposts :
● https://www.percona.com/blog/2014/05/19/errant-transactions-major-hurdle-for-gtid-based-failover-in-mysql-5-6/
● https://dzone.com/articles/how-createrestore-slave-using
● https://dzone.com/articles/mysql-replication-errant-transactions-in-gtid-base
● https://severalnines.com/blog/mysql-replication-and-gtid-based-failover-deep-dive-errant-transactions
● https://www.percona.com/blog/2013/03/26/repair-mysql-5-6-gtid-replication-by-injecting-empty-transactions/

46
Questions?

● Contact : @dcadriaenssens
● Monitoring check: https://github.com/UGent-DICT/check_mysql_gtid

47
Rate My Session

48

You might also like