ぼくとMySQLの3日間戦争(調査編)
こんにちは、DBAのたなかです。
時は平成、世はDevOps。これは、突然クラッシュしたMySQL(4.0.26)とDBA(気持ちは20代)の戦いの記録である。。
続きものの第2話ですので、予告編(導入部)がまだの方はこちらをどうぞ
⇒ ぼくとMySQLの3日間戦争(予告編)
取り敢えず落ち着くために 素数を数える `となりの人'からもらった情報を整理してみる。
- 過去にもクラッシュを繰り返す事例があった
- 別サーバーで再構築しても、一時期回復するものの再発する
- がんばってくれ、きみならできる
落ち着かなかった。オチもつかなかった。
取り敢えず4.0とはいえどMySQL。日頃培ったMySQLの知識をフル動員するときがきたのだー。
さて、と。
Number of processes running now: 0 130731 00:36:20 mysqld restarted 130731 0:36:20 InnoDB: Database was not shut down normally. InnoDB: Starting recovery from log files... InnoDB: Starting log scan based on checkpoint at InnoDB: log sequence number 141 926344408 InnoDB: Doing recovery: scanned up to log sequence number 141 926357179 130731 0:36:20 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: In a MySQL replication slave the last master binlog file InnoDB: position 0 58547047, file name mysql-bin.238 InnoDB: Last MySQL binlog file position 0 922822642, file name ./mysql-bin.060 130731 0:36:21 InnoDB: Flushing modified pages from the buffer pool... 130731 0:36:21 InnoDB: Started /usr/local/mysql4026/libexec/mysqld: ready for connections. Version: '4.0.26-log' socket: '/xxx/mysql.sock' port: 3306 Source distribution
クラッシュリカバリーは問題なさそう。
130731 0:36:21 Slave SQL thread initialized, starting replication in log 'mysql-bin.238' at position 58547047, relay log './mysql-relay-bin.736' position: 5133063 130731 0:36:21 Slave I/O thread: connected to master '[email protected]:3306', replication started in log 'mysql-bin.238' at position 58547047
リレーログの破損やマスター側のバイナリーログの破損を検知しているわけでもなく、
mysqld got signal 11; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=67108864 read_buffer_size=2093056 max_used_connections=0 max_connections=3000 threads_connected=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 30773512 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.
いきなりズドン。
バックアップ用スレーブなのにmax_connections=3000とかなんだよとか思いつつも、threads_connected=0ってことは外部クエリー起因じゃなさそう。てか大体外部からトラフィック来ないし。バッチにも使ってないから。
…困った時は、coreだよね!
# vim ./my.cnf .. core-file .. # ulimit -c 0 # ulimit -c unlimited # ulimit -c unlimited
suid環境ではない(rootで/etc/init.d/mysqlとかやらず、mysqlユーザーでmysqld_safeを叩いている)ので、suid_dumpableはいじる必要なし。
この状態で放っておけば、
# ll core* -rw------- 1 mysql mysql 1227132928 7月 31 10:11 core.26263
ほらできた!
というか、ホントにあっという間に出来てちょっと困った。
さて、じゃあ解析。
# gdb /usr/local/mysql4026/libexec/mysqld ./core.26263 .. warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff63be6000 Core was generated by `/usr/local/mysql4026/libexec/mysqld --defaults-file=/xxx/my.cnf'. Program terminated with signal 11, Segmentation fault. #0 0x00000000006b32b9 in kill () (gdb) bt Cannot access memory at address 0x2ac9611dc000
( д ) ゚ ゚ 読めない
とか言ってる間に
# ll core* -rw------- 1 mysql mysql 1227132928 7月 31 10:11 core.26263 -rw------- 1 mysql mysql 1227227136 7月 31 10:18 core.26520
Σ(゚д゚lll) またクラッシュしてる
取り敢えずulimit -c 0でDISKを食い潰されないように出力を止めつつ、普段使い慣れている個人単位の業務用VMにcoreとmysqldのバイナリーを吸い上げる。
もっかいgdb。
$ gdb ./mysqld ./core.26263 .. (gdb) bt #0 0x00000000006b32b9 in kill () #1 0x0000000000674d23 in pthread_kill () #2 0x0000000000457724 in handle_segfault () #3 0x00000000006772f5 in __pthread_sighandler () #4 <signal handler called> #5 0x0000000000671455 in pthread_cond_wait () #6 0x0000000000457473 in end_thread(THD*, bool) () #7 0x000000000046bac4 in handle_one_connection () #8 0x000000000067213b in pthread_start_thread () #9 0x00000000006db2ff in clone () Cannot access memory at address 0x2ac9611dc000 (gdb) frame 6
#6 0x0000000000457473 in end_thread(THD*, bool) ()
Σ(゚д゚lll) 見えた! 水の一滴
gdbのバージョンがさっきのより新しいのが良かったのかよく判らないけど、関数の名前だけでも見られるのは大きい。
(gdb) thread apply all bt Thread 13 (Thread 26190): #0 0x00000000006d993a in select () #1 0x0000000000457c12 in handle_connections_sockets () #2 0x000000000045a326 in main () Thread 12 (Thread 26191): #0 0x00000000006d9642 in poll () #1 0x00000000006726b7 in __pthread_manager () #2 0x00000000006db2ff in clone () #3 0x0000000000000000 in ?? () Thread 11 (Thread 26192): #0 0x0000000000674b4c in __pthread_sigsuspend () #1 0x000000000067415d in __pthread_wait_for_restart_signal () #2 0x000000000067149c in pthread_cond_wait () #3 0x000000000060cc16 in os_event_wait () #4 0x00000000006100b0 in os_aio_simulated_handle () #5 0x00000000005d0b41 in fil_aio_wait () #6 0x0000000000500c2c in io_handler_thread () #7 0x000000000067213b in pthread_start_thread () #8 0x00000000006db2ff in clone () #9 0x0000000000000000 in ?? () Thread 10 (Thread 26193): #0 0x0000000000674b4c in __pthread_sigsuspend () #1 0x000000000067415d in __pthread_wait_for_restart_signal () #2 0x000000000067149c in pthread_cond_wait () #3 0x000000000060cc16 in os_event_wait () #4 0x00000000006100b0 in os_aio_simulated_handle () #5 0x00000000005d0b41 in fil_aio_wait () #6 0x0000000000500c2c in io_handler_thread () #7 0x000000000067213b in pthread_start_thread () #8 0x00000000006db2ff in clone () #9 0x0000000000000000 in ?? () Thread 9 (Thread 26194): #0 0x0000000000674b4c in __pthread_sigsuspend () #1 0x000000000067415d in __pthread_wait_for_restart_signal () #2 0x000000000067149c in pthread_cond_wait () #3 0x000000000060cc16 in os_event_wait () #4 0x00000000006100b0 in os_aio_simulated_handle () #5 0x00000000005d0b41 in fil_aio_wait () #6 0x0000000000500c2c in io_handler_thread () #7 0x000000000067213b in pthread_start_thread () #8 0x00000000006db2ff in clone () #9 0x0000000000000000 in ?? () Thread 8 (Thread 26196): #0 0x0000000000674b4c in __pthread_sigsuspend () #1 0x000000000067415d in __pthread_wait_for_restart_signal () #2 0x000000000067149c in pthread_cond_wait () #3 0x000000000060cc16 in os_event_wait () #4 0x00000000006100b0 in os_aio_simulated_handle () #5 0x00000000005d0b41 in fil_aio_wait () #6 0x0000000000500c2c in io_handler_thread () #7 0x000000000067213b in pthread_start_thread () #8 0x00000000006db2ff in clone () #9 0x0000000000000000 in ?? () Thread 7 (Thread 26197): #0 0x00000000006d993a in select () #1 0x000000000060d9db in os_thread_sleep () #2 0x00000000004fe445 in srv_lock_timeout_and_monitor_thread () #3 0x000000000067213b in pthread_start_thread () #4 0x00000000006db2ff in clone () #5 0x0000000000000000 in ?? () Thread 6 (Thread 26199): #0 0x00000000006d993a in select () #1 0x000000000060d9db in os_thread_sleep () #2 0x00000000004fea3d in srv_error_monitor_thread () #3 0x000000000067213b in pthread_start_thread () #4 0x00000000006db2ff in clone () #5 0x0000000000000000 in ?? () Thread 5 (Thread 26200): #0 0x00000000006d993a in select () #1 0x000000000060d9db in os_thread_sleep () #2 0x00000000004ff863 in srv_master_thread () #3 0x000000000067213b in pthread_start_thread () #4 0x00000000006db2ff in clone () #5 0x0000000000000000 in ?? () Thread 4 (Thread 26201): #0 0x00000000006b3329 in sigsuspend () #1 0x000000000067501a in __pthread_sigwait () #2 0x000000000045a89d in signal_hand () #3 0x000000000067213b in pthread_start_thread () #4 0x00000000006db2ff in clone () #5 0x0000000000000000 in ?? () Thread 3 (Thread 26202): #0 0x0000000000677573 in read () #1 0x000000000064ea78 in vio_read () #2 0x00000000004505f3 in my_real_read(st_net*, unsigned long*) () #3 0x0000000000450c25 in my_net_read () #4 0x00000000004f7d2c in mc_net_safe_read(st_mysql*) () #5 0x00000000004f1f16 in handle_slave_io () #6 0x000000000067213b in pthread_start_thread () #7 0x00000000006db2ff in clone () #8 0x636f6c2f7273752f in ?? () #9 0x6c7173796d2f6c61 in ?? () #10 0x62696c2f36323034 in ?? () #11 0x73796d2f63657865 in ?? () #12 0x616572203a646c71 in ?? () #13 0x6320726f66207964 in ?? () #14 0x6f697463656e6e6f in ?? () #15 0x737265560a2e736e in ?? () #16 0x2e3427203a6e6f69 in ?? () #17 0x676f6c2d36322e30 in ?? () #18 0x656b636f73202027 in ?? () #19 0x7461642f27203a74 in ?? () #20 0x61706f7475612f61 in ?? () #21 0x2f316d62642d6567 in ?? () #22 0x6f732e6c7173796d in ?? () #23 0x726f702020276b63 in ?? () #24 0x2031303633203a74 in ?? () #25 0x20656372756f5320 in ?? () #26 0x7562697274736964 in ?? () #27 0x0000000a6e6f6974 in ?? () #28 0x0000000000000000 in ?? () Thread 2 (Thread 26203): #0 0x0000000000674b4c in __pthread_sigsuspend () #1 0x000000000067415d in __pthread_wait_for_restart_signal () #2 0x000000000067149c in pthread_cond_wait () #3 0x000000000049cd66 in MYSQL_LOG::wait_for_update(THD*, bool) () #4 0x00000000004f3af1 in next_event(st_relay_log_info*) () #5 0x00000000004f3f63 in handle_slave_sql () #6 0x000000000067213b in pthread_start_thread () #7 0x00000000006db2ff in clone () Cannot access memory at address 0x2ac961a98000 Thread 1 (Thread 26263): #0 0x00000000006b32b9 in kill () #1 0x0000000000674d23 in pthread_kill () #2 0x0000000000457724 in handle_segfault () #3 0x00000000006772f5 in __pthread_sighandler () #4 <signal handler called> #5 0x0000000000671455 in pthread_cond_wait () #6 0x0000000000457473 in end_thread(THD*, bool) () #7 0x000000000046bac4 in handle_one_connection () #8 0x000000000067213b in pthread_start_thread () #9 0x00000000006db2ff in clone ()
うん、何とかなりそう。
ここまででこの障害のアタリがついた方は、是非弊社のDBA募集ページへ!
(まだつづく)
第3話(再現編)はこちら
⇒ ぼくとMySQLの3日間戦争(再現編)
2 Notes/ Hide
dominion525 reblogged this from gmomedia-engineer
atm09td reblogged this from gmomedia-engineer
gmomedia-engineer posted this