0% found this document useful (0 votes)

39 views50 pages

Linux Performance Tuning and Stabilization Tips: Yoshinori Matsunobu

MySQL Performance Tuning and Stabilization Tips Yoshinori Matsunobu Lead of MySQL Professional Services APAC Sun Microsystems. RAM: -60ns 100,000 queries per second is not impossible 16-64GB RAM is now pretty common hot application data should be cached in memory. Use compact data types (SMALLINT instead of VARCHAR / BIGINT, TIMESTAMP instead of DATETIME, etc) Delete records or move to archived tables, to keep hot tables smaller.

Uploaded by

IDarknightI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views50 pages

Linux Performance Tuning and Stabilization Tips: Yoshinori Matsunobu

Uploaded by

IDarknightI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Linux Performance Tuning and Stabilization Tips

Yoshinori Matsunobu
Lead of MySQL Professional Services APAC Sun Microsystems [email protected]

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

Table of contents
Memory and Swap space management Synchronous I/O, Filesystem, and I/O scheduler Useful commands and tools
iostat, mpstat, oprofile, SystemTap, gdb

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

Random Access Memory

The most important H/W component for RDBMS RAM access speed is much faster than HDD/SSD
RAM: -60ns
100,000 queries per second is not impossible

HDD: -5ms SSD: 100-500us

16-64GB RAM is now pretty common *hot application data* should be cached in memory Minimizing hot application data size is important
Use compact data types (SMALLINT instead of VARCHAR/BIGINT, TIMESTAMP instead of DATETIME, etc) Do not create unnecessary indexes Delete records or move to archived tables, to keep hot tables smaller
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 3

Cache hot application data in memory

DBT-2 (W200) Buffer pool 1G Buffer pool 2G Buffer pool 5G Buffer pool 30G (All data in cache) Transactions per Minute %user 1125.44 1863.19 4385.18 36784.76 2% 3% 5.5% 36% %iowait 30% 28% 33% 8%

DBT-2 benchmark (write intensive) 20-25GB hot data (200 warehouses, running 1 hour) Nehalem 2.93GHz x 8 cores, MySQL 5.5.2, 4 RAID1+0 HDDs RAM size affects everything. Not only for SELECT, but also for INSERT/UPDATE/DELETE
INSERT: Random reads/writes happen when inserting into indexes in random order UPDATE/DELETE: Random reads/writes happen when modifying records

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

Buffered I/O

Use Direct I/O

Direct I/O

InnoDB Buffer Pool InnoDB Buffer Pool Filesystem Cache RAM RAM

InnoDB Data File

Direct I/O is important to fully utilize Memory innodb_flush_method=O_DIRECT Alignment: File i/o unit must be a factor of 512 bytes Cant use O_DIRECT for InnoDB Log File, Binary Log File, MyISAM, PostgreSQL data files, etc
The Worlds Most Popular Open Source Database 5

Copyright 2010 Sun Microsystems inc

Do not allocate too much memory

user$ top Mem: 32967008k total, 32808696k used, 158312k free, 10240k buffers

Swap: 35650896k total, 4749460k used, 30901436k free, 819840k cached PID USER 5231 mysql PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

25 0 35.0g 30g 324 S 0.0 71.8 7:46.50 mysqld

What happens if no free memory space is available?

Reducing filesystem cache to allocate memory space Swapping process(es) to allocate memory space

Swap is bad
Process spaces are written to disk (swap out) Disk reads happen when accessing on-disk process spaces (swap in) Massive random disk reads and writes will happen
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 6

By setting swap size to zero, swap doesnt happen anymore. But..

Very dangerous

What if setting swap size to zero?

When neither RAM nor swap space is available, OOM killer is invoked. OOM Killer may kill any process to allocate memory space The most memory-consuming process (mysqld) will be killed at first
Its abort shutdown. Crash recovery takes place on restart Priority is determined by ORDER BY /proc/<PID>/oom_score DESC Normally mysqld has the highest score Depending on VMsize, CPU time, running time, etc

It often takes very long time (minutes to hours) for OOM Killer to kill processes
We cant do anything until enough memory space is available
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 7

Do not set swap=zero

top - 01:01:29 up 5:53, 3 users, load average: 0.66, 0.17, 0.06 Tasks: 170 total, 3 running, 167 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 24.9%sy, 0.0%ni,75.0%id,0.2%wa,0.0%hi, 0.0%si,0.0%st Mem: 32967008k total, 32815800k used, 151208k free, 8448k buffers Swap: 0k total, 0k used, 0k free, 376880k cached PID USER 26988 mysql PR 25 NI 0 VIRT 30g RES SHR S %CPU %MEM 30g 1452 R 98.5 97.7 TIME+ COMMAND 0:42.18 mysqld

If no memory space is available, OOM killer will be invoked Some CPU cores consume 100% system resources
24.9% (average) = 1 / 4 core use 100% cpu resource in this case Terminal freezed (SSH connections cant be established)

Swap is bad, but OOM killer is much worse than swap

What if stopping OOM Killer?

If /proc/<PID>/oom_adj is set to -17, OOM Killer wont kill the process Setting -17 to sshd is a good practice so that we can continue remote login # echo -17 > /proc/<pid of sshd>/oom_adj But dont set -17 to mysqld
If over-memory-consuming process is not killed, Linux cant have any available memory space We cant do anything for a long long time.. -> Long downtime

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

Swap space management

Swap space is needed to stabilize systems
But we dont want mysqld swapped out

What consumes memory?

RDBMS
Mainly process space is used (innodb_buffer_pool, key_buffer, sort_buffer, etc) Sometimes filesystem cache is used (MyISAM files, etc)

Administration (backup, etc)

Mainly filesystem cache is used

We want to keep mysqld in RAM, rather than allocating large filesystem cache
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 10

Be careful about backup operations

Mem: 32967008k total, 28947472k used, 4019536k free, 152520k buffers Swap: 35650896k total, PID USER 5231 mysql 0k used, 35650896k free, 197824k cached TIME+ COMMAND

PR NI VIRT RES SHR S %CPU %MEM

25 0 27.0g 27g 288 S 0.0 92.6 7:40.88 mysqld Copying 8GB datafile

Mem: 32967008k total, 32808696k used, 158312k free,

10240k buffers

Swap: 35650896k total, 4749460k used, 30901436k free, 8819840k cached PID USER 5231 mysql PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

25 0 27.0g 22g 324 S 0.0 71.8 7:46.50 mysqld

Copying large files often causes swap

vm.swappiness = 0
Mem: 32967008k total, 28947472k used, 4019536k free, 152520k buffers Swap: 35650896k total, PID USER 5231 mysql 0k used, 35650896k free, 197824k cached TIME+ COMMAND PR NI VIRT RES SHR S %CPU %MEM

25 0 27.0g 27g 288 S 0.0 91.3 7:55.88 mysqld Copying 8GB of datafile 3940k buffers 216k used, 35650680k free, 4117432k cached TIME+ COMMAND

Mem: 32967008k total, 32783668k used, 183340k free, Swap: 35650896k total, PID USER 5231 mysql

PR NI VIRT RES SHR S %CPU %MEM

25 0 27.0g 27g 288 S 0.0 80.6 8:01.44 mysqld

Set vm.swappiness=0 in /etc/sysctl.conf

Default is 60

When physical RAM was fully consumed, Linux kernel reduces filesystem cache with high priority (lower swappiness increases priority) After no file system cache is available, swapping starts
OOM killer wont be invoked if large enough swap space is allocated. Its safer
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 12

Memory allocator
mysqld uses malloc()/mmap() for memory allocation Faster and more concurrent memory allocator such as tcmalloc can be used
Install Google Perftools (tcmalloc is included)
# yum install libunwind # cd google-perftools-1.5 ; ./configure --enable-frame-pointers; make; make install

export LD_PRELOAD=/usr/local/lib/tcmalloc_minimal.so; mysqld_safe &

InnoDB internally uses its own memory allocator

Can be changed in InnoDB Plugin
If Innodb_use_sys_malloc = 1(default 1), InnoDB uses OS memory allocator tcmalloc can be used by setting LD_PRELOAD
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 13

Memory allocator would matter for CPU bound workloads

Default allocator Buffer pool 1G Buffer pool 2G Buffer pool 5G Buffer pool 30G 1125.44 1863.19 4385.18 36784.76 tcmalloc_minimal 1131.04 1881.47 4460.50 38400.30 %user 2% 3% 5.5% 36% up +0.50% +0.98% +1.2% +4.4%

DBT-2 benchmark (write intensive) Nehalem 2.93GHz x 8 cores, MySQL 5.5.2 20-25GB hot data (200 warehouses, running 1 hour)

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

Be careful about per-session memory

Do not allocate much more memory than needed (especially for persession memory) Allocating 2MB takes much longer time than allocating 128KB
Linux malloc() internally calls brk() if size <= 512KB, else calling mmap()

In some cases too high per-session memory allocation causes negative performance impacts
SELECT * FROM huge_myisam_table LIMIT 1; SET read_buffer_size = 256*1024; (256KB) -> 0.68 second to run 10,000 times SET read_buffer_size = 2048*1024; (2MB) -> 18.81 seconds to run 10,000 times

In many cases MySQL does not allocate per-session memory than needed. But be careful about some extreme cases (like above: MyISAM+LIMIT+FullScan)
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 15

Table of contents
Memory and Swap space management Synchronous I/O, Filesystem, and I/O scheduler Useful commands and tools
iostat, mpstat, oprofile, SystemTap, gdb

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

RDBMS calls fsync() many times (per transaction commit, checkpoints, etc) Make sure to use Battery Backed up Write Cache (BBWC) on raid cards
10,000+ fsync() per second, without BBWC less than 200 on HDD Disable write cache on disks for safety reasons

File I/O and synchronous writes

Do not set write barrier on filesystems (enabled by default in some cases)

Write-through to disks even though BBWC is enabled (very slow) ext3: mount -o barrier=0 xfs: mount -o nobarrier drbd: no-disk-barrier in drbd.conf
seek & rotation time Write cache with battery

disk seek & rotation time

disk

The Worlds Most Popular Open Source Database

Overwriting or Appending?
Some files are overwritten (fixed file size), others are appended (increasing file size)
Overwritten: InnoDB Logfile Appended: Binary Logfile

Appending + fsync() is much slower than overwriting + fsync()

Additional file space needs to be allocated & file metadata needs to be flushed per fsync()

10,000+ fsync/sec for overwriting, 3,000 or less fsync/sec for appending

Appending speed highly depends on filesystems Copy-on-write filesystems such as Solaris ZFS is fast enough for appending (7,000+)

Be careful when using sync-binlog=1 for binary logs

Consider using ZFS Check preallocating binlog worklog: WL#4925

Do not extend files too frequently

innodb-autoextend-increment = 20 (default 8)
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 18

Quick file i/o health check

Checking BBWC is enabled, and write barrier is disabled
Overwriting + fsync() test
Run mysqlslap insert(InnoDB, single-threaded, innodb_flush_log_at_trx_commit=1), check qps is over 1,000

$ mysqlslap --concurrency=1 --iterations=1 --engine=innodb \ --auto-generate-sql --auto-generate-sql-loadtype=write \ --number-of-queries=100000

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

Buffered and asynchronous writes

Some file i/o operations are not direct i/o, not synchronous file copy, MyISAM, mysqldump, innodb_flush_log_at_trx_commit=2, etc Dirty pages in filesystem cache needs to be flushed to disks in the end pdflush takes care of it, maximum 8 threads When? -> highly depending on vm.dirty_background_ratio and vm.dirty_ratio Flushing dirty pages starts in background after reaching dirty_background_ratio * RAM (Default is 10%, 10% of 64GB is 6.4GB) Forced flush starts after reaching dirty_ratio * RAM (Default is 40%) Forced, and burst dirty page flushing is problematic All buffered write operations become synchronous, which hugely increase latency Do flush dirty pages aggressively Execute sync; while doing massive write operations Reduce vm.dirty_background_ratio Upgrade to 2.6.32 or higher pdflush threads are allocated per device. Flushing to slow devices wont block other pdflush threads
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 20

By far the most widely used filesystem But not always the best Deleting large files takes long time

Filesystem ext3

Internally has to do a lot of random disk i/o (slow on HDD) In MySQL, if it takes long time to DROP table, all client threads will be blocked to open/close tables (by LOCK_open mutex) Be careful when using MyISAM, InnoDB with innodb_file_per_table, PBXT, etc

Writing to a file is serialized

Serialized by i-mutex, allocated per i-node Sometimes it is faster to allocate many files instead of single huge file Less optimized for faster storage (like PCI-Express SSD)

Use dir_index to speed up searching files Use barrier=0 to disable write-through

Filesystem xfs/ext2
xfs
Fast for dropping files Concurrent writes to a file is possible when using O_DIRECT Not officially supported in current RHEL (Supported in SuSE) Disable write barrier by setting nobarrier

ext2
Faster for writes because ext2 doesnt support journaling It takes very long time for fsck On active-active redundancy environment (i.e. MySQL Replication), in some cases ext2 is used to gain performance

Btrfs (under development)

Copy-on-write filesystem Supporting transactions (no half-block updates) Snapshot backup with no overhead
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 22

Concurrent write matters on fast storage

Random Write IOPS (16KB Blocks)
20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

1 i/o thread 100 i/o threads

HDD(ext3)

HDD(xfs)

Intel(ext3)

Intel(xfs)

Fusion(ext3) Fusion (xfs)

Negligible on HDD (4 SAS RAID1) 1.8 times difference on Fusion I/O

I/O scheduler
Note: RDBMS (especially InnoDB) also schedules I/O requests so theoretically Linux I/O scheduler is not needed Linux has I/O schedulers
to efficiently handle lots of I/O requests I/O scheduler type and Queue Size matters

Types of I/O schedulers (introduced in 2.6.10: RHEL5)

noop: Sorting incoming i/o requests by logical block address, thats all deadlilne: Prioritize read (sync) requests rather than write requests (async) to some extent (to avoid write-starving-reads problem) cfq(default): Fairly scheduling i/o requests per i/o thread anticipatory: Removed in 2.6.33 (bad scheduler. Dont use it)

Default is cfq, but noop / deadline is better in many cases

# echo noop > /sys/block/sdX/queue/scheduler
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 24

cfq madness
Running two benchmark programs concurrently 1. Multi-threaded random disk reads (Simulating RDBMS reads) 2. Single-threaded overwriting + fsync() (Simulating redo log writes) Random Read write+fsync() i/o threads running 1 100 1 No No Yes Scheduler noop/deadline cfq noop/deadline cfq noop/deadline cfq noop/deadline cfq

reads/sec from iostat 260 260 2100 2100 212 248 1915 2084

writes/sec from iostat 0 0 0 0 14480 246 12084 0

In RDBMS, write IOPS is often very high because HDD + write cache can handle thousands of transaction commits per second (write+fsync) Write iops was adjusted to per-thread read iops in cfq, which reduced total iops significantly Verified on RHEL5.3 and SuSE 11, Sun Fire X4150, 4 HDD H/W RAID1+0
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 25

Changing I/O scheduler (InnoDB)

DBT-2 (MySQL5.1) 15000
NOTPM

10000 5000 0 noop cfq deadline as

RAID1+0 RAID5

- Sun Fire X4150 (4 HDDs, H/W RAID controller+BBWC) - RHEL5.3 (2.6.18-128) - Built-in InnoDB 5.1
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 26

Changing I/O scheduler queue size (MyISAM)

Time to insert 1 million records (HDD)
Seconds
5000 4000 3000 2000 1000 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77

queue size=100000 queie size=128 (default)

Existing records (millions)

Queue size = N
Sorting N outstanding I/O requests to optimize disk seeks

MyISAM does not optimize I/O requests internally Highly depending on OS and storage When inserting into indexes, massive random disk writes/reads happen Increasing I/O queue size reduces disk seek overheads
# echo 100000 > /sys/block/sdX/queue/nr_requests

No impact in InnoDB
Many RDBMS including InnoDB internally sort I/O requests
The Worlds Most Popular Open Source Database 27

Copyright 2010 Sun Microsystems inc

Useful commands and tools

iostat mpstat oprofile SystemTap (stap) gdb

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

iostat
Detailed I/O statistics per device Very important tool because in most cases RDBMS becomes I/O bound iostat -x Check r/s, w/s, svctm, %util
IOPS is much more important than transfer size

Always %util = (r/s + w/s) * svctm (hard coded in the iostat source file)

# iostat -xm 10 avg-cpu: %user %nice %system %iowait %steal %idle 21.16 0.00 6.14 29.77 0.00 42.93 Device: rqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 2.60 389.01 283.12 47.35 4.86 2.19 43.67 4.89 14.76 3.02 99.83

(283.12+47.35) * 3.02(ms)/1000 = 0.9980 = 100% util

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

iostat example (DBT-2)

(283.12+47.35) * 3.02(ms)/1000 = 0.9980 = 100% util

# iostat -xm 10 avg-cpu: %user %nice %system %iowait %steal %idle 40.03 0.00 16.51 16.52 0.00 26.94 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdb 6.39 368.53 543.06 490.41 6.71 3.90 21.02 3.29 3.20 0.90 92.66

(543.06+490.41) * 0.90(ms)/1000 = 0.9301 = 93% util

Sometimes throughput gets higher even though %util reaches 100%

Write cache, Command Queuing, etc

In both cases %util is almost 100%, but r/s and w/s are far different Do not trust %util too much Check svctm rather than %util
If your storage can handle 1000 IOPS, svctm should be less than 1.00 (ms) so you can send alerts if svctm is higher than 1.00 for a couple of minutes
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 30

mpstat
Per CPU core statistics vmstat displays average statistics Its very commom that only one of CPU cores consumes 100% CPU resources
The rest CPU cores are idle Especially applies to batch jobs

If you check only vmstat/top/iostat/sar you will not notice single threaded bottleneck You can also check network bottlenecks (%irq, %soft) from mpstat
vmstat counts them as %idle

Copyright 2010 Sun Microsystems inc

The Worlds Most Popular Open Source Database

vmstat and mpstat

# vmstat 1 procs -----------memory---------- ---swap-r b swpd free buff cache si so 0 1 2096472 1645132 18648 19292 0 0 1 2096472 1645132 18648 19292 0 0 1 2096472 1645132 18648 19292 0 0 1 2096472 1645132 18648 19292 0 # mpstat ... 11:04:37 11:04:38 11:04:38 11:04:38 11:04:38 11:04:38 11:04:38 11:04:38 11:04:38 11:04:38 -P ALL 1 AM AM AM AM AM AM AM AM AM AM CPU all 0 1 2 3 4 5 6 7 %user 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 %nice 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %sys %iowait 0.12 12.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99 98.02 0.00 0.00 0.00 0.00 0.00 0.00 %irq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %soft 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %steal 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 %idle intr/s 87.55 1201.98 100.00 990.10 100.00 0.00 100.00 0.00 100.00 0.00 0.00 206.93 100.00 0.00 100.00 4.95 100.00 0.00 -----io---bi bo 0 4848 0 4176 0 4320 0 3872 --system-- -----cpu-----in cs us sy id wa st 0 1223 517 0 0 88 12 0 0 1287 623 0 0 87 12 0 0 1202 470 0 0 88 12 0 0 1289 627 0 0 87 12 0

vmstat displays average statistics. 12% * 8 (average) = 100% * 1 + 0% * 7

The Worlds Most Popular Open Source Database 32

Oprofile
Profiling CPU usage from running processes You can easily identify which functions consume CPU resources Supporting both user space and system space profiling Mainly used by database-internal developers If specific functions consume most of recourses, applications might be re-designed to skip calling them Not useful to check low-CPU activities
I/O bound, mutex waits, etc

How to use
opcontrol --start (--no-vmlinux) benchmarking opcontrol --dump opcontrol --shutdown opreport -l /usr/local/bin/mysqld

The Worlds Most Popular Open Source Database

# opreport l /usr/local/bin/mysqld samples % symbol name 83003 8.8858 String::copy(char const*, unsigned int, charset_info_st*, charset_info_st*, unsigned int*) 79125 8.4706 MYSQLparse(void*) 68253 7.3067 my_wc_mb_latin1 55410 5.9318 my_pthread_fastmutex_lock 34677 3.7123 my_utf8_uni 18359 1.9654 MYSQLlex(void*, void*) 12044 1.2894 _ZL15get_hash_symbolPKcjb 11425 1.2231 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
You can see quite a lot of CPU resources were spent for character conversions (latin1 <-> utf8) Disabling character code conversions on application side will improve performance (20% in this case)

Oprofile example

samples % symbol name 83107 10.6202 MYSQLparse(void*) 68680 8.7765 my_pthread_fastmutex_lock 20469 2.6157 MYSQLlex(void*, void*) 13083 1.6719 _ZL15get_hash_symbolPKcjb 12148 1.5524 JOIN::optimize() 11529 1.4733 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 34

SystemTap provides a simple command line interface and scripting language for writing instrumentation for a live running kernel (and applications). Similar to DTrace SystemTap script runs as a Linux Kernel module No-need to rebuild applications to profile
kernel-header/devel, kernel-debuginfo packages are needed

SystemTap

Supported in RHEL 5 by default (5.4 is more stable than older versions, but be careful about bug reports) User level functions can be profiled if a target program has DWARF debugging symbols
MySQL official binary has DWARF symbols, so you do not need to rebuild mysqld Add -g if you build MySQL by yourselves

You can write custom C code inside a SystemTap script, but its limited
This is called guru mode Easily causes kernel panic. Be extremely careful Since it is kernel module, user-side libraries can not be used

The Worlds Most Popular Open Source Database

SystemTap use-case 1 : Per-file i/o statistics

# filestat 10 2010-04-05 11:18:55 iotime r/s 6.12s 182.4 2.36s 64.1 1.05s 25.7 826.6ms 15.1 645.1ms 16.6 2010-04-05 11:19:05 iotime r/s 4.76s 173.9 2.22s 68.1 1.23s 18.4 919.6ms 14.3 w/s 1.6 0.7 6.1 0.5 2.1 rBytes/s 2.85M 1.00M 411.2K 241.6K 265.6K wBytes/s 30.4K 11.2K 100.8K 9.6K 107.2K file /hdd/data/dbt2/stock.ibd /hdd/data/dbt2/customer.ibd /hdd/data/dbt2/orders.ibd /hdd/data/dbt2/order_line.ibd /hdd/data/dbt2/new_order.ibd

w/s 0.6 2.3 1.2 11.0

rBytes/s 2.72M 1.06M 294.4K 228.8K

wBytes/s 12.8K 36.8K 25.6K 521.6K

file /hdd/data/dbt2/stock.ibd /hdd/data/dbt2/customer.ibd /hdd/data/dbt2/order_line.ibd /hdd/data/dbt2/new_order.ibd

iostat provides per-device i/o statistics, iotop provides perprocess i/o statistics
Not enough for mysqld
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 36

probe syscall.read, syscall.pread { if (execname() == mysqld) { readstats[pid(), fd] <<< count fdlist[pid(), fd] = fd } } probe timer.s($1) { foreach ([pid+, fd] in fdlist) { reads=@count(readstats[pid, fd]) rbytes=@sum(rdbs[pid, fd]) print %d %d %d\n, fd, reads, rbytes ...

Sample Code
#!/bin/sh stap filestat.stap 10 | perl sum.pl

Programming within SystemTap is possible, but difficult

Most of utility libraries can not be used limited to 1000 statements per probe

Typical coding style:

Print raw statistical information (i.e. file descriptor, iotime, reads, writes, bytes-read, bytes-written, etc) to STDOUT Pipe to Perl script (or python/ruby/etc) Filtering/Grouping/Sorting/Decorating etc in Perl
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 37

SystemTap use-case 2 : Userspace profiling

mysql> EXPLAIN SELECT user_id, post_date, title -> FROM diary ORDER BY rating DESC limit 100\G ********* select_type: SIMPLE table: diary type: ALL key: NULL rows: 1163 Extra: Using filesort mysql> SELECT user_id, post_date, title -> FROM diary ORDER BY rating DESC limit 100; 100 rows in set (0.73 sec)

[root #] stap sort.stp # of returned rows sorted by old algorithm: 0 # of returned rows sorted by new algorithm: 100
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 38

Background: MySQL Sorting Algorithm

MySQL has two sorting algorithms (old algorithm / new algorithm) Choosing either of the two, depending on column length, data types, etc.. Currently there is no MySQL status variable to check which algorithm is used Sometimes performance difference is huge (especially when used with LIMIT) Inside MySQL, rr_from_pointers() is called by old algorithm, rr_unpack_from_buffer() by new algorithm
3) Fetch the rest columns user_id post_date rating title 100 2010-03-29 4.71 UEFA CL: Inter vs Chelsea 2 2010-03-30 3.32 Denmark vs Japan, 3-0 3 2010-03-31 4.10 MySQL Administration 10 2010-04-01 4.50 Linux tuning

Old algorithm 1) Load into sort buffer rating RowID rating RowID 4.71 1 4.71 1 3.32 2 4.50 4 4.10 3 4.10 3 4.50 4 3.32 3 2) Sort user_id 100 2 3 10

New algorithm 1) Load all columns into sort buffer

post_date 2009-03-29 2009-03-30 2009-03-31 2009-04-01

rating title 4.71 UEFA CL: Inter vs Chelsea 3.32 Denmark vs Japan, 3-0 2) Sort 4.10 MySQL Administration 4.50 Linux tuning
The Worlds Most Popular Open Source Database 39

SystemTap Script 2
global oldsort=0; global newsort=0; probe process("/usr/local/bin/mysqld").function("*rr_from_pointers*").return { oldsort++; } probe process("/usr/local/bin/mysqld").function("*rr_unpack_from_buffer*").return { newsort++; } probe end { printf("# of returned rows printf("# of returned rows } ----[root #] stap sort.stp # of returned rows sorted by # of returned rows sorted by
Copyright 2010 Sun Microsystems inc

sorted by old algorithm: %d \n", oldsort); sorted by new algorithm: %d \n", newsort);

old algorithm: 0 new algorithm: 100

The Worlds Most Popular Open Source Database 40

gdb
Debugging tool gdb has a functionality to take thread stack dumps from a running process (similar to Solaris truss) Useful to identify where and why mysqld hangs up, slows down, etc
But you have to read MySQL source code

Debugging symbol is required on the target program

The Worlds Most Popular Open Source Database

gdb case study

mysql> SELECT query_time, start_time, sql_text -> FROM mysql.slow_log WHERE start_time -> BETWEEN '2010-02-05 23:00:00' AND '2010-02-05 01:00:00' -> ORDER BY query_time DESC LIMIT 10; +------------+---------------------+----------+ | query_time | start_time | sql_text | +------------+---------------------+----------+ | 00:00:11 | 2010-02-05 23:09:55 | begin | | 00:00:09 | 2010-02-05 23:09:55 | Prepare | | 00:00:08 | 2010-02-05 23:09:55 | Prepare | | 00:00:08 | 2010-02-05 23:09:55 | Init DB | | 00:00:08 | 2010-02-05 23:09:55 | Init DB | | 00:00:07 | 2010-02-05 23:09:55 | Prepare | | 00:00:07 | 2010-02-05 23:09:55 | Init DB | | 00:00:07 | 2010-02-05 23:09:55 | Init DB | | 00:00:07 | 2010-02-05 23:09:55 | Init DB | | 00:00:06 | 2010-02-05 23:09:55 | Prepare | +------------+---------------------+----------+ 10 rows in set (0.02 sec)
Copyright 2010 Sun Microsystems inc

Suddenly all queries were not responding for 1-10 seconds Checking slow query log All queries are simple enough, its strange to take 10 seconds CPU util (%us, %sy) were almost zero SHOW GLOBAL STATUS, SHOW FULL PROCESSLIST were not helpful
42

The Worlds Most Popular Open Source Database

gdbtrace() { PID=`cat /var/lib/mysql/mysql.pid` STACKDUMP=/tmp/stackdump.$$ echo 'thread apply all bt' > $STACKDUMP echo 'detach' >> $STACKDUMP echo 'quit' >> $STACKDUMP gdb --batch --pid=$PID -x $STACKDUMP } while loop do CONN=`netstat -an | grep 3306 | grep ESTABLISHED | wc | awk '{print $1}'` if [ $CONN -gt 100 ]; then gdbtrace() done sleep 3 done
Copyright 2010 Sun Microsystems inc

Taking thread dumps with gdb

Attaching running mysqld, then taking a thread dump Taking dumps every 3 seconds Attaching & Dumping with gdb is expensive so invoke only when exceptional scenario (i.e. conn > threshold) happens Check if the same LWPs are waiting at the same place

The Worlds Most Popular Open Source Database

..... Thread 73 (Thread 0x46c1d950 (LWP 28494)): #0 0x00007ffda5474384 in __lll_lock_wait () from /lib/libpthread.so.0 #1 0x00007ffda546fc5c in _L_lock_1054 () from /lib/libpthread.so.0 #2 0x00007ffda546fb30 in pthread_mutex_lock () from /lib/libpthread.so.0 #3 0x0000000000a0f67d in my_pthread_fastmutex_lock (mp=0xf46d30) at thr_mutex.c:487 #4 0x000000000060cbe4 in dispatch_command (command=16018736, thd=0x80, packet=0x65 <Address 0x65 out of bounds>, packet_length=4294967295) at sql_parse.cc:969 #5 0x000000000060cb56 in do_command (thd=0xf46d30) at sql_parse.cc:854 #6 0x0000000000607f0c in handle_one_connection (arg=0xf46d30) at sql_connect.cc:1127 #7 0x00007ffda546dfc7 in start_thread () from /lib/libpthread.so.0 #8 0x00007ffda46305ad in clone () from /lib/libc.so.6 #9 0x0000000000000000 in ?? () Many threads were waiting at pthread_mutex_lock(), called from sql_parse.cc:969
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 44

Stack Trace

Reading sql_parse.cc:969
953 bool dispatch_command(enum enum_server_command command, THD *thd, 954 char* packet, uint packet_length) 955 { 956 NET *net= &thd->net; 957 bool error= 0; 958 DBUG_ENTER("dispatch_command"); 959 DBUG_PRINT("info",("packet: '%*.s'; command: %d", packet_length, packet, command)); 960 961 thd->command=command; 962 /* 963 Commands which always take a long time are logged into 964 the slow log only if opt_log_slow_admin_statements is set. 965 */ 966 thd->enable_slow_log= TRUE; 967 thd->lex->sql_command= SQLCOM_END; /* to avoid confusing VIEW detectors */ 968 thd->set_time(); 969 VOID(pthread_mutex_lock(&LOCK_thread_count));
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 45

Who locked LOCK_thread_count for seconds?

Thread 1 (Thread 0x7ffda58936e0 (LWP 15380)): #0 0x00007ffda4630571 in clone () from /lib/libc.so.6 #1 0x00007ffda546d396 in do_clone () from /lib/libpthread.so.0 #2 0x00007ffda546db48 in pthread_create@@GLIBC_2.2.5 () from /lib/libpthread.so.0 #3 0x0000000000600a66 in create_thread_to_handle_connection (thd=0x3d0f00) at mysqld.cc:4811 #4 0x00000000005ff65a in handle_connections_sockets (arg=0x3d0f00) at mysqld.cc:5134 #5 0x00000000005fe6fd in main (argc=4001536, argv=0x4578c260) at mysqld.cc:4471 #0 0x00007ffda4630571 in clone () from /lib/libc.so.6 gdb stack dumps were taken every 3 seconds In all cases, Thread 1 (LWP 15380) was stopped at the same point clone() (called by pthread_create()) seemed to take a long time

The Worlds Most Popular Open Source Database

4795 void create_thread_to_handle_connection(THD *thd) 4796 { 4797 if (cached_thread_count > wake_thread) 4798 { 4799 /* Get thread from cache */ 4800 thread_cache.append(thd); 4801 wake_thread++; 4802 pthread_cond_signal(&COND_thread_cache); 4803 } 4804 else 4805 { 4811 if ((error=pthread_create(&thd->real_id,&connection_attrib, 4812 handle_one_connection, 4813 (void*) thd))) 4839 } 4840 (void) pthread_mutex_unlock(&LOCK_thread_count); pthread_create is called under critical section (LOCK_thread_count is released after that) If cached_thread_count > wake_thread, pthread_create is not called Increasing thread_cache_size will fix the problem!
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 47

Reading mysqld.cc:4811

Install at least sar, mpstat, iostat (sysstat package)

Configuration Summary

Oprofile, gdb and SystemTap(stap) are recommended

Allocate swap space (approx half of RAM size) Set vm.swappiness = 0 and use O_DIRECT Set /sys/block/sdX/queue/scheduler = deadline or noop Filesystem Tuning
relatime (noatime) ext3: tune2fs O dir_index -c l i 0 xfs: nobarrier Make sure write cache with battery is enabled

Others
Make sure to allocate separate database partitions (/var/lib/mysql, /tmp) from root partition (/)
When database size becomes full, it should not affect Linux kernels

/etc/security/limits.conf
soft nofile 8192 hard nofile 8192

Restart linux if kernel panic happens

kernel.panic_on_oops = 1 kernel.panic = 1
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 48

Enjoy the conference!

The slides will be published at Slideshare very soon My talks on Wed/Thu
More Mastering the Art of Indexing
April 14th (Wed), 14:00-15:00, Ballroom A

SSD Deployment Strategies for MySQL

April 15th (Thu), 14:00-14:45, Ballroom E

Contact:
E-mail: [email protected] Blog http://yoshinorimatsunobu.blogspot.com @matsunobu on Twitter
Copyright 2010 Sun Microsystems inc The Worlds Most Popular Open Source Database 49

The Worlds Most Popular Open Source Database

Unix/Linux Notes
100% (115)
Unix/Linux Notes
1,157 pages
Little Machine Shop Catalog
100% (2)
Little Machine Shop Catalog
128 pages
Memcached + Innodb Performance. The Waffle Grid Project.
100% (3)
Memcached + Innodb Performance. The Waffle Grid Project.
33 pages
Linux and H/W Optimizations For MySQL
100% (2)
Linux and H/W Optimizations For MySQL
160 pages
Phaser3 TypeScript
No ratings yet
Phaser3 TypeScript
101 pages
C554e Series Pri Dlbt1319521en 0
100% (1)
C554e Series Pri Dlbt1319521en 0
10 pages
MySQL Database Administration
100% (3)
MySQL Database Administration
36 pages
Progress Database Performance Tuning
No ratings yet
Progress Database Performance Tuning
107 pages
Performance Is Overrated - NEDB 2012
100% (2)
Performance Is Overrated - NEDB 2012
44 pages
Mysql Cluster Deployment Best Practices
No ratings yet
Mysql Cluster Deployment Best Practices
39 pages
IBM Aix
100% (1)
IBM Aix
54 pages
Session 3. MySQL Performance and Tuning - Full - Notes
No ratings yet
Session 3. MySQL Performance and Tuning - Full - Notes
58 pages
First Steps Download
No ratings yet
First Steps Download
16 pages
Another MySQL Performance Talk
100% (1)
Another MySQL Performance Talk
35 pages
PgDay 2017 Innodb Architecture Performance Optimization
No ratings yet
PgDay 2017 Innodb Architecture Performance Optimization
175 pages
Managing Storage Devices
No ratings yet
Managing Storage Devices
57 pages
Aix Tuning
No ratings yet
Aix Tuning
72 pages
MySQL and Linux Tuning - Better Together
100% (1)
MySQL and Linux Tuning - Better Together
26 pages
CYCLOPE Mascara Plantilla
No ratings yet
CYCLOPE Mascara Plantilla
18 pages
Linux and H/W Optimizations For Mysql: Yoshinori Matsunobu
No ratings yet
Linux and H/W Optimizations For Mysql: Yoshinori Matsunobu
160 pages
Unit 5 DBMS
No ratings yet
Unit 5 DBMS
34 pages
IACS 17 Bulk Strength: Manual
No ratings yet
IACS 17 Bulk Strength: Manual
35 pages
VM Sizing
No ratings yet
VM Sizing
97 pages
Vmcache
No ratings yet
Vmcache
14 pages
Building MySQL Infrastructure For Performance and Reliability
No ratings yet
Building MySQL Infrastructure For Performance and Reliability
43 pages
Report
No ratings yet
Report
70 pages
2016 12 Innodb Internals PDF
No ratings yet
2016 12 Innodb Internals PDF
43 pages
2016 12 Innodb Internals
No ratings yet
2016 12 Innodb Internals
43 pages
Lab 1 Introduction To My SQL
No ratings yet
Lab 1 Introduction To My SQL
8 pages
Prototyping Scalable Smart Villages Final 030916
No ratings yet
Prototyping Scalable Smart Villages Final 030916
74 pages
AIX Performance: Configuration & Tuning For Oracle & Oracle RAC
No ratings yet
AIX Performance: Configuration & Tuning For Oracle & Oracle RAC
72 pages
Speedemy MySQL Configuration Tuning Handbook
No ratings yet
Speedemy MySQL Configuration Tuning Handbook
42 pages
Mysql Performance Tuning
No ratings yet
Mysql Performance Tuning
17 pages
Performance Tuning Oracle Rac On Linux
No ratings yet
Performance Tuning Oracle Rac On Linux
12 pages
AIX Performance: Configuration & Tuning For Oracle: Vijay Adik ATS - Oracle Solutions Team
No ratings yet
AIX Performance: Configuration & Tuning For Oracle: Vijay Adik ATS - Oracle Solutions Team
46 pages
Innodb Performance Optimisation: Mydbops Database Meetup
No ratings yet
Innodb Performance Optimisation: Mydbops Database Meetup
32 pages
Firebird Tuning
No ratings yet
Firebird Tuning
60 pages
Mysql Perf Tuning
No ratings yet
Mysql Perf Tuning
46 pages
Mysql For Oracle Dbas and Developers
No ratings yet
Mysql For Oracle Dbas and Developers
65 pages
BCS306a Super Important - 22SCHEME
No ratings yet
BCS306a Super Important - 22SCHEME
2 pages
Percona 服务器与 XtraDB 存储引擎
No ratings yet
Percona 服务器与 XtraDB 存储引擎
74 pages
Mysql Architecture Guide
No ratings yet
Mysql Architecture Guide
17 pages
Best Practices Guide For Databases On IBM FlashSystem
No ratings yet
Best Practices Guide For Databases On IBM FlashSystem
20 pages
Computer Science
No ratings yet
Computer Science
33 pages
Huge Pages
No ratings yet
Huge Pages
8 pages
Performance Tuning The Mysql Server: Ligaya Turmelle Mysql Support Engineer
No ratings yet
Performance Tuning The Mysql Server: Ligaya Turmelle Mysql Support Engineer
34 pages
Idera Whitepaper Guide To Mysql Performance Tuning
No ratings yet
Idera Whitepaper Guide To Mysql Performance Tuning
10 pages
Parameters
No ratings yet
Parameters
3 pages
MySQL Performance Tuning Step by Step
No ratings yet
MySQL Performance Tuning Step by Step
36 pages
SLES 1112 OS Tuning Amp Optimization Guide Part 1
No ratings yet
SLES 1112 OS Tuning Amp Optimization Guide Part 1
6 pages
Innodb Performance Tuning
No ratings yet
Innodb Performance Tuning
18 pages
AIX Configuration & Tuning For Oracle DB
No ratings yet
AIX Configuration & Tuning For Oracle DB
19 pages
Management Information System Literature Review
100% (1)
Management Information System Literature Review
6 pages
Mariadb Best Practices
No ratings yet
Mariadb Best Practices
5 pages
Practice Swap
No ratings yet
Practice Swap
15 pages
Optimize Oracle On Linux
No ratings yet
Optimize Oracle On Linux
10 pages
Mathematics Connections Task
No ratings yet
Mathematics Connections Task
12 pages
Tuning Informix Engine Parameters
No ratings yet
Tuning Informix Engine Parameters
14 pages
Senior DBA Interview Questions
No ratings yet
Senior DBA Interview Questions
8 pages
Linux System Memory Utilization
No ratings yet
Linux System Memory Utilization
4 pages
What Is Swap?: Openoffice
No ratings yet
What Is Swap?: Openoffice
8 pages
Review: (R&G Chapter 9) - Aren't Databases Great? - Relational Model - SQL
No ratings yet
Review: (R&G Chapter 9) - Aren't Databases Great? - Relational Model - SQL
7 pages
21 - ODI Console
No ratings yet
21 - ODI Console
17 pages
Missing Value Paper
No ratings yet
Missing Value Paper
10 pages
AmiBcp ToolsUserGuide v5.15.0065
No ratings yet
AmiBcp ToolsUserGuide v5.15.0065
3 pages
Project Aavin
No ratings yet
Project Aavin
76 pages
MySQL ZFS Best Practices
No ratings yet
MySQL ZFS Best Practices
5 pages
Ch02v4
No ratings yet
Ch02v4
95 pages
Visual Dispatch Guide
No ratings yet
Visual Dispatch Guide
20 pages
How To Configure RHEL - OEL 4 32-Bit For Very Large Memory With Ramfs and HugePages
No ratings yet
How To Configure RHEL - OEL 4 32-Bit For Very Large Memory With Ramfs and HugePages
3 pages
LIS Communication Protocol Specification - 20191126 - Rev.0.2
No ratings yet
LIS Communication Protocol Specification - 20191126 - Rev.0.2
6 pages
MIRO - BAdI For Value Date
No ratings yet
MIRO - BAdI For Value Date
8 pages
SAAD 2 Prelims
No ratings yet
SAAD 2 Prelims
40 pages
Ec302T Microprocessor and Microcontroller
No ratings yet
Ec302T Microprocessor and Microcontroller
6 pages
Nathaniel Brandon Cei 6 Stalpi Ai Increderii in Sine
No ratings yet
Nathaniel Brandon Cei 6 Stalpi Ai Increderii in Sine
179 pages
Internship Presentation 19ce10067
No ratings yet
Internship Presentation 19ce10067
9 pages
ACL User Manual PDF
No ratings yet
ACL User Manual PDF
96 pages
Presenting DeepSeek-Coder
No ratings yet
Presenting DeepSeek-Coder
2 pages
How To Connect To An API With JavaScript
No ratings yet
How To Connect To An API With JavaScript
11 pages
Samsung Galaxy M13
No ratings yet
Samsung Galaxy M13
2 pages
Blutooth Car
No ratings yet
Blutooth Car
4 pages
GRSM Standard Operating Procedure: Backup, Storage & Recovery
No ratings yet
GRSM Standard Operating Procedure: Backup, Storage & Recovery
6 pages
Kevin Snow Resume Post
No ratings yet
Kevin Snow Resume Post
1 page