Linux and H/W Optimizations For MySQL
Linux and H/W Optimizations For MySQL
Linux and H/W Optimizations For MySQL
for MySQL
Yoshinori Matsunobu
H/W improvements
HDD RAID, Write Cache
Large RAM
SATA SSD、PCI-Express SSD
More number of CPU cores
Faster Network
S/W improvements
Improved algorithm (i/o scheduling, swap control, etc)
Much better concurrency
Avoiding stalls
Improved space efficiency (compression, etc)
Per-server performance is important
Additional 900 servers will cost 10M$ initially, 1M$ every year
If you can increase per server throughput, you can reduce the total
number of servers, which will decrease TCO
16GB RAM
+ Many slaves
HDD RAID
(120GB)
Product B
Time
Some unstable database servers suddenly drop performance in
some situations
Low performance is a problem because we can’t meet
customers’ demands
Through product A is better on average, product B is much
more stable
Don’t trust benchmarks. Vendors’ benchmarks show the best
score but don’t show worse numbers
Avoiding stalls
All clients are blocked for a short period of time (less than one
second – a few seconds)
The number of connections grow significantly (10-100 on average, but
suddenly grows to 1000+ and TOO MANY CONNECTIONS errors are
thrown)
Increased response time
Avoiding stalls(2)
Typical stalls
Dropping a huge table (LOCK_open mutex)
Burst write (at checkpoint, at the time when redo log file free space
becomes not enough)
pthread_create()/clone() (called at connection establishment)
etc
Handling Real-World workloads..
Cons:
Additional operations are added
Merging might take a very long time
when many secondary indexes must be updated and many rows
have been inserted.
it may continue to happen after a server shutdown and restart
INSERT gets slower
Time to insert 1 million records (InnoDB, HDD)
600
500 2,000 rows/s
Seconds
400
Sequential order
300
Random order
200
100 10,000 rows/s
0
1 13 25 37 49 61 73 85 97 109 121 133 145
Existing records (millions)
Index size exceeded innodb buffer pool size at 73 million records for
random order test
Gradually taking more time because buffer pool hit ratio is getting worse
(more random disk reads are needed)
For sequential order inserts, insertion time did not change.
No random reads/writes
INSERT performance difference
In-memory INSERT throughput
15000+ insert/s from single thread on recent H/W
Partition 3 Partition 4
UPDATE performance
Need to read target record blocks and index blocks
– Fast if completed in buffer pool, otherwise massive foreground disk reads
happen
– Data size does not grow significantly, depending on INSERT/DELETE
Huge performance difference between storage devices
– In-memory UPDATE: 12,000/s
– HDD UPDATE: 300/s
– SATA SSD UPDATE: 1,800/s
– PCI-E SSD UPDATE: 4,000/s
– * Single Thread
– Random reads happen in foreground, so random read speed matters a lot
SSD performance and deployment strategies for MySQL
What do you need to consider? (H/W layer)
SSD or HDD?
Interface
SATA/SAS or PCI-Express?, How many drives?
RAID
H/W RAID, S/W RAID or JBOD?
Network
Is 100Mbps or 1Gbps enough?
Memory
Is 2GB RAM + PCI-E SSD faster than 64GB RAM +
8HDDs?
CPU
Nehalem, Opteron or older Xeon?
What do you need to consider?
Redundancy
RAID
DRBD (network mirroring)
Semi-Sync MySQL Replication
Async MySQL Replication
Filesystem
ext3, xfs, raw device ?
File location
Data file, Redo log file, etc
Regular SAS HDD : 200 iops per drive (disk seek & rotation is
slow)
HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads
Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads
Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads
Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better
SSD’s concurrency (4x) is much better than HDD’s (2.2x)
Very strong reason to use SSD
High Concurrency
CPU
Flash Flash
Advantage
PCI-Express is much faster interface than SAS/SATA
(current) Disadvantages
Most motherboards have limited # of PCI-E slots
No hot swap mechanism
Write performance on SSD
20000
18000
16000
14000
12000
1 i/o thread
10000
100 i/o threads
8000
6000
4000
2000
0
HDD(4 RAID10 xfs) Intel(xfs) Fusion (xfs)
!.
Flash memory chips
Single SSD drive consists of many flash memory chips (i.e. 2GB)
A flash memory chip internally consists of many blocks (i.e. 512KB)
A block internally consists of many pages (i.e. 4KB)
It is *not* possible to overwrite to a non-empty block
Reading from pages is possible
Writing to pages in an empty block is possible
Appending is possible
Overwriting to pages in a non-empty block is *not* possible
Understanding how data is written to SSD (2)
New data
Block (empty) Block ×
Page
Page
!.
P P P
Block P Block P Block P Block (empty)
P P P
Block Block P Block P 2. Writing data
P P P
1. Reading pages P
New data
Background jobs ERASE unused blocks P
20000
IOPS
Fastest
15000
Slowest
10000
5000
Stopping writing for a while
0
Intel Fusion(150G) Fusion(120G) Fusion(96G) Fusion(80G)
Using tachIOn
tachIOn is highly optimized for keeping write performance
higher
600
500
400
MB/s
Seq read
300
Seq write
200
100
0
4 HDD(raid10, xfs) Intel(xfs) Fusion(xfs)
20000
18000
16000
14000
fsync/sec
12000 1KB
10000 8KB
8000 16KB
6000
4000
2000
0
HDD(xfs) Intel (xfs) Fusion I/O(xfs)
Write cache
disk
disk
20000
18000
16000
14000
12000
1 thread
iops
10000
16 thread
8000
6000
4000
2000
0
Fusion(ext3) Fusion (xfs) Fusion (raw)
Filesystem
2500
2000
1KB
1500
IOPS
4KB
1000
16KB
500
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
160000
140000
120000
100000
Reads/s
4KB
80000 8KB
60000 16KB
40000
20000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
Concurrency
Huge difference
On SSDs, not only IOPS, but also I/O transfer size matters
It’s worth considering that Storage Engines support
“configurable block size” functionality
SLC vs MLC (16KB)
Random Read IOPS, FusionIO (16KB)
45000
40000
35000
30000
reads/s
25000 SLC
20000 MLC
15000
10000
5000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
80000
70000
60000
50000
reads/s
SLC
40000
MLC
30000
20000
10000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
90000
80000
70000
60000
reads/s
50000 FusionIO
40000 tachIOn
30000
20000
10000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
# mpstat –P ALL 1
CPU %user %nice %sys %iowait %irq %soft %idle intr/s
all 0.45 0.00 7.75 26.69 1.65 0.00 63.45 40046.40
0 1.00 0.00 12.60 86.40 0.00 0.00 0.00 1000.20
1 1.00 0.00 13.63 85.37 0.00 0.00 0.00 0.00
2 0.40 0.00 4.80 26.80 0.00 0.00 68.00 0.00
3 0.00 0.00 0.00 0.00 79.20 0.00 20.80 39033.20 ...
350000
300000
250000
200000
reads/s
Single Drive
Two Drives (RAID0)
150000
100000
50000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
Two drives result in nearly two times better throughput, if enough read i/o is
coming
When the number of clients was small (not enough i/o requests were coming),
%irq was not so high, so using two drives was not so much helpful
The number of slots are limited on most motherboards
# of interfaces (FusionIO MLC)
Random Read IOPS (16KB)
70000
60000
50000
reads/s
FusionIO Duo internally has two drives per single PCI-Express connector
Two IRQ ports can be used, which greatly increases throughput
A couple of restrictions
FusionIO Duo has two device files (/dev/fioa, /dev/fiob). Single large native filesystem
can not be created
FusionIO Duo has physical(height/length) and electrical restrictions
– Only one Duo drive can be installed on HP DL360 (two PCI-E ports) physically
– On some servers, without an optional power cable, maximum performance can not be gained
tachIOn(SLC) vs FusionIO Duo(MLC)
90000
80000
70000
60000
tachIOn
reads/s
50000
FusionIO Duo
40000
FusionIO
30000
20000
10000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
90000
80000
70000
60000
reads/s
300000
250000
200000
reads/s
Nehalem X5650
150000
Opteron 6174
100000
50000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
# of threads
Recent lineups
ZFS L2ARC
– Part of ZFS filesystem
Facebook FlashCache
– Working as Linux Kernel module
FusionIO DirectCache
– Working between OS and FusionIO. Depending on FusionIO drives
Oracle Smart Flash Cache
– L2 cache of Oracle Database. Depending on Oracle database
Issues
Performance is not good (FlashCache)
– Overheads on Linux Kernel seem huge
– Even though data is 100% on SSD, random read iops dropped 40%, random write iops dropped 75% on
FusionIO (tested with CentOS5.5)
– Less performance drops on Intel X25-E
In practice it’s Single Point of Failure
– It’s not just a cache. We expect SSD-level performance. If it’s broken, the system will go down.
Total write volume grows significantly. L2Cache needs to be written when reading from HDD
Virtualization?
Currently performance drops are serious (on faster drives)
Got only 1/30 throughput when tested with Ubuntu 10.4 + KVM +
FusionIO
5000
4500
4000
3500
reads/sec
3000 Physical
2500
2000 KVM
1500
1000
500
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
# of threads
25000
20000
reads/sec
15000
Physical
KVM
10000
5000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
# of threads
PCI-Express PCI-Express
Raw SSD drives performed much better than using a traditional H/W
raid controller
Even on RAID10 performance was worse than single raw drive
H/W Raid controller seemed serious bottleneck
Make sure SSD drives have write cache and capacitor itself (Intel X25-
V/M/E doesn’t have capacitor)
Use JBOD + write cache + capacitor
Intel 320 SSD supports capacitor
Enable HyperThreading
Fusion I/O + HDD HT OFF (8) HT ON (16) Up
Buffer pool 1G 19295.94 20785.42 +7.7%
Buffer pool 2G 25627.49 28438.00 +11%
Buffer pool 5G 39435.25 45785.12 +16%
Buffer pool 30G 66053.68 81412.23 +23%
InnoDB Plugin and 5.1 scales well with 16-24 CPU cores
HT is more effective on SSD environments because loads are more
CPU bound
MySQL 5.5
Fusion I/O + HDD MySQL5.1 MySQL5.5 Up
Buffer pool 1G 19295.94 24019.32 +24%
Buffer pool 2G 25627.49 32325.76 +26%
Buffer pool 5G 39435.25 47296.12 +20
Buffer pool 30G 66053.68 67253.45 +1.8%
Deploying on slave?
If using HDD on master, SATA SSD should be enough to handle workloads
– PCI-Express SSD is much more expensive than SATA SSD
How about running multiple MySQL instances on single server?
– Virtualization is not fast
– Running multiple MySQL instances on single OS is more reasonable
Does PCI-E SSD have enough storage capacity to run multiple instances?
On HDD environments, typically only 100-200GB of database data can be stored
because of slow random IOPS on HDD
FusionIO SLC: 320GB Duo + 160GB = 480GB
FusionIO MLC: 1280GB Duo + 640GB = 1920GB (or using ioDrive Octal)
tachIOn SLC: 800GB x 2 = 1600GB
Running multiple slaves on single box
Before After
M M B M M B
B S1 S2 S3 B S1 S2 S3
S1, S1 S2, S2
S1, S1 S2, S2
M M
B S1 S2 S3 B S1 S2 S3 B M M B
CPU Utilization
%user 27.3%, %sys 11%(%soft 4%), %iowait 4%
C.f. SATA SSD:%user 4%, %sys 1%, %iowait 1%
No replication delay
No significant (100+ms) response time delay caused by SSD
CPU loads
22:10:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
22:11:57 all 27.13 0.00 6.58 4.06 0.14 3.70 0.00 58.40 56589.95
…
22:11:57 23 30.85 0.00 7.43 0.90 1.65 49.78 0.00 9.38 44031.82
MySQL server currently does not scale well with 24 logical CPU cores
When running 6+ instances on 24 cores, the number of utilized CPU cores
should be limited per instance
Check /proc/cpuinfo output
Use same physical id (socket id) for the same instance
Within the same physical id, use same core id (physical core) for the same
instance
Application Design
Bottleneck shifts from storage to CPU/Network
Massively executed SQL statements should be migrated to
NoSQL / HandlerSocket /etc
Separating tables
History tables (only recent data is accessed) fit with HDD
because most of active data fit in memory
Rarely accessed tables can also be stored on HDD
Other tables on SSD
Making MySQL better
Parallel SQL threads
Pool of threads
8KB/4KB InnoDB Blocks
Minimized performance stalls
No hot global/large mutex
LINEAR HASH partitions on large tables helps
– Index mutex is allocated per partition
Future improvements from DBA perspective
One master, one backup/slave and one DR slave
Single slave should be enough from performance perspective
Disks
PCI-E SSDs (i.e. FusionIO, tachIOn) perform very well
SAS/SATA SSDs with capacitor (i.e. Intel 320)
Carefully research RAID controller. Many controllers
do not scale with SSD drives
Keep enough reserved space, use tachIOn, or use
RAID0 if you need to handle massive write traffics
HDD is good at sequential writes
Concurrency matters
Single SSD scales as well as 8-16 HDDs
Concurrent ALTER TABLE, parallel query
Memory and Swap Space Management
Random Access Memory
RAM access speed is much faster than HDD/SSD
RAM: -60ns
– 100,000 queries per second is not impossible
HDD: -5ms
SSD: 100-500us
16-100+GB RAM is now pretty common
Filesystem Cache
RAM RAM
Swap is bad
Process spaces are written to disk (swap out)
Disk reads happen when accessing on-disk process spaces (swap in)
Massive random disk reads and writes will happen
What if setting swap size to zero?
By setting swap size to zero, swap doesn’t happen anymore. But..
Very dangerous
When neither RAM nor swap space is available, OOM killer is invoked. OOM
Killer may kill any process to allocate memory space
It often takes very long time (minutes to hours) for OOM Killer to kill processes
We can’t do anything until enough memory space is available
Do not set swap=zero
top - 01:01:29 up 5:53, 3 users, load average: 0.66, 0.17, 0.06
Tasks: 170 total, 3 running, 167 sleeping, 0 stopped, 0 zo zombie
mbie
Cpu(s):
Cpu(s): 0.0%us, 24.9%sy,
24.9%sy, 0.0%ni,75.0%id,0.2%wa,0.0%hi, 0.0%si,0.0%st
Mem:
Mem: 32967008k total, 32815800k used, 151208k free, buffers
8448k buffers
Swap: 0k total, 0k used, 0k free, 376880k cached
top - 11:54:51 up 7 days, 15:17, 1 user, load average: 0.21, 0.14, 0.10
Tasks: 251 total, 1 running, 250 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5%us, 0.2%sy, 0.0%ni, 98.9%id, 0.3%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 32818368k total, 31154696k used, 1663672k free, 125048k buffers
Swap: 4184924k total, 1292756k used, 2892168k free, 2716380k cached
Swap happened even though filesystem cache was not wiped out
InnoDB log file size : 3GB in total
It’s common in InnoDB Plugin because crash recovery becomes much faster
Binary logs (or relay logs)
Swap bug on Linux
A known issue of Linux Kernel
Fixed in 2.6.28: Merged on RHEL6
https://bugzilla.redhat.com/show_bug.cgi?id=160033
When will CentOS 6 be released?
Filesystem Cache
In many cases MySQL does not allocate per-session memory than needed.
But be careful about some extreme cases (like above:
MyISAM+LIMIT+FullScan)
File I/O
File I/O and synchronous writes
RDBMS calls fsync() many times (per transaction commit, checkpoints, etc)
Make sure to use Battery Backed up Write Cache (BBWC) on raid cards
10,000+ fsync() per second, without BBWC less than 200 on HDD
Disable write cache on disks for safety reasons
Do not set “write barrier” on filesystems (enabled by default in some cases)
Write-through to disks even though BBWC is enabled (very slow)
ext3: mount -o barrier=0 (needed for some distros such as SuSe Linux)
xfs: mount -o nobarrier
drbd: no-disk-barrier in drbd.conf
disk disk
Physical file remove doesn’t happen because ref count is not zero
When removing the hard link, physical file remove happens because at
that time ref count is zero
Physical remove operation causes massive random disk i/o, but at this
stage this doesn’t take MySQL mutex so it won’t block MySQL
operations
– But iowait increases so don’t run rm command when load is high
Filesystem – xfs/ext2
xfs
Fast for dropping files
Concurrent writes to a file is possible when using O_DIRECT
Not officially supported in current RHEL (Supported in SuSE)
Disable write barrier by setting “nobarrier”
ext2
Faster for writes because ext2 doesn’t support journaling
It takes very long time for fsck
On active-active redundancy environment (i.e. MySQL
Replication),
in some cases ext2 is used to gain performance
1 i/o thread
Random Write IOPS (16KB Blocks)
100 i/o threads
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
HDD(ext3) HDD(xfs) Intel(ext3) Intel(xfs) Fusion(ext3) Fusion (xfs)
DBT-2 (MySQL5.1)
15000
RAID1+0
10000
NOTPM
RAID5
5000
0
noop cfq deadline as
Queue size = N
Sorting N outstanding I/O requests to optimize disk seeks
MyISAM does not optimize I/O requests internally
Highly depending on OS and storage
When inserting into indexes, massive random disk writes/reads happen
Increasing I/O queue size reduces disk seek overheads
# echo 100000 > /sys/block/sdX/queue/nr_requests
No impact in InnoDB
Many RDBMS including InnoDB internally sort I/O requests
Network
Fighting against network bottlenecks
Latency
100Mbit Ethernet: 3000 us
– Around 35,000qps from 100 memcached clients
– Not good if you use SSD
– Easily reaches 100Mbps bandwidth when copying large files
1Gbit Ethernet: 400us
– Around 250,000qps from 100 memcached clients
– 100,000qps is not impossible in MySQL
Latency is not so amazing at 10Gbit Ethernet
– Check Dolphin Supersockets, etc
When setting up new replication slave, the master sends large binlog events
to slaves, which easily occupies 100Mbps traffics
START/STOP SLAVE IO_THREADS; repeatedly
Manually send binlogs by scp with upper limits, then apply events by
mysqlbinlog and mysql
Remote datacenter
1. Sending SYN,
Changing state to SYN_SENT 2. Receiving SYN
3. Checking conditions (i.e. back_log)
If it doesn’t meet criteria, dropping it
4. Generating SYN+ACK,
Changing state to SYN_RECV
5. Sending SYN+ACK
6. Receiving SYN+ACK
7. Generating ACK
8. Sending ACK 9. Receiving ACK,
Changing state to ESTABLISHED
Queue MySQL
Web Worker
Servers Servers
Servers Program
(Q4M) (InnoDB)
iostat
mpstat
dstat
oprofile
gdb
pmp (Poor Man’s Profiler)
gcore
iostat
Showing detailed I/O statistics per device
Very important tool because in most cases RDBMS becomes I/O
bound
iostat -x
Check r/s, w/s, svctm, %util
IOPS is much more important than transfer size
Always %util = (r/s + w/s) * svctm (hard coded in the iostat
source file)
# iostat -xm 10
avg-
avg-cpu:
cpu: %user %nice %system %iowait
%iowait %steal %idle
21.16 0.00 6.14 29.77 0.00 42.93
Device: rqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-
avgrq-sz avgqu-
avgqu-sz await svctm %util
sdb 2.60 389.01 283.12 47.35 4.86 2.19 43.67 4.89 14.76 3.02 99.83
# mpstat -P ALL 1
...
11:04:37 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
11:04:38 AM all 0.00 0.00 0.12 12.33 0.00 0.00 0.00 87.55 1201.98
11:04:38 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 990.10
11:04:38 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
11:04:38 AM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
11:04:38 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
11:04:38 AM 4 0.99 0.00 0.99 98.02 0.00 0.00 0.00 0.00 206.93
11:04:38 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
11:04:38 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 4.95
11:04:38 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
Replication Slave
%idle on a single CPU core becomes 0-10%
– Mostly used for %user, %system, %iowait, %soft
# dstat -N bond0
----total-cpu-usage---- -dsk/total- -net/bond0-
net/bond0- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send|
send in out | int csw
2 1 95 1 0 0|2416k 5112k| 782k 1196k|
1196k 0 0 |5815 11k
3 1 95 0 0 0|3376k 15M| 792k 1213k|
1213k 0 0 |5632 11k
3 2 95 0 0 0|2120k 3264k| 793k 1229k|
1229k 0 0 |5707 11k
2 1 96 0 0 0|1920k 3560k| 788k 1193k|
1193k 0 0 |5856 11k
2 1 96 0 0 0|1872k 13M| 770k 1112k|
1112k 0 0 |5463 10k
yum install dstat
Similar UI as vmstat, but it also shows network statistics
Disk and Net total is incorrect, if you use RAID/bonding (double counted)
Filter by disk/interface name that you want to trace
“mtstat” additionally supports mysql status outputs
https://launchpad.net/mtstat
Oprofile
Profiling CPU usage from running processes
You can easily identify which functions consume CPU
resources
Supporting both user space and system space profiling
Mainly used by database-internal developers
If specific functions consume most of recourses, applications
might be re-designed to skip calling them
Not useful to check low-CPU activities
I/O bound, mutex waits, etc
How to use
opcontrol --start (--no-vmlinux)
benchmarking
opcontrol --dump
opcontrol --shutdown
opreport -l /usr/local/bin/mysqld
Oprofile example
# opreport –l /usr/local/bin/mysqld
/usr/local/bin/mysqld
samples % symbol name
83003 8.8858 String::copy(char
String::copy(char const*, unsigned int,
int, charset_info_st*,
charset_info_st*,
charset_info_st*,
charset_info_st*, unsigned int*)
int*)
79125 8.4706 MYSQLparse(void*)
MYSQLparse(void*)
68253 7.3067 my_wc_mb_latin1
55410 5.9318 my_pthread_fastmutex_lock
34677 3.7123 my_utf8_uni
18359 1.9654 MYSQLlex(void*,
MYSQLlex(void*, void*)
12044 1.2894 _ZL15get_hash_symbolPKcjb
11425 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
1.2231 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4 ItemP16st_dynamic_array
You can see quite a lot of CPU resources were spent for character conversions
(latin1 <-> utf8)
Disabling character code conversions on application side will improve
performance
samples % symbol name
83107 10.6202 MYSQLparse(void*)
MYSQLparse(void*)
68680 8.7765 my_pthread_fastmutex_lock
20469 2.6157 MYSQLlex(void
MYSQLlex(void*,
*, void*)
13083 1.6719 _ZL15get_hash_symbolPKcjb
12148 1.5524 JOIN::optimize()
JOIN::optimize()
11529 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
1.4733 _ZL20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array
Checking stalls
Collecting statistics
Per-minute statistics (i.e. vmstat 60) is not helpful. Collect per-second
statistics
Checking stalls by gdb or pmp
Debugging tool
#!/bin/bash
nsamples=1
sleeptime=0
pid=$(pidof mysqld)
for x in $(seq 1 $nsamples)
do
gdb -ex "set pagination 0" -ex "thread apply all bt" -batch -p $pid
sleep $sleeptime
done | ¥
awk '
BEGIN { s = ""; }
/Thread/ { print s; s = ""; }
/^¥#/ { if (s != "" ) { s = s "," $4} else { s = $4 } }
END { print s }' | ¥
sort | uniq -c | sort -r -n -k 1,1
pmp output example
291 pthread_cond_wait@@GLIBC_2.3.2,one_thread_per_connection_end,handle_one_connection
57 read,my_real_read,my_net_read,do_command,handle_one_connection,start_thread
26
pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_
thread,start_thread
3 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_purge_worker_thread
1 select,os_thread_sleep,srv_purge_thread
1 select,os_thread_sleep,srv_master_thread
1 select,os_thread_sleep,srv_lock_timeout_and_monitor_thread
1 select,os_thread_sleep,srv_error_monitor_thread
1 select,handle_connections_sockets,main,select
1 read,vio_read_buff,my_real_read,my_net_read,cli_safe_read,handle_slave_io
1
pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,sync_array_wait_event,rw_lock_s_lock_spin,buf_pa
ge_get_gen,btr_cur_search_to_nth_level,row_search_for_mysql,ha_innodb::index_read,handler::index_
read_idx_map,join_read_const,join_read_const_table,make_join_statistics,JOIN::optimize,mysql_sele
ct,handle_select,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_comm
and,handle_one_connection
Disadvantages of gdb/pmp
top - 20:39:14 up 360 days, 17:56, 1 user, load average: 1.26, 1.29, 1.32
Tasks: 125 total, 2 running, 123 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.8% us, 0.8% sy, 0.0% ni, 87.2% id, 0.7% wa, 0.0% hi, 0.5% si
Mem: 24680588k total, 24609132k used, 71456k free, 99260k buffers
Swap: 4192956k total, 160744k used, 4032212k free, 4026256k cached
| Slave_open_temp_tables | 15927 |
(gdb) p active_mi->rli->save_temporary_tables->s->table_name
$1 = {str = 0x31da400c3e "claimed_guids", length = 13}
(gdb) p $a->file->stats
$16 = {data_file_length = 1044496, max_data_file_length = 5592400,
index_file_length = 1044496, max_index_file_length = 0, delete_length = 0,
auto_increment_value = 0, records = 1, deleted = 0, mean_rec_length = 8,
create_time = 0, check_time = 0, update_time = 0, block_size = 0}
Tmp tables are MEMORY engine so 1044496 + 1044496 bytes (2MB) were used per
table.
Table name was “claimed_guids”
Dumping slave’s tmp tables info
define print_all_tmp_tables
set $a= active_mi->rli->save_temporary_tables
set $b= slave_open_temp_tables
while ($i < $b)
p $a->alias
p $a->file->stats
set $a= $a->next
set $i=$i+1
end
end
set pagination 0
set $i=0
print_all_tmp_tables
detach
quit
# gdb /usr/lib/debug/usr/sbin/mysqld.debug /data/mysql/core.11706 -x above_script
List length was 15927 (same as slave_open_tmp_tables), all tables were the
same name, used same size (2MB)
15927 * 2MB = 31GB
This should be the reason why VIRT was 30+ GB larger than other slaves.
Rebooting the slave should be fine because they are tmp tables.
Capturing MySQL packets
libpcap
library to capture network packets
Most of network capturing tools including tcpdump rely on
libpcap
Packet loss often happens
tcpdump + mk-query-digest
1) [root#] tcpdump -i bond0 port 3306 -s 65535 -x -n -q -tttt -c 10000 > tcpdump.out
SATA SSD
Much more cost effective than spending on RAM only
H/W RAID Controller + SATA SSD won’t perform great
Capacitor + SSD should be great. Check Intel 320
PCI-Express SSD
Check FusionIO and tachIOn
Expensive, but MLC might be affordable
Master handles traffics greatly, but slave servers can not
catch up due to single threaded replication channel
Consider running multiple instances on a slave server
RAM